7,306 Matching Annotations
  1. Oct 2025
    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Vision is a highly active process. Humans move their eyes 3-4 times per second to sample information with high visual acuity from our environment, and where eye movements are directed is critical to our understanding of active vision. Here, the authors propose that the cost of making a saccade contributes critically to saccade selection (i.e., whether and where to move the eyes). The authors build on their own recent work that the effort (as measured by pupil size) that comes with planning and generating an eye movement varies with saccade direction. To do this, the authors first measured pupil size for different saccade directions for each participant. They then correlated the variations in pupil size obtained in the mapping task with the saccade decision in a free-choice task. The authors observed a striking correlation: pupil size in the mapping task predicted the decision of where to move the eyes in the free choice task. In this study, the authors provide a number of additional insightful analyses (e.g., based on saccade curvature, and saccade latency) and experiments that further support their claim that the decision to move the eyes is influenced by the effort to move the eyes in a particular direction. One experiment showed that the same influence of assumed saccade costs on saccade selection is observed during visual search in natural scenes. Moreover, increasing the cognitive load by adding an auditory counting task reduced the number of saccades, and in particular reduced the costly saccades. In sum, these experiments form a nice package that convincingly establishes the association between pupil size and saccade selection.

      We thank the reviewer for highlighting the novelty and cogency of our findings.

      In my opinion, the causal structure underlying the observed results is not so clear. While the relationship between pupil size and saccade selection is compelling, it is not clear that saccade-related effort (i.e., the cost of a saccade) really drives saccade selection. Given the correlational nature of this relationship, there are other alternatives that could explain the finding. For example, saccade latency and the variance in landing positions also vary across saccade directions. This can be interpreted for instance that there are variations in oculomotor noise across saccade directions, and maybe the oculomotor system seeks to minimize that noise in a free-choice task. In fact, given such a correlational result, many other alternative mechanisms are possible. While I think the authors' approach of systematically exploring what we can learn about saccade selection using pupil size is interesting, it would be important to know what exactly pupil size can add that was not previously known by simply analyzing saccade latency. For example, saccade latency anisotropies across saccade directions are well known, and the authors also show here that saccade costs are related to saccade latency. An important question would be to compare how pupil size and saccade latency uniquely contribute to saccade selection. That is, the authors could apply the exact same logic to their analysis by first determining how saccade latencies (or variations in saccade landing positions; see Greenwood et al., 2017 PNAS) vary across saccade directions and how this saccade latency map explains saccade selection in subsequent tasks. Is it more advantageous to use one or the other saccade metric, and how well does a saccade latency map correlate with a pupil size map?

      We thank the reviewer for the detailed comment. 1) The reviewer first points out the correlational nature of many of our results. Thereafter, 2), the reviewer asks whether saccade latencies and landing precision also predict saccade selection, and could be these potential predictors be considered alternative explanations to the idea of effort driving saccade selection? Moreover, what can pupil size add to what can be learned from saccade latency?

      In brief, although we report a combination of correlational and causal findings, we do not know of a more parsimonious explanation for our findings than “effort drives saccade selection”. Moreover, we demonstrate that oculomotor noise cannot be construed as an alternative explanation for our findings.

      (1) Correlational nature of many findings.

      We acknowledge that many of our findings are predominantly correlational in nature. In our first tasks, we correlated pupil size during saccade planning to saccade preferences in a subsequent task. Although the link between across tasks was correlational, the observed relationship clearly followed our previously specified directed hypothesis. Moreover, experiments 1 and 2 of the visual search data replicated and extended this relationship. We also directly manipulated cognitive demand in the second visual search experiment. In line with the hypothesis that effort affects saccade selection, participants executed less saccades overall when performing a (primary) auditory dual task, and even cut the costly saccades most – which actually constitutes causal evidence for our hypothesis. A minimal oculomotor noise account would not directly predict a reduction in saccade rate under higher cognitive demand. To summarize, we have a combination of correlational and causal findings, although mediators cannot be ruled out fully for the latter. That said, we do not know of a more fitting and parsimonious explanation for our findings than effort predicting saccade selection (see following points for saccade latencies). We now address causality in the discussion for transparency and point more explicitly to the second visual search experiment for causal evidence.

      “We report a combination of correlational and causal findings. Despite the correlational nature of some of our results, they consistently support the hypothesis that saccade costs predicts saccade selection [which we predicted previously, 33]. Causal evidence was provided by the dual-task experiment as saccade frequencies - and especially costly saccades were reduced under additional cognitive demand. Only a cost account predicts 1) a link between pupil size and saccade preferences, 2) a cardinal saccade bias, 3) reduced saccade frequency under additional cognitive demand, and 4) disproportional cutting of especially those directions associated with more pupil dilation. Together, our findings converge upon the conclusion that effort drives saccade selection.”

      (2) Do anisotropies in saccade latencies constitute an alternative explanation?

      First of all, we would like to to first stress that differences in saccade latencies are indeed thought to reflect oculomotor effort (Shadmehr et al., 2019; TINS). For example, saccades with larger amplitudes and saccades where distractors need to be ignored are associated with longer latencies. Therefore, even if saccade latencies would predict saccade selection, this would not contrast the idea that effort drives saccade selection. Instead, this would provide convergent evidence for our main novel conclusion: effort drives saccade selection. There are several reasons why pupil size can be used as a more general marker of effort (see responses to R2), but ultimately, our conclusions do not hinge on the employed measure of effort per se. As stressed above in 1), we see no equally parsimonious explanation besides the cost account. Moreover, we predicted this relationship in our previous publication before running the currently reported experiments and analyses (Koevoet et al., 2023). That said, we are open to discuss further alternative options and would be looking forward to test these accounts in future work against each other – we are welcoming the reviewers’ (but also the reader’s) suggestions.

      We now discuss this in the manuscript as follows:

      “We here measured cost as the degree of effort-linked pupil dilation. In addition to pupil size, other markers may also indicate saccade costs. For example, saccade latency has been proposed to index oculomotor effort [100], whereby saccades with longer latencies are associated with more oculomotor effort. This makes saccade latency a possible complementary marker of saccade costs (also see Supplemen- tary Materials). Although relatively sluggish, pupil size is a valuable measure of attentional costs for (at least) two reasons. First, pupil size is a highly established as marker of effort, and is sensitive to effort more broadly than only in the context of saccades [36–45, 48]. Pupil size therefore allows to capture not only the costs of saccades, but also of covert attentional shifts [33], or shifts with other effectors such as head or arm movements [54, 101]. Second, as we have demonstrated, pupil size can measure saccade costs even when searching in natural scenes (Figure 4). During natural viewing, it is difficult to disentangle fixation duration from saccade latencies, complicating the use of saccade latency as a measure of saccade cost.

      Together, pupil size, saccade latency, and potential other markers of saccade cost could fulfill complementary roles in studying the role of cost in saccade selection.”

      Second, we followed the reviewer’s recommendation in testing whether other oculomotor metrics would predict saccade selection. To this end, we conducted a linear regression across directions. We calculated pupil size, saccade latencies, landing precision and peak velocities maps from the saccade planning task. We then used AICbased backward model selection to determine the ‘best’ model model to determine which factor would predict saccade selection best. The best model included pupil size, latency and landing precision as predictors (Wilkinson notation: saccade preferences ~ pupil size + saccade latency + landing precision). Pupil size (b \=-42.853, t \= 4.791, p < .001) and saccade latency (b \=-.377, t \= 2.106, p \= .043; see Author response image 1) predicted saccade preferences significantly. In contrast, landing precision did not reach significance (b \= 23.631, t \= 1.675, p \= .104). This analysis shows that although saccade latency also predicts saccade preferences, pupil size remains a robust predictor of saccade selection. These findings demonstrate that minimizing oculomotor noise cannot fully explain the pattern of results.

      Author response image 1.

      The relationship between saccade latency (from the saccade planning task) and saccade preferences averaged across participants. Individual points reflect directions and shading represents bootstrapped 95% confidence intervals.

      We have added this argument into the manuscript, and discuss the analysis in the discussion. Details of the analysis have been added to the Supporting Information for transparency and further detail.

      “A control analysis ruled out that the correlation between pupil size and saccade preferences was driven by other oculomotor metrics such as saccade latency and landing precision (see Supporting Information).”

      “To ascertain whether pupil size or other oculomotor metrics predict saccade preferences, we conducted a multiple regression analysis. We calculated average pupil size, saccade latency, landing precision and peak velocity maps across all 36 directions. The model, determined using AIC-based backward selection, included pupil size, latency and landing precision as predictors (Wilkinson notation: saccade preferences  pupil size + saccade latency + landing precision). The analysis re- vealed that pupil size (β = -42.853, t = 4.791, p < .001) and saccade latency (β = -.377, t = 2.106, p = .043) predicted saccade preferences. Landing precision did not reach significance (β = 23.631, t = 1.675, p = .104). Together, this demonstrates that although other oculomotor metrics such as saccade latency contribute to saccade selection, pupil size remains a robust marker of saccade selection.”

      In addition to eye-movement-related anisotropies across the visual field, there are of course many studies reporting visual field anisotropies (see Himmelberg, Winawer & Carrasco, 2023, Trends in Neuroscience for a review). It would be interesting to understand how the authors think about visual field anisotropies in the context of their own study. Do they think that their results are (in)dependent on such visual field variations (see Greenwood et al., 2017, PNAS; Ohl, Kroell, & Rolfs, 2024, JEP:Gen for a similar discussion)?

      We agree that established visual field anisotropies are fascinating to be discussed in context of our own results. At the reviewer’s suggestion, we now expanded this discussion.

      The observed anisotropies in terms of saccade costs are likely related to established anisotropies in perception and early visual cortex. However, the exact way that these anisotropies may be linked remains elusive (i.e. what is cause, what is effect, are links causal?), and more research is necessary to understand how these are related.

      “The observed differences in saccade costs across directions could be linked to established anisotropies in perception [80–86], attention [87–92], saccade charac- teristics [87, 88, 92, 93], and (early) visual cortex [94–98] [also see 99]. For example, downward saccades are more costly than upward saccades, which mimics a similar asymmetry in early visual areas wherein the upper visual field is relatively under- represented [94–98]; similarly stronger presaccadic benefits are found for down- compared with upward saccades [87, 88]. Moreover, upward saccades are more pre- cise than downward saccades [93]. Future work should elucidate where saccade cost or the aforementioned anisotropies originate from and how they are related - something that pupil size alone cannot address.”

      We also added that the finding that more precise saccades are coupled with worse performance in a crowding task might be attributed to the increased effort associated with more precise saccades (Greenwood et al., 2017).

      “Adaptive resource allocation from, and to the oculomotor system parsimoniously explains a number of empirical observations. For example, higher cognitive demand is accompanied by smooth pursuits deviating more from to-be tracked targets [137], reduced (micro)saccade frequencies [Figure 4; 63, 64, 138, 139], and slower peak saccade velocities [140–142]. Relatedly, more precise saccades are accompanied with worse performance in a crowding task [93].”

      Finally, the authors conclude that their results "suggests that the eye-movement system and other cognitive operations consume similar resources that are flexibly allocated among each other as cognitive demand changes. The authors should speculate what these similar resources could mean? What are the specific operations of the auditory task that overlap in terms of resources with the eye movement system?

      We agree that the nature of joint resources is an interesting question. Our previous discussion was likely too simplistic here (see also responses to R3). We here specifically refer to the cognitive resources that one can flexibly distribute between tasks.

      Our data do not directly speak to the question of what the shared resources between the auditory and oculomotor tasks are. Nevertheless, both tasks charge working memory as saccade targets are mandatorily encoded into working memory prior to saccade onset (Van der Stigchel & Hollingworth, 2018), and the counting task clearly engages working memory. This may indicate some domain-generality between visual and auditory working memory during natural viewing (see Nozari & Martin, 2024 for a recent review), but this remains speculative. Another possibility is that not the working memory encoding associated with saccades per se, but that the execution of overt motor actions itself also requires cognitive processing as suggested by Beatty (1982): “the organization of an overt motor act places additional demands on informationprocessing resources that are reflected in the task-evoked pupillary response”.

      We have added upon this in more detail in the results and discussion sections.

      “Besides the costs of increased neural activity when exerting more effort, effort should be considered costly for a second reason: Cognitive resources are limited. Therefore, any unnecessary resource expenditure reduces cognitive and behavioral flexibility [22, 31, 36, 116]. As a result, the brain needs to distribute resources between cognitive operations and the oculomotor system. We found evidence for the idea that such resource distribution is adaptive to the general level of cognitive demand and available resources: Increasing cognitive demand through an additional pri- mary auditory dual task led to a lower saccade frequency, and especially costly sac- cades were cut. In this case, it is important to consider that the auditory task was the primary task, which should cause participants to distribute resources from the ocu- lomotor system to the counting task. In other situations, more resources could be distributed to the oculomotor system instead, for example to discover new sources of reward [22, 136]. Adaptive resource allocation from, and to the oculomotor system parsimoniously explains a number of empirical observations. For example, higher cognitive demand is accompanied by smooth pursuits deviating more from to-be tracked targets [137], reduced (micro)saccade frequencies [Figure 4; 63, 64, 138, 139], and slower peak saccade velocities [140–142]. Relatedly, more precise saccades are accompanied with worse performance in a crowding task [93]. Furthermore, it has been proposed that saccade costs are weighed against other cognitive operations such as using working memory [33, 143–146]. How would the resources between the oculomotor system and cognitive tasks (like the auditory counting task) be related? One possibility is that both consume from limited working memory resources [147, 148]. Saccades are thought to encode target objects in a mandatory fashion into (vi- sual) working memory [79], and the counting task requires participants to keep track of the auditory stream and maintain count of the instructed digit in working mem- ory. However, the exact nature of which resources overlap between tasks remain open for future investigation [also see 149]. Together, we propose that cognitive re- sources are flexibly (dis)allocated to and from the oculomotor system based on the current demands to establish an optimal balance between performance and cost minimization.”

      Reviewer #2 (Public Review):

      The authors attempt to establish presaccadic pupil size as an index of 'saccade effort' and propose this index as one new predictor of saccade target selection. They only partially achieved their aim: When choosing between two saccade directions, the less costly direction, according to preceding pupil size, is preferred. However, the claim that with increased cognitive demand participants would especially cut costly directions is not supported by the data. I would have expected to see a negative correlation between saccade effort and saccade direction 'change' under increased load. Yet participants mostly cut upwards saccades, but not other directions that, according to pupil size, are equally or even more costly (e.g. oblique saccades).

      Strengths:

      The paper is well-written, easy to understand, and nicely illustrated.

      The sample size seems appropriate, and the data were collected and analyzed using solid and validated methodology.

      Overall, I find the topic of investigating factors that drive saccade choices highly interesting and relevant.

      We thank the reviewer for pointing out the strengths of our paper.

      Weaknesses:

      The authors obtain pupil size and saccade preference measures in two separate tasks. Relating these two measures is problematic because the computations that underly saccade preparation differ. In Experiment 1, the saccade is cued centrally, and has to be delayed until a "go-signal" is presented; In Experiment 2, an immediate saccade is executed to an exogenously cued peripheral target. The 'costs' in Experiment 1 (computing the saccade target location from a central cue; withholding the saccade) do not relate to Experiment 2. It is unfortunate, that measuring presaccadic pupil size directly in the comparatively more 'natural' Experiment 2 (where saccades did not have to be artificially withheld) does not seem to be possible. This questions the practical application of pupil size as an index of saccade effort

      This is an important point raised by the reviewer and we agree that a discussion on these points improves the manuscript. We reply in two parts: 1) Although the underlying computations during saccade preparation might differ, and are therefore unlikely to be fully similar (we agree), we can still predict saccade selection between (Saccade planning to Saccade preference) and within tasks (Visual search). 2) Pupil size is a sluggish physiological signal, but this is outweighed by the advantages of using pupil size as a general marker of effort, also in the context of visual selection compared with saccade latencies.

      (1) Are delayed saccades (cost task) and the much faster saccades (preference task) linked?

      As the reviewer notes the underlying ‘type’ of oculomotor program may differ between voluntarily delayed-saccades and those in the saccade preference task. There are, however, also considerable overlaps between the oculomotor programs as the directions and amplitudes are identical. Moreover, the different types of saccades have considerable overlap in their underlying neural circuitry. Nevertheless, the underlying oculomotor programs likely still differ in some regard. Even despite these differences, we were able to measure differences across directions in both tasks, and costs and preferences were negatively and highly correlated between tasks. The finding itself therefore indicates that the costs of saccades measured during the saccade planning task generalize to those in the saccade preference task. Note also that we predicted this finding and idea already in a previous publication before starting the present study (Koevoet et al., 2023).

      We now address this interesting point in the discussion as follows:

      “We observed that aOordable saccades were preferred over costly ones. This is especially remarkable given that the delayed saccades in the planning task likely differ in their oculomotor program from the immediate saccades in the preference task in some regard.”

      (2) Is pupil size a sensible measure of saccade effort?

      As the reviewer points out, the pupillary signal is indeed relatively sluggish and therefore relatively slow and more artifical tasks are preferred to quantify saccade costs. This does not preclude pupil size from being applied in more natural settings, as we demonstrate in the search experiments – but a lot of care has to be taken to control for many possible confounding factors and many trials will be needed.

      That said, as saccade latencies may also capture differences in oculomotor effort (Shadmehr et al., 2019) they are a possible alternative option to assess effort in some oculomotor tasks (see below on why saccade latencies do not provide evidence for an alternative to effort driving saccade selection, but converging evidence). Whilst we do maintain that pupil size is an established and versatile physiological marker of effort, saccade latencies provide converging evidence for our conclusion that effort drives saccade selection.

      As for the saccade preference task, we are not able to analyze the data in a similar manner as in the visual search task for two reasons. First, the number of saccades is much lower than in the natural search experiments. Second, in the saccade preference task, there were always two possible saccade targets. Therefore, even if we were able to isolate an effort signal, this signal could index a multitude of factors such as deciding between two possible saccade targets. Even simple binary decisions go hand in hand with reliable pupil dilations as they require effort (e.g. de Gee et al., 2014).

      There are three major reasons why pupil size is a more versatile marker of saccade costs than saccade latencies (although as mentioned, latencies may constitute another valuable tool to study oculomotor effort). First, pupil size is able to quantify the cost of attentional shifts more generally, including covert attention as well as other effector systems such as head and hand movements. This circumvents the issue of different latencies of different effector systems and also allows to study attentional processes that are not associated with overt motor movements. Second, saccade latencies are difficult to interpret in natural viewing data, as fixation duration and saccade latencies are inherently confounded by one another. This makes it very difficult to separate oculomotor processes and the extraction of perceptual information from a fixated target. Thus, pupil size is a versatile marker of attentional costs in a variety of settings, and can measure costs that saccade latencies cannot (i.e. covert attention). Lastly, pupil size is highly established as a marker of effort which has been demonstrated across wide range of cognitive tasks and therefore not bound to eye movements alone (Bumke, 1911; Koevoet et al., 2024; Laeng et al., 2012; Loewenfeld, 1958; Mathôt, 2018; Robison & Unsworth, 2019; Sirois & Brisson, 2014; Strauch et al., 2022; van der Wel & van Steenbergen, 2018).

      We now discuss this as follows:

      “We here measured cost as the degree of effort-linked pupil dilation. In addition to pupil size, other markers may also indicate saccade costs. For example, saccade latency has been proposed to index oculomotor effort [100], whereby saccades with longer latencies are associated with more oculomotor effort. This makes saccade latency a possible complementary marker of saccade costs (also see Supplemen- tary Materials). Although relatively sluggish, pupil size is a valuable measure of attentional costs for (at least) two reasons. First, pupil size is a highly established as marker of effort, and is sensitive to effort more broadly than only in the context of saccades [36–45, 48]. Pupil size therefore allows to capture not only the costs of saccades, but also of covert attentional shifts [33], or shifts with other effectors such as head or arm movements [54, 101]. Second, as we have demonstrated, pupil size can measure saccade costs even when searching in natural scenes (Figure 4). During natural viewing, it is difficult to disentangle fixation duration from saccade latencies, complicating the use of saccade latency as a measure of saccade cost. Together, pupil size, saccade latency, and potential other markers of saccade cost could fulfill complementary roles in studying the role of cost in saccade selection.”

      The authors claim that the observed direction-specific 'saccade costs' obtained in Experiment 1 "were not mediated by differences in saccade properties, such as duration, amplitude, peak velocity, and landing precision (Figure 1e,f)". Saccade latency, however, was not taken into account here but is discussed for Experiment 2.

      The final model that was used to test for the observed anisotropies in pupil size across directions indeed did not include saccade latencies as a predictor. However, we did consider saccade latencies as a potential predictor originally. As we performed AICbased backward model selection, however, this predictor was removed due to the marginal predictive contribution of saccade latency beyond other predictors explaining pupil size.

      For completeness, we here report the outcome of a linear mixed-effects that does include saccade latency as a predictor. Here, saccade latencies did not predict pupil size (b \= 1.859e-03, t \= .138, p \= .889). The asymmetry effects remained qualitatively unchanged: preparing oblique compared with cardinal saccades resulted in a larger pupil size (b \= 7.635, t \= 3.969, p < .001), and preparing downward compared with upward saccades also led to a larger pupil size (b \= 3.344, t \= 3.334, p \= .003).

      The apparent similarity of saccade latencies and pupil size, however, is striking. Previous work shows shorter latencies for cardinal than oblique saccades, and shorter latencies for horizontal and upward saccades than downward saccades - directly reflecting the pupil sizes obtained in Experiment 1 as well as in the authors' previous study (Koevoet et al., 2023, PsychScience).

      As the reviewer notes, there are substantial asymmetries across the visual field in saccade latencies. These assymetries in saccade latency could also predict saccade preferences. We will reply to this in three points: 1) even if saccade latency is a predictor of saccade preferences, this would not constitute as an alternative explanation to the conclusion of effort driving saccade selection, 2) saccade latencies show an up-down asymmetry but oblique-cardinal effects in latency may not be generalizable across saccade tasks, 3) pupil size remains a robust predictor of saccade preferences even when saccade latencies are considered as a predictor of saccade preferences.

      (1) We want to first stress that saccade latencies are thought to reflect oculomotor effort (Shadmehr et al., 2019). For example, saccades with larger amplitudes and saccades where distractors need to be ignored are associated with longer latencies. Therefore, even if saccade latencies predict saccade selection, this would not contrast the idea that effort drives saccade selection. Instead, this would provide convergent evidence for our main conclusion – effort predicting saccade selection (rather than pupil size predicting saccade selection per se).

      “We here measured cost as the degree of effort-linked pupil dilation. In addition to pupil size, other markers may also indicate saccade costs. For example, saccade latency has been proposed to index oculomotor effort [100], whereby saccades with longer latencies are associated with more oculomotor effort. This makes saccade latency a possible complementary marker of saccade costs (also see Supplemen- tary Materials). Although relatively sluggish, pupil size is a valuable measure of attentional costs for (at least) two reasons. First, pupil size is a highly established as marker of effort, and is sensitive to effort more broadly than only in the context of saccades [36–45, 48]. Pupil size therefore allows to capture not only the costs of saccades, but also of covert attentional shifts [33], or shifts with other effectors such as head or arm movements [54, 101]. Second, as we have demonstrated, pupil size can measure saccade costs even when searching in natural scenes (Figure 4). During natural viewing, it is difficult to disentangle fixation duration from saccade latencies, complicating the use of saccade latency as a measure of saccade cost. Together, pupil size, saccade latency, and potential other markers of saccade cost could fulfill complementary roles in studying the role of cost in saccade selection.”

      (2) We first tested anisotropies in saccade latency in the saccade planning task (Wilkinson notation: latency ~ obliqueness + updownness + leftrightness + saccade duration + saccade amplitude + saccade velocity + landing error + (1+obliqueness + updownness|participant)). We found upward latencies to be shorter than downward saccade latencies (b \= -.535, t \= 3.421, p \= .003). In addition, oblique saccades showed shorter latencies than cardinal saccades (b \= -1.083, t \= 3.096, p \= .002) – the opposite of what previous work has demonstrated.

      We then also tested these latency anisotropies in another dataset wherein participants (n \= 20) saccaded toward a single peripheral target as fast as possible (Koevoet et al., submitted; same amplitude and eccentricity as in the present manuscript). There we did not find a difference in saccade latency between cardinal and oblique targets, but we did observe shorter latencies for up- compared with downward saccades. We are therefore not sure in which situations oblique saccades do, or do not differ from cardinal saccades in terms of latency, and even in which direction the effect occurs.

      In contrast, we have now demonstrated a larger pupil size prior to oblique compared with cardinal saccades in two experiments. This indicates that pupil size may be a more reliable and generalizable marker of saccade costs than saccade latency. However, this remains to be investigated further.

      (3) To gain further insights into which oculomotor metrics would predict saccade selection, we conducted a linear regression across directions. We created pupil size, saccade latencies, landing precision and peak velocities maps from the saccade planning task. We then used AIC-based model selection to determine the ‘best’ model to determine which factor would predict saccade selection best. The selected model included pupil size, latency and landing precision as predictors (Wilkinson notation: saccade preferences ~ pupil size + saccade latency + landing precision). Pupil size (b \=-42.853, t \= 4.791, p < .001) and saccade latency (b \=-.377, t \= 2.106, p \= .043) predicted saccade preferences significantly. In contrast, landing precision did not reach significance (b \= 23.631, t \= 1.675, p \= .104). This analysis shows that although saccade latency predicts saccade preferences, pupil size remains a robust predictor of saccade selection.

      “To ascertain whether pupil size or other oculomotor metrics predict saccade preferences, we conducted a multiple regression analysis. We calculated average pupil size, saccade latency, landing precision and peak velocity maps across all 36 directions. The model, determined using AIC-based backward selection, included pupil size, latency and landing precision as predictors (Wilkinson notation: saccade preferences  pupil size + saccade latency + landing precision). The analysis re- vealed that pupil size (β = -42.853, t = 4.791, p < .001) and saccade latency (β = -.377, t = 2.106, p = .043) predicted saccade preferences. Landing precision did not reach significance (β = 23.631, t = 1.675, p = .104). Together, this demonstrates that although other oculomotor metrics such as saccade latency contribute to saccade selection, pupil size remains a robust marker of saccade selection.”

      The authors state that "from a costs-perspective, it should be eOicient to not only adjust the number of saccades (non-specific), but also by cutting especially expensive directions the most (specific)". However, saccade targets should be selected based on the maximum expected information gain. If cognitive load increases (due to an additional task) an effective strategy seems to be to perform less - but still meaningful - saccades. How would it help natural orienting to selectively cut saccades in certain (effortful) directions? Choosing saccade targets based on comfort, over information gain, would result in overall more saccades to be made - which is non-optimal, also from a cost perspective.

      We thank the reviewer for this comment. Although we do not fully agree, the logic is quite close to our rationale and it is worth adding a point of discussion here. A vital part of the current interpretation is the instruction given to participants. In our second natural visual search task, participants were performing a dual task, where the auditory task was the primary task, whilst the search task was secondary. Therefore, participants are likely to adjust their resources to optimize performance on the primary task – at the expense of the secondary task. Therefore, less resources are made available and used to searching in the dual than in the single task, because these resources are needed for the auditory task. Cutting expensive directions does not help search in terms of search performance, but it does reduce the cost of search, so that more resources are available for the prioritized auditory task. Also note that the search task was rather difficult – participants did it, but it was tough (see the original description of the dataset for more details), which provides another reason to go full in on the auditory task at expense of the visual task. This, however, opens up a nice point of discussion: If one would emphasize the importance of search (maybe with punishment or reward), we would indeed expect participants to perform whichever eye movements are getting them to their goal fastest – thus reducing the relative influence of costs on saccade behavior. This remains to be tested however - we are working on this and are looking forward to discussing such findings in the future.

      Together, we propose that there is a trade-off between distributing resources either towards cognitive tasks or the oculomotor system (also see Ballard et al., 1995; Van der Stigchel, 2020). How these resources are distributed depends highly on the current task demands (also see Sahakian et al., 2023). This allows for adaptive behavior in a wide range of contexts.

      We now added these considerations to the manuscript as follows (also see our previous replies):

      “Do cognitive operations and eye movements consume from a similar pool of resources [44]? If so, increasing cognitive demand for non-oculomotor processes should result in decreasing available resources for the oculomotor system. In line with this idea, previous work indeed shows altered eye-movement behavior un- der effort as induced by dual tasks, for example by making less saccades under increased cognitive demand [62–64]. We therefore investigated whether less sac- cades were made as soon as participants had to count the occurrence of a specific digit in the auditory number stream in comparison to ignoring the stream (in Exp. 2; Figure 4a). Participants were instructed to prioritize the auditory digit-counting task over finding the visual search target. Therefore, resources should be shifted from the oculomotor system to the primary auditory counting task. The additional cognitive demand of the dual task indeed led to a decreased saccade frequency (t(24) = 7.224, p < .001, Cohen’s d = 1.445; Figure 4h).”

      I would have expected to see a negative correlation between saccade effort and saccade direction 'change' under increased load. Yet participants mostly cut upwards saccades, but not other directions that, according to pupil size, are equally or even more costly (e.g. oblique saccades).

      The reviewer’s point is taken from the initial comment, which we will address here. First, we’d like to point out that is it not established that saccade costs in different directions are always the same. Instead, it is possible that saccade costs could be different in natural viewing compared with our delayed-saccade task. Therefore, we used pupil size during natural viewing for the search experiments. Second, the reviewer correctly notes that oblique saccades are hardly cut when under additional cognitive demand. However, participants already hardly execute oblique saccades when not confronted with the additional auditory task (Figure 4b, d), making it difficult to reduce those further (i.e. floor effect). Participants chose to cut vertical saccades, possibly because these are more costly than horizontal saccades.

      We incorporated these point in our manuscript as follows:

      “To test this, we analyzed data from two existing datasets [63] wherein participants (total n = 41) searched for small targets (’Z’ or ’H’) in natural scenes (Figure 4a; [64]). Again, we tested whether pupil size prior to saccades negatively linked with saccade preferences across directions. Because saccade costs and preferences across directions could differ for different situations (i.e. natural viewing vs. saccade preference task), but should always be negatively linked, we established both cost and preferences independently in each dataset.”

      “We calculated a saccade-adjustment map (Figure 4g) by subtracting the saccade preference map in the single task (Figure 4f) from the dual task map (Fig- ure 4d). Participants seemingly cut vertical saccades in particular, and made more saccades to the top right direction. This pattern may have emerged as vertical saccades are more costly than horizontal saccades (also see Figure 1d). Oblique saccades may not have been cut because there were very little oblique saccades in the single condition to begin with (Figure 4d), making it difficult to observe a further reduction of such saccades under additional cognitive demand (i.e. a floor effect).”

      Overall, I am not sure what practical relevance the relation between pupil size (measured in a separate experiment) and saccade decisions has for eye movement research/vision science. Pupil size does not seem to be a straightforward measure of saccade effort. Saccade latency, instead, can be easily extracted in any eye movement experiment (no need to conduct a separate, delayed saccade task to measure pupil dilation), and seems to be an equally good index.

      There are two points here.

      (1) What is the practical relevance of a link between effort and saccade selection for eyemovement research and vision science?

      We see plenty – think of changing eye movement patterns under effort (be it smooth pursuits, saccade rates, distributions of gaze positions to images etc.) which have substantial implications for human factors research, but also neuropsychology. With a cost account, one may predict (rather than just observe) how eye movement changes as soon as resources are reduced/ non-visual demand increases. With a cost account, we can explain such effects (e.g. lower saccade rates under effort, cardinal bias, perhaps also central bias) parsimoniously that cannot be explained by what is so far referred to as the three core drivers of eye movement behavior (saliency, selection history, goals, e.g., Awh et al., 2012). Conversely, one must wonder why eye-movement research/vision science simply accepts/dismisses these phenomena as such, without seeking overarching explanations.

      (2) What is the usefulness of using pupil size to measure effort?

      We hope that our replies to the comments above illustrate why pupil size is a sensible, robust and versatile marker of attentional costs. We briefly summarize our most important points here.

      - Pupil size is an established measure of effort irrespective of context, as demonstrated by hundreds of original works (e.g. working memory load, multiple object tracking, individual differences in cognitive ability). This allows pupil size to be a versatile marker of the effort, and therefore costs, of non-saccadic attentional shifts such as covert attention or those realized by other effector systems (i.e. head or hand movements).

      - Our new analysis indicates that pupil size remains a strong and robust predictor of saccade preference, even when considering saccade latency.

      - Pupil size allows to study saccade costs in natural viewing. In contrast, saccade latencies are difficult to assess in natural viewing as fixation durations and saccade latencies are intrinsically linked and very difficult to disentangle.

      - Note however, that we think that it is interesting and useful so study effects of effort/cost on eye movement behavior. Whichever index is used to do so, we see plenty potential in this line of research, this paper is a starting point to do so.

      Reviewer #3 (Public Review):

      This manuscript extends previous research by this group by relating variation in pupil size to the endpoints of saccades produced by human participants under various conditions including trial-based choices between pairs of spots and search for small items in natural scenes. Based on the premise that pupil size is a reliable proxy of "effort", the authors conclude that less costly saccade targets are preferred. Finding that this preference was influenced by the performance of a non-visual, attentiondemanding task, the authors conclude that a common source of effort animates gaze behavior and other cognitive tasks.

      Strengths:

      Strengths of the manuscript include the novelty of the approach, the clarity of the findings, and the community interest in the problem.

      We thank the reviewer for pointing out the strengths of our paper.

      Weaknesses:

      Enthusiasm for this manuscript is reduced by the following weaknesses:

      (1) A relationship between pupil size and saccade production seems clear based on the authors' previous and current work. What is at issue is the interpretation. The authors test one, preferred hypothesis, and the narrative of the manuscript treats the hypothesis that pupil size is a proxy of effort as beyond dispute or question. The stated elements of their argument seem to go like this:

      PROPOSITION 1: Pupil size varies systematically across task conditions, being larger when tasks are more demanding.

      PROPOSITION 2: Pupil size is related to the locus coeruleus.

      PROPOSITION 3: The locus coeruleus NE system modulates neural activity and interactions.

      CONCLUSION: Therefore, pupil size indexes the resource demand or "effort" associated with task conditions.

      How the conclusion follows from the propositions is not self-evident. Proposition 3, in particular, fails to establish the link that is supposed to lead to the conclusion.

      We inadvertently laid out this rationale as described above, and we thank the reviewer for pointing out this initial suboptimal structure of argumentation. The notion that the link between pupil size and effort is established in the literature because of its neural underpinnings is inaccurate. Instead, the tight link between effort and pupil size is established based on covariations of pupil diameter and cognition across a wide variety of tasks and domains. In line with this, we now introduce this tight link predominantly based on the relationships between pupil size and cognition instead of focusing on putative neural correlates of this relationship.

      As reviewed previously (Beatty, 1982; Bumke, 1911; Kahneman, 1973; Kahneman & Beatty, 1966; Koevoet et al., 2024; Laeng et al., 2012; Mathôt, 2018; Sirois & Brisson, 2014; Strauch et al., 2022; van der Wel & van Steenbergen, 2018), any increase in effort is consistently associated with an increase in pupil size. For instance, the pupil dilates when increasing load in working memory or multiple object tracking tasks, and such pupillary effects robustly explain individual differences in cognitive ability and fluctuations in performance across trials (Alnæs et al., 2014; Koevoet et al., 2024; Robison & Brewer, 2020; Robison & Unsworth, 2019; Unsworth & Miller, 2021). This extends to the planning of movements as pupil dilations are observed prior to the execution of (eye) movements (Koevoet et al., 2023; Richer & Beatty, 1985). The link between pupil size and effort has thus been firmly established for a long time, irrespective of the neural correlates of these effort-linked pupil size changes.

      We again thank the reviewer for spotting this logical mistake, and now revised the paragraph where we introduce pupil size as an established marker of effort as follows:

      “We recently demonstrated that the effort of saccade planning can be measured with pupil size, which allows for a physiological quantification of saccade costs as long as low-level visual factors are controlled for [33]. Pupil size is an established marker of effort [36–44]. For instance, loading more in working memory or tracking more objects results in stronger pupil dilation [44–52]. Pupil size not only reflects cognitive (or mental) effort but also the effort of planning and executing movements [37, 53, 54]. We leveraged this to demonstrate that saccade costs can be captured with pupil size, and are higher for oblique compared with cardinal directions [33]. Here, we addressed whether saccade costs predict where to saccade.”

      We now mention the neural correlates of pupil size only in the discussion. Where we took care to also mention roles for other neurotransmitter systems:

      “Throughout this paper, we have used cost in the limited context of saccades.

      However, cost-based decision-making may be a more general property of the brain [31, 36, 114–116]. Every action, be it physical or cognitive, is associated with an in- trinsic cost, and pupil size is likely a general marker of this [44]. Note, however, that pupil dilation does not always reflect cost, as the pupil dilates in response to many sensory and cognitive factors which should be controlled for, or at least considered, when interpreting pupillometric data [e.g., see 39, 40, 42, 117]. Effort-linked pupil dilations are thought to be, at least in part, driven by activity in the brainstem locus coeruleus (LC) [40, 118–120] [but other neurotransmitters also affect pupil size, e.g. 121, 122]. Activity in LC with its widespread connections throughout the brain [120, 123–127] is considered to be crucial for the communication within and between neu- ral populations and modulates global neural gain [128–132]. Neural firing is costly [22, 133], and therefore LC activity and pupil size are (neuro)physiologically plausible markers of cost [40]. Tentative evidence even suggests that continued exertion of effort (accompanied by altered pupil dilation) is linked to the accumulation of glutamate in the lateral prefrontal cortex [134], which may be a metabolic marker of cost [also see 116, 134, 135]. “

      (2) The authors test one, preferred hypothesis and do not consider plausible alternatives. Is "cost" the only conceivable hypothesis? The hypothesis is framed in very narrow terms. For example, the cholinergic and dopamine systems that have been featured in other researchers' consideration of pupil size modulation are missing here. Thus, because the authors do not rule out plausible alternative hypotheses, the logical structure of this manuscript can be criticized as committing the fallacy of aOirming the consequent.

      As we have noted in the response to the reviewer’s first point, we did not motivate our use of pupil size as an index of effort clearly enough. For the current purpose, the neural correlates of pupil size are less relevant than the cognitive correlates (see previous point). We reiterate that the neuromodulatory underpinnings of the observed pupil size effects (which indeed possibly include effects of the cholinergic, dopaminergic and serotonergic systems), while interesting for the discussion on the neural origin of effects, are not crucial to our conclusion. We hope the new rationale (without focusing too much on the (irrelevant) exact neural underpinnings) convinces the reviewer and reader.

      Our changes to the manuscript are shown in our reply to the previous comment.

      The reviewer notes that other plausible alternative hypotheses could explain the currently reported results. However, we did not find a more parsimonuous explanation for our data than ‘Effort Drives Saccade Selection’. Effort explains why participants prefer saccading toward specific directions in (1) highly controlled and (2) more natural settings. Note that we also predicted this effect previously (Koevoet et al., 2023). Moreover, this account explains (3) why participants make less saccades under additional cognitive demand, and (4) why especially costly saccades are reduced under additional cognitive demand. We are very open to the reviewer presenting other possible interpretations of our data so these can be discussed to be put to test in future work.

      (3) The authors cite particular publications in support of the claim that saccade selection is influenced by an assessment of effort. Given the extensive work by others on this general topic, the skeptic could regard the theoretical perspective of this manuscript as too impoverished. Their work may be enhanced by consideration of other work on this general topic, e.g, (i) Shenhav A, Botvinick MM, Cohen JD. (2013) The expected value of control: an integrative theory of anterior cingulate cortex function. Neuron. 2013 Jul 24;79(2):217-40. (ii) Müller T, Husain M, Apps MAJ. (2022) Preferences for seeking effort or reward information bias the willingness to work. Sci Rep. 2022 Nov 14;12(1):19486. (iii) Bustamante LA, Oshinowo T, Lee JR, Tong E, Burton AR, Shenhav A, Cohen JD, Daw ND. (2023) Effort Foraging Task reveals a positive correlation between individual differences in the cost of cognitive and physical effort in humans. Proc Natl Acad Sci U S A. 2023 Dec 12;120(50):e2221510120.

      We thank the reviewer for pointing us toward this literature. These papers are indeed relevant for our manuscript, and we have now incorporated them. Specifically, we now discuss how the costs of effort are weighed in relation to possible rewards during decision-making. We have also incorporated work that has investigated how the biomechanical costs of arm movements contribute to action selection.

      “Our findings are in line with established effort-based models that assume costs to be weighed against rewards during decision-making [102–107]. In such studies, reward and cognitive/physical effort are often parametrically manipulated to as- sess how much effort participants are willing to exert to acquire a given (monetary) reward [e.g. 108, 109]. Whereas this line of work manipulated the extrinsic costs and/or rewards of decision options (e.g. perceptual consequences of saccades [110, 111] or consequences associated with decision options), we here focus on the intrin- sic costs of the movement itself (in terms of cognitive and physical effort). Relatedly, the intrinsic costs of arm movements are also considered during decision-making: biomechanically aOordable movements are generally preferred over more costly ones [26–28]. We here extend these findings in two important ways. First, until now, the intrinsic costs of saccades and other movements have been inferred from gaze behavior itself or by using computational modelling [23, 25–28, 34, 35, 112]. In con- trast, we directly measured cost physiologically using pupil size. Secondly, we show that physiologically measured saccade costs predict where saccades are directed in a controlled binary preference task, and even during natural viewing. Our findings could unite state-of-the-art computational models [e.g. 23, 25, 34, 35, 113] with physiological data, to directly test the role of saccade costs and ultimately further our understanding of saccade selection.”

      (4) What is the source of cost in saccade production? What is the currency of that cost? The authors state (page 13), "... oblique saccades require more complex oculomotor programs than horizontal eye movements because more neuronal populations in the superior colliculus (SC) and frontal eye fields (FEF) [76-79], and more muscles are necessary to plan and execute the saccade [76, 80, 81]." This statement raises questions and concerns. First, the basis of the claim that more neurons in FEF and SC are needed for oblique versus cardinal saccades is not established in any of the publications cited. Second, the authors may be referring to the fact that oblique saccades require coordination between pontine and midbrain circuits. This must be clarified. Second, the cost is unlikely to originate in extraocular muscle fatigue because the muscle fibers are so different from skeletal muscles, being fundamentally less fatigable. Third, if net muscle contraction is the cost, then why are upward saccades, which require the eyelid, not more expensive than downward? Thus, just how some saccades are more effortful than others is not clear.

      Unfortunately, our current data do not allow for the specification of what the source is of differences in saccade production, nor what the currency is. We want to explicitly state that while pupil size is a sensitive measure of saccade costs, pupil size cannot directly inform what underlying mechanisms are causing differences in saccade costs across conditions (e.g. directions). Nevertheless, we do speculate about these issues because they are important to consider. We thank the reviewer for pointing out the shortcomings in our initial speculations.

      Broadly, we agree with the reviewer that a neural source of differences in costs between different types of saccades is more likely than a purely muscular account (also see Koevoet et al., 2023). Furthermore, we think that the observed differences in saccade costs for oblique vs. cardinal and up vs. down could be due to different underlying mechanisms. While we caution against overinterpreting single directions, tentative evidence for this may also be drawn by the different time course of effects for up/down versus cardinal/oblique, Figure 1c.

      Below we speculate about why some specific saccade directions may be more costly than others:

      Why would oblique saccades be more costly than cardinal saccades? We thank the reviewer for pointing out that oblique saccades additionally require coordination between pontine and midbrain circuits (Curthoys et al., 1984; King & Fuchs, 1979; Sparks, 2002). This point warrants more revised discussion compared to our initial version. We have incorporated this as follows:

      “The complexity of an oculomotor program is arguably shaped by its neural underpinnings. For example, oblique but not cardinal saccades require communication between pontine and midbrain circuits [73–75]. Such differences in neural complexity may underlie the additional costs of oblique compared with cardinal saccades. Besides saccade direction, other properties of the ensuing saccade such as its speed, distance, curvature, and accuracy may contribute to a saccade’s total cost [22, 33, 53, 76, 77] but this remains to be investigated directly.”

      Why would downward saccades be more costly than upward saccades? As the reviewer points out: from a net muscular contraction account of cost, one would expect the opposite pattern due to the movement of the eyelid. Instead, we speculate that our findings may be associated with the well-established anisotropy in early visual cortex along the vertical meridian. Specifically, the upper vertical meridian is represented at substantially less detail than the lower vertical meridian (Himmelberg et al., 2023; Silva et al., 2018). Prior to a saccade, attention is deployed towards the intended saccadic endpoint (Deubel & Schneider, 1996; Kowler et al., 1995). Attention tunes neurons to preferentially process the attended location over non-attended locations. Due to the fact that the lower visual field is represented at higher detail than the upper visual field, attention may tune neuronal responses differently when preparing up- compared with downward saccades (Hanning et al., 2024; Himmelberg et al., 2023). Thus, it may be more costly to prepare down- compared with upward saccades. This proposition, however, does not account for the lower costs associated horizontal compared with up- and downward saccades as the horizontal meridian is represented at a higher acuity than the vertical merdian. This makes it unlikely that this explains the pattern of results completely. Again, at this point we can only speculate why costs differ, yet we demonstrate that these differences in cost are decisive for oculomotor behavior. We now explicitly state the speculative nature of these ideas that would all need to be tested directly.

      We have updated our discussion of this issue as follows:

      “The observed differences in saccade costs across directions could be linked to established anisotropies in perception [80–86], attention [87–92], saccade charac- teristics [87, 88, 92, 93], and (early) visual cortex [94–98] [also see 99]. For example, downward saccades are more costly than upward saccades, which mimics a similar asymmetry in early visual areas wherein the upper visual field is relatively under- represented [94–98]; similarly stronger presaccadic benefits are found for down- compared with upward saccades [87, 88]. Moreover, upward saccades are more pre- cise than downward saccades [93]. Future work should elucidate where saccade cost or the aforementioned anisotropies originate from and how they are related - something that pupil size alone cannot address.”

      (5) The authors do not consider observations about variation in pupil size that seem to be incompatible with the preferred hypothesis. For example, at least two studies have described systematically larger pupil dilation associated with faster relative to accurate performance in manual and saccade tasks (e.g., Naber M, Murphy P. Pupillometric investigation into the speed-accuracy trade-off in a visuo-motor aiming task. Psychophysiology. 2020 Mar;57(3):e13499; Reppert TR, Heitz RP, Schall JD. Neural mechanisms for executive control of speed-accuracy trade-off. Cell Rep. 2023 Nov 28;42(11):113422). Is the fast relative to the accurate option necessarily more costly?

      We thank the reviewer for this interesting point that we will answer in two ways. First, we discuss the main point: the link between pupil size, effort, and cost. Second, we discuss the findings described specifically in these two papers and how we interpret these from a pupillometric account.

      First, one may generally ask whether 1) any effort results in pupil dilation, 2) whether any effort is costly, and 3) whether this means that pupil dilation always reflects effort and cost respectively. Indeed, it has been argued repeatedly, prominently, and independently (e.g., Bumke, 1911; Mathôt, 2018) that any change in effort (no matter the specific origin) is associated with an evoked pupil dilation. Effort, in turn, is consistently and widely experienced as aversive, both across tasks and cultures (David et al., 2024). Effort minimization may therefore be seen as an universal law of human cognition and behavior with effort as a to-be minimized cost (Shadmehr et al., 2019; Hull 1943, Tsai 1932). However, this does not imply that any pupil dilation necessarily reflects effort or that, as a consequence thereof, any pupil dilation is always signaling cost. For instance, the pupil dark response, the pupil far response and changes in baseline pupil size are not associated with effort. Baseline and task-evoked pupil dilation responses have to be interpreted differently (see below), moreover, the pupil also changes (and dilates) due to other factors (see Strauch et al., 2022; Mathôt, 2018, Bumke 1911, Loewenfeld, 1999 for reviews).

      Second, as for Naber & Murphy (2020) & Reppert at al. (2023) specifically: Both Reppert et al. (2023) and Naber & Murphy (2020) indeed demonstrate a larger baseline pupil size when participants made faster, less accurate responses. However, baseline pupil size is not an index of effort per-se, but task-evoked pupil dilation responses are (as studied in the present manuscript) (Strauch et al., 2022). For work on differences between baseline pupil diameter and task-evoked pupil responses, and their respective links with exploration and exploitation please see Jepma & Nieuwenhuis (2011). Indeed, the link between effort and larger pupil size holds for task evoked responses, but not baseline pupil size per se (also see Koevoet et al., 2023).

      Still, Naber (third author of the current paper) & Murphy (2020) also demonstrated larger task-evoked pupil dilation responses when participants were instructed to make faster, less accurate responses compared with making accurate and relatively slow responses. However, this difference in task-evoked response gains significance only after the onset of the movement itself, and peaks substantially later than response offset. Whilst pupil dilation may be sluggish, it isn’t extremely sluggish either. As feedback to the performance of the participant was displayed 1.25s after performing the movement and clicking (taking about 630ms), we deem it possible that this effect may in part result from appraising the feedback to the participant rather than the speed of the response itself (in fact, Naber and Murphy also discuss this option). In addition to not measuring saccades but mouse movements, it is therefore possible that the observed evoked pupil effects in Naber & Murphy (2020) are not purely linked to motor preparation and execution per se. Therefore, future work that aims to investigate the costs of movements should isolate the effects of feedback and other potential factors that may drive changes in pupil size. This will help clarify whether fast or more accurate movements could be linked to the underlying costs of the movements.

      Relatedly, we do not find evidence that pupil size during saccade planning predicts the onset latency of the ensuing saccade (please refer to our second response to Reviewer 2 for a detailed discussion).

      Together, we therefore do not see the results from Reppert et al. (2023) and Naber & Murphy (2020) to be at odds with our interpretation of evoked pupil size reflecting effort and cost in the context of planning saccades.

      We think that these are considerations important to the reader, which is why we now added them to the discussion as follows:

      “Throughout this paper, we have used cost in the limited context of saccades.

      However, cost-based decision-making may be a more general property of the brain [31, 36, 114–116]. Every action, be it physical or cognitive, is associated with an in- trinsic cost, and pupil size is likely a general marker of this [44]. Note, however, that pupil dilation does not always reflect cost, as the pupil dilates in response to many sensory and cognitive factors which should be controlled for, or at least considered, when interpreting pupillometric data [e.g., see 39, 40, 42, 117].”

      (6) The authors draw conclusions based on trends across participants, but they should be more transparent about variation that contradicts these trends. In Figures 3 and 4 we see many participants producing behavior unlike most others. Who are they? Why do they look so different? Is it just noise, or do different participants adopt different policies?

      We disagree with the transparency point of the reviewer. Note that we deviated from the norm here by being more transparent than common: we added individual data points and relationships rather than showing pooled effects across participants with error bars alone (see Figures 2c, 3b,c, 4c,e,f).

      Moreover, our effects are consistent and stable across participants and are highly significant. To illustrate, for the classification analysis based on cost (Figure 2E) 16/20 participants showed an effect. As for the natural viewing experiments (total > 250,000 fixations), we also find that a majority of participants show the observed effects: Experiment 1: 15/16 participants; Experiment 2: 16/25 participants; Experiment 2 – adjustment: 22/25 participants.

      We fully agree that it’s interesting to understand where interindividual variation may originate from. We currently have too little data to allow robust analyses across individuals and zooming in on individual differences in cost maps, preference maps, or potential personalized strategies of saccade selection. That said, future work could study this further. We would recommend to hereby reduce the number of directions to gain more pupil size data per direction and therefore cleaner signals that may be more informative on the individual level. With such stronger signals, studying (differences in) links on an individual level may be feasible and would be interesting to consider – and will be a future direction in our own work too. Nonetheless, we again stress that the reported effects are robust and consistent across participants, and that interindividual differences are therefore not extensive. Moreover, our results from four experiments consistently support our conclusion that effort drives saccade selection.

      Recommendations for the authors:  

      Reviewer #1 (Recommendations For The Authors):

      - Based on the public review, I would recommend that the authors carefully review and correct the manuscript with regard to the causal conclusions. The study is largely correlational (i.e. the pupil was only observed, not manipulated) and therefore does not allow causal conclusions to be drawn about the relationship between pupil size and saccade selection. These causal conclusions become even more confusing when pupil size is equated with effort and saccade cost. As a consequence, an actual correlation between pupil size and saccade selection has led to the title that effort drives saccade selection. It would also be helpful for the reader to summarize in an additional section of the discussion what they consider to be a causal or correlational link based on their results.

      We agree with the reviewer, and we have indeed included more explicitly which findings are correlational and which causal in detail now. As outlined before we do not see a more parimanious explanation for our findings than our title, but we fully agree that the paper benefits from making the correlational/causal nature of evidence for this idea explicitly transparent.

      “We report a combination of correlational and causal findings. Despite the correlational nature of some of our results, they consistently support the hypothesis that saccade costs predicts saccade selection [which we predicted previously, 33]. Causal evidence was provided by the dual-task experiment as saccade frequencies - and especially costly saccades were reduced under additional cognitive demand. Only a cost account predicts 1) a link between pupil size and saccade preferences, 2) a cardinal saccade bias, 3) reduced saccade frequency under additional cognitive demand, and 4) disproportional cutting of especially those directions associated with more pupil dilation. Together, our findings converge upon the conclusion that effort drives saccade selection.”

      - Can the authors please elaborate in more detail on how they transformed the predictors of their linear mixed model for the visualization in Figure 1f? It is difficult to see how the coeOicients in the table and the figure match.

      We used the ‘effectsize’ package to provide effect sizes of for each predictor of the linear mixed-effects model (https://cran.r-project.org/web/packages/effectsize/index.html). We report absolute effect sizes to make it visually easier to compare different predictors. These details have now been included in the Methods section to be more transparent about how these effect sizes were computed.

      “Absolute effect sizes (i.e. r) and their corresponding 95% confidence intervals for the linear mixed-effects models were calculated using t and df values with the ’effectsize’ package (v0.8.8) in R.”

      - Could the authors please explain in more detail why they think that a trial-by-trial analysis in the free choice task adds something new to their conclusions? In fact, a trialby-trial analysis somehow suggests that the pupil size data would enter the analysis at a single trial level. If I understand correctly, the pupil size data come from their initial mapping task. So there is only one mean pupil size for a given participant and direction that goes into their analysis to predict free choice in a single trial. If this is the case, I don't see the point of doing this additional analysis given the results shown in Figure 2c.

      The reviewer understands correctly that pupil size data is taken from the initial mapping task. We then used these mean values to predict which saccade target would be selected on a trial-by-trial basis. While showing the same conceptual result as the correlation analysis, we opted to include this analysis to show the robustness of the results across individuals. Therefore we have chosen to keep the analysis in the manuscript but now write more clearly that this shows the same conceptual finding as the correlation analysis.

      “As another test of the robustness of the effect, we analyzed whether saccade costs predicted saccade selection on a trial-by-trial basis. To this end, we first determined the more aOordable option for each trial using the established saccade cost map (Figure 1d). We predicted that participants would select the more aOordable option. Complementing the above analyses, the more aOordable option was chosen above chance level across participants (M = 56.64%, 95%-CI = [52.75%-60.52%], one-sample t-test against 50%: t(19) = 3.26, p = .004, Cohen’s d = .729; Figure 2e). Together, these analyses established that saccade costs robustly predict saccade preferences.”

      Reviewer #2 (Recommendations For The Authors):

      The authors report that "Whenever the difference in pupil size between the two options was larger, saccades curved away more from the non-selected option (β = .004, SE = .001, t = 4.448, p < .001; Figure 3b), and their latencies slowed (β = .050, SE = .013, t = 4.323, p < .001; Figure 3c)". I suspect this effect might not be driven by the difference but by a correlation between pupil size and latency.

      The authors correlate differences in pupil size (Exp1) with saccade latencies (Exp2), I recommend correlating pupil size with the latency directly, in either task. This would show if it is actually the difference between choices or simply the pupil size of the respective individual option that is linked to latency/effort. Same for curvature.

      The reviewer raises a good point. Please see the previous analyses concerning the possible correlations between pupil size and saccade latency, and how they jointly predict saccade selection.

      Our data show that saccade curvature and latencies are linked with the difference in pupil size between the selected and non-selected options. Are these effects driven by a difference in pupil size or by the pupil size associated with the chosen option?

      To assess this, we conducted two linear mixed-effects models. We predicted saccade curvature and latency using pupil size (from the planning task) of the selected and nonselected options while controlling for the chosen direction (Wilkinson notation: saccade curvature/latency ~ selected pupil size + non-selected pupil size + obliqueness + vertical + horizontal + (1+ selected pupil size + non-selected pupil size|participant). We found that saccades curved away more from costlier the non-selected targets (β \=1.534, t \= 8.151, p < .001), and saccades curved away from the non-selected target less when the selected target was cheaper (β \=-2.571, t \= -6.602, p < .001). As the costs of the selected and non-selected show opposite effects on saccade curvature, this indicates that the difference between the two options drives oculomotor conflict.

      As for saccade latencies, we found saccade onsets to slow when the cost of the selected target was higher (b \= .068, t \= 2.844, p \= .004). In contrast, saccade latencies were not significantly affected by the cost of the non-selected target (β \= -.018, t \= 1.457, p \= .145), although numerically the effect was in the opposite direction. This shows that latencies were primarily driven by the cost of the selected target but a difference account cannot be fully ruled out.

      Together, these analyses demonstrate that the difference in costs between two alternatives reliably affects oculomotor conflict as indicated by the curvature analysis. However, saccade latencies are predominantly affected by the cost of the selected target – even when controlling for the obliqueness, updownness and leftrightness of the ensuing saccade. We have added these analyses here for completeness, but because the findings seem inconclusive for saccade latency we have chosen to not include these analyses in the current paper. We are open to including these analyses in the supplementary materials if the reviewer and/or editor would like us to, but have chosen not to do so due to conciseness and to keep the paper focused.

      I was wondering why the authors haven't analyzed the pupil size in Experiment 2. If the pupil size can be assessed during a free viewing task (Experiment 3), shouldn't it be possible to also evaluate it in the saccade choice task?

      We did not analyze the pupil size data from the saccade preference task for two reasons. First, the number of saccades is much lower than in the natural search experiments (~14.000 vs. ~250.000). Second, in the saccade preference task, there were always two possible saccade targets. Therefore, even if we were able to isolate an effort signal, this signal could index a multitude of factors such as deciding between two possible saccade targets (de Gee et al., 2014), and has the possibility of two oculomotor programs being realized instead of only a single one (Van der Stigchel, 2010).

      Discussion: "due to stronger presaccadic benefits for upward compared with downward saccades [93,94]". I think this should be the other way around.

      We thank the reviewer for pointing this out. We have corrected our mistake in the revised manuscript.

      Saccade latencies differ around the visual field; to account for that, results / pupil size should be (additionally) evaluated relative to saccade onset (rather than cue offset). It is interesting that latencies were not accounted for here (Exp1), since they are considered for Exp2 (where they correlate with a pupil size difference). I suspect that latencies not only correlate with the difference in pupil size, but directly with pupil size itself.

      We agree with the reviewer that locking the pupil size signal to saccade onset instead of cue offset may be informative. We included an analysis in the supporting information that investigates this (see Figure S1). The results of the analysis were conceptually identical.

      The reviewer writes that latencies were not accounted for in Experiment 1. Although saccade latency was not included in the final model reported in the paper, it was considered during AIC-based backward model selection. As saccade latency did not predict meaningful variance in pupil size, it was ultimately not included in the analysis as a predictor. For completeness, we here report the outcome of a linear mixed-effects that does include saccade latency as a predictor. Here, saccade latencies did not predict pupil size (β \= 1.859e-03, t \= .138, p \= .889). The assymetry effects remained qualitatively unchanged: preparing oblique compared with cardinal saccades resulted in a larger pupil size (β \= 7.635, t \= 3.969, p < .001), and preparing downward compared with upward saccades also led to a larger pupil size (β \= 3.344, t \= 3.334, p \= .003).

      In addition, we have included a new analysis in the supporting information that directly addresses this issue. We will reiterate the main results here:

      “To ascertain whether pupil size or other oculomotor metrics predict saccade preferences, we conducted a multiple regression analysis. We calculated average pupil size, saccade latency, landing precision and peak velocity maps across all 36 directions. The model, determined using AIC-based backward selection, included pupil size, latency and landing precision as predictors (Wilkinson notation: saccade preferences  pupil size + saccade latency + landing precision). The analysis re- vealed that pupil size (β = -42.853, t = 4.791, p < .001) and saccade latency (β = -.377, t = 2.106, p = .043) predicted saccade preferences. Landing precision did not reach significance (β = 23.631, t = 1.675, p = .104). Together, this demonstrates that although other oculomotor metrics such as saccade latency contribute to saccade selection, pupil size remains a robust marker of saccade selection.”

      We have also added this point in our discussion:

      “We here measured cost as the degree of effort-linked pupil dilation. In addition to pupil size, other markers may also indicate saccade costs. For example, saccade latency has been proposed to index oculomotor effort [100], whereby saccades with longer latencies are associated with more oculomotor effort. This makes saccade latency a possible complementary marker of saccade costs (also see Supplemen- tary Materials). Although relatively sluggish, pupil size is a valuable measure of attentional costs for (at least) two reasons. First, pupil size is a highly established as marker of effort, and is sensitive to effort more broadly than only in the context of saccades [36–45, 48]. Pupil size therefore allows to capture not only the costs of saccades, but also of covert attentional shifts [33], or shifts with other effectors such as head or arm movements [54, 101]. Second, as we have demonstrated, pupil size can measure saccade costs even when searching in natural scenes (Figure 4). During natural viewing, it is difficult to disentangle fixation duration from saccade latencies, complicating the use of saccade latency as a measure of saccade cost. Together, pupil size, saccade latency, and potential other markers of saccade cost could fulfill complementary roles in studying the role of cost in saccade selection.”

      References

      Alnæs, D., Sneve, M. H., Espeseth, T., Endestad, T., van de Pavert, S. H. P., & Laeng, B. (2014). Pupil size signals mental eFort deployed during multiple object tracking and predicts brain activity in the dorsal attention network and the locus coeruleus. Journal of Vision, 14(4), 1. https://doi.org/10.1167/14.4.1

      Awh, E., Belopolsky, A. V., & Theeuwes, J. (2012). Top-down versus bottom-up attentional control: A failed theoretical dichotomy. Trends in Cognitive Sciences, 16(8), 437–443. https://doi.org/10.1016/j.tics.2012.06.010

      Ballard, D. H., Hayhoe, M. M., & Pelz, J. B. (1995). Memory Representations in Natural Tasks. Journal of Cognitive Neuroscience, 7(1), 66–80. https://doi.org/10.1162/jocn.1995.7.1.66

      Beatty, J. (1982). Task-evoked pupillary responses, processing load, and the structure of processing resources. Psychological Bulletin, 91(2), 276–292. https://doi.org/10.1037/0033-2909.91.2.276

      Bumke, O. (1911). Die Pupillenstörungen bei Geistes-und Nervenkrankheiten (2nd ed.). Fischer.

      Curthoys, I. S., Markham, C. H., & Furuya, N. (1984). Direct projection of pause neurons to nystagmusrelated excitatory burst neurons in the cat pontine reticular formation. Experimental Neurology, 83(2), 414–422. https://doi.org/10.1016/S0014-4886(84)90109-2

      David, L., Vassena, E., & Bijleveld, E. (2024). The unpleasantness of thinking: A meta-analytic review of the association between mental eFort and negative aFect. Psychological Bulletin, 150(9), 1070–1093. https://doi.org/10.1037/bul0000443

      de Gee, J. W., Knapen, T., & Donner, T. H. (2014). Decision-related pupil dilation reflects upcoming choice and individual bias. Proceedings of the National Academy of Sciences, 111(5), E618–E625. https://doi.org/10.1073/pnas.1317557111

      Deubel, H., & Schneider, W. X. (1996). Saccade target selection and object recognition: Evidence for a common attentional mechanism. Vision Research, 36(12), 1827–1837. https://doi.org/10.1016/0042-6989(95)00294-4

      Greenwood, J. A., Szinte, M., Sayim, B., & Cavanagh, P. (2017). Variations in crowding, saccadic precision, and spatial localization reveal the shared topology of spatial vision. Proceedings of the National Academy of Sciences, 114(17), E3573–E3582. https://doi.org/10.1073/pnas.1615504114

      Hanning, N. M., Himmelberg, M. M., & Carrasco, M. (2024). Presaccadic Attention Depends on Eye Movement Direction and Is Related to V1 Cortical Magnification. Journal of Neuroscience, 44(12). https://doi.org/10.1523/JNEUROSCI.1023-23.2023

      Himmelberg, M. M., Winawer, J., & Carrasco, M. (2023). Polar angle asymmetries in visual perception and neural architecture. Trends in Neurosciences, 46(6), 445–458. https://doi.org/10.1016/j.tins.2023.03.006

      Jepma, M., & Nieuwenhuis, S. (2011). Pupil Diameter Predicts Changes in the Exploration–Exploitation Trade-oF: Evidence for the Adaptive Gain Theory. Journal of Cognitive Neuroscience, 23(7), 1587– 1596. https://doi.org/10.1162/jocn.2010.21548

      Kahneman, D. (1973). Attention and Effort. Prentice-Hall.

      Kahneman, D., & Beatty, J. (1966). Pupil diameter and load on memory. Science (New York, N.Y.), 154(3756), 1583–1585. https://doi.org/10.1126/science.154.3756.1583

      King, W. M., & Fuchs, A. F. (1979). Reticular control of vertical saccadic eye movements by mesencephalic burst neurons. Journal of Neurophysiology, 42(3), 861–876. https://doi.org/10.1152/jn.1979.42.3.861

      Koevoet, D., Strauch, C., Naber, M., & Van der Stigchel, S. (2023). The Costs of Paying Overt and Covert Attention Assessed With Pupillometry. Psychological Science, 34(8), 887–898. https://doi.org/10.1177/09567976231179378

      Koevoet, D., Strauch, C., Van der Stigchel, S., Mathôt, S., & Naber, M. (2024). Revealing visual working memory operations with pupillometry: Encoding, maintenance, and prioritization. WIREs Cognitive Science, e1668. https://doi.org/10.1002/wcs.1668

      Kowler, E., Anderson, E., Dosher, B., & Blaser, E. (1995). The role of attention in the programming of saccades. Vision Research, 35(13), 1897–1916. https://doi.org/10.1016/0042-6989(94)00279-U

      Laeng, B., Sirois, S., & Gredebäck, G. (2012). Pupillometry: A Window to the Preconscious? Perspectives on Psychological Science, 7(1), 18–27. https://doi.org/10.1177/1745691611427305

      Loewenfeld, I. E. (1958). Mechanisms of reflex dilatation of the pupil. Documenta Ophthalmologica, 12(1), 185–448. https://doi.org/10.1007/BF00913471

      Mathôt, S. (2018). Pupillometry: Psychology, Physiology, and Function. Journal of Cognition, 1(1), 16. https://doi.org/10.5334/joc.18

      Naber, M., & Murphy, P. (2020). Pupillometric investigation into the speed-accuracy trade-oF in a visuomotor aiming task. Psychophysiology, 57(3), e13499. https://doi.org/10.1111/psyp.13499

      Nozari, N., & Martin, R. C. (2024). Is working memory domain-general or domain-specific? Trends in Cognitive Sciences, 0(0). https://doi.org/10.1016/j.tics.2024.06.006

      Reppert, T. R., Heitz, R. P., & Schall, J. D. (2023). Neural mechanisms for executive control of speedaccuracy trade-oF. Cell Reports, 42(11). https://doi.org/10.1016/j.celrep.2023.113422

      Richer, F., & Beatty, J. (1985). Pupillary Dilations in Movement Preparation and Execution. Psychophysiology, 22(2), 204–207. https://doi.org/10.1111/j.1469-8986.1985.tb01587.x

      Robison, M. K., & Brewer, G. A. (2020). Individual diFerences in working memory capacity and the regulation of arousal. Attention, Perception, & Psychophysics, 82(7), 3273–3290. https://doi.org/10.3758/s13414-020-02077-0

      Robison, M. K., & Unsworth, N. (2019). Pupillometry tracks fluctuations in working memory performance. Attention, Perception, & Psychophysics, 81(2), 407–419. https://doi.org/10.3758/s13414-0181618-4

      Sahakian, A., Gayet, S., PaFen, C. L. E., & Van der Stigchel, S. (2023). Mountains of memory in a sea of uncertainty: Sampling the external world despite useful information in visual working memory. Cognition, 234, 105381. https://doi.org/10.1016/j.cognition.2023.105381

      Shadmehr, R., Reppert, T. R., Summerside, E. M., Yoon, T., & Ahmed, A. A. (2019). Movement Vigor as a Reflection of Subjective Economic Utility. Trends in Neurosciences, 42(5), 323–336. https://doi.org/10.1016/j.tins.2019.02.003

      Silva, M. F., Brascamp, J. W., Ferreira, S., Castelo-Branco, M., Dumoulin, S. O., & Harvey, B. M. (2018). Radial asymmetries in population receptive field size and cortical magnification factor in early visual cortex. NeuroImage, 167, 41–52. https://doi.org/10.1016/j.neuroimage.2017.11.021

      Sirois, S., & Brisson, J. (2014). Pupillometry. WIREs Cognitive Science, 5(6), 679–692. https://doi.org/10.1002/wcs.1323

      Sparks, D. L. (2002). The brainstem control of saccadic eye movements. Nature Reviews Neuroscience, 3(12), Article 12. https://doi.org/10.1038/nrn986

      Strauch, C., Wang, C.-A., Einhäuser, W., Van der Stigchel, S., & Naber, M. (2022). Pupillometry as an integrated readout of distinct attentional networks. Trends in Neurosciences, 45(8), 635–647. https://doi.org/10.1016/j.tins.2022.05.003

      Unsworth, N., & Miller, A. L. (2021). Individual DiFerences in the Intensity and Consistency of Attention. Current Directions in Psychological Science, 30(5), 391–400. https://doi.org/10.1177/09637214211030266

      Van der Stigchel, S. (2010). Recent advances in the study of saccade trajectory deviations. Vision Research, 50(17), 1619–1627. https://doi.org/10.1016/j.visres.2010.05.028

      Van der Stigchel, S. (2020). An embodied account of visual working memory. Visual Cognition, 28(5–8), 414–419. https://doi.org/10.1080/13506285.2020.1742827

      Van der Stigchel, S., & Hollingworth, A. (2018). Visuospatial Working Memory as a Fundamental Component of the Eye Movement System. Current Directions in Psychological Science, 27(2), 136–143. https://doi.org/10.1177/0963721417741710

      van der Wel, P., & van Steenbergen, H. (2018). Pupil dilation as an index of eFort in cognitive control tasks: A review. Psychonomic Bulletin & Review, 25(6), 2005–2015. https://doi.org/10.3758/s13423-018-1432-y

    1. Author response:

      The following is the authors’ response to the current reviews.

      Reviewer #1 (Public review): 

      Summary: 

      Nitric oxide (NO) has been implicated as a neuromodulator in the retina. Specific types of amacrine cells (ACs) produce and release NO in a light-dependent manner. NO diffuses freely through the retina and can modulate intracellular levels of cGMP, or directly modify and modulate proteins via S-nitrosylation, leading to changes in gap-junction coupling, synaptic gain, and adaptation. Although these system-wide effects have been documented, it is not well understood how the physiological function of specific neuronal types is affected by NO. This study aims to address this gap in our knowledge. 

      There are two major findings. 1) About a third of the retinal ganglion cells display cell-type specific adaptation to prolonged stimulus protocols. 2) Application of NO specifically affected Off-suppressed ganglion cells designated as G32 cells. The G32 cluster likely contains 3 ganglion cell types that are differentially affected. 

      This is the first comprehensive analysis of the functional effects of NO on ganglion cells in the retina. The cell-type specificity of the effects is surprising and provides the field with valuable new information. 

      Strengths: 

      NO was expected to produce small effects, and considerable effort was expended in validating the system to ensure that changes in the state of the preparation would not confound any effects of NO. The authors used a sequential stimulus protocol to control for changes in the sensitivity of the retina during the extended recording periods. The approach potentially increases the sensitivity of the measurements and allows more subtle effects to be observed. 

      Neural activity was measured by Ca-imaging. Responsive ganglion cells were grouped into 32 types using a clustering analysis. Initial control experiments demonstrated that the celltypes revealed by the analysis largely recapitulate those from their earlier landmark study using a similar approach. 

      Application of NO to the retina modulated responses of a single cluster of cells, labeled G32, while having little effect on the remaining 31 clusters. In separate experiments, ganglion cell spiking activity was recorded on a multi-electrode array (MEA). Together the Ca-imaging and MEA recordings provide complementary approaches and demonstrate that NO modulates the temporal but not spatial properties of affected cell-types.

      Weaknesses: 

      The concentration of NO used in these experiments was ~0.25µM, which is 5- to 10-fold lower than the endogenous concentration previously measured in rodent retina. It is perhaps surprising that this relatively low NO concentration produced significant effects. However, the endogenous measurements were done in an eye-cup preparation, while the current experiments were performed in a bare (no choroid) preparation. Perhaps the resting NO level is lower in this preparation. It is also possible that the low concentration of NO promoted more selective effects.

      Reviewer #2 (Public review): 

      Neuromodulators are important for circuit function, but their roles in the retinal circuitry are poorly understood. This study by Gonschorek and colleagues aims to determine the modulatory effect of nitric oxide on the response properties of retinal ganglion cells. The authors used two photon calcium imaging and multi-electrode arrays to classify and compare cell responses before and after applying a NO donor DETA-NO. The authors found that DETA-NO selectively increases activity in a subset of contrast-suppressed RGC types. In addition, the authors found cell-type specific changes in light response in the absence of pharmacological manipulation in their calcium imaging paradigm. This study focuses on an important question and the results are interesting. The limitations of the method and data interpretation are adequately discussed in the revised manuscript. 

      The authors have addressed my previous comments, included additional discussions on the limitations of the method, and provided a more careful interpretation of their data. 

      Recommendations for the authors: 

      Please correct the citation that reviewer #1 mentioned. In addition, a little more discussion of the NO concentration issue would be helpful. The low NO concentration is not a weakness in the data; it simply raises questions regarding the interpretation.

      Thank you for these recommendations.

      Regarding the citation error, we are not sure if Reviewer #1 refers to a citation   formatting error or incorrect placement. In any case, we modified the text: We  specified the extracted information regarding the NO concentrations and put the  applied concentration into that context (Lines 621-635). In addition, we made clear  that the citation of Guthrie (2014) refers to the dissertation, which can be easily  retrieved via Google Scholar. We also cited the mentioned ARVO abstract by   Guthrie and Mieler (2014). 

      We hope that these modifications solve the above-mentioned issues. 


      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):  

      Summary: 

      Nitric oxide (NO) has been implicated as a neuromodulator in the retina. Specific types of amacrine cells (ACs) produce and release NO in a light-dependent manner. NO diffuses freely through the retina and can modulate intracellular levels of cGMP, or directly modify and modulate proteins via S-nitrosylation, leading to changes in gap-junction coupling, synaptic gain, and adaptation. Although these system-wide effects have been documented, it is not well understood how the physiological function of specific neuronal types is affected by NO. This study aims to address this gap in our knowledge. 

      Strengths: 

      NO was expected to produce small effects, and considerable effort was expended in validating the system to ensure that any effects of NO would not be confounded by changes in the state of the preparation. The authors used a paired stimulus protocol to control for changes in the sensitivity of the retina during the extended recording periods. The approach potentially increases the sensitivity of the measurements and allows more subtle effects to be observed. 

      Neural activity was initially measured by Ca-imaging. Responsive ganglion cells were grouped into 32 types using a clustering analysis. Initial control experiments demonstrated that the cell-types revealed here largely recapitulate those from their earlier landmark study using the same approach (Fig. 2). 

      Application of NO to the retina strongly modulated responses of a single cluster of cells, labeled G32, while having little effect on the remaining 31 clusters. This result is evident in Fig. 3e. 

      Separate experiments measured ganglion cell spiking activity on a multi-electrode array (MEA). Clustering analysis of the peri-stimulus spike-time histograms (PSTHs) obtained from the MEA data also revealed 32 clusters. The PSTHs for each cluster were aligned to the Ca-imaging data using a convolution approach. The higher temporal resolution of the MEA recordings indicated that NO increased the speed of sub-cluster 2 responses but had no effect on receptive field size. The physiological significance of the small change in kinetics remains unclear. 

      We thank the reviewer for their detailed and constructive comments.

      Weaknesses: 

      The G32 cluster was further divided into three sub-types using Bayesian Information Criterion (BIC) based on the temporal properties of the Ca-responses. This sub-clustering result seems questionable due to the small difference in the BIC parameter between 2 and 3 clusters. Three sub-clusters of the G32 cluster were also revealed for the PSTH data, however, the BIC analysis was not applied to further validate this result. 

      (1.1) We agree with the reviewer that this is an important point to be clarified. To this end, we repeated the analysis with n=2 clusters (see Author response image 1 below). In brief, we found that the overall interpretation did not change: Both clusters in the Ctrl1-dataset showed barely any type-specific adaptational effects, whereas under NO application, temporal contrast responses decreased (see Author response image 1 below). If requested, we would be happy to add this image to the supplementary material. 

      Author response image 1.

      In an additional analysis, we evaluated if n=2 or n=3 was the “better” choice for the number of clusters. In the new Supplementary Fig. S4, we compared the clusters with n=2 (top) and n=3 (bottom). For n=2, the two clusters are relatively strongly correlated for both visual stimuli, whereas for n=3, the clusters become more distinct, especially with respect to differences in the correlations for the two stimuli (Fig. S4b). For n=2, the low intra-cluster correlation (ICC) strongly suggests that cluster 2 contains multiple response types (ICC(C2) = 0.5 ± 0.48, mean ± s.d.; Fig. S4c). For n=3, the mean ICC values are high for all three clusters (ICC(C1) = 0.81 ± 0.16; ICC(C2) = 0.86 ± 0.07; ICC(C3) = 0.83 ± 0.1; mean ± s.d.). Together, this suggests that n=3 clusters captures the response diversity in G32 better than n=2 clusters. 

      Finally, we performed a BIC analysis for the MEA dataset and found the optimal number of clusters to be also n=3 (see new Suppl. Fig. S5).

      The alignment of sub-clusters 1, 2, and 3 identified in the Ca-imaging and the MEA recordings seemed questionable, because the temporal properties of clusters did not align well, nor did the effects of NO. 

      (1.2) To address this important point, we analyzed the correlations between the control responses of the three clusters from the Ca<sup>2+</sup>-dataset with the ones from the MEA-dataset (see new Suppl. Fig. S7). To avoid confusion, we named the clusters in the MEA-dataset i,ii,iii (see Fig. 8). We found two of the three clusters to be highly correlated (Ca<sup>2+</sup> clusters 2,3 and MEA clusters iii, ii), whereas one cluster was much less so (cluster 1 vs. cluster i), likely due to differences in response kinetics. In clusters i and ii NO application led to a release of suppression for temporal contrasts – similar to what we observed in the Ca<sup>2+</sup> data (see also our new analysis of the MEA data in Suppl. Fig. S6, as discussed further below).

      We agree that the cell types underlying the Ca<sup>2+</sup> and MEA G32 clusters may not be the same – aligning functional types between those two methods is challenging due to several factors, mainly because while Ca<sup>2+</sup> is a proxy for spiking activity, other Ca<sup>2+</sup> sources as well as sub-threshold membrane potential changes affect the intracellular Ca<sup>2+</sup>, potentially in a cell type-specific way. We explain this now better in the text.

      In any case, our main point was not to unambiguously align the cell types but to show that in both datasets, we find three subclusters of G<sub>32</sub>, which are affected by NO in a differential manner, particularly their suppression to temporal contrasts.

      The title of the paper indicates that nitric oxide modulates contrast suppression in a subset of mouse retinal ganglion cells, however, this result appears to be inferred from previous results showing that G32 is identified as a "suppressed-by-contrast" cell. The present study does not explicitly evaluate the amount of contrast-suppression in G32 cells. 

      (1.3) The reviewer is correct in that we did not quantify contrast-suppression in G<sub>32</sub> in detail but focused on the responses to temporal contrast (chirp and moving bar) and its modulation by NO (Fig. 5). In this context, please note that G<sub>32</sub>’s responses to the moving bar stimulus suggests that the cells are also suppressed by spatial contrast (i.e., an edge appearing in their RF). The functional RGC type G<sub>32</sub> (“Off suppressed 2”) was defined in an earlier study (Baden et al. 2016); it was assigned to the “Suppressed-by-Contrast” (SbC) category mainly because temporal contrast suppresses its responses. Already then, coverage analysis indicated that G<sub>32</sub> may indeed contain several RGC types – in line with our clustering analysis. It is still unclear if G<sub>32</sub> contains one (or more) of the SbC cells described by Jacoby & Schwartz (2018); in their recent study, Wienbar and Schwarz (2022) introduced the novel bursty-SbC RGC, which Goetz et al. (2022) speculated to potentially align with G<sub>32</sub>.<br /> We now discuss the relationship between G<sub>32</sub> and the SbC RGCs defined in other studies in the revised manuscript.

      In its current form, the work is likely to have limited impact, since the morphological and functional properties of the affected sub-cluster remain unknown. The finding that there can be cell-specific adaptation effects during experiments on in vitro retina is important new information for the field.

      (1.4) Again, we thank the reviewer for the detailed and helpful feedback. We hope that the reviewer finds our revised manuscript improved.

      Reviewer #1 (Recommendations For The Authors):  

      Most of the calcium activity traces (dF/F) throughout the paper have neither vertical nor horizontal calibration bars. Presumably, most values are positive, but this is unclear as a zero level is not indicated anywhere. Without knowing where zero dF/F is, it is not possible to determine whether the NO increased the Ca-signal or blocked a decrease in the Ca-signal. 

      Both ∆F/F and z-scoring, as we used here, are ways to normalize Ca<sup>2+</sup> traces. We decided against using ∆F/F<sub>0</sub> because this typically assumes that F represents the cell’s Ca<sup>2+</sup> resting level (F<sub>0</sub>; without activity). However, in our measurements, the “resting” Ca<sup>2+</sup> levels (i.e. before presenting a stimulus) may indeed reflect no spiking activity (e.g., in an ON RGC) but may also reflect baseline spiking activity (e.g., in an G<sub>32</sub>, which has a baseline firing rate of ~10 Hz; see Fig. S6). Hence, we used z-scoring, which carries no assumption of resting Ca<sup>2+</sup> level equal to no activity. In practice, we normalized all traces to the Ca<sup>2+</sup> level prior to the light stimulus and defined this as zero (as described in the Methods).

      We considered the reviewer’s suggestion of adding zero lines to every trace but felt that this would hamper the overall readability of the figures.

      Regarding calibration bars: We made sure that horizontal bars (indicating time) are present in all figures. We decided to leave out vertical bars in Ca<sup>2+</sup> responses, because as explained above, the traces are normalized (and unit-free), and within a figure all traces are scaled the same.

      Points of clarification for the Methods: 

      (1) The stimulus field was 800 x 600 µm. Presumably, both scan fields were contained within this region when scanning either Field 1 or Field 2 so that the adaptation level of the preparation at both locations was maintained? 

      Yes, the stimulation field is always kept centered on the respective recording (scan) field and the adaptation level for each recording field was maintained.

      (2) There appeared to be an indeterminate amount of time between the initial 10-minute adaptation period and Ctrl1, whereas there were no such gaps between subsequent scans. Is this likely to produce differences in adaptation state and thus represent a systematic error? 

      At this time point, recording (scan) fields were selected to make sure that the cells in the field were uniformly labelled with the Ca<sup>2+</sup> indicator and responsive to light stimuli. This typically happened already at the end of the light adaptation phase and/or right after. When selecting the fields, light stimuli were presented (to test responsiveness) and thereby the adaptation level was maintained independent of the duration of this procedure, minimizing systematic errors.

      (3) Was the dense white noise stimulus applied during the wash-in period to maintain the adaptation state of the preparation prior to the subsequent scan? 

      The dense noise was not applied throughout the wash-in period but at least 5-10min before the field was recorded with a drug (e.g., NO). 

      Fig. 1d illustrates very nicely how the stimuli align with the responses. It would have been helpful to have this format continue throughout the paper but unfortunately, the vertical lines are dropped in Fig. 2a and then the stimulus waveform is omitted in Fig. 2e onwards. 

      Thanks, good idea. We added the vertical lines and the stimulus waveform to the figures where they were missing to improve the readability. 

      What was the rationale for selecting the concentration of the NO donor used? Is it likely to mimic natural levels? 

      A DETA/NO concentration of 100 µM is commonly used in studies investigating NOinduced effects. DETA/NO has a half-life time (t<sub>0.5</sub>) of 20 hours, which makes it more suitable for application in tissues (like our whole-mount preparation), because the donor can penetrate into the issue before releasing NO. In turn, this long t0.5 means that only a fraction of the bound NO is released per time unit.

      Based on t<sub>0.5</sub> for DETA/NO and NO, one can roughly estimate the NO range as follows: t<sub>0.5</sub> of NO strongly depends on the tissue and is estimated in the second to minute range (Beckman & Koppenol, 1996). Assuming a t<sub>0.5</sub> for NO of 2 minutes, a freshly prepared 100 µM DETA/NO solution is expected to result within the first hour a NO concentration of approx. 0.25 µM (taking into account that 1 mole of DETA/NO releases 1.5 moles of NO molecules; see Ramamurthi & Lewis 1997).

      In general, it is difficult to determine the physiological concentration of NO in the retina. Different measurements point at peaks of a few 100 nM (e.g., frog retina, ganglion cells: 0.25 µM, Kalamkarov et al. 2016; rodent inner retina, 0.1 to 0.4 µM, Micah et al. 2014). Hence, the NO concentrations we apply should be within the measured physiological range.

      Fig. 3e: what are the diamond symbols? If these are the individual cells, it might be better to plot them on top of the box plots so all are visible. 

      Indeed, the diamond symbols represent individual cells, yet outliers only. We decided not to plot all cells as a dot plot on top of the box plots since the readability will suffer as there are too many individual dots to show, e.g., n=251 for G<sub>32</sub> Ctrl and n=135 for G<sub>32</sub> DETA/NO.

      Fig. 3: please explain more clearly the x-axis units in a-d and the y-axis units in e. 

      To estimate potential response differences between the first and the second scan (i.e. either Ctrl 2 or NO), the traces were subtracted cell-pairwise (∆ Ctrl: Ctrl 2 – Ctrl 1; ∆ DETA/NO: NO – Ctrl 1). As all Ca<sup>2+</sup> traces were normalized, they are unit-free. Therefore, the x-axes in Fig. 3a-d represent the mean differences of each cell per cell type, e.g., a value of zero would mean that the traces of Ctrl 1 and Ctrl 2 for a cell are identical. The y-axis in Fig. 3e is also unit-free, because technically, it is the same measure as Fig. 3a-d. But since it summarizes the control- and NO-data, we refer to this as “delta mean trace.” We tried to make this clearer in the revised manuscript and a detailed description can be found in the Methods.

      Fig. 3: "...a substantial number of RGC types (34%) changed their responses to chirp and/or moving bar stimuli in the absence of any pharmacological perturbation in a highly reproducible manner...". How many of the cell types showed a significant difference? Two cell-types with p<0.001are highlighted with 3 asterisks. It would be helpful to indicate on this plot which of the other cells showed significant differences. 

      Yes, this is a good idea. Thank you. We tried to add this information to the figure, but it became rather crowded. Therefore, we added a new Suppl. Fig. S3 (same style as Fig. 3) where we exclusively summarized the control-dataset. 

      Fig. 7: To illustrate the transform from PSTH to Ca-imaging, why not use G32 data as an example?

      Fair point. We modified the figure and added G<sub>32</sub> as an example.

      It would be clearer if the cells were labeled consistently throughout the paper using their Baden cluster numbers rather than switching to the older nomenclature (JAM-B, local edge, alpha, etc), e.g. Fig. 7a,b. 

      In the revised manuscript, we now changed the nomenclature to the Ca2+ Baden et al. (2016) terminology. We used the alternative cell type names here because where Fig. 7a is discussed in the manuscript, the cell type matching did not happen yet. But we agree that a consistent nomenclature is helpful.

      The evidence supporting the sub-clustering of the G32 cells for the two recording methods could have been stronger. In Fig. 5, the BIC difference between 2 and 3 clusters is rather small. Is this result robust enough to justify 3 rather than 2 clusters? The BIC analysis should also be performed on the PSTH data-set to support the notion that the MEA G32 cluster also contains 3 rather than 2 sub-clusters. 

      Regarding the sub-clustering of G<sub>32</sub> into n=2 or n=3 clusters for both datasets, please see our detailed reply #1.1 in our response to the public comments above.

      The alignment of the three sub-clusters across the Ca-imaging and MEA data looked questionable. For example, the cluster 2 and cluster 3 traces in Fig. 5e,f look similar, with cluster 1 being more different. In Fig. 8c on the other hand, cluster 1 and 3 look similar with cluster 2 being more different. The pharmacological results also did not align well. For the Ca-imaging, NO appeared to have a large effect on cluster 1, a more modest effect on cluster 2 and less effect on cluster 3 (Fig. 5f). In comparison, the MEA results diverged, with NO producing the largest effect on cluster 2 and very modest if any effects on clusters 1 and 3 (Fig. 8c). Moreover, the temporal properties of cluster 1 and cluster 3 look very different between the Ca-imaging and MEA data. Without further comment, these differences raise concerns about the reliability of the clustering and the validity of comparisons made across the two sets of experiments. 

      We agree that this is a critical point. Please see our reply #1.2 in our response to the public comments above.

      Fig. 8: Transforming the PSTHs into Ca-traces is important to align the MEA recordings with the Ca-imaging data. It would also be very informative to see a more detailed overall presentation of the PSTH data since it provides a much higher temporal resolution of the responses. For example, illustrating the average PSTHs for the G32 cells under all the experimental conditions could be quite illuminating. 

      To address this point, we added a new Supplementary Fig. S6, which shows the pseudo-Ca<sup>2+</sup> traces for each cluster and condition next to the PSTHs. In addition, we quantified the cumulative firing rate for response features (time windows) where temporal suppression was observed in the Ca<sup>2+</sup> data. This new analysis shows that during NO-application, we can see an increase in firing rate in all clusters. Nevertheless, the effect of NO on the PSTHs is admittedly small and it is better visible in the pseudo-Ca<sup>2+</sup> transformed traces. One possible explanation for this difference may be that the overall firing rates are quite dynamic in G<sub>32</sub> such that a significant increase in “suppression” phases relative to the peak firing appears small.

      Reviewer #2 (Public Review):  

      Neuromodulators are important for circuit function, but their roles in the retinal circuitry are poorly understood. This study by Gonschorek and colleagues aims to determine the modulatory effect of nitric oxide on the response properties of retinal ganglion cells. The authors used two photon calcium imaging and multi-electrode arrays to classify and compare cell responses before and after applying a NO donor DETA-NO. The authors found that DETA-NO selectively increases activity in a subset of contrast-suppressed RGC types.

      In addition, the authors found cell-type specific changes in light response in the absence of pharmacological manipulation in their calcium imaging paradigm. While this study focuses on an important question and the results are interesting, the following issues need further clarification for better interpretation of the data. 

      We thank the reviewer for her/his detailed and constructive comments.

      (1) Design of the calcium imaging experiments: the control-control pair has a different time course from the control-drug pair (Fig 1e). First, the control-control pair has a 10 minute interval while the control-drug pair has a 25 minute interval. Second, Control 1 Field 2 was imaged 10 min later than Control 1 Field 1 since the start of the calcium imaging paradigm. 

      Given that the control dataset is used to control for time-dependent adaptational changes throughout the experiment, I wonder why the authors did not use the same absolute starting time of imaging and the same interval between the first and second round of imaging for both the control-control and the control-drug pairs. This can be readily done in one of the two ways: 1. In a set of experiment, add DETA/NO between "Control 1 Field 1 and "Control 2 Field 1" in Fig. 1e as the drug group; or 2. Omit DETA/NO in the Fig. 1e protocol as the control group to monitor the time course of adaptational changes. 

      Thank you for raising this point. We hope that in the following we can clarify the reasoning behind our protocol and the analysis approach.

      (2.1) Initially, we performed these experiments in different ways (also in the sequence suggested by the reviewer), before homing in on the paradigm illustrated in Fig. 1. We chose this paradigm for two reasons: First, we wanted to have for each retina both Ctrl1/Ctrl2 and Ctr1/NO data sets, to be sure that the time-dependent (adaptational) effects were not related to the general condition of an individual retina preparation. Second, we did not see obvious differences in time-dependent or NO-induced effects between paradigms. Therefore, while we cannot exclude that the absolute time between recordings can affect the observed changes, we do not think that such effects are substantial enough to change our conclusions.

      In the revised manuscript, we now explicitly point at the different intervals. 

      Related to the concern above, to determine NO-specific effect, the authors used the criterion that "the response changes observed for control (ΔR(Ctrl2−Ctrl1)) and NO (ΔR(NO−Ctrl1)) were significantly different". This criterion assumes that without DETA-NO, imaging data obtained at the time points of "Control 1 Field 2" and "DETA/NO Field 2" would give the same value of ΔR as ΔR(Ctrl2−Ctrl1) for all RGC types. It is not obvious to me why this should be the case, because of the unknown time-dependent trajectory of the adaptational change for each RGC type. For example, a RGC type could show stable response in the first 30 min and then change significantly in the following 30 min. DETA/NO may counteract this adaptational change, leading to the same ΔR as the control condition (false negative). Alternatively, DETA/NO may have no effect, but the nonlinear timedependent response drift can give false positive results. 

      (2.2) Initially, we assumed that after adapting the retina to a certain light level, RGCs exhibit stable responses over time, such that when adding a pharmacological agent, we can identify drug-induced response changes (e.g., by calculating the response difference). To our surprise, we found that for some RGC types the responses changed between the first and the second recording (referred to as cell type-specific adaptational effects), which is why we devised the Ctrl1/Ctrl2 vs. Ctr2/NO analysis. 

      The reviewer is correct in that we assume in our analysis that the adaptational- and NO-induced effects are independent and sum linearly. Further, we agree with the reviewer that there may be other possibilities, two of which are highlighted by the reviewer:

      (a) Interaction: for instance, if NO compensates for the adaptational effect, we would not be able to measure this; or, if this compensation was partial, underestimate both effects. 

      (b) More complex time-dependency: for example, if an RGC shows a pronounced adaptational effect with a longer delay (i.e. only after the second scan), or that a very transient NO effect has already disappeared when we perform the second scan. On the one hand, as we only can take snapshots of the RGC responses, we cannot exclude these possibilities. On the other hand, both effects (adaptational- and NO-dependent) were type-specific and reproducible between experiments (also with varying timing, see reply #2.1), which makes complex time dependencies less likely.

      The revised manuscript now reflects these limitations of our recording paradigm and points out which effects can be detected, and which likely not.

      I also wonder why washing-out, a standard protocol for pharmacological experiments, was not done for the calcium protocol since it was done in the MEA experiments. A reversible effect by washing in and out DETA/NO in the calcium protocol would provide a much stronger support that the observed NO modulation is due to NO and not to other adaptive changes. 

      (2.3) We agree that a clear wash-out would strengthen our findings. Indeed, in the beginning of our experiments, we tried to wash-out the agent in the Ca<sup>2+</sup> recordings, as we did in the MEA recordings. We soon stopped doing this in the Ca<sup>2+</sup> experiments, because response quality decreased for the third scan of the same field, likely due to bleaching of fluorescent indicator and photopigment. This is why we typically restrict the total recording time of the same field of RGCs to about 30 min (~ two scans with all light stimuli). Moreover, our MEA data showed that DETA/NO can largely be washed-out, which supports that we observed NO-specific effects. Therefore, we decided against further attempts to establish the wash-out also in the Ca<sup>2+</sup> experiments (e.g., shortening the recording time by presenting fewer light stimuli).

      (2) Effects of Strychnine: In lines 215-219, " In the light-adapted retina, On-cone BCs boost light-Off responses in Off-cone BCs through cross-over inhibition (83, 84) and hence, strychnine affects Off-response components in RGCs - in line with our observations (Fig. S2)" However, Fig. S2 doesn't seem to show a difference in the Off-response components. Rather, the On response is enhanced with strychnine. In addition, suppressed-by-contrast cells are known to receive glycinergic inhibition from VGluT3 amacrine cells (Tien et al., 2016). However, the G32 cluster in Fig. S2 doesn't seem to show a change with strychnine. More explanation on these discrepancies will be helpful.

      (2.4) We thank the reviewer for this comment. Regarding the first part, we agree that the figure does not support differences in the Off-response components. We therefore rephrased the corresponding text accordingly. Additionally, we now show all RGC types with n>3 cells per recording condition in the revised Suppl. Fig. S2 and added statistics.

      Regarding the second part, there are several possible explanations for these discrepancies:

      (a) The SbC (transient Off SbC) studied in Tien et al. (2016) likely corresponds to the RGC type G<sub>28</sub> (see Höfling et al. 2024). As mentioned above (see reply #1.2), it is unclear if G<sub>32</sub> corresponds to a previously described SbC, and if so, to which. Goetz et al. (2022) proposed that G<sub>32</sub> may align with the bursty-SbC (bSbC) type (their Supplemental Table 3), as described also by Wienbar and Schwartz (2022). An important feature of the bSbC type is that its contrast response function is mainly driven by intrinsic properties rather than synaptic input. If G<sub>32</sub> indeed included the bSbC, this may explain why strychnine does not interfere with the suppression of temporal contrast.

      (b) In Tien et al. (2016), the authors genetically removed the VG3-ACs (see their Fig. 3) and show that this ablation reduces the inhibition of tSbC cells in a stimulus size-dependent manner. Specifically, larger light stimuli (600 µm) only show marginal effects on the IPSCs and inhibitory synaptic conductance (see their Figs. 3c,d and 3e,f, respectively). In our study, the full-field chirp had a size of 800 x 600 µm. Therefore – and assuming that G<sub>32</sub> indeed included tSbCs – our observation that strychnine did not affect temporal suppression in the full-field chirp responses would be in line with Tien et al. (2016).   

      (3) This study uses DETA-NO as an NO donor for enhancing NO release. However, a previous study by Thompson et al., Br J Pharmacol. 2009 reported that DETA-NO can rapidly and reversible induce a cation current independent of NO release at the 100 uM used in the current study, which could potentially cause the observed effect in G32 cluster such as reduced contrast suppression and increased activity. This potential caveat should at least be discussed, and ideally excluded by showing the absence of DETA-NO effects in nNOS knockout mice, and/or by using another pharmacological reagent such as the NO donor SNAP or the nNOS inhibitor l-NAME. 

      Thank you for pointing out this potential caveat. We certainly cannot exclude such side effects. However, we think that this explanation of our observations is unlikely, because Thompson et al. barely see effects at 100 µM DETA/NO; in fact, their data suggests that clear NO-independent effects on the cation-selective channel occur at much higher DETA/NO concentrations, such as 3 mM. 

      In any case, in the revised manuscript, we refer to this paper in the Discussion

      (4) Clarification of methods: In the Methods, lines 1119-1127, the authors describe the detrending, baseline subtraction, and averaging. Then, line 1129, " the mean activity r(t) was computed and then traces were normalized such that: max t(|r(t)|) = 1. How is the normalization done? Is it over the entire recording (control and wash in) for each ROI? Or is it normalized based on the mean trace under each imaging session (i.e. twice for each imaging field)? 

      The normalization (z-scoring) was done for each ROI individually per stimulus and condition (Ctrl 1, Ctrl 2, DETA/NO). We normalized the traces, because the absolute Ca<sup>2+</sup> signal depends on factors, such as “resting” state of the cell (e.g., silent vs. baseline spiking activity in the absence of a light stimulus) and its fluorescent dye concentration. This also means that absolute response amplitudes are difficult to interpret. Hence, we focused on analyzing relative changes per ROI and condition, which still allowed us to investigate adaptational and drug-induced effects. In the revised manuscript, we changed the corresponding paragraph for clarification.

      As for the clustering of RGC types, I assume that each ROI's cluster identity remains unchanged through the comparison. If so, it may be helpful to emphasize this in the text.

      Yes, this is correct. We identified G<sub>32</sub> RGCs based on their Ctrl1 responses and then compares these responses with those for Ctrl2 or NO. We now clarified this in the revised manuscript.

      Reviewer #2 (Recommendations For The Authors):  

      The manuscript would benefit from a discussion of how the findings in this study relate to known mechanisms of NO modulation and previously reported effects of NO manipulations on RGC activity. 

      Thank you for the recommendation. We already refer to known mechanisms of NO within the retina in the Introduction. In the revised manuscript, we now added information to the Discussion.

      In the abstract, "a paired-recording paradigm" could be misleading because paired recording generally refers to the simultaneous recording of two neurons. However, the paradigm in this study is essentially imaging experiments done at two time points. 

      We agree with the reviewer. To avoid any confusion with paired electrophysiological recordings, we changed the term “paired-recording paradigm” to “sequential recording paradigm” and replaced the term “pair-/ed” with “sequentially recorded”.

    1. Author Response

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      The manuscript investigates the role of membrane contact sites (MCSs) and sphingolipid metabolism in regulating vacuolar morphology in the yeast Saccharomyces cerevisiae. The authors show that tricalbin (1-3) deletion leads to vacuolar fragmentation and the accumulation of the sphingolipid phytosphingosine (PHS). They propose that PHS triggers vacuole division through MCSs and the nuclear-vacuolar junction (NVJ). The study presents some solid data and proposes potential mechanisms underlying vacuolar fragmentation driven by this pathway. However, there are some concerns regarding the strength and interpretation of their lipid data, and the robustness of some conclusions. The manuscript would benefit from addressing these concerns and providing more conclusive evidence to support the proposed conclusions. Overall, the study provides valuable insights into the connection between MCSs, lipid metabolism, and vacuole dynamics, but further clarification will be highly valuable to strengthen the conclusions.

      We thank the thoughtful and positive feedback from Reviewer #1. Nevertheless, there are concerns raised regarding the strength and interpretation of the lipid data, as well as the robustness of specific conclusions. We acknowledge the importance of addressing the raised concerns and provide more conclusive evidence to support our proposed conclusions. We have responded in the "Recommendations to Authors" section and hope that our research has been further strengthened.

      Reviewer #2 (Public Review):

      This manuscript investigates the mechanism behind the accumulation of phytosphingosine (PHS) and its role in triggering vacuole fission. The study proposes that membrane contact sites (MCSs) are involved in two steps of this process. First, tricalbin-tethered MCSs between the endoplasmic reticulum (ER) and the plasma membrane (PM) or Golgi modulate the intracellular amount of PHS. Second, the accumulated PHS induces vacuole fission, most likely via the nuclear-vacuolar junction (NVJ). The authors suggest that MCSs regulate vacuole morphology through sphingolipid metabolism.

      While some of the results in the manuscript are interesting the overall logic is hard to follow. In my assessment of the manuscript, my primary concern lies in its broad conclusions which, in my opinion, exceed the available data and raise doubts. Here are some instances where this comes into play for this manuscript:

      We greatly appreciate the careful insights into our research from Reviewer #2. We have sincerely addressed the points one by one in the following.

      Major points for revision

      1) The rationale to start investigating a vacuolar fission phenotype in the beginning is very weak. It is basically based on a negative genetic interaction with NVJ1. Based on this vacuolar fragmentation is quantified. The binning for the quantifications is already problematic as, in my experience, WT cells often harbor one to three vacuoles. How are quantifications looking when 1-3 vacuoles are counted as "normal" and more than 3 vacuoles as "fragmented"? The observed changes seem to be relatively small and the various combinations of TCB mutants do not yield a clear picture.

      The number of vacuoles at a steady state could be influenced by various environmental factors, including the composition of the medium (manufacturer supplying the reagent and local water hardness) and the background of the strain. Possibly due to those causes, our observations differ from the experience of Reviewer #2. Indeed, we observed that WT cells always have one vacuole in YPD medium. Whereas in SD medium (Fig S3B only), WT cells have mainly one or two vacuoles per cell. In both cases, we observed that some of the mutants showed a different phenotype from the WT and that those differences are supported by student’s t-test and two-way ANOVA analysis.

      2) The analysis of the structural requirements of the Tcb3 protein is interesting but does not seem to add any additional value to this study. While it was used to quantify the mild vacuolar fragmentation phenotype it does not reoccur in any following analysis. Is the tcb3Δ sufficient to yield the lipid phenotype that is later proposed to cause the vacuolar fragmentation phenotype?

      We do not know whether tcb3Δ alone is sufficient to increase PHS as we have not examined it. Nevertheless, as another approach, we analyzed the difference in IPC level between tcb1Δ2Δ3Δ triple deletion and tcb3Δsingle deletion in a sec18 mutant background and showed that the reduction of IPC synthesis is similar between tcb1Δ2Δ3Δand tcb3Δ alone (unpublished). This result suggests that out of all tricalbins (Tcb1, Tcb2 and Tcb3), Tcb3 plays a central role. In addition, the IPC synthesis reduction phenotype was small in tcb1Δ alone and tcb2Δ alone, but a strong phenotype appeared in the tcb1Δtcb2Δ combined deletion (as strong as in tcb3Δ alone). The relationship between Tcb1 Tcb2 and Tcb3 indicated by these results is also consistent with the results of the structural analysis in this study. We have shown that Tcb3 physically interacts with Tcb1 and Tcb2 by immunoprecipitation analysis (unpublished). In the future, we plan to investigate the relationship between Tcb proteins in more detail, along with the details of the interactions between Tcb1, Tcb2, and Tcb3.

      3) The quantified lipid data also has several problems. i) The quantified effects are very small. The relative change in lipid levels does not allow any conclusion regarding the phenotypes. What is the change in absolute PHS in the cell. This would be important to know for judging the proposed effects. ii) It seems as if the lipid data is contradictory to the previous study from the lab regarding the role of tricalbins in ceramide transfer. Previously it was shown that ceramides remain unchanged and IPC levels were reduced. This was the rationale for proposing the tricalbins as ceramide transfer proteins between the ER and the mid-Golgi. What could be an explanation for this discrepancy? Does the measurement of PHS after labelling the cells with DHS just reflect differences in the activity of the Sur2 hydroxylase or does it reflect different steady state levels.

      i) As Reviewer #2 pointed out, it is a slight change, but we cannot say that it is not sufficient. We have shown that PHS increases in the range of 10~30% depending on the concentration of NaCl that induces vacuole division (This result is related to the answers to the following questions by Reviewer #3 and to the additional data in the new version). This observation supports the possibility that a small increase in PHS levels may have an effect on vacuole fragmentation. We did not analyze total PHS level by using methods such as liquid chromatography-mass spectrometry or ninhydrin staining of TLC-separated total lipids. The reason for this is that radiolabeling of sphingolipids using the precursor [3H]DHS provides higher sensitivity and makes it easier to detect differences. Moreover, using [3H]DHS labeling, we only measure PHS that is synthesized in the ER and that doesn’t originate from degradation of complex sphingolipids or dephosphorylation of PHS-1P in other organelles.

      ii) In our previous study (Ikeda et al. iScience. 2020), we separated the lipid labeled with [3H]DHS into ceramides and acylceramides. There was no significant change in ceramide levels, but acylceramides increased in tcb1Δ2Δ3Δ. Since we did not separate these lipids in the present study, the data shows the total amount of both ceramide and acylceramide. We apologize that the term in Figure 3A was wrong. We have corrected it. Also, we have used [3H]DHS to detect IPC levels, which differs from the previous analysis used [3H]inositol. This means the lipid amounts detected are completely different. Since the amount of inositol incorporated into cells varies from cell to cell, the amount loaded on the TLC plate is adjusted so that the total amount (signal intensity) of radioactively labeled lipids is almost the same. In contrast, for DHS labeling, the amount of DHS attached to the cell membrane is almost the same between cells, so we load the total amount onto the TLC plate without adjustment. In addition, the reduction in IPC levels due to Tcb depletion that we previously reported was seen only in sec12 or sec18 mutation backgrounds, and no reduction in IPC levels was observed in the tcb1Δ2Δ3Δ by [3H]inositol labeling (Ikeda et al. iScience. 2020). Therefore, we cannot simply compare the current results with the previous report due to the difference in experimental methods.

      The labeling time for [3H]DHS is 3 hours, and we are not measuring steady-state amounts, but rather analyzing metabolic reactions. Since [3H]DHS is converted to PHS by Sur2 hydroxylase in the cell, the possibility that differences in PHS amounts reflect differences in Sur2 hydroxylase activity cannot be ruled out. However, this possibility is highly unlikely since we have previously observed that the distribution of ceramide subclasses is hardly affected by tcb1Δtcb2Δtcb3Δ (Ikeda et al. iScience 2020). We have added to the discussion that the possibility of differences in Sur2 hydroxylase activity cannot be excluded.

      4) Determining the vacuole fragmentation phenotype of a lag1Δlac1Δ double mutant does not allow the conclusion that elevated PHS levels are responsible for the observed phenotype. This just shows that lag1Δlac1Δ cells have fragmented vacuoles. Can the observed phenotype be rescued by treating the cells with myriocin? What is the growth rate of a LAG1 LAC1 double deletion as this strain has been previously reported to be very sick. Similarly, what is the growth phenotype of the various LCB3 LCB4 and LCB5 deletions and its combinations.

      As Reviewer #2 pointed out, the vacuolar fragmentation in lag1Δlac1Δ itself does not attribute to the conclusion that increased PHS levels are the cause. Since this mutant strain has decreased level of ceramide and its subsequent product IPC/MIPC in addition to the increased level of the ceramide precursors LCB or LCB-1P, we have changed the manuscript as follows. As noted in the following comment by reviewer #2, myriocin treatment has been reported to induce vacuolar fragmentation, so we do not believe that experiments on recovery by myriocin treatment will lead to the expected results.

      ・ Previous Version: We first tested whether increased levels of PHS cause vacuolar fragmentation. Loss of ceramide synthases could cause an increase in PHS levels. Our analysis showed that vacuoles are fragmented in lag1Δlac1Δ cells, which lack both enzymes for LCBs (DHS and PHS) conversion into ceramides (Fig 3B). This suggests that ceramide precursors, LCBs or LCB-1P, can induce vacuolar fragmentation.

      ・Current Version: We first evaluated whether the increases in certain lipids are the cause of vacuolar fragmentation in tcb1Δ2Δ3Δ. Our analysis showed that vacuoles are fragmented in lag1Δlac1Δ cells, which lack both enzymes for LCBs (DHS and PHS) conversion into ceramides (Fig 3B). This suggests that the increases in ceramide and subsequent products IPC/MIPC are not the cause of vacuolar fragmentation, but rather its precursors LCBs or LCB-1P.

      As reviewer #2 pointed out, the lag1Δlac1Δ double mutant is very slow growing as shown below (Author response image 1). We also examined the growth phenotype of LCB3, LCB4, and LCB5 deletion strains, and found that the growth of these strains was the same as the wild strains, with no significant differences in growth (Author response image 1).

      Author response image 1.

      Cells (FKY5687, FKY5688, FKY36, FKY37, FKY33, FKY38) were adjusted to OD 600 = 1.0 and fivefold serial dilutions were then spotted on YPD plates, then incubated at 25℃ for 3 days.

      5) The model in Figure 3 E proposes that treatment with PHS accumulates PHS in the endoplasmic reticulum. How do the authors know where exogenously added PHS ends up in the cell? It would also be important to determine the steady state levels of sphingolipids after treatment with PHS. Or in other words, how much PHS is taken up by the cells when 40 µM PHS is added?

      It has been found that the addition of PHS well suppresses the Gas1 trafficking (Gaigg et al. J Biol Chem. 2006) and endocytosis phenotypes in lcb-100 mutants (Zanolari et al. EMBO J. 2000). Their suppression depends on Lcb3 localized to the ER. Thus, we know that PHS added from outside the cell reaches the ER and is functional.

      We also agree that it is important to measure the amount of PHS taken up into the cells. However, this is extremely difficult to do for the following reasons. The majority of PHS added to the medium remains attached to the surface layer of the cells. If we measure the lipids in the cells by MS, we would detect both lipids present on the outside and inside of the plasma membrane. This means we need to separate the outside from the inside of the cell's membrane to determine the exact amount of LCB that has taken up by the cells. Regretfully, this separation is currently technically difficult.

      6) Previous studies have observed that myriocin treatment itself results in vacuolar fragmentation (e.g. Hepowit et al. biorXivs 2022, Fröhlich et al. eLife 2015). Why does both, depletion and accumulation of PHS lead to vacuolar fragmentation?

      It’s exactly as Reviewer #2 said. Consistent with previous results with myriocin treatment, we also observed vacuolar fragmentation in the lcb1-100 mutant strain. Then we have added these papers to the references for further discussion. Our discussion is as follows.

      "Previous studies have observed that myriocin treatment results in vacuolar fragmentation (Hepowit et al. bioRxiv 2022; Now published in J Cell Sci. 2023, Fröhlich et al. eLife 2015). Myriocin treatment itself causes not only the depletion of PHS but also of complex sphingolipids such as IPC. This suggests that normal sphingolipid metabolism is important for vacuolar morphology. The reason for this is unclear, but perhaps there is some mechanism by which sphingolipid depletion affects, for example, the recruitment of proteins required for vacuolar membrane fusion. In contrast, our new findings show that both PHS increase and depletion cause vacuole fragmentation. Taken together, there may be multiple mechanisms controlling vacuole morphology and lipid homeostasis by responding to both increasing and decreasing level of PHS."

      7) The experiments regarding the NVJ genes are not conclusive. While the authors mention that a NVJ1/2/3 MDM1 mutant was shown to result in a complete loss of the NVJ the observed effects cannot be simply correlated. It is also not clear why PHS would be transported towards the vacuole. In the cited study (Girik et al.) the authors show PHS transport from the vacuole towards the ER. Here the authors claim that PHS is transported via the NVJ towards the vacuole. Also, the origin of the rationale of this study is the negative genetic interaction of tcb1/2/3Δ with nvj1Δ. This interaction appears to result in a strong growth defect according to the Developmental Cell paper. What are the phenotypes of the mutants used here? Does the additional deletion of NVJ genes or MDM1 results in stronger growth phenotypes?

      We seriously appreciate the concerns in our research. As reviewer #2 pointed out, we have not shown evidence in this study to support that PHS is transported directly from the ER to the vacuole, so it is unclear whether PHS is transported to the vacuole and its physiological relevance. Girik et al. showed that the NVJ resident protein Mdm1 is important for PHS transport between vacuole and ER. Given the applied experimental method that tracks PHS released in the vacuole, indeed only transport of PHS from the vacuole to the ER was verified. However, assuming that Mdm1 transports PHS along its concentration gradient we consider that under normal conditions, PHS is transported from the ER (as the organelle of PHS synthesis) to the vacuole. We clarified this interpretation by adding the following sentences to the manuscript at line 313:

      “The study applied an experimental method that tracks LCBs released in the vacuole and showed that Mdm1p is necessary for LCBs leakage into the ER. However, assuming that Mdm1p transports LCBs along its concentration gradient we consider that under normal conditions, LCBs is transported from the ER (as the organelle of PHS synthesis) to the vacuole.”

      The negative genetic interaction between tcb1/2/3Δ and nvj1Δ is consistent with this model, but under our culture conditions we did not observe a negative interaction between the genes encoding the TCB3 and NVJ junction proteins (Author response image 2). We do not know if this is due to strain background, culture conditions, or whether the deletions of TCB1 and TCB2 are also required for the negative interaction. We would like to analyze details in the future.

      Author response image 2.

      Cells (FKY 3868, FKY5560, FKY6187, FKY6189, FKY6190, FKY6188, FKY6409) were adjusted to OD 600 = 1.0 and fivefold serial dilutions were then spotted on YPD plates, then incubated at 25℃ for 3 days.

      Our results in this study show that deletion of the NVJ component gene partially suppresses vacuolar fission upon the addition of PHS. To clarify these facts, we have changed the sentences in Results and Discussion of our manuscript as follows. We hope that this change will avoid over-interpretation.

      ・ Previous: To test the role of NVJ-mediated “transport” for PHS-induced vacuolar fragmentation,

      ・Current: To test the role of NVJ-mediated “membrane contact” for PHS-induced vacuolar fragmentation,

      ・Previous: Taken together, we conclude from these findings that accumulated PHS in tricalbin deleted cells triggers vacuole fission via “non-vesicular transport of PHS” at the NVJ.

      ・Current: Taken together, we conclude from these findings that accumulated PHS in tricalbin deleted cells triggers vacuole fission via “contact between ER and vacuole” at the NVJ.

      ・Previous: Because both PHS- and tricalbin deletion-induced vacuolar fragmentations were partially suppressed by the lack of NVJ (Fig 4B, 4C), it is suggested that transport of PHS into vacuoles via the NVJ is involved in triggering vacuolar fragmentation.

      ・Current: Based on the fact that both PHS- and tricalbin deletion-induced vacuolar fragmentations were partially suppressed by the lack of NVJ (Fig 4B, 4C), it is possible that the trigger for vacuolar fragmentation is NVJ-mediated transport of PHS into the vacuole.

      8) As a consequence of the above points, several results are over-interpreted in the discussion. Most important, it is not clear that indeed the accumulation of PHS causes the observed phenotypes.

      We thank the suggestion by Reviewer #2. In particular, the concern that PHS accumulation really causes vacuolar fragmentation could only be verified by an in vitro assay system. This is an important issue to be resolved in the future.

      Reviewer #3 (Public Review):

      In this manuscript, the authors investigated the effects of deletion of the ER-plasma membrane/Golgi tethering proteins tricalbins (Tcb1-3) on vacuolar morphology to demonstrate the role of membrane contact sites (MCSs) in regulating vacuolar morphology in Saccharomyces cerevisiae. Their data show that tricalbin deletion causes vacuolar fragmentation possibly in parallel with TORC1 pathway. In addition, their data reveal that levels of various lipids including ceramides, long-chain base (LCB)-1P and phytosphingosine (PHS) are increased in tricalbin-deleted cells. The authors find that exogenously added PHS can induce vacuole fragmentation and by performing analyses of genes involved in sphingolipid metabolism, they conclude that vacuolar fragmentation in tricalbin-deleted cells is due to the accumulated PHS in these cells. Importantly, exogenous PHS- or tricalbin deletion-induced vacuole fragmentation was suppressed by loss of the nucleus vacuole junction (NVJ), suggesting the possibility that PHS transported from the ER to vacuoles via the NVJ triggers vacuole fission.

      This work provides valuable insights into the relationship between MCS-mediated sphingolipid metabolism and vacuole morphology. The conclusions of this paper are mostly supported by their results, but there is concern about physiological roles of tricalbins and PHS in regulating vacuole morphology under known vacuole fission-inducing conditions. That is, in this paper it is not addressed whether the functions of tricalbins and PHS levels are controlled in response to osmotic shock, nutrient status, or ER stress.

      We appreciate the comment, and we consider it an important point. To answer this, we have performed additional experiments. Please refer to the following section, "Recommendations For The Authors" for more details. These results and discussions also have been added to the revised Manuscript. We believe this upgrade makes our findings more comprehensive.

      There is another weakness in their claim that the transmembrane domain of Tcb3 contributes to the formation of the tricalbin complex which is sufficient for tethering ER to the plasma membrane and the Golgi complex. Their claim is based only on the structural simulation, but not on biochemical experiments such as co-immunoprecipitation and pull-down.

      We appreciate your valuable suggestion and would like to attempt to improve upon it in the future.

      Author response to Recommendations:

      The following is the authors' response to the Recommendations For The Authors. We have now incorporated the changes recommended by Reviewers to improve the interpretations and clarity of the manuscript.

      Reviewer #1 (Recommendations For The Authors):

      I would recommend the authors provide additional experimental data to fully support their claims or revise the writing of their manuscript to be more precise in their conclusions. In particular, I have suggestions/questions:

      Fig. 1A: display the results as in 1B (that is, different colors for different number of vacuoles, and the x axes showing the different conditions, in this case WT vs tcb1∆2∆3∆.

      In response to the suggestion of Reviewer #1, we have changed the display of results.

      Fig. S1B: the FM4-64 pattern looks different in the KO strain as compared to those shown in Fig. 1A. Is there a reason for that? Also, no positive control of cps1p not in the vacuole lumen is shown.

      Our apologies, this was probably due to the poor resolution of the images. We have made other observations and changed the Figure along with the positive control.

      Line 172: the last condition in Fig. 2B (vi), should be compared to the tcb1∆tcb2∆ condition (shown in fig 1).

      In response to the suggestion of Reviewer #1, we have changed the manuscript as follows: We found that cells expressing Tcb3(TM)-GBP and lacking Tcb1p and Tcb2p (Fig 2B (vi)) are even more fragmented than tcb1Δ2Δ in Fig 1B and are fragmented to a similar degree as tcb3Δ (Fig 1B and Fig 2B (ii)).

      Fig 2E: the model shown here can be tested, is there binding (similar to kin recognition mechanism of some Golgi proteins) between the different Tcb TMDs?

      As Reviewer #1 mentioned, we have confirmed by co-immunoprecipitation that Tcb3 binds to both Tcb1 and Tcb2 (unpublished). Furthermore, we will test if the binding can be observed with TMD alone in the future.

      Fig 3A: you measured an increase in PHS that is metabolized from DHS (which is what you label). Are there other routes to produce PHS independently of DHS? I mean, how is the increase reporting on the total levels of this lipid?

      PHS synthesized by Sur2 is converted to PHS-1P and phytoceramide. Conversely, PHS is reproduced by degradation of PHS1-P via Lcb3, Ysr3, and by degradation of phytoceramides via Ypc1 (Vilaça, Rita et al. Biochim Biophys Acta Mol Basis Dis. 2017. Fig1). Our analysis shows that these degradation substrates are not decreasing but rather accumulating in tcb1Δ2Δ3Δ strain, suggesting that the degradation system is not promoting PHS level. Therefore, the increase in detected PHS is most likely due to congestion/jams in metabolic processes downstream of PHS. Possible causes of the lipid metabolism disruption in Tcbdeletion cells have been discussed in the Discussion. To put it simply, (1) The reduced activity of a PtdIns4P phosphatase Sac1, due to MCS deficiency between ER and PM. (2) The impaired ceramide nonvesicular transport from the ER to the Golgi. (3) The low efficiency of PHS export by Rsb1, due to insufficient PHS diffusion between the ER and the PM.

      Line 248: did the authors test if the NVJ MCS is unperturbed in the triple Tcb KO?

      This is an exciting question. We are very interested in considering whether Tcb deficiency affects NVJ formation in terms of lipid transport. We would like to conduct further analysis in this regard in our future studies.

      Reviewer #2 (Recommendations For The Authors):

      I would suggest carefully evaluating the findings in this manuscript. Right now the connection between elevated PHS levels and vacuolar fragmentation are not really supported by the data. One of the major issues in the field of yeast sphingolipid biology is that quantification of the lipid levels is difficult and labor- and cost-intensive. But I think that it is very important to directly connect phenotypes with the lipid levels.

      Minor points:

      • In figure 1 c and d WT controls of the different treatments are lacking.

      As reviewer #2 had pointed out, we have added data for the WT controls.

      • The tcb1Δmutant appears to be sensitive in pH 5.0 media while the triple tricalbins mutant grows fine. Is that a known phenotype?

      We have performed this assay on SD plates. Then, to check whether this phenotype of tcb1Δ was specific or general, we re-analyzed the same strain in YPD medium. In YPD medium, tcb1Δ strain grew normally, while the control, vma3Δ, was still pH sensitive. Therefore, the growth of this tcb1Δ strain is dependent on the nutrient conditions of the medium but does not appear to be pH sensitive. This new data was inserted as part of Supplementary Figure 1.

      • Line 305. The is an "of" in the sentence that needs to be deleted.

      As pointed out by Reviewer #2, we have corrected the sentence.

      Reviewer #3 (Recommendations For The Authors):

      In supplementary Fig 2, the authors show the involvement of the NVJ in hyperosmotic shockinduced vacuole fission, but the involvement of tricalbins and PHS in this process is not tested. Does osmotic shock affect the level or distribution of tricalbins and PHS? They will be able to test whether overexpression of tricalbins inhibits hyperosmotic shock-induced vacuole fission or not. Also, they will be able to perform the similar experiments upon ER stressinduced vacuole fission.

      We appreciate Reviewer#3 for suggesting that it is important to test the involvement of PHS in hyperosmotic shock- or ER stress-induced vacuole fission. We have shown in a previous report that treatment with tunicamycin, which is ER stress inducer, increased the PHS level by about 20% (Yabuki et al. Genetics. 2019. Fig4). In addition, we tested the effect of hyperosmolarity on PHS levels for this time. Analysis of PHS under hyperosmotic shock conditions (0.2 M NaCl), in which vacuolar fragments were observed, showed an increase in PHS of about 10%. Furthermore, when the NaCl concentration was increased to 0.8 M, PHS levels increased up to 30%. In other words, we have shown that PHS increases in the range of tens of percent depending on the concentration of NaCl that induces vacuole division. This observation supports the possibility that a small increase in PHS levels may have an effect on vacuole fragmentation. Moreover, NaCl-induced vacuolar fragmentation, like that caused by PHS treatment, was also suppressed by PHS export from the cell by Rsb1 overexpression.

      These new data are now inserted, commented and discussed in the manuscript as Figure 5. We hope that these results will provide further insight into the more general aspects of PHS involvement in the vacuole fission process.

      Minor points:

      1) It is unclear for me whether endogenous Tcb3 is deleted in cells expressing Tcb3-GBP (FKY3903-3905 and FKY4754). They should clearly mention that these cells do not express endogenous Tcb3 in the manuscript.

      We apologize that our description was not clear. In this strain, endogenous TCB3 gene is tagged with GBP and the original Tcb3 has been replaced by the tagged version. We have changed the description in our manuscript.

      2) The strength of the effect of PHS on vacuole morphology looks different in respective WT cells in Fig 3C, 4B, and S2B. Is this due to the different yeast strains they used?

      Yes, we used BY4742 background for the strain in Figure 3C, SEY6210 background in Figure 4B, and HR background in Figure S2B. As a matter of fact, we observed that the strength of the PHS effect varies depending on their background. Strain numbers are now given in the legend so that the cells used for each data can be referenced in the strain list.

      3) p.3, line 44: the "SNARE" complex (instead of "protease")?

      We thank for the remarks on the incorrect wording. We have corrected this sentence.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews: 

      Reviewer #1 (Public Review):

      Summary: 

      The authors compared four types of hiPSCs and four types of hESCs at the proteome level to elucidate the differences between hiPSCs and hESCs. Semi-quantitative calculations of protein copy numbers revealed increased protein content in iPSCs. Particularly in iPSCs, proteins related to mitochondrial and cytoplasmic were suggested to reflect the state of the original differentiated cells to some extent. However, the most important result of this study is the calculation of the protein copy numbers per cell, and the validity of this result is problematic. In addition, several experiments need to be improved, such as using cells of different genders (iPSC: female, ESC: male) in mitochondrial metabolism experiments.

      Strengths: 

      The focus on the number of copies of proteins is exciting and appreciated if the estimated calculation result is correct and biologically reproducible. 

      Weaknesses: 

      The proteome results in this study were likely obtained by simply looking at differences between clones, and the proteome data need to be validated. First, there were only a few clones for comparison, and the gender and number of cells did not match between ESCs and iPSCs. Second, no data show the accuracy of the protein copy number per cell obtained by the proteome data. 

      We agree with the reviewer that it would be useful to have data from more independent stem cell clones and ideally an equal gender balance of the donors would be preferable. As usual, practical cost-benefit, and time available affect the scope of work that can be performed. We note that the impact of biological donor sex on proteome expression in iPSC lines has already been addressed in previous studies13. We will however revise the manuscript to include specific mention of these limitations and propose a larger-scale follow-up when resources are available.

      Regarding the estimation of protein copy numbers in our study, we would like to highlight that the proteome ruler approach we have used has been employed extensively in the field previously, with direct validation of differences in copy numbers provided using orthogonal methods to MS, e.g., FACS2-4,7,10. Furthermore, the original manuscript14 directly compared the copy numbers estimated using the “proteomic ruler” to spike-in protein epitope signature tags and found remarkable concordance. This original study was performed with an older generation mass spectrometer and reduced peptide coverage, compared with the instrumentation used in our present study. Further, we noted that these authors predicted that higher peptide coverage, such as we report in our study, would further increase quantitative performance.

      Reviewer #2 (Public Review):

      Summary: 

      Pluripotent stem cells are powerful tools for understanding development, differentiation, and disease modeling. The capacity of stem cells to differentiate into various cell types holds great promise for therapeutic applications. However, ethical concerns restrict the use of human embryonic stem cells (hESCs). Consequently, induced human pluripotent stem cells (ihPSCs) offer an attractive alternative for modeling rare diseases, drug screening, and regenerative medicine. A comprehensive understanding of ihPSCs is crucial to establish their similarities and differences compared to hESCs. This work demonstrates systematic differences in the reprogramming of nuclear and non-nuclear proteomes in ihPSCs. 

      We thank the reviewer for the positive assessment.

      Strengths: 

      The authors employed quantitative mass spectrometry to compare protein expression differences between independently derived ihPSC and hESC cell lines. Qualitatively, protein expression profiles in ihPSC and hESC were found to be very similar. However, when comparing protein concentration at a cellular level, it became evident that ihPSCs express higher levels of proteins in the cytoplasm, mitochondria, and plasma membrane, while the expression of nuclear proteins is similar between ihPSCs and hESCs. A higher expression of proteins in ihPSCs was verified by an independent approach, and flow cytometry confirmed that ihPSCs had larger cell sizes than hESCs. The differences in protein expression were reflected in functional distinctions. For instance, the higher expression of mitochondrial metabolic enzymes, glutamine transporters, and lipid biosynthesis enzymes in ihPSCs was associated with enhanced mitochondrial potential, increased ability to uptake glutamine, and increased ability to form lipid droplets. 

      Weaknesses: 

      While this finding is intriguing and interesting, the study falls short of explaining the mechanistic reasons for the observed quantitative proteome differences. It remains unclear whether the increased expression of proteins in ihPSCs is due to enhanced transcription of the genes encoding this group of proteins or due to other reasons, for example, differences in mRNA translation efficiency. Another unresolved question pertains to how the cell type origin influences ihPSC proteomes. For instance, whether ihPSCs derived from fibroblasts, lymphocytes, and other cell types all exhibit differences in their cell size and increased expression of cytoplasmic and mitochondrial proteins. Analyzing ihPSCs derived from different cell types and by different investigators would be necessary to address these questions. 

      We agree with the Reviewer that our study does not extend to also providing a detailed mechanistic explanation for the quantitative differences observed between the two stem cell types and did not claim to have done so. We have now included an expanded section in the discussion where we discuss potential causes. However, in our view fully understanding the reasons for this difference is likely to involve extensive future in-depth analysis in additional studies and is not something that can be determined just by one or two additional supplemental experiments.

      We also agree studying hiPSCs reprogrammed from different cell types, such as blood lymphocytes, would be of great interest. Again, while we agree it is a useful way forward, in practice this will require a very substantial additional commitment of time and resources. We have now included a section discussing this opportunity within the discussion to encourage further research into the area.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      (1) aizi1 and ueah1 clones, which were analyzed in Figure 1A, were excluded from the proteome analysis. In particular, the GAPDH expression level of the aizi1 clone is similar to that of ESCs and different from other iPSC clones. An explanation of how the clones were selected for proteome analysis is needed. Previously, the comparative analysis of iPSCs and ESCs reported in many studies from 2009-2017 (Ref#1-7) has already shown that the number of clones used in the comparative analysis is small, claiming differences (Ref#1-3) and that the differences become indistinguishable when the number of clones is increased (Ref#4-7). Certainly, few studies have been done at the proteome level, so it is important to examine what differences exist in the proteome. Also, it is interesting to focus on the amount of protein per cell. However, if the authors want to describe biological differences, it would be better to get the proteome data in biological duplicate and state the reason for selecting the clones used.

      (1) M. Chin, Cell Stem Cell, 2009, PMID: 19570518

      (2) K. Kim, Nat Biotechnol., 2011, PMID: 22119740

      (3) R. Lister, Nature, 2011, PMID: 21289626

      (4) A.M. Newman, Cell Stem Cell, 2010, PMID: 20682451

      (5) M.G. Guenther, Cell Stem Cell, 2010, PMID: 20682450

      (6) C. Bock, Cell, 2010, PMID: 21295703

      (7) S. Yamanaka, Cell Stem Cell, PMID: 22704507

      We agree with the reviewer that analysing more clones would be beneficial. We have included a section of this topic in the discussion. In our study, we only had access to the 4 hESC lines included, therefore in the original proteomic study we also analysed 4 hiPSC lines, which were routinely grown within our stem cell facility. While as the study progressed the stem cell facility expanded the culture of additional hiPSC lines, unfortunately we couldn’t also access additional hESC lines.

      We agree that ideally combining each biological replicate with additional technical replicates would provide extra robustness. As usual, cost and practical considerations at the time the experiments were performed affected the experimental design chosen. For the experimental design, each experiment was contained within 1 batch to avoid the strong batch effects present in TMT (Brenes et al 2019).

      (2) iPSC samples used in the proteome analysis are two types of female and two types of male, while ESC samples are three types of female and one type of female. The number of sexes of the cells in the comparative analysis should be matched because sex differences may bias the results.

      While we agree with the reviewer in principle, we have previously performed detailed comparisons of proteome expression in many independent iPSC lines from both biological male and female donors (see Brenes et al., Cell Reports 2021) and it seems unlikely that biological sex differences alone could account for the proteome differences between iPS and ESC lines uncovered in this study . However, as this is a relevant point, we have revised the manuscript to explicitly mention this caveat within the discussion section.

      (3) In Figure 1h, I suspect that the variation of PCA plots is very similar between ESCs and iPSCs. In particular, the authors wrote "copy numbers for all 8 replicates" in the legend, but if Figure 1b was done 8 times, there should be 8 types of cells x 8 measurements = 64 points. Even if iPSCs and ESCs are grouped together, there should be 8 points for each cell type. Is it possible that there is only one TMT measurement for this analysis? If so, at least technical duplicates or biological duplicates would be necessary. I also think each cell should be plotted in the PCA analysis instead of combining the four types of ESCs and iPSCs into one.

      We thank the reviewer for bringing this error to our attention. The legend has been corrected to state, “for all 8 stem cell lines”. Each dot represents the proteome of each of the 4 hESCs and 4 hiPSCs that were analysed using proteomics.

      (4) It is necessary to show what functions are enriched in the 4408 proteins whose protein copies per cell were increased in the iPSCs obtained in Figure 2B.

      The enrichment analysis requested has been performed and is now included as a new supplemental figure 2. We find it very interesting that despite the large number of proteins involved here (4,408), the enrichment analysis still shows clear enrichment for specific cellular processes. The summary plot using affinity propagation within webgestalt is included here:

      Author response image 1.

      (5) The Proteomic Ruler method used in this study is a semi-quantitative method to calculate protein copy numbers and is a concentration estimation method. Therefore, if the authors want to have a biological discussion based on the results, they need to show that the estimated concentrations are correct. For example, there are Western Blotting (WB) results for genes with no change in protein levels in hESC and hiPSC in Fig. 6ij, but the WB results for the group of genes that are claimed to have changed are not shown throughout the paper. Also, there is no difference in the total protein level between iPSCs and ESCs from the ponceau staining in Fig.6ij. WB results for at least a few genes are needed to show whether the concentration estimates obtained from the proteome analysis are plausible. If the protein per cell is increased in these iPSC clones, performing WB analysis using an equal number of cells would be better.

      Regarding the ‘proteome ruler’ approach we would like to highlight that this method has previously been used extensively in the field, with detailed validation, as already explained above. It is also not ‘semi-quantitative’ and can estimate absolute abundance, as well as concentrations. Our work does not use their concentration formulas, but the estimation of protein copy numbers, which was shown to closely match the observed copy numbers as determined when spike-ins are used14.

      In providing here additional validation using Western Blotting (WB), we prioritised for analysis also by WB the proteins related to pluripotency markers, which are vital to determine the pluripotency state of the hESCs and hiPSCs, as well as histone markers. We have included a section in the discussion concerning additional validation data and agree in general that further validation is always useful.

      (6) Regarding the experiment shown in Figure 4l, the gender of iPSC used (wibj2) is female and WA01 (H1; WA01) is male. Certainly, there is a difference in the P/E control ratio, but isn't this just a gender difference? The sexes of the cells need to be matched.

      We accept that ideally the sexes of donors should ideally have been matched and have mentioned this within the discussion. Nonetheless, as previously mentioned, our previous detailed proteomic analyses of multiple hiPSC lines13 derived from both biological male and female donors provide relevant evidence that the results shown in this study are not simply a reflection of the sex of the donors for the respective iPSC and ESC lines. When comparing eroded and non-eroded female hiPSCs to male hiPSCs we found no significant differences in any electron transport chain proteins, not TCA proteins between males and females.

      Minor comments:

      (1) Method: Information on the hiPSCs and hESCs used in this study should be described. In particular, the type of differentiated cells, gender, and protocols that were used in the reprogramming are needed.

      We agree with the reviewer on this. The hiPSC lines were generated by the HipSci consortium, as described in the flagship HipSci paper15. We cite the flagship paper, which specifies in great detail the reprogramming protocols and quality control measures, including analysis of copy number variations15. However, we agree that this information may not be easily accessible for readers. We agree it is relevant to explicitly include this information in our present manuscript, instead of expecting readers to look at the flagship paper. These details have therefore been added to the revised version.

      (2) Method: In Figure1a, Figure 6i, j, the antibody information of Nanog, Oct4, Sox2, and Gapdh is not written in the method and needs to be shown.

      The data relating to these has now been included within the methods section.

      (3) Method: In Figure 1b and other figures, the authors should indicate which iPSC corresponds to which TMT label; the data in the Supplemental Table also needs to indicate which data is which clone.

      We have now added this to the methods section.

      (4) Method: The method of the FACS experiment used in Figure 2 should be described.

      The methods related to the FACS analysis have now been included within the manuscript.

      (5) Method: The cell name used in the mitochondria experiment shown in Figure 4 is listed as WA01, which is thought to be H1. Variations in notation should be corrected.

      This has now been corrected.

      (6) Method: The name of the cell clone shown in Figure 3l,m should be mentioned.

      We have now added these details on the corresponding figure and legend.

      Reviewer #2 (Recommendations For The Authors):

      This study utilized quantitative mass spectrometry to compare protein expression in independently derived 4 ihPSC and 4 hESC cell lines. The investigation quantified approximately 7,900 proteins, and employing the "Proteome ruler" approach, estimated protein copy numbers per cell. Principal component analyses, based on protein copy number per cell, clearly separated hiPSC and hESC, while different hiPSCs and hESCs grouped together. The study revealed a global increase in the expression of cytoplasmic, mitochondrial, membrane transporters, and secreted proteins in hiPSCs compared to hESCs. Interestingly, standard median-based normalization approaches failed to capture these differences, and the disparities became apparent only when protein copy numbers were adjusted for cell numbers. Increased protein abundance in hiPSC was associated with augmented ribosome biogenesis. Total protein content was >50% higher in hiPSCs compared to hESCs, a observation independently verified by total protein content measurement via the EZQ assay and further supported by the larger cell size of hiPSCs in flow cytometry. However, the cell cycle distribution of hiPSC and hESC was similar, indicating that the difference in protein content was not due to variations in the cell cycle. At the phenotypic level, differences in protein expression also correlated with increased glutamine uptake, enhanced mitochondrial potential, and lipid droplet formation in hiPSCs. ihPSCs also expressed higher levels of extracellular matrix components and growth factors.

      Overall, the presented conclusions are adequately supported by the data. Although the mechanistic basis of proteome differences in ihPSC and hESC is not investigated, the work presents interesting findings that are worthy of publication. Below, I have listed my specific questions and comments for the authors.

      (1) Figure 1a displays immunoblots from 6 iPSC and 4 ESC cell lines, with 8 cell lines (4 hESC, 4 hiPSC) utilized in proteomic analyses (Fig. 1b). The figure legend should specify the 8 cell lines included in the proteomic analyses. The manuscript text describing these results should explicitly mention the number and names of cell lines used in these assays.

      We agree with the reviewer and have now marked in figure 1 all the lines that were used for proteomics and have added a section in the methods specifying which cell lines were analysed in each TMT channel.

      (2) In most figures, the quantitative differences in protein expression between hiPSC and hESC are evident, and protein expression is highly consistent among different hiPSCs and hESCs. However, the glutamine uptake capacity of different hiPSC cell lines, and to some extent hESC cell lines, appears highly variable (Figure 3e). While proteome changes were measured in 4 hiPSCs and 4 hESCs, the glutamine uptake assays were performed on a larger number of cell lines. The authors should clarify the number of cell lines used in the glutamine uptake assay, clearly indicating the cell lines used in the proteome measurements. Given the large variation in glutamine uptake among different cell lines, it would be useful to plot the correlation between the expression of glutamine transporters and glutamine uptake in individual cell lines. This may help understand whether differences in glutamine uptake are related to variations in the expression of glutamine transporters.

      The “proteomic ruler” has the capacity to estimate the protein copy numbers per cell, as such changes in the absolute number of cells that were analysed do not cause major complications in quantification. Furthermore, TMT-based proteomics is the most precise proteomics methods available, where the same peptides are detected in all samples across the same data points and peaks, as long as the analysis is done within a single batch, as is the case here.

      The glutamine uptake assay is much more sensitive to the variation in the number of cells. The number of cells were estimated by plating the cells with approximately 5e4 cells two days before the assay, which creates variability. Furthermore, hESCs and hiPSCs are more adhesive than the cells used in the original protocol, hence the quench data was noisier for these lines, making the data from the assay more variable.

      (3) In Figure 4j, it would be helpful to indicate whether the observed differences in the respiration parameters are statistically significant.

      We have now modified the plot to show which proteins were significantly different.

      (4) The iPSCs used here are generated from human primary skin fibroblasts. Different cells vary in size; for instance, fibroblast cells are generally larger than blood lymphocytes. This raises the question of whether the parent cell origin impacts differences in hiPSCs and hESC proteomes. For example, do the authors anticipate that hiPSCs derived from small somatic cells would also display higher expression of cytoplasmic, mitochondrial, and membrane transporters compared to ESC? The authors may consider discussing this point.

      This is a very interesting point. We have now added an extension to the discussion focussed on this subject.

      (5) One wonders if the "Proteome ruler" approach could be applied retrospectively to previously published ihPSC and hESC proteome data, confirming higher expression of cytoplasmic and mitochondrial proteins in ihPSCs, which may have been masked in previous analyses due to median-based normalization.

      We agree with the reviewer and think this is a very good suggestion. Unfortunately, in the main proteomic papers comparing hESC and hiPSCs16,17  the authors did not upload their raw files to a public repository (as it was not mandatory at that period in time), and they also used the International Protein Index (IPI), which is a discontinued database. So the raw files can’t be reprocessed and the database doesn’t match the modern SwissProt entries. Therefore, reprocessing the previous data was impractical.

      (6) The work raises a fundamental question: what is the mechanistic basis for the higher expression of cytoplasmic and mitochondrial proteins in ihPSCs? Conceivably, this could be due to two reasons: (a) Genes encoding cytoplasmic and mitochondrial proteins are expressed at a higher level in ihPSCs compared to hESC. (b) mRNAs encoding cytoplasmic and mitochondrial proteins are translated at a higher level in ihPSCs compared to hESC. The authors may check published transcriptome data from the same cell lines to shed light on this point.

      This is a very interesting point. We believe that the reprogrammed cells contained mature mitochondria, which are not fully regressed upon reprogramming and that this can establish a growth advantage in the normoxic environments in which the cells are grown. Unfortunately, the available transcriptomic data lacked spike-ins, and thus only enables comparison of concentration, not of copy numbers13. Therefore, we could not determine with the available data if there was an increase in the copies of specific mRNAs. However, with a future study where there was a transcriptomic dataset with spike-ins included, this would be very interesting to analyse.

      Reviewer #3 (Recommendations For The Authors):

      It is unclear whether changes in protein levels relate to any phenotypic features of cell lines used. For example, the authors highlight that increased protein expression in hiPSC lines is consistent with the requirement to sustain high growth rates, but there is no data to demonstrate whether hiPSC lines used indeed have higher growth rates.

      We respectfully disagree with the reviewer on this point. Our data show that hESCs and hiPSCs show significant differences in protein mass and cell size, with the MS data validated by the EZQ assay and FACS, while having no significant differences in their cell cycle profiles. Thus, increased size and protein content would require higher growth rates to sustain the increased mass, which is what we observe.

      The authors claim that the cell cycle of the lines is unchanged. However, no details of the method for assessing the cell cycle were included so it is difficult to appreciate if this assessment was appropriately carried out and controlled for.

      We apologise for this omission; the details have been included in the revised version of the manuscript.

      Details and characterisation of iPSC and ESC lines used in this study are overall lacking. The lines used are merely listed in methods, but no references are included for published lines, how lines were obtained, what passage they were used at, their karyotype status etc. For details of basic characterisation, the authors should refer to the ISSC Standards for the use of human stem cells in research. In particular, the authors should consider whether any of the changes they see may be attributed to copy number variants in different lines.

      We agree with the reviewer on this and refer to the reply above concerning this issue.

      The expression data for markers of undifferentiated state in Figure 1a would ideally be shown by immunocytochemistry or flow cytometry as it is impossible to tell whether cultures are heterogeneous for marker expression.

      We agree with the reviewer on this. FACS is indeed much more quantitative and a better method to study heterogeneity. However, we did not have protocols to study these markers using FACS.

      TEM analysis should ideally be quantified.

      We agree with the reviewer that it would be nice to have a quantitative measure.

      All figure legends should explicitly state what graphs are representing (e.g. average/mean; how many replicates (biological or technical), which lines)? Some data is included in Methods (e.g. glutamine uptake), but not for all of the data (e.g. TEM).

      We agree with the reviewer. These has been corrected in the revised version of the manuscript, with additional details included.

      Validation experiments were performed typically on one or two cell lines, but the lines used were not consistent (e.g. wibj_2 versus H1 for respirometry and wibj_2, oaqd_3 versus SA121 and SA181 for glutamine uptake). Can the authors explain how the lines were chosen?

      The validation experiments were performed at different time points, and the selection of lines reflected the availability of hiPSC and hESC lines within our stem cell facility at a given point in time.

      We chose to use a range of different lines for comparison, rather than always comparing only one set of lines, to try to avoid a possible bias in our conclusions and thus to make the results more general.

      The authors should acknowledge the need for further functional validation of the results related to immunosuppressive proteins.

      We agree with the reviewer and have added a sentence in the discussion making this point explicitly.

      Differences in H1 histones abundance were highlighted. Can the authors speculate as to the meaning of these differences?

      Regarding H1 histones, our study of the literature, as well as discussions with with chromatin and histone experts, both within our institute and externally, have not shed light into what the differences could imply, based upon previous literature. We think therefore that this is a striking and interesting result that merits further study, but we have not yet been able to formulate a clear hypothesis on the consequences.

      (1) Howden, A. J. M. et al. Quantitative analysis of T cell proteomes and environmental sensors during T cell differentiation. Nat Immunol, doi:10.1038/s41590-019-0495-x (2019).

      (2) Marchingo, J. M., Sinclair, L. V., Howden, A. J. & Cantrell, D. A. Quantitative analysis of how Myc controls T cell proteomes and metabolic pathways during T cell activation. Elife 9, doi:10.7554/eLife.53725 (2020).

      (3) Damasio, M. P. et al. Extracellular signal-regulated kinase (ERK) pathway control of CD8+ T cell differentiation. Biochem J 478, 79-98, doi:10.1042/BCJ20200661 (2021).

      (4) Salerno, F. et al. An integrated proteome and transcriptome of B cell maturation defines poised activation states of transitional and mature B cells. Nat Commun 14, 5116, doi:10.1038/s41467-023-40621-2 (2023).

      (5) Antico, O., Nirujogi, R. S. & Muqit, M. M. K. Whole proteome copy number dataset in primary mouse cortical neurons. Data Brief 49, 109336, doi:10.1016/j.dib.2023.109336 (2023).

      (6) Edwards, W. et al. Quantitative proteomic profiling identifies global protein network dynamics in murine embryonic heart development. Dev Cell 58, 1087-1105 e1084, doi:10.1016/j.devcel.2023.04.011 (2023).

      (7) Barton, P. R. et al. Super-killer CTLs are generated by single gene deletion of Bach2. Eur J Immunol 52, 1776-1788, doi:10.1002/eji.202249797 (2022).

      (8) Phair, I. R., Sumoreeah, M. C., Scott, N., Spinelli, L. & Arthur, J. S. C. IL-33 induces granzyme C expression in murine mast cells via an MSK1/2-CREB-dependent pathway. Biosci Rep 42, doi:10.1042/BSR20221165 (2022).

      (9) Niu, L. et al. Dynamic human liver proteome atlas reveals functional insights into disease pathways. Mol Syst Biol 18, e10947, doi:10.15252/msb.202210947 (2022).

      (10) Murugesan, G., Davidson, L., Jannetti, L., Crocker, P. R. & Weigle, B. Quantitative Proteomics of Polarised Macrophages Derived from Induced Pluripotent Stem Cells. Biomedicines 10, doi:10.3390/biomedicines10020239 (2022).

      (11) Ryan, D. G. et al. Nrf2 activation reprograms macrophage intermediary metabolism and suppresses the type I interferon response. iScience 25, 103827, doi:10.1016/j.isci.2022.103827 (2022).

      (12) Nicolas, P. et al. Systems-level conservation of the proximal TCR signaling network of mice and humans. J Exp Med 219, doi:10.1084/jem.20211295 (2022).

      (13) Brenes, A. J. et al. Erosion of human X chromosome inactivation causes major remodeling of the iPSC proteome. Cell Rep 35, 109032, doi:10.1016/j.celrep.2021.109032 (2021).

      (14) Wisniewski, J. R., Hein, M. Y., Cox, J. & Mann, M. A "proteomic ruler" for protein copy number and concentration estimation without spike-in standards. Mol Cell Proteomics 13, 3497-3506, doi:10.1074/mcp.M113.037309 (2014).

      (15) Kilpinen, H. et al. Common genetic variation drives molecular heterogeneity in human iPSCs. Nature 546, 370-375, doi:10.1038/nature22403 (2017).

      (16) Phanstiel, D. H. et al. Proteomic and phosphoproteomic comparison of human ES and iPS cells. Nat Methods 8, 821-827, doi:10.1038/nmeth.1699 (2011).

      (17) Munoz, J. et al. The quantitative proteomes of human-induced pluripotent stem cells and embryonic stem cells. Mol Syst Biol 7, 550, doi:10.1038/msb.2011.84 (2011).

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      This paper presents a computational model of the evolution of two different kinds of helping ("work," presumably denoting provisioning, and defense tasks) in a model inspired by cooperatively breeding vertebrates. The helpers in this model are a mix of previous offspring of the breeder and floaters that might have joined the group, and can either transition between the tasks as they age or not. The two types of help have differential costs: "work" reduces "dominance value," (DV), a measure of competitiveness for breeding spots, which otherwise goes up linearly with age, but defense reduces survival probability. Both eventually might preclude the helper from becoming a breeder and reproducing. How much the helpers help, and which tasks (and whether they transition or not), as well as their propensity to disperse, are all evolving quantities. The authors consider three main scenarios: one where relatedness emerges from the model, but there is no benefit to living in groups, one where there is no relatedness, but living in larger groups gives a survival benefit (group augmentation, GA), and one where both effects operate. The main claim is that evolving defensive help or division of labor requires the group augmentation; it doesn't evolve through kin selection alone in the authors' simulations.

      This is an interesting model, and there is much to like about the complexity that is built in. Individual-based simulations like this can be a valuable tool to explore the complex interaction of life history and social traits. Yet, models like this also have to take care of both being very clear on their construction and exploring how some of the ancillary but potentially consequential assumptions affect the results, including robust exploration of the parameter space. I think the current manuscript falls short in these areas, and therefore, I am not yet convinced of the results. Much of this is a matter of clearer and more complete writing: the Materials and Methods section in particular is incomplete or vague in some important junctions. However, there are also some issues with the assumptions that are described clearly.

      Below, I describe my main issues, mostly having to do with model features that are unclear, poorly motivated (as they stand), or potentially unrealistic or underexplored.

      We would like to thank the reviewer for the thoughtful comments that helped us to greatly improve the clarity of our paper.  

      One of the main issues I have is that there is almost no information on what happens to dispersers in the model. Line 369-67 states dispersers might join another group or remain as floaters, but gives no further information on how this is determined. Poring through the notation table also comes up empty as there is no apparent parameter affecting this consequential life history event. At some point, I convinced myself that dispersers remain floaters until they die or become breeders, but several points in the text contradict this directly (e.g., l 107). Clearly this is a hugely important model feature since it determines fitness cost and benefits of dispersal and group size (which also affects relatedness and/or fitness depending on the model). There just isn't enough information to understand this crucial component of the model, and without it, it is hard to make sense of the model output.

      We use the same dispersal gene β to represent the likelihood an individual will either leave or join a group, thereby quantifying both dispersal and immigration using the same parameter. Specifically, individuals with higher β are more likely to remain as floaters (i.e., disperse from their natal group to become a breeder elsewhere), whereas those with lower β are either more likely to remain in their natal group as subordinates (i.e., queue in a group for the breeding position) or join another group if they dispersed.  

      We added in the text “Dispersers may migrate to another group to become subordinates or remain as floaters waiting for breeding opportunities, which is also controlled by the same genetic dispersal propensity as subordinates” to clarify this issue. We also added in Table 1 that β is the “genetic predisposition to disperse versus remain in a group”, and to Figure 1 that “subordinates in the group (natal and immigrants) […]” after we already clarified that “Dispersers/floaters may join a random group to become subordinates.”

      Related to that, it seems to be implied (but never stated explicitly) that floaters do not work, and therefore their DV increases linearly with age (H_work in eq.2 is zero). That means any floaters that manage to stick around long enough would have higher success in competition for breeding spots relative to existing group members. How realistic is this? I think this might be driving the kin selection-only results that defense doesn't evolve without group augmentation (one of the two main ways). Any subordinates (which are mainly zero in the no GA, according to the SI tables; this assumes N=breeder+subordinates, but this isn't explicit anywhere) would be outcompeted by floaters after a short time (since they evolve high H and floaters don't), which in turn increases the benefit of dispersal, explaining why it is so high. Is this parameter regime reasonable? My understanding is that floaters often aren't usually high resource holding potential individuals (either b/c high RHP ones would get selected out of the floater population by establishing territories or b/c floating isn't typically a thriving strategy, given that many resources are tied to territories). In this case, the assumption seems to bias things towards the floaters and against subordinates to inherit territories. This should be explored either with a higher mortality rate for floaters and/or a lower DV increase, or both.

      When it comes to floaters replacing dead breeders, the authors say a bit more, but again, the actual equation for the scramble competition (which only appears as "scramble context" in the notation table) is not given. Is it simply proportional to R_i/\sum_j R_j ? Or is there some other function used? What are the actual numbers of floaters per breeding territory that emerge under different parameter values? These are all very important quantities that have to be described clearly.

      Although it is true that dispersers do not work when they are floaters, they may later help if they immigrate into a group as a subordinate. Consequently, immigrant subordinates have no inherent competitive advantage over natal subordinates (as step 2.2. “Join a group” is followed by step 3. “Help”, which occurs before step 5. “Become a breeder”). Nevertheless, floaters can potentially outcompete subordinates of the same age if they attempt to breed without first queuing as a subordinate (step 5) when subordinates are engaged in work tasks. We believe that this assumption is realistic and constitutes part of the costs associated with work tasks. However, floaters are at a disadvantage for becoming a breeder because: (1) floaters incur higher mortality than individuals within groups (Eq. 3); and (2) floaters may only attempt to become breeders in some breeding cycles (versus subordinate groups members, who are automatically candidates for an open breeding position in the group in each cycle). Therefore, due to their higher mortality, floaters are rarely older than individuals within groups, which heavily influences their dominance value and competitiveness. Additionally, any competitive advantage that floaters might have over other subordinate group members is unlikely to drive the kin selection-only results because subordinates would preferably choose defense tasks instead of work tasks so as not to be at a competitive disadvantage compared to floaters.  

      Regarding whether floaters aren't usually high resource holding potential (RHP) individuals and, therefore, our assumptions might be unrealistic; empirical work in a number of species has shown that dispersers are not necessarily those of lower RHP or of lower quality. In fact, according to the ecological constraints hypothesis, one might predict that high quality individuals are the ones that disperse because only individuals in good condition (e.g., larger body size, better energy reserves) can afford the costs associated with dispersal (Cote et al., 2022). To allow differences in dispersal propensity depending on RHP, we extended our model in the Supplemental Materials by incorporating a reaction norm of dispersal based on their rank (D = 1 / (1 + exp (β<sub>R</sub> * Rβ<sub>0</sub>)) under the section “Dominance-dependent dispersal propensities” and now referenced in L195. This approach allows individuals to adjust their dispersal strategy to their competitiveness and to avoid kin competition by remaining as a subordinate in another group. Results show that the addition of the reaction norm of dispersal to rank did not qualitatively influence the results described in the main text.  

      We also added “number of floaters” present in the whole population to the summary tables as requested.  

      As a side note, the “scramble context” we mention was an additional implementation in which we made rank independent of age. However, since the main conclusions remained unchanged, we decided to remove it for simplicity from the final manuscript, but we forgot to remove it from Table 1 before submission.  

      I also think the asexual reproduction with small mutations assumption is a fairly strong one that also seems to bias the model outcomes in a particular way. I appreciate that the authors actually measured relatedness within groups (though if most groups under KS have no subordinates, that relatedness becomes a bit moot), and also eliminated it with their ingenious swapping-out-subordinates procedure. The fact remains that unless they eliminate relatedness completely, average relatedness, by design, will be very high. (Again, this is also affected by how the fate of the dispersers is determined, but clearly there isn't a lot of joining happening, just judging from mean group sizes under KS only.) This is, of course, why there is so much helping evolving (even if it's not defensive) unless they completely cut out relatedness.

      As we showed in the Supplementary Tables and the section on relatedness in the SI (“Kin selection and the evolution of division of labor"), high relatedness does not appear to explain our results. In evolutionary biology generally and in game theory specifically (with the exception of models on sexual selection or sex-specific traits), asexual reproduction is often modelled because it reduces unnecessary complexity. To further study the effect of relatedness on kin structures more closely resembling those of vertebrates, however, we created an additional “relatedness structure level”, where we shuffled half of the philopatric offspring using the same method used to remove relatedness completely, effectively reducing withingroup relatedness structure by half. As shown in the new Figure S3, the conclusions of the model remain unchanged.  

      Finally, the "need for division of labor" section is also unclear, and its construction also would seem to bias things against division of labor evolving. For starters, I don't understand the rationale for the convoluted way the authors create an incentive for division of labor. Why not implement something much simpler, like a law of minimum (i.e., the total effect of helping is whatever the help amount for the lowest value task is) or more intuitively: the fecundity is simply a function of "work" help (draw Poisson number of offspring) and survival of offspring (draw binomial from the fecundity) is a function of the "defense" help. As it is, even though the authors say they require division of labor, in fact, they only make a single type of help marginally less beneficial (basically by half) if it is done more than the other. That's a fairly weak selection for division of labor, and to me it seems hard to justify. I suspect either of the alternative assumptions above would actually impose enough selection to make division of labor evolve even without group augmentation.

      In nature, multiple tasks are often necessary to successfully rear offspring. We simplify this principle in the model by maximizing reproductive output when both tasks are carried out to a similar extent, allowing for some flexibility from the mean. We added to the manuscript “For example, in many cooperatively breeding birds, the primary reasons that individuals fail to produce offspring are (1) starvation, which is mitigated by the feeding of offspring, and (2) nest depredation, which is countered by defensive behavior. Consequently, both types of tasks are necessary to successfully produce offspring, and focusing solely on one while neglecting the other is likely to result in lower reproductive success than if both tasks are performed by individuals within the group.”

      Regarding making fecundity a function of work tasks and offspring survival as a function of defensive tasks, these are actually equivalent in model terms, as it’s the same whether breeders produce three offspring and two die, or if they only produce one. This represents, of course, an oversimplification of the natural context, where breeding unsuccessfully is more costly (in terms of time and energy investment) than not breeding at all.

      Overall, this is an interesting model, but the simulation is not adequately described or explored to have confidence in the main conclusions yet. Better exposition and more exploration of alternative assumptions and parameter space are needed.

      We hope that our clarifications and extension of the model satisfy your concerns.  

      Reviewer #2 (Public review):

      Summary:

      This paper formulates an individual-based model to understand the evolution of division of labor in vertebrates. A main conclusion of the paper is that direct fitness benefits are the primary factor causing the evolution of vertebrate division of labor, rather than indirect fitness benefits.

      Strengths:

      The paper formulates an individual-based model that is inspired by vertebrate life history. The model incorporates numerous biologically realistic details, including the possibility to evolve age polytheism where individuals switch from work to defence tasks as they age or vice versa, as well as the possibility of comparing the action of group augmentation alone with that of kin selection alone.

      Weaknesses:

      The model makes assumptions that restrict the possibility that kin selection leads to the evolution of helping. In particular, the model assumes that in the absence of group augmentation, subordinates can only help breeders but cannot help non-breeders or increase the survival of breeders, whereas with group augmentation, subordinates can help both breeders and non-breeders and increase the survival of breeders. This is unrealistic as subordinates in real organisms can help other subordinates and increase the survival of non-breeders, even in the absence of group augmentation, for instance, with targeted helping to dominants or allies. This restriction artificially limits the ability of kin selection alone to lead to the evolution of helping, and potentially to division of labor. Hence, the conclusion that group augmentation is the primary driving factor driving vertebrate division of labor appears forced by the imposed restrictions on kin selection. The model used is also quite particular, and so the claimed generality across vertebrates is not warranted.

      We would like to thank the reviewer for the in-depth review. We respond to these and other comments below.  

      I describe some suggestions for improving the paper below, more or less in the paper's order.

      First, the introduction goes to great lengths trying to convince the reader that this model is the first in this or another way, particularly in being only for vertebrates, as illustrated in the abstract where it is stated that "we lack a theoretical framework to explore the conditions under which division of labor is likely to evolve" (line 13). However, this is a risky and unnecessary motivation. There are many models of division of labor and some of them are likely to be abstract enough to apply to vertebrates even if they are not tailored to vertebrates, so the claims for being first are not only likely to be wrong but will put many readers in an antagonistic position right from the start, which will make it harder to communicate the results. Instead of claiming to be the first or that there is a lack of theoretical frameworks for vertebrate division of labor, I think it is enough and sufficiently interesting to say that the paper formulates an individual-based model motivated by the life history of vertebrates to understand the evolution of vertebrate division of labor. You could then describe the life history properties that the model incorporates (subordinates can become reproductive, low relatedness, age polyethism, etc.) without saying this has never been done or that it is exclusive to vertebrates; indeed, the paper states that these features do not occur in eusocial insects, which is surprising as some "primitively" eusocial insects show them. So, in short, I think the introduction should be extensively revised to avoid claims of being the first and to make it focused on the question being addressed and how it is addressed. I think this could be done in 2-3 paragraphs without the rather extensive review of the literature in the current introduction.

      We have revised the novelty statements in the Introduction by more clearly emphasizing how our model addresses gaps in the existing literature. More details are provided in the comments below.

      Second, the description of the model and results should be clarified substantially. I will give specific suggestions later, but for now, I will just say that it is unclear what the figures show. First, it is unclear what the axes in Figure 2 show, particularly for the vertical one. According to the text in the figure axis, it presumably refers to T, but T is a function of age t, so it is unclear what is being plotted. The legend explaining the triangle and circle symbols is unintelligible (lines 227-230), so again it is unclear what is being plotted; part of the reason for this unintelligibility is that the procedure that presumably underlies it (section starting on line 493) is poorly explained and not understandable (I detail why below). Second, the axes in Figure 3 are similarly unclear. The text in the vertical axis in panel A suggests this is T, however, T is a function of t and gamma_t, so something else must be being done to plot this. Similarly, in panel B, the horizontal axis is presumably R, but R is a function of t and of the helping genotype, so again some explanation is lacking. In all figures, the symbol of what is being plotted should be included.

      We added the symbols of the variables to the Figure axes to increase clarity. In Figure 3A, we corrected the subindex t in the x-axis; it should be subindex R (reaction norm to dominance rank instead of age). As described in Table 1, all values of T, H and R are phenotypically expressed values. For instance, T values are the phenotypically expressed values from the individuals in the population according to their genetic gamma values and their current dominance rank at a given time point.  

      Third, the conclusions sound stronger than the results are. A main conclusion of the paper is that "kin selection alone is unlikely to select for the evolution of defensive tasks and division of labor in vertebrates" (lines 194-195). This conclusion is drawn from the left column in Figure 2, where only kin selection is at play, and the helping that evolves only involves work rather than defense tasks. This conclusion follows because the model assumes that without group augmentation (i.e., xn=0, the kin selection scenario), subordinates can only help breeders to reproduce but cannot help breeders or other subordinates to survive, so the only form of help that evolves is the least costly, not the most beneficial as there is no difference in the benefits given among forms of helping. This assumption is unrealistic, particularly for vertebrates where subordinates can help other group members survive even in the absence of group augmentation (e.g., with targeted help to certain group members, because of dominance hierarchies where the helping would go to the breeder, or because of alliances where the helping would go to other subordinates). I go into further details below, but in short, the model forces a narrow scope for the kin selection scenario, and then the paper concludes that kin selection alone is unlikely to be of relevance for the evolution of vertebrate division of labor. This conclusion is particular to the model used, and it is misleading to suggest that this is a general feature of such a particular model.

      The scope of this paper was to study division of labor in cooperatively breeding species with fertile workers (i.e., primarily vertebrates), in which help is exclusively directed towards breeders to enhance offspring production (i.e., alloparental care). Our focus is in line with previous work in most other social animals, including eusocial insects and humans, which emphasizes how division of labor maximizes group productivity. Other forms of “general” help are not considered in the paper, and such forms of help are rarely considered in cooperatively breeding vertebrates or in the division of labor literature, as they do not result in task partitioning to enhance productivity.

      Overall, I think the paper should be revised extensively to clarify its aims, model, results, and scope of its conclusions.

      Recommendations for the authors: 

      Reviewer #1 (Recommendations for the authors):

      I reserved this section for more minor comments, relating to clarity and a general admonition to give us more detail and exploration of some basic population genetic quantities.

      Another minor point, although depending on whether I assume right or wrong, it could be major: I am not entirely sure that dispersers help in the groups they join as helpers, because of line 399, which states specifically that individuals who do remain in natal territories do. But I assume dispersers help (elsewhere, the authors state helping is not conditional on relatedness to the breeder). Otherwise, this model becomes even weirder for me. Either way, please clarify.

      Apologies if this was not clear. Immigrants that join a group (so dispersers from another group) as a subordinate help and queue for a breeding position, as does any natal subordinate born into the group. We rephased the sentence to “Subordinate group members, either natal or immigrants to the group, […]”  

      More generally, in simulation studies like this, there can be interactions between the strength of selection (which affects overall genetic variation maintained in the population), population size, and mutation rate/size, which can affect, for example, relatedness values. None of these quantities is explored here (and their interactions are not quantified), so it is not possible to evaluate the robustness of any of these results.

      Thank you for your comments about the parameter landscape. It is important to point out that variations in the mutation rate do not qualitatively affect our results, as this is something we explored in previous versions of the model (not shown). Briefly, we find that variations in the mutation rates only alter the time required to reach equilibrium. Increasing the step size of mutation diminishes the strength of selection by adding stochasticity and reducing the genetic correlation between offspring and their parents. Population size could, in theory, affect our results, as small populations are more prone to extinction. Since this was not something we planned to explore in the paper directly, we specifically chose a large population size, or better said, a large number of territories (i.e. 5000) that can potentially host a large population.  

      The authors also never say how it is actually determined. There is the evolved helping variable, and there is also the evolved reaction norm. I assume that the actual amount of help of each type is given by the product of T (equation 1) and H (for defense) and (1-T) and H (for work), but this should be stated explicitly.  

      Help provided is an interaction between H (total effort) and T (proportion of total effort invested in each type of task). To clarify the distinction between these two processes, we have now added “Hence, the gene α regulates the amount of help expressed, while the genes γ determine which specific helping tasks are performed at different time points in the breeding cycle”.  

      It is also weird that after introducing the T variable as a function of age, Figure 3 actually depicts it as a function of dominance value.

      Thank you for pointing out an error in Eq. 1. This inequality was indeed written incorrectly in the paper (but is correct in the model code); it is dominance rank instead of age (see code in Individual.cpp lines 99-119). We corrected this mistake throughout the manuscript.

      What is "scramble context"?

      “Scramble context” was an additional implementation that we decided to remove from the final manuscript, but we forgot to remove from Table 1 before submission. We have now removed it from the table.

      Reviewer #2 (Recommendations for the authors):

      Some specific comments:

      (1) L 31: "All theoretical..." These absolute statements are risky and unnecessary.

      Rephrased to “To date, most theoretical and empirical work…”

      (2) L 46: I believe Tom Wenseleers has published on the evolution of division of labor with reproductive workers and high within-colony conflict.

      Tom Wenseleers has indeed produced some models on the evolution of cooperation in social insects where some workers may reproduce. However, these models focus on the relevance of relatedness and policing selecting for a reduction in within-group conflict and the evolution of reproductive division of labor. Our model focuses instead on division of labor among workers (helpers). We have rephased this section to “task specialization is linked to sterility and where conflict of interest is generally low” to account for species of social insect in which variation in relatedness between group members and higher levels of reproductive conflict may arise. We also cited one of his papers.  

      (3) L 57: Again, unnecessary categorical statements.

      Rephrased to “Although a great deal of recent empirical work highlights the importance of direct benefits in the evolution of cooperative breeding behavior in vertebrates [21–24], we lack understanding on the joint influence of direct and indirect fitness benefits in the evolution of division of labor.”

      (4) L 67: This is said to be a key distinction, but in the paper, such a key role is not clearly shown. This and other tangential points are unnecessary to keep the introduction to the point.

      The different fitness costs of different tasks is the basis of our model on division of labor. Therefore, this is a key distinction and basis from which to describe different tasks in the model. We have left this sentence unchanged.

      (5) L 61-73: "In vertebrates, however, helpers may obtain fitness benefits directly via reproduction..." Some social insects may do so as well. It seems unnecessary and incorrect to say that vertebrate sociality is fundamentally different from invertebrate one. I think it is sufficiently interesting to say this work aims to understand vertebrate division of labor, by explicitly modeling aspects of its life history, without saying this can't happen in invertebrates or that no other model has ever done anything like it.

      Our point is not that, in some social insects, workers cannot obtain direct fitness benefits, but that previous models where the focus is on the colony reproductive outcome are only a good approximation to eusocial insect with sterile workers. However, to make this clearer we have added “In vertebrates and social insect with fertile workers, however, helpers may obtain fitness benefits directly via […]”.  

      (6) L 74-86: By this point, the introduction reads like a series of disconnected comments without a clear point.

      In L60 we added: “Understanding how direct and indirect benefits interact is particularly important in systems where individuals may differentially bear the fitness costs of cooperation”. By adding this sentence, we emphasize our focus on the largely unexplored direct fitness benefits and costs, as well as their interaction with indirect fitness. We then proceed to explain why it is crucial to consider that tasks have varying direct fitness costs and how the fitness benefits derived from cooperation change with age and resource-holding potential. These elements are essential for studying the division of labour in species with totipotent workers.

      (7) L 87: This sentence gives a clear aim. It would be clearer if the introduction focused on this aim.

      With the new sentence added in L60 (see previous comment), we bring the focus to the main question that we are trying to address in this paper earlier in the Introduction.  

      (8) L 88: "stochastic model" should be changed to "individual-based model".

      Done.

      (9) L 104: "limited number" is unclear. Say a fixed finite number, or something specific.

      Done.

      (10) L 105: "unspecified number" is unclear. Say the number of subordinates emerges from the population dynamics.

      Changed to “variable number of subordinate helpers, the number of which is shaped by population dynamics, with all group members capable of reproducing during their lifetime”.

      (11) L 112: "Dispersers" is used, but in the previous lines 107-109, the three categories introduced used different terms. Those three terms introduced should be used consistently throughout the paper, without using two or more terms for one thing.

      We use the term “disperser” to describe individuals that disperse from their natal group.

      Dispersers can assume one of three roles: (1) they can join another group as "subordinates"; (2) they can join another group as "breeders" if they successfully outcompete others; or (3) they can remain as "floaters" if they fail to join a group. "Floaters" are individuals who persist in a transient state without access to a breeding territory, waiting for opportunities to join a group in an established territory. We rephased the sentence to “Dispersers cannot reproduce without acquiring a territory (denoted here as floaters)”. This was also clarified in other instances where the term “dispersers” was used (e.g. L407). Other instances where this might not have been so clear, we replace “dispersers” with “floaters”.  

      (12) L 112: "(floaters)" Unclear parenthesis.

      See previous comment.  

      (13) L 115: There should be a reference to Methods around here.

      Added a reference to Figure 1.

      (14) L 117: To be clearer, say instead that dominance value is a linearly increasing function of age as a proxy of RHP and a linearly decreasing function of help provided due to the costs of working tasks. And refer to equation 2.

      Rephrased to “We use the term dominance value to designate the competitiveness of an individual compared to other candidates in becoming a breeder, regardless of group membership, that increases as a function of age, serving as a proxy for resource holding potential (RHP), and decreases as a function of help provided, reflecting costs to body condition from performing working tasks (Eq. 2).” We did not include “linearly” to keep it simpler, since it is clear from Eq. 2, which is now referenced here.  

      (15) L 119: "Subordinate helpers". As all subordinates are helpers, the helper qualifier is confusing.

      Subordinates are not necessarily helpers, as they can evolve help values of 0, hence, why we make it explicit here.

      (16) L 119: "choose". This terminology may be misleading. The way things are implemented in the model is that individuals are assigned a task depending on their genetic traits gamma. Perhaps it would be better to use a less intentional term, like perform one of two tasks.

      We changed “choose between two” to “engage in one of two”, which has less connotations of intentionality.

      (17) L 124: "Subordinates can [...] exhibit task specialization that [...] varies with their dominance value". It should be that it varies with age.

      Apologies. The equation was wrong; it does vary with dominance value. We corrected it accordingly.

      (18) L 133: "maximised" This is apparently important for the modelling procedure, but it is completely unclear what it means. Equation 4 comes out of nowhere, and it is said that such an equation is the maximum amount of help that can affect fecundity. Why? What does this mean? If there is something that is maximised, this should be proven. This value is then used for something (line 507), but it is unclear why or what it is used for (it says "we use the value of Hmax instead" without saying what for, no justification for the listed inequalities are given, and the claimed maximisation of an unspecified variable at those H values is not proven). Moreover, the notation in this section is also unclear: what are the sums over? Also, Hdefence and Hwork should vary over the index that is summed over, but the notation suggests that those quantities don't vary.

      We changed “maximized” to “greatest”, and we added a clarification to the rationality behind the maximization of the impact of help in the breeder’s productivity: “For example, in many cooperatively breeding birds, the primary reasons that breeders fail to produce offspring are (1) starvation, which is mitigated by the feeding of offspring, here considered as a work task, and (2) nest depredation, which is countered by defensive behavior. Consequently, both types of tasks are often necessary for successful reproduction, and focusing solely on one while neglecting the other is likely to result in lower reproductive success than if both tasks are performed by helpers within the group.”

      We now also clarify that the sums are for help given within a group (L 507), and added indexes to the equations.

      (19) L 152: "habitat saturation" How is this implemented? How is density dependence implemented? Or can the population size keep increasing indefinitely? It would be good to plot the population size over time, the group size over time, and the variance in group size over time. This could substantiate later statements about enhancing group productivity and could all be shown in the SI.

      Habitat saturation emerges from population dynamics due to the limited availability of territories and the fluctuating number of individuals, leading highly productive environments to experience habitat saturation. Although the number of group members is not restricted in our model, the population could theoretically increase indefinitely. However, this is not observed in the results presented here, as we selected parameter landscapes that stabilize population numbers. We confined our parameters to those where the population neither increased indefinitely (nor collapsed), as we did not incorporate density-dependent mortality traits for simplification. Consequently, the group size in the SI, where the standard deviation is already included, closely represents group size at any other given time during equilibrium.

      L 336: we changed “environments with habitat saturation” to “environments that lead to habitat saturation”, to increase clarity.

      (20) L 152: "lifecycle". Rather than the lifecycle, the figure describes the cycle of events in a single time step. The lifecycle (birth to death) goes over multiple time steps (as individuals live over multiple steps). So this figure shouldn't be called a life cycle.

      We changed “lifecycle” to “breeding cycle”.

      (21) L 156: "generation". This is not a generation but a time step.

      We changed “generation” to “breeding cycle”.

      (22) L 157: "previous life cycle" would mean that the productivity of a breeder depends on the number of helpers that its parents had, which is not what is meant.

      We changed “lifecycle” to “breeding cycle”.

      (23) L 158: "Maximum productivity is achieved when different helping tasks are performed to a similar extent." Again, unclear why that is the case.

      We added a clarification on this, see response to comment 18.  

      (24) L 160: "Dispersers/floaters". Use just one term for a single thing.

      See response to comment 11.   

      (25) L 162: "dispersal costs". I don't recall these being described in Methods.

      Individuals that disperse do not enjoy the protection of living in a territory and within a group of other individuals, so they have a higher mortality risk, described in Eq. 3.3. (negative values in the exponential part of the equation increase survival). The cost of dispersal is the same as individuals that remain as floaters at a given time step.

      (26) L 164: "generation" -> time step.

      We changed this to “breeding cycle”.  

      (27) L 170: "Our results show that division of labor initially emerges because of direct fitness benefits..." This is a general statement, but the results are only particular to the model. So this statement and others in the manuscript should be particular to the model. Also, Figure 2 doesn't say anything about what evolves "initially" as it only plots evolutionary equilibria.

      We rephrased this statement to “Our results suggest that voluntary division of labor involving tasks with different fitness costs is more likely to emerge initially because of direct fitness benefits”, to more accurately represent the conditions under which we modeled the division of labor.  

      Our reference to “initially” is regarding group formation (family groups versus aggregations of unrelated individuals or a mix). This is shown in the comparison between the different graphs at equilibrium. The initial state of the simulation is that all individuals disperse and do not cooperate.  

      (28) L 171: "but a combination of direct and indirect fitness benefits leads to higher rates and more stable forms of division of labor". What do you mean by "higher rates and more stable forms of division of labor"? Say how division of labor is shown in the figure (with intermediate T?).

      Yes, intermediate values of T show division of labor if γR ≠ 0. This is described under the section “The role of dominance in task specialization”. We added “with intermediate values suggesting a division of labor” to the Figure 2 legend.  

      (29) L173-175: "as depicted in Figure 2, intermediate values of task specialization indicate in all cases age/dominance-mediated task specialization (γt ≠ 0; Table 1) and never a lack of specialization (γt = 0; Table 1)". This sentence is unclear and imprecise. Does this sentence want to say that in Figure 2, all plots with intermediate values of T involve gamma t different from zero? If so, just say that.

      Rephrased to: “In Figure 2, all plots depicting intermediate values of T exhibit non-zero γR values and, hence, division of labor”.

      (30) L179-180: "forms of help that impact survival never evolve under any environmental condition when only kin selection occurs". This is misleading because under the KS scenario, help cannot positively impact survival in this model, so they never evolve.

      Help cannot affect survival but could potentially affect group persistence. If helpers increase breeder productivity and offspring remain philopatric and queue for the breeding position, then they will receive help from related individuals.   

      (31) L 210: "initially". What do you mean by that?

      Help only evolves in our model in family groups, which may then open the door for the evolution of help in mixed-kin groups. Therefore, we use “initially” to refer to the ancestral group structure that likely led to cooperation under benign environmental conditions. We rephased this section to “in more benign (and often highly productive) environments that lead to habitat saturation, help likely evolved initially in family groups, and defensive tasks are favored because competition for the breeding position is lower under kin selection.”

      (32) L 212: "kin selection is achieved". What does that mean?

      Rephased to “kin selection acts not only by selecting subordinates in their natal group to increase the productivity of a related breeder […]”

      (33) L 216: "division of labor seems to be more likely to evolve in increasingly harsh environments". Say in parentheses where this is shown.

      Added.  

      (34) L 218: "help evolves in benign environments". I don't see where this is shown. Figure 2 doesn't show that H is higher with lower m (e.g., in KS+GA column).

      Help does not evolve in benign environments under only direct fitness benefits derived from group augmentation (shown in Figure 2).  

      (35) L 225: "y-axis" should be "vertical axis", as y has another meaning in the model.

      Done.

      (36) L 226: "likelihood". Here and throughout, "likelihood" should be changed to probability. Likelihood means something else.

      Thank you for the advice, we have corrected this through the manuscript.  

      (37) L 236: "the slope of the reaction norm for the dominance value in task specialization".

      Unclear. Clearer to say: the rate at which individuals to shift from defense to work as they age.

      The important part is not so much the rate but the direction, that is, from work task to defense (or vice versa) as their rank increases. Changed to “the direction and rate of change in task specialization with dominance”.

      (38) L 257: "(task = 0; cost to dominance value)," This seems out of place.

      This aims to clarify that work tasks have a cost to dominance, while defense tasks have a cost to survival. This is particularly relevant in this model since different helping tasks are defined by their fitness costs.

      (39) L 258: "increase"-> "increase with age".

      Added “with dominance”.

      (40) L 262: "division of labor equilibria" What is that?

      Changed to “at equilibrium when division of labor evolves”

      (41) L 268: "Our findings suggest that direct benefits of group living play a driving role in the evolution of division of labor via task specialization in species with totipotent workers". This is a very general statement, but the results are much more circumscribed. First, the model is quite specific by assuming that, in the absence of group augmentation (xn=0), indirect fitness benefits can only be given to breeders (Equation 5) but not to other subordinates (Equations 2, 3.1). This is unrealistic, particularly for vertebrates, and reduces the possibility that indirect fitness benefits play a role.  

      As previously discussed, the scope of this paper was to study division of labor in cooperatively breeding species with fertile workers in which help is exclusively directed towards breeders to enhance offspring production through alloparental care. Other forms of “general” help do not result in task partitioning to enhance productivity.

      Second, the difference in costs of work and defense are what drive the evolution of "division of labor" (understood as intermediate T in case this is what the authors mean) in the KS scenario, but the functional forms of those two costs are quite specific and not of the same form, so these functions may bias the results found. Specifically, R is an unbounded linear function of work and the effect of this function becomes weaker as the individual ages due to the weakening force of selection with age (Equation 2) whereas Sh is a particular bounded nonlinear function of defense (Equation 3.1). These differences may tend to make the effect of Sh stronger due to the particular functions chosen.  

      The difference in costs is inherent to the nature of the different tasks (work versus defense): while survival is naturally bounded, with death as the lower bound, dominance costs are potentially unbounded, as they are influenced by dynamic social contexts and potential competitors. Therefore, we believe that the model’s cost structure is not too different from that in nature.  

      Third, no parameter sweep is given to see to what extent these results hold across the many parameters involved. So, in summary, the discussion should at least reflect that the results are of a restricted nature rather than giving the impression that they are of the suggested level of generality.

      During the exploratory phase of the model development, various parameters and values were assessed. However, the manuscript only details the ranges of values and parameters where changes in the behaviors of interest were observed, enhancing clarity and conciseness. For instance, variation in yh (the cost of help on dominance when performing “work tasks”) led to behavioral changes similar to those caused by changes in xh (the cost of help in survival when performing “defensive tasks”), as both are proportional to each other. Specifically, since an increase in defense costs raises the proportion of work relative to defense tasks, while an increase in the costs of work task has the opposite effect, only results for the variation of xh were included in the manuscript to avoid redundancy. Added to Table 1: “To maintain conciseness, further exploration of the parameter landscape was not included in the manuscript”.

      (42) L 270: "in eusocial insects often characterized by high relatedness and reproductive inhibition, sterile workers acquire fitness benefits only indirectly". This is misleading. Sterile workers of any taxa, be it insects or vertebrates, can only acquire fitness benefits indirectly as they are sterile, but eusocial insects involve not only sterile workers.

      Rephased to “In contrast, in eusocial species characterized by high relatedness and permanent worker sterility, such as most eusocial insects, workers acquire fitness benefits only indirectly”. In any case, permanent sterility only occurs in eusocial invertebrates; in vertebrates with reproductive inhibition sterility is only temporal and context dependent. Therefore, in vertebrates, sterile workers may potentially obtain direct fitness benefits if the social context changes, as is the case in naked mole-rats.  

      (43) L 273: "Group members in eusocial species are therefore predicted to maximize colony fitness due to the associated lower within-group conflict". Again, this is incorrect. Primitively eusocial insects have high conflict.

      We added “Group members in such eusocial species” to clarify that we are not referring here to primitively eusocial species but those with permanent sterile workers.  

      (44) L 277: "when the benefits of cooperation are evenly distributed among group members". In this model, the benefits of cooperation are not evenly distributed among group members: breeders reproduce, but subordinates don't.

      Subordinates may reproduce if they become breeders later in life. However, subordinates also benefit from cooperation as subordinates directly (greater survival in larger groups), and indirectly if they are related to the breeder. Here we refer to the first one, and we expand on that in the following sentence.  

      (45) L 280: "survival fitness benefits derived from living in larger groups seem to be key for the evolution of cooperative behavior in vertebrates [22, 63], and may also translate into low within-group conflict. This suggests that selection for division of labor in vertebrates is stronger in smaller groups". I don't see how the previous sentence suggests this. The paper does not present results to support this statement (i.e., no selection gradients in smaller vs larger groups are shown).

      The benefits of living in a larger group entail diminishing returns, so those living in smaller groups benefit greater by an increase in productivity and group size than those in a larger group.  

      (46) L 284: "Our model demonstrates that vertebrates evolve a more stable division of labor". Where is that shown? How is "more stable" measured?

      Rephrased to “vertebrates are more likely to evolve division of labor”. This is shown in Figure 2, that exemplifies that division of labor evolves in a wider range of environmental condition and to a higher degree (intermediate values of T).  

      (47) L 287: "direct fitness benefits in the form of group augmentation select more strongly for defensive tasks". Where is that shown? Establishing this would entail comparing selection gradients with direct fitness benefits of group augmentation and without them.

      In Figure 2, when we compare the GA column to KS+GA column, we see that at equilibrium, more helpers choose defense tasks, specially when they are free to choose their preferred task (circles).  

      (48) L 288: "kin selection alone seems to select only for work tasks." Again, this may be an artifact of the model assuming that helpers cannot increase non-breeders' fitness components except via group augmentation, and that defense tasks are inherently more costly than work tasks.

      As stated previously, we are studying task specialization in cooperative breeders where help is in the form of alloparental care (from allofeeding and egg care to defense from predators). We also assume that the costs are different, but whether one or the other is more costly depends on the relative context (e.g., a task can be more costly if it affects competitiveness in a very competitive environment). It is important to note that we name these tasks “work” and “defense” for practical reasons, but the focus of the paper is on tasks with different fitness costs that for their characteristics may not fit so well in under this terminology. While we acknowledge that most tasks have both kinds of fitness costs to a degree, here we focus on the main fitness costs of each kind of task (L430-436).  

      (49) L 290: "are comparatively large". This sounds as if the tasks are large, which is presumably not what is meant.

      Rephrased to “costs to dominance value and to the probability of attaining a breeding position are comparatively larger than survival costs.”

      (50) L 298: "helpers are predicted to increase defensive tasks with age or rank, whereas in harsh environments, work tasks are predicted to increase with age or rank." Add parentheses referring to where this is shown.

      This is shown in Figure 3, but since this is described in the discussion, we did not add a reference to the figure. If the editor would like us to refer to figures here, we can (see also comments below relating to the same issue).

      (51) L 308: "the role of age and environmental harshness on the evolution of division of labor". What is the prediction? Simply, the role of age is an assumption, not a prediction.

      Rephrased to “the role of environmental harshness on the evolution of division of labor via age-dependent task specialization”.

      (52) L 315: "individuals shifting from work tasks such as foraging for food, digging, and maintaining the burrow system, to defensive tasks such as guarding and patrolling as individuals grow older and larger". Say in parentheses where this is predicted.

      This prediction comes from Figure 3, we do not reference it here since we are in the Discussion section.  

      (53) L 320: "Under these conditions, our model predicts the highest levels of task partitioning and division of labor." Where is this predicted? Add parentheses referring to where this is shown. As it is, it is not possible to check the validity of the statement.

      This prediction comes from Figure 2 column KS+GA, we do not reference it here since we are in the Discussion section. The results with references to the figures are found under the Results section. In the discussion, we reiterate the results already described and add some examples from real data that seem to confirm our predictions.  

      (54) L 322: "In line with our model predictions, larger and older helpers of this species invest relatively more in territory maintenance, whereas younger/smaller helpers defend the breeding shelter of the dominant pair to a greater extent against experimentally exposed egg predators". These predictions are neat, but are now very difficult to understand from the figures. Maybe at the bottom of 3A, you could add a diagram work->defense for negative gamma_t and defense>work for positive gamma_t (or whatever order it is).

      Done.

      (55) L 325: "Territory maintenance has been shown to greatly affect routine metabolic rates and, hence, growth rates [80], which directly translates into a decrease in the likelihood of becoming dominant and attaining breeding status, as predicted by our model." This seems to be an assumption, not a prediction.

      That is true. We removed: “as predicted by our model”.  

      (56) L 352: "controlled". This means something else.

      Changed to “addressed”.

      (57) L 356: "summary, our study represents the first theoretical model aimed at elucidating the potential mechanisms underlying division of labor between temporal non-reproductives via task specialization in taxa beyond eusocial organisms". Again, claiming to be the first is risky and unnecessary.

      Rephrased to “our study helps to elucidate”.

      (58) L 358: "Harsh environments, where individuals can obtain direct fitness benefits from group living, favor division of labor, thereby enhancing group productivity and, consequently, group size." I'm not sure about this conclusion as harsh environments (large m in Figure 2) also involve the evolution of no division of labor (from the triangles and circles that are zero in the right bottom panel) and perhaps more so than with less harsh environments (intermediate m). Incidentally, in the bottom right panel of Figure 2, do the two separate clusters of triangles and circles mean that there is some sort of evolutionary branching?

      Yes, there are two different equilibria for the same set of conditions. Although it is true that for m=0.3 less division of labor evolves when kin selection and group augmentation act together, it is not the case when only group augmentation takes place. In addition, we qualify m=0.2 as harsh as opposed to benign in which we observe the rise of habitat saturation (m=0.1). m=0.3 is then an extreme harsh environment, in which in several instances different parameter landscape causes population collapse (see figures in the Supplemental Material).  

      (59) L 360: "Variation in the relative fitness costs of different helping tasks with age favors temporal polyethism". I don't see that this has been shown. Temporal polyethism evolves here whenever gamma_t evolves non-zero values. Figure 3A shows that non-zero gamma_t evolves with harsher environments, but I don't see what the "variation in relative fitness costs of different helping tasks" refers to.

      The evolved reaction norms of the model are towards different fitness costs depending on the task performed, since this is how we define the different types of tasks in the model.  

      (60) L 382: "undefined". Say variable. Undefined is something else.

      Undefined is more accurate, since we did not define how many subordinates there were per group, while “variable” could have been defined within a range, which was not the case in this model.  

      (61) L 390: "each genetic locus". Say earlier that each genetic trait is controlled by a single locus.

      Added.  

      (62) L 395: "complete" and "consistent" -> "certain".

      We changed one to “certain” and another to “absolute” to avoid using the same adjective twice in a sentence.  

      (63) L 396: What determines whether dispersers become subordinates or floaters? A trait? Or a fixed probability?

      We added “which is also controlled by the same genetic dispersal predisposition as for subordinates”.

      (64) L 412-413: "cycle". This should be a breeding step.

      Changed to “season” instead.

      (65) L 418: Say negatively impacts (it could also be positively impacts, which I guess is not what you mean).

      Done.

      (66) L 425: "a sample of floaters". Chosen how?

      Added “randomly drawn”.

      (67) L 426-428. But the equation in Table 1 indicates that all floaters compete for breeding spots, not a sample of floaters. This is not clear.

      The number of floaters sampled to try to breed at a given group is N<sub>f,b</sub> = 𝑓∗𝑁<sub>𝑓</sub>/𝑁<sub>𝑏</sub> (Table 1).

      Therefore, N<sub>f,b</sub> is the sample size of floaters for a given open breeding position, and f is how many groups on average a floater attempts to access in each time step.  

      (68) L 432. In the figure, the breeding cycle is called a step, but here it is called a cycle. There should be a single term used throughout. Breeding is not really a cycle here (it doesn't involve multiple steps that are repeated cyclically), so it seems more appropriate to call this breeding steps or breeding seasons.

      Taken into account previous comments, we changed the terms “generation” and “life cycle” to “breeding cycle”. We added “or seasons”.  

      (69) L 439: "generations". What are generations here, as generations are overlapping? You probably mean time steps or something else.

      Changed to “breeding cycles”.

      (70) L 439: "equilibrium was reached". Presumably, equilibrium is reached only asymptotically, so some cutoff is implemented in practice. So maybe say explicitly what cutoff was implemented.

      As mentioned, we run the model for 200’000 time steps, and if equilibrium was not reached for the phenotypic values, then we run the model for longer, with 400’000 time steps being the maximum at which all simulation reached equilibrium. In some cases, genetic values did not reach equilibrium at ranges at which there was no impact on phenotypic values, so these were disregarded to assess whether equilibrium was reached.  

      (71) L 452: "Even though individuals are likely to change the total amount of help given throughout their lives". Do you mean in real organisms or in the model? Say which. If it is in the model, it is not clear how.

      We added “in nature” to clarify that this was not the case in the model.  

      (72) L 455: "For more details on how individuals may adapt their level of help with age and social and environmental conditions, see [63]." Do you mean real individuals or in the model? Again, if it is in the model, it is unclear how this is possible and should be explained in this paper at least briefly rather than citing another one.

      We rephrased it to “How individuals in the model may adapt their level of help with age and social and environmental conditions has been described elsewhere.” We do not go into detail here because it is not within the scope of the paper, and those results have been described elsewhere.  

      (73) L 475: "helpers". Make terminology consistent throughout.

      All helpers are subordinates, but not all subordinates are helpers, as they may evolve no help. Since here we are describing those subordinates that do help, we use that terminology. We added “subordinate helpers” to clarify this further.  

      (74) L 476: "proportional". The dependence in Equation 1 is not "proportional to". Say something like "a survival probability (not rate) that decreases with the amount of help provided".

      Done.

      (75) L 482: "environmental"-> baseline, as defined first.

      Done.

      (76) L 486: "benefits". Can you briefly say in parentheses what those benefits are in real organisms? As in line 475, where you reminded the reader of survival costs due to predator defense.

      Added “such as those offered by safety in numbers or increased resource defense potential”.

      (77) L 494. "we first outline a basic model in which individuals". It is not clear what this sentence says, and the remainder of this section does not clarify it.

      We made two models for comparison, one where individuals can choose freely which task they prefer to perform, and another in which there is an increase in productivity when both kinds of tasks are performed to a similar extent at group level. In the latter model, individuals may choose an unpreferred task at certain times during their lived to increase the effect of the help provided in the breeder’s (and group’s) productivity.  

      We rephrased this section to “we first outline a basic model where individuals evolve their preferred helping task. Then we compare this to another model in which the breeder’s reproductive outcome is maximized when the group’s helping effort in each kind of tasks is performed to a roughly equal degree.”

      (78) L 496: "by performing both tasks". Sounds as if the breeder performs both tasks, not helpers.

      We changed to “when the group’s helping effort in each kind of tasks”.

      (79) L 497: "the maximum amount of cumulative help of each type (sigma Hmax) that can affect fecundity is given by Eq. 4:" This statement is imprecise. Presumably, what is meant is that this level of help maximises breeder productivity, as stated earlier in the paper. However, there is no proof that this level of help maximises breeder productivity, so this expression seems unjustified and it is unclear how it is used.

      This is a description of the model set up. As described later in the same section, the cumulative help of each time that will influence the breeder’s fecundity if maximum Hmax. Therefore, it does represent the maximum amount of cumulative help of each type that can affect the breeder’s fecundity.

      (80) L 500: "reproduced" -> "reproduce".

      Done.  

      (81) L 503. Say here what K is so that the reader knows what equation 5 is showing.

      Added “K” to the “The quantity of offspring produced (K)”.

      (82) L 503: "diminishing returns" -> "diminishing returns as help increases".

      Done.  

      (83) L 507: Why these inequalities?

      These inequalities explain the use of Hmax (response to comment 79). We rephased it to “the cumulative defense effort is larger than or the cumulative work effort is larger than ”.  

      (84) L 526: "removing the influence of relatedness from the model". It would be helpful to plot relatedness in this and the other scenario to check that it is indeed low here and high in the other.

      The actual values of relatedness are provided in the Supplemental Material Table S1. We added this reference to Figure 2.  

      (85) L 528: "It is possible that direct and indirect fitness benefits could have an additive effect on the evolution of alloparental care". This is technically incorrect. It is also unclear what the point of this sentence is.

      We have removed this sentence.  

      (86) Table 1: Say what are the allowed values for these genotypic traits (can they take negative values, be greater than one, are they continuous or discrete?): e.g., alpha \in [0,1] or alpha \in (-infinity, infinity). For phenotypic traits, it would be helpful if the third column lists the equation where the trait is defined. As the variables in the first column are scalars, they should not be bold face. Survival "rate" should be survival "probability" throughout.

      All genetic traits can take any real number (-infinity, infinity), but the phenotypic values are either constrained by the equation like for logistic formulas, or manually constrained like for dispersal propensity or help (only positive numbers allowed). We added “Each genetic trait is controlled by a single locus, and may take any real number” (L403), and added the boundaries for help and dominance value in Table 1. We decided against including the equations in the table due to space constraints. We removed the bold face as suggested. We changed all instances of “survival rate” to “survival probability”.

      (87) Figures S1, S2: I don't recall seeing references to these figures in the main text, but there should be, as well as for Tables S1-S3.

      Table S1 is now referenced in Figure 2. The other figures are now referenced in the main text when we reference the different sections in the Supplemental Materials (L190 and L198). Other Tables are referenced in their respective Figures in the SI.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      Khan et. al., investigated the functional redundancy of the non-canonical L-cysteine synthases of M. tuberculosis, CysM and CysK2, focussing on their role in mitigating the effects of host-derived stress. They found that while deletion mutants of the two synthases (Rv∆cysM, Rv∆cysK2) have similar transcriptomes under standard conditions, their transcriptional response to oxidative stress is distinct. The impact of deleting the synthases also differentially affected the pools of L-cysteinederived metabolites. They show that the mutants (Rv∆cysM, Rv∆cysK2) have impaired survival in peritoneal macrophages and in a mouse model of infection. Importantly, they show that the survival of the mutants increases when the host is defective in producing reactive oxygen and nitrogen species, linking the phenotype to a defect in combating host-derived stress. Finally, they show that compounds inhibiting L-cysteine synthases reduce the intracellular survival of M.

      tuberculosis.

      Strengths:

      (1) The distinct transcriptome of the Rv∆cysM and Rv∆cysK2 mutants in the presence of oxidative stress provides solid evidence that these mutants are distinct in their response to oxidative stress, and suggests that they are not functionally redundant.

      (2) The use of macrophages from phox-/- and INF-/- mice and an iNOS inhibitor for the intracellular survival assays provides solid evidence that the survival defect seen for the Rv∆cysM and Rv∆cysK2 mutants is related to their reduced ability to combat host-derive oxidative and nitrosative stress. This is further supported by the infection studies in phox-/- and INF-/- mice.

      Weaknesses:

      (1) There are several previous studies looking at the transcriptional response of M. tuberculosis to host-derived stress, however, the authors do not discuss initial RNA-seq data in the context of these studies. Furthermore, while several of the genes in sulfur assimilation and L-cysteine biosynthetic pathway genes are upregulated by more than one stress condition, the data does not support the statement that it is the "most commonly upregulated pathway in Mtb exposed to multiple host-like stresses".

      We have made changes in the manuscript in line with reviewer’s suggestion.  

      “Thus RNA-Seq data suggest that genes involved in sulfur assimilation and L-cysteine biosynthetic pathway are upregulated during various host-like stresses in Mtb (Figure S2). Given the importance of sulphur metabolism genes in in vivo survival of Mtb [1, 2], it is not surprising that these genes are dynamically regulated by diverse environment cues. Microarray studies have shown upregulation of genes encoding sulphate transporter upon exposure to hydrogen peroxide and nutrient starvation [3-7] Similarly, ATP sulfurlyase and APS kinase is induced during macrophage infection and by nutrient depletion. Induction of these genes that coordinate first few steps of sulphur assimilation pathway indicate that probable increase in biosynthesis of sulphate containing metabolites that may be crucial against host inflicted stresses. Furthermore, genes involved in synthesis of reduced sulphur moieties (cysH, sirA and cysM) are also induced by hydrogen peroxide and nutrient starvation. Sulfur metabolism has been postulated to be important in transition to latency. This hypothesis is based on transcriptional upregulation of cysD, cysNC, cysK2, and cysM upon exposure to hypoxia. Multiple transcriptional profiling studies have reported upregulation of moeZ, mec, cysO and cysM genes when cells were subjected to oxidative and hypoxic stress [1, 6-11] further suggesting an increase in the biosynthesis of reduced metabolites such as cysteine and methionine and sulfur containing cell wall glycolipids upon exposure to oxidative stress [12]. We have modified the sentence to “significantly upregulated pathway in Mtb exposed to multiple host-like stresses”

      (2) For the quantification of the metabolites, it isn't clear how the abundance was calculated (e.g., were standards for each metabolite used? How was abundance normalised between samples?), and this information should be included to strengthen the data.

      Thanks for picking up this. We have extended our description of metabolomics methods. It now reads: “Due to the tendency of M. tuberculosis to form clamps, which significantly skews any cell number estimation we normalized samples to protein/peptide concentration using the BCA assay kit (Thermo). Therefore, our LC-MS data is expressed as ion counts/mg protein or ratios of that for the same metabolite. This is a standard way to express ion abundance data as it was done previously [13, 14].

      Furthermore, labelling with L-methionine was performed to determine the rate of synthesis of the L-cysteine-derived metabolites. L-cysteine is produced from L-methionine via the transsulfuration pathway, which is independent of CysM and CysK2. It is therefore difficult to interpret this experiment, as the impact of deleting CysM and CysK2 on the transsulfuration pathway is likely indirect.

      The reviewer may have misunderstood the experiment and the results presented. Labelling was not performed with L-methionine. We use 34S derived from SO42-, to monitor reductive assimilation of sulfur and its transit from S2- until L-methionine, passing through cysteine. We specified in material and methods that we have used sodium sulfate-34S (Merck 718882), as our label source of sulfur. This method was first employed in M. tuberculosis by the Bertozzi group to identify sulfolipids in mycobacteria. Therefore, we are not measuring transsulfuration, but instead direct synthesis of L-methionine via cysteine, and consequently we are indeed assessing the importance of cysK2 and cysM in this process. We have now added to the results section (page 9) that we employed (Na34SO4) for labeling, to make sure other readers will not think we are measuring transulfuration.

      (3) The ability of L-cysteine to rescue the survival defect of the Rv∆cysM and Rv∆cysK2 mutants in macrophages is interpreted as exogenous L-cysteine being able to compensate for reduced intracellular levels. However, there is no evidence that L-cysteine is being taken up by the mutants and an alternate explanation is that L-cysteine functions as an antioxidant within cells i.e., it reduces intracellular ROS.

      The concentration of L-cysteine used for peritoneal macrophage survival rescue experiments was titrated to have no minimum survival advantage in case of wild-type Rv. Thus, at the given concentration, we believe that the contribution of cysteine in reducing intracellular ROS within cells does not have a major role since there is no significant difference in the survival of wild-type Rv strain. Had cysteine reduced intracellular ROS, we would expect increased bacterial survival of Rv due to diminished oxidative stress. 

      Furthermore, L-cysteine addition also mitigates CHP induced survival defect in vitro [15] and nullifies observed effect of Cysteine inhibitors in vitro [16] suggesting that cysteine or cystine can be transported into Mtb. This has also been previously shown in case of AosR mutant strain [15], CysH [2] and over 70% uptake of exogenously added [35S] cysteine to a growing culture of Mtb [17].

      The authors sought to investigate the functional redundancy of the non-canonical L-cysteine synthases CysM and CysK2. While their distinct transcriptional response to oxidative stress suggests distinct physiological roles, the study did not explore these differences and therefore provides only preliminary insight into the underlying reasons for this observation. In the context of drug development, this work suggests that while L-cysteine synthase inhibitors do not have high potency for killing intracellular M. tuberculosis, they have the potential to decrease the pathogen's survival in the presence of host-derive stress.

      Reviewer #2 (Public Review):

      Summary:

      The paper examines the role L-cysteine metabolism plays in the biology of Mycobacterium tuberculosis. The authors have preliminary data showing that Mycobacterium tuberculosis has two unique pathways to synthesize cysteine. The data showing new compounds that act synergistically with INH is very interesting.

      Strengths:

      RNAseq data is interesting and important.

      Weaknesses:

      The paper would be strengthened if the authors were to add further detail to their genetic manipulations.

      The authors provide evidence that they have successfully made a cysK2 mutant by recombineering. This data looks promising, but I do not see evidence for the cysM deletion. It is also important to state what sort of complementation was done (multicopy plasmid, integration proficient vector, or repair of the deletion). Since these mutants are the basis for most of the additional studies, these details are essential. It is important to include complementation in mouse studies as unexpected loss of PDIM could have occurred.

      The details of CysM knockout generation have been previously published ([15]; Appendix Figure S4), and complementation strain details are provided in the methods section.  

      Reviewer #3 (Public Review):

      In this work, the authors conduct transcriptional profiling experiments with Mtb under various different stress conditions (oxidative, nitrosative, low pH, starvation, and SDS). The Mtb transcriptional responses to these stress conditions are not particularly new, having been reported extensively in the literature over the past ~20 years in various forms. A common theme from the current work is that L-cysteine synthesis genes are seemingly up-regulated by many stresses. Thus, the authors focused on deleting two of the three L-cysteine synthesis genes (cysM and cysK2) in Mtb to better understand the roles of these genes in Mtb physiology.

      The cysM and cysK2 mutants display fitness defects in various media (Sautons media, starvation, oxidative and nitrosative stress) noted by CFU reductions. Transcriptional profiling studies with the cysM and cysK2 mutants revealed that divergent gene signatures are generated in each of these strains under oxidative stress, suggesting that cysM and cysK2 have non-redundant roles in Mtb's oxidative stress response which likely reflects the different substrates used by these enzymes, CysO-L-cysteine and O-phospho-L-serine, respectively. Note that these studies lack genetic complementation and are thus not rigorously controlled for the engineered deletion mutations.

      The authors quantify the levels of sulfur-containing metabolites (methionine, ergothioneine, mycothiol, mycothionine) produced by the mutants following exposure to oxidative stress. Both the cysM or cysK2 mutants produce more methionine, ergothioneine, and mycothionine relative to WT under oxidative stress. Both mutants produce less mycothiol relative to WT under the same condition. These studies lack genetic complementation and thus, do not rigorously control for the engineered mutations.

      Next, the mutants were evaluated in infection models to reveal fitness defects associated with oxidative and nitrosative stress in the cysM or cysK2 mutants. In LPS/IFNg activated peritoneal macrophages, the cysM or cysK2 mutants display marked fitness defects which can be rescued with exogenous cysteine added to the cell culture media. Peritoneal macrophages lacking the NADPH oxidase (Phox) or IFNg fail to produce fitness phenotypes in the cysM or cysK2 mutants suggesting that oxidative stress is responsible for the phenotypes. Similarly, chemical inhibition of iNOS partly abrogated the fitness defect of the cysM or cysK2 mutants. Similar studies were conducted in mice lacking IFNg and Phox establishing that cysM or cysK2 mutants have fitness defects in vivo that are dependent on oxidative and nitrosative stress.

      Lastly, the authors use small molecule compounds to inhibit cysteine synthases. It is demonstrated that the compounds display inhibition of Mtb growth in 7H9 ADC media. No evidence is provided to demonstrate that these compounds are specifically inhibiting the cysteine synthases via "ontarget inhibition" in the whole Mtb cells. Additionally, it is wrongly stated in the discussion that "combinations of L-cys synthase inhibitors with front-line TB drugs like INH, significantly reduced the bacterial load inside the host". This statement suggests that the INH + cysteine synthase inhibitor combinations reduce Mtb loads within a host in an infection assay. No data is presented to support this statement.

      We agree with the reviewer that the experiments do not conclusively prove that these compounds specifically inhibit the cysteine synthases via "on-target inhibition" in the whole Mtb cells. However, the inhibitors used in this study have been previously profiled in vitro (https://www.sciencedirect.com/science/article/abs/pii/S0960894X17308405?via%3Dihub).  We have modified the sentence to “a combination of L-cysteine synthase inhibitors with front-line TB drugs like INH, significantly reduced the bacterial survival in vitro”

      References

      (1) Hatzios, S.K. and C.R. Bertozzi, The regulation of sulfur metabolism in Mycobacterium tuberculosis. PLoS Pathog, 2011. 7(7): p. e1002036.

      (2) Senaratne, R.H., et al., 5'-Adenosinephosphosulphate reductase (CysH) protects Mycobacterium tuberculosis against free radicals during chronic infection phase in mice. Mol Microbiol, 2006. 59(6): p. 1744-53.

      (3) Betts, J.C., et al., Evaluation of a nutrient starvation model of Mycobacterium tuberculosis persistence by gene and protein expression profiling. Mol Microbiol, 2002. 43(3): p. 717-31.

      (4) Hampshire, T., et al., Stationary phase gene expression of Mycobacterium tuberculosis following a progressive nutrient depletion: a model for persistent organisms? Tuberculosis (Edinb), 2004. 84(3-4): p. 228-38.

      (5) Schnappinger, D., et al., Transcriptional Adaptation of Mycobacterium tuberculosis within Macrophages: Insights into the Phagosomal Environment. J Exp Med, 2003. 198(5): p. 693-704.

      (6) Voskuil, M.I., et al., The response of mycobacterium tuberculosis to reactive oxygen and nitrogen species. Front Microbiol, 2011. 2: p. 105.

      (7) Voskuil, M.I., K.C. Visconti, and G.K. Schoolnik, Mycobacterium tuberculosis gene expression during adaptation to stationary phase and low-oxygen dormancy. Tuberculosis (Edinb), 2004. 84(3-4): p. 218-27.

      (8) Brunner, K., et al., Profiling of in vitro activities of urea-based inhibitors against cysteine synthases from Mycobacterium tuberculosis. Bioorg Med Chem Lett, 2017. 27(19): p. 4582-4587.

      (9) Manganelli, R., et al., Role of the extracytoplasmic-function sigma factor sigma(H) in Mycobacterium tuberculosis global gene expression. Mol Microbiol, 2002. 45(2): p. 365-74.

      (10) Burns, K.E., et al., Reconstitution of a new cysteine biosynthetic pathway in Mycobacterium tuberculosis. J Am Chem Soc, 2005. 127(33): p. 11602-3.

      (11) Manganelli, R., et al., The Mycobacterium tuberculosis ECF sigma factor sigmaE: role in global gene expression and survival in macrophages. Mol Microbiol, 2001. 41(2): p. 423-37.

      (12) Tyagi, P., et al., Mycobacterium tuberculosis has diminished capacity to counteract redox stress induced by elevated levels of endogenous superoxide. Free Radic Biol Med, 2015. 84: p. 344-354.

      (13) de Carvalho, L.P., et al., Metabolomics of Mycobacterium tuberculosis reveals compartmentalized co-catabolism of carbon substrates. Chem Biol, 2010. 17(10): p. 1122-31.

      (14) Agapova, A., et al., Flexible nitrogen utilisation by the metabolic generalist pathogen Mycobacterium tuberculosis. Elife, 2019. 8.

      (15) Khan, M.Z., et al., Redox homeostasis in Mycobacterium tuberculosis is modulated by a novel actinomycete-specific transcription factor. EMBO J, 2021. 40(14): p. e106111.

      (16) Brunner, K., et al., Inhibitors of the Cysteine Synthase CysM with Antibacterial Potency against Dormant Mycobacterium tuberculosis. J Med Chem, 2016. 59(14): p. 6848-59.

      (17) Wheeler, P.R., et al., Functional demonstration of reverse transsulfuration in the Mycobacterium tuberculosis complex reveals that methionine is the preferred sulfur source for pathogenic Mycobacteria. J Biol Chem, 2005. 280(9): p. 8069-78.

      Recommendations For The Authors:

      Reviewer #1 (Recommendations For The Authors):

      (1) In Figure S1 it would be useful to include the reverse transsulfuration pathway given that it contributes to the L-cysteine pool, and that L-methionine was used for metabolite labelling experiments.

      We are in agreement with the reviewer’s suggestion, and we have included reverse transsulfuration in Fig S1. Please note that Labelling was not performed with L-methionine. We used 34S derived from SO42-to monitor the reductive assimilation of sulfur and its transit from S2- until Lmethionine, passing through cysteine. We specified in material and methods that we have used sodium sulfate-34S (Merck 718882), as our label source of sulfur. This method was first employed in M. tuberculosis by the Bertozzi group to identify sulfolipids in mycobacteria. Therefore, we are not measuring transsulfuration but instead a direct synthesis of Lmethionine via cysteine, and consequently, we are indeed assessing the importance of cysK2 and cysM in this process. We have now added to the results section (page 9) that we employed (Na34SO4) for labeling to make sure other readers will not think we are measuring transulfuration.

      Author response image 1.

      (2) In Figure S2 it is unclear why the control is included in this figure given that the stress conditions were compared to the control. What is the control being compared to here?

      The heat maps of controls have been included to demonstrate relative gene expression in independent/each of the replicates. The normalized count for the differentially expressed genes are plotted. To better understand the RNA-seq results, we plotted the fold change of differentially expressed genes due to different stress conditions (New figure & table- Figure S3 & Table S2). This allowed us to understand the expression profile of genes in all the stress conditions simultaneously, regardless of whether they were identified as differentially expressed. The data revealed that specific clusters of genes are up- and downregulated in oxidative, SDS, and starvation conditions. In comparison, the differences observed in the pH 5.5 and nitrosative conditions were limited (Figure S3 & Table S2).  

      (3) In Figure S3 it would be more informative to show fold-enrichment than gene counts in (b) to (f).

      In our opinion, gene counts are more informative when plotting GO enrichments, as the number of genes in each GO category can vary drastically. The significance values are already calculated based on the fold enrichment of a category compared to the background, and hence, p-adj values plotted on the x-axis can be sort of a proxy for fold enrichment. Hence, instead of plotting two related variables, plotting the total gene counts that belonged to a category is usually helpful for the reader in understanding the “scale” in which a category is affected.

      (4) Figure 1c standard Sautons is a defined media, and is not nutrient-limiting - the authors should clarify the composition of the media that they used here.

      The composition of Sautons media used in the study is 0.5g/L MgSO4.7H20, 2 g/L citric acid, 1g/L L-asparagine, 0.3 g/L KCl.H20, 0.2% glycerol, 0.64 g/L FeCl3, 100 μM NH4Cl and 0.7 g/L K2HPO4.3H20. We have modified the sentence in line with reviewer’s suggestion.  

      (5) The authors claim that the distinct transcriptomes for the two mutants indicate that "CysM and CysK2 distinctly modulate 324 and 1104 genes". The effect is likely due to distinct downstream consequences of the deletions, rather than direct regulation by the synthases. This section should be reworded for clarity.

      We have modified the sentence in line with reviewer’s suggestion.

      (6) In Figure 3 it would be useful to express mycothione levels as a percentage of the total mycothiol pool to give an indication of the extent to which the thiol is being oxidised.

      While we appreciate reviewer’s suggestion, we cannot make ratios of IC for two different compounds, as they ionize different. 100 ion counts of one does NOT equal to 100 ion counts of the other.

      (7) Figure 6 is difficult to interpret as the concentrations used in the INH + inhibitor wells are not clear. It would be useful to indicate the concentrations of each compound added next to the wells in the figure.

      We have modified the figure and legends in line with reviewer’s suggestion

      Reviewer #2 (Recommendations For The Authors):

      (1) Document the cysM deletion.

      The details of CysM knockout generation have been previously published ([15]; Appendix Figure S4), and complementation strain details are provided in the methods section. 

      (2) The oxidative stress CHP is not defined in the figure legend.

      We have modified the legend in line with the reviewer’s suggestion.

      (3) Can we see the structures of the compounds?

      Kindly refer to Fig 6a for the structures of compounds 

      (4) Fix the genetics and the paper is very interesting.

      I might be missing something. The authors do provide promising complementation data for several of the stresses. Provide evidence for the cysM deletion and complementation and the data will be very compelling. The focus of the paper is important for our understanding of the biology of Mycobacterium tuberculosis.

      Thank you for appreciating our study. The details of CysM knockout and complementation strain generation have been previously published ([15]; Appendix Figure S4 & Methods)). CysK2 mutant and complementation strain details are included in the present manuscript (Figure 1b & Methods).

      Reviewer #3 (Recommendations For The Authors):

      The transcriptional profiling studies do not rigorously control for the engineered mutations using genetic complementation.

      The complementation strains used in all in vitro, ex vivo and in vivo experiments showcase that the phenotypes associated with knockouts are gene specific. We choose not to include complementation strains in RNA sequencing experiments due to the large number of samples handling and associated costs.  

      Figure 3. These data are not rigorously controlled without genetic complementation, explain why some data in Figure 3 was generated at 24 hr and other data was generated at 48 hr, remove subbars in 3g. Please provide more clarification on Fig 3e-g because the normalization in these panels makes it appear as if there is little- or no-difference in the levels of 34S incorporation into the thiol metabolites.

      The complementation strains used in all in vitro, ex vivo, and in vivo experiments showcase that the phenotypes associated with knockouts are gene-specific. We chose not to include complementation strains in Figure 3 experiments due to the large number of sample handling and associated costs. 

      The time points in the given experiment were chosen based on an initial pilot experiment. It is apparent that a longer duration is required to see the phenotypes associated with labelling compared to pool size. The differences observed are statistically significant. 

      Surfactant and SDS stress are used interchangeably in the text, legends, and figures. Please be consistent here.

      We have modified the text in line with reviewer’s suggestion.

      Consider re-wording the 1st paragraph on page 5 to better clarify how Trp, Lys, and His interact with the host immune cells.

      We have modified the text in line with reviewer’s suggestion.

      Cite the literature associated with the sulfur import system in Mtb on page 3 in the 2nd paragraph.

      We have modified the text in line with reviewer’s suggestion.

      The manuscript nicely describes the construction of a cysK2 mutant. It is unclear how the cysM mutant was generated. Please clarify, cite, or add the cysM mutant construction to this manuscript.

      The details of CysM knockout and complementation strain generation has been previously published ([15]; Appendix Figure S4 & Methods)). We have included the citation in the methods section of current manuscript.

      Provide evidence that the small molecules used in Fig 6 are on target and inhibit the cysteine biosynthetic enzymes in whole bacteria. It is unclear how a MIC can be determined with these compounds in 7H9 ADC when deletion mutants grow just fine in this media. Is this because the compounds inhibit multiple cysteine synthesis enzymes and/or enzymatic targets in other pathways? To me, the data suggests that the compounds are hitting multiple enzymes in whole Mtb cells. Does cysteine supplementation reverse the inhibitory profiles with the compounds in Figure 6?

      As mentioned in the text, all the compounds were ineffective in killing Mtb, likely because Lcysteine synthases are not essential during regular growth conditions. Hence, the MIC for cysteine inhibitors was very high - C1 (0.6 mg/ml), C2 (0.6 mg/ml), and C3 (0.15 mg/ml) opposed to the standard drug, isoniazid with MIC of 0.06 ug/ml. We agree with the reviewer that the experiments do not conclusively prove that these compounds specifically inhibit the cysteine synthases via "on-target inhibition" in  Mtb cells. The inhibitors used in this study have been previously profiled in vitro [8]. However, one cannot rule out the hypothesis that these compounds might also have some off-target effects.

    1. Author Response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Summary:

      Cheong et al. use a synapse-resolution wiring map of the fruit fly nerve cord to comprehensively investigate circuitry between descending neurons (DNs) from the brain and motor neurons (MNs) that enact different behaviours. These neurons were painstakingly identified, categorised, and linked to existing genetic driver lines; this allows the investigation of circuitry to be informed by the extensive literature on how flights walk, fly, and escape from looming stimuli. New motifs and hypotheses of circuit function were presented. This work will be a lasting resource for those studying nerve cord function.

      Strengths:

      The authors present an impressive amount of work in reconstructing and categorising the neurons in the DN to MN pathways. There is always a strong link between the circuitry identified and what is known in the literature, making this an excellent resource for those interested in connectomics analysis or experimental circuits neuroscience. Because of this, there are many testable hypotheses presented with clear predictions, which I expect will result in many follow-up publications. Most MNs were mapped to the individual muscles that they innervate by linking this connectome to pre-existing light microscopy datasets. When combined with past fly brain connectome datasets (Hemibrain, FAFB) or future ones, there is now a tantalising possibility of following neural pathways from sensory inputs to motor neurons and muscle.

      Weaknesses:

      As with all connectome datasets, the sample size is low, limiting statistical analyses. Readers should keep this in mind, but note that this is the current state-of-the-art. Some figures are weakened by relying too much on depictions of wiring diagrams as evidence of circuit function, similarity between neuropils, etc. without additional quantitative justification.

      We thank the reviewer for their helpful comments. We are excited about the release of this densely reconstructed connectome and its potential to facilitate circuit exploration in the VNC. We note that while statistical methods for analyzing complicated networks such as the connectome are still being developed, the wiring diagrams presented are themselves visualizations of quantitative data. We address specific concerns below.

      Reviewer #2 (Public Review):

      Summary:

      In Cheong et al., the authors analyze a new motor system (ventral nerve cord) connectome of Drosophila. Through proofreading, cross-referencing with another female VNC connectome, they define key features of VNC circuits with a focus on descending neurons (DNs), motor neurons (MNs), and local interneuron circuits. They define DN tracts, MNs for limb and wing control, and their nerves (although their sample suffers for a subset of MNs). They establish connectivity between DNs and MNs (minimal). They perform topological analysis of all VNC neurons including interneurons. They focus specifically on identifying core features of flight circuits (control of wings and halteres), leg control circuits with a focus on walking rather than other limbed behaviors (grooming, reaching, etc.), and intermediate circuits like those for escape (GF). They put these features in the context of what is known or has been posited about these various circuits.

      Strengths:

      Some strengths of the manuscript include the matching of new DN and MN types to light microscopy, including the serial homology of leg motor neurons. This is a valuable contribution that will certainly open up future lines of experimental work.

      Also, the analysis of conserved connectivity patterns within each leg neuromere and interconnecting connectivity patterns between neuromeres will be incredibly valuable. The standard leg connectome is very nice.

      Finally, the finding of different connectivity statistics (degrees of feedback) in different neuropils is quite interesting and will stimulate future work aimed at determining its functional significance.

      We thank the reviewer for their constructive feedback, and are optimistic about the utility of the MANC connectome to the Drosophila neurobiology community in dissecting VNC circuit function.

      Weaknesses:

      First, it seems like quite a limitation that the neurotransmitter predictions were based on training data from a fairly small set of cells, none of which were DNs. It's wonderful that the authors did the experimental work to map DN neurotransmitter identity using FISH, and great that the predictions were overall decently accurate for both ACh and Glu, but unfortunate that they were not accurate for GABA. I hope there are plans to retrain the neurotransmitter predictions using all of this additional ground truth experimental data that the authors collected for DNs, in order to provide more accurate neurotransmitter type predictions across more cell types.

      The reviewer makes an excellent suggestion, and collecting further ground truth data and retraining the neurotransmitter classifier is an ongoing research project. 

      Second, the degradation of many motor neurons is unfortunate. Figure 5 Supplement 1 shows that roughly 50% of the leg motor neurons have significantly compromised connectivity data, whereas, for non-leg motor neurons, few seem to be compromised. If that is the correct interpretation of this figure, perhaps a sentence like this that includes some percentages (~50% of leg MNs, ~5% of other MNs) could be added to the main text so that readers can get a sense of the impact more easily.

      Thank you for this suggestion. We have added a line describing the percentage of leg and other MNs affected (L416-417).

      As well, Figure 5 Supplement 1 caption says "Note that MN groups where all members of the group have reconstruction issues may not be flagged" - could the authors comment on how common they think this is based on manual inspection? If it changes the estimate of the percentage of affected leg motor neurons from 50% to 75% for example, this caveat in the current analysis would need to be addressed more directly. Comparing with FANC motor neurons could perhaps be an alternative/additional approach for estimating the number of motor neurons that are compromised.

      We agree that a direct comparison to another dataset, such as FANC, would aid in identifying reconstruction issues. However, a full analysis is not currently possible as only a minority of FANC neurons have been proofread or annotated. We were able to gain some insights into reconstruction quality by looking at T1 motor neurons, where FANC MN reconstruction is more complete. As reported in the submitted manuscript, we were able to confidently match T1 MNs between FANC and MANC for all but one MN (we are missing one ltm MN on the right side of MANC). While some of the MANC neurons had smaller/less dense arbors than FANC, none of them would have been flagged as having reconstruction issues. However, for FANC, we observe that neurons on the right have less dense arbors and fewer reconstructed synapses than neurons on the left.  We have prepared a reviewer figure analyzing the consistency of synapse counts for the T1 (front leg) MNs:

      Author response image 1.

      In these results (MANC on the left, FANC on the right) we compare the number of input synapses on matched motor neurons on the left (LHS) and right hand side (RHS) of each dataset. We see that the MANC distribution is much more symmetric, indicating left and right hand side synapse counts for matched MNs are more similar in MANC. This is likely largely due to the left-right difference in reconstruction completeness in the FANC T1 leg neuropils. The number of synapses per cell type is also more variable in FANC. Overall, we recommend that end users should inspect the morphology and total synapse counts of individual MNs of interest in either dataset as part of any detailed analysis.

      This analysis might benefit from some sort of control for true biological variability in the number of MN synapses between left and right or across segments. I assume the authors chose the threshold of 0.7 because it seemed to do a good job of separating degraded neurons from differences in counts that could just be due to biological variability or reconstruction imperfections, but perhaps there's some way to show this more explicitly. For example, perhaps show how much variability there is in synapse counts across all homologs for one or two specific MN types that are not degraded and are reconstructed extremely well, so any variability in input counts for those neurons is likely to be biologically real. Especially because the identification of serial homologs among motor neurons is a key new contribution of this paper, a more in-depth analysis of similarities and differences in homologous leg MNs across segments could be interesting to the field if the degradation doesn't preclude it.

      We agree that there can be ambiguity in whether variability in synapse counts between left-right homologs of a MN type represents biological variability or technical issues. We have added a comparison of synapse counts of T1 leg MNs in MANC (Left) vs FANC (Right) as noted in the previous point. As the number of connectomes available to us increases, we will have a better idea of how synapse counts of MNs vary within and between animals.

      Fourth, the infomap communities don't seem to be so well controlled/justified. Community detection can be run on any graph - why should I believe that the VNC graph is actually composed of discrete communities? Perhaps this comes from a lack of familiarity with the infomap algorithm, but I imagine most readers will be similarly unfamiliar with it, so more work should be done to demonstrate the degree to which these communities are really communities that connect more within than across communities.

      A priori we expect that there is some degree of functional division between circuits controlling different limbs or motor systems, given current evidence that VNC neuropils and neural hemilineages are relatively specialized in controlling motor output. We have added this explanation to section 2.4.2 (L633-635).

      The Infomap algorithm was chosen out of several directed and undirected community detection methods that we tried, as it defined communities that each had connectivity with narrow and specific motor neuron subclasses. For example, it labeled populations in each of the six leg neuropils as belonging to distinct communities. We think this provides an interesting partitioning of the VNC network that could have biological relevance (which future functional studies should investigate). To the reviewer’s final sentence, we do show intra- vs inter-community connectivity in Fig. 9–supplement 1B. Notably, most communities except several small ones have far more intra-community connectivity than inter-community connectivity. We have added text highlighting this observation (L656-658).

      We do, however, agree with the general point of the reviewer that it is not yet known which community detection methods are ‘optimal’ for use with connectomics data, so we have added further text (L679-683) explaining that community detection in MANC will require further investigation and validation in the future.

      I think the length of this manuscript reduces its potential for impact, as I suspect the reality is that many people won't read through all 140 pages and 21 main figures of (overall excellent) work and analysis.

      We intend this paper to serve not only as a first look into the organization of descending-to-motor circuits, but also as a resource for future investigations in MANC. The provided detail is intended to serve these purposes.

      Reviewer #1 (Recommendations For The Authors):

      General comments:

      I find that there are too many main figures with too much content in them, as well as too much corresponding text. Much of the initial anatomical identification and description could be summarised in fewer main figures, with more supplementary figures if the authors desired. I think there is a lot of great insight in this paper, particularly in the second half, but I am concerned that the extensive detail in the initial sections may challenge reader engagement through to the later sections of the paper. It would also be useful to have a higher level and shorter discussion.

      Reiterating our response from above, we intend this paper to serve not only as a first look into the organization of descending-to-motor circuits, but also as a resource for future investigations in MANC. The provided detail is intended to serve these purposes.

      There is sometimes an over-reliance on wiring diagrams or complex plots as evidence without further quantification. I will mention several examples below, as well as additional suggestions.

      Specific comments:

      In Figure 2E, how are DNs divided into pair vs population type? This was a very interesting idea, particularly in light of "command-like" neurons vs ensembles of DNs controlling behaviour. However, it is not clear how this distinction is made. This concept is referenced throughout the manuscript, so I think a clear quantitative way of identifying "pair" vs "population" identity for each DN would be very useful. And at the very least, a thorough explanation of how it is done in the current manuscript.

      We have added additional text in the Figure 2 legend to point towards Materials and Methods where the DN grouping (pair vs. population) is explained. These groups were formed based on morphology and further split into types based on connectivity, if needed. However, as the connectome represents a static snapshot of connectivity with no functional data, it remains possible that some DNs that were grouped as populations may act functionally as multiple pairs. Future work should continue to update these annotations.

      In Figure 4, there are some inconsistencies between neurotransmitter predictions and experimental FISH data. Have the authors taken into consideration Lacin et al. 2019 (https://elifesciences.org/articles/43701)? Specifically in that paper, it is stated: "We did not find any cases of neurons using more than one neurotransmitter, but found that the acetylcholine specific gene ChAT is transcribed in many glutamatergic and GABAergic neurons, but these transcripts typically do not leave the nucleus and are not translated." I wonder if this might explain some of the inconsistencies between FISH (mRNA detection) and the neurotransmitter predictions (presumably based on indirect protein structures detected via EM imagery), or the presence of so much co-transmission.

      We agree and have added this possible explanation for apparent co-transmission in the text (L394-397).

      In Figure 8B, the authors state: "We found that individual DN and MN subclasses have direct downstream and upstream partners, respectively, that are relatively hemilineage-restricted (Figure 8B)." While the connectivity patterns highlighted are intriguing, further quantitative analysis could help strengthen this point. The connectivity matrices in Figure 8B are linked to activation phenotypes and hemilineages below. But I don't really know how to interpret "relatively hemilineage-restricted" in light of this plot. How does this connectivity pattern for example compare statistically to a randomly selected set of DNs (maintaining the same group size for example)? Would random DN sets be less hemilineage restricted? Similar quantification would be helpful to support this statement "...with high correspondence between the hemilineages connected to individual DN and MN subclasses that are expected to be functionally related."

      "both upper tectulum DNs (DNut) and wing MNs (MNwm) have significant connectivity with hemilineages 6A, 7B, 2A, 19B, 12A and 3B". What is significant connectivity? Looking at the plot in Figure 8B, why is DNut -> 16B not considered significant? Is there a threshold and if so, what is the justification?

      These plots aim to be descriptive rather than drawing hard quantitative thresholds between ‘significant’ and ‘non-significant’ connectivity. We have revised the text to remove the terms ‘restricted’ and ‘significant’ and to clarify our interpretation (L555-559).

      In Figure 9G-H, this is a very interesting finding, but how do we know that the difference is real? Why not do a statistical test to compare the brain and VNC? Or create a null model network with edge swaps, etc. to compare against.

      Statistical comparison between the brain and VNC may be problematic given differences in generating these connectomes, as well as missing connectivity (only half the brain is imaged) in the hemibrain connectome. Comparison to a null model is possible and for purposes of understanding motif frequency in general has already been done (see for example, Lin et al., 2024, Nature). However, a null or shuffled model is not required for comparing motif frequencies between brain or VNC neuropils as is the point of this particular graph. At present, we simply highlight a qualitative observation that will require future work to investigate.

      Referring to Figure 12 in the main text, "we observe that the power MN upstream network is largely shared among all power MNs and is highly bilateral." Quantifying the fraction of shared upstream neurons from power MNs would make this statement much stronger. Particularly if compared to other non-power MNs. Or potentially using some other network comparison metric.

      This is a good point. We have added cosine similarity to figure 6 for wing/haltere MNs to show the similarity between inputs across these MNs, and added text in section 2.3 (L461-465) and 2.5.3 discussing the cosine similarity (L987-988).

      In Figure 13B, "Nearly 50% of these restricted neurons (totalling about 1200 per leg neuropil) have been serially matched across the six neuropils (Figure 13B)". There seems like a disconnect here. In the IR, CR, and BR columns, I see ~2750, ~500, and ~1250 neurons not in a serial set (~4500 total); I see ~1500, ~750, and ~1000 in a serial set (~3250 total). This would mean that ~58% of neurons are not in serial sets, ~42% are in serial sets. Shouldn't the conclusion be the opposite then? That surprisingly most intrinsic neurons are not repeated across leg neuropils. I find this fascinating if true. Perhaps there is some confusion on my part, however.

      We now find that about half of the leg-restricted neurons are serially repeated across the 6 leg neuropil with similar morphology and connectivity, especially to the downstream leg motor neurons. Since first submission of this paper, we have identified some additional serial homologues while completing the systematic cell typing, described in the accompanying paper Marin et al. 2024. Figure 13B has now been updated to reflect this. In total, 3998 of 7684 restricted neurons (IR,CR,BR) have been assigned to a serial set or serial type. The sentence in the text has been adjusted to report that 52% of these restricted neurons are in serial sets (L1125).

      In Figure 13D-E, "the Tect INs are not a homogenous population." Providing additional evidence could strengthen this statement. A connectivity matrix is shown in (D), followed by examples of morphologies in (E). What makes a population homogenous or heterogenous? For example, compared to all possible INs, the Tect IN morphology actually looks quite similar. Are those connectivity matrices in (D) really so different? What would a random selection of neurons look like?

      Our sister paper, Marin et al. (2024), has looked into variation of connectivity across neurons of the entire VNC in much more detail, including clustering methods that include connectivity and other criteria for cell typing. Thus, we have now amended the text to direct the reader to that paper for more detail on variability of connectivity in the Tect INs, which were divided into 5 cell types in Marin et al. (2024) (L1027-1031). In addition, we have replaced our clustering by connectivity in Figure 13 with the cell type clusters from Marin et al. (2024).

      In reference to Figure 13 - Supplement 1, "This standard leg connectome was very similar across legs, but there were small deviations 1051 between T1, T2, and T3 legs, as shown in Figure 13-Supplement 1." - what makes a deviation considered small? T1 seems to generally have many more synapses, T2 many less, and T3 a mixture depending on the connection. Also, are there lost connections or new connections? A quantification of these issues would be helpful instead of simply depicting the wiring diagrams.

      The connections that differ are likely due to the reconstruction state of leg MNs. We have now stated this in the main text for clarification (L1143-1145). In the leg neuropils, T2 and T3 left hand side MNs have sparser dendritic arbors than the right hand side. Therefore the differences in Figure 13–Supplement 1, which are almost exclusively the connections between the leg restricted neurons onto leg MNs, seem stronger in T1. Future work, bolstered by additional datasets, will undoubtedly reveal further insight into the comparison of circuits for the different legs.

      In Figure 15 - Supplement 2, "We used effective connectivity to identify leg DNs with similar MN connectivity patterns (Figure 15-Supplement 2). Of previously identified DNs, we found that DNg13 showed a highly similar effective connectivity fingerprint."

      How was this similarity calculated? How do we know these particular DNs have similar effective connectivity? The connectivity matrix depicted is quite complex, with both layer and connectivity scores quantified at each location. A principled way of determining similarity would make this statement much stronger.

      The similarity was calculated simply as the Euclidean distance between the effective connectivity matrix for each DN onto the set of MNs. While this is a straightforward comparison mathematically, effective connectivity calculations (as first introduced in this context by Li et al., 2020 by our collaborators Larry Abbott and Ashok Litwin-Kumar) have not yet been subject to functional validation. We therefore agree with the reviewer that this should not be over interpreted at this point. Future functional work should explore hypotheses suggested here and more quantitatively compare the similarity of different DN-MN pathways.

      Minor notes:

      In Figure 4E, the circles, squares, and triangles in the figure legend are too small. This is also true to some extent in the plot itself.

      We have increased the size of the symbols in the legend and plot.

      In Figure 8E right, the figure legend and x/y axes are not clear to me. Unfortunately, I'm not sure what the plot is showing because of this.

      The right plot in figure 8E is the number of DN groups each MN group receives input from, at a threshold of 1% input. As this plot is redundant to the left plot, we have decided to remove it.

      In Figure 8I, it would be interesting to see which neurons are directly downstream of DNs. One can't see layers 2/3/4 with the fan-out expansion of neurons and the y-axis scale.

      We have revised the plot to better show cell composition of individual layers.

      In Figure 19E, it would be helpful to also have a standard y-axis.

      The panel has been revised accordingly.

      Reviewer #2 (Recommendations For The Authors):

      General:

      In the Title, you do not mention DNs or MNs but these are a major focus of this study. The title could be more descriptive of the work.

      Per the reviewer’s comments, we have revised the title to “Transforming descending input into motor output: An analysis of the Drosophila Male Adult Nerve Cord connectome”.

      A glossary would be helpful, where all the paper's abbreviations and their definitions are provided in one place. Perhaps a hierarchical structure would help (for at least part of the glossary), so that terms like NTct, WTct, and HTct could be nested underneath UTct, for example.

      We do include a glossary in the sister paper, Marin et al. (2024) and in this paper have included a short glossary in the first Figure. Please refer to these sources for abbreviation reference.

      Introduction:

      Define 'Premotor'.

      We have defined ‘premotor circuits’ to be ‘circuits that directly or indirectly control motor output’ in lines 45-46.

      It might be worthwhile to start with a broader introduction sentence than the current one that focuses just on the fly, in order to emphasize the impact of MANC as the first complete connectome of a motor circuit in any animal with limbs or wings.

      We have revised the introductory paragraph per the reviewer’s suggestions.

      "Muscles in the leg are not innervated uniformly; indeed, in the T1 legs the number of MNs per muscle varies by as much as an order of magnitude" needs to specify the axis of variability more clearly - the authors probably mean variability across muscles in the leg (not variability across individuals for example) but I think the current sentence is a bit ambiguous in that respect.

      We have reworded this sentence to clarify this point (L132-133).

      Line 182 end of paragraph: It would be useful to point out explicitly what makes the MANC project valuable in the context of a similar FANC project - for example, that the MANC connectome is more complete, is a male (so interesting for anyone interested in sexual dimorphism), and gives the field an n=2 for VNC connectome datasets.

      We agree, and have added a sentence describing the benefits of the MANC connectome on L209-212.

      Line 213: A brief phrase or sentence of context could be provided to help unaware readers understand that 42% of synaptic connectivity being captured is in the same sort of range as previous datasets like the hemibrain and likely leads to the vast majority of important cell-cell connections being identified (perhaps cite Buhmann et al 2021 Nature Methods which does an analysis of this), and therefore is a reason to think highly of this dataset's quality and its potential for impact on the field. The sentence at the end of this paragraph doesn't quite do it for me.

      We have added the comparison of MANC synapse completeness to that of the Hemibrain, and revised the ending sentence in L234-237.

      Line 271: Clarify what happened to the remaining 15% of DNs that weren't able to be assigned to a tract. They travelled outside the tracts, or data quality issues prevented assignment, or something else?

      Indeed, some DNs could not be assigned to a tract as they traveled outside of all axon tracts and did not bundle with other DNs. We have added this explanation to the text (L300-301).

      Figure 1:

      The pie chart "DN postsynaptic partners by neuron class" is a bit hard to interpret without having another pie chart next to it showing "Neurons in MANC by neuron class". I know these numbers are written on the schematic but it would be nice to be able to easily tell which cell classes are overrepresented or underrepresented in the set of postsynaptic partners of DNs. e.g. It's obvious that ANs are overrepresented and DNs are underrepresented in the set of postsynaptic partners of DNs, but it would be nice if readers didn't have to do any mental math to figure out if INs or MNs are under/overrepresented.

      We agree and have added a pie chart of the neuron class composition of the entire VNC to Figure 1.

      "35.9% of leg MNs are matched to FANC" Why is this number so low? Because FANC motor neurons were only identified in T1, so the remaining 2/3rds of leg MNs in MANC weren't matched? How successful was matching for the neurons where it was actually attempted?

      For this work, we only matched the T1 neurons across the two datasets. This was both a way of checking that we found everything in these segments and a way of being more sure of muscle target assignments as our collaborators in the FANC dataset had generated extensive light level data to match motor neurons with their target leg muscles. The T2 and T3 MNs were not fully proofread or identified in FANC, precluding further analysis, and leading to the 35.9% matched number. We hope to be able to compare between these datasets more thoroughly in future, and have matched all the premotor leg restricted intrinsic neurons of our standard connectome to FANC. We report on their stereotypy in our latest preprint, Stürner, Brooks et al. 2024.

      Figure 2:

      Figure 2A: Perhaps darken the color of the MTD-III skeletons. Currently, they're so light it's hard to see, and this is one of the most interesting tracts because the claim is that it's a new tract.

      We take the reviewer’s point, however, the color scheme used for the tracts in Figure 2 is coordinated between multiple figures and figure panels, and thus we would prefer to keep it as is. If readers would like to examine DNs of a particular tract, we encourage them to retrieve said DNs using the tract annotations in NeuPrint.

      Figure 2 supplement 1: It's not clear to me what I should be getting out of seeing the right side DNs as well. If you want readers to be able to visually compare the left and right side morphologies and appreciate the high degree of symmetry, you may want to put the left and right side DN panels side-by-side. Perhaps do that (show both the left and right side DNs) for one or two tracts in the main Fig2, and then leave out the remaining panels - or if you want to include the remaining panels, explain more clearly what readers are supposed to learn from seeing them.

      We agree and have now removed Figure 2 supplement 1.

      Figure 2C caption: Instead of "DN primary neurites" I think the authors probably mean "longest single branch of each DN" or something along those lines. I think "primary neurite" is usually used to refer to the thick non-synaptic branch coming out of a neuron's soma, which can't be how it's being used here.

      We agree and have changed all references to ‘primary neurite’ for DNs to ‘longest neurite’.

      Figure 2D+E: Perhaps add an overall % of neurons of each class to the legend. I ask because I would be very interested to know what % of all DNs exist as single pairs versus as populations, and I imagine that could be a number that is quoted a fair amount by others in the field when talking about DNs.

      We agree and have added the overall percentage of each neuron class to the results (L275-276) and Figure 2 legend.

      Figure 3:

      UTct.IntTct neurons are by far the largest class of DNxn neurons, so would it be worth calling these the DNxt class (DN projecting to some combination of tectulum neuropils), to mirror the DNxl class? I would vote for doing that.

      Thanks for the suggestion.  However, the subclass naming scheme for DNs had been coordinated between multiple groups of people working on MANC reconstruction and annotation. As making changes to subclasses will impact many analyses that have already been completed for existing work, we will refrain from doing so.

      Figure 3G feels a bit out of place in this figure and under-explained

      We have clarified in the text our citations to Figure 3G to better explain our interpretation of this data.

      Figure 4

      "DNp20 has few vesicles and may be electrically coupled": If I'm correct that DNp20 is also known as DNOVS1 and is the second largest diameter axon in the neck after the giant fiber, then yes, Suver et al. 2016 J Neurosci show that this DN is gap junction coupled to neck motor neurons (see their Fig 2F). This neuron (along with the giant fiber) is enough of an outlier that it might be more representative to show a different, more canonical DN that has a low prediction probability.

      The reviewer is right that DNp20 is also known as DNOVS1 with known gap junction coupling.  We now clarify in the text (L366) how we think that could lead to a lower neurotransmitter prediction score, which is what we were trying to illustrate.

      Figure 4E: It looks like only a single DN has more inputs (~11000) than outputs (~9000), is that right? It could be interesting to dedicate some panels and text to the connectivity profile of that one unique neuron.

      Yes, that is correct, there is just one pair of DNs, DNxn166, that receives more input than it gives output (the two triangles lie on top of each other). We think that the other DN pair in that same box (more variable in total synapse number and therefore the triangles are further apart) also receives an unusually high amount of input versus output. The morphology of these two types are shown in Figure 4F and they both have fine processes that look more like dendrites, especially when compared to other DNs such as the ones in 4G. Unfortunately, neither of these two types have been matched to light microscopy images so we cannot say if they have the same type of morphology in the brain, or further explore their brain connectivity, at this time point.

      Figure 4E: "black rectangle ... gray rectangle" don't look different shades to me. It's obvious which is which based on where they are in the graph but if you want to color code this, pick more separate colors. Or code it with something other than colors.

      We have made the rectangle in Figure 4E a lighter shade of grey and added labels to refer to the panels D, F and G. The figure legend now also describes more clearly that we are plotting every DN as a single shape and exactly how many DN types are included in those rectangles to avoid confusion.

      Figure 5:

      "subclass is their two-letter muscle anatomical category" should be explained better, I'm not sure what "muscle anatomical category" means.

      We have changed the wording in the Figure 5 legend to better clarify that MN subclasses are the broad muscle category that they innervate (e.g. legs, wings).

      Figure 7:

      Leg MN identification and serial homology.

      Why are there no tarsus reductor (tarm1 and tarm2) motor neurons? Do we not know their anatomy from light microscopy well enough, perhaps? Were these MNs identified in FANC? Is it reasonable to guess that the remaining small number of unidentified T1 leg motor neurons in MANC would control these muscles? I think Marta Moita's lab has some ongoing projects on these muscles (see Twitter), so if more LM data is needed perhaps it will come from them.

      We now know that the small number of unidentified T1 leg motor neurons (a T1 pair with a serial T2 pair, serial set 17664) are not in fact MNs. A new and unpublished dataset (Janelia whole male CNS volume, the optic lobe from which has been published as Nern et al., 2025) shows they have axons within the VNC. The MN annotation for these neurons has been removed and they now have the type name INXXX471. Thus, we have no T1 leg MNs without a muscle target annotated. Our muscle target annotation comes from matching to the FANC dataset that has also not annotated tarsus reductor MNs. We suspect that the tarsus reductor MNs are hard to distinguish from the tarsus depressor MNs of which there are 5 per side and segment.

      It seems there are a few more leg motor neurons in MANC vs FANC. Any indication of which muscles they control?

      See above.

      -Figure 7E: A qualitative comparison between the cosine similarity results here and from FANC could be useful. What generally is the same versus different? Any indication of male/female differences?

      We observe no differences in the cosine similarity of T1 leg MNs between MANC and FANC and only very minor differences between T1, T2 and T3, as shown in Figure 7. In our most recent work, now on bioRxiv (Stürner, Brooks et al., 2024), we were able to find all intrinsic leg serial sets that we included in our standard leg premotor circuit here in the FANC dataset. We do not see any differences between them in terms of morphology, and while we have several cases in which we are still missing 1 of the 6 neurons in a serial set in FANC, we see similar connectivity when comparing small circuits. We have also found almost all neurons interconnecting the legs, with some very interesting exceptions, mainly coming from the abdomen, that we believe are male specific. These male-specific neurons can also be found in this preprint (Stürner, Brooks et al., 2024).

      Figure 8

      Figure 8A: Why are ~1/3rd of the wing and leg motor neurons considered populations instead of pairs? I thought essentially all wing and leg motor neurons have unique morphologies.

      Pair vs populations are assigned based on MN morphology and connectivity. For the wing MNs, many sets of DVMns and DLMns have near-identical morphology and connectivity, are not easily distinguishable in the VNC and are categorized as a ‘population’. For the leg MNs, there are ‘true’ population MN types that provide multiple innervation of the same muscle.

      The text states "up to a maximum of 20% [traversal probability] (corresponding to a synapse input fraction of 1)" but I interpret the bottom of Figure 8G to have flipped values, where a synapse input fraction of 0.2 yields a traversal probability of 1. Is there a mistake here or have I misunderstood?

      Thank you for pointing this discrepancy out. The text description was indeed flipped, and we have corrected this error.

      Caption for J says "Layers without neurons are omitted". How is it possible to have a layer without neurons?? Something about how the traversal is done doesn't seem to be explained clearly enough. If it's really possible to have a layer without neurons, I think the approach might need to be revisited as this seems quite strange.

      Here, ‘layer’ should be viewed as a nonlinear measure of indirect connectivity combining path length and synaptic weights. Layers without neurons are possible due to the details of the calculation–layer position is assigned probabilistically by the downstream synapse connectivity of the source neurons, and the probability is scaled up to 1 at an input synapse fraction of 0.2. Neuron-to-neuron connectivity of an input synapse fraction of >=0.2 is very rare in the VNC connectome and thus neurons strictly assigned to layer 2 downstream of each DN type are similarly rare. We have updated the figure legend for figure 8 to better explain this.

      Section 2.6

      "flies have been shown to walk normally without proprioceptive feedback, suggesting that inter- and intra-leg coordination is not strictly dependent on sensory feedback loops from the legs" is quite a drastic overinterpretation of that paper's results. The ablation there was not complete (some subtypes of sensory neurons were not perturbed), and the perturbed flies certainly walked with some defects. This statement certainly should be removed or significantly softened.

      Thank you for pointing this detail out. The term ‘normally’ has been removed from this sentence to soften the statement.

      Figure 13, Standard leg connectome

      Unfortunately, the motor neurons controlling the tarsus could not be included here, I suppose due to the difficulty in identifying the T2 and T3 homologs for these motor neurons. This should be mentioned in the text. This version of the standard leg connectome is without a doubt still an incredibly valuable discovery, but readers should be made aware that this version of the standard leg connectome does in fact lack the motor neurons for one joint.

      The MNs controlling the tarsus could not be matched with high confidence. We have added a sentence pointing this out when the leg circuit is introduced (L1141-1142).

      The focus here is on locomotion is the absence of other behaviors whereas the legs are responsible for grooming, reaching, boxing, etc. How should we consider the leg connectome in light of this?

      This is a very good point, and we have indeed found known grooming neurons that target our leg premotor circuit (L1158-1161). We’ve now added this observation to the Discussion (L1949-1951).

      Minor points

      L84 - re: Descending neurons work together - cite Braun et al., bioRxiv 2023; cite Yang HH bioRxiv 2023 .

      We agree that these papers are relevant to the function of DNs in combination, and have added them to the introduction (L83-84, 86-87).

      L193 - "intrepid" is overly florid language; similar for L1507 "enigmatic".

      We have replaced these words with suitable synonyms.

      L273 - The acronym "ITD" is not explained. Please check all other acronyms. Related, it would be good to include a Table or Box with all acronyms for the reader.

      We have added the full name of the ITD to the text. A glossary is available in Figure 1, and a full glossary of MANC terms is available in Table 1 of our sister paper, Marin et al. 2024.

      -L514, you state that hemilineages 6A and 6B unexpectedly produce uncoordinated leg movements (flight-related was expected). However, Harris didn't study animals in tethered flight but headless on the ground.

      The experimental setup of Harris et al. was capable of assessing flight-like motor output even if not true flight, as seen in the predominantly wing movement phenotypes of activating hemilineages 7B, 11A/B and 2A. We now also note that hemilineage annotation in Marin et al., 2024, shows that the 6B hemilineage has some projections into the leg neuropils, in support of a leg motor role in addition to an upper tectular role (L570-571).

      L1425 - "the TTM" is repeated twice.

      This sentence addresses both the TTM and its MN (TTMn). We have revised this sentence to improve clarity by expanding the full name of TTM in that paragraph and leaving TTMn abbreviated

      L1728 - Ascending neuron projections to the brain - cite Chen et al., Nat Neuro 2023.

      We agree that Chen et al. 2023 is relevant to the discussion of AN function, and have added this citation (L1836-1838).

      L1817, It is a good idea to compare with previous predictions for circuit control. But these originate from non-Drosophila work as well. Please cite and consider the original models from Buschges, Cruse, Holmes, and others.

      Thanks for the suggestion. We now cite the non-Drosophila literature as well. (L1971)

      L1827, how precisely should these "theories" be updated? Be explicit.

      We summarize in the sentences before what is different in comparison to one of the suggested models. We have now additionally added examples to the sentence (L1942-1945) to suggest that theoretical leg circuits need to account for the posterior-to-anterior as well as anterior-to-posterior connections between leg neuropils, as well as relative lack of connectivity between the left and right mesothoracic leg neuropils.

      L1831, include a discussion about another alternative which is through mechanical coupling and sensory feedback.

      We agree that leg sensory input likely contributes to leg locomotor circuits. We have added the following sentence to point out that annotations of sensory neurons in MANC are available through work in a companion paper (Marin et al. 2024), and future work is necessary to examine the contribution of sensory input to leg motor circuits (L1954-1956).

      Methods

      https://flyconnectome.github.io/malevnc/ link doesn't work.

      We have updated the link.

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife Assessment

      The study presents valuable findings on the role of RIPK1 in maintaining liver homeostasis under metabolic stress. Strengths include the intriguing findings that RIPK1 deficiency sensitizes the liver to acute liver injury and apoptosis, but because the conclusions require additional experimental support, the evidence is incomplete.

      We are truly grateful, and wish to express our sincere acknowledgement to the reviewer and the editor for the time and effort spent in reviewing our manuscript. We highly appreciate the thorough and constructive comments, which can greatly improve our manuscript. We have conducted new experiments to address the reviewer’s concerns. We also carefully checked and changed our manuscript according to the constructive suggestions by the reviewer. Hopefully we have adequately addressed all the concerns. In the revised manuscript version, changes are highlighted in yellow. Please find the detailed point-to-point responses below. 

      Public Reviews:

      Reviewer #1 (Public Review):

      This study presents an investigation into the physiological functions of RIPK1 within the context of liver physiology, particularly during short-term fasting. Through the use of hepatocyte-specific Ripk1-deficient mice (Ripk1Δhep), the authors embarked on an examination of the consequences of Ripk1 deficiency in hepatocytes under fasting conditions. They discovered that the absence of RIPK1 sensitized the liver to acute injury and hepatocyte apoptosis during fasting, a finding of significant interest given the crucial role of the liver in metabolic adaptation. Employing a combination of transcriptomic profiling and single-cell RNA sequencing techniques, the authors uncovered intricate molecular mechanisms underlying the exacerbated proinflammatory response observed in Ripk1Δhep mice during fasting. While the investigation offers valuable insights into the consequences of Ripk1 deficiency in hepatocytes during fasting conditions, there appears to be a primarily descriptive nature to the study with a lack of clear connection between the experiments. Thus, a stronger focus is warranted, particularly on understanding the dialogue between hepatocytes and macrophages. Moreover, the data would benefit from reinforcement through additional experiments such as Western blotting, flow cytometry, and rescue experiments, which would offer a more quantitative aspect to the findings. By incorporating these enhancements, the study could achieve a more comprehensive understanding of the underlying mechanisms and ultimately strengthen the overall impact of the research.

      We thank the reviewer for the encouraging comments and helpful suggestions. We agree with the reviewer that additional experiments could reinforce our findings. Therefore, we conducted additional experiments including flow cytometry, western blotting, and using kinase-dead mutant mice to further investigate the underlying mechanisms. We carefully addressed every comment by the reviewer as indicated below.

      Detailed major concerns:

      (1) Related to Figure 1.

      It is imperative to ensure consistency in the number of animals analyzed across the different graphs. The current resolution of the images appears to be low, resulting in unsharp visuals that hinder the interpretation of data beyond the presence of "white dots". To address this issue, it is recommended to enhance the resolution of the images and consider incorporating zoom-in features to facilitate a clearer visualization of the observed differences. Moreover, it would be beneficial to include a complete WB analysis for the cell death pathways analyzed. These adjustments will significantly improve the clarity and interpretability of Figure 1.

      Thanks very much for the constructive advice. We carefully checked the number of animals and make sure that the animal number were consistent within different figures. We further updated the figures with incorporating zoom-in features in updated Figure 1, and the resolution of the figures were greatly improved. Western blot analysis were also included in updated Supplementary Figure 1.

      (2) Related to Figure 2.

      It is essential to ensure consistency in the number of animals analyzed across the different graphs, as indicated by n=6 in the figure legend (similar to Figure 1). Additionally, it is crucial to distinguish between male and female subjects in the dot plots to assess any potential gender-based differences, which should be consistent throughout the paper. To achieve this, the dots plot should be harmonized to clearly differentiate between males and females and investigate if there are any disparities between the genders. Moreover, it is imperative to correlate hepatic inflammation with the activation of Kupffer cells, infiltrating monocytes, and/or hepatic stellate cells (HSCs). Therefore, conducting flow cytometry would be instrumental in achieving this correlation. Additionally, the staining for Ki67 appears to be non-specific, showing a granular pattern reminiscent of bile crystals rather than the expected nuclear staining of hepatocytes or immune cells. It is crucial to ensure specific staining for Ki67, and conducting in vitro experiments on primary hepatocytes could further elucidate the proliferation process. These experiments are relatively straightforward to implement and would provide valuable insights into the mechanisms underlying hepatic inflammation and proliferation.

      Thanks very much for the helpful advice. First, we corrected the number of animals analyzed in different graphs and make sure that the number of animals listed in the figure legend were consistent with the graphs in all figures. Second, to distinguish the results between male and female mice, blue represents male mice, pink represents female mice, and green represents RIPK1 kinase inactivated mice. The majority of results were obtained from male mice, and our results indicated that there was no difference between male and female mice herein.

      The percentages of immune cell subpopulations isolated from mouse liver tissue were determined. The results were consistent with single cell analysis that greater number of  macrophages were recruited into the liver tissue in Ripk1<sup>Δhep</sup> upon 12-hour fasting (updated Figure 4F&G).

      To confirm the results of Ki67, we first detected the transcriptional expression of Ki67 using real-time qPCR, and the results were consistent with the protein expression measured by immunohistochemical analysis. The percentage of Ki67<sup>+</sup> cells in liver cells were also detected, and there was significantly more Ki67<sup>+</sup> cells in Ripk1<sup>Δhep</sup> mouse liver than WT control mouse upon 12-hour fasting. Taken together, our transcriptional analysis, immunohistochemical analysis as well as flow cytometry data indicated that Ki67 expression was higher in Ripk1<sup>Δhep</sup> mice than Ripk1<sup>fl/fl</sup> mice. (updated Figure 2). 

      (3) Related to Figure 3 & related to Figure 4.

      The immunofluorescence data presented are not entirely convincing and are insufficient to conclusively demonstrate the recruitment of monocytes. Previous suggestions for flow cytometry studies remain pertinent and are indeed necessary to bolster the robustness of the data and conclusions. Conducting flow cytometry analyses would provide more accurate and quantitative assessments of monocyte recruitment, ensuring the reliability of the findings and strengthening the overall conclusions of the study. Regarding the single-cell RNA sequencing analysis presented in the manuscript, it's worth questioning its relevance and depth of information provided. While it successfully identifies a quantitative difference in the cellular composition of the liver between control and knockout mice, it may fall short in elucidating the intricate interactions between different cell populations, which are crucial for understanding the underlying mechanisms of hepatic inflammation. Therefore, I propose considering alternative bioinformatic analyses, such as CellPhone-CellChat, which could potentially provide a more comprehensive understanding of the cellular dynamics and interactions within the liver microenvironment. By examining the dialogue between different cell clusters, these analyses could offer deeper insights into the functional consequences of Ripk1 deficiency in hepatocytes and its impact on hepatic inflammation during fasting.

      Thanks very much for the constructive suggestion. We agree with the reviewer that conducting flow cytometry analyses would provide accurate and quantitative assessments of monocyte recruitment, ensuring the reliability of the findings. Following the advice, both WT and Ripk1<sup>Δhep</sup> mice were fasted for 12 hour and then single hepatic cells were isolated and analyzed by flow cytometry. As indicated in updated Figure 4F&G, the percentage of F4/80<sup>+</sup>CD11b<sup>+</sup> cells were significantly higher in Ripk1<sup>Δhep</sup> compared with WT control mice, confirming that more monocytes were recruited into the liver.

      Additionally, we performed CellChat analysis on the single-cell transcriptomic data. As shown in updated Figures 4H-J, both the number of ligand-receptor pairs and the interaction strength among the eight cell types were significantly increased in Ripk1<sup>Δhep</sup> mice, particularly the interactions between macrophages and other cell types. Network analysis indicated that inflammation and proliferation signals were amplified in Ripk1<sup>Δhep</sup> mice. Consistent with the bulk RNA sequencing data, SAA signaling was upregulated in the hepatocytes of Ripk1<sup>Δhep</sup> mice (updated Figure 4K). SAA has been found to play a role in regulating immune responses and tumor development. Based on these findings, we speculate that fasting-induced liver injury in RIPK1 knockout mice may exacerbate the inflammatory response in liver tissue through enhanced SAA signaling. The above data analysis and interpretation were included in the updated Figure 4&S4 and line 421 - 443.

      (4) Related to Figure 5.

      What additional insights do the data from Figure 5 provide compared to the study published in Nat Comms, which demonstrated that RIPK1 regulates starvation resistance by modulating aspartate catabolism (PMID: 34686667)?

      Thank you very much for your constructive suggestion. As noted by the reviewer, this study (PMID: 34686667) primarily focuses on metabolomic analyses of Ripk1<sup>-/-</sup> neonatal mouse brain tissue and Ripk1<sup>-/-</sup> MEF cells. The authors propose that Ripk1 regulates starvation resistance by modulating aspartate catabolism.

      In our study, the global metabolic changes induced by fasting were monitored. Fastinginduced lipolysis in peripheral adipose tissue leads to hepatic lipid accumulation, and excessive deposition of free fatty acids has been shown to induce endoplasmic reticulum (ER) stress in the liver. Data from Figure 5 demonstrate that administering the ER stress inhibitor 4-PBA effectively mitigated fasting-induced liver injury and inflammatory responses in Ripk1<sup>Δhep</sup> mice. Our findings suggest that ER stress plays a critical role in fasting-induced liver injury and inflammation in Ripk1<sup>Δhep</sup> mice.

      (5) Related to Figure 6.

      The data presented in Figure 7 are complementary and do not introduce new mechanistic insights.

      Thank you very much for your insightful suggestion. As you mentioned, the AAV-TBG-Cre-mediated liver-specific RIPK1 knockout mice offer complementary validation of the results obtained from Ripk1<sup>Δhep</sup> mice. Moreover, TBG is a promoter that is exclusively expressed in mature hepatocytes, while the ALB promoter is active not only in mature hepatocytes but also in precursor cells and cholangiocytes. Therefore, we think that the inclusion of AAV-TBG-Cre further strengthens our finding that RIPK1 in hepatocytes is responsible for fasting-induced liver injury and inflammatory responses.

      (6) Related to Figure 7.

      The data from Figure 7 suggest that RIPK1 in hepatocytes is responsible for the observed damage. However, it has been previously demonstrated that inhibition of RIPK1 activity in macrophages protects against the development of MASLD (PMID: 33208891). One possible explanation for these findings could be that the overreaction of macrophages to fasting, coupled with the absence of RIPK1 in hepatocytes (an indirect effect), contributes to the observed damage. Considering this, complementing hepatocytes with a kinase-dead version of RIPK1 could be a valuable approach to further refine the molecular aspect of the study. This would allow for a more precise investigation into the specific role of RIPK1's scaffolding or kinase function in response to starvation in hepatocytes. Such experiments could provide additional insights into the mechanisms underlying the observed effects and help delineate the contributions of RIPK1 in different cell types to metabolic stress responses.

      Thank you very much for the constructive suggestion. We fully agree with the reviewer that employing a RIPK1 kinase-inactive mutant mice could precisely investigate the specific roles of RIPK1's scaffolding and kinase functions in hepatocyte responses to starvation, respectively. In accordance with this advice, we established a 12-hour fasting model using Ripk1<sup>WT/WT</sup> and Ripk1<sup>K45A/K45A</sup> mice, which were previously established and confirmed with the inactivity of RIPK1 kinase activity. As demonstrated in updated Supplementary Figure 2, these mice did not show significant liver damage or inflammatory responses after 12 hours of fasting. These findings suggest that the liver damage and inflammatory response induced by fasting in Ripk1<sup>Δhep</sup> mice may not be contributed by the kinase activity of RIPK1.  

      Reviewer #2 (Public Review):

      Summary:

      Zhang et al. analyzed the functional role of hepatocyte RIPK1 during metabolic stress, particularly its scaffold function rather than kinase function. They show that Ripk1 knockout sensitizes the liver to cell death and inflammation in response to short-term fasting, a condition that would not induce obvious abnormality in wild-type mice.

      Strengths:

      The findings are based on a knockout mouse model and supported by bulk RNA-seq and scRNA-seq. The work consolidates the complex role of RIPK1 in metabolic stress.

      Weaknesses:

      However, the findings are not novel enough because the pro-survival role of RIPK1 scaffold is well-established and several similar pieces of research already exist. Moreover, the mechanism is not very clear and needs additional experiments.

      We thank the reviewer for the encouraging comments and helpful suggestions. Here we conducted additional experiments including flow cytometry, western blotting, and using kinase-dead mutant mice to further investigate the underlying mechanisms. We carefully addressed every comment by the reviewer as indicated below.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      (7) I recommend that the authors consider reassessing their results, particularly with regards to elucidating the dialogue between macrophages and hepatocytes, as this could further strengthen the study's conclusions.

      Thank you very much for your constructive suggestion. We conducted additional experiments, including flow cytometry and western blotting, to reassess our findings. Furthermore, to clarify the interactions between cells, we employed CellChat for a more in-depth analysis of the single-cell sequencing results. In the revised manuscript version, changes are highlighted in yellow. In this study, we demonstrated that the specific deletion of RIPK1 in hepatocytes exacerbated the liver's vulnerability to metabolic disturbances, such as short-term fasting and high-fat diet feeding, resulting in increased liver damage, apoptosis, inflammation, and compensatory proliferation. The data indicate that fasting-induced liver injury in RIPK1 knockout mice of hepatic parenchymal cells may exacerbate the inflammatory response in liver tissue through enhanced SAA signaling. In summary, we revealed a novel physiological role of RIPK1 as a scaffold in maintaining liver homeostasis during fasting and other nutritional disturbances.

      (8) It would be beneficial for the authors to address the minor weaknesses identified in the study, such as ensuring consistency in the number of animals analyzed across different graphs and enhancing the resolution of images to improve data clarity.

      Thank you for the suggestion. In the revised manuscript, we have addressed these minor weaknesses, and we checked the consistency in the number of animals in different graphs, as well as enhanced the resolution of all images.

      (9) I encourage the authors to incorporate additional experiments, such as Western blotting and flow cytometry, to provide a more quantitative assessment of the observed effects and enhance the robustness of their conclusions.

      Thank you for your insightful suggestion. We completely agree with the reviewer that incorporating flow cytometry and western blotting would strengthen the robustness of our conclusions. We conducted flow cytometry analysis and western blotting and the results were listed in updated Supplementary Figure 1, Figure 2, Figure 4 and Supplementary Figure 4.

      (10) Furthermore, the authors may consider conducting complementary experiments, such as rescue experiments involving complementing hepatocytes with a kinase-dead version of RIPK1, to further refine the molecular aspect of the study and elucidate the specific roles of RIPK1's scaffolding or kinase function in response to starvation.

      Thank you very much for your constructive suggestion. As shown in updated Supplementary Figure 2, we conducted fasting experiments using RIPK1 kinase-dead mice. These findings suggest that the liver damage and inflammatory response induced by fasting in Ripk1<sup>Δhep</sup> mice may not contributed by the kinase activity of RIPK1.

      Reviewer #2 (Recommendations For The Authors):

      Major:

      (11) What is the upsteam signal for RIPK1? The study investigated the change induced by short-term fasting which is metabolic stress. Although RIPK1 knockout promotes cell death and inflammation, how it is involved in this condition is unclear. RIPK1 is never reported as a metabolic sensor and its function is typically downstream of TNFR1 as well as other death receptors such as Fas, TRAIL-R1, TRAIL-R2. Thus, it's probable that metabolic stress induces the expression and secretion of some ligand of the above receptors. Although TNFα expression is upregulated on both mRNA and protein levels, it could not be concluded that TNFα is the upsteam signal for RIPK1 because expression difference does not always lead to fuctional role. In addition, a recent study, which is also reference 33, reports that knockout of TNFR1/2 does not protect against 18 h liver ischemia, a condition that is similar to the present study. Therefore, the link between the metabolic fluctuation and RIPK1 function is elusive and should be addressed. The expression difference analysis should be extended to other relevant ligands. A functional study using neutralizing antibodies in RIPK1ΔHep mice is encouraged. At least, this should be discussed in the discussion section.

      Thank you very much for your insightful comments. The upstream signals of RIPK1 remains a significant area of scientific inquiry. Fasting, as one of the main causes of metabolic stress, is known to trigger a series of physiological changes, including but not limited to decreased blood glucose levels, hepatic glycogen depletion, increased production of hepatic glucose and ketone bodies, adipose tissue lipolysis, and the influx and accumulation of free fatty lipids in the liver. It is well-established that the elevated lipid influx and hepatic accumulation during fasting may cause lipotoxicity stress for liver. To investigate whether the elevated free fatty acids influx might act as the signal to induce cytotoxicity, we isolated primary hepatocytes but observed that a significant number of cells underwent spontaneous death during the isolation and perfusion processes. To address this question, we utilized CRISPR-Cas9 technology to generate Ripk1<sup>-/-</sup> AML12 cells, as illustrated in Author response image 1A.

      To mimic hepatic lipid accumulation induced by short-term fasting, we treated the cells with palmitic acid (PA) or oleic acid (OA) for 12 hours in vitro. Our results indicated a significant increase in cell death among Ripk1<sup>-/-</sup> AML12 cells after PA treatment compared to WT control cells (Author response image 1B). As shown in Author response image 1C, we also observed a marked increase in caspase-3 activity in Ripk1<sup>-/-</sup> AML12 cells following PA treatment.

      Collectively, our results highlight the crucial role of RIPK1 in hepatocytes in maintaining the liver's adaptive capacity to counteract lipotoxicity induced by metabolic stress. These in vitro results were not included in the manuscript; however, we addressed them in the discussion section (line 593 - 597). If the reviewer suggest, we would like to incorporate in our manuscript.

      Author response image 1.

      (12) What is the exact relationship between ER stress and RIPK1? In Figure 5A and Figure 6B, Ripk1 knockout only slightly promotes the expression of ER stress markers. The evidence of RIPK1 leading to ER stress is limited in the literature and poorly supported in this study. Also in reference 33, the hypothesis is proposed that ER stress leads to death receptor upregulation and activation, which induces RIPK1 activation. Although the ER stress inhibitor showed good efficacy in rescue experiments, it could not determine whether RIPK1 deficiency leads to ER stress-associated phenotype or ER stress leads to death receptor activation and RIPK1 deficiency-associated phenotype. If RIPK1 deficiency leads to ER stress, the possible mechanism should be investigated.

      Thank you very much for your insightful comments. As the reviewer noted, the specific relationship between endoplasmic reticulum (ER) stress and RIPK1 remains unclear. However, our data, along with findings from other studies (Piccolis M et al., Mol Cell. 2019; Geng Y et al., Hepatol Int. 2021), suggest that fasting-induced lipolysis in peripheral adipose tissue leads to hepatic lipid accumulation. Additionally, excessive deposition of free fatty acids has been shown to induce ER stress in the liver. One possible explanation is that ER stress may trigger the upregulation and activation of death receptors, and the scaffold function of RIPK1 may play a protective and checkpoint role in this process. ER stress during the fasting might locate upstream of RIPK1. This could help explain why short-term fasting results in liver damage in Ripk1<sup>Δhep</sup> mice while control mice remain unaffected. Moreover, the inhibition of ER stress using 4-PBA can effectively alleviate this damage.

      Minor:  

      (13) The study starts directly from functional experiments. However, it should be firstly explored whether RIPK1 expression or activation is modulated in wild-type mice.

      Thank you very much for your insightful observation. Previous studies showed that RIPK1 deficiency in hepatocytes does not impact the growth and development of mice, indicating that RIPK1 is dispensable for proper liver development and homeostasis (Filliol A et al., Cell Death Dis. 2016). Furthermore, we did not observe any changes in RIPK1 levels in wild-type mice induced by fasting across different experimental batches. In our bulk transcriptomic analysis, the expression of RIPK1 was not changed before and after 12-hour fasting in Ripk1<sup>fl/fl</sup> mice. Therefore, we focused our attention on the function of RIPK1 and started our study directly with functional experiments.

      (14) Knockout of RIPK1 deprived both its scaffold function and kinase function. It is encouraged to explore whether blocking RIPK1 kinase activity influences the outcome of metabolic stress.

      Thank you for your insightful suggestion. To investigate the role of RIPK1 kinase activity in response to metabolic stress, we added fasting experiments using RIPK1 kinaseinactive mice in the updated Supplementary Figure 2, in which blocking RIPK1 kinase activity does not affect the outcome of metabolic stress.

      (15) In Figure 1, the number of TUNEL+ cells is about 2 times of c-casp3. What is the possible reason?

      Thank you for your careful reading. Indeed, the number of TUNEL<sup>+</sup> cells in Figure 1 is twice that of cleaved-caspase-3<sup>+</sup> cells. There are two possible reasons. First, we speculate that this discrepancy may be attributed to the higher sensitivity of the TUNEL assay compared to the cleaved-caspase-3 assay. Secondly, TUNEL assay detects DNA fragmentation, indicating that these cells are in a pre-apoptotic state or poised to undergo apoptosis. In contrast, cleaved-caspase-3 specifically identifies cells that have already committed to the apoptotic pathway, whereas TUNEL assay could detects all types of apoptosis, but the mechanisms of apoptosis may involve more than just cleaved-caspase3.

      (16) Infiltrated innate immune cells could lead to hepatocyte death. Is the hepatocyte death in this study partially caused by immune cells?

      Many thanks for the advice. As outlined in the response to the 11th comment from the second reviewer, our findings indicate that metabolic stress induced by short-term fasting is the primary cause of hepatocyte death. Additionally, we demonstrate that infiltrated innate immune cells may also play a partial role in hepatocyte death through subsequent cascade reactions.

      (17) Could the in vivo results be consolidated by in vitro experiments on primary mouse hepatocytes? This would be helpful to answer question 4.

      Thank you for your helpful comments. As demonstrated in the response to the 11th comment by the second reviewer, we attempted to conduct in vitro experiments using primary hepatocytes. However, during the isolation and perfusion processes, we observed that a significant number of cells underwent spontaneous death. To address this issue, we utilized CRISPR-Cas9 technology to generate Ripk1<sup>-/-</sup> AML12 cells, in which a significant increase in cell death among Ripk1<sup>-/-</sup> AML12 cells after palmitic acid (PA) treatment compared to WT control cells. We also observed a marked increase in caspase-3 activity in Ripk1<sup>-/-</sup> AML12 cells following PA treatment.

      (18) RIPK1 scaffold function is associated with NF-kB signal. Is NF-kB signal transduction influenced by Ripk1 deficiency? If so, to what extent does it contribute to the observed phynotype? If not, what is the direct downstream effect of Ripk1 deficiency?

      Thank you very much for your insightful perspective. As reported by Clucas J et al., RIPK1 serves as a scaffold for downstream NF-κB signaling through the ubiquitin chains generated by its ubiquitination (Clucas J et al., Nat Rev Mol Cell Biol. 2023). The deficiency of RIPK1 in hepatic parenchymal cells can disrupt NF-κB signaling and impair its pro-survival functions, resulting in increased cell death in response to stress. Our current findings suggest that the RIPK1-NF-κB axis serves as a crucial scaffold platform essential for the liver's adaptation to metabolic fluctuations. Any inappropriate inactivation or deletion of components within this scaffold disrupts the delicate balance between cell death, inflammation, and normal function, making the liver susceptible to metabolic changes, ultimately leading to liver damage, hepatic inflammation, and compensatory proliferation.

      (19) In Figure 6B, the 'RIP' should be changed to 'RIPK1'.

      Thank you for your careful observation. We have corrected "RIP" to "RIPK1" in updated Figure 6B.

      (20) For Western blot results, the blot height should be at least the lane width to reveal additional signals and the molecular weight as well as unspecific signals should be denoted.

      Thank you for your valuable advice. We appreciate your suggestions regarding the western blot results. We went through the previous western blot results and did not find any additional nonspecific signals. We added the molecular weights in the updated figures Figure 5, Figure 6 and Supplementary Figure 1.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review): 

      Summary:

      In this manuscript, Fister et. al. investigate how amputational and burn wounds affect sensory axonal damage and regeneration in a zebrafish model system. The authors discovered that burn injury results in increased peripheral axon damage and impaired regeneration. Convincing experiments show altered axonal morphology and increased Ca2+ fluxes as a result of burn damage. Further experimental proof supports that early removal of the burnt tissue by amputation rescues axonal damage. Burn damage was also shown to markedly increase keratinocyte migration and increase localized ROS production as measured by the dye Pfbsf. These responses could be inhibited by Arp 2/3 inhibition and isotonic treatment. 

      Strengths: 

      The authors use state-of-the-art methods to study and compare transection and burn-induced tissue damage. Multiple experimental approaches (morphology, Ca2+ fluxing, cell membrane labeling) confirm axonal damage and impaired regeneration time. Furthermore, the results are also accompanied by functional response tests of touch sensitivity. This is the first study to extend the role of tissue-damage-related osmotic exposure beyond wound closure and leukocyte migration to a novel layer of pathology: axonal damage and regeneration. 

      Weaknesses: 

      The conclusions of the paper claiming a link between burn-induced epithelial cell migration, spatial redox signaling, and sensory axon regeneration are mainly based on correlative observations. Arp 2/3 inhibition impairs cell migration but has no significant effect on axon regeneration and restoration of touch sensitivity. 

      We agree with the reviewer. We have tried many experiments to address this question. The data show that Arp 2/3 inhibition with CK666 is an effective way to inhibit initial keratinocyte migration. However, later migration still proceeds. What is interesting is that just inhibition of the early migration is sufficient to restore localized ROS production in the wound area in the first  hour post-burn, even if this is not sufficient to prevent ROS accumulation over time. There is also a trend toward improved sensory neuron function late after this early treatment. However, this is not statistically significant. We think it is likely that both migration and tissue scale ROS influence the regeneration defect of sensory neurons after burn. The data using isotonic solution supports this conclusion. We have tried many other ways to limit keratinocyte migration including depletion of talin and expression of a dominant negative Rac in basal epithelial cells, but these treatments were not compatible with survival of the fish after burn.

      Pharmacological or genetic approaches should be used to prove the role of ROS production by directly targeting the known H2O2 source in the system: DUOX. 

      We agree that pharmacologic or genetic approaches to directly manipulate ROS production would provide substantial support to the hypothesis that ROS, along with keratinocyte migration, is a main factor contributing to poor burn outcomes. To address this, we first tried using a morpholino to deplete DUOX. However, the combination of DUOX morpholino and burn injury was lethal to larvae. We also used pharmacologic inhibition of ROS production using DPI (Diphenyleneiodonium). With this treatment, ROS is inhibited for only the first hour post-burn as treatment is lethal for longer periods of time. Burned larvae have marginally improved axon density and touch sensitivity, suggesting the importance of ROS in burn outcomes, however it was not statistically significant. It is likely that an increased effect would be observed with longer treatment, but treatment for more than 1 hour was toxic. We have added a supplemental figure with this new DPI data.

      While the authors provide clear and compelling proof that osmotic responses lie at the heart of the burn-induced axonal damage responses, they did not consider the option of further exploring any biology related to osmotic cell swelling. Could osmotic ATP release maybe play a role through excitotoxicity? Could cPLA2 activation-dependent eicosanoid production relate to the process? Pharmacological tests using purinergic receptor inhibition or blockage of eicosanoid production could answer these questions. 

      We agree that the role of osmotic cell swelling in the burn response is an interesting avenue for future study. However, we make use of isotonic treatment in this study specifically for its effect on keratinocyte migration and broad-scale wound healing. As a result, we feel that pursuing the biology of this swelling phenomenon is outside the scope of this paper.

      The authors provide elegant experiments showing that early removal of the burnt tissue can rescue damage-induced axonal damage, which could also be interpreted in an osmotic manner: tail fin transections could close faster than burn wounds, allowing for lower hypotonic exposure time. Axonal damage and slow regeneration in tail fin burn wounds could be a direct consequence of extended exposure time to hypotonic water. 

      We have done experiments using FM dye to test how long it takes burn and transection wounds to close (shown below). In these experiments, dye entry into wounded tissue is used as a readout of wound closure. Dye is only able to enter wounded tissue when the epithelial barrier is disrupted. Our data reveal that transections take approximately 10 minutes to fully close, while burns take approximately 20 minutes to close.

      Author response image 1.

      To test if this difference in wound closure time would have an effect on axon outcomes, we repeated, but slightly modified, the dual-wound experiment. We increased the amount of time the burn condition was exposed to hypotonic conditions by 10 additional minutes (by transecting burned tissue at 15 minutes post burn, shortly before closure) and compared axon outcomes to the 5 mpw control transection. These results show there was no difference in axon regeneration or function when secondary transection was performed at 5 or 15 minutes post burn, suggesting that increased exposure to hypotonic solution is not the reason for defects in axon outcomes after burn injury.

      Author response image 2.

      Reviewer #2 (Public Review): 

      This is an interesting study in which the authors show that a thermal injury leads to extensive sensory axon damage and impaired regrowth compared to a mechanical transection injury. This correlates with increased keratinocyte migration. That migration is inhibited by CK666 drug treatment and isotonic medium. Both restrict ROS signalling to the wound edge. In addition, the isotonic medium also rescues the regrowth of sensory axons and recovery of sensory function. The findings may have implications for understanding non-optimal re-innervation of burn wounds in mammals. 

      The interpretation of results is generally cautious and controls are robust. 

      Here are some suggestions for additional discussion: 

      The study compares burn injury which produces a diffuse injury to a mechanical cut injury which produces focal damage. It would help the reader to give a definition of wound edge in the burn situation. Is the thermally injured tissue completely dead and is resorbed or do axons have to grow into damaged tissue? The two-cut model suggests the latter. Also giving timescales would help, e.g. when do axons grow in relation to keratinocyte movement? An introductory cartoon might help. 

      We thank the reviewer for these insightful comments and questions. The burn wound is defined as the area that is directly damaged as a result of increased heat (labeled by FM dye entry), and the burn wound edge as the first line of healthy cells adjacent to the burned cells. These definitions have been added to the text to clarify the areas referenced. Recent experiments lead us to believe the wound area is composed almost completely of dead cells, but we are currently working to discover the fate of these dead cells as well as the wound adjacent cells that migrate to the wound edge after burn. As a result, we do not know whether axons grow into damaged tissue or if the damaged tissue is extruded, but we do see growth cone formation within a few hours after wounding suggesting the axons are actively trying to regenerate after a burn.

      Could treatment with CK666 or isotonic solution influence sensory axons directly, or through other non-keratinocyte cell types, such as immune cells? 

      We have done experiments looking at the density of caudal fin innervation in CK666, isotonic, or DPI treated fins. The axon density is unchanged in all these treatments compared to control treated larvae, so we do not believe these treatments affect axon health homeostatically. These data have been added to supplemental figure 3. Additionally, one of the benefits of the larval zebrafish burn model is the simplicity of the system – the epidermis is primarily composed of sensory axons, mesenchymal cells and keratinocytes. The burn environment is proinflammatory so it does promote immune cell recruitment, but we do not believe the immune cells are interacting directly with sensory axons besides clearing axonal debris. Previous papers by our lab have shown that peak immune cell recruitment occurs at 6 hpw, but they localize to the damaged tissue in the burn area and not the wound edge.

      Reviewer #3 (Public Review): 

      Fister and colleagues use regeneration of the larval zebrafish caudal fin to compare the effects of two modes of tissue damage-transection and burn-on cutaneous sensory axon regeneration. The authors found that restoration of sensory axon density and function is delayed following burn injury compared to transection. 

      The authors hypothesized that thermal injury triggers signals within the wound microenvironment that impair sensory neuron regeneration. The authors identify differences in the responses of epithelial keratinocytes to the two modes of injury: keratinocytes migrate in response to burn but not transection. Inhibiting keratinocyte migration with the small-molecule inhibitor of Arp2/3 (CK666) resulted in decreased production of reactive oxygen species (ROS) at early, but not late, time points. Preventing keratinocyte migration by wounding in isotonic media resulted in increased sensory function 24 hours after burn. 

      Strengths of the study include the beautiful imaging and rigorous statistical approaches used by the authors. The ability to assess both axon density and axon function during regeneration is quite powerful. The touch assay adds a unique component to the paper and strengthens the argument that burns are more damaging to sensory structures and that different treatments help to ameliorate this. 

      A weakness of the study is the lack of genetic and cell-autonomous manipulations. Additional comparisons between transection and burns, in particular with manipulations that specifically modulate ROS generation or cell migration without potentially confounding effects on other cell types or processes would help to strengthen the manuscript.

      The use of genetic and cell-autonomous approaches would strengthen our study, however, we were unable to do this due to the lethality of these genetic approaches (or cell autonomous approaches). Basal epithelial migration is necessary for embryonic development. We attempted to circumvent this by generation of larvae transiently expressing a dominant-negative form of Rac, a protein crucial to the migratory process. The chimeric expression of the dominant negative Rac was either damaging to the larvae or the mosaicism was too low to observe any effects on migration phenotype.

      We also attempted a genetic approach to manipulate ROS production, as discussed above. We found that the DUOX morpholino was lethal to burned larvae. Finally, we attempted pharmacological inhibition of ROS production using the inhibitor DPI (Diphenyleneiodonium). With this treatment, burned larvae have marginally improved axon density and touch sensitivity, suggesting that dampening ROS may improve outcome. The DPI data have been added to the manuscript.

      In terms of framing their results, the authors refer to "sensory neurons" and "sensory axons" throughout the text - it should be made clear what type of neuron(s)/axon(s) are being visualized/assayed. Along these lines, a broader discussion of how burn injuries affect sensory function in other systems - and how the authors' results might inform our understanding of these injury responses - would be beneficial to the reader. 

      In summary, the authors have established a tractable vertebrate system to investigate different sensory axon wound healing outcomes in vivo that may ultimately allow for the identification of improved treatment strategies for human burn patients. Although the study implicates differences in keratinocyte migration and associated ROS production in sensory axon wound healing outcomes, the links between these processes could be more rigorously established. 

      The inconsistency between “neuron” and “axon” has been noted and the text has been corrected accordingly. “Neuron” is used when referring to the cell as a whole, while “axon” is used when referring to the sensory processes in the caudal fin. We added information about burn in the introduction as suggested: “While epithelial tissue is well adapted to repair from mechanical damage, burn wounds heal poorly. Thermal injury results in chronic pain and lack of sensation in the affected tissue, suggesting that an abnormal sensory neuron response contributes to burn wound pathophysiology.”

      We thank the reviewer’s for their comments.

      Recommendations For The Authors:

      Reviewer #1 (Recommendations For The Authors): 

      Suggested experiments: 

      (1) ROS measurements with the dye Pfbsf should be validated with more established ROS probes such as HyPer. 

      Pfbsf has been used previously as a readout of ROS production, and its use is documented in zebrafish (Maeda et al., Angew Chem Int Ed Engl, 2004, and Niethammer et al, Nature, 2009). These sources have been added as references when introducing Pfbsf to provide context for its use. The probe was validated and compared to HyPer in Niethammer’s 2009 paper. In our hands, we have used both probes and have similar results with tail transection.

      (2) To better support claims on ROS and H2O2 playing a central role in mediating axonal damage, the authors should consider pharmacological approaches such as rescue experiments with H2O2 and experiments using inhibitors such as DPI ar apocynin. 

      While the above reagents and drugs have limitations and non-specific side effects, more convincing proof could result from genetic approaches including experiments on DOUX knockdown or knockout lines. 

      To further dissect the role of ROS in the burn response, we conducted experiments using DPI, a potent ROS inhibitor that is well-documented in the literature. We found that 20 uM treatment of DPI (1 hour pretreatment, 1 hour post-burn) marginally improved axon density when quantified 24 hpw. Any higher dose, when in combination with a burn, proved to be lethal. Longer treatment with DPI was also not tolerated.

      In addition to experiments with DPI, we attempted to burn larvae that were injected with DUOX morpholino. The combined use of burn and DUOX MO was lethal. We have dampened the conclusions and include the new data with the DPI in the revised manuscript.

      Minor corrections: 

      (1)A phrase/expression in the abstract is confusing: isotonic treatment does not "induce osmotic regulation". Cells exposed to hypo- or hypertonicity will respond by regulatory volume decrease or increase, respectively. Isotonic treatment maintains homeostasis. 

      We appreciate this point and agree with the distinction. Revisions have been made in the text accordingly.

      (2) Figures 4E and 5E would be better to show as an average of multiple experiments with statistical significance. 

      The purpose of figures 4E and 5E are to demonstrate changes in fluorescence intensity and localization of ROS using the representative time series shown in 4D and 5D. The figure legend has been updated accordingly.

      Reviewer #2 (Recommendations For The Authors): 

      Figure 3D How can one distinguish between the two cellular elements that randomly meet or that there is actual coordination? Can the interactions be quantified? It is also unclear what the authors mean by "sensory neuron movement". The authors show that the neuronal cell bodies stay in their position, so only the axons change position. Do they do this by growth, i.e. the neuronal growth cones follow the keratinocytes or do keratinocytes displace the axon shafts? 

      We have included supplemental movies that address this question in the new uploaded document. Figure 3D is comprised of still images taken from supplemental movie 2, which is a timelapse of keratinocytes/axons moving together after a burn injury.  This movie clearly shows keratinocytes and their ensheathed axons moving simultaneously, so keratinocytes are mechanically pulling sensory axon shafts with them. We have revised the text to say axon movement, not sensory neuron movement.

      Over the time course of axonal movement (1 hour post-burn), it is not possible that neuronal growth cones contribute to movement, as this is too slow – previous work by other labs has shown that it takes several hours for axons to fully regenerate into amputated tissue, with movement not even noticeable until about 3 hours post-wound (Rieger and Sagasti, PLOS Biology, 2011).

      Regarding the second point, “neuron” vs. “axon” is an inconsistency in the text that has been corrected. “Neuron” is used when referring to the cell as a whole, “axon” is used when referring to the processes that innervate the caudal fin. The axons are physically pulled along with keratinocytes as they migrate after burn application. From our observations, growth cones appear closer to the wound site after the movement has stopped.

      Figure 4G It is surprising that the visual differences in the distribution of values are not statistically significant. 

      The distribution of values in 4G was large and that is why there is no statistically-significant difference – we were also surprised at this result. We did all statistics with a statistician and this included rigorous criteria for significance.

      Figure 4H The images seem to show a difference, whereas the quantification does not. I suggest choosing more representative images. 

      Figure 4H has been updated to include a more representative image of axon patterning with CK666 treatment.

      Figure 6A The text states that axon damage in the control and isotonic condition is comparable, yet in the image, it appears that the damage in the isotonic treatment at 0 hpw is more distal. 

      This is a good observation that we consistently see in isotonic-treated fish after burn. Axon damage localizes more proximally in isotonic-treated samples because the keratinocytes distal to the notochord are likely dead, and the axons innervating those cells are likely immediately destroyed upon burn application. As a result, the distal axons are not present to express GCaMP. We believe isotonic treatment allows keratinocytes to live slightly longer, so axon damage is therefore prevented for longer. This is also the focus of continuing work to further understand the burn microenvironment.

      Finally, the materials section could mention bias mitigation measures, e.g. withholding the treatment condition from the experimenter in the touch test. 

      We minimized bias in experiments whenever possible, and the conservative statistical measures that were applied to our data further reduce the likelihood of false significance.

      Reviewer #3 (Recommendations For The Authors): 

      - Line numbers would have facilitated reviewer feedback. 

      - Supplementary movies were missing in the submission. 

      The lack of supplementary movies upon submission was a mistake and the movies have been uploaded along with the revised manuscript.

      Introduction: 

      - Pg. 3: "In response to tissue damage, sensory neurons undergo rapid and localized axonal degeneration 4,5." Not sure reference 4 (Reyes et al) is appropriate here as this study was not in the context of tissue damage. 

      We have revised this section as suggested by the reviewer.

      Results: 

      - The expected expression pattern/localization of several transgenes was unclear. Please clearly state what cell type(s) each should label. For example, pg. 5 - "We next sought to further investigate sensory neuron function in burned tissue. For this, we assessed wound-induced axonal damage using zebrafish larvae that express the calcium probe GCaMP." Where is GCaMP expressed? 

      The manuscript has been updated to include expression patterns for the included transgenes – in this mentioned case, GCaMP is expressed in neurons under the pan-neuronal Elavl3 promoter.

      - Introducing the GCaMP labeling could use some clarification. Pg. 5 - "As shown previously by other groups, GCaMP labels degenerating neurons in real time35." This is confusing. Do the authors mean that GCaMP increases immediately prior to Wallerian degeneration as shown by Vargas et al. (PMID: 26558774)? 

      Sustained elevated calcium levels are associated with axon damage. Previous work from other labs has shown that calcium influx follows axon injury (Ziv and Spira, EJN 1993, Adalbert et al., Neuroscience 2012). In these experiments, whenever there are CGaMP-positive punctae, this indicates axon damage. We have revised the manuscript to address this critique.

      The Elavl3-GCaMP5 transgenic line will label when calcium levels increase in neurons. However, given the parameters used for imaging in our study (20x magnification, 100 ms exposure, and collection speed every 30 seconds for timelapses), we believe that only sufficiently large increases in calcium that are indicative of cell damage, and not physiological function, are being visualized.

      - Figure 1E - Are these panels images of the same fish? Please specify in the legend. 

      Figure 1E is comprised of one transected and one burned larva each, live-imaged over the course of six hours. The legend has been updated to include this information.

      - Figure 1F - How was the damage area measured? Consider doing this measurement over time to match Figure 1E. 

      Axon damage area measurements were performed similar to axon density measurements – maximum intensity z-projected confocal images of the caudal fin were generated using FIJI. For all experiments, the caudal fin area posterior to the notochord was outlined using the Polygon tool and measured to obtain a total surface area ROI. Axon fragments inside the outlined area were manually thresholded so all fragments posterior to the notochord were labeled and no saturated pixels were present, and an area measurement of these thresholded pixels was taken. We have added a section describing these measurements in the Methods section under “Axon damage quantification.”

      - Pg. 5 - When introducing the ngn1 MO - please state the expected phenotype and cite the appropriate background literature_._ 

      The ngn1 morpholino was cited in the Methods section with the appropriate literature (Cornell and Eisen, Development, 2002), from which we got the morpholino sequence. We thank the reviewer for pointing out the need for more introduction and clarification in the main text, so the ngn1 morpholino has been discussed in greater depth and cited in the main text as well using the same citation.

      - The two-wound model is an elegant approach but could be more clearly described in the main text. 

      An improved explanation of the two-wound experiment has been added to the text.

      - For Figure 3, it would be helpful to have a schematic of the anatomy illustrating the relative positions of axons and epidermal cell types. 

      - Figure 3C - should an additional control here be transected? Given that the krt4:lifeact transgene labels both layers of the epidermis, how were the superficial and basal keratinocytes separated? Interpretation of this section should be carefully worded. The authors state that "...suggesting that the superficial keratinocytes are being pulled by the motile basal keratinocytes" (pg.7 ) but isn't another possibility that the superficial cells are stationary? 

      It is correct that the krt4:lifeact transgene labels both layers of keratinocytes, which together span 20-30 microns. These layers were separated from the same z-stack collected by confocal imaging. The first z-slice and last z-slice of the same stack were separated using FIJI and pseudocolored to appear as different colors. This clarification has been added to the Methods.

      Prior observations with the krt4:lifeact and krt4:utrch (figure 3A) transgenic lines reveal that both keratinocyte layers will move distally after burn application.

      - Pg. 7 - "The axons of sensory neurons are ensheathed within actin-rich channels running through basal keratinocytes 50,51." ref 51 is a C. elegans paper which does not have basal keratinocytes.

      This was in error. The correct reference has replaced reference 51 (O’brien, J Comp. Neurol., 2012), in which electron microscopy is used to document the development of two layers of epithelial cells that also ensheath sensory neurons in a protective manner similar to glial cells in the central nervous system.

      - Figures S1E and F - the authors state that RB and DRG soma don't move. However, it was unclear from the figure panels and legend whether the authors imaged neurons that actually innervate the caudal fin (rather than some other region of the animal). Please clarify. For comparison, Fig S1F needs a pre-injury image to be meaningful. 

      The imaged cell bodies were those in the posterior trunk region, which are responsible for innervating the posterior sections of the fish including the caudal fin. From our observations, there was no movement of neuronal cell bodies after the burn.

      - Figure 5 title - can the authors clarify what aspect of this figure relates to "sustained epidermal damage" 

      The figure 5 title has been updated in response to the reviewer comments.

      - Figure 6 - is touch sensitivity really "restored" as the authors suggest? Alternatively, sensitivity may never be lost in isotonic treatment. Or the loss may be delayed? 

      We have modified the text accordingly by updating our phrasing – “restored” has been replaced with “improved” to indicate benefit over time.

      - Can the authors further disentangle the effects of keratinocyte migration, ROS, and isotonic treatment on axon regeneration? For example, would the addition of CK666 to the Isotonic +1 hpw treatment improve axon regeneration? Can the authors directly manipulate ROS signaling (e.g., through exogenous addition of H2O2 or duox1 MO) to alter regeneration outcomes in their wounding assays? 

      See the comments above.

      - Figure 6 title - consider removing or clarifying the word "excessive" here 

      The title has been revised according to the reviewer suggestion.

      - hpw vs hpb were used inconsistently throughout the text 

      The manuscript has been revised to use “hpw” when referring to the timeframe after injury application.

      Methods: 

      - Zebrafish transgenics are missing allele names 

      References: 

      - Many mistakes were noted in this section e.g., journal names missing, wrong authors, typos, DOIs misformatted 

      The references section has been corrected to use formatting consistent with APA citation and eLife preferred guidelines.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      This manuscript reports the investigation of PriC activity during DNA replication initiation in Escherichia coli. It is reported that PriC is necessary for the growth and control of DNA replication initiation under diverse conditions where helicase loading is perturbed at the chromosome origin oriC. A model is proposed where PriC loads helicase onto ssDNA at the open complex formed by DnaA at oriC. Reconstituted helicase loading assays in vitro support the model. The manuscript is well-written and has a logical narrative.

      Thank you for understanding this study.

      Major Questions/Comments:

      An important observation here is that a ΔpriC mutant alone displays under-replication, suggesting that this helicase loading pathway is physiologically relevant. Has this PriC phenotype been reported previously? If not, would it be possible to confirm this result using an independent experimental approach (e.g. marker frequency analysis or fluorescent reporter-operator systems)?

      We thank Reviewer 1 for this comment. This study provides the first direct evidence for PriC’s role in initiation of chromosome replication. Given the change of the oriC copy number of ∆priC cells in non-stressed conditions is only slight, resolution of the suggested methods is clearly not high enough to distinguish the differences in the oriC copy number between priC<sup>+</sup> (WT) and ∆priC cells. Thus, to corroborate the ∆priC phenotype, we additionally analyzed using flow cytometry priC<sup>+</sup> and ∆priC cells growing under various nutrition and thermal conditions.

      As shown in Figure 2-figure supplement 1 of the revised version, the fraction of cells with non-2<sup>n</sup> oriC copies was slightly higher in ∆priC cells compared to priC<sup>+</sup> cells. Furthermore, when grown in M9 minimal medium at 37˚C, ∆priC mutant cells exhibited slightly reduced ori/mass values. These are supportive to the idea that inhibition of replication initiation occurs at low frequency even in the WT dnaA and dnaC background, and that PriC function is necessary to ensure normal replication initiation. Related descriptions have been revised accordingly.

      Is PriA necessary for the observed PriC activity at oriC? Is there evidence that PriC functions independently of PriA in vivo?

      As described in Introduction of the original manuscript, PriA is a 3’-to-5’ helicase which specifically binds to the forked DNA with the 3’-end of the nascent DNA strand. Thus, structural specificity of target DNA is essentially different between PriA and PriC. Consistent with this, our in vitro data indicate that PriC alone is sufficient to rescue the abortive helicase loading at oriC (Figure 7), indicating that PriA is principally unnecessary for PriC activity at oriC. Consistently, as described in Introduction, PriC can interact with ssDNA to reload DnaB (Figure 1E). Nevertheless, a possibility that PriA might participate in the PriC-dependent DnaB loading rescue at oriC in vivo can not be completely excluded. However, elucidation of this possibility is clearly beyond the scope of the present study and should be analyzed in the future. An additional explanation has been included in Discussion of the revised version.

      Is PriC helicase loading activity in vivo at the origin direct (the genetic analysis leaves other possibilities tenable)? Could PriC enrichment at oriC be detected using chromatin immunoprecipitation?

      These are advanced questions about genomic dynamics of PriC. Given that PriC facilitates DnaB reloading at stalled replication forks (Figure 1E) (Heller and Marians, Mol Cell., 2005; Wessel et al., J Biol Chem, 2013; Wessel et al., J Biol Chem, 2016), PriC might interact with the whole genome and its localization might not necessarily exhibit a preference for oriC in growing cells. Analysis about these advanced questions is interesting but is beyond the scope of the present study and should be analyzed in the future study.

      Reviewer #2 (Public review):

      This is a great paper. Yoshida et al. convincingly show that DnaA does not exclusively do loading of the replicative helicase at the E. coli oriC, but that PriC can also perform this function. Importantly, PriC seems to contribute to helicase loading even in wt cells albeit to a much lesser degree than DnaA. On the other hand, PriC takes a larger role in helicase loading during aberrant initiation, i.e. when the origin sequence is truncated or when the properties of initiation proteins are suboptimal. Here highlighted by mutations in dnaA or dnaC.

      This is a major finding because it clearly demonstrates that the two roles of DnaA in the initiation process can be separated into initially forming an open complex at the DUE region by binding/nucleation onto DnaA-boxes and second by loading of the helicase. Whereas these two functions are normally assumed to be coupled, the present data clearly show that they can be separated and that PriC can perform at least part of the helicase loading provided that an area of duplex opening is formed by DnaA. This puts into question the interpretation of a large body of previous work on mutagenesis of oriC and dnaA to find a minimal oriC/DnaA complex in many bacteria. In other words, mutants in which oriC is truncated/mutated may support the initiation of replication and cell viability only in the presence of PriC. Such mutants are capable of generating single-strand openings but may fail to load the helicase in the absence of PriC. Similarly, dnaA mutants may generate an aberrant complex on oriC that trigger strand opening but are incapable of loading DnaB unless PriC is present.

      We would like to thank Revierwer#2 for the very positive comments about our work.

      In the present work, the sequence of experiments presented is logical and the manuscript is clearly written and easy to follow. The very last part regarding PriC in cSDR replication does not add much to the story and may be omitted.

      Given that the role PriC in stimulating cSDR was unclear, we believe that our finding that PriC has little or no role in cSDR, despite being a negative result, is valuable for the general readership of eLife. To further assess impact of PriC on cSDR and as recommended by Referee #1, we carried out the chromosome loci copy-number analysis by the whole-genome sequencing. As shown in Figure 8-supplement 1 of the revised version, the results support our conclusion from the original version.

      Reviewer #3 (Public review):

      Summary:

      At the abandoned replication fork, loading of DnaB helicase requires assistance from PriABC, repA, and other protein partners, but it does not require replication initiator protein, DnaA. In contrast, nucleotide-dependent DnaA binding at the specific functional elements is fundamental for helicase loading, leading to the DUE region's opening. However, the authors questioned in this study that in case of impeding replication at the bacterial chromosomal origins, oriC, a strategy similar to an abandoned replication fork for loading DnaB via bypassing the DnaA interaction step could be functional. The study by Yoshida et al. suggests that PriC could promote DnaB helicase loading on the chromosomal oriC ssDNA without interacting with the DnaA protein. However, the conclusions drawn from the primarily qualitative data presented in the study could be slightly overwhelming and need supportive evidence.

      Thank you for your understanding and careful comments.

      Strengths:

      Understanding the mechanism of how DNA replication restarts via reloading the replisomes onto abandoned DNA replication forks is crucial. Notably, this knowledge becomes crucial to understanding how bacterial cells maintain DNA replication from a stalled replication fork when challenging or non-permissive conditions prevail. This critical study combines experiments to address a fundamental question of how DnaB helicase loading could occur when replication initiation impedes at the chromosomal origin, leading to replication restart.

      Thank you for your understanding.

      Weaknesses:

      The term colony formation used for a spotting assay could be misleading for apparent reasons. Both assess cell viability and growth; while colony formation is quantitative, spotting is qualitative. Particularly in this study, where differences appear minor but draw significant conclusions, the colony formation assays representing growth versus moderate or severe inhibition are a more precise measure of viability.

      We used serial dilutions of the cell culture for the spotting assay and thus this assay should be referred as semi-quantitative rather than simply qualitative. For more quantitative assessment of viability, we analyzed the growth rates of cells and the chromosome replication activity using flow cytometry.

      Figure 2

      The reduced number of two oriC copies per cell in the dnaA46priC-deficient strain was considered moderate inhibition. When combined with the data suggested by the dnaAC2priC-deficient strain containing two origins in cells with or without PriC (indicating no inhibition)-the conclusion was drawn that PriC rescue blocked replication via assisting DnaC-dependent DnaB loading step at oriC ssDNA.

      The results provided by Saifi B, Ferat JL. PLoS One. 2012;7(3):e33613 suggests the idea that in an asynchronous DnaA46 ts culture, the rate by which dividing cells start accumulating arrested replication forks might differ (indicated by the two subpopulations, one with single oriC and the other with two oriC). DnaA46 protein has significantly reduced ATP binding at 42C, and growing the strain at 42C for 40-80 minutes before releasing them at 30 C for 5 minutes has the probability that the two subpopulations may have differences in the active ATP-DnaA. The above could be why only 50% of cells contain two oriC. Releasing cells for more time before adding rifampicin and cephalexin could increase the number of cells with two oriCs. In contrast, DnaC2 cells have inactive helicase loader at 42 C but intact DnaA-ATP population (WT-DnaA at 42 or 30 C should not differ in ATP-binding). Once released at 30 C, the reduced but active DnaC population could assist in loading DnaB to DnaA, engaged in normal replication initiation, and thus should appear with two oriC in a PriC-independent manner.

      This is a question about dnaA46 Δ_priC_ mutant cells. Inhibition of the replication forks causes inhibition of RIDA (the DNA-clamp complex-dependent DnaA-ATP hydrolysis) system, resulting in the increase of ATP-DnaA molecules (Kurokawa et al. (1999) EMBO J.). Thus, if Δ_priC_ inhibits the replication forks significantly, the ATP-DnaA level should increase and initiation should be stimulated. However, the results of Figure 2BC are opposite, indicating inhibition of initiation by Δ_priC_. Thus, we infer that the inhibition of initiation in the Δ_priC_ cells is not related to possible changes in the ATP-DnaA level. Even if the ATP-DnaA levels are different in subpopulations in dnaA46 cells, Δ_priC_ mutation should not affect the ATP-DnaA levels significantly. Thus, we infer that even in dnaA46 Δ_priC_ mutant cells, Δ_priC_ mutation directly affect initiation mechanisms, rather than indirectly through the ATP-DnaA levels.

      Broadly, the evidence provided by the authors may support the primary hypothesis. Still, it could call for an alternative hypothesis: PriC involvement in stabilizing the DnaA-DnaB complex (this possibility could exist here). To prove that the conclusions made from the set of experiments in Figures 2 and 3, which laid the foundations for supporting the primary hypothesis, require insights using on/off rates of DnaB loading onto DnaA and the stability of the complexes in the presence or absence of PriC, I have a few other reasons to consider the latter arguments.

      This is a very careful consideration. However, we infer that stabilization of the DnaA-DnaB interaction by PriC, even if present, does not always result in stimulation of DnaB loading to oriC. Given that interactions between DnaA and DnaB during DnaB loading to oriC are highly dynamic and complicated with multiple steps, stabilization of the DnaA-DnaB interaction by PriC, even if it occurs, has a considerable risk of inhibiting the DnaB loading by constructing abortive complexes. In addition, DnaA-DiaA binding is very tight and stable (Keyamura et al., 2007, 2009). Even if WT DnaA and WT DnaB are present, PriC can rescue the initiation defects of oriC mutants. Based on these facts and the known characteristics of PriC as explained in Introduction, it is more reasonable to infer that PriC provides a bypass of DnaB loading even at oriC, as proposed for the mechanism at the stalled replication fork. However, we cannot completely rule out the indicated possibility and these explanations are included in the revised version.

      Figure 3

      One should consider the fact that dnA46 is present in these cells. Overexpressing pdnaAFH could produce mixed multimers containing subunits of DnaA46 (reduced ATP binding) and DnaAFH (reduced DnaB binding). Both have intact DnaA-DnaA oligomerization ability. The cooperativity between the two functions by a subpopulation of two DnaA variants may compensate for the individual deficiencies, making a population of an active protein, which in the presence of PriC could lead to the promotion of the stable DnaA: DnaBC complexes, able to initiate replication. In the light of results presented in Hayashi et al. and J Biol Chem. 2020 Aug 7;295(32):11131-11143, where mutant DnaBL160A identified was shown to be impaired in DnaA binding but contained an active helicase function and still inhibited for growth; how one could explain the hypothesis presented in this manuscript. If PriC-assisted helicase loading could bypass DnaA interaction, then how growth inhibition in a strain carrying DnaBL160A should be described. However, seeing the results in light of the alternative possibility that PriC assists in stabilizing the DnaA: DnaBC complex is more compatible with the previously published data.

      Unfortunately, in this comment, there is a crucial misunderstanding in the growth of cells bearing DnaA L160A. Hayashi et al. reported that the dnaB(Ts) cells bearing the dnaB L160A allele grew slowly and formed colonies even at 42°C. This feature is similar to the growth of dnaA46 cells bearing dnaA F46A H136A allele (Figure 2). Thus, the results of dnaB L160A cells are consistent with our model and support the idea that PriC partially rescues the growth inhibition of cells bearing the DnaB L160A allele by bypassing the strict requirement for the DnaA-DnaB interaction. Nevertheless, we have to be careful about a possibility that DnaB L160A could affect interaction with PriC, which we are going to investigate for a future paper.

      As suggested, if mixed complexes of DnaA46 and DnaA F46A H136A proteins are formed, those might retain partial activities in oriC unwinding and DnaB interaction although those cells are inviable at 42°C without PriC. It is noteworthy that in the specific oriC mutants which are impaired in DnaB loading (e.g., Left-oriC), PriC effectively rescues the initiation and cell growth. In these cells, both DnaA and DnaB are intact. Thus, the idea that only mutant DnaA (or DnaB) protein is simulated specifically via PriC interaction is invalid. Even in cells bearing wild-type oriC, DnaA and DnaB, contribution of PriC for initiation is detected.

      In addition, as described in the above response, given that interactions between DnaA and DnaB during DnaB loading to oriC are very dynamic and complicated with multiple steps, stabilization of the DnaA-DnaB interaction by PriC, even if present, would not simply result in stimulation of DnaB loading to oriC; rather we think a probability that it would inhibit the DnaB loading by constructing abortive complexes. Based on the known characteristics of PriC as explained in Introduction, it is more reasonable to infer that PriC provides a bypass of DnaB loading even at oriC, as proposed for the mechanism at the stalled replication fork.

      However, we cannot completely rule out the indicated possibility and this explanation has been described in the revised version as noted in response to the above question.

      Figure 4

      Overexpression of DiaA could contribute to removing a higher number of DnaA populations. This could be more aggravated in the absence of PriC (DiaA could titrate out more DnaA)-the complex formed between DnaA: DnaBC is not stable, therefore reduced DUE opening and replication initiation leading to growth inhibition (Fig. 4A ∆priC-pNA135). Figure 7C: Again, in the absence of PriC, the reduced stability of DnaA: DnaBC complex leaves more DnaA to titrate out by DiaA, and thus less Form I*. However, adding PriC stabilizes the DnaA: DnaBC hetero-complexes, with reduced DnaA titration by DiaA, producing additional Form I*. Adding a panel with DnaBL160A that does not interact with DnaA but contains helicase activity could be helpful. Would the inclusion of PriC increase the ability of mutant helicase to produce additional Form I*?

      Unfortunately, the proposed idea is biased disregarding the fact that DiaA effectively stimulates assembling processes of DnaA molecules at oriC. As oriC contains multiple DnaA boxes and multiple DnaA molecules are recruited there, DiaA will efficiently facilitate assembling of DnaA molecules on oriC. Even DnaA molecules of DnaA-DiaA complexes can efficiently bind to oriC. This is consistent with in vitro experiments showing that higher levels of DiaA stimulate assembly of DnaA molecules and oriC unwinding (i.e., DUE opening) but even excessive levels of DiaA do not inhibit those reactions (Keyamura et al., J. Biol. Chem. (2009) 284, 25038-25050). However, as shown in Figure 9, DiaA tightly binds to the specific site of DnaA which is the same as the DnaB L160-binding site, which causes inhibition of DnaA-DnaB binding (ibid). These are consistent with in vivo experiments, and concordantly consistent with the idea that the excessive DiaA level inhibits interaction and loading of DnaB by the DnaA-oriC complexes, but not oriC unwinding (i.e., DUE opening) in vivo. Also, as mentioned above, we do not consider that stabilization of DnaA-DnaBC complex simply results in stimulation of DnaB loading to oriC. Based on the known characteristics of PriC, it is more reasonable to infer that PriC provides a bypass of DnaB loading even at oriC, as proposed for the mechanism at the stalled replication fork (Figure 1E), as described in the above response.

      As for DnaB L160A, as mentioned above, we are currently investigating interaction modes between DnaB and PriC. While investigating DnaB L160A could further support our model, we believe its contribution to the present manuscript would be incremental. In addition, there is a possibility that DnaA L160A could affect interaction with PriC. Thus, analysis of DnaB mutants in this PriC rescue mechanisms should be addressed in future study.

      Figure 5

      The interpretation is that colony formation of the Left-oriC ∆priC double mutant was markedly compromised at 37˚C (Figure 5B), and 256 the growth defects of the Left-oriC mutant at 25{degree sign}C and 30{degree sign}C were aggravated. However, prima facia, the relative differences in the growth of cells containing and lacking PriC are similar. Quantitative colony-forming data is required to claim these results. Otherwise, it is slightly confusing.

      The indicated concern was raised due to our typing error lacking ∆priC. In the revised manuscript, we have amended as follows: the cell growth of the Left-oriCpriC double mutant was markedly compromised at 37˚C and moderately reduced at 25°C and 30°C (Figure 5B).

      A minor suggestion is to include cells expressing PriC using plasmid DNA to show that adding PriC should reverse the growth defect of dnaA46 and dnaC2 strains at non-permissive temperatures. The same should be added at other appropriate places.

      Even in the presence of PriC, unwinding of oriC and DnaB helicase loading to the wound oriC require DnaA and DnaC activities as indicated by previous studies (see for a review, Windgassen et al., (2018) Nucleic Acids Res. 46, 504-519). Thus, dnaA46 cells and dnaC2 cells bearing pBR322-priC can not grow at 42°C and 37°C (as follows). These are reasonable results. However, at semi-permissive temperatures (37°C for dnaA46 and 35°C for dnaC2), slight stimulation of the cell growth by pBR322-priC might be barely observed (Figure 2-supplement 1 of the revised version). These suggest that the intrinsic level of PriC is functionally nearly sufficient. This explanation has been included in the revised version.

      Author response image 1.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      Line 38. "in assembly of the replisome".

      Corrected.

      Line 137. "specifically" rather than specificity.

      Corrected.

      Line 139. "at" rather than by.

      Corrected.

      The DnaA46 protein variant contains two amino acid substitutions (A184V and H252Y) within the AAA+ motif. H136 appears to reside adjacent to A184 in structure. Is A184V mutation causative?

      The DnaA H136A and A184V alleles are responsible for different defects. Indeed, the DnaA A184V variant is thermolabile and defective in ATP binding whereas the H136A variant retains ATP binding but impairs DnaB loading (Carr and Kaguni, Mol. Microbiol., 1996; Sakiyama et al., Front. Microbiol., 2018). These observations strongly support the view that the phenotype of the DnaA H136A allele is independent of that of the DnaA A184V allele.

      Figure 2A. Regarding the dnaA46 allele grown at 37°C.

      Individual colonies cannot be resolved. Is an image from a later time-point available?

      We have replaced the original image with one from another replicate that provides better resolution. Please see Figure 2A in the revised version.

      Figure 2C. Quantification of the number of cells with more than one chromosome equivalent in the dnaC2 ΔpriC strain. The plot from flow cytometry appears to show >20% of cells with only 1 genome. Are these numbers correct?

      Thank you for this careful comment. We quantified the peaks more strictly, but the percentages were noy largely changed. To improve resolution of the DNA profiles, we have changed the range of the x-axis in panels B and C of Figure 2 in the revised version.

      Figure 3. Are both F46A and H136A mutations in the plasmid-encoded dnaA necessary?

      Yes. The related explanation is included in the Discussion section (the third paragraph) of the original manuscript. As described there, dnaA46 cells expressing the DnaA H136A single mutant exhibited severe defects in cell growth even in the presence of PriC (Sakiyama et al., 2018). The His136 residue is located within the weak, secondary DnaB interaction region in DnaA, and is crucial for DnaB loading onto oriC ssDNA. Given domain I in DnaA H136A can stably tether DnaB-DnaC complexes to DnaA complexes on oriC (Sakiyama et al., 2018), we infer that oriC-DnaA complexes including DnaA H136A stably bind DnaB via DnaA domain I as an abortive complex, which inhibits functional interaction between PriC and DnaB as well as DnaB loading to oriC DNA.

      As for DnaA F46A mutant, our previous studies show that DnaA F46A has a limited residual activity in vivo (unlike in vitro), and allows slow growth of cells. As the stable DnaA-DnaB binding is partially impaired in vivo in DnaA F46A, this feature is consistent with the above ideas. Thus, both F46A and H136A mutations are required for severer inhibition of DnaB loading. This is additionally described in the revised Discussion.

      Figure 3. Is the DnaA variant carrying F46A and H136A substitutions stably expressed in vivo?

      We have performed western blotting, demonstrating that the DnaA variant carrying F46A and H136A substitutions is stable in vivo. In the revised version, we have added new data to Figure 3-figure supplement 1 and relevant description to the main text as follows:

      Western blotting demonstrated that the expression levels were comparable between WT DnaA and DnaA F46A H136A double mutant (Figure 3-figure supplement 1).

      Figure 5A. Should the dashed line extending down from I2 reach the R4Tma construct?

      We have amended the indicated line appropriately.

      Figure 6C. It was surprising that the strain combining the subATL mutant with ΔpriC displayed a pronounced under-initiation profile by flow cytometry, and yet there was no growth defect observed (see Figure 6B). This seems to contrast with results using the R4Tma origin, where the ΔpriC mutant produced a relatively modest change to the flow cytometry profile, and yet growth was perturbed (Figure 5C-D). How might these observations be interpreted? Is the absolute frequency of DNA replication initiation critical?

      Please note that, in E. coli, initiation activity corelates closely with the numbers of oriC copies per cell mass (ori/mass), rather than the apparent DNA profiles measured by flow cytometer. When cells were grown in LB at 30˚C, the mean ori/mass values were as follows: 0.34 for R4Tma priC, 0.51 for R4Tma, 0.82 for DATL priC, 0.99 for DATL (Figures 5 & 6 in the original manuscript). These values closely correspond to the cell growth ability shown in Figure 5C in the original manuscript.

      In the revised manuscript, we have cited appropriate references for introduction of the ori/mass values as follows.

      To estimate the number of oriC copies per unit cell mass (ori/mass) as a proxy for initiation activity (Sakiyama et al., 2017, 2022),

      Line 295. Reference for Form I* assay should cite the original publication.

      Done. The following paper is additionally cited.

      Baker, T. A., Sekimizu, K., Funnell, B. E., and Kornberg, A. (1986). Extensive unwinding of the plasmid template during staged enzymatic initiation of DNA replication from the origin of the Escherichia coli chromosome. Cell 45, 53–64.doi: 10.1016/0092-8674(86)90537-4

      Reviewer #2 (Recommendations for the authors):

      The partial complementation of the dnaC2 strain by PriC seems quite straightforward since this particular mutation leads to initiation arrest at the open complex stage and this sets the stage for PriC to load the helicase. The situation is somewhat different for dnaA46. Why is this mutation partly complemented by PriC at 37C? DnaA46 binds neither ATP nor ADP, yet it functions in initiation at permissive temperature. At nonpermissive temperature, it binds oriC as well but does not lead to initiation. Does the present data imply that the true initiation defect of DnaA46 lies in helicase loading? The authors need to comment on this in the text.

      Given the thermolabile propensity of the DnaA46 protein, it is presumable that DnaA46 protein becomes partially denatured at the sub-permissive temperature of 37˚C. This partial denaturation should impair both origin unwinding and helicase loading, though not to the extent that cell viability is lost. The priC deletion should further exacerbate helicase loading defects by inhibiting the bypass mechanism, resulting in the lethality of dnaA46 cells at this temperature. This explanation is included in the revised Discussion section.

      Relating to the above. In Figure 3 it is shown that the pFH plasmid partly complements dnaA46 in a PriC-dependent manner. Again, it would be nice to know the nature of the DnaA46 protein defect. It would be interesting to see how a pING1-dnaA46 plasmid performs in the experiment presented in Figure 3.

      A previous paper showed that multicopy supply of DnaA46 can suppress temperature sensitivity of the dnaA46 cells (Rao and Kuzminov, G3, 2022). This is reasonable in that DnaA46 has a rapid degradation rate unlike wild-type DnaA. As DnaA46 preserves the intact sequences in DnaB binding sites such as G21, F46 and H136, the suppression would not depend on PriC but would be due to the dosage effect.

      Figure 8 B: The authors should either remove the data or show a genome coverage: it is not clear that yapB is a good reference. A genome coverage would be nice, and show whether initiation can occur at oriC even if it is not the major place of initiation in a rnhA mutant.

      As suggested, we carried out the chromosome loci copy-number analysis by whole-genome sequencing to assess impact of PriC on cSDR. The new data are shown in Figure 8-supplement 1 with relevant descriptions of the main text of the revised version as shown below. Briefly, results of the chromosome loci copy-number analysis are consistent with those of real-time qPCR (Figure 8B). Given that the role PriC in stimulating cSDR was unclear, we believe that our finding that PriC has little or no role in cSDR, despite being a negative result, is valuable for the general readership of eLife.

      Line 38-39: .....resulting in replisome assembly.

      Corrected.

      Line 48: Something is wrong with the Michel reference. Also in the reference list.

      Corrected

      Line 156: replace retarded with reduced.

      Corrected.

      Line 171 and elsewhere: WT priC cells is somewhat misleading. Isn't this simply PriC+ cells?

      Yes. We have revised the wording to “priC<sup>+</sup>” for clarity.

      Line 349-350: "the oriC copy number ratio of the dnaA46 DpriC double mutant was lower than that of the dnaA46 single mutant....". This is only provided growth rate of the strains is the same.

      These strains exhibited similar growth rates. This is included in the Result section of the revised manuscript as follows: At the permissive temperature, despite having similar growth rates, the oriC copy number ratio of the dnaA46priC double mutant strain was lower than that of the dnaA46 single mutant.

      Reviewer #3 (Recommendations for the authors):

      I would suggest improved or additional experiments, data, or analyses.

      The revised version includes improved or additional experiments, data, or analyses.

    1. Author response:

      The following is the authors’ response to the current reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      The authors describe a massively parallel reporter assays (MPRA) screen focused at identifying polymorphisms in 5' and 3' UTRs that affect translation efficiency and thus might have a functional impact on cells. The topic is of timely interest, and indeed, several related efforts have recently been published and preprinted (e.g., https://pubmed.ncbi.nlm.nih.gov/37516102/ and https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10635273/). This study has several major issues with the results and their presentation.

      Major comments:

      • The main issue remains that it appears that the screen has largely failed, and the reasons for that remain unclear, which make it difficult to interpret how useful is the resulting data. The authors mention batch effects as a potential contributor. The authors start with a library that includes ~6,000 variants, which makes it a medium-size MPRA. But then, only 483 pairs of WT/mutated UTRs yield high confidence information, which is already a small number for any downstream statistical analysis, particularly since most don't actually affect translation in the reporter screen setting (which is not unexpected). It is unclear why >90% of the library did not give high-confidence information. The profiles presented as base-case examples in Fig. 2B don't look very informative or convincing. All the subsequent analysis is done on a very small set of UTRs that have an effect, and it is unclear to this reviewer how these can yield statistically significant and/or biologically-relevant associations.

      • From the variants that had an effect, the authors go on to carry out some protein-level validations, and see some changes, but it is not clear if those changes are in the same direction was observed in the screen. In their rebuttal the authors explain that they largely can not infer directionality of changes form the screen, which further limits its utility.

      • It is particularly puzzling how the authors can build a machine learning predictor with >3,000 features when the dataset they use for training the model has just a few dozens of translation-shifting variants.

      We recognize that RNA distribution within polysomes is inherently less stable than the associated protein components. This instability has been noted in previous studies, including those cited by the reviewer, which used RNA from bulk polysomes to infer the translatome without fractionation. Acknowledging this limitation, we purposely adopted a conservative strategy: (i) performing gross fractionation of polysomes, and (ii) collaborating with biostatisticians at the Institute of Statistical Science, Academia Sinica, to design a conservative yet optimized analysis pipeline that minimized batch effects.

      This approach proved robust: representative cases in Fig. 2B clearly demonstrate distinct distributions of reference and alternative alleles. From our high-confidence dataset, we applied a well-established statistical framework specifically designed to accommodate multiple influencing factors in relatively small datasets (Elements of Statistical Learning by Hastie, Tibshirani, and Friedman). We further conducted sensitivity analyses to select an optimal QC cutoff across a range of stringencies, ensuring maximal reliability of our results. We have therefore successfully shortlisted UTR variants which have strong effect on translation.

      Building upon these conservative measures, we developed a predictive model for translation effects of UTR variants. Importantly, this model was validated not only with our internal test dataset but also with independent external datasets. In addition, the sequence features identified by the model were validated through reporter assays and in vivo CRISPR editing. These external and functional validations establish the generalizability and robustness of our approach.

      A more detailed analysis of the directionality of changes in translation efficiency is under active investigation. These results will be reported in a separate manuscript currently in preparation.


      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      The authors describe a massively parallel reporter assays (MPRA) screen focused on identifying polymorphisms in 5' and 3' UTRs that affect translation efficiency and thus might have a functional impact on cells. The topic is of timely interest, and indeed, several related efforts have recently been published and preprinted (e.g., https://pubmed.ncbi.nlm.nih.gov/37516102/ and https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10635273/). This study has several major issues with the results and their presentation.

      Major comments:

      (1) The main issue is that it appears that the screen has largely failed, yet the reasons for that are unclear, which makes it difficult to interpret. The authors start with a library that includes approximately 6,000 variants, which makes it a medium-sized MPRA. But then, only 483 pairs of WT/mutated UTRs yield highconfidence information, which is already a small number for any downstream statistical analysis, particularly since most don't actually affect translation in the reporter screen setting (which is not unexpected). It is unclear why >90% of the library did not give high-confidence information. The profiles presented as basecase examples in Figure 2B don't look very informative or convincing. All the subsequent analysis is done on a very small set of UTRs that have an effect, and it is unclear to this reviewer how these can yield statistically significant and/or biologically relevant associations.

      To make sure our final results are technically and statistically sound, we applied stringent selection criteria and cutoffs in our analytics workflow. First, from our RNA-seq dataset, we filtered the UTRs with at least 20 reads in a polysome profile across all three repeated experiments. Secondly, in the following main analysis using a negative binomial generalized linear model (GLM), we further excluded the UTRs that displayed batch effect, i.e. their batch-related main effect and interaction are significant. We believe our measure has safeguarded the filtered observations (UTRs) from the (potential) high variation of our massively parallel translation assays and thus gives high confidence to our results.

      Regarding the interpretation of Figure 2B, since we aimed to identify the UTRs whose interaction term of genotype and fractions is significant in our generalized linear model, it is statistically conventional to doublecheck the interaction of the two variables using such a graph. For instance, in the top left panel of Figure 2B (5'UTR of ANK2:c.-39G>T), we can see that read counts of WT samples congruously decreased from Mono to Light, whereas the read counts of mutant samples were roughly the same in the two fractions – the trend is different between WT and mutant. Ergo, the distinct distribution patterns of two genotypes across three fractions in Figure 2B offer the readers a convincing visual supplement to our statistics from GLM.

      In contrast to Figure 2B, the graphs of nonsignificant UTRs (shown below) reveal that the trends between the two genotypes are similar across the 'Mono and Light' and 'Light and Heavy' polysome fractions. Importantly, our analysis remains unaffected by differential expression levels between WT and mutant, as it specifically distinguishes polysome profiles with different distributions. This consistent trend further supports the lack of interaction between genotype and polysome fractions for these UTRs.

      Author response image 1.

      Examples of non-significant UTR pairs in massively parallel polysome profiling assays.

      (2) From the variants that had an effect, the authors go on to carry out some protein-level validations and see some changes, but it is not clear if those changes are in the same direction as observed in the screen.

      To infer the directionality of translation efficiency from polysome profiles, a common approach involves pooling polysome fractions and comparing them with free or monosome fractions to identify 'translating' fractions. However, this method has two major potential pitfalls: (i) it sacrifices resolution and does not account for potential bias toward light or heavy polysomes, and (ii) it fails to account for discrepancies between polysome load and actual protein output (as discussed in https://doi.org/10.1016/j.celrep.2024.114098 and https://doi.org/10.1038/s41598-019-47424-w). Therefore, our analysis focused on the changes within polysome profiles themselves. 'Significant' candidates were identified based on a significant interaction between genotype and polysome distribution using a negative binomial generalized linear model, without presupposing the direction of change on protein output. 

      (3) The authors follow up on specific motifs and specific RBPs predicted to bind them, but it is unclear how many of the hits in the screen actually have these motifs, or how significant motifs can arise from such a small sample size.

      We calculated the Δmotif enrichment in significant UTRs versus nonsignificant UTRs using Fisher’s exact test. For example, the enrichment of the Δ‘AGGG’ motif in 3’ UTRs is shown below:

      Author response table 1.

      This test yields a P-value of 0.004167 by Fisher’s exact test. The P-values and Odds ratios of Δmotifs in relation to polysome shifting are included in Supplementary Table S4, and we will update the detailed motif information in the revised Supplementary Table S4.

      (4) It is particularly puzzling how the authors can build a machine learning predictor with >3,000 features when the dataset they use for training the model has just a few dozens of translation-shifting variants.

      We understand the concern regarding the relatively small number of translation-shifting variants compared to the large number of features. To address this, we employed LASSO regression, which, according to The Elements of Statistical Learning by Hastie, Tibshirani, and Friedman, is particularly suitable for datasets where the number of features 𝑝𝑝 is much larger than the number of samples 𝑁𝑁. LASSO effectively performs feature selection by shrinking less important coefficients to zero, allowing us to build a robust and generalizable model despite the limited number of variants.

      (5) The lack of meaningful validation experiments altering the SNPs in the endogenous loci by genome editing limits the impact of the results.

      Following the reviewer’s suggestion, we assessed the endogenous mutant effect by generating CRISPR knock-in clones carrying the IRF6:c.-4609G>A variant. We showed that this G>A variant generate a deleterious upstream open reading frame, which dramatically reduced protein expression of the main open reading frame (Fig. 7B-D). The genome editing further demonstrated the G>A variant reduced endogenous IRF6 protein expression to 23% or 44% in two independent clones. We have incorporated the genome editing results in the revised  main text and the new Figure 7E&F: 

      “To further validate the endogenous effect of the novel upstream ATG (uATG), we generated CRISPR knockin clones carrying the IRF6:c.-4609G>A variant and examined its impact on gene expression. The introduction of the uATG reduced RNA levels to 88% and 37% of the wild-type in two independent clones (Fig. 7E), and protein levels to 44% and 23%, respectively (Fig. 7F), resulting in an overall reduction of translation efficiency to 50–62%.“ (p.18)

      Reviewer #2 (Public Review):

      Summary:

      In their paper "Massively Parallel Polyribosome Profiling Reveals Translation Defects of Human DiseaseRelevant UTR Mutations" the authors use massively parallel polysome profiling to determine the effects of 5' and 3' UTR SNPs (from dbSNP/ClinVar) on translational output. They show that some UTR SNPs cause a change in the polysome profile with respect to the wild-type and that pathogenic SNPs are enriched in the polysome-shifting group. They validate that some changes in polysome profiles are predictive of differences in translational output using transiently expressed luciferase reporters. Additionally, they identify sequence motifs enriched in the polysome-shifting group. They show that 2 enriched 5' UTR motifs increase the translation of a luciferase reporter in a protein-dependent manner, highlighting the use of their method to identify translational control elements.

      Strengths:

      This is a useful method and approach, as UTR variants have been more difficult to study than coding variants. Additionally, their evidence that pathogenic mutations are more likely to cause changes in polysome association is well supported.

      Weaknesses:

      The authors acknowledge that they "did not intend to immediately translate the altered polysome profile into an increase or decrease in translation efficiency, as the direction of the shift was not readily evident. Additionally, sedimentation in the sucrose gradient may have been partially affected by heavy particles other than ribosomes." However, shifted polysome distribution is used as a category for many downstream analyses. Without further clarity or subdivision, it is very difficult to interpret the results (for example in Figure 5A, is it surprising that the polysome shifting mutants decrease structure? Are the polysome "shifts" towards the untranslated or heavy fractions?)

      Our approach, combining polysome fractionation of the UTR library with negative binomial generalized linear model (GLM) analysis of RNA-seq data, systematically identifies variants that affect translational efficiency. The GLM model is specifically designed to detect UTR pairs with significant interactions between genotype and polysome fractions, relying solely on changes in polysome profiles to identify variants that disrupt translation. Consequently, our analytical method does not determine the direction of translation alteration.

      Following the massively parallel polysome profiling, we sought to understand how these polysome-shifting variants influence the translation process. To do this, we examined their effects on RNA characteristics related to translation, such as RBP binding and RNA structure. In Figure 5A, we observed a notable trend in significant hits within 5’ UTRs—they tend to increase ΔG (weaker folding energy) in response to changes in polysome profiles, regardless of whether protein production increases or decreases (Fig. 3).

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      Minor comments:

      (1) Figure 3A - the claim that 5'UTR variants had a stronger effect than 3'UTR is based on the two UTRs with the strongest effect. It is unclear how these differences between 5' and 3'UTRs are significant.

      We carried out a Wilcoxon rank-sum test to examine the mut/WT fold change of translation efficiency between the 3’ and 5’ UTR variants. The results showed that the 5’ UTR variants exhibited a greater change of translation efficiency. We have inserted this result in the revised Figure 3C and refers to this figure in the main text: “Furthermore, we observed that 5’ UTR variants had a greater impact on translation activity relative to 3’ UTR variants (Fig. 3C).” (p. 12)

      (2) Figures 2B and S1, S2 - what is the meaning of less signal for a light chain and a similar signal for a heavy chain? How can this situation, while being a significant difference between the profiles, lead to a biologically relevant difference in eventual protein output?

      Taking 3’UTR ACADSB:c.*4177G>A (bottom-left panel in Figure 2B) as an example: WT transcripts have less read count (in the unit of log(CPM)) compared with the transcripts carrying the mutant UTR in the light polysome-containing fraction, whereas the read counts of the two genotypes are approximately the same in the heavy polysome-containing fraction.

      In line with our reply to Reviewer 1’s major comment 1, we aimed to identify the UTRs whose interaction term of genotype and fractions is significant in our generalized linear model (GLM). That is, the UTR pairs whose WT and mutant have different trends across the fractions (Mono to Light & Light to Heavy) are our targets. In Figure 2B, 3’UTR ACADSB:c.*4177G>A is a perfect example of our significant hits, as it displays the clear distinction of the trends of the two genotypes across three fractions.

      It is widely known that the alteration of polysome profiling distribution indicates the change of translational efficiency. Our GLM model helped us identify the UTR pairs whose WT and mutant have different polysome profiling patterns and thus likely have distinct translational efficiency. Nevertheless, since we only had limited polysome fractions in our experiments, we further validated our significant hits and confirmed the direction of regulation using luciferase reporter assay.

      (3) The paragraph starting with "Even with the high confidence dataset, we did not intend to immediately translate the altered polysome profile into an increase or decrease in translation efficiency" is confusing. The whole premise of the screen used by the authors is that polysome profiling is a useful proxy for estimating levels of translation, so claiming that it doesn't necessarily measure translation is counterintuitive.

      In line with our reply to the last question, our goal is to use the alteration of polysome profiling patterns as a proxy for the change of translational efficiency. However, due to the limited number of fractions in our experiment, we could not directly infer the direction of regulation, i.e. increase or decrease of translational efficiency, of the statistically significant variants. That is why we refrained from making any conclusion about the direction of the regulation for the significant hits and proceed to validate them using luciferase reporter assay.

      (4) Figure S5A - this is normalized to the nucleotide distribution in 5' or 3'UTRs? Is this statistic being applied to 27 SNPs in 3'UTRs?

      To identify sequence features associated with altered polysome association, we systematically analyzed both significant and nonsignificant UTRs for nucleotide and motif-level changes. Fisher’s exact test was employed to evaluate whether specific nucleotide or motif alterations were enriched or depleted in polysome-shifting UTRs, compared to nonsignificant UTR pairs. For example, in the case of nucleotide C (see table below; also Table S4 and new Fig. S6A), only four significant 3’ UTRs involved a change in C, resulting in a significant depletion of this nucleotide change among polysome-shifting 3’ UTRs (odds ratio = 0.22, p = 0.0069). Expanding this approach to all 1-7 nt motifs, we identified multiple motif and nucleotide changes that were significantly associated with altered polysome association.

      Author response table 2.

      (5) "uATG in the 5' UTR was not identified by the model as a widespread feature explaining polysome shifting". Is this because of the method of ribosome profiling or because of the sequences in the library? Can having more sequences in the library specifically looking at 5'UTR give more power for such an effect to emerge?

      Our assay design accounted for the presence of upstream ATG codons and the strength of adjacent Kozak sequences. However, additional factors known to influence the function of upstream open reading frames (uORFs)—such as the reading frame of the uORF relative to the main coding sequence, and the use of nonATG initiation codons—were not systematically included. As a result, the current assay may have limited sensitivity in detecting uORF-related regulatory effects. A dedicated design specifically tailored to uORF variants is likely to enhance the detection power and better capture their contribution to translational control.

      (6) Figure 7B- it is not clear whether the luciferase reporter and the GFP reporter in the library function in a similar manner; is it creating out-of-frame or in of in frame uORF? Also, it is not clear if the differences are statistically significant.

      In the MPRA library, the IRF6 uORF is out of frame relative to the GFP coding sequence. To directly assess its translational impact, we employed a luciferase reporter assay by fusing luciferase downstream of the IRF6 uORF. These constructs revealed a significant reduction in protein production, as shown in Figures 3 and 7B–F. Although the clinically relevant IRF6 uORF is out-of-frame with the main ORF, we engineered an inframe uORF variant to validate translation initiation at the upstream ATG (uATG) (Fig. 7B-D). The in-frame construct confirmed uATG usage and led to a significant reduction in luciferase protein expression. Together, these results support the conclusion that the IRF6:c.-4609G>A variant gives rise to an active uORF that suppresses translation of the main ORF.

      Reviewer #2 (Recommendations For The Authors):

      (1) It would be helpful for the authors to subcategorize their data in ways that they consider meaningful and interpretable (e.g. shifts from all monosome to heavy, all heavy to monosome/free, etc.) Relatedly, what do the authors think the functional meaning is when a given transcript has high mono/heavy occupancy but low light occupancy (like what is shown in Figure 2B for ANK2) in the polysome profiling experiment? It is not apparent why a transcript with a high ribosome occupancy (heavy) would also have light occupancy (light).

      From the amplicon sequencing data, we obtained read counts for each UTR variant across the monosome, light, and heavy polysome fractions. Notably, this approach does not preserve the original relative abundance of transcripts among the three fractions. That is, despite a greater abundance of mRNAs in the heavy polysome fraction, comparable numbers of sequencing reads were recovered from the monosome and light fractions. As a result, this method is not suitable for interpreting the global directionality of translational shifts but is well-suited for detecting relative differences in polysome association. Therefore, our experimental and analytical design—combining targeted amplicon sequencing with generalized linear modeling (GLM)—was optimized to identify UTR variants that alter polysome association, independently of absolute transcript abundance in each fraction.

      (2) The method put forward in Figure 2 would be more convincing if there was data showing reproducibility in the massively parallel reporter assay. Perhaps the mut/WT ratio for all transcripts can be plotted against each other and a statistical test of correlation can be performed.

      Thank you for pointing this out. To demonstrate the reproducibility of our massively parallel reporter assay, we have plotted scatter plots of the ratios of all transcripts (summing the monosome, light, and heavy fractions) across different batches using our high-confidence dataset. We calculated the Pearson correlation coefficients and corresponding p-values for these comparisons. The results show strong correlation between each batch, supporting the reproducibility of our assay. We have incorporated this analysis in the main text as well as Supplemental Figure 3: “Pearson correlation analysis revealed R coefficients ranging from 0.59 to 0.71 for the mut-to-WT transcript ratios across three independent experiments (Supplemental Fig. 3).”

      (3) The dots in Figure 2B indicate separate experiments, but the y-axis is log(counts). Values could be normalized (perhaps a ratio of mut/WT) for comparison between experiments.

      We aimed to compare UTR distribution across polysome fractions and recognized the importance of presenting the distribution patterns for both genotypes. This approach allows us to more clearly illustrate the differences or similarities in polysome association between the two genotypes.

      (4) When describing the 5' UTRs used for the validation experiments in Figure 3, more information about the 5' UTR sequence used is necessary. It is not clear how much or what part of the 5' UTRs were removed, or why this was necessary considering the same experiment was conducted using full-length UTRs.

      In the initial library design, technical limitations of bulk oligonucleotide synthesis constrained the UTRs to 155 nucleotides, comprising 115-nt of endogenous human UTR sequence flanked by 20-nt priming sites on both ends. Variants were centered at the 58th nucleotide within the 115-nt UTR sequence. When one flanking region of the native UTR was shorter than 57 nt, the variant was shifted accordingly toward the shorter arm to maintain the 115-nt UTR length (Fig. 2A).

      Given that endogenous UTRs in the human genome are often longer than 155 nt, we further evaluated the functional consequences of variants within full-length UTR sequences (Fig. 3B). While the mutant effects observed in the library setting were largely recapitulated, their magnitude was diminished in the full-length context, likely due to the increased sequence and structural complexity.

      To clarify the experimental design related to Figure 3, we modified the text as the following: “The variants significantly altering the polysome profile were then individually validated by means of high-sensitivity luciferase reporter assays (Fig. 3A). To that end, we resynthesized both the variant and corresponding wildtype alleles in the same library format - 115-nt native UTR segments centered on the variant and flanked by 20-nt priming sites. These UTRs were then cloned upstream (5’) or downstream (3’) of the firefly luciferase coding sequence, depending on their genomic location.” (p. 11)

      (5) The conclusions from inserting RBP-binding motifs into 5' UTRs and assaying translational output (Figure 4) would be strengthened by including luciferase reporters containing endogenous 5' UTRs containing these motifs, and versions where the motifs are disrupted.

      Several variants that altered translation efficiency were validated in their native sequence contexts, including 5’ UTR variants in DMD and NF1 that affect SRSF1/2 binding sites, as well as a 3’ UTR variant in AL049650.1 that impacts a KHSRP binding site (Fig. 3 and Supplemental Figs. S1 & S2). To address the functional relevance of these variants within their native regulatory landscapes, we have incorporated the following clarification into the text (p. 13): “This observation is consistent with additional findings where variants that create or disrupt specific RBP binding sites—such as SRSF1/2 (e.g., in DMD and NF1; Fig. 2 and Supplementary Fig. S4) and KHSRP (e.g., in AL049650.1; Fig. 2 and Supplementary Figs. S4 & S5)—led to significant changes in translation efficiency within their native UTR contexts.”

      (6) Figure 5C shows that 5' UTR SNPs that form an uAUG are associated with greater structural changes, but this does not "indicate" that "structure‐modifying UTR variants may control primary ORF translation partly by interfering with translation initiation from a uORF." The data presented in Figure 5 and luciferase/polysome data presented previously do not distinguish whether translation is occurring at an uAUG or canonical AUG. The statement quoted above is speculative and it should be clear that it is a hypothesis generated by the data and is not conclusive.

      We appreciate the reviewer’s suggestion. We have therefore modified our text to: ”Therefore, while changes in uATG may not be common explanatory factors for polysome-shifting mutations, our results suggest that structure-modifying UTR variants may control primary ORF translation partly by interfering with translation initiation from a uORF.” (p. 14)

      Minor points/questions

      (1) The authors should clarify whether during library construction for massively parallel polysome profiling the 3' UTR constructs contain a common 5' UTR? Likewise, do the 5' UTR constructs contain a common 3' UTR? Perhaps the lack of a 5' UTR in the 3' UTR constructs, which is implied by Figure 2A, would influence differences seen between 3' UTR pairs (and likewise for 5' UTR pairs).

      There are short common 5’ UTRs appended to the 3’ UTR library, and likewise, a common short 3’ UTR is included in the 5’ UTR library. The common 5’ UTR comprises partial sequences from the CMV promoter and the plasmid backbone of pEGFP-N1 vector. The common 3’ UTR includes sequences from the pEGFP-N1 backbone and a short polyadenylation signal from HBA1 (hemoglobin subunit alpha 1). While we cannot entirely rule out potential crosstalk between 5’ and 3’ UTRs, the design ensures that all constructs are compared in a controlled and consistent context, enabling valid pairwise comparisons between variant and wildtype alleles.

      To clarify the library design, we have revised the main text to include this explanation: 

      “The entire library of UTR oligonucleotides (UTR library) was subsequently ligated upstream or downstream of an enhanced GFP (EGFP) coding region, along with a CMV promoter and a common UTR sequence on the opposite end. Cells transfected with the UTR library were treated with cycloheximide 14 hours post transfection and then subjected to polysome fractionation (see Methods).” (p.11) 

      “The variants significantly altering the polysome profile were then individually validated through highsensitivity luciferase reporter assays (Fig. 3A). To this end, we resynthesized both the variant and corresponding wildtype alleles in the same library format - 115-nt native UTR segments centered on the variant and flanked by 20-nt priming sites. These UTRs were then cloned upstream (5’) or downstream (3’) of the firefly luciferase coding sequence, depending on their genomic location. As the initial library design, the test UTR segment differs only by one nucleotide, while a shared short UTR fragment is present on the opposite end of the coding sequence to ensure consistency across constructs (Fig. 2A).” (p. 12)

      (2) The lines connecting the polysome distribution points make the plots appear busy and difficult to read, the data would be easier to interpret if they were removed.

      We employed a generalized linear model (GLM) to identify the variants that altered the polysome association of the corresponding transcripts. Statistically speaking, we were looking for the variants which led to significant interaction between genotype and polysome fractions. Ergo, displaying the lines as it is in our plots offers readers a convincing visualization of the interaction: lines from WT and Mut groups were not parallel, which indicates the interaction between genotype and polysome fractions. Moreover, showing the lines from three batches of experiments also helps us ascertain the reproducibility of our experiments. Taken all together, the presence of the lines makes our plots even more informative.

    1. Author Response

      The following is the authors’ response to the original reviews.

      Public Reviews:

      The study could also valuably explore what kinds of genes experienced what forms of expression evolution. A brief description of GO terms frequently represented in genes which showed strong patterns of expression evolution might be suggestive of which selective pressures led to the changes in expression in the C. bursa-pastoris lineage, and to what extent they related to adaptation to polyploidization (e.g. cell-cycle regulators), compensating for the initial pollen and seed inviability or adapting to selfing (endosperm- or pollen-specific genes), or adaptation to abiotic conditions. ”

      We did not include a gene ontology (GO) analysis in the first place as we did not have a clear expectation on the GO terms that would be enriched in the genes that are differentially expressed between resynthesized and natural allotetraploids. Even if we only consider adaptive changes, the modifications could occur in various aspects, such as stabilizing meiosis, adapting to the new cell size, reducing hybrid incompatibility and adapting to self-fertilization. And each of these modifications involves numerous biological processes and molecular functions. As we could make post-hoc stories for too many GO terms, extrapolating at this stage have limited implications and could be misleading.

      Nonetheless, we are not the only study that compared newly resynthesized and established allopolyploids. GO terms that were repeatedly revealed by this type of exploratory analysis may give a hint for future studies. For this reason, now we have reported the results of a simple GO analysis.

      Recommendations for the authors: please note that you control which, if any, revisions, to undertake

      The majority of concerns from reviewers and the reviewing editor are in regards to the presentation of the manuscript; that the framing of the manuscript does not help the general reader understand how this work advances our knowledge of allopolyploid evolution in the broad sense. The manuscript may be challenging to read for those who aren't familiar with the study system or the genetic basis of polyploidy/gene expression regulation. Further, it is difficult to understand from the introduction how this work is novel compared to the recently published work from Duan et al and compared to other systems. Because eLife is a journal that caters to a broad readership, re-writing the introduction to bring home the novelty for the reader will be key.

      Additionally, the writing is quite technical and contains many short-hands and acronyms that can be difficult to keep straight. Revising the full text for clarity (and additionally not using acronyms) would help highlight the findings for a larger audience.

      Reviewer #1 (Recommendations For The Authors):

      Most of my suggestions on this interesting and well-written study are minor changes to clarify the writing and the statistical approaches.

      The use of abbreviations throughout for both transcriptional phenomena and lines is logical because of word limits, but for me as a reader, it really added to the cognitive burden. Even though writing out "homoeolog expression bias" or "hybridization-first" every time would add length, I would find it easier to follow and suspect others would too.

      Thank you for this suggestion. Indeed, using less uncommon acronyms or short-hands should increase the readability of the text for broader audience. Now in most places, we refer to “Sd/Sh” and “Cbp” as “resynthesized allotetraploids” and “natural allotetraploids”, respectively. We have also replaced the most occurrences of the acronyms for transcriptional phenomena (ELD, HEB and TRE) with full phrases, unless there are extra attributes before them (such as “Cg-/Co-ELD” and “relic/Cbp-specific ELD”).

      It would be helpful to include complete sample sizes to either a slightly modified Figure 1 or the beginning of the methods, just to reduce mental arithmetic ("Each of the five groups was represented by six "lines", and each line had six individuals" so there were 180 total plants, of which 167 were phenotyped - presumably the other 13 died? - and 30 were sequenced).

      The number 167 only applied to floral morphorlogical traits (“Floral morphological traits were measured for all five groups on 167 plants…”), but the exact total sample size for other traits differed. Now the total sample sizes of other traits have also been added to beginning of the second paragraph of the methods.

      For this study 180 seedings have been transplanted from Petri dishes to soil, but 8 seedlings died right after transplanting, seemingly caused by mechanical damage and insufficient moistening. Later phenotyping (2020.02-2020.05) was also disrupted by the COVID-19 pandemic, and some individuals were not measured as we missed the right life stages. Specifically, 5 individuals were missing for floral morphological traits (sepal width, sepal length, petal width, petal length, pistil width, pistil length, and stamen length), 30 for pollen traits, 1 for stem length, and 2 for flowering time. As for seed traits, we only measured individuals with more than ten fruits, so apart from the reasons mentioned above, individuals that were self-incompatible and had insufficient hand-pollination were also excluded. We spotted another mistake during the revision: two individuals with floral morphological measurements had no positional information (tray ID). These measurements were likely mis-sampled or mislabeled, and were therefore excluded from analysis. We assumed most of these missing values resulted from random technical mistakes and were not directly related to the measured traits.

      In general, the methods did a thorough job of describing the genomics approaches but could have used more detail for the plant growth (were plants randomized in the growth chamber, can you rule out block/position effects) and basic statistics (what statistical software was used to perform which tests comparing groups in each section, after the categories were identified).

      When describing the methods, mention whether the plants; this should be straightforward as a linear model with position as a covariate.

      Data used in the present study and a previously published work (Duan et al., 2023) were different subsets of a single experiment. For this reason, we spent fewer words in describing shared methods in this manuscript but tried to summarize some methods that were essential for understanding the current paper. But as you have pointed out, we did miss many important details that should have been kept. Now we have added some description and a table (Supplementary file 1) in the “Plant material” section for explaining randomization, and added more information of the software used for performing statistic tests in the “Phenotyping” section.

      Although we did not mention in the present manuscript, we used a randomized block design for the experiment (Author response image 1).

      Author response image 1.

      Plant positions inside the growth chamber. Plants used in the present study and Duan et al. (2023) were different subsets of a single experiment. The entire experiment had eight plant groups, including the five plant groups used in the present study (diploid C. orientalis (Co2), diploid C. grandiflora (Cg2), “whole-genome-duplication-first” (Sd) and “hybridization-first”(Sh) resynthesized allotetraploids, and natural allotetraploids, C. bursa pastoris (Cbp), as well as three plant groups that were only used in Duan et al. (2023; tetraploid C. orientalis (Co4), tetraploid C. grandiflora (Cg4) and diploid hybrids (F)). Each of the eight plant groups had six lines and each line represented by six plants, resulting in 288 plants (8 groups x 6 lines x 6 individuals = 288 plants). The 288 plants were grown in 36 trays placed on six shelves inside the same growth chamber. Each tray had exactly one plant from each of the eight groups, and the position of the eight plants within each tray (A-H) were randomized with random.shuffle() method in Python (Supplementary file 1). The position of the 36 trays inside the growth room (1-36) was also random and the positions of all trays were shuffled once again 28 days after germination (randomized with RAND() and sorting in Microsoft Excel Spreadsheet). (a) Plant distribution; (b) An example of one tray; (c) A view inside the growth chamber, showing the six benches.

      With the randomized block design and one round of shuffling, positional effect is very unlikely to bias the comparison among the five plant groups. The main risk of not adding positions to the statistical model is increasing error variance and decreasing the statistical power for detecting group effect. As we had already observed significant among-group variation in all phenotypic traits (p-value <2.2e-16 for group effect in most tests), further increasing statistical power is not our primary concern. In addition, during the experiment we did not notice obvious difference in plant growth related to positions. Although we could have added more variables to account for potential positional effects (tray ID, shelf ID, positions in a tray etc.), adding variables with little effect may reduce statistical power due to the loss of degree of freedom.

      Due to one round of random shuffling, positions cannot be easily added as a single continuous variable. Now we have redone all the statistical tests on phenotypic traits and included tray ID as a categorical factor (Figure 2-Source Data 1). In general, the results were similar to the models without tray ID. The F-values of group effect was only slightly changed, and p-values were almost unchanged in most cases (still < 2.2e-16). The tray effect (df=35) was not significant in most tests and was only significant in petal length (p-value=0.0111), sepal length (p-value=0.0242) and the number of seeds in ten fruits (p-value=0.0367). As expected, positions (tray ID) had limited effect on phenotypic traits.

      Figure 2 - I assume the numbers at the top indicate sample sizes but perhaps add this to the figure caption.

      Statistical power depends on both the total sample size and the sample size of each group, especially the group with the fewest observations. We lost different number of measurements in each phenotypic trait, and for pollen traits we did have a notable loss, so we chose to show sample sizes above each group to increase transparency. Since we had five different sets of sample sizes (for floral morphological traits, stem length, days to flowering, pollen traits and seed traits, respectively), it would be cumbersome to introduce all 25 numbers in figure caption and could be hard for readers to match the sample sizes with results. For this reason, we would like to keep the sample sizes in the figure, and now we have modified the legend to clarify that the numbers above groups are sample sizes.

      ’The trend has been observed in a wide range of organisms, including ...’ - perhaps group Brassica and Raphanobrassica into one clause in the sentence, since separating them out undermines the diversity somewhat.

      Indeed, it is very strange to put “cotton” between two representatives from Brassicaceae. Now the sentence is changed to “… including Brassica (Wu et al., 2018; Li et al., 2020; Wei et al., 2021) and Raphanobrassica (Ye et al., 2016), cotton (Yoo et al., 2013)…”

      The diagrams under the graph in Figure 4B are particularly helpful for understanding the expression patterns under consideration! I appreciated them a lot!

      Thank you for the comment. We also feel the direction of expression level dominance is convoluted and hard to remember, so we adopted the convention of showing the directions with diagrams.

      Reviewer #2 (Recommendations For The Authors):

      The science is very interesting and thorough, so my comments are mostly meant to improve the clarity of the manuscript text:

      • I found it challenging to remember the acronyms for the different gene expression phenomena and had to consistently cross-reference different parts of the manuscript to remind myself. I think using the full phrase once or twice at the start of a paragraph to remind readers what the acronym stands for could improve readability.

      Thank you for this reasonable suggestion. Now we have replaced the most occurrence of acronyms with the full phrases.

      • There are some technical terms, such as "homoeologous synapsis" and "disomic inheritance", which I think are under-defined in the current text.

      Indeed these terms were not well-defined before using in the manuscript. Now we have added a brief explanation for each term.

      • Under the joint action of these forces, allopolyploid subgenomes are further coordinated and degenerated, and subgenomes are often biasedly fractionated" This sentence has some unclear terminology. Does "coordinated" mean co-adapted, co-inherited, or something else? Is "biasedly fractionated" referring to biased inheritance or evolution of one of the parental subgenomes?

      We apologize for not using accurate terms. With “coordinated” we emphasized the evolution of both homoeologs depends on the selection on total expression of both homoeologs, and on both relative and absolute dosages, which may have shifted away from optima after allopolyploidization. “Co-evolved” or “co-adapted” might be a better word.

      But the term "biasedly fractionation" has been commonly used for referring to the phenomenon that genes from one subgenome of polyploids are preferentially retained during diploidization (Woodhouse et al., 2014; Wendel, 2015). Instead of inventing a new term, we prefer to keep the same term for consistency, so readers could link our findings with numerous studies in this field. Now the sentence is changed to “Under the joint action of these forces, allopolyploid subgenomes are further co-adapted and degenerated, and subgenomes are often biasedly retained, termed biased fractionation”.

      • There are a series of paragraphs in the results, starting with "Resynthesized allotetraploids and the natural Cbp had distinct floral morphologies", which consistently reference Figure 1 where they should be referencing Figure 2.

      Thank you for spotting this mistake! Now the numbers have been corrected.

      • ‘The number of pollen grains per flower decreased in natural Cbp’ this wording implies it's the effect of some experimental treatment on Cbp, rather than just measured natural variation.

      Yes, it is not scientifically precise to say this in the Results section, especially when describing details of results. We meant that assuming resynthesized allopolyploids are good approximation of the initial state of natural allotetraploid C. bursa-pastoris, our results indicate that the number of pollen grains had decreased in natural C. bursa-pastoris. But this is an implication, rather than an observation, so the sentence is better rewritten as “Natural allotetraploids had less pollen grains per flower.”

      • ‘The percentage of genes showing complete ELD was altogether limited but doubled between resynthesized allotetraploid groups and natural allotetraploids’ for clarity, I would suggest revising this to something like "doubled in natural allotetraploids relative to resynthesized allotetraploids

      Thank you for the suggestion. The sentence has been revised as suggested.

      • I'm not sure I understand what the difference is between expression-level dominance and homeolog expression bias. It seems to me like the former falls under the umbrella of the latter.

      Expression-level dominance and homeolog expression bias are easily confused, but they are conceptually independent. One gene could have expression-level dominance without any homeolog expression bias, or strong homeolog expression bias without any expression-level dominance. The concepts were well explained in Grover et al., (2012) with nice figures.

      Expression level dominance compares the total expression level of both homoeologs in allopolyploids with the expression of the same gene in parental species, and judges whether the total expression level in allopolyploids is only similar to one of the parental species. The contributions from different homoeologs are not distinguished.

      While homoeolog expression bias compares the relative expression level of each homoeologs in allopolyploids, with no implication on the total expression of both homoeologs.

      Let the expression level of one gene in parental species X and Y be e(X) and e(Y), respectively. And let the expression level of x homoeolog (from species X) and y homoeolog (from species Y) in allopolyploids be e(x) and e(y), respectively.

      Then a (complete) expression level dominance toward species X means: e(x)+e(y)=e(X) and e(x)+e(y)≠e(Y);

      While a homoeolog expression bias toward species X means: e(x) > e(y), or e(x)/e(y) > e(X)/e(Y), depending on the definition of studies.

      Both expression-level dominance and homeolog expression bias have been widely studied in allopolyploids (Combes et al., 2013; Li et al., 2014; Yoo et al., 2014; Hu & Wendel, 2019). As the two phenomena could be in opposite directions, and may be caused by different mechanisms, we think adopting the definitions in Grover et al., (2012) and distinguishing the two concepts would facilitate communication.

      • Is it possible to split up the results in Figure 7 to show which of the two homeologs was lost (i.e. orientalis vs. grandiflora)? Or at least clarify in the legend that these scenarios are pooled together in the figure?

      Maybe using acronyms without explanation made the figure titles hard to understand, but in the original Figure 7 the loss of two homoeologs were shown separately. Figure 7a,c showed the loss of C. orientalis-homoeolog (“co-expession loss”), and Figure 7b,d showed the loss of C. grandiflora-homoeolog (“cg-expession loss”). Now the legends have been modified to explain the Figure.

      • The paragraph starting with "The extant diploid species" is too long, should probably be split into two paragraphs and edited for clarity.

      The whole paragraph was used to explain why the resynthesized allotetraploids could be a realistic approximation of the early stage of C. bursa-pastoris with two arguments:

      1) The further divergence between C. grandiflora and C. orientalis after the formation of C. bursa-pastoris should be small compared to the total divergence between the two parental species; 2) The mating systems of real parental populations were most likely the same as today. Now the two arguments were separated as two paragraphs, and the second paragraph has been shortened.

      • On the other hand, the number of seeds per fruit" implies this is evidence for an alternative hypothesis, when I think it's really just more support for the same idea.

      “On the other hand” was used to contrast the reduced number of pollen grains and the increased number of seeds in natural allotetraploids. As both changes are typical selfing syndrome, indeed the two support the same idea. We replaced the “On the other hand” with “Moreover”.

      • ‘has become self-compatible before the formation" "has become" should be "became".

      The tense of the word has been changed.

      • If natural C. bursa-pastoris indeed originated from the hybridization between C. grandiflora-like outcrossing plants and C. orientalis-like self-fertilizing plants, the selfing syndrome in C. bursa-pastoris does not reflect the instant dominance effect of the C. orientalis alleles, but evolved afterward.’ This sentence should be closer to the end of the paragraph, after the main morphological results are summarized.

      Thank you for the suggestion. The paragraph is indeed more coherent after moving the conclusion sentence.

      References

      Combes, M.C., Dereeper, A., Severac, D., Bertrand, B. & Lashermes, P. (2013) Contribution of subgenomes to the transcriptome and their intertwined regulation in the allopolyploid Coffea arabica grown at contrasted temperatures. New Phytologist, 200, 251–260.

      Grover, C.E., Gallagher, J.P., Szadkowski, E.P., Yoo, M.J., Flagel, L.E. & Wendel, J.F. (2012) Homoeolog expression bias and expression level dominance in allopolyploids. New Phytologist, 196, 966–971.

      Hu, G. & Wendel, J.F. (2019) Cis – trans controls and regulatory novelty accompanying allopolyploidization. New Phytologist, 221, 1691–1700.

      Li, A., Liu, D., Wu, J., Zhao, X., Hao, M., Geng, S., et al. (2014) mRNA and Small RNA Transcriptomes Reveal Insights into Dynamic Homoeolog Regulation of Allopolyploid Heterosis in

      Nascent Hexaploid Wheat. The Plant Cell, 26, 1878–1900. Wendel, J.F. (2015) The wondrous cycles of polyploidy in plants. American Journal of Botany, 102, 1753–1756.

      Woodhouse, M.R., Cheng, F., Pires, J.C., Lisch, D., Freeling, M. & Wang, X. (2014) Origin, inheritance, and gene regulatory consequences of genome dominance in polyploids. Proceedings of the National Academy of Sciences of the United States of America, 111, 5283–5288.

      Yoo, M.J., Liu, X., Pires, J.C., Soltis, P.S. & Soltis, D.E. (2014) Nonadditive Gene Expression in Polyploids. https://doi.org/10.1146/annurev-genet-120213-092159, 48, 485–517.

    1. Author Response

      The following is the authors’ response to the original reviews.

      First, we discovered several erroneous duplicate values in our source data sets from figures S1, 2, 4, and 8, due to mistakes from MATLAB analysis. We have re-analyzed the data and corrected these errors; since limited values in each data set changed, the results were unaffected. The changes are reflected in updated figures and source data.

      Overall, the reviewers gave a positive assessment of our work, but had reservations about:

      (1) Specifics of the iGluSnFR data and analysis

      (2) Overstatement/oversimplification of the importance of syt7 and Doc2

      (3)The strength and interpretation of the EM data 4) The relevance and parametrization of the modeling data

      (1) We have clarified aspects of the iGluSnFR data and analysis in the point-by-point response, as well as in the manuscript.

      (2) We have toned down our statements about the role of syt7 and Doc2 throughout, and emphasized that the DKO data are conclusive and reveal that there must be additional Ca2+ sensors for AR. We have also added to the discussion, noting syt3 as a strong candidate to perform a function analogous to syt7 (to regulate docking), along with another protein (or proteins) performing a role similar to Doc2 (directly in fusion) that has not been identified as a candidate in the field yet.

      (3) We feel the EM data are consistent with the model as much as they could be, and while a sequence of events can only be inferred from time-resolved EM, we believe our work falls in the scope of reasonable interpretation. However, upon reexamining the terminology of ‘feeding’ and related discussion, we realized this could be misleading, so these sections have been revised.

      (4) We have improved the description and interpretation of the model in the manuscript and provide a detailed rationale of our approach in the point-by-point-response.

      Reviewer #1 (Recommendations For The Authors):

      Major points:

      (1) It is surprising the optical GluSnFR approach reports so much asynchronous release in control hippocampal neurons after single stimuli (36% of release). This seems much higher than what is observed at most synapses, where asynchronous release is usually less than 5% of the initial response to the first evoked stimuli. Any thoughts on why the GluSnFR approach reports such a high level of asynchronous release? Could the optical approach be slower in activation kinetics in some cases, which artificially elevates the asynchronous aspect of fusion? This seems to be the case, given electrophysiology recordings in Figure 3 show the asynchronous release component as ~10% in controls at the 1st stimuli (panel C).

      The reported proportion of asynchronous release from cultured hippocampal neurons varies, contingent upon a range of factors (calcium concentration, how asynchronous release is quantified, etc). However, we would argue that there is considerable evidence for a higher percentage of asynchronous release (more than the <5% indicated by the referee) at synapses in the hippocampus. In our previous work on Doc2 using electrophysiology in cultured hippocampal neurons (Yao et al., 2011, Cell), it was noted that there is an approximate 25% incidence of asynchronous release after a single action potential. Furthermore, Hagler and Goda also reported a 26% ratio of asynchronous neurotransmitter release, also from cultured hippocampal neurons (Hagler and Goda, 2001, J Neurophysiol.).

      We also point out that another study using iGluSnFR to measure synchronous/asynchronous release ratios, with more sophisticated stimulation, imaging, and analysis procedures than ours, found an average ratio of synchronous to asynchronous release that is in-line with our values, with considerable variability among individual boutons (Mendonça et al., 2022; 25% asynchronous release after a single action potential). We feel that iGluSnFR is actually the superior approach (barring specialized e-phys preparations that can measure quantal events at individual small synapses; please see Miki et al., 2018), as it directly measures the timing of individual release events at individual boutons. By comparison, in most electrophysiology experiments there is a large peak of synchronous release from many synapses. iGluSnFR also bypasses postsynaptic considerations such as receptor kinetics and desensitization, or asynchronous release being poorly aligned to AMPA receptors, per a recent study of ours (Li et al., 2021), and a study showing 25% of asynchronous release occurs outside the active zone (Malagon et al., 2023). All these factors could obscure asynchronous release or otherwise make it difficult to measure by electrophysiology. To our knowledge, the approach in Miki et al., 2018 best bypasses these limitations, though the data in that study are from exceptionally fast and synchronous cerebellar synapses, and so cannot be directly compared to our findings. Thus, it is possible that iGluSnFR can report more asynchronous release than electrophysiological recordings, but this may actually reflect real biology.

      This being said, after considering the reviewer’s points we realized that our analysis method likely underestimates the total amount of synchronous release when using the high-affinity sensor (Figure 1). We quantify release by ‘events’ (that is, peaks), which does not take into account multiquantal peaks resulting from near-simultaneous multivesicular release. We have previously determined by quantal analysis that most synchronous peaks after a single action potential are multiquantal, while for asynchronous release there are still multiquantal events but they are in the minority (Vevea et al., 2021; Mendonça et al., 2022). So, in our data sets, the total amount of synchronous release is underestimated more so than asynchronous release. Thus, 37% asynchronous release is probably an overestimate, which explains the 12% difference compared to Mendonça et al., 2022, who used sophisticated quantal analysis (though that study also was performed at room temperature, which could also cause differences). We have now pointed this out in the text:

      “This ratio of synchronous to asynchronous release is likely an underestimate, since our analysis only counts the number of peaks (‘events’) and does not take into account multiquantal peaks resulting from near-simultaneous multivesicular release. We have previously determined by quantal analysis that most synchronous peaks are multiquantal after a single action potential, while for AR there are still multiquantal events but they are in the minority (Vevea et al., 2021). So, in our measurements, the total amount of synchronous release is underestimated; sophisticated quantal analysis using the A184V iGlusnFR recently found the percentage of total release that is AR to be ~25%, with otherwise similar results to ours (Mendonça et al., 2022) . Nonetheless, this approach faithfully distinguishes synchronous from asynchronous release…”

      However, while this method underestimates total synchronous release, it does not misclassify synchronous events as asynchronous because of kinetics. Even the slower iGluSnFR variant does not have a rise time that would misrepresent a synchronous event as asynchronous (Marvin et al., 2018). Mendonça et al (2022) note that averaged iGluSnFR traces for the A184V are biphasic, with the transition from fast to slow component occurring around 10 ms. These authors also determined that the temporal resolution of glutamate imaging is actually limited by the frame rate, not the biosensor, and based on simulations found that detection time was biased in their data to be about 1 ms earlier than the actual timing of release events.

      The reviewer’s final point about Figure 3 is a misunderstanding, as these are data from iGluSnFR, not electrophysiology. The asynchronous proportion in these experiments is ~10% because, as noted in the manuscript, we used a faster, lower-affinity variant of iGluSnFR in train stimulation experiments (Figure 2). In contrast to the high-affinity sensor, as explained above, in our analysis this variant would be expected to underestimate the amount of asynchronous release because it fails to detect many uniquantal release events (presumably those further from the focal plane, with too little fluorescence to reach our detection threshold) as evidenced by the fact that the apparent mini rate is much lower as measured by this sensor compared to higher-affinity variants. Since synchronous peaks are mostly multiquantal after a single action potential, while asynchronous peaks are mostly uniquantal, a fraction of release going undetected results in mostly smaller synchronous peaks, which are counted the same in our analysis while many asynchronous peaks are missed entirely. We have added a bit more clarification in the text to avoid confusion on this point:

      “This sensor underestimates the fraction of AR (~10% of total release for a single action potential) as compared to the A184V variant used above that overestimates the fraction of AR (~35% of total release for a single action potential). This is because it is less sensitive and misses many uniquantal events; as discussed above, our analysis quantifies release by number of peaks, and most synchronous peaks are multiquantal after a single action potential, while most AR peaks are uniquantal (Vevea et al., 2021). Still, the S72A variant reported the same phenotypes as the A184V variant after the first action potential (Fig. 3B, C).”

      As discussed above, we think the synchronous-to-asynchronous ratio is actually harder to determine with electrophysiology, and the preparations are different (acute slice vs dissociated culture); still, our electrophysiological measurements are in line with the iGluSnFR data: 29% for Figure 2 and 26% from the first action potential of Figure 4. These values also agree with the findings from Yao et al. (2011) and Hagler and Goda (2001), discussed above.

      Finally, the ultimate goal of our study was to measure the effects of deleting Doc2 and syt7 on synchronous and asynchronous release, not to measure the exact ratio between the two. If iGluSnFR greatly misreported synchronous events as asynchronous, we would expect the results from the knockouts to diverge between our imaging and electrophysiology data, which they do not. We have also previously applied this approach to syt1 knockouts, showing the characteristic desynchronization of release (Vevea et al., 2020). Furthermore, the high-affinity and low-affinity iGluSnFR variants, which as discussed above in our analysis overestimate and underestimate the fraction of release that is asynchronous, respectively, both reported the same phenotypes.

      (2) In the acute hippocampal physiology traces, it looks like the effect on cumulative release in Doc2A mutants only appears around ~40 msec after stimulation. This is a relatively late phase of asynchronous release. Any reason this effect does not show up sooner, where most asynchronous fusion events occur, or is this due to some technical aspects of the physiology clamp that masks earlier components?

      The reviewer is correct, although the curves actually diverge at around 30 ms (see image below). This can be attributed to the fact that the EPSCs in our recordings are broad, probably because of the large number of different synaptic inputs captured in our stimulation and recording paradigm (note that the currents are also quite large), resulting in a broad spread in the timing of release. That is to say, synchronous release is likely still occurring fairly late into the trace, obscuring any changes in asynchronous release earlier than 30 ms. This is not related to Doc2 specifically, as the EGTA charge transfer curve also diverges from the control curve at the same time. This EGTA control gives us confidence that our broad EPSCs still faithfully report synchronous and asynchronous release, even if the exact timing is spread-out to some extent.

      Author response image 1.

      (3) How do the authors treat multi-vesicular release in their synchronous/asynchronous quantification? It was not clear from the methods section. Many of the optical traces show dual peaks - are those that occur in the 10 ms bin assigned to synchronous and those outside to asynchronous? Are the authors measuring the area of the response or just the peak amplitude for the measurements? The methods seem to indicate peak amplitude, but asynchronous is better quantified with area measurements for electrophysiology.

      This is an excellent point by the reviewer, and in the Methods we now explicitly state how we treat multivesicular release/multiple peaks in our analysis. Release timing is assigned based on peak timing, including when there are multiple peaks at the same bouton.

      “Timing of release was determined based on the frame in which the signal peaked, including for dual peaks in the case of synchronous and asynchronous release at the same bouton.”

      Regarding the comparison to area measurements for electrophysiology, we agree with the reviewer, which is why we used such an approach for our electrophysiological data. However, a key advantage of iGluSnFR is the ability to resolve individual quantal events (or, as is often the case for synchronous release, simultaneous multiquantal events), so temporal binning of the peaks is the appropriate analysis approach regarding these data. This is comparable to the analysis used for electrophysiology recordings of responses from single small synapses, which also detects individual quantal events, where release timing is calculated as the latency between the stimulus and the beginning of each EPSC (Miki et al., 2018).

      This leaves the general concern that multiple vesicle fusions at the same bouton that occur milliseconds apart could blur together and make it more difficult to accurately determine release timing, particularly with the slower sensor used in the single-stim experiments in Figure 1. We believe this is not a major concern, since we also performed experiments with the much faster sensor, S72A which can resolve peaks from 100 Hz stimulation (Marvin et al., 2018). Furthermore, while the peak-calling method we used is crude by comparison, the synchronous/asynchronous ratio we report is similar to that of Mendonça et al. (2022) who used a higher frame rate and deconvolution to produce more easily distinguishable quanta when synchronous and asynchronous release occur at the same bouton after the same action potential.

      (4) It would be relevant to show that calcium binding mutations in Syt7 do not support SV docking/capture in the current assays, given some evidence for Syt7 calcium-independent activities has been reported in the field.

      To our knowledge, when using the correct mutations to block calcium binding, none of the reported syt7 knockout phenotypes (including those reported by our laboratory in Liu et al., 2014) have ever been rescued. However, this does not formally rule out a calciumindependent role in transient docking. For the EM data, we originally considered including rescue experiments with normal and non-calcium binding mutants of both syt7 and Doc2 in our study. However, our EM approach is spectacularly expensive and labor-intensive and such experiments would as much as triple the amount of EM work in the study. We plan on doing such experiments, and there is a great deal of additional structure-function work to be done on both these proteins. We feel that reassessing the calcium binding mutants with iGluSnFR and zap-andfreeze falls into the scope of this future work. For now, this as a limitation of the current study.

      (5) The authors are not consistent in how they describe the role of the two proteins in asynchronous release, with the reader often drawing the impression that these two proteins solely mediate this aspect of SV fusion. As the authors note, some synapses do not require Syt7 or Doc2 for SV release, indicating different asynchronous sensors or molecular components at distinct brain synapses. Indeed, asynchronous release is only reduced, not eliminated, in the double mutants the authors report, so other components are at play even in these hippocampal synapses. The authors should be more consistent in noting this in their text, as the wording can be confusing as noted below:

      "Together, these data further indicated that AR after single action potentials is driven by Doc2α, but not syt7, in excitatory mouse hippocampal synapses."

      "after a single action potential, Doc2α accounts for 54-67% of AR at hippocampal excitatory synapses, whereas deleting syt7 has no effect."

      "This, along with our finding that syt7/Doc2a DKOs still had remaining AR, raises the possibility that there are other unidentified calcium sensors for AR."

      We have made adjustments throughout to not overstate the role of syt7 and Doc2, including at the locations the reviewer points out. This is an important point from the reviewer, and not just to avoid misleading readers. It is itself interesting; in the original manuscript we should have emphasized, far more than we did, that the DKO experiments strongly point to asyet-unidentified proteins being involved in asynchronous release. This has been rectified in the revised text: we now emphasize that another calcium sensor for asynchronous release is likely present at all relevant points in the manuscript.

      (6) Given the authors' data, I don't think it's fair to say "raises the possibility" of other AR sensors, as almost 50% of AR remained in the Doc2A mutant in some of the experimental approaches. Clearly, other AR calcium sensors or molecular components are required, so better to just state that in the 1st paragraph of the discussion with something like: "Given syt7/Doc2a DKOs still had remaining AR, further work should explore the diversity of synaptic Ca2+ sensors and how they contribute to heterogeneity in synaptic transmission throughout the brain."

      We agree; this was poor phrasing on our part. We meant to imply that there may be proteins that have not even been considered, because it is also technically possible that the remaining asynchronous release is supported by the known machinery (i.e., syt1). We have changed “raises the possibility” to “indicates”.

      Minor points:

      (1) Remove "on" from the abstract sentence "Consequently, both synchronous and asynchronous release depress from the second pulse on during repetitive activity".

      We have changed “on” to “onward” to reduce ambiguity.

      (2) Shouldn't syt7 be Syt7 and syt1 be Syt1 when referring to the proteins?

      To our knowledge there is not a hard-and-fast convention for non-acronym mouse protein abbreviations. The technically correct full name is lowercase, so we find it reasonable to use lowercase for the abbreviation.

      (3) Both calcium and Ca2+ are used in the manuscript - better to stick to one term throughout.

      We thank the referee for catching this error; we now use only “Ca2+” throughout our study.

      Reviewer #2 (Recommendations For The Authors):

      (1) While the GluSnFR experiments appear to be well done, what is striking is the relatively small and "jagged" fluorescent responses. Are the authors concerned that they are missing many fast (with peaks occurring within 10 ms) synchronous events and incorrectly identifying them asynchronous? If this is not a concern, why not?

      With respect to the small raw responses, this is the nature of measuring individual quanta from individual boutons while imaging at 100 Hz, even with the excellent signal-to-noise ratio of the iGluSnFR variants we used.

      As far as kinetics, as noted in the response to Reviewer 1 point #1, even the slower iGluSnFR variant has a rise time fast enough that it cannot misrepresent a synchronous event as asynchronous (Marvin et al., 2018). This threshold for iGluSnFR has been used by others: see Mendonça et al., 2022, who note that averaged iGluSnFR traces are biphasic, with the transition from fast to slow component occurring around 10 ms. The ‘jaggedness’ is in large part due to the frame rate (100 Hz); Mendonça et al., 2022 used 250 Hz and deconvolution to produce smoother, cleaner traces, but still achieved similar results to us.

      Finally, we reiterate what we wrote in response to Reviewer 1 point #1: “the ultimate goal of our study was to measure the effects of deleting Doc2 and syt7 on synchronous and asynchronous release, not to measure the exact ratio between the two. If iGluSnFR misreported synchronous events as asynchronous, we would expect the results from the knockouts to diverge between those data and our electrophysiology data, which they do not. We have also previously applied this approach to syt1 knockouts, showing the characteristic desynchronization of release (Vevea et al., 2020). Also, the phenotypes reported by the faster and slower iGluSnFR variants were identical. ”

      (2) On page 6, I'm not sure I would agree that short-term plasticity is "so catastrophically disrupted". It is probably enough to say that plasticity is disrupted in the ko.

      We argue that syt7 knockout causes the most severe phenotype specific to short-term plasticity so far described (that is, without affecting initial release probability), but we have changed “catastrophically” to “strongly”.

      (3) Differences in the post-stim number of "docked" vesicles between conditions are, in absolute numbers, very small. For example, it seems that the number of docked vesicles goes from ~ 2.2 prior to stimulation, to ~ 1.5 in the first 5 ms window following stimulation. While this number may be statistically significant, I worry about bias and sampling errors. It is comforting that images are randomized prior to analysis. Nevertheless, the differences are very small and this should be explicitly acknowledged.

      This ~40% decrease in number of docked vesicles in dissociated cultured hippocampal neurons has been consistent throughout all our studies using flash-and-freeze and zap-and-freeze electron microscopy (Watanabe et al., 2013; Kusick et al., 2020, Li et al., 2021), as well as those of other labs (Chang et al., 2018). Statistically, 40% is far beyond the limit to detect differences between samples with 200-300 synapses quantified per condition and an average of ~2 docked vesicles per image. The low absolute number of docked vesicles per synaptic profile (since the 40 nm section only captures a portion of the active zone, which contain an average of 12 docked vesicles in total; Kusick et al., 2020) is not relevant except that it does reduce the statistical power to detect differences, but this is compensated for by the huge number of images we capture and annotate per sample. We are able to detect differences in fusion and endocytic pits (albeit with much less precision and sensitivity), such as the Doc2 phenotype in this study, even though these events are an order of magnitude rarer than docked vesicles. Biologically, in our view, a 40% reduction in all docked vesicles across all synapses, considering that the majority of synapses do not have even 1 vesicle fusion, after only a single action potential, is substantial. We have even been puzzled why there is such a large decrease, but as stated above this result has been consistent for a decade of using this approach. For comparison to the magnitude of baseline docking changes in mutants, this 40% is similar to the effect of deleting synaptotagmin 1 (Imig et al, 2014; Chang et al, 2018; note in Imig et al., considered a gold standard in the field, the average number of docked vesicles per tomogram is ~10, but there are fewer than 25 tomograms per sample, so the actual amount of sampling in our data set is slightly greater).

      (4) The related point is that how can one know about the "transient" nature of vesicle docking when the analysis is performed on completely different sections from different cells? Moreover, what does it mean that the docked granules have recovered or not recovered (abstract)? This should be explained in more detail.

      This is a fundamental difficulty of interpreting time-resolved electron microscopy data. We cannot observe a sequence of events at any given synapse, but only try to measure each time point as accurately as we can and interpret the data.

      By ‘recovery’ we simply mean that the number of docked vesicles at a given time point after stimulation is similar to the no-stimulation baseline. We have replaced ‘recovery’ in the abstract with ‘replenishment’ to avoid confusion.

      We now realize that in the context of this study the term ‘transient docking’ is confusing, since we only measured out to 14 ms in this study. In experiments with samples frozen at 5 ms, 14 ms , 100 ms, 1,s and 10 s, the return to baseline at 14 ms appears temporary, since samples frozen at 100 ms have a similar reduction of docked vesicles as those at 5 ms (Kusick et al., 2020). The number of vesicles again returns to baseline at 10 s, so we used the term ‘transient docking’ to distinguish the recovery at 14 ms from the slower and presumably permanent return to baseline that takes 10 s. The apparently temporary nature of this process is why we believe it contributes to facilitation, which likewise peaks soon after stimulation and decays over the course of ~100 ms.

      To make the transient docking terminology less confusing, we have removed the word ‘transiently’ from the title and added a clarification of what transient docking is when it is first mentioned:

      “vesicles can dock within 15 ms of an action potential to replenish vacated release sites and undock over the next 100 ms”

      As noted by the reviewer, such a sequence of events, where vesicles dock within 14 ms, then undock over the course of 100 ms, then dock again over the course of 10 s, is an inference, but is based on predictions from electrophysiological data and modeling (see Silva, Tran, and Marty, 2021 for review; those authors use the term ‘calcium-dependent docking’ but this refers to the same process), and as yet there is no way to directly observe vesicle dynamics at synapses down to nanometer resolution in live cells.

      On the reviewers recommendation we have removed references to syt7 ‘feeding’ vesicles from the abstract and the beginning of the “physiological relevance” section of the discussion. This phrasing could imply a direct molecular pipeline between syt7 and syt1/Doc2, which is a misrepresentation of our actual model that syt7 simply helps recruit docked vesicles.

      “These findings result in a new model whereby syt7 drives activity-dependent docking, thus providing synaptic vesicles for synchronous (syt1) and asynchronous (Doc2 and other unidentified sensors) release during ongoing transmission.”

      “In the case of paired-pulse facilitation it can supply docked vesicles for syt1-mediated synchronous release to enhance signaling; it likely functions in the same manner to reduce synaptic depression during train stimulation. In the case of AR, syt-7-mediated docked vesicles can be used by Doc2α, which then directly triggers this slow mode of transmission.”

      (5) In this study, docking is phenomenologically defined and, therefore, arbitrary; vesicles are defined as docked if there is no space between them and the plasma membrane. What happens if the definition is broadened to include some small distance between the respective membranes? Does the timecourse of "recovery" change?

      We always quantify at least all vesicles within 100 nm of the active zone; these data are shown in Figure S6D. We show only docking in the main figures because, consistent with our previous work and as stated in the text, we found no change in the number of vesicles at any distance from the plasma membrane at the active zone after stimulation, nor did we find any difference in the mutants. In our previous work on syt7 (Vevea et al., 2021) we quantified all the vesicles within the synapse and also found no differences after stimulation or in the KO further from the active zone.

      The reviewer is correct that the term ‘docking’ at synapses is often used quite arbitrarily; even among morphological studies the definition is inconsistent. We consider our strict docking definition that we explain in the manuscript (in high-pressure-frozen and freeze-substituted samples) of no visible distance between membranes to be less arbitrary, since only the number of these attached vesicles decreases after stimulation (Watanabe et al., 2013, Kusick et al., 2020, Li et al., 2021, this study) and in SNARE knockouts (Imig et al., 2014). Broadening the definition, as is done in some other studies (for example Chang et al., 2018), retains the effect, since the majority of vesicles within 10 nm are at ~0 nm, but again all that is actually changing is the number of vesicles at ~0 nm.

      (6) My overall impression is that this model is not adding much to the story. Specifically, the model was not fit to any data and has a huge number of states and free parameters given the dynamics that it is trying to capture (ie I think this is overkill). Many of the free parameters were arbitrarily constrained with little to no justification and there was minimal parameter space exploration, in part because the model wasn't being quantitatively constrained to any data. While advertised to be a 3-state model, there is a combinatorial explosion of substates by distinguishing between levels of calcium occupancy simultaneously in three separate calcium sensors so that one ends up with 9 empty states, 9 tethered states, and 45 docked states for a total of 63 distinguishable states. At 63 states and 21 free parameters, one could of course model just about any dynamics imaginable. But the relatively simple dynamics of AR and its perturbation by removal of Doc2 and Syt7 can likely be captured with far fewer states and parameters (such as Neher's recent proposal). Specifically, starting with the Neher ES-LS-TS model along with adding a transient labile docked state affected by Syt7 and Doc2 (TSL in Neher nomenclature), I wonder if the authors could more or less capture what they are observing during stimulus trains. The advantage of a minimal model is that readers don't have to struggle with fairly elaborate systems of differential equations and parameter plots to get a feel for what's going on. Especially since the point of this model is to develop intuition rather than to capture with physical accuracy exactly what is transpiring at a docked vesicle (which would require many more details excluded from the current model).

      We would like to thank the reviewer for pointing out unclarities and mistakes in the description of the model. We have worked on improving on these points. We now more elaborately explain why we have made certain assumptions and what decisions we have made to constrain the parameter values in the model. As the reviewer points out other models might also work in explaining the dynamics of the experimental data presented in this paper. Thus, we agree that it is unlikely that this theory and model implementation is the only one that can account for the observations. With this model we aimed to investigate whether the theory proposed based on the experimental data could indeed reproduce the dynamics that are observed experimentally. In the section below we will briefly explain why we made different decisions in constructing the model to comment on the reviewer’s concerns. We will also discuss more precisely what adjustments we have made to the model’s description to improve its readability and be open about its limitations.

      One of the main concerns of the reviewer is that the model has many states and free parameters, some of which are poorly constrained. We agree that the model indeed contains many states. However, in essence, the model corresponds to a two-step docking model, in which SVs get tethered to an empty release site and subsequently dock/prime in a fusion-competent state. This structure of the model corresponds to the ES-LS-TS model (Neher and Brose 2018, Neuron) mentioned by the reviewer or the replacement-docking model (Miki et al., 2016, Neuron). As the reviewer points out, by making the transition rates calcium-dependent in those models, we would indeed be able to capture similar dynamics with these models as with ours. However, instead of directly implementing calcium-dependent rates, we let the rates depend on the number of calcium ions bound to syt7, Doc2 and Syt1. We decided to do so, as some information on the calcium binding dynamics of these proteins is available. By simulating the calcium binding to the proteins explicitly we could integrate this knowledge into our model. Moreover, by explicitly simulating calcium-binding to these proteins, we included the time it takes before a new steady state-binding occupancy is reached after a change of calcium levels. Especially for Ca2+ sensors with slow kinetics such as, syt7 and Doc2, this is crucial. These properties are highly relevant for asynchronous release (which we quantified as the release >5 ms after onset of AP). The consequence is that because of combinatorics (e.g., if we assume 5 calcium ions to bind to syt1 and 2 to Doc2 this leads to 24 different states), explicit simulation of all relevant states extends the number of potential different states a vesicle can be in. In the main text of the manuscript, we added this explanation on why we decided on the structure of the model as it is presented and discussed it in context of other previous models.

      Our decision to simulate calcium binding to syt1, syt7 and Doc2 also increased the number of parameters in our model. As the reviewer points out, the large number of parameters in our model compared to the relative low number of features in the experimental behavior the model is compared to – is a limitation. However, after thorough exploration of the model, we are certain that the model cannot create any type of desired dynamics. The large number of parameters does make it possible that different combinations of parameter values would lead to similar responses, as can be seen in the parameter space exploration in Figure S9. This means that our modelling effort does not provide estimates of parameter values. We now mention this explicitly in the discussion section of the model. Some of the parameter values we were able to constrain based on previous literature (10 parameters), others were more arbitrary set (8 parameters), and some of them were adjusted to match the experimental data closely (7 parameters). We indicated more clearly now in Supplementary Table 3 to which category each parameter value belongs in table. We determined the values of the model parameters through a manual exploration of the parameter space. One of the main reasons why we decided not to perform a fitting of the model to data obtained in this work is that the obtained parameters would not be informative (e.g., multiple combinations of parameters will lead to similar results). We agree with the reviewer that a direct quantitative comparison between model predictions and experimental data obtained by fitting would be nice. However, fitting the model to experimental data would be close to impossible computationally. This is in part because of the large number of states, but mainly due to the large number of APs that need to be simulated. Especially since the transients in our model have slow and fast parts (the decay of the residual Ca2+-transient, and the peak of the local Ca2+transient), the model is challenging to solve with ODE solvers available in Matlab, even when using a high-performance computer system optimized for parallel computation (32 cores). Moreover, fitting the model to experimental data would require the addition of extra assumptions and parameters to the model. As the experiments are performed using different samples, different parameter settings are probably required (e.g. it is likely that the number of release site or the fusion probability differs between cultured hippocampal neurons and hippocampal slices). Additionally, if we decide to fit the model, we would need to define a cost function (i.e., a quantitative measure of how well the model is fitting to experimental data), which requires us to determine the different weights the different experiments we are comparing our model predictions to have. The decision on how to weight the different types of data is very difficult (not to say arbitrary).

      Therefore, we constrained the parameter values in our model based on a manual (but systematic) exploration of the parameter space. The simulations of the model were evaluated based on the increase in the number of docked vesicles between 5 and 15 ms after AP stimulation (this should be as large as possible for the control and Doc2- model, and close to 0 for the syt7- model simulations), the peak release rates in response to the first AP (to be equal between all conditions), the ratio between the peak release rate of the 1st and 10th response (depressive phenotype should be more prominent in the syt7- model simulation and the least in the Doc2- simulation), and the amount of asynchronous release (syt7- and Doc2- simulations should have approximately half of the total amount of asynchronously released vesicles compared to the control simulations). Moreover, the parameter values for the calcium transient should be realistic. We do not know the exact parameter values of the calcium transient in the samples used in the experiments performed here, but previous studies have provided a range of realistic parameter values (Brenowitz and Regehr 2007, PMID: 17652580; Helmchen et al., 1998, PMID: 9138591; Sabatini and Regehr 1998, PMID: 9512051; Wang et al., 2008, PMID: 19118179). Furthermore, we decided to set the parameters describing calcium binding to syt7 and Doc2 to the same values, as the scope of the model was to investigate the role of syt7 and Doc2 in asynchronous release when they act on different steps in the reaction scheme. By using the same parameter values both proteins are identical except for their mechanism of action. We added this section to the methods of the manuscript.

      In the parameter space evaluation, we decided to vary parameters one-by-one or in pairs of two. We decided not to further extend the parameter space evaluation as it will be challenging to give a proper interpretation of these results, to visualize them, and to simulate it (computationally expensive).

      (7) The graphics, equations, and nomenclature all need some work. The equations aren't numbered or indexed, so I can't really refer to any of them in particular, but the symbols being used generally were not defined well enough for a naïve reader to follow. The 15 diffEQs compressed into a single expression at the bottom of page 19 are basically impenetrable. The 'equation' near the bottom of p. 20 is not an equation - it is a set of four symbols lacking a definition. The fusion rate equation (with f1 and f2 factors) isn't spelled out clearly enough (top of p. 20). Can fusion occur from any of the 45 docked states but just with a different probability? Or does fusion only occur from the 3 states where Doc2+Syt1 Ca occupancy = 5? The graphical representation of Syt7 occupancy and its effects in Fig S7 doesn't work well. Tons of color and detail but very hard to decipher and intuit what Syt7 is doing to the SV buried in the arrow lengths. And this is a crucial point of the paper - it really needs to shine through in this figure.

      We thank the reviewer for pointing out the unclarities in the description of the model. We have worked on improving this section. Specifically, we have improved the equations and now more clearly explain the symbols used in these equations. We have altered the graphical representation of the effect of calcium binding to syt7 on docking and undocking rates.

      (8) I would strongly recommend abandoning this large-scale soft modeling effort altogether, but if the authors feel that all the states and parameters are absolutely required, they need to justify this point, define all symbols systematically, number all equations, and provide some evidence of actual data fitting, systematic parameter space exploration, and more exposition of why they are making the various assumptions and constraints that were used to lower the number of free parameters. For instance, why are the tethering and untethering (or docking and undocking) rate constants set to equal each other? And why is it assumed that Syt7 enhances both the docking and undocking rates? Why is fusion set to occur as long as the sum of Syt1 and Doc2 calcium occupancy is exactly 5 regardless of the specific occupancy of either Syt1 or Doc2? Again probably quite important but unjustified physically. Given the efforts of this model to capture some sort of realistic calcium liganding by Syt1, Syt7, and Doc2, the model doesn't seem to take into account the copy number of each protein at a release site. Shouldn't it matter if there are 2 Syt7s vs 20 Syt7s? Or the stoichiometry between Doc2 and Syt1? Either this model assumes that there is exactly one copy of each protein at a release site or that all copies are always identically liganded and strictly act as a unit. Neither of these possibilities seems plausible.

      Despite the fact that this model (as all models) is a simplified version of reality and despite the fact that this model (as all models) has its limitations, we decided to keep the model in our work to illustrate that this well-defined hypothesis put forth in this paper is consistent with the experimental data. Again, we are not claiming that this model is the only one that may explain this, nor do we claim that we have uniquely identified its parameters. As indicated above, we worked on improving the description of the model in the methods and improved on our description of how the parameter values are constrained. For the reasons mentioned above (first and foremost because of infeasibility due to excessive computation time) we did not perform data fitting or changed the parameter space exploration. We would like to thank the reviewer for pointing out that some of the assumptions of the model are not well enough explained. We added an extra explanation of these assumptions to the main text.

      One of the assumptions we made, as the reviewer points out, is that the tethering and untethering and docking and undocking rates constants are set to equal each other. This is indeed an arbitrary assumption, with the main aim of reducing the number of free parameters in our model given that there is currently no experimental constraint on the relation between the two rate constants. We agree that this assumption is as good as any other, and we have pointed this out more clearly in the main text.

      In the model syt7 enhances both docking and undocking rates as we assumed it to function as a catalyst of the docking reaction. A catalyst lowers the energy barrier for the reaction and thereby promotes both forward and backward rates. One of the main reasons we decided on this is because in the model also syt1 and Doc2 are assumed to function by lowering the energy barrier for the fusion reaction. However, since fusion is irreversible this would only affect the forward reaction rate. We cannot exclude that syt7 acts on the forward rate only, which we now mention in the results section of the model.

      In our model fusion can occur from any possible docked SV state. The probability of fusion however increases the more calcium ions are bound to Doc2 or Syt1, with Syt1-bound to Calcium being more effective in promoting fusion. This structure matches the dual-sensor model proposed by Sun et al., 2007, Science (PMID: 18046404) and Kobbersmed et al. 2020, Elife (PMID: 32077852), and is based on the assumption that each protein bound to calcium lowers the energy barrier with a certain amount. We have explained this more in the results section of the model.

      We decided that syt1 and Doc2 together could have no more than five calcium ions bound to them. This is based on the idea that syt1 and Doc2 are competing for the same type of resources, which could for instance be a limited number of SNARE complexes that are available to execute the reaction. An indication for competition between the two proteins can be found in the synchronous release amplitudes after stimulus 2, which are larger in the Doc2KO.

      The reviewer rightfully points out that for realistic simulations of the role of syt1, syt7 and Doc2 the stoichiometry of these proteins at the release site is relevant. In the ideal scenario, we would have included this in our model. However, this would massively increase the possible number of states (which this reviewer criticizes already in our simpler model), making the model even more computationally expensive to run. Additionally, we currently have no reliable estimates of the number of syt7 and Doc2 molecules per release site. In our model, all syt1s expressed on an SV can bind up to five calcium ions. We have recently shown that this simplified model can capture the features of all syt1 proteins per vesicle that compete for the binding of three substrates on the plasma membrane to exert their function in speeding up fusion (Kobbersmed et al., 2022 eLife PMID: 35929728). This means that the copy number is indirectly covered in our model. This number of five calcium ions (and two for Doc2 and syt7) however is not based on the estimated number of syt1s on an SV (which would be around 15, Takamori 2006), but rather on the calcium-dependence of the fusion reaction. Similarly, the number of two calcium ions binding to Doc2 is based on the Calcium-dependence of asynchronous fusion rates (Sun et al., 2007). Based on the reviewer’s comment we now more explicitly mention in the text that the numbers of calcium ions binding to syt1, Doc2 and syt7 corresponds to the total number of calcium ions that can bind to each of these molecules per release site/SV.

      We again would like to thank the reviewer for asking us to improve the explanation on the assumptions made to construct our model and how we constrained the parameter values in our model.

    1. Author Response

      The following is the authors’ response to the original reviews.

      Public Review:

      1. Evidence for a disulfide bridge contained in membrane-associated FGF2 dimers

      This aspect was brought up in detail by both Reviewer #1 and Reviewer #3. It has been addressed in the revised manuscript by (i) new experimental and computational analyses, (ii) a more detailed discussion of previous work from our lab in which experiments were done the reviewers were asking for and (iii) a more general discussion of known examples of disulfide formation in protein complexes with a particular focus on membrane surfaces facing the cytoplasm, the inner plasma membrane leaflet being a prominent example. Please find our detailed comments in our direct response to Reviewers #1 and #3, see below.

      1. Affinity towards PI(4,5)P2 comparing FGF2 dimers versus monomers

      This is an aspect that has been raised by Reviewer 3 along with additional comments on the interaction of FGF2 with PI(4,5)P2. Please find our detailed response below. With regard to PI(4,5)P2 affinity aspects of FGF2 dimers versus FGF2 monomers, we think that the increased avidity of FGF2 dimers with two high affinity binding pockets for PI(4,5)P2 are a good explanation for the different values of free energies of binding that were calculated from the atomistic molecular dynamics simulations shown in Fig. 9. This phenomenon is well known for many biomolecular interactions and is also consistent with the cryoEM data contained in our manuscript, showing a FGF2 dimer with two PI(4,5)P2 binding sites facing the membrane surface.

      1. C95-C95 FGF2 dimers as signaling units

      We have put forward this hypothesis since in structural studies analyzing the FGF ternary signaling complex consisting of FGF2, FGF receptor and heparin, FGF2 mutants were used that lack C95. Nevertheless, two FGF2 molecules are contained in FGF signaling complexes. In addition to the papers on the structure of the FGF signaling complex, we have cited work that showed that C95-C95 crosslinked FGF2 dimers are efficient FGF signaling modules (Decker et al, 2016; Nawrocka et al, 2020). Therefore, being based on an assembly/disassembly mechanism with the transient formation of poreforming FGF2 oligomers, we think it is an interesting idea that the FGF2 secretion pathway produces C95-C95 disulfide-linked FGF2 dimers at the outer plasma membrane leaflet that can engage in FGF2 ternary signaling complexes. While this is a possibility we put forward to stimulate the field, it of course remains a hypothesis which has been clearly indicated as such in the revised manuscript.

      Reviewer #1:

      1. Evidence for disulfide-bridged FGF2 dimers and higher oligomers on non-reducing versus reducing SDS gels

      The experiment suggested by Reviewer #1 is an important one that has been published by our group in previous work. In these studies, we found FGF2 oligomers analyzed on non-reducing SDS gels to be sensitive to DTT, turning the vast majority of oligomeric FGF2 species into monomers [(Müller et al, 2015); Fig. 3, compare panel D with panel H]. This phenomenon could be observed most clearly after short periods of incubations (0.5 hours) of FGF2 with PI(4,5)P2-containing liposomes. These findings constituted the original evidence for PI(4,5)P2-induced FGF2 oligomerization to depend on the formation of intermolecular disulfide bridges.

      In the current manuscript, we established the structural principles underlying this process and identified C95 to be the only cysteine residue involved in disulfide formation. Based on biochemical cross-linking experiments in cells, cryo-electron tomography, predictions from AlphaFold-2 Multimer and molecular dynamics simulations, we demonstrated a strong FGF2 dimerization interface in which C95 residues are brought into close proximity when FGF2 is bound to membranes in a PI(4,5)P2-dependent manner. These findings provide the structural basis by which disulfide bridges can be formed from the thiols contained in the side chains of two C95 residues directly facing each other in the dimerization interface. In the revised manuscript, we included additional data that further strengthen this analysis. In the experiments shown in the new Fig. 10, we combined chemical cross-linking with mass spectrometry, further validating the reported FGF2 dimerization interface. In addition, illustrated in the new Fig. 8, we employed a new computational analysis combining 360 individual atomistic molecular dynamics simulations, each spanning 0.5 microseconds, with advanced machine learning techniques. This new data set corroborates our findings, demonstrating that the C95-C95 interface self-assembles independently of C95-C95 disulfide formation, based on electrostatic interactions. Intriguingly, it is consistent with our experimental findings based on cross-linking mass spectrometry (new Fig. 10) where cross-linked peptides could also be observed with the C77/95A variant form of FGF2, suggesting a protein-protein interface whose formation does not depend on disulfide formation. Therefore, we propose that disulfide formation occurs in a subsequent step, representing the committed step of FGF2 membrane translocation with the formation of disulfide-bridged FGF2 dimers being the building blocks for pore-forming FGF2 oligomers.

      As a more general remark on the mechanistic principles of disulfide formation in different cellular environments, we would like to emphasize that it is a common misconception that the reducing environment of the cytoplasm generally makes the formation of disulfide bridges unlikely or even impossible. From a biochemical point of view, the formation of disulfide bridges is not limited by a reducing cellular environment but is rather controlled by kinetic parameters when two thiols are brought into proximity. Indeed, it has become well established that disulfide bridges can also be formed in compartments other than the lumen of the ER/Golgi system, including the cytoplasm. For example, viruses maturing in the cytoplasm can form stable structural disulfide bonds in their coat proteins (Locker & Griffiths, 1999; Hakim & Fass, 2010). Moreover, many cytosolic proteins, including phosphatases, kinases and transcriptions factors, are now recognized to be regulated by thiol oxidation and disulfide bond formation, formed as a post-transcriptional modification (Lennicke & Cocheme, 2021). In numerous cases with direct relevance for our studies on FGF2, disulfide bond formation and other forms of thiol oxidation occur in association with membrane surfaces. In fact, many of these processes are linked to the inner plasma membrane leaflet (Nordzieke & Medrano-Fernandez, 2018). Growth factors, hormones and antigen receptors are observed to activate transmembrane NADPH oxidases generating O2·-/H2O2 (Brown & Griendling, 2009). For example, the local and transient oxidative inactivation of membrane-associated phosphatases (e.g., PTEN) serves to enhance receptor associated kinase signaling (Netto & Machado, 2022). It is therefore conceivable that similar processes introduce disulfide bridges into FGF2 while assembling into oligomers at the inner plasma membrane leaflet. In the revised version of our manuscript, we have discussed the above-mentioned aspects in more detail, with the known role of NADPH oxidases in disulfide formation at the inner plasma membrane leaflet being highlighted.

      Reviewer #2:

      1. Potential effects of a C95A substitution on protein folding and comparison with a C95S substitution with regard to phenotypes observed in FGF2 secretion

      A valid point that we indeed addressed at the beginning of this project. Most importantly, we tested whether both FGF2 C95A and FGF2 C95S are characterized by severe phenotypes in FGF2 secretion efficiency. As shown in the revised Fig. 1, cysteine substitutions by serine showed very similar FGF2 secretion phenotypes compared to cysteine to alanine substitutions (Fig. 1C and 1D). In addition, in the pilot phase of this project, we also compared recombinant forms of FGF2 C95A and FGF2 C95S in various in vitro assays. For example, we tested the full set of FGF2 variants in membrane integrity assays as the ones contained in Fig. 4. As shown in Author response image 1, FGF2 variant forms carrying a serine in position 95 behaved in a very similar manner as compared to FGF2 C95A variant forms. Relative to FGF2 wild-type, membrane pore formation was strongly reduced for both types of C95 substitutions. By contrast, both FGF2 C77S and C77A did show activities that were similar to FGF2 wild-type.

      Author response image 1.

      From these experiments, we conclude that changes in protein structure are not the basis for the phenotypes we report on the C95A substitution in FGF2.

      1. Effects of a C77A substitution on FGF2 membrane recruitment in cells

      The effect of a C77A substitution in FGF2 recruitment to the inner plasma membrane leaflet is indeed a moderate one. This is likely to be the case because C77 is only one residue of a more complex surface that contacts the α1 subunit of the Na,K-ATPase. Stronger effects can be observed when K54 and K60 are changed, residues that are positioned in close proximity to C77 (Legrand et al, 2020). Nevertheless, as shown in the revised Fig. 1, we consistently observed a reduction in membrane recruitment when comparing FGF2 C77A with FGF2 wild-type. When analyzing the raw data without GFP background subtraction, a significant reduction of FGF2 C77A was observed compared to FGF2 wild-type (Fig. 1A and 1B). We therefore conclude that C77 does not only play a role in FGF2/α1 interactions in biochemical assays using purified components (Fig. 7) but also impairs FGF2/α1 interactions in a cellular context (Fig. 1A and 1B).

      1. Identity of the protein band in Fig. 3 labeled with an empty diamond

      This is a misunderstanding as we did not assign this band to a FGF2-GFP dimer. When we produced the corresponding cell lines, we used constructs that link FGF2 with GFP via a ‘self-cleaving’ P2A sequence. During translation, even though arranged on one mRNA, this causes the production of FGF2 and GFP as separate proteins in stoichiometric amounts, the latter being used to monitor transfection efficiency. However, a small fraction is always expressed as a complete FGF2-P2A-GFP fusion protein (a monomer). This band can be detected with the FGF2 antibodies used and was labeled in Fig. 3 by an empty diamond.

      1. Labeling of subpanels in Fig. 5A

      We have revised Fig. 5 according to the suggestion of Reviewer #2.

      1. FGF2 membrane binding efficiencies shown in Fig. 5C

      It is true that FGF2 variant forms defective in PI(4,5)P2-dependent oligomerization (C95A and C77/95A) bind to membranes with somewhat reduced efficiencies. This is also evident form the intensity profiles shown in Fig. 5A and was observed in biochemical in vitro experiments as well. A plausible explanation for this phenomenon would be the increased avidity when FGF2 oligomerizes, stabilizing membrane interactions (see also Fig. 9B).

      1. Residual activities of FGF2 C95A and C77/95A in membrane pore formation?

      We do not assign the phenomenon in Fig. 5 Reviewer #2 is referring to as controlled activities of FGF2 C95A and C77/95A in membrane pore formation. Rather, GUVs containing PI(4,5)P2 are relatively labile structures with a certain level of integrity issues upon protein binding and extended incubation times being conceivable. It is basically a technical limitation of this assay with GUVs incubated with proteins for 2 hours. Even after substitution of PI(4,5)P2 with a Ni-NTA membrane lipid, background levels of loss of membrane integrity can be observed (Fig. 6). Therefore, as compared to FGF2 C95A and C77/95A, the critical point here is that FGF2 wt and FGF2 C77A do display significantly higher levels of a loss of membrane integrity in PI(4,5)P2-containing GUVs, a phenomenon that we interpret as controlled membrane pore formation. By contrast, all variant forms of FGF2 show only background levels for loss of membrane integrity in GUVs containing the Ni-NTA lipid.

      1. Why does PI(4,5)P2 induce FGF2 dimerization?

      This has been studied extensively in previous work (Steringer et al, 2017). As also discussed in the current manuscript, the interaction of FGF2 with membranes through its high affinity PI(4,5)P2 binding pocket orients FGF2 molecules on a 2D surface that increase the likelihood of the formation of the C95containing FGF2 dimerization interface. Moreover, in the presence of cholesterol at levels typical for plasma membranes, PI(4,5)P2 clusters containing up to 4 PI(4,5)P2 molecules (Lolicato et al, 2022), a process that may further facilitate FGF2 dimerization.

      1. Is it possible to pinpoint the number of FGF2 subunits in oligomers observed in cryo-electron tomography?

      We indeed took advantage of the Halo tags that appear as dark globular structures in cryo-electron tomography. For most FGF2 oligomers with FGF2 subunits on both sides of the membrane, we could observe 4 to 6 Halo tags which is consistent with the functional subunit number that has been analyzed for membrane pore formation (Steringer et al., 2017; Sachl et al, 2020; Singh et al, 2023). However, since the number of higher FGF2 oligomers we observed in cryo-electron tomography was relatively small and the nature of these oligomers appears to be highly dynamic, caution should be taken to avoid overinterpretation of the available data.

      Reviewer #3:

      1. Conclusive demonstration of disulfide-linked FGF2 dimers

      A similar point was raised by Reviewer #1, so that we would like to refer to our response on page 2, see above.

      1. Identity of FGF2-P2A-GFP observed in Fig. 3

      Again, a similar point has been made, in this case by Reviewer #2 (Point 3). The observed band is not a FGF2-P2A-GFP dimer but rather the complete FGF2-P2A-GFP fusion protein (a monomer) that corresponds to a small population produced during mRNA translation where the P2A sequence did not cause the production of FGF2 and GFP as separate proteins in stoichiometric amounts.

      1. Quantification of GFP signals in Fig. 6

      Fig. 6 has been revised according to the suggestion of Reviewer #3. A comprehensive comparison of PI(4,5)P2 and the Ni-NTA membrane lipid in FGF2 membrane translocation assays is also contained in previous work that introduced the GUV-based FGF2 membrane translocation assay (Steringer et al., 2017).

      1. Experimental evidence for various aspects of FGF2 interactions with PI(4,5)P2

      Most of the points raised by Reviewer #3 have been addressed in previous work. For example, FGF2 has been demonstrated to dimerize only on membrane surfaces containing PI(4,5)P2 (Müller et al., 2015). In solution, FGF2 remained a monomer even after hours of incubation as analyzed by native gel electrophoresis and reducing vs. non-reducing SDS gels (see Fig. 3 in Müller et al, 2015). In the same paper, the first evidence for a potential role of C95 in FGF2 oligomerization has been reported, however, at the time, our studies were limited to FGF2 C77/95A. In the current manuscript, the in vitro experiments shown in Figs. 2 to 6 establish the unique role of C95 in PI(4,5)P2-dependent FGF2 oligomerization. As discussed above, FGF2 oligomers have been shown to contain disulfide bridges based on analyses on non-reducing gels in the absence and presence of DTT (Müller et al., 2015).

      References

      Brown DI, Griendling KK (2009) Nox proteins in signal transduction. Free Radic Biol Med 47: 1239-1253 Decker CG, Wang Y, Paluck SJ, Shen L, Loo JA, Levine AJ, Miller LS, Maynard HD (2016) Fibroblast growth factor 2 dimer with superagonist in vitro activity improves granulation tissue formation during wound healing. Biomaterials 81: 157-168

      Hakim M, Fass D (2010) Cytosolic disulfide bond formation in cells infected with large nucleocytoplasmic DNA viruses. Antioxid Redox Signal 13: 1261-1271

      Legrand C, Saleppico R, Sticht J, Lolicato F, Muller HM, Wegehingel S, Dimou E, Steringer JP, Ewers H, Vattulainen I et al (2020) The Na,K-ATPase acts upstream of phosphoinositide PI(4,5)P2 facilitating unconventional secretion of Fibroblast Growth Factor 2. Commun Biol 3: 141

      Lennicke C, Cocheme HM (2021) Redox metabolism: ROS as specific molecular regulators of cell signaling and function. Mol Cell 81: 3691-3707

      Locker JK, Griffiths G (1999) An unconventional role for cytoplasmic disulfide bonds in vaccinia virus proteins. J Cell Biol 144: 267-279

      Lolicato F, Saleppico R, Griffo A, Meyer A, Scollo F, Pokrandt B, Muller HM, Ewers H, Hahl H, Fleury JB et al (2022) Cholesterol promotes clustering of PI(4,5)P2 driving unconventional secretion of FGF2. J Cell Biol 221

      Müller HM, Steringer JP, Wegehingel S, Bleicken S, Munster M, Dimou E, Unger S, Weidmann G, Andreas H, GarciaSaez AJ et al (2015) Formation of Disulfide Bridges Drives Oligomerization, Membrane Pore Formation and Translocation of Fibroblast Growth Factor 2 to Cell Surfaces. J Biol Chem 290: 8925-8937

      Nawrocka D, Krzyscik MA, Opalinski L, Zakrzewska M, Otlewski J (2020) Stable Fibroblast Growth Factor 2 Dimers with High Pro-Survival and Mitogenic Potential. Int J Mol Sci 21

      Netto LES, Machado L (2022) Preferential redox regulation of cysteine-based protein tyrosine phosphatases: structural and biochemical diversity. FEBS J 289: 5480-5504

      Nordzieke DE, Medrano-Fernandez I (2018) The Plasma Membrane: A Platform for Intra- and Intercellular Redox Signaling. Antioxidants (Basel) 7

      Sachl R, Cujova S, Singh V, Riegerova P, Kapusta P, Muller HM, Steringer JP, Hof M, Nickel W (2020) Functional Assay to Correlate Protein Oligomerization States with Membrane Pore Formation. Anal Chem 92: 14861-14866

      Singh V, Macharova S, Riegerova P, Steringer JP, Muller HM, Lolicato F, Nickel W, Hof M, Sachl R (2023) Determining the Functional Oligomeric State of Membrane-Associated Protein Oligomers Forming Membrane Pores on Giant Lipid Vesicles. Anal Chem 95: 8807-8815

      Steringer JP, Lange S, Cujova S, Sachl R, Poojari C, Lolicato F, Beutel O, Muller HM, Unger S, Coskun U et al (2017) Key steps in unconventional secretion of fibroblast growth factor 2 reconstituted with purified components. eLife 6: e28985

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      In their paper, Kang et al. investigate rigidity sensing in amoeboid cells, showing that, despite their lack of proper focal adhesions, amoeboid migration of single cells is impacted by substrate rigidity. In fact, many different amoeboid cell types can durotax, meaning that they preferentially move towards the stiffer side of a rigidity gradient.

      The authors observed that NMIIA is required for durotaxis and, building on this observation, they generated a model to explain how durotaxis could be achieved in the absence of strong adhesions. According to the model, substrate stiffness alters the diffusion rate of NMAII, with softer substrates allowing for faster diffusion. This allows for NMAII accumulation at the back, which, in turn, results in durotaxis.

      The experiments support the main message of the paper regarding durotaxis by amoeboid cells. In my opinion, a few clarifications on the mechanism proposed to explain this phenomenon could strengthen this research:

      (1) According to your model, the rear end of the cell, which is in contact with softer substrates, will have slower diffusion rates of MNIIA. Does this mean that bigger cells will durotax better than smaller cells because the stiffness difference between front and rear is higher? Is it conceivable to attenuate the slope of the durotactic gradient to a degree where smaller cells lose their ability to durotact, while longer cells retain their capacity for directional movement?

      We thank the reviewer for this comment. In fact, it is not always the case that bigger cells will durotax better than smaller cells. Although bigger cells will sense higher stiffness difference between the front and rear, cells placed on different regions of underlying substrates may respond differently. This is because diffusion coefficient difference is not proportional to stiffness difference in our theoretical model. Therefore, when cells are placed on a very stiff substrate, cells may not durotax. When cells are placed on a region with suitable stiffness, where cells are sensitive to stiffness gradient, bigger cells will durotax better than smaller cells. In this situation, as you mentioned, lowering the stiffness gradient will make smaller cells become adurotactic while longer cells still durotax.

      We tried to further address this question by our durotaxis assay but there was a challenge: the amoeboid cells we use, including CD4+ Naïve T cells, neutrophils, dHL-60 cells and Dictysotelium, frequently protrude, retract and alter contact area with the substrate which make it difficult for us to distinguish between bigger and smaller cells in a particular cell type. Previously reported durotactic cell lines, such as MDA-MB-231 and HT1080 cells, are bigger than the amoeboid cells we use but they are mesenchymal cells and adopt distinct mechanisms which always involve stable focal adhesions. Due to this, although we are eager to answer this question by experiments and that the stiffness gradient is tunable in our system, we have not found an appropriate approach and experimental setup.

      (2) Where did you place the threshold for soft, middle, and stiff regions (Figure 6)? Is it possible that you only have a linear rigidity gradient in the center of your gel and the more you approach the borders, the flatter the gradient gets? In this case, cells would migrate randomly on uniform substrates. Did you perform AFM over the whole length of the gel or just in the central part?

      We thank the reviewer for this comment. We have performed AFM over the whole length of our gradient gel (Fig. S1A). We divide the gel into three equal parts (stiff: 1-4 mm; middle: 4-7 mm; soft: 7-10 mm) and the stiffness gradient is almost linear within each part as shown in Fig. S1A.

      (3) In which region (soft, middle, stiff) did you perform all the cell tracking of the previous figures?

      We thank the reviewer for this question. We performed the cell tracking in the soft region of the gradient gel.

      (4) What is the level of confinement experienced by the cells? Is it possible that cells on the soft side of the gels experience less confinement due to a "spring effect" whereby the coverslips descending onto the cells might exert diminished pressure because the soft hydrogels act as buffers, akin to springs? If this were the case, cells could migrate following a confinement gradient.

      We thank the reviewer for this comment. Although the possibility that our thin hydrogel layers act as buffers cannot be completely excluded, we have performed the durotaxis assay without upper gradient gel providing confinement (Author response image 1A). In this case, CD4+ Naïve T cells, neutrophils, dHL-60 cells and Dictysotelium can still durotax (Author response image 1B-E), indicating stiffness gradient itself is sufficient to direct amoeboid cell migration.

      Author response image 1.

      Illustration of the durotaxis system without confinement (A) and y-FMI of CD4+ Naïve T cells (B), neutrophils (C), dHL-60 cells (D) and Dictysotelium (E) cultured on uniform substrate or gradient substrate (n ≥ 30 tracks were analyzed for each experiment, N = 3 independent experiments for each condition, replicates are biological). All error bars are SEM. ****, P < 0.0001, by Student’s t-test.

      Reviewer #2 (Public Review):

      Summary:

      The authors developed an imaging-based device that provides both spatialconfinement and stiffness gradient to investigate if and how amoeboid cells, including T cells, neutrophils, and Dictyostelium, can durotax. Furthermore, the authors showed that the mechanism for the directional migration of T cells and neutrophils depends on non-muscle myosin IIA (NMIIA) polarized towards the soft-matrix-side. Finally, they developed a mathematical model of an active gel that captures the behavior of the cells described in vitro.

      Strengths:

      The topic is intriguing as durotaxis is essentially thought to be a direct consequence of mechanosensing at focal adhesions. To the best of my knowledge, this is the first report on amoeboid cells that do not depend on FAs to exert durotaxis. The authors developed an imaging-based durotaxis device that provides both spatial confinement and stiffness gradient and they also utilized several techniques such as quantitative fluorescent speckle microscopy and expansion microscopy. The results of this study have well-designed control experiments and are therefore convincing.

      Weaknesses:

      Overall this study is well performed but there are still some minor issues I recommend the authors address:

      (1) When using NMIIA/NMIIB knockdown cell lines to distinguish the role of NMIIA and NMIIB in amoeboid durotaxis, it would be better if the authors took compensatory effects into account.

      We thank the reviewer for this suggestion. We have investigated the compensation of myosin in NMIIA and NMIIB KD HL-60 cells using Western blot and added this result in our updated manuscript (Fig. S4B, C). The results showed that the level of NMIIB protein in NMIIA KD cells doubled while there was no compensatory upregulation of NMIIA in NMIIB KD cells. This is consistent with our conclusion that NMIIA rather than NMIIB is responsible for amoeboid durotaxis since in NMIIA KD cells, compensatory upregulation of NMIIB did not rescue the durotaxis-deficient phenotype.

      (2) The expansion microscopy assay is not clearly described and some details are missed such as how the assay is performed on cells under confinement.

      We thank the reviewer for this comment. We have updated details of the expansion microscopy assay in our revised manuscript in line 481-485 including how the assay is performed on cells under confinement:

      Briefly, CD4+ Naïve T cells were seeded on a gradient PA gel with another upper gel providing confinement. 4% PFA was used to fix cells for 15 min at room temperature. After fixation, the upper gradient PA gel is carefully removed and the bottom gradient PA gel with seeded cells were immersed in an anchoring solution containing 1% acrylamide and 0.7% formaldehyde (Sigma, F8775) for 5 h at 37 °C.

      (3) In this study, an active gel model was employed to capture experimental observations. Previously, some active nematic models were also considered to describe cell migration, which is controlled by filament contraction. I suggest the authors provide a short discussion on the comparison between the present theory and those prior models.

      We thank the reviewer for this suggestion. Active nematic models have been employed to recapitulate many phenomena during cell migration (Nat Commun., 2018, doi: 10.1038/s41467-018-05666-8.). The active nematic model describes the motion of cells using the orientation field, Q, and the velocity field, u. The director field n with (n = −n) is employed to represent the nematic state, which has head-tail symmetry. However, in our experiments, actin filaments are obviously polarized, which polymerize and flow towards the direction of cell migration. Therefore, we choose active gel model which describes polarized actin field during cell migration. In the discussion part, we have provided the comparison between active gel model and motor-clutch model. We have also supplemented a short discussion between the present model and active nematic model in the main text of line 345-347:

      The active nematic model employs active extensile or contractile agents to push or pull the fluid along their elongation axis to simulate cells flowing (61).

      (4) In the present model, actin flow contributes to cell migration while myosin distribution determines cell polarity. How does this model couple actin and myosin together?

      We thank the reviewer for this question. In our model, the polarization field P(r,t) is employed to couple actin and myosin together. It is obvious that actin accumulate at the front while myosin diffuses in the opposite direction. Therefore, we propose that actin and myosin flow towards the opposite direction, which is captured in the convection term of actin (∇[c(v+wP)])  and myosin (∇[m(-wP)]) density field.

      Reviewing Editor (Recommendations For The Authors):

      We suggest that you cite the publication about confinement force microscopy from the Betz lab (https://doi.org/10.1101/2023.08.22.554088).

      We thank the editor for this suggestion. We have cited this publication in line 89 in our updated manuscript.

      Reviewer #1 (Recommendations For The Authors):

      Minor points and text corrections:

      - In line 288 you state that NMIIA basal diffusion rate is larger on softer substrates, while in line 315 you say that NMIIA is more diffusive on stiff. The two sentences seem to contradict each other.

      We thank the reviewer for pointing out this mistake. In our active gel model, the basal diffusion rate of NMIIA is larger on stiffer substrate. We have corrected this mistake in line 288 (line 283 in the updated manuscript) in our revised manuscript.

      - How were the non-muscle myosin images (Figure 3F) collected?

      We thank the reviewer for this question. The non-muscle myosin images in Fig. 3F are single planes collected by epifluorescence-confocal microscopy. We have updated the related method in our revised manuscript in line 477-478:

      After mounting medium is solidified, single plane images were captured using a 63×1.4 NA objective lens on Andor Dragonfly epi-fluorescence confocal imaging system.

      - Is there a quantification of NMAII accumulation at the back?

      We thank the reviewer for this question. We have a quantification of NMIIA distribution in Fig. 3G. We measured the fluorescence intensity of NMIIA and NMIIB in the soft and stiff region of cells and found that the soft/stiff fluorescence ratio of NMIIB is about 0.95 and the ratio of NMIIA is about 1.82, indicating NMIIA tend to be localized at back while NMIIB is evenly distributed in the soft and stiff region of cells.

      - At which frequency were images acquired for Fluorescent Speckle Microscopy? Overall, I think it would help to state the length and frequency of videos in the legends.

      We thank the reviewer for this comment. We have updated the length (10 min for movie 6-10 and 80 sec for movie11) and frequency (15 sec intervals for movie 6-10 and 2 sec intervals for movie11) of Fluorescent Speckle Microscopy videos in our revised manuscript.

      Reviewer #2 (Recommendations For The Authors):

      The cell contour of Figure S5C is not very clear.

      We thank the reviewer for this comment. We have marked the outline of the cell in Fig. S5C in our updated manuscript.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      In this study, Kroll et al. conduct an in-depth behavioral analysis of F0 knockouts of 4 genes associated with late-onset Alzheimer's Disease (AD), together with 3 genes associated with early-onset AD. Kroll and colleagues developed a web application (ZOLTAR) to compare sleep-associated traits between genetic mutants with those obtained from a panel of small molecules to promote the identification of affected pathways and potential therapeutic interventions. The authors make a set of potentially important findings vis-à-vis the relationship between AD-associated genes and sleep. First, they find that loss-of-function in late-onset AD genes universally results in night-time sleep loss, consistent with the well supported hypothesis that sleep disruption contributes to Alzheimer's-related pathologies. psen-1, an early-onset associated AD gene, which the authors find is principally responsible for the generation of AB40 and AB42 in zebrafish, also shows a slight increase in activity at night and slight decreases in night-time sleep. Conversely, psen-2 mutations increase daytime sleep, while appa/appb mutations have no impact on sleep. Finally, using ZOLTAR, the authors identify serotonin receptor activity as potentially disrupted in sorl1 mutants, while betamethasone is identified as a potential therapeutic to promote reversal of psen2 knockout-associated phenotypes.

      This is a highly innovative and thorough study, yet a handful of key questions remain. First, are night-time sleep loss phenotypes observed in all knockouts for late-onset AD genes in the larval zebrafish a valid proxy for AD risk?

      We cannot say, but it is an interesting question. We selected the four late-onset Alzheimer’s risk genes (APOE, CD2AP, CLU, SORL1) based on human genetics data and brain expression in zebrafish larvae, not based on their likelihood to modify sleep behaviour, which we could have tried by searching for overlaps with GWAS of sleep phenotypes, for example. Consequently, we find it remarkable that all four of these genes caused a night-time sleep phenotype when mutated. We also find it reassuring that knockout of appa/appb and psen2 did not cause a night-time sleep phenotype, which largely excludes the possibility that the phenotype is a technical artefact (e.g. caused by the F0 knockout method) or a property of every gene expressed in the larval brain.

      Having said that, it could still be a coincidence, rather than a special property of genes associated with late-onset AD. In addition to testing additional late-onset Alzheimer’s risk genes, the ideal way to answer this question would be to test in parallel a random set of genes expressed in the brain at this stage of development. From this random set, one could estimate the proportion of genes that cause a night-time sleep phenotype when mutated. One could then use that information to test whether late-onset Alzheimer’s risk genes are indeed enriched for genes that cause a night-time sleep phenotype when mutated.

      For those mutants that cause night-time sleep disturbances, do these phenotypes share a common underlying pathway? e.g. Do 5-HT reuptake inhibitors promote sleep across all 4 late-onset genes in addition to psen1? Can 5-HT reuptake inhibitors reverse other AD-related pathologies in zebrafish? Can compounds be identified that have a common behavioral fingerprint across all or multiple AD risk genes? Do these modify sleep phenotypes?

      To attempt to answer these questions, we used ZOLTAR to generate predictions for all the knockout behavioural fingerprints presented in the study, in the same way as for sorl1 in Fig. 5 and Fig. 5–supplement 1. Here are the indications, targets, and KEGG pathways which are shared by the largest number of knockouts (Author response image 1):

      – One indication is shared by 4/7 knockouts: “opioid dependence” (significant for appa/appb, psen1, apoea/apoeb, cd2ap).

      – Four targets are shared by 4/7 knockouts: “strychnine-binding glycine receptor” (psen1, apoea/apoeb, clu, sorl1); “neuronal acetylcholine receptor beta-2” (psen1, apoea/apoeb, cd2ap, clu); thyroid peroxidase (psen1, apoea/apoeb, cd2ap, clu); carbonic anhydrase IV (appa/appb, psen1, psen2, cd2ap).

      – Three KEGG pathways are shared by 5/7 knockouts: “cholinergic synapse” (psen1, apoea/apoeb, cd2ap, clu, sorl1); tyrosine metabolism (psen2, apoea/apoeb, cd2ap, clu, sorl1); and “nitrogen metabolism” (appa/appb, psen1, psen2, apoea/apoeb, cd2ap).

      As reminder, we hypothesised that loss of Sorl1 affected serotonin signalling based on the following annotations being significant: indication “depression”, target “serotonin transporter”, and KEGG pathway “serotonergic synapse”. Indication “depression” is only significant for sorl1 knockouts; target “serotonin transporter” is also significant for appa/appb and psen2 knockouts; and KEGG pathway “serotonergic synapse” is also significant for psen2 knockouts. ZOLTAR therefore does not predict serotonin signalling to be a major theme common to all mutants with a night-time sleep loss phenotype.

      Particularly interesting is cholinergic signalling appearing in the most common targets and KEGG pathways. Acetylcholine signalling is a major theme in research on AD. For example, the first four drugs ever approved by the FDA to treat AD were acetylcholinesterase inhibitors, which increase acetylcholine signalling by preventing its breakdown by acetylcholinesterase. These drugs are generally considered only to treat symptoms and not modify disease course, but this view has been called into question (Munoz-Torrero, 2008; Relkin, 2007). If, as ZOLTAR suggests, mutations in several Alzheimer’s risk genes affect cholinergic signalling early in development, this would point to a potential causal role of cholinergic disruption in AD.

      Author response image 1.

      Common predictions from ZOLTAR for the seven Alzheimer’s risk genes tested. Predictions from ZOLTAR which are shared by multiple knockout behavioural fingerprints presented in the study. Only indications, targets, and KEGG pathways which are significant for at least three of the seven knockouts tested are shown, ranked from the annotations which are significant for the largest number of knockouts.

      Finally, the web- based platform presented could be expanded to facilitate comparison of other behavioral phenotypes, including stimulus-evoked behaviors.

      Yes, absolutely. The behavioural dataset we used (Rihel et al., 2010) did not measure other stimuli than day/night light transitions, but the “SauronX” platform and dataset (MyersTurnbull et al., 2022) seems particularly well suited for this. To provide some context, we and collaborators have occasionally used the dataset by Rihel et al. (2010) to generate hypotheses or find candidate drugs that reverse a behavioural phenotype measured in the sleep/wake assay (Ashlin et al., 2018; Hoffman et al., 2016). The present work was the occasion to enable a wider and more intuitive use of this dataset through the ZOLTAR app, which has already proven successful. Future versions of ZOLTAR may seek to incorporate larger drug datasets using more types of measurements.

      Finally, the authors propose but do not test the hypothesis that sorl1 might regulate localization/surface expression of 5-HT2 receptors. This could provide exciting / more convincing mechanistic support for the assertion that serotonin signaling is disrupted upon loss of AD-associated genes.

      While working on the Author Response, we made some changes to the analysis ran by ZOLTAR to calculate enrichments (see Methods and github.com/francoiskroll/ZOLTAR, notes on v2). With the new version, 5-HT receptor type 2 is not a significantly enriched target for the sorl1 knockout fingerprint but type 4 is. 5-HT receptor type 4 was also shown to interact with sorting nexin 27, a subunit of retromer, so is a promising candidate (Joubert et al., 2004). Antibodies against human 5-HT receptor type 2 and 4a exist; whether they would work in zebrafish remains to be tested. In our experience, the availability of antibodies suitable for immunohistochemistry in the zebrafish is a serious experimental roadblock.

      Note, all the results presented in the “Version of Records” are from ZOLTAR v2.

      Despite these important considerations, this study provides a valuable platform for highthroughput analysis of sleep phenotypes and correlation with small-molecule-induced sleep phenotypes.

      Strengths:

      - Provides a useful platform for comparison of sleep phenotypes across genotypes/drug manipulations.

      - Presents convincing evidence that night-time sleep is disrupted in mutants for multiple late onset AD-related genes.

      - Provides potential mechanistic insights for how AD-related genes might impact sleep and identifies a few drugs that modify their identified phenotypes

      Weaknesses:

      - Exploration of potential mechanisms for serotonin disruption in sorl1 mutants is limited.

      - The pipeline developed can only be used to examine sleep-related / spontaneous movement phenotypes and stimulus-evoked behaviors are not examined.

      - Comparisons between mutants/exploration of commonly affected pathways are limited.

      Thank you for these excellent suggestions, please see our answers above.

      Reviewer #2 (Public Review):

      Summary:

      This work delineates the larval zebrafish behavioral phenotypes caused by the F0 knockout of several important genes that increase the risk for Alzheimer's disease. Using behavioral pharmacology, comparing the behavioral fingerprint of previously assayed molecules to the newly generated knockout data, compounds were discovered that impacted larval movement in ways that suggest interaction with or recovery of disrupted mechanisms.

      Strengths:

      This is a well-written manuscript that uses newly developed analysis methods to present the findings in a clear, high-quality way. The addition of an extensive behavioral analysis pipeline is of value to the field of zebrafish neuroscience and will be particularly helpful for researchers who prefer the R programming language. Even the behavioral profiling of these AD risk genes, regardless of the pharmacology aspect, is an important contribution. The recovery of most behavioral parameters in the psen2 knockout with betamethasone, predicted by comparing fingerprints, is an exciting demonstration of the approach. The hypotheses generated by this work are important stepping stones to future studies uncovering the molecular basis of the proposed gene-drug interactions and discovering novel therapeutics to treat AD or co-occurring conditions such as sleep disturbance.

      Weaknesses:

      - The overarching concept of the work is that comparing behavioral fingerprints can align genes and molecules with similarly disrupted molecular pathways. While the recovery of the psen2 phenotypes by one molecule with the opposite phenotype is interesting, as are previous studies that show similar behaviorally-based recoveries, the underlying assumption that normalizing the larval movement normalizes the mechanism still lacks substantial support. There are many ways that a reduction in movement bouts could be returned to baseline that are unrelated to the root cause of the genetically driven phenotype. An ideal experiment would be to thoroughly characterize a mutant, such as by identifying a missing population of neurons, and use this approach to find a small molecule that rescues both behavior and the cellular phenotype. If the connection to serotonin in the sorl1 was more complete, for example, the overarching idea would be more compelling.

      Thank you for this cogent criticism.

      On the first point, we were careful not to claim that betamethasone normalises the molecular/cellular mechanism that causes the psen2 behavioural phenotype. Having said that, yes, to a certain extent that would be the hope of the approach. As you say, every compound which normalises the behavioural fingerprint will not normalise the underlying mechanism, but the opposite seems true: every compound that normalises the underlying mechanism should also normalise the behavioural fingerprint. We think this logic makes the “behaviour-first” approach innovative and interesting. The logic is to discover compounds that normalise the behavioural phenotype first, only subsequently test whether they also normalise the molecular mechanism, akin to testing first whether a drug resolves the symptoms before testing whether it actually modifies disease course. While in practice testing thousands of drugs in sufficient sample sizes and replicates on a mutant line is challenging, the dataset queried through ZOLTAR provides a potential shortcut by shortlisting in silico compounds that have the opposite effect on behaviour.

      You mention a “reduction in movement bouts” but note here that the number of behavioural parameters tested is key to our argument. To take the two extremes, say the only behavioural parameter we measured in psen2 knockout larvae was time active during the day, then, yes, any stimulant used at the right concentration could probably normalise the phenotype. In this situation, claiming that the stimulant is likely to also normalise the underlying mechanism, or even that it is a genuine “phenotypic rescue”, would not be convincing. Conversely, say we were measuring thousands of behavioural parameters under various stimuli, such as swimming speed, position in the well, bout usage, tail movements, and eye angles, it seems almost impossible for a compound to rescue most parameters without also normalising the underlying mechanism. The present approach is somewhere inbetween: ZOLTAR uses six behavioural parameters for prediction (e.g. Fig 6a), but all 17 parameters calculated by FramebyFrame can be used to assess rescue during a subsequent experiment (Fig. 6c). For both, splitting each parameter in day and night increases the resolution of the approach, which partly answers your criticism. For example, betamethasone rescued the day-time hypoactivity without causing night-time hyperactivity, so we are not making the “straw man argument” explained above of using any broad stimulant to rescue the hypoactivity phenotype.

      Furthermore, for diseases where the behavioural defect is the primary concern, such as autism or bipolar disorder, perhaps this behaviour-first approach is all that is needed, and whether or not the compound precisely rescues the underlying mechanism is somewhat secondary. The use of lithium to prevent manic episodes in bipolar disorder is a good example. It was initially tested because mania was thought to be caused by excess uric acid and lithium can dissolve uric acid (Mitchell and Hadzi-Pavlovic, 2000). The theory is now discredited, but lithium continues to be used without a precise understanding of its mode of action. In this example, behavioural rescue alone, assuming the secondary effects are tolerable, is sufficient to be beneficial to patients, and whether it modulates the correct causal pathway is secondary.

      On the second point, we agree that testing first ZOLTAR on a mutant for which we have a fairly good understanding of the mechanism causing the behavioural phenotype could have been a productive approach. Note, however, that examples already exist in the literature (Ashlin et al., 2018; Hoffman et al., 2016). The example from Hoffman et al. (2016) is especially convincing. Drugs generating behavioural fingerprints that positively correlate with the cntnap2a/cntnap2b double knockout fingerprint were enriched with NMDA and GABA receptor antagonists. In experiments analogous to our citalopram and fluvoxamine treatments (Fig. 5c,d and Fig. 5–supplement 1c,d), cntnap2a/cntnap2b knockout larvae were overly sensitive to the NMDA receptor antagonist MK-801 and the GABAA receptor antagonist pentylenetetrazol (PTZ). Among other drugs tested, zolpidem, a GABAA receptor agonist, caused opposite effects on wild-type and cntnap2a/cntnap2b knockout larvae. Knockout larvae were found to have fewer GABAergic neurons in the forebrain. While these studies did not use precisely the same analysis that ZOLTAR runs, they used the same rationale and behavioural dataset to make these predictions (Rihel et al., 2010), which shows that approaches like ZOLTAR can point to causal processes.

      On your last point, we hope our experiment testing fluvoxamine, another selective serotonin reuptake inhibitor (SSRI), makes the connection between Sorl1 and serotonin signalling more convincing.

      - The behavioral difference between the sorl1 KO and scrambled at the higher dose of the citalopram is based on a small number of animals. The KO Euclidean distance measure is also more spread out than for the other datasets, and it looks like only five or so fish are driving the group difference. It also appears as though the numbers were also from two injection series. While there is nothing obviously wrong with the data, I would feel more comfortable if such a strong statement of a result from a relatively subtle phenotype were backed up by a higher N or a stable line. It is not impossible that the observed difference is an experimental fluke. If something obvious had emerged through the HCR, that would have also supported the conclusions. As it stands, if no more experiments are done to bolster the claim, the confidence in the strength of the link to serotonin should be reduced (possibly putting the entire section in the supplement and modifying the discussion). The discussion section about serotonin and AD is interesting, but I think that it is excessive without additional evidence.

      We mostly agree with this criticism. One could interpret the larger spread of the data for sorl1 KO larvae treated with 10 µM citalopram as evidence that the knockout larvae do indeed react differently to the drug at this dose, regardless of being driven by a subset of the animals. The result indeed does not survive removing the top 5 (p = 0.87) or top 3 (p = 0.18) sorl1 KO + 10 µM larvae, but this amounts to excluding 20 (3/14) or 35 (5/14) % of the datapoints as potential outliers, which is unreasonable. In fact, excluding the top 5 sorl1 KO + 10 µM is equivalent to calling any datapoint with z-score > 0.2 an outlier (z-scores of the top 5 datapoints are 0.2–1.8). Applying consistently the same criterion to the scrambled + 10 µM group would remove the top 6 datapoints (z-scores = 0.5–3.9). Comparing the resulting two distributions again gives the sorl1 KO + 10 µM distribution as significantly higher (p = 0.0015). We would also mention that Euclidean distance, as a summary metric for distance between behavioural fingerprints, has limitations. For example, the measure will be more sensitive to changes in some parameters but not others, depending on how much room there is for a given parameter to change. We included this metric to lend support to the observation one can draw from the fingerprint plot (Fig. 5c) that sorl1 mutants respond in an exaggerated way to citalopram across many parameters, while being agnostic to which parameter might matter most.

      Given that the HCR did not reveal anything striking, we agree with you that too much of our argument relied on this result being robust. As you and Reviewer #3 suggested, we repeated this experiment with a different SSRI, fluvoxamine (Fig. 5–supplement 1). We cannot readily explain why the result was opposite to what we found with citalopram, but in both cases sorl1 knockout larvae reacted differently than their control siblings, which adds an argument to our claim that ZOLTAR correctly predicted serotonin signalling as a disrupted pathway from the behavioural fingerprint. Accordingly, we mostly kept the Discussion on Sorl1 the same, although we concede that we may not have identified the molecular mechanism.

      - The authors suggest two hypotheses for the behavioral difference between the sorl1 KO and scrambled at the higher dose of the citalopram. While the first is tested, and found to not be supported, the second is not tested at all ("Ruling out the first hypothesis, sorl1 knockouts may react excessively to a given spike in serotonin." and "Second, sorl1 knockouts may be overly sensitive to serotonin itself because post-synaptic neurons have higher levels of serotonin receptors."). Assuming that the finding is robust, there are probably other reasons why the mutants could have a different sensitivity to this molecule. However, if this particular one is going to be mentioned, it is surprising that it was not tested alongside the first hypothesis. This work could proceed without a complete explanation, but additional discussion of the possibilities would be helpful or why the second hypothesis was not tested.

      There are no strong scientific reasons why this hypothesis was not tested. The lead author (F Kroll) moved to a different lab and country so the project was finalised at that time. We do not plan on testing this hypothesis at this stage. However, we adapted the wording to make it clear this is one possible alternative hypothesis which could be tested in the future. The small differences found by HCR are actually more in line with the new results from the fluvoxamine experiment, so it may also be that both hypotheses (pre-synaptic neurons releasing less serotonin when reuptake is blocked; or post-synaptic neurons being less sensitive) contribute. The fluvoxamine experiment was performed in a different lab (ICM, Paris; all other experiments were done in UCL, London) in a different wild-type strain (TL in ICM, AB x Tup LF in UCL), which complicates how one interprets this discrepancy.

      - The authors claim that "all four genes produced a fairly consistent phenotype at night". While it is interesting that this result arose in the different lines, the second clutch for some genes did not replicate as well as others. I think the findings are compelling, regardless, but the sometimes missing replicability should be discussed. I wonder if the F0 strategy adds noise to the results and if clean null lines would yield stronger phenotypes. Please discuss this possibility, or others, in regard to the variability in some phenotypes.

      For the first part of this point, please see below our answer to Reviewer #3, point (2) c.

      Regarding the F0 strategy potentially adding variability, it is an interesting question which we tested in a larger dataset of behavioural recordings from F0 and stable knockouts for the same genes (unpublished). In summary, the F0 knockout method does not increase clutchto-clutch or larva-to-larva variability in the assay. F0 knockout experiments found many more significant parameters and larger effect sizes than stable knockout experiments, but this difference could largely be explained by the larger sample sizes of F0 knockout experiments. In fact, larger sample sizes within individual clutches appears to be a major advantage of the F0 knockout approach over in-cross of heterozygous knockout animals as it increases sensitivity of the assay without causing substantial variability. We plan to report in more detail on this analysis in a separate paper as we think it would dilute the focus of the present work.

      - In this work, the knockout of appa/appb is included. While APP is a well-known risk gene, there is no clear justification for making a knockout model. It is well known that the upregulation of app is the driver of Alzheimer's, not downregulation. The authors even indicate an expectation that it could be similar to the other knockouts ("Moreover, the behavioural phenotypes of appa/appb and psen1 knockout larvae had little overlap while they presumably both resulted in the loss of Aβ." and "Comparing with early-onset genes, psen1 knockouts had similar night-time phenotypes, but loss of psen2 or appa/appb had no effect on night-time sleep."). There is no reason to expect similarity between appa/appb and psen1/2. I understand that the app knockouts could unveil interesting early neurodevelopmental roles, but the manuscript needs to be clarified that any findings could be the opposite of expectation in AD.

      On “there is no reason to expect similarity […]”, we disagree. Knockout of appa/appb and knockout of psen1 will both result in loss of Aβ (appa/appb encode Aβ and psen1 cleaves Appa/Appb to release Aβ, cf. Fig. 3e). Consequently, a phenotype caused by the loss of Aβ, or possibly other Appa/Appb cleavage products, should logically be found in both appa/appb and psen1 knockouts.

      On “it is well known that the upregulation of APP is the driver of Alzheimer’s, not downregulation”; we of course agree. Among others, the examples of Down syndrome, APP duplication (Sleegers et al., 2006), or mouse models overexpressing human APP show definitely that overexpression of APP is sufficient to cause AD. Having said that, we would not be so quick in dismissing APP knockout as potentially relevant to understanding of AD.

      Loss of soluble Aβ due to aggregation could contribute to pathology (Espay et al., 2023). Without getting too much into this intricate debate, links between levels of Aβ and risk of disease are often counter-intuitive too. For example, out of 138 PSEN1 mutations screened in vitro, 104 reduced total Aβ production and 11 even seemingly abolished the production of both Aβ40 and Aβ42 (Sun et al., 2017). In short, loss of soluble Aβ occurs in both AD and in our appa/appb knockout larvae.

      We added a sentence in Results (section psen2 knockouts […]) to briefly justify our appa/appb knockout approach. To be clear, we do not want to imply, for example, that the absence of a night-time sleep phenotype for appa/appb is contradictory to the body of literature showing links between Aβ and sleep, including in zebrafish (Özcan et al., 2020). As you say, our experiment tested loss of App, including Aβ, while the literature typically reports on overexpression of APP, as in APP/PSEN1-overexpressing mice (Jagirdar et al., 2021).

      Reviewer #3 (Public Review):

      In this manuscript by Kroll and colleagues, the authors describe combining behavioral pharmacology with sleep profiling to predict disease and potential treatment pathways at play in AD. AD is used here as a case study, but the approaches detailed can be used for other genetic screens related to normal or pathological states for which sleep/arousal is relevant. The data are for the most part convincing, although generally the phenotypes are relatively small and there are no major new mechanistic insights. Nonetheless, the approaches are certainly of broad interest and the data are comprehensive and detailed. A notable weakness is the introduction, which overly generalizes numerous concepts and fails to provide the necessary background to set the stage for the data.

      Major points

      (1) The authors should spend more time explaining what they see as the meaning of the large number of behavioral parameters assayed and specifically what they tell readers about the biology of the animal. Many are hard to understand--e.g. a "slope" parameter.

      We agree that some parameters do not tell something intuitive about the biology of the animal. It would be easy to speculate. For example, the “activity slope” parameter may indicate how quickly the animal becomes tired over the course of the day. On the other hand, fractal dimension describes the “roughness/smoothness” of the larva’s activity trace (Fig. 2–supplement 1a); but it is not obvious how to translate this into information about the physiology of the animal. We do not see this as an issue though. While some parameters do provide intuitive information about the animal’s behaviour (e.g. sleep duration or sunset startle as a measure of startle response), the benefit of having a large number of behavioural parameters is to compare behavioural fingerprints and assess rescue of the behavioural phenotype by small molecules (Fig. 6c). For this purpose, the more parameters the better. The “MoSeq” approach from Wiltschko et al., 2020 is a good example from literature that inspired our own Fig. 6c. While some of the “behavioural syllables” may be intuitive (e.g. running or grooming), it is probably pointless to try to explain the ‘meaning’ of the “small left turn in place with head motion” syllable (Wiltschko et al., 2020). Nonetheless, this syllable was useful to assess whether a drug specifically treats the behavioural phenotype under study without causing too many side effects. Unfortunately, ZOLTAR has to reduce the FramebyFrame fingerprint (17 parameters) to just six parameters to compare it to the behavioural dataset from Rihel et al., 2010, but here, more parameters would almost certainly translate into better predictions too, regardless of their intuitiveness.

      It is true however that we did not give much information on how some of the less intuitive parameters, such as activity slope or fractal dimension, are calculated or what they describe about the dataset (e.g. roughness/smoothness for fractal dimension). We added a few sentences in the legend of Fig. 2–supplement 1.

      (2) Because in the end the authors did not screen that many lines, it would increase confidence in the phenotypes to provide more validation of KO specificity. Some suggestions include:

      a. The authors cite a psen1 and psen2 germline mutant lines. Can these be tested in the FramebyFrame R analysis? Do they phenocopy F0 KO larvae?

      We unfortunately do not have those lines. We investigated the availability of importing a psen2 knockout line from abroad, but the process of shipping live animals is becoming more and more cost and time prohibitive. However, we observed the same pigmentation phenotype for psen2 knockouts as reported by Jiang et al., 2018, which is at least a partial confirmation of phenocopying a loss of function stable mutant.  

      b. psen2_KO is one of the larger centerpieces of the paper. The authors should present more compelling evidence that animals are truly functionally null. Without this, how do we interpret their phenotypes?

      We disagree that there should be significant doubt about these mutants being truly functionally null, given the high mutation rate and presence of the expected pigmentation phenotype (Jiang et al., 2018, Fig. 3f and Fig. 3–supplement 3a). The psen2 F0 knockouts were virtually 100% mutated at three exons across the gene (mutation rates were locus 1: 100 ± 0%; locus 2: 99.99 ± 0.06%; locus 3: 99.85 ± 0.24%). Additionally, two of the three mutated exons had particularly high rates of frameshift mutations (locus 1: 97 ± 5%; locus 2: 88 ± 17% frameshift mutation rate). It is virtually impossible that a functional protein is translated given this burden of frameshift mutations. Phenotypically, in addition to the pigmentation defect, double psen1/psen2 F0 knockout larvae had curved tails, the same phenotype as caused by a high dose of the γ-secretase inhibitor DAPT (Yang et al., 2008). These double F0 knockouts were lethal, while knockout of psen1 or psen2 alone did not cause obvious morphological defects. Evidently, most larvae must have been psen2 null mutants in this experiment, otherwise functional Psen2 would have prevented early lethality.

      Translation of zebrafish psen2 can start at downstream start codons if the first exon has a frameshift mutation, generating a seemingly functional Psen2 missing the N-terminus (Jiang et al., 2020). Zebrafish homozygous for this early frameshift mutation had normal pigmentation, showing it is a reliable marker of Psen2 function even when it is mutated. This mechanism is not a concern here as the alternative start codons are still upstream of two of the three mutated exons (the alternative start codons discovered by Jiang et al., 2020 are in exon 2 and 3, but we targeted exon 3, exon 4, and exon 6).

      We understand that the zebrafish community may be cautious about F0 phenotyping compared to stably generated mutants. As mentioned to Reviewer #2, we are planning to assemble a paper that expressly compares behavioural phenotypes measured in F0 vs. stable mutants to allay some of these concerns. Our current manuscript, which combines CRISPR-Cas9 rapid F0 screening with in silico pharmacological predictions, inevitability represents a first step in characterizing the functions of these genes. 

      c. Related to the above, for cd2AP and sorl1 KO, some of the effect sizes seem to be driven by one clutch and not the other. In other words, great clutch-to-clutch variability. Should the authors increase the number of clutches assayed?

      Correct, there is substantial clutch-to-clutch variability in this behavioural assay. This is not specific to our experiments. Even within the same strain, wild-type larvae from different clutches (i.e. non-siblings) behave differently (Joo et al., 2021). This is why it is essential to compare behavioural phenotypes within individual clutches (i.e. from a single pair of parents, one male and one female), as we explain in Methods (section Behavioural video-tracking) and in the documentation of the FramebyFrame package. We often see two different experimental designs in literature: comparing non-sibling wild-type and mutant larvae, or pooling different clutches which include all genotypes (e.g. pooling multiple clutches from heterozygous in-crosses or pooling wild-type clutches before injecting them). The first experimental design causes false positive findings (Joo et al., 2021), as the clutchto-clutch variability we and others observe gets interpreted as a behavioural phenotype. The second experimental design should not cause false positives but likely decreases the sensitivity of the assay by increasing the spread within genotypes. In both cases, the clutch-to-clutch variability is hidden, either by interpreting it as a phenotype (first case) or by adding it to animal-to-animal variability (second case). Our experimental design is technically more challenging as it requires obtaining large clutches from unique pairs of parents. However, this approach is better as it clearly separates the different sources of variability (clutch-to-clutch or animal-to-animal). As for every experiment, yes, a larger number of replicates would be better, but we do not plan to assay additional clutches at this time. Our work heavily focuses on the sorl1 and psen2 knockout behavioural phenotypes. The key aspects of these phenotypes were effectively tested in four experiments (five to six clutches) as sorl1 knockout larvae were also tracked in the citalopram and fluvoxamine experiments (Fig. 5 and Fig. 5–supplement 1), and psen2 knockout larvae were also tracked in the small molecule rescue experiment (Fig. 6 and Fig. 6–supplement 1).

      The psen2 behavioural phenotype replicated well across the six clutches tested (pairwise cosine similarities: 0.62 ± 0.15; Author response image 2a). 5/6 clutches were less active and initiating more sleep bouts during the day, as we claimed in Fig. 3.

      In the citalopram experiment, the H<sub>2</sub>O-treated sorl1 knockout fingerprint replicated fairly well the baseline recordings in Fig. 4, despite the smaller sample size (cos = 0.30 and 0.78; Author response image 2b, see “KO Fig. 5”). 5/6 of the significant parameters presented in Fig. 4–supplement 4 moved in the same direction, and knockout larvae were also hypoactive during the day but hyperactive at night. Note that two clutches were tracked on the same 96-well plate in this experiment. We calculated each larva’s z-score using the average of its control siblings, then we averaged all the z-scores to generate the fingerprint. The H<sub>2</sub>O treated sorl1 knockout clutch from the fluvoxamine experiment did not replicate well the baseline recordings (cos = 0.08 and 0.11; Author response image 2b, see “KO Fig. 5–suppl. 1”). Knockout larvae were hypoactive during the day as expected, but behaviour at night was not as robustly affected. As mentioned above, knockouts were made in a different genetic background (TL, instead of AB x Tup LF used for all other experiments), which could explain the discrepancy.

      We also took the opportunity to check whether our SSRI treatments replicated well the data from Rihel et al., 2010. For both citalopram (n = 3 fingerprints in the database) and fluvoxamine (n = 4 fingerprints in the database), replication was excellent (cos ≥ 0.67 for all comparisons of a fingerprint from this study vs. a fingerprint from Rihel et al. 2010; Author response image 2c,d). Note that the scrambled + 10 µM citalopram and + 10 µM fluvoxamine fingerprints correlate extremely well (cos = 0.92; can be seen in Author response image 2c,d), which was predicted by the small molecule screen dataset.

      Author response image 2.

      Replication of psen2 and sorl1 F0 knockout fingerprints and SSRI treatments from Rihel et al., 2010. a, (left) Every psen2 F0 knockout behavioural fingerprint generated in this study. Each dot represents the mean deviation from the same-clutch scrambled-injected mean for that parameter (z-score, mean ± SEM). From the experiments in Fig. 6, presented is the psen2 F0 knockout + H<sub>2</sub>O fingerprints. The fingerprints in grey (“not shown”) are from a preliminary drug treatment experiment we did not include in the final study. These fingerprints are from psen2 F0 knockout larvae treated with 0.2% DMSO, normalised to scrambled-injected siblings also treated with 0.2% DMSO. (right) Pairwise cosine similarities (−1.0–1.0) for the fingerprints presented. b, Every sorl1 F0 knockout behavioural fingerprint, as in a). c, The scrambled-injected + citalopram (10 µM) fingerprints (grey) in comparison to the citalopram (10–15 µM) fingerprints from the Rihel et al., 2010 database (green). d, The scrambled-injected + fluvoxamine (10 µM) fingerprint (grey) in comparison to the fluvoxamine fingerprints from the Rihel et al., 2010 database (pink). In c) and d), the scrambled-injected fingerprints are from the experiments in Fig. 5 and Fig. 5–suppl. 1, but were converted here into the behavioural parameters used by Rihel et al., 2010 for comparison. Parameters: 1, average activity (sec active/min); 2, average waking activity (sec active/min, excluding inactive minutes); 3, total sleep (hr); 4, number of sleep bouts; 5, sleep bout length (min); 6, sleep latency (min until first sleep bout).

      (3) The authors make the point that most of the AD risk genes are expressed in fish during development. Is there public data to comment on whether the genes of interest are expressed in mature/old fish as well? Just because the genes are expressed early does not at all mean that early- life dysfunction is related to future AD (though this could be the case, of course). Genes with exclusive developmental expression would be strong candidates for such an early-life role, however. I presume the case is made because sleep studies are mainly done in juvenile fish, but I think it is really a prejy minor point and such a strong claim does not even need to be made.

      This is a fair criticism but we do not make this claim (“early-life dysfunction is related to future AD”) from expression alone. The reviewer is probably referring to the following quote:

      “[…] most of these were expressed in the brain of 5–6-dpf zebrafish larvae, suggesting they play a role in early brain development or function,” which does not mention future risk of AD. We do suggest that these genes have a function in development. After all, every gene that plays a role in brain development must be expressed during development, so this wording seemed reasonable. Nevertheless, we adapted the wording to address this point and Reviewer #2’s complaint below. As noted, the primary goal was to check that the genes we selected were indeed expressed in zebrafish larvae before performing knockout experiments. Our discussion does raise the hypothesis that mutations in Alzheimer’s risk genes impact brain development and sleep early in life, but this argument primarily relies on our observation that knockout of late-onset Alzheimer’s risk genes causes sleep phenotypes in 7-day old zebrafish larvae and from previous work showing brain structural differences in children at high genetic risk of AD (Dean et al., 2014; Quiroz et al., 2015), not solely on gene expression early in life.

      Please also see our answer to a similar point raised by Reviewer #2 below (cf. Author response image 7).

      (4) A common quandary with defining sleep behaviorally is how to rectify sleep and activity changes that influence one another. With psen2 KOs, the authors describe reduced activity and increased sleep during the day. But how do we know if the reduced activity drives increased behavioral quiescence that is incorrectly defined as sleep? In instances where sleep is increased but activity during periods during wake are normal or elevated, this is not an issue. But here, the animals might very well be unhealthy, and less active, so naturally they stop moving more for prolonged periods, but the main conclusion is not sleep per se. This is an area where more experiments should be added if the authors do not wish to change/temper the conclusions they draw. Are psen2 KOs responsive to startling stimuli like controls when awake? Do they respond normally when quiescent? Great care must be taken in all models using inactivity as a proxy for sleep, and it can harm the field when there is no acknowledgment that overall health/activity changes could be a confound. Particularly worrisome is the betamethasone data in Figure 6, where activity and sleep are once again coordinately modified by the drug.

      This is a fair criticism. We agree it is a concern, especially in the case of psen2 as we claim that day-time sleep is increased while zebrafish are diurnal. We do not rely heavily on the day-time inactivity being sleep (the ZOLTAR predictions or the small molecule rescue do not change whether the parameter is called sleep or inactivity), but our choice of labelling can fairly be challenged.

      To address “are psen2 KO responsive to startling stimuli like controls when awake/when quiescent”, we looked at the larvae’s behaviour immediately after lights abruptly switched on in the mornings. Almost every larva, regardless of genotype, responded strongly to every lights-off transition during the experiment. Instead, we chose the lights-on transition for this analysis because it is a weaker startling stimulus for the larvae than the lights-off transition (Fig. 3–supplement 3), potentially exposing differences between genotypes or behavioural states (quiescent or awake). We defined a larva as having reacted to the lights switching on if it made a swimming bout during the second (25 frames) a er the lights-on transition. Across two clutches and two lights-on transitions, an average of 65% (range 52–73%) of all larvae reacted to the stimulus. psen2 knockout larvae were similarly likely, if not more likely, to respond (in average 69% responded, range 60–76%) than controls (60% average, range 44– 75%). When the lights switched on, about half of the larvae (39–51%) would have been classified as asleep according to the one-minute inactivity definition (i.e. the larva did not move in the minute preceding the lights transition). This allowed us to also compare behavioural states, as suggested by the reviewer. For three of the four light transitions, larvae which were awake when lights switched on were more likely to react than asleep larvae, but this difference was not striking (overall, awake larvae were only 1.1× more likely to react; Author response image 3). Awake psen2 knockout larvae were 1.1× (range 1.04–1.11×) more likely to react than awake control larvae, so, yes, psen2 knockout larvae respond normally when awake. Asleep psen2 knockout larvae were 1.4× (range 0.63–2.19×) more likely to react than asleep control larvae, so psen2 knockouts are also more or equally likely to react than control larvae when asleep. In summary, the overall health of psen2 knockouts did not seem to be a significant confound in the experiment. As the reviewer suggested, if psen2 knockout larvae were seriously unhealthy, they would not be as responsive as control larvae to a startling stimulus.

      Author response image 3.

      psen2 F0 knockouts react normally to lights switching on, indicating they are largely healthy. At each lights-on transition (9 AM), each larva was categorised as awake if it had moved in the preceding one minute or asleep if it had been inactive for at least one minute. Darker tiles represent larvae which performed a swimming bout during the second following lights-on; lighter tiles represent larvae which did not move during that second. The total count of each waffle plot was normalised to 25 so plots can be compared to each other. The real count is indicated in the corner of each plot. Data is from the baseline psen2 knockout trackings presented in Fig. 3 and Fig. 3–suppl. 2.

      Next, we compared inactive period durations during the day between psen2 and control larvae. If psen2 knockout larvae indeed sleep more during the day compared to controls, we may predict inactive periods longer than one minute to increase disproportionately compared to the increase in shorter inactive periods. This broadly appeared to be the case, especially for one of the two clutches (Author response image 4). In clutch 1, inactive periods lasting 1–60 sec were equally frequent in both psen2 and control larvae (fold change 1.0× during both days), while inactive periods lasting 1–2 min were 1.5× (day 1) and 2.5× (day 2) more frequent in psen2 larvae compared to control larvae. In clutch 2, 1–60 sec inactive periods were also equally frequent in both psen2 and control larvae, while inactive periods lasting 1–2 min were 3.4× (day 1) and 1.5× (day 2) more frequent in psen2 larvae compared to control larvae. Therefore, psen2 knockouts disproportionately increased the frequency of inactive periods longer than one minute, suggesting they genuinely slept more during the day.

      Author response image 4.

      psen2 F0 knockouts increased preferentially the frequency of longer inactive bouts. For each day and clutch, we calculated the mean distribution of inactive bout lengths across larvae of same genotype (psen2 F0 knockout or scrambled-injected), then compared the frequency of inactive bouts of different lengths between the two genotypes. For example, in clutch 1 during day 2, 0.01% of the average scrambled-injected larva’s inactive bouts lasted 111–120 seconds (X axis 120 sec) while 0.05% of the average psen2 F0 knockout larva lasted this long, so the fold change was 5×. Inactive bouts lasting < 1 sec were excluded from the analysis. In clutch 2, day 1 plot, two datapoints fall outside the Y axis limit: 140 sec, Y = 32×; 170 sec, Y = 16×. Data is from the baseline psen2 knockout trackings presented in Fig. 3 and Fig. 3–suppl. 2.

      Ultimately, this criticism seems challenging to definitely address experimentally. A possible approach could be to use a closed-loop system which, after one minute of inactivity, triggers a stimulus that is sufficient to startle an awake larva but not an asleep larva. If psen2 knockout larvae indeed sleep more during the day, the stimulus should usually not be sufficient to startle them. Nevertheless, we believe the two analyses presented here are consistent with psen2 knockout larvae genuinely sleeping more during the day, so we decided to keep this label. We agree with the reviewer that the one-minute inactivity definition has limitations, especially for day-time inactivity.

      (5) The conclusions for the serotonin section are overstated. Behavioural pharmacology purports to predict a signaling pathway disrupted with sorl1 KO. But is it not just possible that the drug acts in parallel to the true disrupted pathway in these fish? There is no direct evidence for serotonin dysfunction - that conclusion is based on response to the drug. Moreover, it is just one drug - is the same phenotype present with another SSRI? Likewise, language should be toned down in the discussion, as this hypothesis is not "confirmed" by the results (consider "supported"). The lack of measured serotonin differences further raises concern that this is not the true pathway. This is another major point that deserves further experimental evidence, because without it, the entire approach (behavioral pharm screen) seems more shaky as a way to identify mechanisms. There are any number of testable hypotheses to pursue such as a) Using transient transgenesis to visualize 5HT neuron morphology (is development perturbed: cell number, neurite morphology, synapse formation); b) Using transgenic Ca reporters to assay 5HT neuron activity.

      Regarding the comment, “is it not just possible that the drug acts in parallel to the true disrupted pathway”, we think no, assuming we understand correctly the question. Key to our argument is the fact that sorl1 knockout larvae react differently to the drug(s) than control larvae. As an example, take night-time sleep bout length, which was not affected by knockout of sorl1 (Fig. 4–supplement 4). For the sake of the argument, say only dopamine signalling (the “true disrupted pathway”) was affected in sorl1 knockouts and that serotonin signalling was intact. Assuming that citalopram specifically alters serotonin signalling, then treatment should cause the same increase in sleep bout length in both knockouts and controls as serotonin signalling is intact in both. This is not what we see, however. Citalopram caused a greater increase in sleep bout length in sorl1 knockouts than in scrambled-injected larvae. In other words, the effect is non-additive, in the sense that citalopram did not add the same number of z-scores to sorl1 knockouts or controls. We think this shows that serotonin signalling is somehow different in sorl1 knockouts. Nonetheless, we concede that the experiment does not necessarily say much about the importance of the serotonin disruption caused by loss of Sorl1. It could be, for example, that the most salient consequence of loss of Sorl1 is cholinergic disruption (see reply to Reviewer #1 above) and that serotonin signalling is a minor theme.

      Furthermore, we agree with the reviewer and Reviewer #2 that the conclusions were overly confident. As suggested, we decided to repeat this experiment with another SSRI, fluvoxamine. Please find the results of this experiment in Fig. 5–supplement 1. The suggestions to further test the serotonin system in the sorl1 knockouts are excellent as well, however we do not plan to pursue them at this stage.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      Major Comments:

      - Data are presented in a variety of different ways, occasionally making comparisons across figures difficult. Perhaps at a minimum, behavioral fingerprints as in Figure 3 - Supplementary Figure 1 should be presented for all mutants in the main figures.

      We like this suggestion! Thank you. We brought the behavioural fingerprints figure (previously Fig. 4–supplement 5) as main Fig. 4, and put the figure focused on the sorl1 knockout behavioural phenotype in supplementary, with the other gene-by-gene figures.

      - It is not clear why some data were selected for supplemental rather than main figures. In many cases, detailed phenotypic data is provided for one example mutant in the main figures, and then additional mutants are described in detail in the supplement. Again, to facilitate comparisons between mutants, fingerprints could be provided for all mutants in a main figure, with detailed analyses moved to the supplements.

      The logic was to dedicate one main figure to psen2 (Fig. 3) as an example of an early-onset Alzheimer’s risk gene, and one to sorl1 (previously Fig. 4) as an example of a late-onset Alzheimer’s risk gene. We focused on them in main figures as they are both tested again later (Fig. 5 and Fig. 6). Having said that, we agree that the fingerprints may be a better use of main figure space than the parameters plots. In addition to the above (fingerprints of lateonset Alzheimer’s risk genes in main figure), we rearranged the figures in the early-onset AD section to have the psen2 F0 knockout fingerprint in main.

      - The explication of the utility of behavioral fingerprinting on page 35 is somewhat confusing. The authors describe drugs used to treat depression as enriched among small molecules anti-correlating with the sorl1 fingerprint. However, in Figure 5 - Supplementary Figure 1, drugs used to treat depression are biased toward positive cosines, which are indicated as having a more similar fingerprint to sorl1. These drugs should be described as more present among compounds positively correlating with the sorl1 fingerprint.

      Sorry, the confusion is about “(anti-)correlating”. Precisely, we meant “correlating and/or anti-correlating”, not just anti-correlating. We changed to that wording. In short, the analysis is by design agnostic to whether compounds with a given annotation are found more on the positive cosines side (le side in Fig. 5–supplement 1a) or the negative cosines side (right side). This is because the dataset often includes both agonists and antagonists to a given pathway but these are difficult to annotate. For example, say 10 compounds in the dataset target the dopamine D4 receptor, but these are an unknown mix of agonists and antagonists. In this case, we want ZOLTAR to generate a low p-value when all 10 compounds are found at extreme ends of the list, regardless of which end(s) that is (e.g. top 8 and bottom 2 should give an extremely low p-value). Initially, we were splitting the list, for each annotation, into positive-cosine fingerprints and negative-cosine fingerprints and testing enrichment on both separately, but we think the current approach is better as it reflects better the cases we want to detect and considers all available examples for a given annotation in one test. In sum, yes, in this case drugs used to treat depression were mostly in the positive-cosine side, but the other drugs on the negative-cosine side also contributed to what the p-value is, so it reflects better the analysis to say “correlating and/or anticorrelating”. You can read more about our logic for the analysis in Methods (section Behavioural pharmacology from sorl1 F0 knockout’s fingerprint).

      - The authors conclude the above-described section by stating: "sorl1 knockout larvae behaved similarly to larvae treated with small molecules targeting serotonin signaling, suggesting that the loss of Sorl1 disrupted serotonin signaling." Directionality here may be important. Are all of the drugs targeting the serotonin transporter SSRIs or similar? If so, then a correct statement would be that loss of Sorl1 causes similar phenotypes to drugs enhancing serotonin signaling. Finally, based on the correlation between serotonin transporter inhibitor trazodone and the sorl1 crispant phenotype, it is potentially surprising that the SSRI citalopram caused the opposite phenotype from sorl1, that is, increased sleep during the day and night. It is potentially interesting that this result was enhanced in mutants, and suggests dysfunction of serotonin signaling, but the statement that "our behavioral pharmacology approach correctly predicted from behaviour alone that serotonin signaling was disrupted" is too strong a conclusion.

      We understand “disrupt” as potentially going either way, but this may not be the common usage. We changed to “altered”.

      The point regarding directionality is excellent, however. We tested the proportion of serotonin transporter agonists and antagonists (SSRIs) on each side of the ranked list of small molecule fingerprints. We used the STITCH database for this analysis as it has more drug–target interactions, but likely less curated, than the Therapeutic Target Database (Szklarczyk et al., 2016). As with the Therapeutic Target Database, most fingerprints of compounds interacting with the serotonin transporter SLC6A4 were found on the side of positive cosines (p ~ 0.005 using the custom permutation test), which replicates Fig. 5a with a different source for the drug–target annotations (Author response image 5). On the side of positive cosines (small molecules which generate behavioural fingerprints correlating with the sorl1 fingerprint), there were 2 agonists and 26 antagonists. On the side of negative cosines (small molecules which generate behavioural fingerprints anti-correlating with the sorl1 fingerprint), there were 3 agonists and 2 antagonists. Using a Chi-squared test, this suggests a significant (p = 0.002) over-representation of antagonists (SSRIs) on the positive side (expected count = 24, vs. 26 observed) and agonists on the negative side (expected count = 1, vs. 3 observed). If SLC6A4 antagonists, i.e. SSRIs, indeed tend to cause a similar behavioural phenotype than knockout of sorl1, this would point in the direction of our original interpretation of the citalopram experiment; which was that excessive serotonin signalling is what causes the sorl1 behavioural phenotype.

      Author response image 5.

      Using the STITCH database as source of annotations also predicts SLC6A4 as an enriched target for the sorl1 behavioural fingerprint. Same figures as Fig. 5a,b but using the STITCH database (Szklarczyk et al., 2016) as source for the drug targets. a, Compounds annotated by STITCH as interacting with the serotonin transporter SLC6A4 tend to generate behavioural phenotypes similar to the sorl1 F0 knockout fingerprint. 40,522 compound–target protein pairs (vertical bars; 1,592 unique compounds) are ranked from the fingerprint with the most positive cosine to the fingerprint with the most negative cosine in comparison with the mean sorl1 F0 knockout fingerprint. Fingerprints of drugs that interact with SLC6A4 are coloured in yellow. Simulated p-value = 0.005 for enrichment of drugs interacting with SLC6A4 at the top (positive cosine) and/or bottom (negative cosine) of the ranked list by a custom permutation test. b, Result of the permutation test for top and/or bottom enrichment of drugs interacting with SLC6A4 in the ranked list. The absolute cosines of the fingerprints of drugs interacting with SLC6A4 (n = 52, one fingerprint per compound) were summed, giving sum of cosines = 15.9. To simulate a null distribution, 52 fingerprints were randomly drawn 100,000 times, generating a distribution of 100,000 random sum of cosines. Here, only 499 random draws gave a larger sum of cosines, so the simulated p-value was p = 499/100,000 = 0.005 **.

      If this were true, we would expect, as the reviewer suggested, SSRI treatment (citalopram or fluvoxamine) on control larvae to give a similar behavioural phenotype as knockout of sorl1. However, this generally did not appear to be the case (sorl1 knockout fingerprint vs. SSRI-treated control fingerprint, cosine = 0.08 ± 0.35; Author response image 6).

      Author response image 6.

      sorl1 F0 knockouts in comparison to controls treated with SSRIs. a, sorl1 F0 knockout fingerprints (baseline recordings and sorl1 + H<sub>2</sub>O fingerprint from the citalopram experiment) in comparison with the scrambled-injected + citalopram (1 or 10 µM) fingerprints. Each dot represents the mean deviation from the same-clutch scrambled-injected H<sub>2</sub>O-treated mean for that parameter (z-score, mean ± SEM). b, As in a), sorl1 F0 knockout fingerprints (baseline recordings and sorl1 + H<sub>2</sub>O fingerprint from the fluvoxamine experiment) in comparison with the scrambled-injected + fluvoxamine (10 µM) fingerprint.

      The comparison with trazodone is an interesting observation, but it is only a weak serotonin reuptake inhibitor (Ki for SLC6A4 = 690 nM, vs. 8.9 nM for citalopram; Owens et al., 1997) and it has many other targets, both as agonist or antagonist, including serotonin, adrenergic, and histamine receptors (Mijur, 2011). In any case, the average trazodone fingerprint does not correlate particularly well to the sorl1 knockout fingerprint (cos = 0.3). Finally, the sorl1 knockout behavioural phenotype could be primarily caused by altered serotonin signalling in the hypothalamus, where we found both the biggest difference in tph1a/1b/2 HCR signal intensity (Fig. 5f) and the highest expression of sorl1 across scRNA-seq clusters (Fig. 1– supplement 2). In this case, it would be correct to expect sorl1 knockouts to react differently to SSRIs than controls, but it would be incorrect to expect SSRI treatment to cause the same behavioural phenotype, as it concurrently affects every other serotonergic neuron in the brain.

      Finally, we agree the quoted conclusion was too strong given the current evidence. We since tested another SSRI, fluvoxamine, on sorl1 knockouts.

      - Also in reference to Figure 5: in panel c, data are presented as deviation from vehicle treated. Because of this data presentation choice, it's no longer possible to determine whether, in this experiment, sorl1 crispants sleep less at night relative to their siblings. Does citalopram rescue / reverse sleep deficits in sorl1 mutants?

      On your first point, please see our response to Reviewer #3 (2)c and Author Response 2b above.

      On “does citalopram rescue/reverse sleep deficits in sorl1 mutants”: citalopram (and fluvoxamine) tends to reverse the key aspects of the sorl1 knockout behavioural phenotype by reducing night-time activity (% time active and total Δ pixels), increasing night-time sleep, and shortening sleep latency (Author response image 7). Extrapolating from the hypothesis presented in Discussion, this may be interpreted as a hint that sorl1 knockouts have reduced levels of 5-HT receptors, as increasing serotonin signalling using an SSRI tends to rescue the phenotype. However, we do not think that focusing on the significant behavioural parameters necessarily make sense here. Rather, one should take all parameters into account to conclude whether knockouts react differently to the drug than wild types (also see answer to Reviewer #3, (7) on this). For example, citalopram increased more the night-time sleep bout length of sorl1 knockouts than the one of controls (Fig. 5), but this parameter was not modified by knockout of sorl1 (Fig. 4). To explain the rationale more informally, citalopram is only used as a tool here to probe serotonin signalling in sorl1 knockouts, whether it worsens or rescues the behavioural phenotype is somewhat secondary, the key question is whether knockouts react differently than controls.

      Author response image 7.

      Comparing untreated sorl1 F0 knockouts vs. treated with SSRIs. a, sorl1 F0 knockout fingerprints (baseline recordings and sorl1 + H<sub>2</sub>O fingerprint from the citalopram experiment) in comparison with the sorl1 knockout + citalopram (1 or 10 µM) fingerprints. Each dot represents the mean deviation from the same-clutch scrambled-injected H<sub>2</sub>O-treated mean for that parameter (z-score, mean ± SEM). b, As in a), sorl1 F0 knockout fingerprints (baseline recordings and sorl1 + H<sub>2</sub>O fingerprint from the fluvoxamine experiment) in comparison with the sorl1 + fluvoxamine (10 µM) fingerprint.

      - Possible molecular pathways targeted by tinidazole, fenoprofen, and betamethasone are not described.

      Tinidazole is an antibiotic, fenoprofen is a non-steroidal anti-inflammatory drug (NSAIDs), betamethasone is a steroidal anti-inflammatory drug. Interestingly, long-term use of NSAIDs reduces the risk of AD (in ’t Veld Bas A. et al., 2001). Several mechanisms are possible (Weggen et al., 2007), including reduction of Aβ42 production by interacting with γ-secretase (Eriksen et al., 2003). However, we did not explore the mechanism of action of these drugs on psen2 knockouts so do not feel comfortable speculating. We do not know, for example, whether these findings apply to betamethasone.

      Minor Comments:

      - On page 25, panel "g" should be labeled as "f".

      Thank you!

      - On page 35, a reference should be provided for the statement "From genomic studies of AD, we know that mutations in genes such as SORL1 modify risk by disrupting some biological processes.".

      Thank you, this is now corrected. There were the same studies as mentioned in Introduction.

      - On page 43, the word "and" should be added - "in wild-type rats and mice, overexpressing mutated human APP and PSEN1, AND restricting sleep for 21 days...".

      Right, this sentence could be misread, we edited it. “overexpressing […]” only applied to the mice, not the rats (as they are wild-type); and both are sleep-deprived.

      - On page 45, a reference should be provided for the statement "SSRIs can generally be used continuously with no adverse effects" and this statement should potentially be softened.

      The reference is at the end of that sentence (Cirrito et al., 2011). You are correct though; we reformulated this statement to: “SSRIs can generally be used safely for many years”. SSRIs indeed have side effects.

      - On page 54, a 60-minute rolling average is described as 45k rows, but this seems to be a 30-minute rolling average.

      Thank you! We corrected. It should have been 90k rows, as in: 25 frames-per-second × 60 seconds × 60 minutes.

      Reviewer #2 (Recommendations For The Authors):

      "As we observed in the scRNA-seq data, most genes tested (appa, appb, psen1, psen2, apoea, cd2ap, sorl1) were broadly expressed throughout the 6-dpf brain (Fig. 1d and Fig. 1supplement 3 and 4)."

      - apoea and appb are actually not expressed highly in the scRNA-seq data, and the apoea in situ looks odd, as if it has no expression. The appb gene mysteriously does not look as though it has high expression in the Raj data, but it is clearly expressed based on the in situ. I had previously noticed the same discrepancy, and I attribute it to the transcriptome used to map the Raj data, as the new DanioCell data uses a new transcriptome and indicates high appb expression in the brain. Please point out the discrepancy and possible explanation, perhaps in the figure legend.

      All excellent points, thank you. We included them directly in Results text.

      "most of these were expressed in the brain of 5-6-dpf zebrafish larvae, suggesting they play a role in early brain development or function."

      - Evidence of expression does not suggest function, particularly not a function in brain development. As one example, almost half of the genome is expressed prior to the maternal-zygotic transition but does not have a function in those earliest stages of development. There are numerous other instances where expression does not equal function. Please change the sentence even as simply as "it is possible that they".

      We mostly agree and edited to “[…], so they could play a role […]”.

      Out of curiosity, we plotted, for each zebrafish developmental stage, the proportion of Alzheimer’s risk gene orthologues expressed in comparison to the proportion of all genes expressed (Author response image 8). We defined “all genes” as every gene that is expressed in at least one of the developmental stages (n = 24,856), not the complete transcriptome, to avoid including genes that are never expressed in the brain or whose expression is always below detection limit. We counted a gene as “expressed” if at least three cells had detectable transcripts. Using these definitions, 82 ± 7% of genes are expressed during development. For every developmental stage except 5 dpf (so 11/12), a larger proportion of Alzheimer’s risk genes than all genes are expressed (+5 ± 4%).

      Author response image 8.

      Proportion of Alzheimer’s risk genes orthologues expressed throughout zebrafish development. Proportion of Alzheimer’s risk genes orthologues (n = 42) and all genes (n = 24,856) expressed in the zebrafish brain at each developmental stage, from 12 hours post-fertilisation (hpf) to 15 days post-fertilisation (dpf). “All genes” corresponds to every gene expressed in the brain at any of the developmental stages, not the complete transcriptome. A gene is considered “expressed” (green) if at least three cells had detectable transcripts. Single-cell RNA-seq dataset from Raj et al., 2020.

      "This frame-by-frame analysis has several advantages over previous methods that analysed activity data at the one-minute resolution."

      - Which methods are these? There are no citations. There are certainly existing methods in the zebrafish field that can produce similar data to the method developed for this project. This new package is useful, as most existing software is not written in R, so it would help scientists who prefer this programming language. However, I would be careful not to oversell its novelty, since many methods do exist that produce similar results.

      We added the references. There were referenced above after “we combined previous sleep/wake analysis methods”, but should have been referenced again here.

      We are not convinced by this criticism. We would obviously not claim that the FramebyFrame package is as sophisticated and versatile as video-tracking tools like SLEAP or DeepLabCut, but we do think it answers a genuine need that was not addressed by other methods. Specifically, we know of many labs recording pixel count data across multiple days using the Zebrabox or DanioVision (we added support for DanioVision data after submission), but there were no packages to extract behavioural parameters from these data. Other methods involved standalone scripts with no documentation or version tracking. We would concede the FramebyFrame package is mostly targeted at these labs, but we already know of six labs routinely using it and were recently contacted by a researcher tracking Daphnia in the Zebrabox.

      "F0 knockouts of both cutches" - "clutches"

      Thank you!

      Reviewer #3 (Recommendations For The Authors):

      I would suggest totally revamping the Introduction section, and being sure to provide readers with the context and background they need for the data that comes thereafter. Key areas to touch on, in no particular order, include:

      • Far more detail on the behavioral pharm screen upon which this paper builds, as a brief overview of that approach and the data generated are needed.

      Thank you for the suggestion, we added a sentence hinting at this work in the last Introduction paragraph.

      • Limitations of current zebrafish sleep/arousal assays that motivated the authors to develop a new, temporally high-resolution system.

      We think this is better explained in Results, as is currently. For example, we need to point to Fig. 2–supplement 2a,b,c to explain that one-minute methods were missing sleep bouts and how FramebyFrame resolves this issue.

      • A paragraph about sleep and AD, that does a better job of citing work in humans, mammalian, and invertebrate models that motivate the interest in the connection pursued here.

      Sorry, we think this would place too much focus on sleep and AD. We want the main topic of the paper to be the behavioural pharmacology approach, not AD or sleep per se. As the Introduction states, we see Alzheimer’s risk genes as a case study for the behavioural pharmacology approach, rather than the reason why the approach was developed. Additionally, presenting sleep and AD in Introduction risks sounding like ZOLTAR is specifically designed for this context, while we conceived of it as much more generalisable and explicitly encourage its use to study genes associated to other diseases. Note that the paragraph you suggest is, we think, mostly present in Discussion (section Disrupted sleep and serotonin signalling […]).

      • I modestly suggest eliminating making such a strong case for a gene-first approach being the best way to understand disease. It is not a zero-sum game, and there is plenty to learn from proteomics, metabolomics, etc. I suspect nobody will argue with the authors saying they leveraged the strength of their system and focused on key AD genes of interest.

      From your point below, we understand the following quote is the source of the issue: “For finding causal processes, studying the genome, rather than the transcriptome or epigenome, is advantageous because the chronology from genomic variant to disease is unambiguous […]”. We did not want to suggest it is a zero-sum game, but we now understand how it can be read this way. We adapted slightly the wording. What we want to do is highlight the causality argument as the advantage of the genomics approach. We feel we do not read this argument often enough, while it remains a ‘magic power’ of genomics. One essentially does not have to worry about causality when studying a pathogenic germline variant, while it is a constant concern when studying the transcriptome or epigenome (i.e. did the change in this transcript’s level cause disease, or vice-versa?). To take an example in the context of AD, arguments based on genomics (e.g. Down syndrome or APP duplication) are often the definite arbiters when debating the amyloid hypothesis, exactly because their causality cannot be doubted.

      Minor comments

      (1) The opening of the introduction is perhaps overly broad, spending an entire paragraph on genome vs transcriptome, etc and making the claim that a gene-first approach is the best path. It isn't zero-sum, and the authors could just get right into AD and study genes of interest. Similar issues occur throughout the manuscript, with sentences/paragraphs that are not necessarily needed.

      Please see our answer to your previous point. On the introduction being overly broad, we perfectly agree it is broad, but related to your point about presenting sleep and AD in the Introduction, we wish to talk about finding causal processes from genomics findings using behavioural pharmacology. We purposefully present research on AD as one instance of this broader goal, not the primary topic of the paper.

      Another example are these sentences, which could be totally removed as the following paragraph starts off making the same point much more succinctly. "From genomic studies of AD, we know that mutations in genes such as SORL1 modify risk by disrupting some biological processes. Presumably, the same processes are disrupted in zebrafish sorl1 knockouts, and some caused the behavioural alterations we observed. Can we now follow the thread backwards and predict some of the biological processes in which Sorl1 is involved based on the behavioural profile of sorl1 knockouts?"

      Thanks for the suggestion, but we think these sentences are useful to place back this Results section in the context of the Introduction. Think of the paper as mainly about the behavioural pharmacology approach, not on Alzheimer’s risk genes. The function of the paragraph here is not simply to explain the method by which we decided to study sorl1; it is to reiterate the rationale behind the behavioural pharmacology approach so that the reader understands where this Results section fits in the overall structure.

      (2) Related to the above, the authors use lecanemab as an example to support their approach, but there has been a great deal of controversy regarding this drug. I don't think such extensive justification is needed. This study uses AD risk genes as a case study in a newly developed behavioral pharm pipeline. A great deal of the rest of the intro seems to just fill space and could be more focused on the study at hand. Interestingly, a er gene selection, the next step in their pipeline is sleep/wake analysis yet nothing is covered about AD and sleep in the intro. Some justification of that approach (why focus on sleep/wake as a starting point for behavioral pharm rather than learning and memory?) would be a better use of intro space.

      There has indeed been controversy about lecanemab, but even the harshest critiques of the amyloid hypothesis concede that it slows down cognitive decline (Espay et al., 2023). That is all that is needed to support our argument, which is that research on AD started primarily from genomics and thereby yielded a disease-modifying drug. The controversy seems mostly focused on whether this effect size is clinically significant, and we think we correctly represent this uncertainty (e.g. “antibodies against Aβ such as lecanemab show promise in slowing down disease progression” and “the beneficial effects from targeting Aβ aggregation currently remain modest”).

      Your next point is entirely fair. We mostly answered it above. To explain further, the primary reason why we measured sleep/wake behaviour is to match the behavioural dataset from Rihel et al., 2010 so we can use it to make predictions, not to study sleep in the context of AD per se. Sure, perhaps learning and memory would have been interesting, but we do not know of any study testing thousands of small molecules on zebrafish larvae during a memory task. We understand it can be slightly confusing though, as we then spend a paragraph of Discussion on sleep as a causal process in AD, but we obviously need to discuss this topic given the findings. However, to reiterate, we purposefully designed FramebyFrame and ZOLTAR to be useful beyond studying sleep/wake behaviour. For example, FramebyFrame would not calculate 17 behavioural parameters if the only goal was to measure sleep. We now mention the Rihel et al., 2010 study in the Introduction as you suggested above (“Far more detail on the behavioral pharm screen […]”), as that is the real reason why sleep/wake behaviour was measured in the first place.

      (3) Also related to the above, another more relevant point that could be talked about in the intro is the need for more refined approaches to analyze sleep in zebrafish, given the effort that went into the new analysis system described here. Again, I think the context for why the authors developed this system would be more meaningful than the current content.

      Thank you, we think we answered this point above (especially below Limitations of current zebrafish sleep/arousal assays […]).

      (4) GWAS can stand for Genome-wide associate studies (plural) so I do not think the extra "s" is needed (GWASs) .

      Indeed, that seems to be the common usage. Thank you.

      (5) AD candidate risk genes were determined from loci using "mainly statistic colocalization". Can the authors add a few more details about what was done and what the "mainly" caveat refers to?

      “Mainly” simply refers to the fact that other methods were used by Schwartzentruber et al. (2021) to annotate the GWAS loci with likely causal genes, but that most calls were ultimately made from statistic colocalisation. Readers can refer to this work to learn more about the methods used.

      (6) The authors write "The loss of psen1 only had mild effects on behaviour" but I think they mean "sleep behaviors" as there could be many other behaviors that are disrupted but were not assessed. The same issue a few sentences later with "Behaviour during the day was not affected" and at the end of the following paragraph.

      Yes, that would be more precise, thank you.

      (7) For the Sorl1 pharmacology data, it is very hard to understand what is being measured behaviorally. Are the authors measuring sleep +/- citalopram, or something else, and why the change to Euclidean distance rather than all the measures we were just introduced to earlier in the manuscript?

      We understand these plots (Fig. 5c,d) are less intuitive, but it is important that we show the difference in behaviour compared to H<sub>2</sub>O-treated larvae of same genotype. The claim is that citalopram has a larger effect on knockouts than on controls, so the reader needs to focus on the effect of the drug on each genotype, not on the effect of sorl1 knockout. We added the standard fingerprints (i.e. setting controls to z-score = 0) here in Author response figures.

      Euclidean distance takes as input all the measures we introduced. The point is precisely not to select a single measure. For example, say we were only plotting active bout number during the day, we would conclude that 10 µM citalopram has the same effect on knockouts and controls. Conversely, if we had taken sleep bout length at night, we would conclude 10 µM has a stronger effect on knockouts. What is the correct parameter to select? Using Euclidean distance resolves this by taking all parameters into account, rather than arbitrarily choosing one.

      And what exactly is a "given spike in serotonin"? and how is this hypothesis the conclusion based on the lack of evidence for the second hypothesis? As the authors say, there could be other ways sorl1 knockouts are more sensitive to citalopram, so the absence of evidence for one hypothesis certainly does not support the other hypothesis.

      We mean a given release of serotonin in the synaptic cleft. We have fixed this wording. 

      We tend to disagree on the second point. We can think of two ways that sorl1 knockouts are more sensitive to citalopram: 1) they produce more serotonin, so blocking reuptake causes a larger spike in knockouts; or 2) blocking reuptake causes the same increase in both knockouts and wild-types but knockouts react more strongly to serotonin. We cannot in fact think of another way to explain the citalopram results. Not finding overwhelming evidence for 1) surely supports 2) somewhat, even if we do not have direct evidence for it. As an analogy, if two diagnoses are possible for a patient, testing negative for the first one supports the other one, even before it is directly tested.

      (8) Again some language is used without enough care. Fish are referred to as "drowsier" under some drug conditions. How do the authors know the animal is drowsy? The phenotype is more specific - more sleep, less activity.

      Thank you, we switched to “Furthermore, fenoprofen worsened the day-time hypoactivity of psen2 knockout larvae […]”.

      (9) This sentence is misleading as it gives the impression that results in this manuscript suggest the conclusion: "Our observation that disruption of genes associated with AD diagnosis after 65 years reduces sleep in 7-day zebrafish larvae suggest that disrupted sleep may be a common mechanism through which these genes exert an effect on risk." That idea is widely held in the field, and numerous other previous manuscripts/reviews should be cited for clarity of where this hypothesis came from.

      This idea is not widely held in the field. You likely read this point as “disrupted sleep is a risk factor for AD”, which, yes, is widely discussed in the field, but is not precisely what we are saying. We hypothesise that mutations in some of the Alzheimer’s risk genes cause disrupted sleep, possibly from a very early age, which then causes AD decades later. Studies and reviews on sleep and AD rarely make this hypothesis, at least not explicitly. The closest we know of are a few recent human genetics studies, typically using Mendelian Randomisation, finding that higher genetic risk of AD correlates with some sleep phenotypes, such as sleep duration (Chen et al., 2022; Leng et al., 2021). The work of Muto et al. (2021) is particularly interesting as it found correlations between higher genetic risk of AD and some sleep phenotypes in men in their early twenties, which seems unlikely to be a consequence of early pathology (Muto et al., 2021). Note, however, that even these studies do not mention sleep possibly being disrupted early in development, which is what our findings in zebrafish larvae support. As we mention, we think a team should test whether sleep is different in infants at higher genetic risk of AD, essentially performing an analogous, but obviously much more difficult, experiment as we did in zebrafish larvae. We do not know of any study testing this or even raising this idea, so evidently it is not widely held. Having said that, the studies we mention here were not referenced in the Discussion paragraph. We have now corrected this.

      Ashlin TG, Blunsom NJ, Ghosh M, Cockcroft S, Rihel J. 2018. Pitpnc1a Regulates Zebrafish Sleep and Wake Behavior through Modulation of Insulin like Growth Factor Signaling. Cell Rep 24:1389–1396. doi:10.1016/j.celrep.2018.07.012

      Chen D, Wang X, Huang T, Jia J. 2022. Sleep and LateOnset Alzheimer’s Disease: Shared Genetic Risk Factors, Drug Targets, Molecular Mechanisms, and Causal Effects. Front Genet 13. doi:10.3389/fgene.2022.794202

      Cirrito JR, Disabato BM, Restivo JL, Verges DK, Goebel WD, Sathyan A, Hayreh D, D’Angelo G, Benzinger T, Yoon H, Kim J, Morris JC, Mintun MA, Sheline YI. 2011. Serotonin signaling is associated with lower amyloid-β levels and plaques in transgenic mice and humans. Proc Natl Acad Sci U S A 108:14968–14973. doi:10.1073/pnas.1107411108

      Dean DC, Jerskey BA, Chen K, Protas H, Thiyyagura P, RoonJva A, O’Muircheartaigh J, Dirks H, Waskiewicz N, Lehman K, Siniard AL, Turk MN, Hua X, Madsen SK, Thompson PM, Fleisher AS, Huentelman MJ, Deoni SCL, Reiman EM. 2014. Brain Differences in Infants at Differential Genetic Risk for Late-Onset Alzheimer Disease A Cross-sectional Imaging Study. JAMA Neurol 71:11–22. doi:10.1001/jamaneurol.2013.4544

      Eriksen JL, Sagi SA, Smith TE, Weggen S, Das P, McLendon DC, Ozols VV, Jessing KW, Zavitz KH, Koo EH, Golde TE. 2003. NSAIDs and enantiomers of flurbiprofen target γ-secretase and lower Aβ42 in vivo. J Clin Invest 112:440–449. doi:10.1172/JCI18162

      Espay AJ, Herrup K, Kepp KP, Daly T. 2023. The proteinopenia hypothesis: Loss of Aβ42 and the onset of Alzheimer’s Disease. Ageing Res Rev 92:102112. doi:10.1016/j.arr.2023.102112

      Hoffman EJ, Turner KJ, Fernandez JM, Cifuentes D, Ghosh M, Ijaz S, Jain RA, Kubo F, Bill BR, Baier H, Granato M, Barresi MJF, Wilson SW, Rihel J, State MW, Giraldez AJ. 2016. Estrogens Suppress a Behavioral Phenotype in Zebrafish Mutants of the AuJsm Risk Gene, CNTNAP2. Neuron 89:725–733. doi:10.1016/j.neuron.2015.12.039

      in ’t Veld Bas A, Ruitenberg A, Hofman A, Launer LJ, van Duijn CM, Stijnen T, Breteler MMB, Stricker BHC. 2001. Nonsteroidal Anti inflammatory Drugs and the Risk of Alzheimer’s Disease. N Engl J Med 345:1515–1521. doi:10.1056/NEJMoa010178

      Jagirdar R, Fu C-H, Park J, Corbek BF, Seibt FM, Beierlein M, Chin J. 2021. Restoring activity in the thalamic reticular nucleus improves sleep architecture and reduces Aβ accumulation in mice. Sci Transl Med 13:eabh4284. doi:10.1126/scitranslmed.abh4284

      Jiang H, Newman M, Lardelli M. 2018. The zebrafish orthologue of familial Alzheimer’s disease gene PRESENILIN 2 is required for normal adult melanotic skin pigmentation. PLOS ONE 13:e0206155. doi:10.1371/journal.pone.0206155

      Jiang H, Pederson SM, Newman M, Dong Y, Barthelson K, Lardelli M. 2020. Transcriptome analysis indicates dominant effects on ribosome and mitochondrial function of a premature termination codon mutation in the zebrafish gene psen2. PloS One 15:e0232559. doi:10.1371/journal.pone.0232559

      Joo W, Vivian MD, Graham BJ, Soucy ER, Thyme SB. 2021. A Customizable Low-Cost System for Massively Parallel Zebrafish Behavioral Phenotyping. Front Behav Neurosci 14.

      Joubert L, Hanson B, Barthet G, Sebben M, Claeysen S, Hong W, Marin P, Dumuis A, Bockaert J. 2004. New sorting nexin (SNX27) and NHERF specifically interact with the 5-HT4a receptor splice variant: roles in receptor targeting. J Cell Sci 117:5367–5379. doi:10.1242/jcs.01379

      Leng Y, Ackley SF, Glymour MM, Yaffe K, Brenowitz WD. 2021. Genetic Risk of Alzheimer’s Disease and Sleep Duration in Non-Demented Elders. Ann Neurol 89:177–181. doi:10.1002/ana.25910

      Mitchell PB, Hadzi-Pavlovic D. 2000. Lithium treatment for bipolar disorder. Bull World Health Organ 78:515–517.

      Mikur A. 2011. Trazodone: properties and utility in multiple disorders. Expert Rev Clin Pharmacol 4:181–196. doi:10.1586/ecp.10.138

      Munoz-Torrero D. 2008. Acetylcholinesterase Inhibitors as Disease-Modifying Therapies for Alzheimer’s Disease. Curr Med Chem 15:2433–2455. doi:10.2174/092986708785909067

      Muto V, Koshmanova E, Ghaemmaghami P, Jaspar M, Meyer C, Elansary M, Van Egroo M, Chylinski D, Berthomier C, Brandewinder M, Mouraux C, Schmidt C, Hammad G, Coppieters W, Ahariz N, Degueldre C, Luxen A, Salmon E, Phillips C, Archer SN, Yengo L, Byrne E, Collette F, Georges M, Dijk D-J, Maquet P, Visscher PM, Vandewalle G. 2021. Alzheimer’s disease genetic risk and sleep phenotypes in healthy young men: association with more slow waves and daytime sleepiness. Sleep 44. doi:10.1093/sleep/zsaa137

      Myers-Turnbull D, Taylor JC, Helsell C, McCarroll MN, Ki CS, Tummino TA, Ravikumar S, Kinser R, Gendelev L, Alexander R, Keiser MJ, Kokel D. 2022. Simultaneous analysis of neuroactive compounds in zebrafish. doi:10.1101/2020.01.01.891432

      Owens MJ, Morgan WN, Plok SJ, Nemeroff CB. 1997. Neurotransmiker receptor and transporter binding profile of antidepressants and their metabolites. J Pharmacol Exp Ther 283:1305– 1322.

      Özcan GG, Lim S, Leighton PL, Allison WT, Rihel J. 2020. Sleep is bi-directionally modified by amyloid beta oligomers. eLife 9:e53995. doi:10.7554/eLife.53995

      Quiroz YT, Schultz AP, Chen K, Protas HD, Brickhouse M, Fleisher AS, Langbaum JB, Thiyyagura P, Fagan AM, Shah AR, Muniz M, Arboleda-Velasquez JF, Munoz C, Garcia G, Acosta-Baena N, Giraldo M, Tirado V, Ramírez DL, Tariot PN, Dickerson BC, Sperling RA, Lopera F, Reiman EM. 2015. Brain Imaging and Blood Biomarker Abnormalities in Children With Autosomal Dominant Alzheimer Disease: A Cross-Sectional Study. JAMA Neurol 72:912–919. doi:10.1001/jamaneurol.2015.1099

      Relkin NR. 2007. Beyond symptomatic therapy: a reexamination of acetylcholinesterase inhibitors in Alzheimer’s disease. Expert Rev Neurother 7:735–748. doi:10.1586/14737175.7.6.735

      Rihel J, Prober DA, Arvanites A, Lam K, Zimmerman S, Jang S, Haggarty SJ, Kokel D, Rubin LL, Peterson RT, Schier AF. 2010. Zebrafish Behavioral Profiling Links Drugs to Biological Targets and Rest/Wake Regulation. Science 327:348–351. doi:10.1126/science.1183090

      Sleegers K, Brouwers N, Gijselinck I, Theuns J, Goossens D, Wauters J, Del-Favero J, Cruts M, van Duijn CM, Van Broeckhoven C. 2006. APP duplication is sufficient to cause early onset Alzheimer’s dementia with cerebral amyloid angiopathy. Brain J Neurol 129:2977–2983. doi:10.1093/brain/awl203

      Sun L, Zhou R, Yang G, Shi Y. 2017. Analysis of 138 pathogenic mutations in presenilin-1 on the in vitro production of Aβ42 and Aβ40 peptides by γ-secretase. Proc Natl Acad Sci 114:E476– E485. doi:10.1073/pnas.1618657114

      Szklarczyk D, Santos A, von Mering C, Jensen LJ, Bork P, Kuhn M. 2016. STITCH 5: augmenting protein–chemical interaction networks with tissue and affinity data. Nucleic Acids Res 44:D380–D384. doi:10.1093/nar/gkv1277

      Weggen S, Rogers M, Eriksen J. 2007. NSAIDs: small molecules for prevention of Alzheimer’s disease or precursors for future drug development? Trends Pharmacol Sci 28:536–543. doi:10.1016/j.Jps.2007.09.004

      Wiltschko AB, Tsukahara T, Zeine A, Anyoha R, Gillis WF, Markowitz JE, Peterson RE, Katon J, Johnson MJ, Daka SR. 2020. Revealing the structure of pharmacobehavioral space through motion sequencing. Nat Neurosci 23:1433–1443. doi:10.1038/s41593-020-00706-3

      Yang T, Arslanova D, Gu Y, Augelli-Szafran C, Xia W. 2008. Quantification of gamma-secretase modulation differentiates inhibitor compound selectivity between two substrates Notch and amyloid precursor protein. Mol Brain 1:15. doi:10.1186/1756-6606-1-15

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      In their paper, Zhan et al. have used Pf genetic data from simulated data and Ghanaian field samples to elucidate a relationship between multiplicity of infection (MOI) (the number of distinct parasite clones in a single host infection) and force of infection (FOI). Specifically, they use sequencing data from the var genes of Pf along with Bayesian modeling to estimate MOI individual infections and use these values along with methods from queueing theory that rely on various assumptions to estimate FOI. They compare these estimates to known FOIs in a simulated scenario and describe the relationship between these estimated FOI values and another commonly used metric of transmission EIR (entomological inoculation rate).

      This approach does fill an important gap in malaria epidemiology, namely estimating the force of infection, which is currently complicated by several factors including superinfection, unknown duration of infection, and highly genetically diverse parasite populations. The authors use a new approach borrowing from other fields of statistics and modeling and make extensive efforts to evaluate their approach under a range of realistic sampling scenarios. However, the write-up would greatly benefit from added clarity both in the description of methods and in the presentation of the results. Without these clarifications, rigorously evaluating whether the author's proposed method of estimating FOI is sound remains difficult. Additionally, there are several limitations that call into question the stated generalizability of this method that should at minimum be further discussed by authors and in some cases require a more thorough evaluation.

      Major comments:

      (1) Description and evaluation of FOI estimation procedure.

      a. The methods section describing the two-moment approximation and accompanying appendix is lacking several important details. Equations on lines 891 and 892 are only a small part of the equations in Choi et al. and do not adequately describe the procedure notably several quantities in those equations are never defined some of them are important to understand the method (e.g. A, S as the main random variables for inter-arrival times and service times, aR and bR which are the known time average quantities, and these also rely on the squared coefficient of variation of the random variable which is also never introduced in the paper). Without going back to the Choi paper to understand these quantities, and to understand the assumptions of this method it was not possible to follow how this works in the paper. At a minimum, all variables used in the equations should be clearly defined.

      We thank the reviewer for this useful comment. We have clarified the method and defined all relevant variables in the revised manuscript (Line 537-573). The reviewer correctly pointed out additional sections and equations in Choi et al., including the derivation of an exact expression for the steady-state queue-length distribution and the two-moment approximation. Since our work directly utilized the two-moment approximation, our previous manuscript included only material on that section. However, we agree that providing additional details on the derivation of the exact expression would benefit readers. Therefore, we have summarized this derivation in the revised manuscript (Line 561-564). Additionally, we clarified the method’s assumptions, particularly those involved in transitioning from the exact expression to the two-moment approximation (Line 565-570).

      b. Additionally, the description in the main text of how the queueing procedure can be used to describe malaria infections would benefit from a diagram currently as written it's very difficult to follow.

      We thank the reviewer for this suggestion. In the revised manuscript, we included a diagram illustrating the connection between the queueing procedure and malaria transmission (Appendix 1-Figure 8).

      c. Just observing the box plots of mean and 95% CI on a plot with the FOI estimate (Figures 1, 2, and 10-14) is not sufficient to adequately assess the performance of this estimator. First, it is not clear whether the authors are displaying the bootstrapped 95%CIs or whether they are just showing the distribution of the mean FOI taken over multiple simulations, and then it seems that they are also estimating mean FOI per host on an annual basis. Showing a distribution of those per-host estimates would also be helpful. Second, a more quantitative assessment of the ability of the estimator to recover the truth across simulations (e.g. proportion of simulations where the truth is captured in the 95% CI or something like this) is important in many cases it seems that the estimator is always underestimating the true FOI and may not even contain the true value in the FOI distribution (e.g. Figure 10, Figure 1 under the mid-IRS panel). But it's not possible to conclude one way or the other based on this visualization. This is a major issue since it calls into question whether there is in fact data to support that these methods give good and consistent FOI estimates.

      There seems to be some confusion on what we display in some key figures. Figures 1-2 and 10-14 (labeled as Figure 1-2 and Appendix 1-Figure 11-15 in the revised manuscript) display bootstrapped distributions including the 95% CIs, not the distribution of the mean FOI taken over multiple simulations. To estimate the mean FOI per host on an annual basis, the two proposed methods require either the steady-state queue length distribution (MOI distribution) or the moments of this distribution. Obtaining such a steady-state queue length distribution necessitates either densely tracked time-series observations per host or many realizations at the same sampling time per host. However, under the sparse sampling schemes, we only have two one-time-point observations per host: one at the end of wet/high-transmission and another at the end of dry/low-transmission. This is typically the case for empirical data, although numerical simulations could circumvent this limitation and generate such output. Nonetheless, we have a population-level queue length distribution from both simulation outputs and empirical data by aggregating MOI estimates across all sampled individuals. We use this population-level distribution to represent and approximate the steady-state queue length distribution at the individual level, not explicitly considering any individual heterogeneity due to transmission. The estimated FOI is per host in the sense of representing the FOI experienced by an individual host whose queue length distribution is approximated from the collection of all sampled individuals. The true FOI per host per year in the simulation is the total FOI of all hosts per year divided by the number of hosts. Therefore, our estimator, combined with the demographic information on population size, estimates the total number of Plasmodium falciparum infections acquired by all individual hosts in the population of interest per year. We clarified this point in the revised manuscript in the subsection of the Materials and Methods, entitled ‘Population-level MOI distribution for approximating time-series observation of MOI per host or many realizations at the same sampling time per host’ (Line 623-639).

      We evaluated the impact of individual heterogeneity due to transmission on FOI inference using simulation outputs (Line 157-184, Figure 1-2 and Appendix 1-Figure 11-15). Even with significant heterogeneity among individuals (2/3 of the population receiving approximately 94% of all bites whereas the remaining 1/3 receives the rest of the bites), our methods performed comparably to scenarios with homogeneous transmission. Furthermore, our methods demonstrated similar performance for both non-seasonal and seasonal transmission scenarios.

      Regarding the second point, we quantitatively assessed the ability of the estimator to recover the truth across simulations and included this information in a supplementary table in the revised manuscript (supplementary file 3-FOImethodsPerformance.xlsx). Specifically, we indicated whether the truth lies within the bootstrap distribution and provided a measure of relative deviation, which is defined as the true FOI value minus the median of the bootstrap distribution for the estimate, normalized by the true FOI value .  This assessment is a valuable addition which enhances clarity, but please note that our previous graphical comparisons do illustrate the ability of the methods to estimate “sensible” values, close to the truth despite multiple sources of errors. “Close” here is relative to the scale of variation of FOI in the field and to the kind of precision that would be useful in an empirical context. From a practical perspective based on the potential range of variation of FOI, the graphical results already illustrate that the estimated distributions would be informative.

      We also thank the reviewer for highlighting instances where our proposed methods for FOI inference perform sub-optimally (e.g. Figure 10, Figure 1 under the mid-IRS panel in the previous manuscript). This feedback prompted us to examine these instances more closely and identify the underlying causes related to the stochastic impact introduced during various sampling processes. These include sampling the host population and their infections at a specific sampling depth in the simulated output, matching the depth used for collecting empirical data. In addition, previously, we imputed MOI estimates for treated individuals by sampling only once from non-treated individuals. This time, we conducted 200 samplings and used the final weighted MOI distribution for FOI inference. By doing so, we reduced the impact of extreme single-sampling efforts on MOI distribution and FOI inference. In other words, some of these suboptimal instances correspond to the scenarios where the one-time sampled MOIs from non-treated individuals do not fully capture the MOI distribution of non-treated individuals. We added a section titled ‘Reducing stochastic impact in sampling processes’ to Appendix 1 on this matter (Line 841-849).

      The reviewer correctly noted that our proposed methods tend to underestimate FOI (Figure 1-2, 10-14, ‘Estimated All Errors’ and ‘Estimated Undersampling of Var’ panels in the previous manuscript, corresponding to Figure 1-2 and Appendix 1-Figure 11-15 in the revised manuscript). This underestimation arises from the underestimation of MOI. The Bayesian formulation of the varcoding method does not account for the limited overlap between co-infecting strains, an additional factor that reduces the number of var genes detected per individual. We have elaborated on this matter in the Results and Discussion sections of the revised manuscript (Line 142-149, 252-256).

      d. Furthermore the authors state in the methods that the choice of mean and variance (and thus second moment) parameters for inter-arrival times are varied widely, however, it's not clear what those ranges are there needs to be a clear table or figure caption showing what combinations of values were tested and which results are produced from them, this is an essential component of the method and it's impossible to fully evaluate its performance without this information. This relates to the issue of selecting the mean and variance values that maximize the likelihood of observing a given distribution of MOI estimates, this is very unclear since no likelihoods have been written down in the methods section of the main text, which likelihood are the authors referring to, is this the probability distribution of the steady state queue length distribution? At other places the authors refer to these quantities as Maximum Likelihood estimators, how do they know they have found the MLE? There are no derivations in the manuscript to support this. The authors should specify the likelihood and include in an appendix an explanation of why their estimation procedure is in fact maximizing this likelihood, preferably with evidence of the shape of the likelihood, and how fine the grid of values they tested is for their mean and variance since this could influence the overall quality of the estimation procedure.

      We thank the reviewer for pointing out these aspects of the work that can be further clarified. In response, we maximized the likelihood of observing the population-level MOI distribution in the sampled population (see our responses to your previous comment c), given queue length distributions, derived from the two-moment approximation method for various mean and variance combinations of inter-arrival times. We added a new section to the Materials and Methods in the revised manuscript with an explicit likelihood formulation (Line 574-585).

      Additionally, we specified the ranges for the mean and variance parameters for inter-arrival times and provided the grid of values tested in a supplementary table (supplementary file 4-meanVarianceParams.xlsx). Example figures illustrating the shape of the likelihood have also been included in Appendix 1-Figure 9. We tested the impact of different grid value choices on estimation quality by refining the grid to include more points, ensuring the FOI inference results are consistent. The results of the test are documented in the revised manuscript (Line 587-593, Appendix 1-Figure 10).

      (2) Limitation of FOI estimation procedure.

      a. The authors discuss the importance of the duration of infection to this problem. While I agree that empirically estimating this is not possible, there are other options besides assuming that all 1-5-year-olds have the same duration of infection distribution as naïve adults co-infected with syphilis. E.g. it would be useful to test a wide range of assumed infection duration and assess their impact on the estimation procedure. Furthermore, if the authors are going to stick to the described method for duration of infection, the potentially limited generalizability of this method needs to be further highlighted in both the introduction, and the discussion. In particular, for an estimated mean FOI of about 5 per host per year in the pre-IRS season as estimated in Ghana (Figure 3) it seems that this would not translate to 4-year-old being immune naïve, and certainly this would not necessarily generalize well to a school-aged child population or an adult population.

      We thank the reviewer for this useful comment. The reviewer correctly noted the challenge in empirically measuring the duration of infection for 1-5-year-olds and comparing it to that of naïve adults co-infected with syphilis. We nevertheless continued to use the described method for the duration of infection, while more thoroughly acknowledging and discussing the limitations this aspect of the method introduces. We have highlighted this potential limitation in the Abstract, Introduction, and Discussion sections of the revised manuscript (Line 26-28, 99-103, 270-292). It is important to note that the infection duration from the historical clinical data we have relied on has been used, and is still used, in the malaria modeling community as a credible source for this parameter in untreated natural infections of malaria-naïve individuals in endemic settings of Africa (e.g. in the agent-based model OpenMalaria, see 1).

      To reduce misspecification in infection duration and fully utilize our proposed methods, future data collection and sampling could prioritize subpopulations with minimal prior infections and an immune profile similar to naïve adults, such as infants and toddlers. As these individuals are also the most vulnerable, prioritizing them aligns with the priority of all intervention efforts in the short term, which is to monitor and protect the most vulnerable individuals from severe symptoms and death. We discuss this aspect in detail in the Discussion section of the revised manuscript (Line 287-292).

      In the pre-IRS phase of Ghana surveys, an estimated mean FOI of about 5 per host per year indicates that a 4-year-old child would have experienced around 20 infections, which could suggest they are far from naïve. The extreme diversity of circulating var genes (2) implies, however, that even after 20 infections, a 4-year-old may have only developed immunity to a small fraction of the variant surface antigens (PfEMP1, Plasmodium falciparum erythrocyte membrane protein 1) encoded by this important gene family. Consequently, these children are not as immunologically experienced as it might initially seem. Moreover, studies have shown that long-lived infections in older children and adults can persist for months or even years, including through the dry season. This persistence is driven by high antigenic variation of var genes and associated incomplete immunity. Additionally, parasites can skew PfEMP1 expression to produce less adhesive erythrocytes, enhancing splenic clearance, reducing virulence, and maintaining sub-clinical parasitemia (3, 4, 5). The impact of immunity on infection duration with age for falciparum malaria remains a challenging open question.

      Lastly, the FOI for naïve hosts is a key basic parameter for epidemiological models of complex infectious diseases like falciparum malaria, in both agent-based and equation-based formulations. This is because FOI for non-naïve hosts is typically a function of their immune status, body size, and the FOI of naïve hosts. Thus, knowing the FOI of naïve hosts helps parameterize and validate these models by reducing degrees of freedom.

      b. The evaluation of the capacity parameter c seems to be quite important and is set at 30, however, the authors only describe trying values of 25 and 30, and claim that this does not impact FOI inference, however it is not clear that this is the case. What happens if the carrying capacity is increased substantially? Alternatively, this would be more convincing if the authors provided a mathematical explanation of why the carrying capacity increase will not influence the FOI inference, but absent that, this should be mentioned and discussed as a limitation.

      Thank you for this question. This parameter represents the carrying capacity of the queuing system, or the maximum number of blood-stage strains with which an individual human host can be co-infected. Empirical evidence, estimated using the varcoding method, suggests this value is 20 (2), providing a lower bound for parameter c. However, the varcoding method does not account for the limited overlap between co-infecting strains, which reduces the number of var genes detected in an individual, thereby affecting the basis of MOI estimation. Additional factors, such as the synchronicity of clones in their 48-hour life cycle on alternate days (6) and within-host competition of strains leading to low-parasitemia levels (7, 8), contribute to under-sampling of strains and are not accounted for in MOI estimation (9). To address these potential under-sampling issues, we previously tested values of 25 and 30.

      This time, we systematically investigated a wider range of values, including substantially higher ones: 25, 30, 40, and 60. We found that the FOI inference results are similar across these values. Figure 3 in the main text and supplementary figures (Appendix 1-Figure 16-18) illustrates these findings.

      The parameter c influences the steady-state queue length distribution based on the two-moment approximation with specific mean and variance combinations, primarily affecting the distribution’s tail when customer or infection flows are high. Smaller values of c lower the maximum possible queue length, making the system more prone to “overflow”. In such cases, customers or infections may find no space available upon their arrival, hence not incrementing the queue length.

      Empirical MOI distributions for high-transmission endemic regions center around 4 or 5, mostly remaining below 10, with only a small fraction between 15-20 (2). These distributions do not support parameter combinations resulting in frequent overflow for a system with c equal to 25 or 30. As one increases the value of c further, these parameter combinations would cause the MOI distributions to shift to larger values inconsistent with the empirical MOI distributions. We therefore do not expect substantially higher values for parameter c to noticeably change either the relative shape of the likelihood or the MLE.

      We have included a subsection on parameter c in the Materials and Methods section of the revised manuscript (Line 596-612).

      Reviewer #2 (Public Review):

      Summary:

      The authors combine a clever use of historical clinical data on infection duration in immunologically naive individuals and queuing theory to infer the force of infection (FOI) from measured multiplicity of infection (MOI) in a sparsely sampled setting. They conduct extensive simulations using agent-based modeling to recapitulate realistic population dynamics and successfully apply their method to recover FOI from measured MOI. They then go on to apply their method to real-world data from Ghana before and after an indoor residual spraying campaign.

      Strengths:

      (1) The use of historical clinical data is very clever in this context.

      (2) The simulations are very sophisticated with respect to trying to capture realistic population dynamics.

      (3) The mathematical approach is simple and elegant, and thus easy to understand.

      Weaknesses:

      (1) The assumptions of the approach are quite strong and should be made more clear. While the historical clinical data is a unique resource, it would be useful to see how misspecification of the duration of infection distribution would impact the estimates.

      We thank the reviewer for bringing up the limitation of our proposed methods due to their reliance on a known and fixed duration of infection distribution from historical clinical data. Please see our response to Reviewer 1, Comment 2a, for a detailed discussion on this matter.

      (2) Seeing as how the assumption of the duration of infection distribution is drawn from historical data and not informed by the data on hand, it does not substantially expand beyond MOI. The authors could address this by suggesting avenues for more refined estimates of infection duration.

      We thank the reviewer for pointing out a potential improvement to our work. We acknowledge that FOI is inferred from MOI and thus depends on the information contained in MOI. However, MOI by definition is a number and not a rate parameter. FOI for naïve hosts is a fundamental parameter for epidemiological models of complex infectious diseases like falciparum malaria, in both agent-based and equation-based formulations. FOI of non-naïve hosts is typically a function of their immune status, body size, and the FOI of naïve hosts. Thus, knowing the FOI of naïve hosts helps parameterize and validate these models by reducing degrees of freedom. In this sense, we believe the transformation from MOI to FOI is valuable.

      Measuring infection duration is challenging, making the simultaneous estimation of infection duration and FOI an attractive alternative, as the referee noted. This, however, would require closely monitored cohort studies or densely sampled cross-sectional surveys to reduce issues like identifiability. For instance, a higher arrival rate of infections paired with a shorter infection duration could generate a similar MOI distribution to a lower arrival rate with a longer infection duration. In some cases, incorrect combinations of rate and duration might even produce an MOI distribution that appears closer to the targeted distribution. Such cohort studies and densely sampled cross-sectional surveys have not been and will not be widely available across different geographical locations and times. This work utilizes more readily available data from sparsely sampled single-time-point cross-sectional surveys, which precludes more sophisticated derivation of time-varying average arrival rates of infections and lacks the resolution to simultaneously estimate arrival rates and infection duration. In the revised manuscript, we have elaborated on this matter and added a paragraph in the Discussion section (Line 306-309).

      (3) It is unclear in the example how their bootstrap imputation approach is accounting for measurement error due to antimalarial treatment. They supply two approaches. First, there is no effect on measurement, so the measured MOI is unaffected, which is likely false and I think the authors are in agreement. The second approach instead discards the measurement for malaria-treated individuals and imputes their MOI by drawing from the remaining distribution. This is an extremely strong assumption that the distribution of MOI of the treated is the same as the untreated, which seems unlikely simply out of treatment-seeking behavior. By imputing in this way, the authors will also deflate the variability of their estimates.

      We thank the reviewer for pointing out aspects of the work that can be further clarified. Disentangling the effect of drug treatment on measurements like infection duration is challenging. Since our methods rely on the known and fixed distribution of infection duration from historical data of naïve patients with neurosyphilis infected with malaria as a therapy, drug treatment can potentially violate this assumption. In the previous manuscript, we did not attempt to directly address the impact of drug treatment. Instead, we considered two extreme scenarios that bound reality, well summarized by the reviewer. Reality lies somewhere in between these two extremes, with antimalarial treatment significantly affecting measurements in some individuals but not in others. Nonetheless, the results of FOI inference do not differ significantly across both extremes.

      The impact of the drugs likely depends on their nature, efficiency, and duration. We note that treatment information was collected via a routine questionnaire, with participant self-reporting that they had received an antimalarial treatment in the previous two-weeks before the surveys (i.e., participants that reported they were sick, sought treatment, and were provided with an antimalarial treatment). No confirmation through hospital or clinic records was conducted, as it was beyond the scope of the study. Additionally, many of these sick individuals seek treatment at local chemists, which may limit the relevance of hospital or clinic records, if they are even available. Consequently, information on the nature, efficiency, and duration of administrated drugs was incomplete or lacking. As this is not the focus of this work, we do not elaborate on the impact of drug treatment in the revised manuscript.

      The reviewer correctly noted that this imputation might not add additional information and could reduce MOI variability. Therefore, in the revised manuscript, we reported FOI estimates with drug-treated 1-5-year-olds excluded. Additionally, we discarded the infection status and MOI values of treated individuals and sampled their MOI from non-treated microscopy-positive individuals, imputing a positive MOI for treated and uninfected individuals. We also reported FOI estimates based on these MOI values. This scenario provides an upper bound for FOI estimates. Note that we do not assume that the MOI distribution for treated individuals is the same as that for untreated individuals. Rather, we aim to estimate what their MOI would have been, and consequently, determine what the FOI per individual per year in the combined population would be, had these individuals not received antimalarial treatment. The results of FOI inference do not differ significantly between these two approaches. They can serve as general solutions to antimalarial treatment issues for others applying our FOI inference methods. These details can be found in the revised manuscript (Line 185-210, 462-484).

      - For similar reasons, their imputation of microscopy-negative individuals is also questionable, as it also assumes the same distributions of MOI for microscopy-positive and negative individuals.

      We thank the reviewer for this comment. The reviewer correctly noted that we imputed the MOI values for microscopy-negative but PCR-positive 1-5-year-olds by sampling from the microscopy-positive 1-5-year-olds, under the assumption that both groups have similar MOI distributions. This approach was motivated by the analysis of our Ghana surveys, which shows no clear relationship between MOI (or the number of var genes detected within an individual host, on the basis of which our MOI values were estimated) and the parasitemia levels of those hosts. Parasitemia levels underlie the difference in detection sensitivity between PCR and microscopy.

      In the revised manuscript, we elaborated on this issue and included formal regression tests showing the lack of a relationship between MOI/the number of var genes detected within an individual host and the parasitemia levels of those hosts (Line 445-451, Appendix 1-Figure 7). We also described potential reasons or hypotheses behind this observation (Line 452-461).

      Reviewer #3 (Public Review):

      Summary:

      It has been proposed that the FOI is a method of using parasite genetics to determine changes in transmission in areas with high asymptomatic infection. The manuscript attempts to use queuing theory to convert multiplicity of infection estimates (MOI) into estimates of the force of infection (FOI), which they define as the number of genetically distinct blood-stage strains. They look to validate the method by applying it to simulated results from a previously published agent-based model. They then apply these queuing theory methods to previously published and analysed genetic data from Ghana. They then compare their results to previous estimates of FOI.

      Strengths:

      It would be great to be able to infer FOI from cross-sectional surveys which are easier and cheaper than current FOI estimates which require longitudinal studies. This work proposes a method to convert MOI to FOI for cross-sectional studies. They attempt to validate this process using a previously published agent-based model which helps us understand the complexity of parasite population genetics.

      Weaknesses:

      (1) I fear that the work could be easily over-interpreted as no true validation was done, as no field estimates of FOI (I think considered true validation) were measured. The authors have developed a method of estimating FOI from MOI which makes a number of biological and structural assumptions. I would not call being able to recreate model results that were generated using a model that makes its own (probably similar) defined set of biological and structural assumptions a validation of what is going on in the field. The authors claim this at times (for example, Line 153) and I feel it would be appropriate to differentiate this in the discussion.

      We thank the reviewer for this comment, although we think there is a mis-understanding on what can and cannot be practically validated in the sense of a “true” measure of FOI that would be free from assumptions for a complex disease such as malaria. We would not want the results to be over-interpreted, and we have extended the discussion of what we have done to test the methods in the revised manuscript (Line 314-328). Performance evaluation via simulation output is common and often necessary for statistical methods. These simulations can come from dynamical or descriptive models, each making their own assumptions to simplify reality. Our stochastic agent-based model (ABM) of malaria transmission, used in this study, has successfully replicated several key patterns from high-transmission endemic regions in the field, including aspects of strain diversity not represented and captured by simpler models (10).

      In what sense this ABM makes a set of biological and structural assumptions that are “probably similar” to those of the queuing methods we present is not clear to us. We agree that using models with different structural assumptions from the method being tested is ideal. Our FOI inference methods based on queuing theory require the duration of infection distribution and the MOI distribution among sampled individuals. However, these FOI inference methods are agnostic to the specific biological mechanisms governing these distributions.

      Another important point raised by this comment is what would be the “true” FOI value against which to validate our methods. Empirical MOI-FOI pairs from cohort studies tracking FOI directly are still lacking. Direct FOI measurements are prone to errors because differentiating new infections from the temporary absence of an old infection in the peripheral blood and its subsequent re-emergence remains challenging. Reasons for this challenge include the low resolution of the polymorphic markers used in cohort studies, which cannot fully differentiate hyper-diverse antigenic strains, and the complexity of within-host dynamics and competitive interaction of co-infecting strains (6, 8, 9). Alternative approaches also do not provide a “true” FOI estimation free from assumptions. These approaches involve fitting simplified epidemiological models to densely sampled/repeated cross-sectional surveys for FOI inference. In this case, no FOI is measured directly, and thus, there are no FOI values available for benchmarking against fitted FOI values. The evaluation or validation of these model-fitting approaches is typically based on their ability to capture other epidemiological quantities that are easier to sample or measure, such as prevalence or incidence, with criteria such as the Akaike information criterion (AIC). This type of evaluation is similar to the one done in this work. We selected FOI values that maximize the likelihood of observing the given MOI distribution. Furthermore, we paired our estimated FOI values for Ghana surveys with the independently measured EIR (Entomological Inoculation Rate), a common field measure of transmission intensity. We ensured that our resulting FOI-EIR points align with existing FOI-EIR pairs and the relationship between these quantities from previous studies. We acknowledge that, like model-fitting approaches, our validation for the field data is also indirect and further complicated by high variance in the relationship between EIR and FOI from previous studies.

      Prompted by the reviewer’s comment, we elaborated on these points in the revised manuscript, emphasizing the indirect nature and existing constraints of our validation with field data in the Discussion section (Line 314-328). Additionally, we clarified certain basic assumptions of our agent-based model in Appendix 1-Simulation data.

      (2) Another aspect of the paper is adding greater realism to the previous agent-based model, by including assumptions on missing data and under-sampling. This takes prominence in the figures and results section, but I would imagine is generally not as interesting to the less specialised reader. The apparent lack of impact of drug treatment on MOI is interesting and counterintuitive, though it is not really mentioned in the results or discussion sufficiently to allay my confusion. I would have been interested in understanding the relationship between MOI and FOI as generated by your queuing theory method and the model. It isn't clear to me why these more standard results are not presented, as I would imagine they are outputs of the model (though happy to stand corrected - it isn't entirely clear to me what the model is doing in this manuscript alone).

      We thank the reviewer for this comment. Please refer to our response to Reviewer 2, comment (3), as we made changes in the revised manuscript regarding antimalarial drug treated individuals. We reported two sets of FOI estimates. In the first, we excluded these treated individuals from the analysis as suggested by Reviewer 2. In the second, we discarded their infection status and MOI estimates and sampling from non-treated individuals.

      The reviewer correctly noted the surprising lack of impact of antimalarial treatment on MOI estimates. This pattern is indeed interesting and counterintuitive. The impact of the drugs likely depends on their nature, efficiency, and duration. We note that treatment information was collected via a routine questionnaire, with participant self-reporting that they had received an antimalarial treatment in the previous two-weeks before the surveys (i.e., participants that reported they were sick, sought treatment, and were provided with an antimalarial treatment). No confirmation through hospital or clinic or pharmacy records was conducted, as it was beyond the scope of the study. Additionally, many of these sick individuals seek treatment at local chemists, which may limit the relevance of hospital or clinic records, if they are even available. Consequently, information on the nature, efficiency, and duration of administrated drugs was incomplete or lacking. As this is not the focus of this work, we do not elaborate on the impact of drug treatment in the revised manuscript.

      Regarding the last point of the reviewer, on understanding the relationship between MOI and FOI, we are not fully clear about what was meant. We are also confused about the statement on what the “model is doing in this manuscript alone”. We interpret the overall comment as the reviewer suggesting a better understanding of the relationship between MOI and FOI generated by the two-moment approximation method and the agent-based model. This could involve exploring the relationship between the moments of their distributions, possibly by fitting models such as simple linear regression models. Although this approach is in principle possible, it falls outside the focus of our work. Moreover, it would be challenging to evaluate the performance of this alternative approach given the lack of MOI-FOI pairs from empirical settings with directly measured FOI values (from large cohort studies). Nonetheless, we note that the qualitative relationship between the two quantities is intuitive. Higher FOI values should correspond to higher MOI values. Less variable FOI values should result in more narrow or concentrated MOI distributions, whereas more variable FOI values should lead to more spread-out MOI distributions. We described this qualitative relationship between MOI and FOI in the revised manuscript (Line 499-502).

      As mentioned in the response to the reviewer’s previous point (1), we hope that our clarification of the basic assumptions underlying our agent-based model in Appendix 1-Simulation data helps the reviewer gain a better sense of the model. We appreciate agent-based models involve more assumptions and parameters than typical equation-based models in epidemiology, and their description can be difficult to follow. We have extended this description to rely less on previous publications. As for other ABMs, the population dynamics of the disease is followed over time by tracking individual hosts and strains. This allows us to implement specific immune memory to the large number of strains arising from the var multigene family. There is no equation-based formulation of the transmission dynamics that can incorporate immune memory in the presence of such large variation as well as recombination of the strains. We rely on this model because large strain diversity at high transmission underlies superinfection of individual hosts, and therefore, MOI values larger than one. We relied on the estimation of MOI with a method based on var gene sampling, and therefore, simulated such sampling for individual hosts (which requires an ABM and one that represents such genes and resulting strains explicitly).

      (3) I would suggest that outside of malaria geneticists, the force of infection is considered to be the entomological inoculation rate, not the number of genetically distinct blood-stage strains. I appreciate that FOI has been used to explain the latter before by others, though the authors could avoid confusion by stating this clearly throughout the manuscript. For example, the abstract says FOI is "the number of new infections acquired by an individual host over a given time interval" which suggests the former, please consider clarifying.

      We thank the reviewer for this helpful comment, as it is crucial to avoid any confusion regarding basic definitions. EIR, the entomological inoculation rate, is closely related to the FOI, force of infection, but they are not equivalent. EIR focuses on the rate of arrival of infectious bites and is measured as such by focusing on the mosquito vectors that are infectious and arrive to bite a given host. Not all these bites result in actual infection of the human host. Epidemiological models of malaria transmission clearly make this distinction, as FOI is defined as the rate at which a host acquires infection. This definition comes from more general models of the population dynamics of infectious diseases. For simpler diseases without super-infection, the typical SIR models define FOI as the rate at which a susceptible individual becomes infected. In the context of malaria, FOI refers to the number of new infections acquired by an individual host over a given time interval. This distinction between EIR and FOI is the reason why studies have investigated their relationship, with the nonlinearity of this relationship reflecting the complexity of the underlying biology and how host immunity influences the outcome of an infectious bite.

      We added “blood-stage strains” to the definition of FOI in the previous manuscript, as pointed out by the reviewer, for the following reason. After an individual host acquires an infection/strain from an infectious mosquito bite, the strain undergoes a multi-stage life cycle within the host, including the liver stage and asexual blood stage. Liver-stage infections can fail to advance to the blood stage due to immunity or exceeding the blood-stage carrying capacity. Only active blood-stage infections are detectable in all direct measures of FOI. Quantities used in indirect model-fitting approaches for estimating FOI are also based on or reflect these blood-stage strains/infections. Only these blood-stage strains/infections are transmissible to other individuals, impacting disease dynamics. Ultimately, the FOI we seek to estimate is the one defined as specified above, as well as in both the previous and revised manuscripts, consistent with the epidemiological literature. We expanded on this point in the revised manuscript (Line 641-656).

      (4) Line 319 says "Nevertheless, overall, our paired EIR (directly measured by the entomological team in Ghana (Tiedje et al., 2022)) and FOI values are reasonably consistent with the data points from previous studies, suggesting the robustness of our proposed methods". I would agree that the results are consistent, given that there is huge variation in Figure 4 despite the transformed scales, but I would not say this suggests a robustness of the method.

      We thank the reviewer for this comment and have modified the relevant sentences to use “consistent” instead of “robust” (Line 229-231).

      (5) The text is a little difficult to follow at times and sometimes requires multiple reads to understand. Greater precision is needed with the language in a few situations and some of the assumptions made in the modelling process are not referenced, making it unclear whether it is a true representation of the biology.

      We thank the reviewer for this comment. As mentioned in the response to Reviewer 1 and in response to your previous points, we have shortened, reorganized and rewritten parts of the text in the revised manuscript to improve clarity and readability.

      Reviewer #1 (Recommendations For The Authors):

      Minor comments:

      Bar graphs in Figures 6 and 7 are not an appropriate way to rigorously compare whether your estimated MOI (under different approaches) is comparable to your true MOIs. Particularly in Figure 6 it is very difficult to clearly compare what is going on. If anything in Figure 7 it looks like as MOI gets higher, Bayesian methods and barcoding are overestimating relative to the truth. The large Excel file that shows KS statistics could be better summarized (and include p-values not in a separate table) and further discussion of how these methods perform on metrics other than the mean value would be important given that MOI distributions can be heavily right skewed and these high MOI values contain a large proportion of genetic diversity which can be highly informative for the purposes of this estimation.

      We appreciate the reviewer’s comment. It appears there may have been some misinterpretation of the pattern in Figure 7 in the previous manuscript. We believe the reviewer meant “as MOI gets higher, Bayesian methods and varcoding are UNDERESTIMATING relative to the truth” rather than “OVERESTIMATING”.

      We agree with the reviewer that the comparison of MOI distributions can be improved. To better quantify the difference between the MOI distribution from the original varcoding method and its Bayesian formulation relative to true MOIs, we replaced the KS test conducted in the previous manuscript with two alternative, more powerful tests: the Cramer-von Mises Test and the Anderson-Darling Test. The Cramer-von Mises Test quantifies the sum of the squared differences between the two cumulative distribution functions, while the Anderson-Darling Test, a modification of the Cramer-von Mises Test, gives more weight to the tails of the distribution, as noted by the reviewer. We have summarized the results, including test statistics and their associated p-values, in a supplementary table (Line 135-149, Line 862-883, supplementary file 1-MOImethodsPerformance.xlsx and supplementary file 7-BayesianImprovement.xlsx).

      Throughout the text the authors use "consistent" to describe their estimation of FOI, I know this is meant in the colloquial use of the word but consider changing this word to replicable or something similar. When talking about estimators, usually, consistency implies asymptotic convergence in probability which we do not know whether the proposed estimator does.

      We thank the reviewer for this suggestion. We changed “consistent” to “replicable” in the revised manuscript.

      I think there is an issue with the numbering of the figures, they are just numbered continuously between the main text and appendix between 1 and 15, but in the text, there is a different numbering system between the main text and appendix figures.

      We thank the reviewer for this comment. We have double-checked to ensure that the numbering of the figures is consistent with the text in the revised manuscript. Figures are numbered continuously between the main text and the appendix. When referring to these figures in the text, we provide a prefix (i.e., Appendix 1) indicating whether the figure is in the main text or Appendix 1, followed by the figure number.

      The description of the bootstrap for 95% CI is a bit sparse, did bootstrap distributions look symmetric? If not did authors use a skewness adjustment to ensure good coverage? Also, is the bootstrap unit of resampling at the individual level, the simulation scenario level, population level?

      We checked the bootstrap distributions and calculated their skewness. The majority fall within the range of -0.5 to 0.5, with a few exceptions falling within the range of 0.5-0.75 (supplementary file 6-FOIBootstrapSkewness.xlsx). We considered them as fairly symmetric and thus did not use a skewness adjustment.

      In Figures 8 and 9 the x-axes seem to imply there are both the true and estimated MOI distributions on the plot but only 1 color of grey is clearly visible. If there are 2 distributions the color or size needs to be changed or if not consider re-labeling the x-axis.

      We thank the reviewer for this comment. There was a mistake in the x-axis labels in Figure 8 and 9. Only the estimated MOI distributions were shown because the true ones are not available for the Ghana field surveys. The labels should simply be “Estimated MOIvar”.

      Reviewer #2 (Recommendations For The Authors):

      (1) Throughout the results section there are lots of vague statements such as "differ only slightly", "exhibit a somewhat larger, but still small, difference", etc. Please include the exact values and ranges within the text where appropriate because it can be difficult to discern from the figure.

      We thank the reviewer for this useful comment. In the revised manuscript, we have provided exact values and ranges where appropriate (supplementary file 1- MOImethodsPerformance.xlsx, supplementary file 3- FOImethodsPerformance.xlsx, and supplementary file 7-BayesianImprovement.xlsx).

      (2) Truncate decimals to 2 places.

      We thank the reviewer for this comment. In the revised manuscript, we have truncated decimals to two places where applicable.

      (3) The queueing theory notation in the methods section is unfamiliar, specifically things like "M/M/c/k", please define the variables used.

      We thank the reviewer for this useful comment. In the revised manuscript, we have defined all the variables used. Please refer to our responses to Reviewer 1 Point (1) a.

      Reviewer #3 (Recommendations For The Authors):

      (1) The work takes many of the models and data from a previous paper published in eLife in 2023 (the 4 most senior authors of this previous manuscript are the 4 authors of the current manuscript). This previous paper introduced some new terminology "census population" which was highlighted as being potentially confusing by 2 of the 3 reviewers of the original article. This was somewhat rebuffed by the authors, though their response was ambiguous about whether the terminology would be changed in any potential future revision. The census population terminology does not appear in this manuscript, though the same data is being used. Publication of similar papers with the same data and different terminology could generate confusion, so I would encourage authors to be consistent and make sure the two papers are in line. To this end, it feels like this paper would be better suited to be classified as a "Research Advances" on this original manuscript and linked, which is a nice functionality that eLife offers.

      We thank the reviewer for this comment, but we do not think our work would fall under the criteria of “Research Advances” based on our previous paper pointed out by the reviewer. The reviewer correctly noted that the current work and the previous paper used the same datasets. However, they have different goals and are not related in terms of content.

      The previous paper examined how epidemiological quantities and diversity measurements of the local parasite population change following the initiation of effective control interventions and subsequently as this control wanes. These quantities included MOI and census population size (MOI was estimated using the Bayesian formulation of the varcoding method, and the census population size was derived from summing MOIvar across individuals in the human population). In contrast, our current work focused on a different goal: inferring FOI based on MOI. We proposed two methods from queuing theory and illustrated them with MOI estimates obtained with the Bayesian formulation of the "varcoding" method. Although the method applied to estimate MOI is indeed the same as that of the paper mentioned by the reviewer, the proposed methods should be applicable to MOI estimates obtained in any other way, as stated in the Abstract in the previous manuscript. That is, the methods we present in the current paper are independent from the way the MOI estimation has been carried out. Our results are not about the MOI values themselves but rather on an illustration of the methods for converting those MOI values to FOI. In fact, there are different ways to obtain MOI estimates for Plasmodium falciparum (9). The most common approach for determining MOI involves size-polymorphic antigenic markers, such as msp1, msp2, msp3, glurp, ama1, and csp. Similarly, microsatellites, also termed simple sequence repeat (SSR), are another type of size-polymorphic marker that can be amplified to estimate MOI by determining the number of alleles detected. Combinations of genome-wide single nucleotide polymorphisms (SNPs) have also been used to estimate MOI.

      The result section of the current manuscript begins by evaluating how different kinds of errors/sampling limitations affect the estimation of MOI using the Bayesian formulation of the varcoding method. Only that brief section, which is not the core or primary objective of the manuscript, could be considered an extension and an advancement related to the other paper. We considered the effect of these errors on the resulting estimates of FOI.

      We further note that, as the reviewer pointed out, the census population size is not utilized at all in our current work. We are unclear on why this quantity is mentioned here. Our previous paper has been revised and can be found in eLife as such. We have not changed this terminology and have provided a clear explanation for why we chose it. The reviewer seems to have read the previous response to version 1 posted on December 28, 2023 (Note that version 2 and the associated response was posted on November 20, 2024). Regardless, this is not the place for a discussion on another paper on a quantity that is irrelevant to the current work being reviewed.

      We understand that the reviewer’s impression may have been influenced by the previous emphasis on the Bayesian formulation of the varcoding method in our manuscript. With the reorganization and rewriting of parts of the manuscript, we hope the revised version will clearly convey the central goal of our work.

      (2) Similar statements that could be toned down. 344 ".... two-moment approximation approach and Little's law are shown to provide consistent and good FOI estimates,.....", 374 "Thus, the flexibility and generality of these two proposed methods allow robust estimation of an important metric for malaria transmission"

      We thank the reviewer for this comment. We have modified the descriptive terms for the performance of our methods. Please also refer to our responses to Reviewer 1, Point (1) c and your previous Point (1).

      (3) Various assumptions seem to have been made which are not justified. For example, heterogeneous mixing is defined as 2/3rd of the population receives 90% of the bites. A reference for this would be good.

      In this work, we considered heterogenous transmission arising from 2/3 of the population receiving approximately 94% of all bites, because we believe this distribution introduces a reasonable and sufficient amount of heterogeneity in exposure risk across individuals. We are not aware of field studies justifying this degree of heterogeneity.

      (4) The work assumes children under 5 have no immunity (Line 648 says "It is thus safe to consider negligible the impact of immune memory accumulated from previous infections on the duration of a current infection." ). Is there supporting evidence for this and what would happen if this wasn't the case?

      We thank the reviewer for this helpful comment. Please refer to our responses to Reviewer 1 Point (2) a.

      (5) Similarly, there are a few instances of a need for more copy-editing. The text says "We continue with the result of the heterogeneous exposure risk scenarios in which a high-risk group ( 2/3 of the total population) receives around 94% of all bites whereas a low-risk group ( 1/3 of the total population) receives the remaining bites (Appendix 1-Figure 5C)." whereas the referenced caption says "For example, heterogeneous mixing is defined as 2/3rd of population receives 90% of the bites."

      We believe there was a misinterpretation of the legend caption. In the referenced caption, we stated “2/3rd of population receives MORE THAN 90% of the bites”, which aligns with “around 94% of all bites”. Nonetheless, to maintain consistency in the revised manuscript, we have updated the description to uniformly state “approximately 94% of all bites” throughout.

      (6) The term "measurement error" is used to describe the missing potential under-sampling of var genes. Given this would only go one way isn't the term "bias" more appropriate?

      We understand that, in general English, “bias” might seem more precise for describing a deviation in one direction. However, in malaria epidemiology and in models for malaria and other infectious diseases, “measurement error” is a general term that describes deviations introduced in the process of measurement and sampling, which can confound or add noise to the true values being collected. This term is commonly used, and we have adhered to it in the revised manuscript.

      (7) Line 739 "Though FOI and EIR both reflect transmission intensity, the former refers directly to detectable blood-stage infections whereas the latter concerns human-vector contact rates." In my mind this is not true, the EIR is the number of potentially invading parasites (a contact rate between parasites in mosquitoes and humans if you will). The human-vector contact rate is the human biting rate.

      We thank the reviewer for this comment. We have clarified the definition regarding FOI and EIR in our response to your previous comment (3) and in the revised manuscript. We agree that the term “human-vector contact rates” was not precise enough for EIR. We intended “human-infectious vector contact rates”, and we have updated the text to reflect this change (Line 644-645).

      References and Notes

      (1) Maire, N. et al. A model for natural immunity to asexual blood stages of Plasmodium falciparum malaria in endemic areas. Am J Trop Med Hyg., 75(2 Suppl):19-31 (2006).

      (2) Tiedje, K. E. et al. Measuring changes in Plasmodium falciparum census population size in response to sequential malaria control interventions. eLife, 12 (2023).

      (3) Andrade C. M. et al. Infection length and host environment influence on Plasmodium falciparum dry season reservoir. EMBO Mol Med.,16(10):2349-2375 (2024).

      (4) Zhang X. and Deitsch K. W. The mystery of persistent, asymptomatic Plasmodium falciparum infections, Current Opinion in Microbiology, 70:102231 (2022).

      (5) Tran, T. M. et al. An Intensive Longitudinal Cohort Study of Malian Children and Adults Reveals No Evidence of Acquired Immunity to Plasmodium falciparum Infection, Clinical Infectious Diseases, 57(1):40–47 (2013).

      (6) Farnert, A., Snounou, G., Rooth, I., Bjorkman, A. Daily dynamics of Plasmodium falciparum subpopulations in asymptomatic children in a holoendemic area. Am J Trop Med Hyg., 56(5):538-47 (1997).

      (7) Read, A. F. and Taylor, L. H. The Ecology of Genetically Diverse Infections, Science, 292:1099-1102 (2001).

      (8) Sondo, P. et al. Genetically diverse Plasmodium falciparum infections, within-host competition and symptomatic malaria in humans. Sci Rep 9(127) (2019).

      (9) Labbe, F. et al. Neutral vs. non-neutral genetic footprints of Plasmodium falciparum multiclonal infections. PLoS Comput Biol, 19(1) (2023).

      (10) He, Q. et al. Networks of genetic similarity reveal non-neutral processes shape strain structure in Plasmodium falciparum. Nat Commun 9(1817) (2018).

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife Assessment

      This study presents a useful modification of a standard model of genetic drift by incorporating variance in offspring numbers, claiming to address several paradoxes in molecular evolution. It is unfortunate that the study fails to engage prior literature that has extensively examined the impact of variance in offspring number, implying that some of the paradoxes presented might be resolved within existing frameworks.

      The prior literature the reviewers referred to are all "modified WF models". In the original submission, we lumped the standard and modified WF models together as the "generalized WF models". As the lumping causes confusions, their distinctions are now made clear.  That said, the Haldane model in our proposal is not a modification of the standard WF model because, conceptually, the two models are very different. WF is based on sampling whereas the Haldane model is based on gene transmission.

      While the "modified WF models" often incorporate V(K) [variance in progeny number], the modification is still based on the WF model of population sampling. The modification is mathematically feasible but biologically untenable, as explained explicitly in the revised text. Most important, all four paradoxes are as incompatible with the modified WF models as with the standard model. Note that the Haldane model does not have the sampling step, which is absorbed into the V(K) term. In the integrated WF-Haldane model, these paradoxes are resolved (see the new sections of Discussion, quoted below).

      If readers do not have time to ponder on all four paradoxes, they may simply read the first one, as follows. When the population size (N) is growing exponentially, such as in a bacteria culture, drift is nearly absent when N is small and becomes stronger as N increases, especially when approaching the carrying capacity.  Such common observations are exactly opposite of the WF model's central prediction. Any model based on sampling cannot escape the constraint of "greater drift, smaller N".

      Revision - The following text is a reproduction of the last 7 paragraphs of Discussion.

      “The standard WF model has been extended in several directions (overlapping generations, multiple alleles, ploidy, etc.). The modification most relevant to our studies here is the introduction of V(K) into the model, thus permitting V(K) ≠ E(K). While the modifications are mathematically valid, they are often biologically untenable. Kimura and Crow (1963) may be the first to offer a biological mechanism for V(K) ≠ E(K), effectively imposing the Haldane model on the WF model. Other models (Kimura and Crow 1963; Lynch, et al. 1995; Sjodin, et al. 2005; Der, et al. 2011; Cannings 2016) indeed model mathematically the imposition of the branching process on the population, followed by the WF sampling. The constructions of such models are biologically dubious but, more importantly, still unable to resolve the paradoxes. It would seem more logical to use the Haldane model in the first place by having two parameters, E(K) and V(K). 

      Even if we permit V(K) ≠ E(K) under the WF sampling, the models would face other difficulties. For example, a field biologist needs to delineate a Mendelian population and determine its size, N or Ne. In all WF models, one cannot know what the actual population being studied is. Is it the fly population in an orchard being sampled, in the geographical region, or in the entire species range? It is unsatisfactory when a population biologist cannot identify the population being studied. The Haldane model is an individual-output model (Chen, et al. 2017), which does not require the delineation of a Mendelian population.

      We shall now review the paradoxes specifically in relation to the modified WF models, starting with the multi-copy gene systems such as viruses and rRNA genes covered in the companion study (Wang, et al. 2024). These systems evolve both within and between hosts. Given the small number of virions transmitted between hosts, drift is strong in both stages as shown by the Haldane model (Ruan, Luo, et al. 2021; Ruan, Wen, et al. 2021; Hou, et al. 2023). Therefore, it does not seem possible to have a single effective population size in the WF models to account for the genetic drift in two stages. The inability to deal with multi-copy gene systems may explain the difficulties in accounting for the SARS-CoV-2 evolution (Deng, et al. 2022; Pan, Liu, et al. 2022; Ruan, Wen, et al. 2022; Hou, et al. 2023; Ruan, et al. 2023).

      We now discuss the first paradox of this study, which is about the regulation of N. In the general WF models, N is imposed from outside of the model, rather than self-generating within the model. When N is increasing exponentially as in bacterial or yeast cultures, there is almost no drift when N is very low and drift becomes intense as N grows to near the carrying capacity. As far as we know, no modifications of the WF model can account for this phenomenon that is opposite of its central tenet. In the general WF models, N is really the carrying capacity, not population size. 

      The second paradox of sex chromosomes is rooted in V(K) ≠ E(K). As E(K) is the same between sexes but V(K) is different, clearly V(K) = E(K) would not be feasible. The mathematical solution of defining separate Ne's for males and females (Kimura and Crow 1963; Lynch, et al. 1995; Sjodin, et al. 2005; Der, et al. 2011; Cannings 2016) unfortunately obscures the interesting biology. As shown in Wang et al. (2024; MBE), the kurtosis of the distribution of K indicates the presence of super-breeder males. While the Haldane model can incorporate the kurtosis, the modified WF models are able to absorb only up to the variance term, i.e., the second moment of the distribution. The third paradox of genetic drift is manifested in the fixation probability of an advantageous mutation, 2_s_/V(K). As explained above, the fixation probability is determined by the probability of reaching a low threshold that is independent of N itself. Hence, the key parameter of drift in the WF model, N (or Ne), is missing. This paradox supports the assertion that genetic drift is fundamentally about V(K) with N being a scaling factor. 

      As the domain of evolutionary biology expands, many new systems do not fit into the WF models, resulting in the lack of a genetic drift component in their evolutionary trajectories. Multi-copy gene systems are obvious examples. Others include domestications of animals and plants that are processes of rapid evolution  (Diamond 2002; Larson and Fuller 2014; Purugganan 2019; Chen, Yang, et al. 2022; Pan, Zhang, et al. 2022; Wang, et al. 2022). Due to the very large V(K) in domestication, drift must have played a large role. Somatic cell evolution is another example with “undefinable” genetic drift (Wu, et al. 2016; Chen, et al. 2017; Chen, et al. 2019; Ruan, et al. 2020; Chen, Wu, et al. 2022). The Haldane (or WFH) model, as an "individual output" model, can handle these general cases of genetic drift.

      The Haldane model and the WF model are fundamentally different approaches to random forces of evolution. While the WF models encounter many biological contradictions, they have provided approximate mathematical solutions to more realistic scenarios. In systems such as in viral evolution (Ruan, Hou, et al. 2022; Hou, et al. 2023) or somatic cell evolution (Chen, Wu, et al. 2022; Zhai, et al. 2022) whereby the WF solution is absent, further development of the WFH model will be necessary.”

      In addition, while the modified model yields intriguing theoretical predictions, the simulations and empirical analyses are incomplete to support the authors' claims.

      This point is addressed in the responses to reviewers' comments. Since they are quite technical, they do not fit in the overview here.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      The authors present a theoretical treatment of what they term the "Wright-Fisher-Haldane" model, a claimed modification of the standard model of genetic drift that accounts for variability in offspring number, and argue that it resolves a number of paradoxes in molecular evolution. Ultimately, I found this manuscript quite strange.

      The notion of effective population size as inversely related to the variance in offspring number is well known in the literature, and not exclusive to Haldane's branching process treatment. However, I found the authors' point about variance in offspring changing over the course of, e.g. exponential growth fairly interesting, and I'm not sure I'd seen that pointed out before.

      Weaknesses:

      I have several outstanding issues. First of all, the authors really do not engage with the literature regarding different notions of an effective population. Most strikingly, the authors don't talk about Cannings models at all, which are a broad class of models with non-Poisson offspring distributions that nonetheless converge to the standard Wright-Fisher diffusion under many circumstances, and to "jumpy" diffusions/coalescents otherwise (see e.g. Mohle 1998, Sagitov (2003), Der et al (2011), etc.). Moreover, there is extensive literature on effective population sizes in populations whose sizes vary with time, such as Sano et al (2004) and Sjodin et al (2005).

      Of course in many cases here the discussion is under neutrality, but it seems like the authors really need to engage with this literature more.

      The reviewer's summary and weakness statement reflects the general criticism summarized by the editors. The reply and revision to these criticisms have been presented in the long reply to elife assessment above.

      We hence re-emphasize only the key points here.

      (1) The literature that the reviewers fault us for not citing is about the modifications of the standard WF model. We now cite them as well as a few others in that vein. However, the WF-Haldane model we propose is conceptually very different from the modified WF models. This WFH model is in essence the Haldane model which may use the results of the WF models as the starting point to find the exact solutions.

      (2) The check of the power of the modified WF models is whether they can resolve the paradoxes. None of them can. The arguments apply to neutral cases as well as selection effects. Hence, our central point is that the modifications of the standard WF model [e.g., by incorporating V(K)] do not help the WF model in resolving the paradoxes.  Besides, the incorporation of V(K) is mathematically feasible but biologically untenable as presented in the new sections of Discussion.

      Nonetheless, I don't think the authors' modeling, simulations, or empirical data analysis are sufficient to justify their claims.

      The most interesting part of the manuscript, I think, is the discussion of the Density Dependent Haldane model (DDH). However, I feel like I did not fully understand some of the derivation presented in this section, …… - this is the whole notion of exchangeability, also neglected in this manuscript). As such, I don't believe that their analysis of the empirical data supports their claim. [Since the comments above are highly technical and fairly long, they are not copied verbatim.]

      We thank this reviewer for the detailed comments with respect to the potential confusion in the discussion of the Density Dependent Haldane (DDH) model.

      First, the reviewer appears to ask how Eqs (5-6) are derived. We should clarify that both Eq (5) and (6) are assumptions rather than derived results. Both equations are assumptions based on population ecology. Eq (7) is then derived by substituting the assumptions in Eq (5) and (6) into Eq (3).

      The definition in Equation (5) allows the growth rate of the population size to be dependent on N itself, such that growth rate E(K) (average offspring number per generation) is greater than 1 when N < Ck and less than 1 when N > Ck. The parameter z is introduced to adjust the sensitivity of E(K) to changes in population size (as shown in Fig. 3a).

      Second, we appreciate the comments regarding the use of individual-based simulations and the apparent lack of interaction between individuals. In our simulations, there is indeed an interaction among individuals, which is represented by Eq (5). This equation reflects how the competition between two alleles affects the expected growth rate 𝐸(𝐾), which decreases as the population size increases. Furthermore, once 𝐸(𝐾) for the entire population is determined, the offspring numbers of the alleles are independent.

      We believe that the primary purpose of our simulations was not clearly stated. This lack of clarity may be the root of the criticisms. We now note that the simulations are aimed at testing the accuracy of Equation (10).

      Note that Eq. (10) is a textbook result and quite important in our study. This equation shows that the strength of genetic drift, as given by Pf (the fixation probability of an advantageous mutation), is not a function of N at all. This approximate solution has been obtained using the WF model by Kimura.  The Haldane model solution that can explain Paradox 1 is based on Equation (7) as shown below

      Since the fixation probability of Equation (10) cannot be easily obtained using Eq. (7), we conducted simulations to confirm the accuracy of Eq. (10) when applied to the Haldane model.

      We have revised the relevant sections of the manuscript to clarify these points and to better distinguish between assumptions and results. 

      Revision - Details of the DDH model are given in the Supplementary Information. A synopsis is given here: We consider a non-overlapping haploid population with two neutral alleles. The population size at time t is Nt. We assume that expected growth rate E(K) is greater than 1 when N < Ck and less than 1 when N > Ck, as defined by Eq. (5) below:

      The slope of E(K) vs. N (i.e., the sensitive of growth rate to changes in population size), as shown in Fig 3a, depends on z. To determine the variance V(K), we assume that K follows the negative binomial distribution whereby parents would suffer reproduction-arresting injury with a probability of pt at each birthing (Supplementary Information). Accordingly, V(K) can then be expressed as

      By Eq. (6), the ratio of V(K)/E(K) could be constant, decrease or increase with the increase of population size. With E(K) and V(K) defined, we could obtain the effective population size by substituting Eq. (5) and Eq. (6) into Eq. (3).

      Eq. (7) presents the relationship between effective population size (Ne) and the population size (N) as shown in Fig. 3. The density-dependent E(K) could regulate N with different strength (Fig. 3a). The steeper the slope in Fig. 3a, the stronger the regulation.

      Simulation of genetic drift in the Haldane model and the Wright-Fisher (WF) model. In both models, interactions between individuals are implicitly included through the dependency of the average number of offspring on population size, as defined by Eq. (5). This dependency leads to the logistic population growth, reflecting the density-dependent interactions.

      Thus, while I think there are some interesting ideas in this manuscript, I believe it has some fundamental issues:

      first, it fails to engage thoroughly with the literature on a very important topic that has been studied extensively. Second, I do not believe their simulations are appropriate to show what they want to show. And finally, I don't think their empirical analysis shows what they want to show.

      References omitted

      The comments are the summary of previous ones, which have been addressed in detail in the preceding sections.

      Reviewer #2 (Public Review):

      Summary:

      This theoretical paper examines genetic drift in scenarios deviating from the standard Wright-Fisher model. The authors discuss Haldane's branching process model, highlighting that the variance in reproductive success equates to genetic drift. By integrating the Wright-Fisher model with the Haldane model, the authors derive theoretical results that resolve paradoxes related to effective population size [Ne]

      Thanks.  The issue of Ne will be addressed below where the reviewer returns to this issue. The strength of the integrated WFH model is that N (or Ne) is generated by the model itself, rather than externally imposed as in WF models.

      Strengths:

      The most significant and compelling result from this paper is perhaps that the probability of fixing a new beneficial mutation is 2s/V(K). This is an intriguing and potentially generalizable discovery that could be applied to many different study systems.

      The authors also made a lot of effort to connect theory with various real-world examples, such as genetic diversity in sex chromosomes and reproductive variance across different species.

      Thanks. 

      Weaknesses:

      One way to define effective population size is by the inverse of the coalescent rate. This is where the geometric mean of Ne comes from. If Ne is defined this way, many of the paradoxes mentioned seem to resolve naturally. If we take this approach, one could easily show that a large N population can still have a low coalescent rate depending on the reproduction model. However, the authors did not discuss Ne in light of the coalescent theory. This is surprising given that Eldon and Wakeley's 2006 paper is cited in the introduction, and the multiple mergers coalescent was introduced to explain the discrepancy between census size and effective population size, superspreaders, and reproduction variance - that said, there is no explicit discussion or introduction of the multiple mergers coalescent.

      The Haldane model treats N’s very differently from the WF models.  In the WF models, N’s are imposed externally (say, constant N, exponentially growing N, temporally fluctuating N’s and so on; all provided from outside of the model). Ne and coalescence are all derived from these given N’s.  In order to account for the first paradox (see the next paragraph), N needs to be regulated but the WF models cannot regulate N’s. The density-dependent Haldane model that Reviewer 1 inquired above is a model that regulates N internally. It can thus account for the paradox.

      Paradox 1 -  When the population size (N) is growing exponentially, such as in a bacteria culture, drift is nearly absent when N is small and is much stronger as N increases, especially when approaching the carrying capacity.  Such a pattern is a common observation and is exactly opposite of the WF model's central prediction. In short, a model that does not regulate N cannot explain the paradox

      Ne is a fix of the WF model in order to account for the missing components of genetic drift. The paradoxes presented in this one and the companion study show that the fix is rather inadequate.  In contrast, by the WFH model, N is regulated within the model itself as E(K) and V(K) are both functions of N.

      The Wright-Fisher model is often treated as a special case of the Cannings 1974 model, which incorporates the variance in reproductive success. This model should be discussed. It is unclear to me whether the results here have to be explained by the newly introduced WFH model, or could have been explained by the existing Cannings model. The abstract makes it difficult to discern the main focus of the paper. It spends most of the space introducing "paradoxes".

      We appreciate greatly the illuminating advice.  Nevertheless, we should explain, or should have explained, more clearly that these four paradoxes presented are central to this pair of eLife papers. The WF and Haldane models are very different conceptual ideas altogether. The choice should not be based on mathematical grounds but on how they help us understand biological evolution. We are using four paradoxes to highlight the differences.  We have said in the papers that the origin and evolution of COVID-19 caused a lot of confusions partly because the WF models cannot handle multi-copy gene systems, including viruses that evolve both within- and between- hosts.

      The standard Wright-Fisher model makes several assumptions, including hermaphroditism, non-overlapping generations, random mating, and no selection. It will be more helpful to clarify which assumptions are being violated in each tested scenario, as V(K) is often not the only assumption being violated. For example, the logistic growth model assumes no cell death at the exponential growth phase, so it also violates the assumption about non-overlapping generations.

      We appreciate the question which has two aspects.  First, why do we think the WF models are insufficient? After all, for each assumption of the WF model (as given in the reviewer’s examples), there is often a solution by modifying Ne which relaxes the assumption. In this sense, there is only one grand assumption made by the WF models. That is, however complex the biology is, it is possible to find Ne that can make the WF model work. Our argument is that Ne is a cumbersome fix of the WF model and it does not work in many situations. That is how we replied about the importance of the paradoxes above.  We shall again use the first paradox as an example whereby drift is stronger as N becomes larger, the fix has to make Ne negatively correlated with N. In reality, it does not appear possible to resolve this paradox. Another paradox is the evolution of multi-copy gene systems. In short, it seems clear that Ne is not a useful or usable fix.

      The second aspect is that “why, among the many modifications the WF models make, do we only emphasize the inclusion of V(K)?” This is the essence of the two papers of ours.  Although V(K) is a modification of the WF models, it does not enable the WF models to resolve the paradoxes. In contrast, the Haldane model has incorporate E(K) and V(K) in the model. In presenting paradox 3, it was stated that

      This equation shows that the strength of genetic drift, as given by Pf (the fixation probability of an advantageous mutation), is not a function of N at all. It supports the view that the essence of genetic drift is V(K) with N as a scaling factor. Note that, if V(K) = 0, there is no genetic drift regardless of N. As V(K) is not an add-on to the Haldane model (unlike in WF models), the Haldane model can resolve the paradoxes.

      The theory and data regarding sex chromosomes do not align. The fact that \hat{alpha'} can be negative does not make sense. The authors claim that a negative \hat{alpha'} is equivalent to infinity, but why is that? It is also unclear how theta is defined. It seems to me that one should take the first principle approach e.g., define theta as pairwise genetic diversity, and start with deriving the expected pair-wise coalescence time under the MMC model, rather than starting with assuming theta = 4Neu. Overall, the theory in this section is not well supported by the data, and the explanation is insufficient.

      a' can be negative for the same reason that a (the male/female ratio in mutation rate) can be negative (Miyata, et al. 1987; Li, et al. 2002; Makova and Li 2002). Clearly, this has not been a problem in the large literature on a becoming negative.  In fact, in many reports, a is negative, which is read as a approaching infinity.  Imagine that our equation is a'^2 = 0.25, then a' can be 0.5 or -0.5, although the latter solution is not biologically meaningful.

      As for theta, the reviewer asked why we do not use the pairwise genetic diversity (or theta[pi]) as the first-principle approach to estimating theta. While theta(pi) is the first estimator of theta used, the general principle is that every bin of the frequency spectrum can be used for estimating theta since the expected value is theta/i where i is the occurrence of the mutation in the sample.  (If the sample size is 100, then i is between 1 and 99.)  Hence, the issue is which part of the spectrum has the best statistical properties for the questions at hand.  The pairwise measure is theta(pi) [which the reviewer recommends]. While theta(pi) and theta(w) are most commonly used, there are in fact numerous ways to estimate theta.  ((Fu 2022) presents an excellent review.) For our purpose, we need a theta estimate least affected by selection and we choose the lowest frequency bin of the spectrum, which is theta(1) based on the singletons. Theta(1), least affected by selection, is the basis of the Fu and Li test. 

      Reviewer #3 (Public Review):

      Summary:

      Ruan and colleagues consider a branching process model (in their terminology the "Haldane model") and the most basic Wright-Fisher model. They convincingly show that offspring distributions are usually non-Poissonian (as opposed to what's assumed in the Wright-Fisher model), and can depend on short-term ecological dynamics (e.g., variance in offspring number may be smaller during exponential growth). The authors discuss branching processes and the Wright-Fisher model in the context of 3 "paradoxes": (1) how Ne depends on N might depend on population dynamics; (2) how Ne is different on the X chromosome, the Y chromosome, and the autosomes, and these differences do match the expectations base on simple counts of the number of chromosomes in the populations; (3) how genetic drift interacts with selection. The authors provide some theoretical explanations for the role of variance in the offspring distribution in each of these three paradoxes. They also perform some experiments to directly measure the variance in offspring number, as well as perform some analyses of published data.

      Strengths:

      (1) The theoretical results are well-described and easy to follow.

      (2) The analyses of different variances in offspring number (both experimentally and analyzing public data) are convincing that non-Poissonian offspring distributions are the norm.

      (3) The point that this variance can change as the population size (or population dynamics) change is also very interesting and important to keep in mind.

      (4) I enjoyed the Density-Dependent Haldane model. It was a nice example of the decoupling of census size and effective size.

      Thanks.

      Weaknesses:

      (1) I am not convinced that these types of effects cannot just be absorbed into some time-varying Ne and still be well-modeled by the Wright-Fisher process.

      Please allow us to refer to, again, two of the four paradoxes.  We believe that that no modification of the WF model can resolve the paradoxes.

      (1) When the population size (N) is growing exponentially, such as in a bacteria culture, drift is nearly absent when N is small and is much stronger as N increases, especially when approaching the carrying capacity.  Such common observations are exactly opposite of the WF model's key prediction. It is not possible for a model that does not regulate N to explain the paradox.

      (2) There is no way the WF models can formulate Ne for, say viruses or ribosomal RNA genes that have two levels of populations – the within-host populations as well as the host population itself.

      The fact that there are numerous Ne's suggests that Ne is a collection of cumbersome fixes of the WF model. By the WF-Haldane model, all factors are absorbed into V(K) resulting in a simpler model in the end. V(K) is often a measurable quantity. Note that, even if V(K) is incorporated into the WF model, the paradoxes remain unresolvable.

      (2) Along these lines, there is well-established literature showing that a broad class of processes (a large subset of Cannings' Exchangeable Models) converge to the Wright-Fisher diffusion, even those with non-Poissonian offspring distributions (e.g., Mohle and Sagitov 2001). E.g., equation (4) in Mohle and Sagitov 2001 shows that in such cases the "coalescent Ne" should be (N-1) / Var(K), essentially matching equation (3) in the present paper.

      The criticism of lack of engagement with well-established literature has been responded extensively above.  Briefly, the literature is about modifications of the WF model which share the same feature of population sampling. With that feature, the paradoxes are unresolvable.  For example, however Ne is defined, the fixation probability of an advantageous mutation does not depend on N or Ne. This is the third paradox of the WF models.

      (3) Beyond this, I would imagine that branching processes with heavy-tailed offspring distributions could result in deviations that are not well captured by the authors' WFH model. In this case, the processes are known to converge (backward-in-time) to Lambda or Xi coalescents (e.g., Eldon and Wakely 2006 or again in Mohle and Sagitov 2001 and subsequent papers), which have well-defined forward-in-time processes.

      We admire the learned understanding of the literature expressed by the review, which raise two points.  First, our model may not be able to handle the heavy-tailed progeny distribution (i.e., the kurtosis of the distribution of k). Second, the Xi coalescence models (cited above) can do that.  Below are our clarifications.

      First, the WFH model is based on the general distribution of K, which includes flexible and realistic representations of offspring number distributions. In fact, we have used various forms of K distribution in our publications on the evolution of SARS-CoV-2 (see the Ruan et al publications in the bibliography). Power-law distribution is particularly useful as the K-distribution in viral transmission is highly kurtotic. This is reflected in the super-spreader hypothesis. In short, the branching process on which the WFH model is based in is mainly about the distribution of K. Nevertheless, the variance V(K) can often yield good approximations when the kurtosis is modest.

      Second, we would like to comment on the models of Eldon and Wakely 2006. or Mohle and Sagitov 2001 and subsequent papers. These papers are based on the Moran model by considering a highly skewed distribution of offspring numbers. Fundamentally, the Moran models generally behave like WF models (standard or modified) and hence have the same problems with the paradoxes that are central to our studies. In fact, the reservations about introducing V(K) into the WF models apply as well to the Moran models.  The introduction of V(K) is mathematically valid but biologically untenable. Essentially, the WF models incorporate the Haldane model as a first step in the generation transition. The introduction of V(K) into the Moran model is even less biologically sensible. Furthermore, the model allows K to take only three discrete values: 0, 2, and Nψ (see Eq. (7) in Eldon and Wakely). Their model also assumes a constant population size, which contrasts with our model's flexibility in handling varying population sizes and more complex distributions for K.

      In short, the modifications of the WF (and Moran) models are unnecessarily complicated, biologically untenable but still fail to account for the paradoxes. The WFH model can rectify these problems. 

      (4) These results that Ne in the Wright-Fisher process might not be related to N in any straightforward (or even one-to-one) way are well-known (e.g., Neher and Hallatschek 2012; Spence, Kamm, and Song 2016; Matuszewski, Hildebrandt, Achaz, and Jensen 2018; Rice, Novembre, and Desai 2018; the work of Lounès Chikhi on how Ne can be affected by population structure; etc...)

      The reviewer is correct in pointing out the inexact correlation between N and Ne. Nevertheless, it should still be true that the WF models predict qualitatively weaker drift as N increases. The first paradox is as stated:

      When the population size (N) is growing exponentially, such as in a bacteria culture, drift is nearly absent when N is small and is much stronger as N increases, especially when approaching the carrying capacity.  Such common observations are exactly opposite of the WF model's key prediction.

      (5) I was also missing some discussion of the relationship between the branching process and the Wright-Fisher model (or more generally Cannings' Exchangeable Models) when conditioning on the total population size. In particular, if the offspring distribution is Poisson, then conditioned on the total population size, the branching process is identical to the Wright-Fisher model.

      We thank the reviewer for this important comment. The main difference is that N is imposed from outside the WF models but can be generated from within the Haldane model (see the density-dependent Haldane model). In nature, N of the next generation is the sum of K’s among members of the population. It is how the Haldane model determines N(t+1) from N(t). In the WF models, N is imposed from outside the model and, hence the given N determines the distribution of K.  For this reason, N regulation is not possible in the WF models, thus resulting in the paradoxes.

      (6) In the discussion, it is claimed that the last glacial maximum could have caused the bottleneck observed in human populations currently residing outside of Africa. Compelling evidence has been amassed that this bottleneck is due to serial founder events associated with the out-of-Africa migration (see e.g., Henn, Cavalli-Sforza, and Feldman 2012 for an older review - subsequent work has only strengthened this view). For me, a more compelling example of changes in carrying capacity would be the advent of agriculture ~11kya and other more recent technological advances.

      We thank the reviewer and have used this more convincing case as suggested by the reviewer.

      Recommendations for the authors:

      General replies - We thank the editors and reviewers again.  The points below are re-iterations of the comments received above and have since been replied in detail. Specific instructions about wording and notations have also been rectified. Again, we are grateful for the inputs from which we learned a great deal.

      Reviewing Editor Comments:

      The reviewers recognize the value of this model and some of the findings, particularly results from the density-dependent Haldane model. However, they expressed considerable concerns with the model and overall framing of this manuscript.

      First, all reviewers pointed out that the manuscript does not sufficiently engage with the extensive literature on various models of effective population size and genetic drift, notably lacking discussion on Cannings models and related works.

      We have addressed this issue in the beginning of Introduction and Discussion, pointing to the long section in the new second half of Discussion. The essence is that the literature is all about the modified WF models.  The WF-Haldane model is conceptually and operationally distinct from the WF models, either standard or modified ones,

      Second, there is a disproportionate discussion on the paradoxes, yet some of the paradoxes might already be resolved within current theoretical frameworks. All three reviewers found the modeling and simulation of the yeast growth experiment hard to follow or lacking justification for certain choices. The analysis approach of sex chromosomes is also questioned.

      This criticism is addressed together with the next one as they make the same point.

      The reviewers recommend a more thorough review of relevant prior literature to better contextualize their findings. The authors need to clarify and/or modify their derivations and simulations of the yeast growth experiment to address the identified caveats and ensure robustness. Additionally, the empirical analysis of the sex chromosome should be revisited, considering alternative scenarios rather than relying solely on the MSE, which only provides a superficial solution. Furthermore, the manuscript's overall framing should be adjusted to emphasize the conclusions drawn from the WFH model, rather than focusing on the "unresolved paradoxes", as some of these may be more readily explained by existing frameworks. Please see the reviewers' overall assessment and specific comments.

      Many thanks.  We have carefully reframed and presented the WF-Haldane model to make it clear and logically consistent. Whether a new model (i.e., the WF-Haldane model) deserves to be introduced depends on whether it makes any contribution for understanding nature. That is why we emphasize the four paradoxes. 

      A most important disagreement between the reviewers and the authors is about the nature of the paradoxes. While the reviewers suggest that they "may" be resolvable by the conventional WF model (standard or modified), they did not offer the possible resolutions.  To use the analogy in our provisional response: the WF vs. Haldane models are compared to gas cars vs electric vehicles.  We can say confidently that the internal combustion engine cannot resolve the conflicting demands of transportation and zero emission. Its design has limited its capability. 

      Reviewer #2 (Recommendations For The Authors):

      Many thanks.  We have incorporated all these suggestions.  When the incorporation is not straightforward, we have carefully revised the text to minimize mis-communications.

      In the introduction -- "Genetic drift is simply V(K)" -- this is a very strong statement. You can say it is inversely proportional to V(K), but drift is often defined based on changes in allele frequency.

      We change the word “simply” to “essentially”. This wording is supported by the fixation probability of advantageous mutations, 2s/(V(k). We have shown in the text that N does not matter here because the fixation is nearly deterministic when the copy number reaches, say, 100, regardless of whether N is 10^4 or 10^8,

      Page 3 line 86. "sexes is a sufficient explanation."--> "sex could be a sufficient explanation"

      The strongest line of new results is about 2s/V(K). Perhaps, the paper could put more emphasis on this part and demonstrate the generality of this result with a different example.

      The math notations in the supplement are not intuitive. e.g., using i_k and j_k as probabilities. I also recommend using E[X] and V[X]for expectation and variance rather than \italic{E(X)} to improve the readability of many equations.

      Thank you for your careful reading. Regarding the use of i_k and j_k  as probabilities, we initially considered using 𝑝 or 𝑞 to represent probabilities. However, since 𝑝 and 𝑞 are already used in the main text, we opted for 𝑖 and 𝑗 to avoid potential confusion potential confusion. As for your recommendation to use

      E[X] and V[X] for expectation and variance, we would like to clarify that we follow the standard practice of italicizing these symbols to represent variables.

      Eq A6, A7, While I manage to follow, P_{10}(t) and P_{10} are not defined anywhere in the text.<br /> Supplement page 7, the term "probability of fixation" is confusing in a branching model.

      Thank you for your observation. We have carefully revised the supplement to provide clarity on these points.<br /> Revision - In population genetics, the fixation of M allele means that the population consist entirely of the M allele, with no W alleles remaining. We define the fixation probability of M allele by generation t as follows:

      Given that M and W allele reproduce independently, this can be factored as:

      As t approaches infinity, the ultimate fixation probability of M allele can be derived as follows:

      E.q. A 28. It is unclear eq. A.1 could be used here directly. Some justification would be nice.

      We appreciate your careful review, and we will ensure this connection between the two equations is made clearer in the supplement. 

      Revision - Note we would like to clarify that Eq. (A1) and Eq. (A28) are essentially the same, with the only difference being the subscript 𝑡, which indicates the time dependence in the dynamic process.

      Supplement page 17. "the biological meaning of negative..". There is no clear justification for this claim. As a reader, I don't have any intuition as to why that is the case.

      Thank you for raising this concern. We have addressed this issue earlier.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      Brdar, Osterburg, Munick, et al. present an interesting cellular and biochemical investigation of different p53 isoforms. The authors investigate the impact of different isoforms on the in-vivo transcriptional activity, protein stability, induction of the stress response, and hetero-oligomerization with WT p53. The results are logically presented and clearly explained. Indeed, the large volume of data on different p53 isoforms will provide a rich resource for researchers in the field to begin to understand the biochemical effects of different truncations or sequence alterations.

      Strengths:

      The authors achieved their aims to better understand the impact/activity of different p53 is-forms, and their data will support their statements. Indeed, the major strengths of the paper lie in its comprehensive characterization of different p53 isoforms and the different assays that are measured. Notably, this includes p53 transcriptional activity, protein degradation, induction of the chaperone machinery, and hetero-oligomerization with wtp53. This will provide a valuable dataset where p53 researchers can evaluate the biological impact of different isoforms in different cell lines. The authors went to great lengths to control and test for the effect of (1) p53 expression level, (2) promotor type, and (3) cell type. I applaud their careful experiments in this regard.

      Weaknesses:

      One thing that I would have liked to see more of is the quantification of the various pull-down/gel assays - to better quantify the effect of, e.g., hetero-oligomerization among the various isoforms. In addition, a discussion about the role of isoforms that contain truncations in the IDRs is not available. It is well known that these regions function in an auto-inhibitory manner (e.g. work by Wright/Dyson) and also mediate many PPIs, which likely have functional roles in vivo (e.g. recruiting p53 to various complexes). The discussion could be strengthened by focusing on some of these aspects of p53 as well.

      Thank you for these comments. In this paper we have focused on the importance of the integrity of the folded domains of p53 for their function. The unfolded regions in the N- and the C-terminus have not been our main target but the reviewer is right that they play important regulatory functions that are lost in the corresponding isoforms. We have, therefore, added a few sentences in the Discussion section.

      With respect to a better quantification, we have re-evaluated the quantification and adjusted where necessary (see also reviewer 2). With respect to the hetero-oligomerization we have run a new mass spectrometry experiment in which we only focus on the p53 peptides. These have been now quantitatively evaluated and the results are provided in this manuscript Fig. 5.

      Reviewer #2 (Public review):

      Summary:

      In this manuscript entitled "p53 isoforms have a high aggregation propensity, interact with chaperones and lack 1 binding to p53 interaction partners", the authors suggest that the p53 isoforms have high aggregation propensity and that they can co-aggregate with canonical p53 (FLp53), p63 and p73 thus exerting a dominant-negative effect.

      Strengths:

      Overall, the paper is interesting as it provides some characterization of most p53 isoforms DNA binding (when expressed alone), folding structure, and interaction with chaperones. The data presented support their conclusion and bring interesting mechanistic insight into how p53 isoforms may exert some of their activity or how they may be regulated when they are expressed in excess.

      Weaknesses:

      The main limitation of this manuscript is that the isoforms are highly over-expressed throughout the manuscript, although the authors acknowledge that the level of expression is a major factor in the aggregation phenomenon and "that aggregation will only become a problem if the expression level surpasses a certain threshold level" (lines 273-274 and results shown in Figures S3D, 6E). The p53 isoforms are physiologically expressed in most normal human cell types at relatively low levels which makes me wonder about the physiological relevance of this phenomenon.

      Furthermore, it was previously reported that some isoforms clearly induce transcription of target genes which are not observed here. For example, p53β induces p21 expression (Fujita K. et al. p53 isoforms Delta133p53 and p53beta are endogenous regulators of replicative cellular senescence. Nat Cell Biol. 2009 Sep;11(9):1135-42), and Δ133p53α induces RAD51, RAD52, LIG4, SENS1 and SOD1 expression (Gong, L. et al. p53 isoform D113p53/D133p53 promotes DNA double-strand break repair to protect cell from death and senescence in response to DNA damage. Cell Res. 2015, 25, 351-369. / Gong, L. et al. p53 isoform D133p53 promotes the efficiency of induced pluripotent stem cells and ensures genomic integrity during reprogramming. Sci. Rep. 2016, 6, 37281. / Horikawa, I. et al. D133p53 represses p53-inducible senescence genes and enhances the generation of human induced pluripotent stem cells. Cell Death Differ. 2017, 24, 1017-1028. / Gong, L. p53 coordinates with D133p53 isoform to promote cell survival under low-level oxidative stress. J. Mol. Cell Biol. 2016, 8, 88-90. / Joruiz et al. Distinct functions of wild-type and R273H mutant Δ133p53α differentially regulate glioblastoma aggressiveness and therapy-induced senescence. Cell Death Dis. 2024 Jun 27;15(6):454.) which demonstrates that some isoforms can induce target genes transcription and have defined normal functions (e.g. Cellular senescence or DNA repair).

      However, in this manuscript, the authors conclude that isoforms are "largely unfolded and not capable of fulfilling a normal cellular function" (line 438), that they do not have "well defined physiological roles" (line 456), and that they only "have the potential to inactivate members of the p53 protein family by forming inactive hetero complexes with wtp53" (line 457-458).

      Therefore, I think it is essential that the authors better discuss this major discrepancy between their study and previously published research.

      This manuscript is not about hunting for the next “signal transduction pathway” that is “regulated” by a specific p53 isoform. For such a project work has indeed to be conducted at the endogenous level. However, our manuscript is about the basic thermodynamic behavior of these isoforms in in vitro assays and in some cell culture assays.

      What, however, depends on the expression level is the interaction with chaperones as well as the tendency to aggregate. And this we actually show in our manuscript by using two different promotors with very different strength: Strong overexpression leads to aggregation, much weaker expression to soluble isoforms. For the mass spectrometry experiments we have established stable expressing cell lines and not used transiently overexpressing ones.

      The level from which on the chaperone systems of the cell cannot keep these isoforms soluble and they start to aggregate is certainly an important question, and we have experimental evidence that if we use different chaperone inhibitors the percentage of the aggregating isoforms in the insoluble fraction increases.

      Proteins have to follow the basic physicochemical rules also in cells. And this manuscript sets the stage for re-interpreting the observed cellular effects – not in terms of specific interaction with certain promoters but as causing a stress response and non-specific interaction with other not-well folded domains of other proteins.

      With respect to this discussion about the physiological relevance, it is interesting to look at a study that was published in Cell:

      Rohaly, G., Chemnitz, J., Dehde, S., Nunez, A.M., Heukeshoven, J., Deppert, W. and Dornreiter, I. (2005) A novel human p53 isoform is an essential element of the ATR-intra-S phase checkpoint. Cell, 122, 21-32.

      This manuscript describes how a specific isoform regulates an important pathway. Two other studies also focused on the same isoform but showed that it lacks the nuclear localization signal and therefore does not enter the nucleus. And even if it would, it would have no transcriptional activity due to the unfolding of the DBD.

      Chan, W.M. and Poon, R.Y. (2007) The p53 Isoform Deltap53 lacks intrinsic transcriptional activity and reveals the critical role of nuclear import in dominant-negative activity. Cancer Res, 67, 1959-1969.

      Garcia-Alai, M.M., Tidow, H., Natan, E., Townsley, F.M., Veprintsev, D.B. and Fersht, A.R. (2008) The novel p53 isoform "delta p53" is a misfolded protein and does not bind the p21 promoter site. Protein Sci, 17, 1671-1678.

      This example shows that it is important to re-consider the basic principles of protein structure and protein folding. And that is exactly what this manuscript is about.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) Does the p53g C-terminus (322-346) form cross-beta amyloid structures? The strong fluorescence signal in the presence of ThT suggests this may be forming amyloid. I wonder if any amyloid sequence predictors identify this region as amyloidogenic.

      Using the Waltz predictor (https://doi.org/10.1038/nmeth.1432), the amino acids 339-346 have been identified as potentially amyloidogenic. We have added this information to the manuscript.

      (2) The chaperone binding results in Figure 5 are interesting and indeed suggest that many p53 isoforms interact with chaperones in vivo to counteract their destabilized nature. For the 5 p53 isoforms shown in Figure 5D, do they present any HSP70-binding motifs that may not exist in wtp53? These motifs can be predicted from the sequence with established software in a similar manner as the authors performed for TANGO.

      Author response image 1.

      Predicted Chaperon binding sites using the LIMBO prediction tool. (http://www.ncbi.nlm.nih.gov/pubmed/19696878)

      We have analyzed the sequence of p53 and the isoforms for potential HSP70 binding sites using the LIMBO prediction tool. The results are shown in the figure above. Wild type p53 has a very strong site that is lost in the β- and ɣ-isoforms. The ɣ-isoform in addition loses another predicted binding site which is replaced with a ɣ-specific one. Overall, this analysis does not provide a very clear picture due to the loss of some and the creation of new, isoform-specific binding sites. We have, therefore, not included this analysis in the manuscript but show it here for the reviewers.

      (3) The mixed hetero-tetramers detected by the MS is very interesting. Also the pull-down experiments in Figure 6. However, the extent of hetero-oligomerization is at times hard to follow. Could you more clearly summarize and/or quantify the results of the hetero-oligomerization experiments?

      We have conducted a new mass spectrometry experiment that was focused only on the analysis of p53 peptides. These data are now shown in Figure 5 and Supplementary Figure 6. They show that peptides not present in the Δ133p53α isoform and therefore must come from wild type p53 can be detected. For the Δ133p53β isoform these peptides are absent, suggesting that this isoform does not hetero-oligomerize with wild type p53. Furthermore, all β- and ɣ- isoforms do not show peptides derived from wild type p53, again suggesting that they cannot hetero-oligomerize due to the lack of a functional oligomerization domain.

      (4) There is a typo in Figure 5. The figure title (top of page) says "Figure 4: Chaperons". Also, "chaperons" appears in the legend.

      Thank you for making us aware of this problem. This has been corrected.

      (5) The figures are often quite small with a lot of white space. Figure 4 in particular is arranged in a confusing way with A, D, B, C, E, F, G in T->B L->R order. Perhaps some figures could be expanded or re-arranged to make better use of the available space. E.g. could move B, C above panel D, and then shift F, G to be next to E. This would give you A, [B, C, D], [E, F, G] in a 2x2 format.

      We have rearranged figures 2, 4, 5 and 6 to be able to enlarge the individual figure panels.

      Reviewer #2 (Recommendations for the authors):

      (1) Figure 2C: Why is the p21-Luc reporter assay performed in SAOS-2 cells when all other assays are performed in H1299?

      The assays we have performed in this study are independent of the cell type because we investigate very basic principles of protein folding and stability. If one removes a third of a folded domain, this domain will no longer fold, independent of the cell type it is in. However, to show, that the cell type indeed does not play any role, we have repeated the experiments in H1299 cells. These data are now shown in Figure 2C and the original data in SAOS cells we have moved to Supplementary Figure 1E.

      (2) Figure 3: I find the statistics on this figure very confusing... It looks like every isoform is compared to the "WT", but in that case, in Figure 3B for example, how can the Δ40p53β be ****, Δ133p53γ be *** while the Δ133p53α, more different to WT and narrower error bars is non-specific? I guess this comes from the normalization of the GST expression of each isoform but in this case, the isoforms should not be compared to the WT, but to their respective GST sample.

      There was indeed a mistake in the statistics, thank you for pointing this out.

      We repeated the statistical analysis and the relative protein level within each sample is now calculated using the ratio between the respective GST sample and the sample containing E6. Significance for each isoform was assessed by comparing the relative protein level to the protein level of the WT.

      (3) Figures 3D and 3E: the authors did not perform the assays on Δ40p53 isoforms because they "contain a fully folded DBD" (lines 218-219). This may be true for Δ40p53α as shown by the pAB240 binding figure 3C, but it is speculative for Δ40p53β and Δ40p53γ since these were not tested in Figure 3C either... Furthermore, Figure 3B suggests that there may be differences between Δ40p53α, Δ40p53β and Δ40p53γ and therefore these two isoforms should be tested for pAB240 IP at least (and DARPin as well if the pAB240 IP shows differences). Also, why were the TAp53β and TAp53γ not tested in Figures 3D and 3E?

      Here we disagree with the reviewer. The PDB is full of structures of the p53 DNA binding domain. All of them – including many structures of the same domain from other species – span residues ~90 to 294 (or the equivalent residues in other species). That means that the β- and ɣ- versions of p53 contain the full DNA binding domain. In contrast to the DNA binding domain, the oligomerization domain, however, is truncated and therefore does not form functional tetramers. This is the reason for the reduced binding affinity to DNA.

      The pAB240 antibody recognizes and binds to an epitope that becomes exposed upon the unfolding of the DBD. This manuscript shows by multiple experiments that the DBD of the β- and the ɣ-isoforms are not compromised but that the oligomerization domain is not functional. In figures 3D and 3E we have not included the TA β- and the ɣ-isoforms, because, again, they have a folded DBD and their inclusion would not provide any additional information compared to TAp53α.

      (4) Figures 4B and 4C are small and extremely difficult to read.

      We agree and have rearranged and enlarged these and other figures. Please see also answer to comment (5) of reviewer 1.

      (5) Figure 5C: the authors claim that "the isoform induced cellular stress that triggers the expression of chaperones" (line 320). However, if the induction of the HSP70 promoter is shown, there is no evidence that this is due to cellular stress. Evidence to support that claim should be shown.

      The expression and accumulation of unfolded, aggregation prone sequences is a stress situation for the cell which triggers the expression of chaperones. The expression of isoforms that are not well folded or of p53 mutants that are not well folded increases expression both from the HSP70 promoter and the heat shock promoter. This shows that the expression of unfolded isoforms induces cellular stress.

      (6) Figure 5D: why was this experiment performed in SAOS2 cells when the whole paper was otherwise performed in H1299 cells?<br /> Also, about this figure, the authors write "In addition to this common set, Δ133p53α and Δ40p53α showed only very few additional interaction partners. This situation was very different for Δ133α, Δ133β and TAp53γ." (lines 331 to 333). My feeling is that we should instead read "In addition to this common set, TAp53β and Δ40p53α showed only very few additional interaction partners. This situation was very different for Δ133p53α, Δ133p53β and TAp53γ"

      Thank you for spotting this mistake. Indeed, the correct wording is TAp53β and Δ40p53α and we have corrected the manuscript.

      The mass spectrometry experiments were actually not carried out in SAOS cells, but in U2OS cells. The reason for not using the H1299 cell line was that these cells do not contain functional p53. In contrast, U2OS cells express wild type p53. We have repeated the mass spectrometry analysis and analyzed the data with a special focus on p53 peptides. This information is now added as Figure 5E. In this analysis we show that the Δ133p53α samples contain peptides from the DBD that are not part of this truncated isoform and must therefore originate from wild type p53 with which this isoform hetero-oligomerizes. The corresponding peptides are absent from Δ133p53β, showing that without a functional oligomerization domain this isoform does not interact with wild type p53. Likewise, the data demonstrate that the β- and the ɣ-isoforms do not form hetero-oligomers.

      (7) Supplementary Table 2: the authors claim "For Δ133p53α we could identify peptides between amino acids 102 and 132 that must originate from wild type p53". SAOS2 has a WT TP53 gene and expresses all isoforms endogenously. Therefore, peptides between amino acids 102 and 132 can actually originate from "WT p53" but also TAp53β, TAp53γ, Δ40p53α, Δ40p53β or Δ40p53γ (most likely a mix of these).

      We have not used SAOS cells but U2OS cells. As mentioned above the data show that the Δ133p53α sample contains peptides from wild type p53 and that these peptides cannot be found in the Δ133p53β sample. In addition, peptides originating from the oligomerization domain are only found in the samples of isoforms containing an oligomerization domain but not in samples of β- and ɣ-isoforms. The data are presented in Figure 5 E-G and Supplementary Figure S5.

      Since the Biotin ligase is directly fused to a specific isoform, peptides from other isoforms can only be detected if these directly interact with the isoform fused to the ligase (and contain unique peptides, not present in the isoform fused to the ligase). The data confirm that only isoforms that have a functional oligomerization domain can interact with wild type p53 (or potentially other isoforms with a functional oligomerization domain).

      (8) Figure 6: Why not conduct these luciferase reporter assays using the MDM-2 and p21 promoters like in Figure 2B and 2C since there may be promoter-specific regulation?

      This would be particularly important for the p21 promoter as TAp53β is known to induce it (Fujita K. et al. p53 isoforms Delta133p53 and p53beta are endogenous regulators of replicative cellular senescence. Nat Cell Biol. 2009 Sep;11(9):1135-42) and the Δ133p53α, Δ133p53β and Δ133p53γ isoforms were shown to reduce p21 transcription by TAp73β when co-expressed in H1299 cells (Zorić A. et al. Differential effects of diverse p53 isoforms on TAp73 transcriptional activity and apoptosis. Carcinogenesis. 2013 Mar;34(3):522-9.). Neither of these regulations appears here on the pBDS2 reporter, which is puzzling.

      The main point of this paper is that all isoforms without a complete DNA binding domain and without a complete oligomerization domain do not bind to DNA with high affinity and do not show transcriptional activity and that is independent of the promotor. There might be effects of expressing certain isoforms in some cells, but that is most likely by inducing a stress response via expression of chaperones etc. High affinity sequence specific DNA binding does not play a role here (see results in Figure 2) and we have therefore not conducted these suggested experiments.

    1. Author response:

      The following is the authors’ response to the original reviews

      We would like to thank you and the reviewers for valuable feedback on the first version of the manuscript. We now addressed all of the issues raised by reviewers, mostly by implementing the suggested changes and clarifying important details in the revised version of the manuscript. A detailed response to each comment is provided in the rebuttal letter. Briefly, the main changes were as follow:

      - We changed homeostatic balance to network balance especially when describing the main finding as the response changes induced by the stimulation occurred on a fast timescale. We speculate the sustained changes observed in the post-stimulation condition are the result of homeostatic mechanisms.

      - We added additional verification on the target stimulation effect by adding a supplementary result showing its effect between the target and off-target z-planes, as well as demonstrating the minimal impact of the imaging laser to rsChRmine.

      - We added a simple toy model illustrating suppression specifically applied to co-tuned cells that yields the response amplitude decrease, to further support our findings.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      Kang et al. provide the first experimental insights from holographic stimulation of auditory cortex. Using stimulation of functionally-defined ensembles, they test whether overactivation of a specific subpopulation biases simultaneous and subsequent sensory-evoked network activations.

      Strengths:

      The investigators use a novel technique to investigate the sensory response properties in functionally defined cell assemblies in auditory cortex. These data provide the first evidence of how acutely perturbing specific frequency-tuned neurons impacts the tuning across a broader population.

      Weaknesses:

      I have several main concerns about the interpretation of these data:<br /> (1) The premise of the paper suggests that sensory responses are noisy at the level of neurons, but that population activity is reliable and that different neurons may participate in sensory coding on different trials. However, no analysis related to single trial variance or overall stability of population coding is provided. Specifically, showing that population activity is stable across trials in terms of total activity level or in some latent low dimensional representation would be required to support the concept of "homeostatic balancing".

      Thank you for raising an important point. We agree that the term ‘homeostatic balancing’ may be not the best term to be applied to explain the main results. We now have toned down on the homeostatic plasticity aspect to explain the main result. We have changed the term to a simple ‘network balance’, potentially due to various factors including rapid synaptic plasticity. We speculate the persistent activity of co-tuned cells in the post-stimulation session as a result of homeostatic balance, instead of rapidly changing back their responses to the baseline. Relevant changes are implemented throughout the manuscript including Introduction (e.g., lines 76-78) and Discussion sections (e.g., lines 453-456).

      (2) Rebalancing would predict either that the responses of stimulated neurons would remain A) elevated after stimulation due to a hebbian mechanism or B) suppressed due to high activity levels on previous trials, a homeostatic mechanism. The authors report suppression in targeted neurons after stimulation blocks, but this appears similar to all other non-stimulated neurons. How do the authors interpret the post-stimulation effect in stimulated neurons?

      It is true that the post stimulation effect of no response change both from co-tuned and non co-tuned neurons, and both from stimulation and control sessions. This could be due to neuronal activity being adapted and decreased enough from the consecutive presentation of acoustic stimuli themselves. However, we still think that if the stimulation driven co-tuned non stimulated neurons’ response decrease is highly driven by stimulation without homeostasis, at least their responses should bounce back during the post-stimulation. We agree that further investigation would be required to further confirm such effect. We elaborated this as another discussion point in the discussion section (lines 457-464).

      (3) The authors suggest that ACtx is different from visual cortex in that neurons with different tuning properties are intermingled. While that is true at the level of individual neurons, there is global order, as demonstrated by the authors own widefield imaging data and others at the single cell level (e.g. Tischbirek et al. 2019). Generally, distance is dismissed as a variable in the paper, but this is not convincing. Work across multiple sensory systems, including the authors own work, has demonstrated that cortical neuron connectivity is not random but varies as a function of distance (e.g. Watkins et al. 2014). Better justification is needed for the spatial pattern of neurons that were chosen for stimulation. Further, analyses that account for center of mass of stimulation, rather than just the distance from any stimulated neuron would be important to any negative result related to distance.

      Thank you for the further suggestion regarding the distance matter. While Watkins et al., 2014 and Levy and Reyes (2012) showed stronger connectivity for nearby cells as well as for more distant patches, on a functional level, Winkowski & Kanold 2013 showed high frequency heterogeneity especially in L2/3, where we targeted to image in this study. Thus, connected cells can have varied tuning consistent with spine imaging (Konnerth paper). We now also calculated the distance based on the center of mass of target cells to calculate the distance effect for an additional verification and still observed no distance related stimulation effect. We now replaced the Figure 4B with the result from the center of mass calculation.

      (4) Data curation and presentation: Broadly, the way the data were curated and plotted makes it difficult to determine how well-supported the authors claims are. In terms of curation, the removal of outliers 3 standard deviations above the mean in the analysis of stimulation effects is questionable. Given the single-cell stimulation data presented in Figure 1, the reader is led to believe that holographic stimulation is quite specific. However, the justification for removing these outliers is that there may be direct stimulation 20-30 um from the target. Without plotting and considering the outliers as well, it is difficult to understand if these outsized responses are due to strong synaptic connections with neighboring neurons or rather just direct off-target stimulation. Relatedly, data presentation is limited to the mean + SEM for almost all main effects and pre-post stimulation effects are only compared indirectly. Whether stimulation effects are driven by just a few neurons that are particularly suppressed or distinct populations which are suppressed or enhanced remains unclear.

      Thank you for pointing this out. Now we specifically removed neighboring cells that are < 20 um from the target point and we observed similar. We replaced all the relevant figures, texts, and statistical results to ensure that the exclusion was specific to overlapping neighboring cells.

      Reviewer #2 (Public review):

      The goal of HiJee Kang et al. in this study is to explore the interaction between assemblies of neurons with similar pure-tone selectivity in mouse auditory cortex. Using holographic optogenetic stimulation in a small subset of target cells selective for a given pure tone (PTsel), while optically monitoring calcium activity in surrounding non-target cells, they discovered a subtle rebalancing process: co-tuned neurons that are not optogenetically stimulated tend to reduce their activity. The cortical network reacts as if an increased response to PTsel in some tuned assemblies is immediately offset by a reduction in activity in the rest of the PTsel-tuned assemblies, leaving the overall response to PTsel unchanged. The authors show that this rebalancing process affects only the responses of neurons to PTsel, not to other pure tones. They also show that assemblies of neurons that are not selective for PTsel don't participate in the rebalancing process. They conclude that assemblies of neurons with similar pure-tone selectivity must interact in some way to organize this rebalancing process, and they suggest that mechanisms based on homeostatic signaling may play a role.

      he conclusions of this paper are very interesting but some aspects of the study including methods for optogenetic stimulation, statistical analysis of the results and interpretation of the underlying mechanisms need to be clarified and extended.

      (1) This study uses an all-optical approach to excite a restricted group of neurons chosen for their functional characteristics (their frequency tuning), and simultaneously record from the entire network observable in the FOV. As stated by the authors, this approach is applied for the first time to the auditory cortex, which is a tour de force. However, such an approach is complex and requires precise controls to be convincing. In the manuscript, several methodological aspects are not sufficiently described to allow a proper understanding.

      (i) The use of CRmine together with GCaMP8s has been reported as problematic as the 2Ph excitation of GCaMP8s also excites the opsin. Here, the authors use a red-shifted version of CRmine to prevent such cross excitation by the imaging laser. To be convincing, they should explain how they controlled for the absence of rsCRmine activation by the 940nm light. Showing the fluorescence traces immediately after the onset of the imaging session would ensure that neurons are not excited as they are imaged.

      Thank you for pointing this out. We realized that the important reference was omitted. Kishi et al. 2022 validated the efficacy of the rsChRmine compared to ChRmine. In this paper, they compared regular ChRmine and rsChRmine activity to different wavelengths and setting and showed the efficiency of rsChRmine with reduced optical cross talk. This reference is now included in the manuscript (line 98). We also checked the spontaneous baseline activity that lasted about 10 sec. before any of the sound presentation and observed a relatively stable activity throughout, rather than any imaging session onset related activation, which is also similar to what we see from another group of GCaMP6s transgenic animals.

      Author response image 1.

      Baseline fluorescence activity across cells within FOVs from AAV9-hSyn-GCaMP8s-T2A-rsChRmine injected mice (top) and CBA X Thy1-GCaMP6s F1 transgenic mice (bottom). Fluorescence levels and activity patterns remain similar, suggesting no evident imaging laser-induced activation from rsChRmine. Note that GCaMP8s examples are smoothed by using moving average of 4 points as GCaMP8s show faster activity.

      (ii) Holographic patterns used to excite 5 cells simultaneously may be associated with out-of-focus laser hot spots. Cells located outside of the FOV could be activated, therefore engaging other cells than the targeted ones in the stimulation. This would be problematic in this study as their tuning may be unrelated to the tuning of the targeted cells. To control for such an effect, one could in principle decouple the imaging and the excitation planes, and check for the absence of out-of-focus unwanted excitation.

      We further verified whether the laser power at the targeted z-plane influences cells’ activity at nearby z-planes. As the Reviewer pointed out, the previous x- and y-axis shifts were tested by single-cell stimulation. This time, we stimulated five cells simultaneously, to match the actual experiment setup and assess potential artifacts in other planes. We observed no stimulation-driven activity increase in cells at a z-planed shifted by 20 µm (Supplementary Figure 1). This confirms the holographic stimulation accurately manipulates the pre-selected target cells and the effects we observe is not likely due to out-of-focus stimulation artifacts. It is true that not all pre-selected cells showing significant response changes prior to the main experiment are effectively activated t every trial during the experiments. We varied the target cell distances across FOVs, from nearby cells to those farther apart within the FOV. We have not observed a significant relationship between the target cell distances and stimulation effect. Lastly, cells within < 20 µm of the target were excluded to prevent potential excitation due to the holographic stimulation power. Given the spontaneous movements of the FOV during imaging sessions due to animal’s movement, despite our efforts to minimize them, we believe that any excitation from these neighboring neurons would be directly from the stimulation rather than the light pattern artifact itself.

      (iii) The control shown in Figure 1B is intended to demonstrate the precision of the optogenetic stimulation: when the stimulation spiral is played at a distance larger or equal to 20 µm from a cell, it does not activate it. However, in the rest of the study, the stimulation is applied with a holographic approach, targeting 5 cells simultaneously instead of just one. As the holographic pattern of light could produce out-of-focus hot spots (absent in the single cell control), we don't know what is the extent of the contamination from non-targeted cells in this case. This is important because it would determine an objective criterion to exclude non-targeted but excited cells (last paragraph of the Result section: "For the stimulation condition, we excluded non-target cells that were within 15 µm distance of the target cells...")

      Highly sensitive neurons to certain frequency also shows the greatest adaptation effect, which can be observed the control condition. Therefore, the high sensitive neurons showing greater amplitude change is first related to the neuronal adaptation to its sensitive information. However, by stimulating the co-tuned target neurons, other co-tuned non-target neurons shows significantly greater amplitude decrease, compared to either non co-tuned target neurons stimulation or control (the latter did not meet the significance level).

      We also tried putting more rigorous criterion as 20 um instead of 15 um as you pointed out since the spiral size was 20 um. The result yielded further significant response amplitude decrease due to the stimulation effect only from co-tuned non-target neurons for processing their preferred frequency information.

      (2) A strength of this study comes from the design of the experimental protocol used to compare the activity in non-target co-tuned cells when the optogenetic stimulation is paired with their preferred tone versus a non-preferred pure tone. The difficulty lies in the co-occurrence of the rebalancing process and the adaptation to repeated auditory stimuli, especially when these auditory stimuli correspond to a cell's preferred pure tones. To distinguish between the two effects, the authors use a comparison with a control condition similar to the optogenetic stimulation conditions, except that the laser power is kept at 0 mW. The observed effect is shown as an extra reduction of activity in the condition with the optogenetic paired with the preferred tone, compared to the control condition. The specificity of this extra reduction when stimulation is synchronized with the preferred tone, but not with a non-preferred tone, is a potentially powerful result, as it points to an underlying mechanism that links the assemblies of cells that share the same preferred pure tones.

      The evidence for this specificity is shown in Figure 3A and 3D. However, the universality of this specificity is challenged by the fact that it is observed for 16kHz preferring cells, but not so clearly for 54kHz preferring cells: these 54kHz preferring cells also significantly (p = 0.044) reduce their response to 54kHz in the optogenetic stimulation condition applied to 16kHz preferring target cells compared to the control condition. The proposed explanation for this is the presence of many cells with a broad frequency tuning, meaning that these cells could have been categorized as 54kHz preferring cells, while they also responded significantly to a 16kHz pure tone. To account for this, the authors divide each category of pure tone cells into three subgroups with low, medium and high frequency preferences. Following the previous reasoning, one would expect at least the "high" subgroups to show a strong and significant specificity for an additional reduction only if the optogenetic stimulation is targeted to a group of cells with the same preferred frequency. Figure 3D fails to show this. The extra reduction for the "high" subgroups is significant only when the condition of opto-stimulation synchronized with the preferred frequency is compared to the control condition, but not when it is compared to the condition of opto-stimulation synchronized with the non-preferred frequency.

      Therefore, the claim that "these results indicate that the effect of holographic optogenetic stimulation depends not on the specific tuning of cells, but on the co-tuning between stimulated and non-stimulated neurons" (end of paragraph "Optogenetic holographic stimulation decreases activity in non-target co-tuned ensembles") seems somewhat exaggerated. Perhaps increasing the number of sessions in the 54kHz target cell optogenetic stimulation condition (12 FOV) to the number of sessions in the 16kHz target cell optogenetic stimulation condition (18 FOV) could help to reach significance levels consistent with this claim.

      We previously also tested by randomly subselecting 12 FOVs from 16kHz stimulation condition to match the same number of FOV between two groups and did not really see any result difference. However, to further ensure the results, we now added three more dataset for 54 kHz target cell stimulation condition (now 15 FOV) which yielded similar outcome. We have now updated the statistical values from added datasets.

      (3) To interpret the results of this study, the authors suggest that mechanisms based on homeostatic signaling could be important to allow the rebalancing of the activity of assemblies of co-tuned neurons. In particular, the authors try to rule out the possibility that inhibition plays a central role. Both mechanisms could produce effects on short timescales, making them potential candidates. The authors quantify the spatial distribution of the balanced non-targeted cells and show that they are not localized in the vicinity of the targeted cells. They conclude that local inhibition is unlikely to be responsible for the observed effect. This argument raises some questions. The method used to quantify spatial distribution calculates the minimum distance of a non-target cell to any target cell. If local inhibition is activated by the closest target cell, one would expect the decrease in activity to be stronger for non-target cells with a small minimum distance and to fade away for larger minimum distances. This is not what the authors observe (Figure 4B), so they reject inhibition as a plausible explanation. However, their quantification doesn't exclude the possibility that non-target cells in the minimum distance range could also be close and connected to the other 4 target cells, thus masking any inhibitory effect mediated by the closest target cell. In addition, the authors should provide a quantitative estimate of the range of local inhibition in layers 2/3 of the mouse auditory cortex to compare with the range of distances examined in this study (< 300 µm). Finally, the possibility that some target cells could be inhibitory cells themselves is considered unlikely by the authors, given the proportions of excitatory and inhibitory neurons in the upper cortical layers. On the other hand, it should be acknowledged that inhibitory cells are more electrically compact, making them easier to be activated optogenetically with low laser power.

      Minimum distance is defined as the smallest distance non-target cell to any of the target cells. Thus, if this is local inhibition, it is likely that the closest target cell would have affected the non-target cells’ response changes. We also calculated the distance based on the center of mass of target cells to calculate the distance effect for an additional verification, based on both Reviewers’ comments, and still observed no distance related stimulation effect. The result is now updated in Figure 4B.

      Based on previous literature, such as Levy & Reyes 2012, the excitatory and inhibitory connectivity is known to range around 100 um distance. Our results do not necessarily show any further effect observed for cells with distance below 100 um. This suggests that such effect is not limited to local inhibition. We also added further speculation on why our results are less likely due to increased inhibition, albeit the biological characteristics of inhibitory neurons to optogenetics.

      Reviewer #3 (Public review):

      Summary:

      The authors optogenetically stimulate 5 neurons all preferring the same pure tone frequency (16 or 54 kHz) in the mouse auditory cortex using a holography-based single cell resolution optogenetics during sound presentation. They demonstrate that the response boosting of target neurons leads to a broad suppression of surrounding neurons, which is significantly more pronounced in neurons that have the same pure tone tuning as the target neurons. This effect is immediate and spans several hundred micrometers. This suggests that the auditory cortical network balances its activity in response to excess spikes, a phenomenon already seen in visual cortex.

      Strengths:

      The study is based on a technologically very solid approach based on single-cell resolution two-photon optogenetics. The authors demonstrate the potency and resolution of this approach. The inhibitory effects observed upon targeted stimulation are clear and the relative specificity to co-tuned neurons is statistically clear although the effect size is moderate.

      Weaknesses:

      The evaluation of the results is brief and some aspects of the observed homeostatic are not quantified. For example, it is unclear whether stimulation produces a net increase or decrease of population activity, or if the homeostatic phenomenon fully balances activity. A comparison of population activity for all imaged neurons with and without stimulation would be instructive. The selectivity for co-tuned neurons is significant but weak. Although it is difficult to evaluate this issue, this result may be trivial, as co-tuned neurons fire more strongly. Therefore, the net activity decrease is expected to be larger, in particular, for the number of non-co-tuned neurons which actually do not fire to the target sound. The net effect for the latter neurons will be zero just because they do not respond. The authors do not make a very strong case for a specific inhibition model in comparison to a broad and non-specific inhibitory effect. Complementary modeling work would be needed to fully establish this point.

      Thank you for raising important points. We agree that the term homeostatic balancing may have been an overstatement. We toned down regarding the homeostatic plasticity and conclude the result from the rapid plasticity at a single trial level now. Regardless, the average activity level did not differ among stimulation conditions (control, 16kHz stim, and 54kHz stim), which seems to suggest that overall activity level has been maintained regardless of the stimulation. We added a new figure of the global activity change as Fig. 4A.

      We also added a simple model work in which a suppression term was applied either to all neurons or specifically to non-target co-tuned cells to test our results from the data.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) For the first holography paper in A1, more information is needed about how holographic stimulation was performed and how stimulation artifacts were avoided or removed from the data set, especially as the text states that the PMTs were left open for the duration of the experiment.

      We further clarified the rationale of leaving the shutter open to avoid any mechanic sounds to activate neurons in the AC. We further clarified that we keep the uncaging shutter open since the Bruker default setting (Software version: 5.7) opens and closes the shutter for the every iteration of the stimulation which generates extra heavy mechanical sounds which then hinders whether the activation is due to the sound or stimulation.

      (2) The choice of the dF/F as the primary tool for quantifying data should be better justified. Presumably, cells have very different variances in baseline activity levels and baseline fluorescence levels that create a highly skewed distribution of responses across the population. Further, a

      To take the baseline activity variances into account, we first calculate dF/F normalising to the baseline period (about 330 ms before the sound onset) right before each trial, per cell level. By doing so, we minimize any effect that could have been driven by variable baseline activity levels across neurons.

      (3) More analysis should be performed to determine why 33% of stimulated cells are not activated, and instead are suppressed during stimulation. Is this related to a cells baseline fluorescence?

      Great point. Although we tried our best to pre-select stimulation-responsive neurons before we start the actual experiments and head fix the animals as much as possible, these neurons do not stay as the “best stimulation-responsive neurons” throughout the entire imaging session. There can be various caveats on this. First, they seem to change their activity levels due to the optogenetic stimulation after they are exposed to acoustic stimulation. Second, since the AC is in the temporal side, it is likely to be more affected from the animals’ and their brain movements throughout the imaging session, which could be bigger than visual cortex or motor cortex. However, 33% of 5 cells is about 1.5 cells so it is usually missed about one cell on average, although some sessions have all 5 cells being stimulated while some other sessions have clearly less effective holographic stimulation effect.

      We even manually visualised the fluorescence change due to the holographic stimulation before we start any imaging sessions. Regardless, they don’t stay as the ‘best stimulation responsive cells’ throughout which we cannot control the natural biological aspect of neuronal activities. Regardless, based on the significant stimulation effects observed by presenting different pure tone frequencies as well as delivering different target stimulation and no-stimulation control, we believe that the effect itself is valid. We added these caveats into the manuscript as a further discussion point and things to consider.

      (4) The linear mixed-effects model should include time as a variable as A) the authors hypothesize that responses should be reduced over time due to sensory adaptation and that B) stimulation induced suppression might be dynamic (though they find it is not).

      Since the stimulation effect seems to be independent from trial-by-trial changes among stimulation conditions (Fig. 4) and we now have toned down on the aspect of homeostasis, we kept the current mixed-effect model variables.

      (5) More speculation is needed on why stimulation suppresses responses from the first trial onwards.

      We further speculate such rapid response changes due to activity-dependent synaptic changes due to overall network energy shift from optogenetic stimulation to maintain the cortical circuit balance.  

      (6) What does each dot represent in Figure 4a vs. Figure 4B? They are very different in number.

      In 4A, each dot is average amplitude change values per each trial level. They are exactly same number of dots between frequency, cell groups and conditions as each dot represents each trial (20 each). The reason why it may look differ could be only due to some overlaps between frequencies.

      In 4B, each dot is each cell. The reason why it’s denser in Stimulation conditions’ 16kHz preferring cells panel is that it naturally had more FOVs thus more cells to be plotted. We further clarified these details in the figure legend.

      (7) How sensory responsive neurons were selected should be shown in the figures. Specifically, which fraction of the 30% of most responsive neurons were stimulated should be stated. Depending on the exact yield in the field of view, all or only a minority of strongly sensory responsive neurons are being stimulated, which in either case would color the interpretation of the data.

      We tried varying the FOV as much as possible across sessions to ensure that FOVs are directly in the A1 covering a range of frequencies. If we cannot observe more than 80 neurons as sound responsive neurons from processed suite2p data, we searched for another FOV.  

      We now included an example FOV of the widefield imaging we first conducted to identify A1, and another example FOV of the 2-photon imaging where we conducted a short sound presentation session to identify the sensory responsive neurons, as an inset of the ‘Cell selection’ part in Figure 1.

      Reviewer #2 (Recommendations for the authors):

      Minor points:

      - p.4, last line: "of" probably missing "the processing the target..."

      Fixed.

      - p.5, top, end of the first paragraph of this page: Figure 3B and 3E don't show exemplar traces.

      Corrected as Figure 2A and 2D.

      - P.5, first sentence of the paragraph "Optogenetic holographic stimulation increases activity in targeted ensembles": reference to Figure 3A and 3D should rather be Figure 2A and 2D.

      Corrected.

      - P.9, 2nd paragraph: sentence with a strange syntax: "since their response amplitude..."

      Corrected.

      - Figure 2: panels C and F are missing.

      Corrected.

      - p.11, methods: "wasthen" should be "was then".

      Corrected.

      - p.12, analysis: it is not clearly explained why the sound evoked activity is computed based on the 160ms to 660ms after sound onset instead of 0ms to 660 ms. It is likely related to some potential contamination but it should be explicitly explained.

      Due to the relatively slow calcium transient to more correctly capture the sound related evoked responses. Added this detail.

      - Methods, analysis: the authors should better explain how they conducted the random permutation described in the Figures 1D, 2B and 2E. Which signals were permutated?

      Random permutation to shuffle the target cell ID.

      - References 55 and 56 don't explicitly state that excitatory neurons generally have stronger responses to sound than inhibitory neurons.

      Thank you for pointing out this error. We replaced those references with Maor et al. 2016 and Kerlin et al. 2010, showing excitatory neurons show more selective tuning, and also changed the wording more appropriately.

      - It is not explained whether the imaging sessions are performed on awake or anaesthetized animals. It is probably done on awake animals, but then it is not clear what procedure is used to get the animals used to the head restraint. It usually takes a few days for the mice to get used to it, and the stress level is often different at the beginning and end of an experiment. Given the experimental protocol used in the study, in which sessions are performed sequentially and compared to each other, this aspect could play a role. However, the main comparison made is probably safe as it compares a control condition (laser at 0mW) and conditions with optogenetic stimulation, all done with similar sequences of sessions.

      The experiment was conducted on awake animals. Although we did not have any control on comparing their status in the beginning and the end of the experiment, they all had a widefield imaging session imaging session to identify the A1 region which uses the same head-fixation setup, thus they are more used to the setup when we conduct 2-photon imaging and stimulation. Regardless of the session, if animals show any sign of extra discomfort due to the unfamiliar setup, we keep them there for 10-15 minutes until they are accustomed to the setup with no movement. If they still show a sign of discomfort, we take them out and try for another day. We now included this detail on the manuscript.

      Reviewer #3 (Recommendations for the authors):

      - Evaluate the global effect of stimulation on the population activity averaged across all neurons (activated and non-activated).

      Thank you for your suggestions. We now included a new Figure 3A that present the population activity across all responsive cells. The average activity level did not differ among stimulation conditions (control, 16kHz stim, and 54kHz stim).

      - Evaluate with a simple model if a population of neurons with different sound tuning receiving non-specific inhibition would not produce the observed effect.

      Thank you for the suggestion. We generated a simple model in which a suppression term was applied either to all neurons or specifically to non-target co-tuned cells to test our results from the data. We took a similar range of number of neurons and FOVs to closely simulate the model to the real dataset structure. On 50 simulated calcium traces of neurons (n),

      Trace<sub>n(t)</sub> = R<sub>n(t)</sub> – theta<sub>n</sub> + epsilon<sub>n(t)</sub>

      Where R<sub>n(t)</sub> is a response amplitude from either baseline or stimulation session, theta<sub>n</sub> is a suppression term applied either to all neurons or only to non-target co-tuned neurons, only during the stimulation session, and epsilon<sub>n(t)</sub> is additive noise. Theta was defined based on the average amount of increased activity amplitudes generated from target neurons due to the stimulation, implemented from the real dataset with extra neuron-level jitter. Similar to the real data analyses, we compared the response change between the stimulation and baseline sessions’ trace amplitudes. By comparing two different model outcomes and the real data, we observed a significant effect of the model type (F(2, 2535) = 34.943, p < 0.0001) and interaction between the model type and cell groups was observed (F(2, 2535) = 36.348, p < 0.0001). Applying suppression to only non-target co-tuned cells during the stimulation session yielded a significant response amplitude decrease for co-tuned cells compared to non co-tuned cells (F(1, 2535) = 45.62, p < 0.0001), which resembles the real data In contrast, applying suppression to all non-target cells led to similar amplitude changes in both co-tuned and non co-tuned neurons (F(1, 2535) = 0.87, p = 0.35), which was not observed in either the real data or the simulated data restricted to co-tuned cell suppression. Therefore, the model predicts correctly that the specific suppression given to only co-tuned neurons drove the real data outcome. All of this information is now added into Methods and Results sections and the figure is added as Figure 3C.

    1. Author Response

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Weaknesses:

      One important question is needed to further clarify the mechanisms of aberrant Ca2+ microwaves as described below.

      Synapsin promoter labels both excitatory pyramidal neurons and inhibitory neurons. To avoid aberrant Ca2+ microwave, a combination of Flex virus and CaMKII-Cre or Thy-1-GCaMP6s and 6f mice were tested. However, all these approaches limit the number of infected pyramidal neurons. While the comprehensive display of these results is appreciated, a crucial question remains unanswered. To distinguish whether the microwave of Ca2+ is caused selectively via the abnormality of interneurons, or just a matter of pyramidal neuron density, testing Flex-GCaMP6 in interneuron specific mouse lines such as PV-Cre and SOM-Cre will be critical.

      We agree that unravelling the role of interneurons is important to the understanding of the cellular mechanisms. However, the primary goal of this preprint was to alert the field and those embarking on in vivo Ca2+ imaging to AAV transduction induced artefacts mediated by one of the most widely used viral constructs for Ca2+ imaging in the field. It was important to us to distribute this finding among the community in a timely manner to avoid the unnecessary waste of resources.

      We consider a thorough understanding of cell-type specific mechanisms interesting. However, the biological relevance of the Ca2+ waves is as yet unclear and to disentangle exactly which cellular and subcellular factors that drive the aberrant phenomenon will require a large systematic effort which goes beyond our resources. For instance, it will be technically not trivial to separate biologically relevant contributions from technical differences. For instance, the absence of Ca2+ waves under the principal neuron promotor CaMKII may suggest the involvement of interneurons. However, alternate possibilities are a reduced density of expression across principal neurons or that the expression levels between the 2 promoters is different.

      The important, take-home message of the preprint, in our opinion, is that users check carefully their viral protocols, adjust the protocols for their specific scientific question and report any issues. We now emphasise the fact that although Ca2+ waves were not observed following conditional expression of syn.GCaMP with CaMKII.cre, this may not be due to a requirement for interneuronal expression but simply reflect differences in final GCaMP expression density and levels between the two transduction procedures (P12, L298-303).

      Reviewer #2 (Public Review):

      Weaknesses:

      Whether micro-waves are associated with the age of mice was not quantified. This would be good to know and the authors do have this data.

      We plotted the animal age at the time of injection for all injections of Syn.GCaMP6 into CA1/CA3 and found no correlation in either the occurrence of Ca2+ waves nor the frequency of Ca2+ waves during the age period between 5 – 79 wks (see reviewer Fig1; linear regression fit to the Ca2+ wave frequency against age was not significant: intercept = 1.37, slope = -0.007, p=0.62, n = 14; and generalized linear model relating Ca2+ wave ~ age was not significant: z score = 0.19, deviance above null = 0.04, p = 0.85, n=24). We have now added a statement to this in the revised manuscript (P14 L354-359) and for the reviewers we have added the plots below.

      Author response image 1.

      Plot of Ca2+ micro-wave frequency (left: number of Ca2+ waves/min) or occurrence (right: yes/no) against the animal age at the time of viral injection. Blue line is linear (left) or logistic (right) fit to the data with 95% confidence level.

      The effect of micro-waves on single cell function was not analyzed. It would be useful, for example, if we knew the influence of micro-waves on place fields. Can a place cell still express a place field in a hippocampus that produces micro-waves? What effect might a microwave passing over a cell have on its place field? Mice were not trained in these experiments, so the authors do not have the data.

      We agree that these are interesting questions; however, the preprint is focused on describing the GECI expression conditions prone to generating these artefacts. Studying the effects of Ca2+ micro-waves on the circuitry are scientific questions, and would require an experimental framework of testing the aberrant activity on a specific physiological function e.g. place activity or specific oscillations (e.g. sharp-wave activity). Ca2+ microwaves, as the ones described here, have not been reported under physiological conditions or pathophysiological conditions and studying the effects of such artefactual waves on the circuit was not our intention.

      With respect to place cell activity, specifically, it is intuitive that during the Ca2+ micro-wave the participating cell’s place field activity would be obscured by the artefactual activity. Cell activity appears to return immediately following the wave suggesting that the cells could exhibit place activity outside their participation in the Ca2+ micro-waves. However, we do not know if the Ca2+ micro-wave activity disrupts the generation or maintenance of place fields. We have now added a brief reference to possible effects on place coding to the paper (P12, L315-317).

      The CaMKII-Cre approach for flexed-syn-GCaMP expression shows no micro-waves and is convincing, but it is only from 2 animals, even though both had no micro-waves. In light of the reviewer’s comment, we have added a further 3 animals with conditional expression of GCaMP6m from the DZNE to complement the current dataset with conditional expression of GCaMP6s from UoB (P10, L236 & 239 and revised table 1). Although Ca2+ waves were not observed in any of the in total 5 animals, we still do not know with all certainty whether this approach is completely safe. Time will show if researchers still encounter the phenotype under certain conditions when using this conditional approach.

      The authors state in their Discussion that even without observable microwaves, a syn-Ca2+-indicator transduction strategy could still be problematic. This may be true, but they do not check this in their analysis, so it remains unknown

      We agree with the reviewer and have now made this point clearer in the revised discussion (P11, L257-258)

      Reviewer #3 (Public Review):

      Weaknesses:

      I believe that the weaknesses of the manuscript are appropriately highlighted by the authors themselves in the discussion. I would, however, like to emphasize several additional points.

      As the authors state, the exact conditions that lead to Ca2+ micro-waves are unclear from this manuscript. It is also unclear if Ca2+ micro-waves are specific to GECI expression or if high-titer viral transduction of other proteins such as genetically encoded voltage indicators, static fluorescent proteins, recombinases, etc could also cause Ca2+ micro-waves.

      The high expression of other proteins has been shown to result in artefactual phenomenon such as toxicity or fluorescent puncta (for GFP see Hechler et al. 2006; Katayama et al. 2008 for GEVI see Rühl et al. 2021), but we are not aware of reports of micro-waves. Although it is certainly possible that high expression levels of other proteins could lead to waves, we suspect the Ca2+ micro-waves observed in this preprint result from a dysregulation of Ca2+ homeostasis. This is not to suggest that voltage indicators could not result in micro-waves (e.g. Ca2+ homeostasis may be indirectly affected).

      The authors almost exclusively tested high titer (>5x10^12 vg/mL) large volume (500-1000 nL) injections using the synapsin promoter and AAV1 serotypes. It is possible that Ca2+ micro-waves are dramatically less frequent when titers are lowered further but still kept high enough to be useful for in vivo imaging (e.g. 1x10^12 vg/mL) or smaller injection volumes are used. It is also possible that Ca2+ micro-waves occur with high titer injections using other viral promoter sequences such as EF1α or CaMKIIα. There may additionally be effects of viral serotype on micro-wave occurrence.

      We agree with all points raised by the reviewer. Notably, we used viral transduction protocols with titers and volumes within in the range of those previously used for viral transduction of GCaMP under the synapsin promoter (see P11 L269-275) and we observed Ca2+ micro-waves. As the reviewer suggested, we did find that lowering the titer is an important factor in reducing these Ca2+ micro-waves and there is likely a wide range of approaches that avoid the phenomenon. With regards to viral serotype, we show that micro-waves occurred across AAV1 and 9, but it is possible that other serotypes may avoid the phenomenon.

      We reiterate in the abstract of the revised manuscript that expression level is a crucial factor (P2, L40 and P2, L44-45) and now mention that other promoters and induction protocols that result in high Ca2+ indicator expression may result in Ca2+ micro-waves (P12, L291-294.

      The number of animals in any particular condition are fairly low (Table 1) with the exception of V1 imaging and thy1-GCaMP6 imaging. This prohibits rigorous comparison of the frequency of pathological calcium activity across conditions.

      We have now added 3 more animals with conditional GCaMP6 expression. In total, the study contains 34 animals with viral injection into the hippocampus from different laboratories and under different conditions resulting in multiple groups. As such we are cognizant of the resulting limitations for statistical evaluation.

      However, in light of the reviewer’s comment, we have now employed a generalized linear model tested on all the data to examine the relationship between the Ca2+ micro-wave incidence and the different factors. The multivariate GLM did find a significant relationship between Ca2+ micro-wave incidence and both viral dilution and weeks post injection (see below and revised manuscript P8, L189-193).

      For injections into CA1 in the hippocampus (n=28), a GLM found no relationship between Ca2+ micro-waves and each of the individual variables x (Ca-wave ~ x) ; viral dilution: z score = 1.14, deviance above null = 1.31, p = 0.254; post injection weeks: : z score = 1.18, deviance above null = 1.44, p = 0.239; injection volume: : z score = -0.76, deviance above null = 0.59, p = 0.45; construct: : z score = 1.18, difference in deviance above null = 1.44, p = 0.239)

      However, a multivariable logistic GLM relating dilution and post injection weeks (Ca-wave ~ dilution + p.i_wks) showed that together both variables were significantly related to Ca2+ micro-waves (Deviation above null = 7.5; Dilution: z score = 2.18, p < 0.05; p.i_wks : z score = 2.22, p < 0.05).

      Recommendations For The Authors:

      Reviewer #1 (Recommendations For The Authors):

      Results are straightforward and convincing. While a couple of ways to reduce the aberrant microwaves of calcium responses were demonstrated, delving into the functions of interneurons is crucial for a more comprehensive understanding of cellular causality.

      As mentioned in the public response, disentangling cellular mechanism from technical requirements will need a large and systematic study. To determine the contribution from interneurons, the use of specific interneuron promoters would be required, and viral titers systematically varied to result in similar cellular GCaMP expression levels as seen under the synapsin promoter condition.

      Reviewer #2 (Recommendations For The Authors):

      Do the authors think the cells are firing when they participate in a micro-wave, or do they think the calcium influx is due to something else? A discussion point on this would be good.

      This is an excellent point raised by the reviewer. We do not know if the elevated cellular Ca2+ during the artifactual Ca2+ micro-wave reflects action potential firing or an increase of Ca2+ from intracellular stores. As already described in the text of the preprint, their optical spatiotemporal profile neither fits with known microseizure progression patterns, nor with spreading depolarization/depression. We have adopted the reviewer’s suggestion and added the following point to the discussion section in the revised preprint (P12, L308-315):

      In a limited dataset, we attempted to detect the Ca2+ micro-waves by hippocampal LFP recordings (using a conventional insulated Tungsten wire, diameter ~110µm). We could not identify a specific signature, e.g. ictal activity or LFP depression, which may correspond to these Ca2+ micro-waves. The crucial shortcoming of this experiment of course is that with these LFP recordings, we could not simultaneous perform hippocampal 2-photon microscopy. Thus, it is uncertain if the Ca2+ micro-waves indeed occurred in proximity to our electrode.

      The results seem to suggest that micro-waves may involve interneurons as their CaMKII-Cre strategy avoids waves - possibly due to a lack of expression of GECIs in interneurons. It would be great to hear the author's thoughts on this and add a brief discussion point.

      As mentioned in public response to Reviewer 1, it is difficult to disentangle cellular mechanisms from technical requirements, and the exact requirements for the Ca2+ micro-waves to occur are still not fully clear. The absence of Ca2+ micro-waves in our CaMKII-Cre dataset may indeed reflect the requirement of interneurons. However, it could just as well be due to a sparse labelling of principle cells or simply reflect differences in the expression levels of GCaMP under the different promotors.

      All in all, a more complete understanding of the requirements of such Ca2+ micro-waves will require a community effort. Therefore, it is important that each group check the safety profile of their GECI and report problems to the community.

      We have added these points to the revised preprint (P12, L291 and P12, L298)

      Plotting the incidence of micro-waves as a function of the age of mice would be a nice addition (the authors have the data).

      There was no relationship of Ca2+ micro-wave occurrence or frequency with age over the range of 5-79 wks (see public response) and this has been added to the preprint (P14, L354)

      Reviewer #3 (Recommendations For The Authors):

      I appreciate the authors raising the awareness of this issue. I had personally observed micro-waves in my own data as well. In agreement with their findings, I found that the occurrence of micro-waves was dramatically lower when I reduced the viral titer. Anecdotally, I also observed voltage micro-waves when virally transducing genetically encoded voltage indicators at similar titers. For that reason, I am skeptical that this issue is exclusive to GECIs.

      We find it interesting that the reviewer has also seen artefactual micro-waves following viral transduction of genetically encoded voltage indicators. Without seeing the voltage waves the referee is referring to or the conditions, it is of course difficult to compare with the Ca2+ micro-waves we report. However, this comment again raises the question of mechanism. We believe that in the GECI framework, Ca2+ homeostatic aspects are important. Voltage indicators are based on different sensor mechanisms, and expressed in the cell membrane, but it may very well be that there are overlapping factors between Ca2+ and voltage indicators that could trigger a similar, or even the same phenomenon in the end.

      Minor comments:

      (1) Line 131-132: I believe the authors only tested for micro-waves in V1. This should be made clear in the results. It could be that micro-waves could occur in other parts of cortex with the same viral titers.

      Both V1 and somatosensory cortex were tested as described in the methods (P15, L395-397), we have made this clearer in the revised preprint (P6, L138).

      (2) There are no statistics associated with the data from Fig 1e.

      We have now added statistics (P5, L126).

      (3) The authors may be able to make a stronger claim about the pathological nature of the micro-waves if there are differences in the histology between the injected and non-injected hemispheres. For example, is there evidence of widespread cell death in the injected hemisphere (e.g. lower cell count, smaller hippocampal volume, caspase staining, etc).

      We found no evidence of gross morphological changes to the hippocampus following viral transduction with no changes in CA1 pyramidal cell layer thickness or CA1 thickness (pyramidal cell layer thickness: 49 ± 12.5 µm ipsilateral and 50.3 ± 11.1 µm contralateral, n=4, Student’s t-test p=0.89; CA1 thickness: 553.3 ± 14 µm ipsilateral and 555.8 ± 62 µm contralateral, n = 4, Student’s t-test p=0.94; 48 ± 13 weeks post injection at time of perfusion).

      We have added this to the preprint (P5, L117-122)

      (4) The broader micro-waves in the stratum oriens versus the stratum pyramidale are likely due to the spread of the basal dendrites of pyramidal cells. If the typical size of the basal dendritic arbor of CA1 pyramidal neurons is taken into account, does this explain the wider calcium waves in this layer.

      Absolutely, great point, yes, we completely agree on this. It is likely the active neuropil (including dendritic arbour) are contributing to the apparent broader diameter. In addition, as evident in the video 5 cell somata in the stratum Oriens (possibly interneurons) are active and their processes also contribute.

      We have now mentioned these points in the revised preprint (P5, L132)

      (5) Lines 179-181: Is the difference in the prevalence of micro-waves between viral titers statistically significant?

      Although we have a large number of animals in total (n=34) with viral injection into the hippocampus, the number of animals in each condition, given the many factors, is low. We therefore used a generalized linear model to test the relationship between the Ca2+ micro-waves and the variables.

      We have now added this analysis to the revised preprint (P8, L189-193)

      (6) Lines 200-203: The CA3 micro-waves were only observed at one institution. The current wording is slightly misleading.

      We agree and have changed this to be clearer (P9 L216)

    1. Author Response

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      This work describes the mechanism of protein disaggregation by the ClpL AAA+ protein of Listeria monocytogenes. Using several model subtrate proteins the authors first show that ClpL possesses a robust disaggregase activity that does not further require the endogenous DnaK chaperone in vitro. In addition, they found that ClpL is more thermostable than the endogenous L. monocytogenes DnaK and has the capacity to unfold tightly folded protein domains. The mechanistic basis for the robust disaggregase activity of ClpL was also dissected in vitro and in some cases, supported by in vivo data performed in chaperonedeficient E. coli strains. The data presented show that the two AAA domains, the pore-2 site and the N-terminal domain (NTD) of ClpL are critical for its disaggregase activity. Remarkably, grafting the NTD of ClpL to ClpB converted ClpB into an autonomous disaggregase, highlighting the importance of such a domain in the DnaK-independent disaggregation of proteins. The role of the ClpL NTD domain was further dissected, identifying key residues and positions necessary for aggregate recognition and disaggregation. Finally, using sets of SEC and negative staining EM experiments combined with conditional covalent linkages and disaggregation assays the authors found that ClpL shows significant structural plasticity, forming dynamic hexameric and heptameric active single rings that can further form higher assembly states via their middle domains.

      Strengths:

      The manuscript is well-written and the experimental work is well executed. It contains a robust and complete set of in vitro data that push further our knowledge of such important disaggregases. It shows the importance of the atypical ClpL N-terminal domain in the disaggregation process as well as the structural malleability of such AAA+ proteins. More generally, this work expands our knowledge of heat resistance in bacterial pathogens.

      Weaknesses:

      There is no specific weakness in this work, although it would have helped to have a drawing model showing how ClpL performs protein disaggregation based on their new findings. The function of the higher assembly states of ClpL remains unresolved and will need further extensive research. Similarly, it will be interesting in the future to see whether the sole function of the plasmid-encoded ClpL is to cope with general protein aggregates under heat stress.

      We thank the reviewer for the positive evaluation. We agree with the reviewer that it will be important to test whether ClpL can bind to and process non-aggregated protein substrates. Our preliminary analysis suggests that the disaggregation activity of ClpL is most relevant in vivo, pointing to protein aggregates as main target.

      We also agree that the role of dimers or tetramers of ClpL rings needs to be further explored. Our initial analysis suggests a function of ring dimers as a resting state. It will now be important to study the dynamics of ClpL assembly formation and test whether substrate presence shifts ClpL assemblies towards an active, single ring state.

      Reviewer #2 (Public Review):

      The manuscript by Bohl et al. is an interesting and carefully done study on the biochemical properties and mode of action of potent autonomous AAA+ disaggregase ClpL from Listeria monocytogenes. ClpL is encoded on plasmids. It shows high thermal stability and provides Listeria monocytogenes food-pathogen substantial increase in resistance to heat. The authors show that ClpL interacts with aggregated proteins through the aromatic residues present in its N-terminal domain and subsequently unfolds proteins from aggregates translocating polypeptide chains through the central pore in its oligomeric ring structure. The structure of ClpL oligomers was also investigated in the manuscript. The results suggest that mono-ring structure and not dimer or trimer of rings, observed in addition to mono-ring structures under EM, is an active species of disaggregase.

      Presented experiments are conclusive and well-controlled. Several mutants were created to analyze the importance of a particular ClpL domain.

      The study's strength lies in the direct comparison of ClpL biochemical properties with autonomous ClpG disaggregase present in selected Gram-negative bacteria and well-studied E. coli system consisting of ClpB disaggregase and DnaK and its cochaperones. This puts the obtained results in a broader context.

      We thank the reviewer for the detailed comments. There are no specific weaknesses indicated in the public review.

      Reviewer #3 (Public Review):

      Summary:

      This manuscript details the characterization of ClpL from L. monocytogenes as a potent and autonomous AAA+ disaggregase. The authors demonstrate that ClpL has potent and DnaKindependent disaggregase activity towards a variety of aggregated model substrates and that this disaggregase activity appears to be greater than that observed with the canonical DnaK/ClpB co-chaperone. Furthermore, Lm ClpL appears to have greater thermostability as compared to Lm DnaK, suggesting that ClpL-expressing cells may be able to withstand more severe heat stress conditions. Interestingly, Lm ClpP can provide thermotolerance to E. coli that have been genetically depleted of either ClpB or in cells expressing a mutant DnaK103. The authors further characterized the mechanisms by which ClpL interacts with protein aggregates, identifying that the N-terminal domain of ClpL is essential for disaggregase function. Lastly, by EM and mutagenesis analysis, the authors report that ClpL can exist in a variety of larger macromolecular complexes, including dimer or trimers of hexamers/heptamers, and they provide evidence that the N-terminal domains of ClpL prevent dimer ring formation, thus promoting an active and substrate-binding ClpL complex. Throughout this manuscript the authors compare Lm ClpL to ClpG, another potent and autonomous disaggregase found in gram-negative bacteria that have been reported on previously, demonstrating that these two enzymes share homologous activity and qualities. Taken together this report clearly establishes ClpL as a novel and autonomous disaggregase.

      Strengths:

      The work presented in this report amounts to a significant body of novel and significant work that will be of interest to the protein chaperone community. Furthermore, by providing examples of how ClpL can provide in vivo thermotolerance to both E. coli and L. gasseri the authors have expanded the significance of this work and provided novel insight into potential mechanisms responsible for thermotolerance in food-borne pathogens.

      Weaknesses:

      The figures are clearly depicted and easy to understand, though some of the axis labeling is a bit misleading or confusing and may warrant revision. While I do feel that the results and discussion as presented support the authors' hypothesis and overall goal of demonstrating ClpL as a novel disaggregase, interpretation of the data is hindered as no statistical tests are provided throughout the manuscript. Because of this only qualitative analysis can be made, and as such many of the concluding statements involving pairwise comparisons need to be revisited or quantitative data with stats needs to be provided. The addition of statistical analysis is critical and should not be difficult, nor do I anticipate that it will change the conclusions of this report.

      We thank the reviewer for the valid criticism. We addressed the major concern of the reviewer and added the requested statistical analysis to all relevant figures. The analysis confirms our conclusions. We also followed the advice of the reviewer and revised axis labeling to increase clarity.

      Reviewer #1 (Recommendations For The Authors):

      • It would really help to have a model showing how ClpL performs protein disaggregation based on their findings.

      We show that ClpL exerts a threading activity that is fueled by ATP hydrolysis in both AAA domains and executed by pore-located aromatic residues. The basic disaggregation mechanism of ClpL therefore does not differ from ClpB and ClpG disaggregases. Similarly, the specificity of ClpL towards protein aggregates is based on simultaneous interactions of multiple N-terminal domains with the aggregate surface. We could recently describe a similar mode of aggregate recognition for ClpG [1]. We therefore prefer not to add a model to the manuscript. We are currently in preparation of a review that includes the characterization of the novel bacterial disaggregases and will present models there as we consider a review article as more appropriate for such illustrations.

      • AAA2 domain of ClpL in Fig 3E should be the same color as in Fig 1A.

      We used light grey instead of dark grey for the ClpL AAA2 domain in Fig 3E, to distinguish between ClpL and ClpB AAA domains. This kind of illustration allows for clearer separation of both AAA+ proteins and the fusion construct LN-ClpB*. We therefore prefer keeping the color code.

      • Partial suppression of the dnaK mutant could be added in the main manuscript Figure.

      The main figure 3 is already very dense and we therefore prefer showing respective data as part of a supplementary figure.

      • It would have been interesting to know if the robust autonomous disaggregation activity of ClpL would be sufficient to rescue the growth of more severe E. coli chaperone mutants, like dnaK tig for example. Did the authors test this?

      We tested whether expression of clpL can rescue growth of E. coli dnaK103 mutant cells at 40°C on LB plates. This experiment is different from the restoration of heat resistance in dnaK103 cells (Figure 3, figure supplement 2A), as continuous growth at elevated temperatures (40°C) is monitored instead of cell survival upon abrupt severe heat shock (49°C). We did not observe rescue of the temperature-sensitive growth phenotype (40°C) of dnaK103 cells upon clpL expression, though expression of clpG complemented the temperature-sensitive growth phenotype (see Author response image 1 below). This finding points to differences in chaperone activities of ClpL and ClpG. It also suggests that ClpL activity is largely restricted to heat-shock generated protein aggregates, enabling ClpL to complement the missing disaggregation function of DnaK but not other Hsp70 activities including folding and targeting of newly synthesized proteins. We believe that dissecting the molecular reasons for differences in ClpG and ClpL complementation activities should be part of an independent study and prefer showing the growth-complementation data only in the response letter.

      Author response image 1.

      Serial dilutions (10-1 – 10-6) of E. coli dnaK103 mutant cells expressing E. coli dnaK, L. monocytogenes clpL or P. aeruginosa clpG were spotted on LB plates including the indicated IPTG concentrations. Plates were incubated at 30°C or 40°C for 24 h. p: empty vector control.

      Reviewer #2 (Recommendations For The Authors):

      Based on results presented in Fig. 2B the authors conclude "that stand-alone disaggregases ClpL and ClpG but not the canonical KJE/ClpB disaggregase exhibit robust threading activities that allow for unfolding of tightly folded domains" (page 5 line 209). In this experiment, the threading power of disaggregases was assessed by monitoring YFP fluorescence during the disaggregation of aggregates formed by fusion luciferase-YFP protein. In my opinion, the results of the experiment depend not only on the threading power of disaggregases but also on the substrate recognition by analyzed disaggregating systems and/or processivity of disaggregases. N-terminal domain in the case of ClpL and KJE chaperones in the case of the KJE/ClpB system are involved in recognition. This is not discussed in the manuscript and the obtained result might be misinterpreted. The authors have created the LN-ClpB* construct (N-terminal domain of ClpL fused to derepressed ClpB) (Fig. 3 E and F). In my opinion, this construct should be used as an additional control in the experiment in Fig. 2 B. It possesses the same substrate recognition domain and therefore the direct comparison of disaggregases threading power might be possible.

      We performed the requested experiment (new Figure 3 - figure supplement 2D). We did not observe unfolding of YFP by LN-ClpB. Sínce ClpL and LN-ClpB do not differ in their aggregate targeting mechanisms, this finding underlines the differences in threading power between ClpL and activated (derepressed) ClpB. It also suggests that the AAA threading motors and the aggregate-targeting NTD largely function independently.

      Presented results suggest that tetramer and dimer of rings might be a "storage form" of disaggregase. It would be interesting to analyze the thermotolerance and/or phenotype of ClpL mutants that do not form tetramer and dimer (E352A). This variant possesses similar to WT disaggregation activity but does not form dimers and tetramers. If in vivo the differences are observed (for example toxicity of the mutant), the "storage form" hypothesis will be probable.

      When testing expression of clpL-MD mutants (E352A, F354A), which cannot form dimers and tetramers of ClpL rings, in E. coli ∆clpB cells, we observed reduced production levels as compared to ClpL wildtype and speculated that reduced expression might be linked to cellular toxicity. We therefore compared spotting efficiencies of E. coli ∆clpB cells expression clpL, ∆NclpL or the clpL-MD mutants at different temperatures. Expression of clpL at high levels abrogated colony formation at 42°C (new Figure 6 - figure supplement 3). ClpL toxicity was dependent on its NTD as no effect was observed upon expression of ∆N-clpL. ClpL-MD mutants (E352A, F354A) were expressed at much lower levels and exhibited strongly increased toxicity as compared to ClpL-WT when produced at comparable levels (new Figure 6 – figure supplement 3). This implies a protective role of ClpL ring dimers and tetramers in the cellular environment by downregulating ClpL activity. We envision that the formation of ClpL assemblies restricts accessibility of the ClpL NTDs and reduces substrate interaction. Increased toxicity of ClpL-E352A and ClpL-F354A points to a physiological relevance of the dimers and tetramers of ClpL rings and is in agreement with the proposed function as storage forms. We added this potential role of ClpL ring assemblies to the discussion section. Due to the strongly reduced production levels of ClpL MD mutants and their enhanced toxicity at elevated temperatures we did not test for their ability to restore thermotolerance in E. coli ∆clpB cells.

      Figure 6G and Figure 6 -figure supplement 2 - it is not clear what is the difference in the preparation of WT and WTox forms of ClpL.

      ClpL WT was purified under reduced conditions (+ 2 mM DTT), whereas WTox was purified in absence of DTT, thus serving as control for ClpL-T355C, which forms disulfide bonds upon purification without DTT. We have added respective information to the figure legend and the materials and methods section.

      Page 5 line 250 - wrong figure citation. Instead of Figure 1 - Figure Supplement 2A should be Figure 3 - Figure Supplement 2A.

      Page 5 line 251 - wrong figure citation. Instead of Figure 1 - Figure Supplement 2B/C should be Figure 3 - Figure Supplement 2B/C.

      Page 7 line 315 - wrong figure citation. Instead of Figure 4F, it should be Figure 4G Figure 1 - Figure Supplement 2E - At first glance, this Figure does not correspond to the text and is confusing. It would be nice to have bars for Lm ClpL activity in the figure. Alternatively, the description of the y-axis might be changed to "relative to Lm ClpL disaggregation activity" instead of "relative disaggregation activity". One has to carefully read the figure legend to find out that 1 corresponds to Lm ClpL activity.

      We have corrected all mistakes and changed the description of y-axis (Figure 1 - figure Supplement 2E) as suggested.

      Reviewer #3 (Recommendations For The Authors):

      (1) While the authors make many experimental comparisons throughout their study, no statistical tests are described or presented with their results or figures, nor are these statistical tests described in the methods. While the data as presented does appear to support the author's conclusions, without these statistical tests no meaningful conclusions from paired analysis can be drawn. Critically, please report these statistical tests. As a general suggestion please include the statistics (p-values) in the results section when presenting this data, as well as in the figure legends, as this will allow the reader to better understand the authors' presentation and interpretation of the data.

      We have added statistical tests to all relevant figures. The analysis is confirming our former statements. We have further clarified our approach for the statistical analysis in the methods section. We report p-values in the results section, however, due to the volume of comparisons we did not add individual p-values to the figure legends but used standard labeling with stars.

      (2) Some of the axis labels for the presented graphs are a bit misleading or confusing. Many describe a relative (%) disaggregation rate, but it is not clear from the methods or figure legends what this rate is relative to. Is it relative to non-denatured substrates, to no chaperone conditions, etc.? Is it possible to present the figures with the raw data rates/activity (ex. luciferase activity / time) vs. relative rates? I think that labeling these figure axes with "disaggregation rate" is a bit misleading as none of these experiments measure the actual rate of disaggregation of these model substrates per se (say by SEC-MALS or other biophysical measurements), but instead infer the extent of disaggregation by measuring a property of these substrates, i.e. luciferase activity or fluorescence intensity over time. Thus, labeling these figures with the appropriate axis for what is being measured, and then clarifying in the methods and results what is being inferred by these measurements, will help solidify the author's conclusions.

      Relative (%) disaggregation rate usually refers to the disaggregation activity of ClpL wildtype serving as reference. We clarified this point in the revised text and respective figure legends. We now also refer to the process measured (e.g. relative refolding activity of aggregated Luciferase instead of relative disaggregation activity) as suggested by the reviewer and added clarifications to text and materials and methods.

      Since we have many measurements for our most frequently used assays and have a reasonable estimate for the general variance within these assays, we found it reasonable to show activity data in relation to fixed controls. This reduces the impact of unspecific variance and thereby makes more accurate comparisons between different repetitions. The reference is now indicated in the axis title.

      (3) The figures are well presented, clutter-free, and graphically easy to understand. Figure legends have sufficient information aside from the aforementioned statistical information and should include the exact number of independent replicates for each panel/experiment (ex. n=4), not just a greater than 3. While the figures do show each data point along with the mean and error, in some figures it is difficult to determine the number of replicate data points. Example figures 2c, 2d, and 3a. Also, please state whether the error is std. error or SEM.

      While we agree, that this is valuable information, we fear that overloading the figure legends with information may take a toll on the readability. We therefore decided to append the number of replicates for each experiment in a separate supplementary table (Table S2). The depicted error is showing the SD and not the SEM, which we also specified in the figure legends.

      (4) There are various examples throughout the results where qualitative descriptors are used to describe comparisons. Examples of this are "hardly enhanced" (Figure 1) and "partially reduced" (Figure 6). While this is not necessarily wrong, qualitative descriptions of comparisons in this manner would require further explanation. What is the definition of "hardly" or "partially"? My recommendation is to just state the data quantitatively, such as "% enhanced" or "reduced by x", this way there is no misinterpretation. Examples of this can be found in Figures 6C-G. This would require a full statistical overview and presentation of these stats in the results.

      We followed the reviewer`s advice and no longer use the terms criticized (e.g. “hardly enhanced”). We instead provide the requested quantifications in the text.

      Questions for Figures:

      Figures 1B and 1C:

      (1) Is the disaggregase activity of ClpL towards heat-denatured luciferase and GFP ATPdependent? While the authors later in the manuscript show that mutations within the Walker B domains dramatically impair reactivation (disaggregation) of denatured luciferase, this does not rule out an ATP-independent effect of these mutations. Thus, the authors should test whether disaggregase activity is observed when wild-type ClpL is incubated with denatured substrates without ATP present or in the presence of ADP only.

      We tested for ClpL disaggregation activity in absence of nucleotide and presence of ADP only (new Figure 1 – figure supplement 2A). We did not observe any activity, demonstrating that ClpL activity depends on ATP binding and hydrolysis (see also Figure 3 – figure supplement 1D: ATPase-deficient ClpL-E197A/E530A is lacking disaggregation activity).

      (2) The authors suggest that a reduction in disaggregase activity observed in samples combining Lm ClpL and KJE (Figure 1C, supp. 1C-E) could be due to competition for protein aggregate binding as observed previously with ClpG. Did the authors test this directly by pulldown assay or another interaction-based assay? While ClpL and ClpG appear to work in a similar manner, it would be good to confirm this. Also, clarification on how this competition operates would be useful. Is it that ClpL prevents aggregates from interacting with KJE, or vice versa?

      We probed for binding of ClpL to aggregated Malate Dehydrogenase in the presence of L. monocytogenes or E. coli Hsp70 (DnaK + respective J-domain protein DnaJ) by a centrifugation-based assay. Here, we used the ATPase-deficient ClpL-E197A/E530A (ClpLDWB) mutant, ensuring stable substrate interaction in presence of ATP. We observe reduced binding of ClpL-DWB to protein aggregates in presence of DnaK/DnaJ (new Figure 1 – figure supplement 2G). This finding indicates that both chaperones compete for binding to aggregated proteins and explains inhibition of ClpL disaggregation activity in presence of Hsp70.

      (3) Related to the above, while incubation of aggregated substrates with ClpL and KJE does appear to reduce aggregase activity towards GFP (Figure 1c), α-glucosidase (Supp. 1C), and MDH (Supp. 1D), this doesn't appear to be the case towards luciferase (Figure 1b, Supp. 1b). Furthermore, ClpL aggregase activity is reduced towards luciferase when combined with E. coli KJE (Supp. 1e) but not with Lm KJE (Figure 1b). The authors provide no commentary or explanation for these observations. Furthermore, these results complicate the concluding statement that "combining ClpL with Lm KJE always led to a strong reduction in disaggregation activity ... ".

      We suggest that the differing inhibitory degrees of the KJE system on ClpL disaggregation activities reflect diverse binding affinities of KJE and ClpL to the respective aggregates. While we usually observe strong inhibition of ClpL activity in presence of KJE, this is different for aggregated Luciferase. This points to specific structural features of Luciferase aggregates or the presence of distinct binding sites on the aggregate surface that favour ClpL binding. We have added a respective comment to the revised manuscript.

      The former statement that “combining ClpL with Lm KJE always led to a strong reduction in disaggregation activity” referred to aggregated GFP, MDH and α-Glucosidase for which a strong inhibition of ClpL activity was observed. We have specified this point.

      Figures 1D and 1E:

      (1) The authors conclude that the heat sensitivity of ΔClpL L. gasseri cells is because they do not express the canonical ClpB disaggregase. A good test to validate this would be to express KJE/ClpB in these Lg ΔClpL cells to see if heat-sensitivity could be fully or partially rescued.

      We agree that such experiment would further strengthen the in vivo function of ClpL as alternative disaggregase. However, such approach would demand for co-expression of E. coli ClpB with the authentic E. coli DnaK chaperone system (KJE), as ClpB and DnaK cooperate in a species-specific manner [2-4]. This makes the experiment challenging, also because the individual components need to be expressed at a correct stochiometry. Furthermore, the presence of the authentic L. gasseri KJE system, which is likely competing with the E. coli KJE system for aggregate binding, will hamper E. coli KJE/ClpB disaggregation activity in L. gasseri. In view of these limitations, we would like to refrain from conducting such an experiment.

      (2) The rationale for investigating Lg ClpL, and the aggregase activity assays are compelling and support the hypothesis that ClpL contributes to thermotolerance in multiple grampositive species. Though, from Figure 1d, why was only Lg ClpL investigated? It appears that S. thermophilus also lacks the canonical ClpB disaggregase and demonstrates ΔClpL heat sensitivity. There is also other Lactobacillus sp. presented that lack ClpB but were not tested for heat sensitivity. Why only test and move forward with L. gasseri? Lastly, L. mesenteroides is ClpB-negative but doesn't demonstrate ΔClpL heat sensitivity. Why?

      We wanted to document high, partner-independent disaggregation activity for another ClpL homolog. We chose L. gasseri, as (i) this bacterial species lacks a ClpB homolog and (ii) a ∆clpL mutant exhibit reduced survival upon severe heat shock (thermotolerance phenotype), which is associated with defects in cellular protein disaggregation. The characterization of L. gasseri ClpL as potent disaggregase in vitro represents a proof-of-concept and allows to generalize our conclusion. We therefore did not further test S. thermophilus ClpL. L. mesenteroides encodes for ClpL but not ClpB, yet, a ∆clpL mutant has not yet been characterized in this species to the best of our knowledge. As we wanted to link ClpL in vitro activity with an in vivo phenotype, we did not characterize L. mesenteroides ClpL.

      We agree with the reviewer that the characterization of additional ClpL homologs is meaningful and interesting, however, we strongly believe that such analysis should be part of an exhaustive and independent study.

      Figures 2A and 2B:

      (1) Figure 2B demonstrates that both ClpL and ClpG, but not the canonical KJE/ClpB, are able to unfold YFP during the luciferase disaggregation process, suggesting that ClpL and ClpG exhibit stronger threading activity. A technical question, can luciferase activity be measured alongside in the same assay sample? If so, would you expect to observe a concomitant increase in luciferase activity as YFP fluorescence decreases?

      KJE/ClpB can partially disaggregate and refold aggregated Luciferase-YFP without unfolding YFP during the disaggregation reaction [5]. YFP unfolding is therefore not linked to refolding of aggregated Luciferase-YFP. On the other hand, unfolding of YFP during disaggregation can hamper the refolding of the fused Luciferase moiety as observed for the AAA+ protein ClpC in presence of its partner MecA [5]. These diverse effects make the interpretation of LuciferaseYFP refolding experiments difficult as the degree of YFP unfolding activity does not necessarily correlate with the extend of Luciferase refolding. We therefore avoided to perform the suggested experiment.

      Figure 2C and 2D:

      (1) Thermal shift assays for ClpL, ClpG, and DnaK were completed with various nucleotides. Were these experiments also completed with samples in their nucleotide-free apo state? Also, while all these chaperones are ATPases, the nucleotides used differ, but no explanation is provided. Comparison should be made of these ATPases bound to the same molecules.

      We did not monitor thermal stabilities of chaperones without nucleotide as such state is likely not relevant in vivo. We used ATPγS in case of ClpL to keep the AAA+ protein in the ATPconformation. ATP would be rapidly converted to ADP due to the high intrinsic ATPase activity of ClpL. In case of DnaK ATPγS cannot be used as it does not induce the ATP conformation [6]. The low intrinsic ATPase activity of DnaK allows determining the thermal stability of its ATP conformation in presence of ATP. This is confirmed by calculating a reduced thermal stability of ADP-bound DnaK.

      (2) The authors suggest that incubation at 55⁰C will cause unfolding of Lm DnaK, but not ClpL, providing ClpL-positive Lm cells disaggregase activity at 55⁰C. While the thermal shift assays in Figures 2C and 2D support this, an experiment to test this would be to heat-treat Lm DnaK and ClpL at 55⁰C then test for disaggregase activity using either aggregated luciferase or GFP as in Figure 1.

      We followed the suggestion of the reviewer and incubated Lm ClpL and DnaK at 55-58°C in presence of ATP for 15 min prior to their use in disaggregation assays. We compared the activities of pre-heated chaperones with controls that were incubated at 30°C for 15 min. Notably, we did not observe a loss of DnaK disaggregation activity, suggesting that thermal unfolding of DnaK at this temperature is reversible. We provide these data as Figure 2 -figure supplement 1 and added a respective statement to the revised manuscript.

      Figure 3B:

      (1) The authors state that ATPase activity of ΔN-ClpL was "hardly affected", but from the data provided it appeared to result in an approximate 35% reduction. As discussed above, no stats are provided for this figure, but given the error bars, it is highly likely that this reduction is significant. Please perform this statistical test, and if significant, please reflect this in the written results as well as the figure. Lastly, if this reduction in ATPase activity is significant, why would this be so, and could this contribute to the reduction in aggregase activity towards luciferase and MDH observed in Figure 3A?

      We applied statistical tests as suggested by the reviewer, showing that the reduction in ATPase activity of ∆N-ClpL is statistically significant. N-terminal domains of Hsp100 proteins can modulate ATPase activity as shown for the family member ClpB, functioning as auxiliary regulatory element for fine tuning of ClpB activity [7]. We speculate that the impact of the ClpL-NTD on the assembly state (stabilization of ClpL ring dimers) might affect ClpL ATPase activity. We would like to point out that other ClpL mutants (e.g. NTD mutant ClpL-Y51A; MDmutant ClpL-F354A) have a similarly reduced ATPase activity, yet exhibit substantial disaggregation activity (approx. 2-fold reduced compared to ClpL wildtype). In contrast ∆NClpL does not exhibit any disaggregation activity. This suggests that the loss of disaggregation activity is caused by a substrate binding defect but not by a partial reduction in ATPase activity. We added a comment on the reduced ATPase activity and also discuss its potential reasons in the discussion section.

      (2) I think the authors' conclusion that deletion of the ClpL NTD does not contribute to structural defects of ClpL is premature given the apparent reduction in ATPase activity. Did the authors perform any biophysical analysis of ΔN-ClpL to confirm this conclusion? Thermal shift assays, Native-PAGE, or size-exclusion chromatography for aggregates would all be good assays to demonstrate that the wild-type and ΔN-ClpL have similar structural properties. Surprisingly, Figure 6 describes significant macromolecular changes associated with ΔN-ClpL such that it preferentially forms a dimer of rings. Furthermore, in Supp. Figure 6D the authors report that ΔN-ClpL appears to have an increased Tm as compared to WT- or ΔM-ClpL. The authors should reflect these observations as deletion of the ClpL NTD does appear to contribute to structural changes, though perhaps only at the macromolecular scale, i.e. dimerization of the rings.

      We have characterized the oligomeric state of ∆N-ClpL by size exclusion chromatography (Figure 6 – figure supplement 1A) and negative staining electron microscopy (Figure 6C), both showing that it forms assemblies similar to ClpL wildtype. We did not observe an increased tendency of ∆N-ClpL to form aggregates and the protein remained fully soluble after several cycles of thawing and freezing. EM data reveal that ∆N-ClpL exclusively form ring dimers, suggesting that the NTDs destabilize MD-MD interactions. The stabilized interaction between two ∆N-ClpL rings can explain the increased thermal stability (Figure 6 – figure supplement 1D). We speculate that the ClpL NTDs either affect MD-MD interactions through steric hindrance or by directly contacting MDs. We have added a respective statement to the discussion section.

      Figure 3C and 3D:

      (1) Given the larger error in samples expressing ClpG (100) or ClpL (100) statistical analysis with p-values is required to make conclusions regarding the comparison of these samples vs. plasmid-only control. The effect of ΔN-ClpL vs. wild-type ClpL looks compelling and does appear to attenuate the ClpL-induced thermotolerance. This is nicely demonstrated in Figure 3D.

      We quantified respective spot tests (new Figure 3E) and tested for statistical significance as suggested by the reviewer. We show that restoration of heat resistance is significant for the first 30 min. While we always observe rescue at later timepoints significance is lost here due to larger deviations in the number of viable cells and thus the degree of complementation.

      Figure 3F:

      (1) What is the role of the ClpB NTD? It appears to be dispensable for disaggregase activity, assuming that ClpB is co-incubated with KJE. A quick explanation of this domain in ClpB could be useful.

      The ClpB NTD is not required for disaggregation activity, as ClpB is recruited to protein aggregates by DnaK, which interacts with the ClpB MDs. Still, two functions have been described for the ClpB NTD. First, it can bind soluble unfolded substrates such as casein [8]. This substrate binding function can increase ClpB disaggregation activity towards some aggregated model substrates (e.g. Glucose-6-phosphate dehydrogenase) [9]. However, NTD deletion usually does not decrease ClpB disaggregation activity and can even lead to an increase [7, 10, 11]. An increased disaggregation activity of ∆N-ClpB correlates with an enhanced ATPase activity, which is explained by NTDs stabilizing a repressing conformation of the ClpB MDs, which function as main regulators of ClpB ATPase activity [7]. We added a short description on the role of the ClpB NTD to the respective results section.

      (2) The result of fusing the ClpL NTD to ClpB supports a role for this NTD in promoting autonomous disaggregase activity. What would you expect to observe if the fused Ln-ClpB protein was co-incubated with KJE? Would this further promote disaggregase activity, or potentially impair through competition? This experiment could potentially support the authors' hypothesis that ClpL and ClpB/KJE can compete with each other for aggregated substrates as suggested in Figure 1.

      We have performed the suggested experiment using aggregated MDH as model substrate. We did not observe an inhibition of LN-ClpB disaggregation activity in presence of KJE. In contrast ClpL disaggregation activity towards aggregated MDH is inhibited upon addition of KJE due to competition for aggregate binding (Figure 1 – figure supplement 2D/F). Disaggregation activity of LN-ClpB in presence of KJE can be explained by functional cooperation between both chaperone systems, which involves interactions between aggregate-bound DnaK and the ClpB MDs of the LN-ClpB fusion construct. We prefer showing these data only in the response letter but not including them in the manuscript, as respective results distract from the main message of the LN-ClpB fusion construct: the ClpL NTD functions as autonomous aggregatetargeting unit that can be transferred to other Hsp100 family members.

      Author response image 2.

      LN-ClpB cooperates with DnaK in protein disaggregation. Relative MDH disaggregation activities of indicated disaggregation systems were determined. KJE: DnaK/DnaJ/GrpE. The disaggregation activity of Lm ClpL was set to 1. Statistical Analysis: Oneway ANOVA, Welch’s Test for post-hoc multiple comparisons. Significance levels: **p < 0.001. n.s.: not significant.

      Figures 4E and 4F:

      (1) While the effect of various NTD mutations follows a similar trend in regard to the impairment of ClpL-mediated disaggregation of luciferase and MDH, the degree of these effects does appear different. For example, patch A and C mutations reduce ClpL disaggregase activity towards luciferase (~60% / 50% reduction) vs. MDH (>90%) respectively. While these results do suggest a critical role for residues in patches A and C of ClpL, these substrate-specific differences are not discussed. Why would we expect a difference in the effect of these patch A/C ClpL mutations on different substrates?

      We speculate that the aggregate structure and the presence or distributions of ClpL NTD binding sites differ between aggregated Luciferase and MDH. A difference between both aggregated model substrates was also observed when testing for an inhibitory effect of Lm KJE (and Ec KJE) on ClpL disaggregation activity (see comment above). We speculate that the mutated NTD residues make specific contributions to aggregate recognition. The severity of binding defects (and reduction of disaggregation activities) of these mutants will depend on specific features of the aggregated model substrates. We now point out that ClpL NTD patch mutants can differ in disaggregation activities depending on the aggregated model substrate used and refer to potential differences in aggregate structures.

      (2) The authors suggest that the loss of disaggregation activity of selected NTD mutants could be linked to reduced binding to aggregated luciferase. While this is likely given that these mutations do not appear to affect ATPase activity (Supp. 4), it could be possible that these mutants can still bind to aggregated luciferase and some other mechanism may impair disaggregation. A pull-down assay would help to prove whether reduced binding is observed in these NTD ClpL mutants. This also needs to be confirmed for Supp. Figure 4.2H.

      We have shown a strong correlation between loss of aggregate binding and disaggregation activity for several NTD mutants (Fig. 4G, Figure 4 – figure supplement 2H). We decided to perform the aggregate binding assay only with mutants that show a full but not a partial disaggregation defect as we made the experience that the centrifugation-based assay provides clear and reproducible results for loss-of-activity mutants but has limitations in revealing differences for partially affected mutants. This might be explained by the use of nonhydrolyzable ATPγS in these experiments, which strongly stabilizes substrate interactions, potentially covering partial binding defects. We agree with the reviewer that some ClpL NTD mutants might have additional effects on disaggregation activity by e.g. controlling substrate transfer to the processing pore site. We have added a respective comment to the revised manuscript.

      (3) Supp. Figure 4.2H has no description in the figure legend. The Y-axes states % aggregate bound to chaperone. How was this measured? See the above comments for Figures 4E and 4F.

      We apologize and added the description to the figure legend. The determination of % aggregate bound chaperone is based on the quantifications of chaperones present in the supernatant and pellet fractions after sample centrifugation. Background levels of chaperones in the pellet fractions in absence of protein aggregates were subtracted. We added this information to the materials and methods section.

      Figure 6G:

      The authors observed reduced disaggregase activity and ATPase activity of mutant T355C under both oxidative and reducing conditions. While this observation under oxidative conditions supports the authors' hypothesis, under reducing conditions (+DTT) we would expect the enzyme to behave similarly to wild-type ClpL unless this mutation has other effects. Can the authors please comment on this and provide an explanation or hypothesis?

      The reviewer is correct, ClpL-T355C exhibit a reduced disaggregation activity (Figure 6 – figure supplement 2B). We observe a similar reduction in disaggregation activity for the ClpL MD mutant F354A, pointing to an auxiliary function of the MD in protein disaggregation. We have made a respective comment in the discussion section of the revised manuscript. How exactly ClpL MDs support protein disaggregation is currently unclear and will be subject of future analysis in the lab. We strongly believe that such analysis should be part of an independent study.

      Discussion:

      In the fourth feature, it is discussed that one disaggregase feature of ClpL is that it does not cooperate with the ClpP protease. While a reference is provided for the canonical ClpB, no data in this paper, nor a reference, is provided demonstrating that ClpL does not interact with ClpP. As discussed, it is highly unlikely that ClpL interacts with ClpP given that ClpL does not contain the IGL/F loops that mediate the interaction of ClpP with cochaperones, such as ClpX, but data or a reference is needed to make such a factual statement.

      The absence of the IGL/F loop makes an interaction between ClpL and ClpP highly unlikely. However, the reviewer is correct, direct evidence for a ClpP-independent function of ClpL, though very likely, is not provided. We have therefore rephrased the respective statement: “Forth, novel disaggregases lack the specific IGL/F signature motif, which is essential for cooperation of other Hsp100 proteins with the peptidase ClpP. This feature is shared with the canonical ClpB disaggregase [12] suggesting that protein disaggregation is primarily linked to protein refolding.”.

      References

      (1) Katikaridis P, Simon B, Jenne T, Moon S, Lee C, Hennig J, et al. Structural basis of aggregate binding by the AAA+ disaggregase ClpG. J Biol Chem. 2023:105336.

      (2) Glover JR, Lindquist S. Hsp104, Hsp70, and Hsp40: A novel chaperone system that rescues previously aggregated proteins. Cell. 1998;94:73-82.

      (3) Krzewska J, Langer T, Liberek K. Mitochondrial Hsp78, a member of the Clp/Hsp100 family in Saccharomyces cerevisiae, cooperates with Hsp70 in protein refolding. FEBS Lett. 2001;489:92-6.

      (4) Seyffer F, Kummer E, Oguchi Y, Winkler J, Kumar M, Zahn R, et al. Hsp70 proteins bind Hsp100 regulatory M domains to activate AAA+ disaggregase at aggregate surfaces. Nat Struct Mol Biol. 2012;19:1347-55.

      (5) Haslberger T, Zdanowicz A, Brand I, Kirstein J, Turgay K, Mogk A, et al. Protein disaggregation by the AAA+ chaperone ClpB involves partial threading of looped polypeptide segments. Nat Struct Mol Biol. 2008;15:641-50.

      (6) Theyssen H, Schuster H-P, Bukau B, Reinstein J. The second step of ATP binding to DnaK induces peptide release. J Mol Biol. 1996;263:657-70.

      (7) Iljina M, Mazal H, Goloubinoff P, Riven I, Haran G. Entropic Inhibition: How the Activity of a AAA+ Machine Is Modulated by Its Substrate-Binding Domain. ACS chemical biology. 2021;16:775-85.

      (8) Rosenzweig R, Farber P, Velyvis A, Rennella E, Latham MP, Kay LE. ClpB N-terminal domain plays a regulatory role in protein disaggregation. Proc Natl Acad Sci U S A. 2015;112:E6872-81.

      (9) Barnett ME, Nagy M, Kedzierska S, Zolkiewski M. The amino-terminal domain of ClpB supports binding to strongly aggregated proteins. J Biol Chem. 2005;280:34940-5.

      (10) Beinker P, Schlee S, Groemping Y, Seidel R, Reinstein J. The N Terminus of ClpB from Thermus thermophilus Is Not Essential for the Chaperone Activity. J Biol Chem. 2002;277:47160-6.

      (11) Mogk A, Schlieker C, Strub C, Rist W, Weibezahn J, Bukau B. Roles of individual domains and conserved motifs of the AAA+ chaperone ClpB in oligomerization, ATP-hydrolysis and chaperone activity. J Biol Chem. 2003;278:15-24.

      (11) Weibezahn J, Tessarz P, Schlieker C, Zahn R, Maglica Z, Lee S, et al. Thermotolerance Requires Refolding of Aggregated Proteins by Substrate Translocation through the Central Pore of ClpB. Cell. 2004;119:653-65.

    1. Author Response

      The following is the authors’ response to the original reviews.

      Reviewer #1

      1) Here are a few sentences that could potentially benefit from further discussion, particularly in the context of the plant developmental framework of an effective germline. It is important to note that the idea of an effective germline is supported by many, but not all, scientists. Nevertheless, as long as this concept remains relevant, a discussion based on it may be appropriate.

      The early establishment of germlines during development is crucial in addressing the impact of somatic mutation on the next generation. To emphasize this aspect, we have included an additional sentence addressing this point in ll. 242–244.

      2) Lines 161-163: The suggestion that long-lived tropical trees do not necessarily suppress somatic mutation rates to the same extent as their temperate counterparts might warrant additional examination.

      We have revised our statement to present a more balanced perspective, and we have also included a sentence to emphasize the importance of conducting further studies in future.

      3) Lines 200-202: The observation of potential influences of GC-biased gene conversion during meiosis or biased purifying selection for C>T inter-individual nucleotide substitutions could be further elaborated upon.

      Our data does not provide enough information to delve into a more detailed discussion regarding GC-biased gene conversion during meiosis or biased purifying selection for C>T substitution. However, future studies that obtain genome sequences from somatic cells, male or female gametophytes, and offspring (such as seeds or seedlings) would offer opportunities to assess these phenomena.

      4) Line 245: The statement "somatic mutations can be transmitted to seeds" might be correct, but it would be helpful to explore the extent to which this occurs.

      In response to the comment from Reviewer 1 (#4) and 2 (#16), we have decided to remove the discussion about the heritability of somatic mutations in next generation. We have completely rewritten the final paragraph to discuss the possibility of a disparity in the relationship between lifespan and somatic mutation rates between plants and animals.

      Reviewer #2

      5) l. 108- 115: The authors seem to have made a really great work at assembling and annotating two reference genomes. Even if this does not represent the main result of the manuscript, these genomic resources are a plus for the community, especially given that reference genomes from tropical trees are known to be underrepresented in the literature (e.g. Plomion et al. 2016). The authors have made the particular effort of generating two high-quality reference genome assemblies for two species of the same genus, including one with an excellent contiguity. Even if they do not explicitly indicate the divergence time between the two species, it is clear that the cheapest solution would have been to map the reads of the two species against a single assembly, but this could have generated some biases. So by generating two de novo assemblies, the authors have used here the best design possible to control for some potential biases for the detection of somatic mutations. However, given the interests these two assemblies represent by themselves, I consider that a couple of additional investigations could have been made on local synteny and orthologous genes in particular. Thanks to whole-genome alignments and orthology (e.g. Lovell et al. 2022), they could have generated more general information regarding the two assembles and investigated additional questions regarding mutations, e.g. mutations in collinear / non-collinear (if any) segments, intensity of purifying selection (or neutral evolution) at single vs. multiple copies or between shared vs. private genes, etc.

      To address the comment by Reviewer 2, we performed synteny analysis using the MCScanX in TBtools-II and added Supplementary Figure 3 to illustrate conserved synteny relationship between S. laevis and S. leprosula. Detecting selection in the genome will be a future study as our current data are not sufficient for the aim because of limited number of individuals (n = 2 for each species).

      6) l. 123-124. Here, the authors indicate that they have "validated" 93.9% of the mutations. It would be more accurate to indicate that they have "validated" 31/33 mutations (94%), 22/24 mutations on S1 and 9/9 on S2 (Table S5). Can the authors indicate why no somatic mutations from the F1 and F2 were tested? According to me, the use of the word "validation" is not totally accurate (see also Schmitt et al. 2022), since amplicon sequencing can be viewed as a kind of validation but it doesn't represent a complete validation since it represents new sequencing data that are mapped against the same reference assembly, in such a way that we could always imagine that the same biases are at play, leading to a similarly false positive call. Reciprocally, a "non-validated" mutation could be associated to a mutation that is at a too low allele frequency, at least after amplification, in such a way that the call is not heterozygous despite the fact that the mutation is real. I think that another terminology than "validated" could be used, plus one or two sentences explaining this degree of complexity.

      To improve the clarity of the statement, we have modified the sentence as follows: We conducted an independent evaluation of a subset of the inferred single nucleotide variants (SNVs) using amplicon sequencing. Our analysis demonstrated accurate annotation for 31 out of 33 mutations (94% overall), with 22 out of 24 mutations on S1 and all 9 mutations on S2 (Supplementary Table 5).”

      While we did not conduct additional assessments using F1 and F2, we anticipate a similar high level of agreement between the somatic SNV calls and amplicon sequencing in these trees. We have included sentences in the Materials and Methods section to elucidate the challenges involved in validating true somatic mutations.

      7) l. 135-137 the reasoning appears to be quite circular to me. As indicated by the authors in the line just before, an incongruent pattern could also be explained biologically, in such a way that the overall congruency between the phylogenetic tree and the tree architecture cannot be considered as a way to prove the reliability of the detection. In some species, it seems clear that the phylogenetic tree do not seem to follow the plant architecture (Zahradnikova et al. 2020) in such a way that we should argue to not consider the plant architecture in the design and not consider this represents either a way to validate mutations or a way to validate the methodological framework. I suggest removing this sentence.

      We have removed the sentence as suggested by Reviewer 2.

      8) l. 150. It seems that the differences in length and diameter between the two species come from two different studies and therefore that no statistical test has been performed to test its significance.

      We agree with Reviewer 2. To clarify this point, we have replaced “significantly” with “substantially” in the revised text.

      9) l. 156-159: the same sentence is repeated twice.

      We have removed the repeated sentence.

      10) l. 159-161: Comparing somatic mutation rates between studies is difficult. It is too sensitive to the methodology used, here again see Schmitt et al. 2022. I propose to remove these two sentences. It represents an interesting working hypothesis but would require a better design, or at least, to reanalyze all the data with the same pipeline.

      We have toned down our statement, and added a sentence that additional studies are required to compare somatic mutation rates among trees in tropical, temperate, and boreal regions, employing standardized methodologies.

      11) l. 171-175: Here I am wondering if the authors could provide more information regarding the enrichment at CpG sites? I suggest first estimating the proportion of CpG sites thanks to the two genome assemblies and then using this information as a way to weight the results and therefore to estimate the level of enrichment of mutations at CpG sites.

      In response to the comment by Reviewer 2, we first determined the proportion of CpG sites as 0.030 and 0.028 for S. laevis and S. leprosula, respectively, based on the triplet matrix using the reference genome of each species. Subsequently, we estimated the proportion of somatic mutations at CpG sites. The results revealed a 4.54-fold and 3.53-fold increase in somatic mutations at CpG sites for S1 and S2, and a 3.38-fold and 2.56-fold increase for F1 and F2, respectively. We have incorporated this finding into ll. 172–175.

      12) l. 176-187. Interesting comparison and insights. You could also indicate that SBS5 is also detected in all human cancers too. So the detection of SBS1 and SBS5 signatures indeed suggest some shared mutation biases. Note that in humans, a specific signature of UV is associated to TCG -> TTG mutations (Martincorena & Campbell, 2015). It seems that there is a substantial difference in the mutation spectra between the two trees for this specific category, note sure if this difference could be associated to UV.

      We slightly modified the sentence to indicate that SBS5 is also detected in all human cancers. We are very interested in the potential impact of UV on somatic mutations in tropical trees, considering the high levels of UVR in the tropics. Conducting a comparative analysis of the mutational spectrum among trees inhabiting diverse UVR environments would provide valuable insights to substantiate this hypothesis.

      13) l. 206: I rather suggest "the somatic mutation rate per year is roughly the same, suggesting that somatic mutations rates are independent of growth rate".

      In response to the suggestion from Reviewer 2, we have revised the sentence as follows: "The somatic mutation rate per year remains largely consistent, indicating that somatic mutation rates are independent of the growth rate."

      14) l. 207-232: Here, It is the section looks a mixture between a result and a discussion. I guess the authors consider here that it remains a verbal model at this stage and it therefore represents more a discussion. If so, I agree but it could be good to discuss more this part, in particular to know how this model could be improved and empirically tested.

      The argument based on the model will be more accurate when the cell cycle duration can be directly estimated for each tree. We have added this explanation in the revised text.

      15) l. 238-239: The parallel drawn with the molecular clock is interesting but according to me, it remains a working hypothesis at this stage, since it is not validated outside the two focal species. I encourage the readers to continue to work on this question and to investigate also some annual plants for instance in the future (assuming that they have a higher α) in order to be able to derive a global model. In addition, even if I consider that the authors use and interpret this parallel wisely, I consider that the use of this terminology could be misleading for some readers. That's why I also suggest removing "molecular clock" from the title and using a more explicit one, e.g. "Somatic mutation rates scale with time not growth rate in dipterocarp trees".

      We agree with Reviewer 2. We have changed the title to “Somatic mutation rates scale with time not growth rate in long-lived tropical trees.”

      16) l. 245-249: The results rather suggest that (i) there is little diversity due to somatic mutations and that (ii) most heritable non-synonymous mutations are deleterious and therefore purged from the population. So rather than this last section of this discussion that has little interest and could be quite debatable, I consider that the authors could extend their discussion, e.g. the differences with somatic mutations in mammals (recently, Cagan and coauthors (2022) demonstrated that somatic mutation rates are inversely correlated with lifespan in mammals) or the overall low rate of molecular evolution in trees could be some directions. But there are many others.

      We have completely rewritten the final paragraph to propose the possibility of a disparity in the relationship between lifespan and somatic mutation rates between plants and animals, rather than discussing the heritability of somatic mutation in next generation.

      17) l. 570-571: I guess, the reader should understand here "fixed at the heterozygous state"

      To avoid confusion, we have modified the text as follows: “If the alternative allele was present or absent in all eight branches in the amplicon sequence, the site was determined as fixed within an individual tree.” We have also removed “heterozygote” in Supplementary Figure 5.

      18) Fig. 4d. the y-axis would be easier to interpret by writing "Delta Inter-individual vs. Somatic SNPs" and/or by adding arrows on the right margin of the plot to indicate the directions with some short sentences such as "more somatic mutations observed than expected assuming the inter-individual comparison", "less somatic mutation than expected". According to me, some statistical tests are lacking here. Are the differences in the mutation spectra significant given the relatively limited amount of somatic mutations detected?

      We have added short sentences explaining the directions.

      19) Supplementary Tables (excel file): please correct the typos. There are many on these supplementary tables.

      We carefully checked supplementary tables and corrected the typos.

      Reviewer #3

      20) To estimate false negative rates, the authors might consider using mutation insertion tools such as Bamsurgeon (https://github.com/adamewing/bamsurgeon) to create simulated mutations. Alternatively, one could assess the calling rate of high-confidence SNPs that differ between individuals of the same species to get at the FNR.

      We agree with Reviewer 3. To calibrate our pipeline, we previously performed simulation to estimate the false negative and positive rates in different tree species (Betula platyphylla) using wgsim v0.1.11 (https://github.com/lh3/wgsim). Based on our simulations, we found that the false negative and false positive rates were very low, averaging at 0.050 and 0.046, respectively. It is important to note that the estimated false positive rate obtained from the simulation data was substantially lower than the proportion of potential false positive SNVs (as shown in Supplementary Fig. 5). This observation suggests that simulation-based evaluation of the false positive rate is not reliable, at least for the tree species we studied. Similarly, the same argument could be applied to the false negative rate. Therefore, we conclude that the simulation-based analysis for estimating false positive and false negative rates is not informative for our study.

      The rate of true-positive or false-negative mutation calls can be estimated only when the true mutational status is known, but the data are not currently available. However, under the assumption that the final set of SNVs represents true somatic mutations, we were able to calculate the potential false negative rate. Our findings indicate that this rate is low, specifically less than 10%, when using less stringent filtering thresholds such as BQ20 and MQ20. While these estimated values may not precisely represent the true false negative rate, we included them as potential false negative rates in Supplementary Figure 7 of the revised manuscript. This information provides additional insights into the performance of our pipeline under different filtering thresholds and contributes to the overall assessment of our study.

      21) It may be interesting to examine the mutation trees for constancy (or not) in mutation rate per meter. Examining Figure 1, it appears that the number of mutations near the crown "4" node is consistently higher than in nearby nodes (3-1 and 3-2).

      We calculated the branch-level increment of SNVs per meter by dividing the number of single nucleotide variations (SNVs) by the physical distance. Our analysis revealed a slight increase in the number of SNVs per meter as the branch position became higher in S. laevis, as shown in Author response table 1. However, this trend was not clearly observed in S. leprosula. We found this observation in S. laevis intriguing, particularly because our recent analysis (Tomimoto et al., in preparation) demonstrated that genetic distance increases in branch pairs located in the upper part of a tree. This was elucidated through a mathematical model that describes the dynamics of the stem cell population during elongation and branching. We opted not to delve further into the findings in the current manuscript, as this topic will be extensively investigated in a future study.

      Author response table 1.

      The branch-level increment of SNVs per meter.

      22) Line 150: Use of "significantly different" is confusing as the phrase is usually reserved for statistical significance. Consider replacing with "substantially different."

      We have replaced “significantly” with “substantially” in the revised text.

      23) In the Discussion, a clearer explanation of the assumptions that underlie the authors' reasoning would be welcome: e.g., constancy in mutation rate per meter within an individual tree. In particular, the authors assume that mutations that are seen in one leaf and not in another cannot have predated the most recent common meristematic node linking the two leaves. Is this a reasonable assumption? Since the meristem is multicellular, is it possible for a mutation to have arisen earlier in development and "assorted" into one cell lineage but not another?

      We greatly appreciate an important comment. It is true that when the meristem is multicellular, and the stem cell lines are retained during mutation accumulation (e.g. a structured meristem analyzed in Tomimoto and Satake 2023), it is possible for a mutation to have arisen earlier before the bifurcation. Using a mathematical model, we have proved that the intercept and slope of the linear regression between the pairwise genetic distance and physical distance are influenced by the type of a meristem (strength of somatic genetic drift in a meristem) as well as the branching architecture of the tree. We have included an explanation of this point in the revised manuscript (ll. 244–249).

      24) Supplementary Data 7: Column J should be "2_2"

      We corrected the typo.

    1. Author Response

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Transcriptional readthrough, intron retention, and transposon expression have been previously shown to be elevated in mammalian aging and senescence by multiple studies. The current manuscript claims that the increased intron retention and readthrough could completely explain the findings of elevated transposon expression seen in these conditions. To that end, they analyze multiple RNA-seq expression datasets of human aging, human senescence, and mouse aging, and establish a series of correlations between the overall expression of these three entities in all datasets.

      While the findings are useful, the strength of the evidence is incomplete, as the individual analyses unfortunately do not support the claims. Specifically, to establish this claim there is a burden of proof on the authors to analyze both intron-by-intron and gene-by-gene, using internal matched regions, and, in addition, thoroughly quantify the extent of transcription of completely intergenic transposons and show that they do not contribute to the increase in aging/senescence. Furthermore, the authors chose to analyze the datasets as unstranded, even though strand information is crucial to their claim, as both introns and readthrough are stranded, and if there is causality, than opposite strand transposons should show no preferential increase in aging/senescence. Finally, there are some unclear figures that do not seem to show what the authors claim. Overall, the study is not convincing.

      Major concerns: 1) Why were all datasets treated as unstanded? Strand information seems critical, and should not be discarded. Specifically, stranded information is crucial to increase the confidence in the causality claimed by the authors, since readthrough and intron retention are both strand specific, and therefore should influence only the same strand transposons and not the opposite-strand ones.

      This is an excellent suggestion. Since only one of our datasets was stranded, we did not run stranded analyses for the sake of consistency. We would like to provide two analyses here that consider strandedness:

      First, we find that within the set of all expressed transposons (passing minimal read filtering), 86% of intronic transposons match the strand of the intron (3147 out of 3613). In contrast, the number is 51% after permutation of the strands. Similarly, when we randomly select 1000 intronic transposons 45% match the strandedness of the intron (here we select from the set of all transposons). This is consistent with the idea that most transposons are only detectable because they are co-expressed on the sense strand of other features that are highly expressed.

      As for the readthrough data, 287 out of 360 transposons (79%) within readthrough regions matched the strand of the gene and its readthrough.

      Second, in the model we postulate, the majority of transposon transcription occurs as a co-transcriptional artifact. This applies equally to genic transposons (gene expression), intronic (intron retention) and gene proximal (readthrough or readin) transposons. Therefore, we performed the following analysis for the set of all transposons in the Fleischer et al. fibroblast dataset.

      When we invert the strand annotation for transposons, before counting and differential expression, we would expect the counts and log fold changes to be lower compared to using the “correct” annotation file.

      Indeed, we show that out of 6623 significantly changed transposons with age only 226 show any expression in the “inverted run” (-96%). (Any expression is defined as passing basic read filtering.)

      Out of the 226 transposons that can be detected in both runs most show lower counts (A) and age-related differential expression converging towards zero (B) in the inverted run (Fig. L1).

      Author response image 1.

      Transposons with inverted strandedness (“reverse”) show lower expression levels (log counts; A) and no differential expression with age (B) when compared to matched differentially expressed transposons (“actual”). For this analysis we selected all transposons showing significant differential expression with age in the actual dataset that also showed at least minimal expression in the strand-inverted analysis (n=226). Data from Fleischer et al. (2018). (A) The log (counts) are clipped because we only used transposons that passed minimal read filtering in this analysis. (B) The distribution of expression values in the actual dataset is bimodal and positive since some transposons are significantly up- or downregulated. This bimodal distribution is lost in the strand-inverted analysis.

      2) "Altogether this data suggests that intron retention contributes to the age-related increase in the expression of transposons" - this analysis doesn't demonstrate the claim. In order to prove this they need to show that transposons that are independent of introns are either negligible, or non-changing with age.

      We would like to emphasize that we never claimed that intron retention and readthrough can explain all of the age-related increases in transposon expression. In fact, our data is compatible with a multifactorial origin of transposons expression. Age- and senescence-related transposon expression can occur due to: 1/ intron retention, 2/ readthrough, 3/ loss of intergenic heterochromatin. Specifically, we do not try to refute 3.

      However, since most transposons are found in introns or downstream of genes, this suggests that intron retention and readthrough will be major, albeit non-exclusive, drivers of age-related changes in transposons expression. Even if the fold-change for intergenic transposons with aging or senescence were higher this would not account for the broadscale expression patterns seen in RNAseq data.

      To further illustrate this, we analyzed transposons located in introns, genes, downstream (ds) or upstream (us) of genes (distance to gene < 25 kb) or in intergenic regions (distance to gene > 25 kb). Indeed, we find that although intergenic transposons show similar log-fold changes to other transposon classes (Fig. L2A), their total contribution to read counts is negligible (Fig. L2B, Fig. Fig. S15). We have also now added a more nuanced explanation of this issue to the discussion.

      Author response image 2.

      We analyzed transposons located in introns, genes, downstream (ds) or upstream (us) of genes (distance to gene < 25 kb) or in intergenic regions (distance to gene > 25 kb). Independent of their location, transposons show similar differential expression with aging or cellular senescence (A). In contrast, the expression of transposons (log counts) is highly dependent on their location and the median log(count) value decreases in the order: genic > intronic > ds > us > intergenic.

      Author response image 3.

      Total counts are the sum of all counts from transposons located in introns, genes, downstream (ds) or upstream (us) of genes (distance to gene < 25 kb) or in intergenic regions (distance to gene > 25 kb). Counts were defined as cumulative counts across all samples.

      3) Additionally, the correct control regions should be intronic regions other than the transposon, which overall contributed to the read counts of the intron.

      4) Furthermore, analysis of read spanning intron and partly transposons should more directly show this contribution.

      Thank you for this comment. To rephrase this, if we understand correctly, the concern is that an increase in transposon expression could bias the analysis of intron retention since transposons often make up a substantial portion of an intron. We would like to address this concern with the following three points:

      First, if the concern is the correlation between log fold-change of transposons vs log fold-change of their containing introns, we do not think that this kind of data is biased. While transposons make up much of the intron, a single transposon on average only accounts for less than 10% of an intron.

      Second, to address this more directly, we show here that even introns that do not contain expressed transposons are increased in aging fibroblasts and after induction of cellular senescence (Fig. S8). This shows that intron retention is universal and most likely not heavily biased by the presence or absence of expressed transposons.

      Author response image 4.

      We split the set of introns that significantly change with cellular aging (A) or cell senescence (B) into introns that contain at least one transposon (has_t) and those that do not contain any transposons (has_no_t). Intron retention is increased in both groups. In this analysis we included all transposons that passed minimal read filtering (n=63782 in A and n=124173 in B). Median log-fold change indicated with a dashed red line for the group of introns without transposons.

      Third, we provide an argument based on the distribution of transposons within introns (Fig. L3).

      Author response image 5.

      The 5’ and 3’ splice sites show the highest sequence conservation between introns, whereas the majority of the intronic sequence does not. This is because these sites contain binding sites for splicing factors such as U1, U2 and SF1 (A). Transposons could affect splicing and we present a biologically plausible mechanism and two ancillary hypotheses here (B). If transposons affect the splicing (retention) of introns the most likely mechanism would be via impairment of splice site recognition because a transposon close to the site forms a secondary structure, binds an effector protein or provides inadequate sequences for pairing. Hypothesis 1: Transposons impair splicing because they are close to the splice site. Hypothesis 2: Transposons do not impair splicing because they are located away from the splice junction. Retained introns should show a similar depletion of transposons around the junction. Image adapted from: Ren, Pingping, et al. "Alternative splicing: a new cause and potential therapeutic target in autoimmune disease." Frontiers in Immunology 12 (2021): 713540.

      Consistent with hypothesis 2 (“transposons do not impair splicing”), we show that the distribution of transposons within introns is similar for the set of all transposons and all significant transposons within significantly overexpressed introns (Fig. S7. A and B is similar in the case of aged fibroblasts; D and E is similar in the case of cellular senescence). If transposon expression was causally linked to changes in intron retention, the most likely mechanism would be via an impairment of splicing. We would expect transposons to be located close to the splice junction, which is not what we observed. Instead, the data is more consistent with intron retention as a driver of transposon expression.

      Author response image 6.

      Transposons are evenly distributed within introns except for the region close to splice junctions (A-E). Transposons appear to be excluded from the splice junction-adjacent region both in all introns (A, D) and in significantly retained introns (B, E). In addition, transposon density of all introns and significantly retained introns is comparable (C, F). We included only introns containing at least one transposon in this analysis. A) Distribution of 2292769 transposons within 163498 introns among all annotated transposons. B) Distribution of 195190 transposons within 14100 introns significantly retained with age. C) Density (transposon/1kb of intron) of transposons in all introns (n=163498) compared to significantly retained introns (n=14100). D) as in (A) E) Distribution of 428130 transposons within 13205 introns significantly retained with induced senescence. F) Density (transposon/1kb of intron) of transposons in all introns (n=163498) compared to significantly retained introns (n=13205).

      5) "This contrasts with the almost completely even distribution of randomly permuted transposons." How was random permutation of transposons performed? Why is this contract not trivial, and why is this a good control?

      Permutation was performed using the bedtools shuffle function (Quinlan et al. 2010). We use the set of all annotated transposons and all reshuffled transposons as a control. It is interesting to observe that these two show a very similar distribution with transposons evenly spread out relative to genes. In contrast, expressed transposons are found to cluster downstream of genes. This gave rise to our initial working hypothesis that readthrough should affect transposon expression.

      6) Fig 4: the choice to analyze only the 10kb-20kb region downstream to TSE for readthrough regions has probably reduced the number of regions substantially (there are only 200 left) and to what extent this faithfully represent the overall trend is unclear at this point.

      This is addressed in Suppl. Fig. 7, we repeated the analysis for every 10kb region between 0 and 100kb, showing similar results.

      Furthermore, we show below in a new figure that the results are comparable when we measure readthrough in the 0 to 10kb region, while the sample size of readthrough regions is increased.

      Finally, it is commonly accepted to remove readthrough regions overlapping genes, which while reducing sample size, increases accuracy for readthrough determination (Rosa-Mercado et al. 2021). Without filtering readthrough regions can overlap neighboring genes which is reflected in an elevated ratio of Readthrough_counts/Genic_counts (Fig. S9).

      Author response image 7.

      A) Readthrough was determined in a region 0 to 10 kb downstream of genes for a subset of genes that were at least 10 kb away from the nearest neighboring gene (n=684 regions). The log2 ratio of readthrough to gene expression is plotted across five age groups (adolescent n=32, young n=31, middle-aged n=22, old n=37 and very old n=21). B) As in (A) but data is plotted on a per sample basis. C) Readthrough was determined in a region 0 to 10 kb downstream of genes for a subset of genes that were at least 10 kb away from the nearest neighboring gene (n=1045 regions). The log2 ratio of readthrough to gene expression is plotted for the groups comprising senescence (n=12) and the non-senescent group (n=6). D) As in (D) but data is plotted on a per sample basis and for additional control datasets (serum-starved, immortalized, intermediate passage and early passage). N=3 per group.

      7) Fig. 5B shows the opposite of the authors claims: in the control samples there are more transposon reads than in the KCl samples.

      Thank you for pointing this out. During preparation of the manuscript the labels of Fig. 5B were switched (however, the color matching between Fig. 5A-C is correct). We apologize for this mistake, which we have now corrected.

      8) "induced readthrough led to preferential expression of gene proximal transposons (i.e. those within 25 kb of genes), when compared with senescence or aging". A convincing analysis would show if there is indeed preferential proximity of induced transposons to TSEs. Since readthrough transcription decays as a function of distance from TSEs, the expression of transposons should show the same trends if indeed simply caused by readthrough. Also, these should be compared to the extent of transposon expression (not induction) in intergenic regions without any readthrough, in these conditions.

      This is a very good suggestion. We now provide two new supplementary figures analyzing the distance-dependence of transposon expression.

      In the first figure (Fig. S13) we show that readthrough decreases with distance (A, B) and we show that transposon counts are higher for transposons close to genes, following a similar pattern to readthrough. This is true in fibroblasts isolated from aged donors (A) and with cellular senescence (B).

      Author response image 8.

      Readthrough counts (rt_counts) decrease exponentially downstream of genes, both in the aging dataset (A) and in the cellular senescence dataset (B). Although noisier, the pattern for transposon counts (transp_cum_counts) is similar with higher counts closer to gene terminals, both in the aging dataset (C) and in the cellular senescence dataset (D). Readthrough counts are the cumulative counts across all genes and samples. Readthrough was determined in 10 kb bins and the values are assigned to the midpoint of the bin for easier plotting. Transposon counts are the cumulative counts across all samples for each transposon that did not overlap a neighboring gene. n=801 in (C) and n=3479 in (D).

      In the second figure (Fig. S14) we show that transposons found downstream of genes with high readthrough show a more pronounced log-fold change (differential expression) than transposons downstream of genes with low readthrough (defined based on log-fold change). This is true in fibroblasts isolated from aged donors (A) and with cellular senescence (B). Furthermore, the difference between high and low readthrough region transposons is diminished for transposons that are more than 10 kb downstream of genes, as would be expected given that readthrough decreases with distance.

      Author response image 9.

      Transposons found downstream of genes with high readthrough (hi_RT) show a more pronounced log-fold change (transp_logfc) than transposons downstream of genes with low readthrough (low_RT). This is true in fibroblasts isolated from aged donors (A) and with cellular senescence (B). Furthermore, the difference between high and low readthrough region transposons is diminished for transposons that are more than 10 kb downstream of genes (“Transp > 10 kb”). Transposons in high readthrough regions were defined as those in the top 20% of readthrough log-fold change. Readthrough was measured between 0 and 10 kb downstream from genes. n=2124 transposons in (A) and n=6061 transposons in (B) included in the analysis.

      Reviewer #2 (Public Review):

      In this manuscript, the authors examined the role of transcription readout and intron retention in increasing transcription of transposable elements during aging in mammals. It is assumed that most transposable elements have lost the regulatory elements necessary for transcription activation. Using available RNA-seq datasets, the authors showed that an increase in intron retention and readthrough transcription during aging contributes to an increase in the number of transcripts containing transposable elements.

      Previously, it was assumed that the activation of transposable elements during aging is a consequence of a gradual imbalance of transcriptional repression and a decrease in the functionality of heterochromatin (de repression of transcription in heterochromatin). Therefore, this is an interesting study with important novel conclusion. However, there are many questions about bioinformatics analysis and the results obtained.

      Major comments:

      1) In Introduction the authors indicated that only small fraction of LINE-1 and SINE elements are expressed from functional promoters and most of LINE-1 are co-expressed with neighboring transcriptional units. What about other classes of mobile elements (LTR mobile element and transposons)?

      We thank the reviewer for this comment. Historically, most repetitive elements, e.g. DNA elements and retrotransposon-like elements, have been considered inactive, having accrued mutations which prevent them from transposition. On the other hand, based on recent data it is indeed very possible that certain LTR elements become active with aging as suggested in several manuscripts (Liu et al. 2023, Autio et al. 2020). However, these elements are not well annotated and our final analysis (Fig. 6) relies on a well-defined distinction between active and inactive elements. (See also question 2 for further discussion.)

      Finally, we would like to point out some of the difficulties with defining expression and re-activation of LTR/ERV elements based on RNAseq data that have been highlighted for the Liu manuscript and are concordant with several of our results: https://pubpeer.com/publications/364E785636ADF94732A977604E0256

      Liu, Xiaoqian, et al. "Resurrection of endogenous retroviruses during aging reinforces senescence." Cell 186.2 (2023): 287-304.

      Autio A, Nevalainen T, Mishra BH, Jylhä M, Flinck H, Hurme M. Effect of ageing on the transcriptomic changes associated with expression at the HERV-K (HML-2) provirus at 1q22. Immun Ageing. 2020;17(1):11.

      2) Results: Why authors considered all classes of mobile elements together? It is likely that most of the LTR containing mobile elements and transposons contain active promoters that are repressed in heterochromatin or by KRAB-C2H2 proteins.

      We do not consider LTR containing elements because there is uncertainty regarding their overall expression levels and their expression with aging (Nevalainen et al. 2018). Furthermore, we believe that substantial activity of LTR elements in human genomes should have been detectable through patterns of insertional mutagenesis. Yet studies generally show low to negligible levels of LTR (ERV) mutagenesis. Here, for example, at a 200-fold lower rate than for LINEs (Lee et al. 2012).

      Importantly, our analysis in Fig. 6 relies on well-annotated elements like LINEs, which is why we do not include LTR or SINE elements that could be potentially expressed. However, for other analyses we did consider element families independently as can be seen in Table S1, for example.

      Nevalainen, Tapio, et al. "Aging-associated patterns in the expression of human endogenous retroviruses." PLoS One 13.12 (2018): e0207407.

      Lee, Eunjung, et al. "Landscape of somatic retrotransposition in human cancers." Science 337.6097 (2012): 967-971.

      3) Fig. 2. A schematic model of transposon expression is not presented clearly. What is the purpose of showing three identical spliced transcripts?

      This is indeed confusing. There are three spliced transcripts to schematically indicate that the majority of transcripts will be correctly spliced and that intron retention is rare (estimated at 4% of all reads in our dataset). We have clarified the figure now, please see below:

      Author response image 10.

      A schematic model of transposon expression. In our model, represented in this schematic, transcription (A) can give rise to mRNAs and pre-mRNAs that contain retained introns when co-transcriptional splicing is impaired. This is often seen during aging and senescence, and these can contain transposon sequences (B). In addition, transcription can give rise to mRNAs and pre-mRNAs that contain transposon sequences towards the 3’-end of the mRNA when co-transcriptional termination at the polyadenylation signal (PAS) is impaired (C, D) as seen with aging and senescence. Some of these RNAs may be successfully polyadenylated (as depicted here) whereas others will be subject to nonsense mediated decay. Image created with Biorender.

      4) The study analyzed the levels of RNA from cell cultures of human fibroblasts of different ages. The annotation to the dataset indicated that the cells were cultured and maintained. (The cells were cultured in high-glucose (4.5mg/ml) DMEM (Gibco) supplemented with 15% (vol/vol) fetal bovine serum (Gibco), 1X glutamax (Gibco), 1X non-essential amino acids (Gibco) and 1% (vol/vol) penicillin-streptomycin (Gibco). How correct that gene expression levels in cell cultures are the same as in body cells? In cell cultures, transcription is optimized for efficient division and is very different from that of cells in the body. In order to correlate a result on cells with an organism, there must be rigorous evidence that the transcriptomes match.

      We agree and have updated the discussion to reflect this shortcoming. While we do not have human tissue data, we would like to draw the reviewer’s attention to Fig. S3 where we presented some liver data for mice. We now provide an additional supplementary figure (in a style similar to Fig. S2) showing how readthrough, transposon expression and intron retention changes in 26 vs 5-month-old mice (Fig. S4). Indeed, intron, readthrough and transposons increase with age in mice, although this is more pronounced for transposons and readthrough.

      Author response image 11.

      Intron, readthrough and transposon elements are elevated in the liver of aging mice (26 vs 5-month-old, n=6 per group). Readthrough and transposon expression is especially elevated even when compered to genic transcripts. The percentage of upregulated transcripts is indicated above each violin plot and the median log10-fold change for genic transcripts is indicated with a dashed red line.

      Finally, just to elaborate, we used the aging fibroblast dataset by Fleischer et al. for three reasons:

      1) Yes, aging fibroblasts could be a model of human aging, with important caveats as you correctly point out,

      2) it is one of the largest such datasets allowing us to draw conclusions with higher statistical confidence and do things such as partial correlations

      3) it has been analyzed using similar techniques before (LaRocca, Cavalier and Wahl 2020) and this dataset is often used to make strong statements about transposons and aging such as transposon expression in this dataset being “consistent with growing evidence that [repetitive element] transcripts contribute directly to aging and disease”. Our goal was to put these statements into perspective and to provide a more nuanced interpretation.

      LaRocca, Thomas J., Alyssa N. Cavalier, and Devin Wahl. "Repetitive elements as a transcriptomic marker of aging: evidence in multiple datasets and models." Aging Cell 19.7 (2020): e13167.

      5) The results obtained for isolated cultures of fibroblasts are transferred to the whole organism, which has not been verified. The conclusions should be more accurate.

      We agree and have updated the discussion accordingly.

      6) The full pipeline with all the configuration files IS NOT available on github (pabisk/aging_transposons).

      Thank you for pointing this out, we have now uploaded the full pipeline and configuration files.

      7) Analysis of transcripts passing through repeating regions is a complex matter. There is always a high probability of incorrect mapping of multi-reads to the genome. Things worsen if unpaired short reads are used, as in the study (L=51). Therefore, the authors used the Expectation maximization algorithm to quantify transposon reads. Such an option is possible. But it is necessary to indicate how statistically reliable the calculated levels are. It would be nice to make a similar comparison of TE levels using only unique reads. The density of reads would drop, but in this case it would be possible to avoid the artifacts of the EM algorithm.

      We thank the reviewer for this suggestion. We show here that mapping only unique alignments (outFilterMultimapNmax=1 in STAR) leads to similar results.

      For the aging fibroblast dataset:

      Author response image 12.

      For the induced senescence dataset:

      Author response image 13.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Review:

      We would like to thank the reviewers for providing constructive feedback on the manuscript. To address their concerns, we have performed additional experiments, analyzed the new data, and revised the manuscript.

      (1) The utility of a pipeline depends on the generalization properties.

      While the proposed pipeline seems to work for the data the authors acquired, it is unclear if this pipeline will actually generalize to novel data sets possibly recorded by a different microscope (e.g. different brand), or different imagining conditions (e.g. illumination or different imagining artifacts) or even to different brain regions or animal species, etc.

      The authors provide a 'black-box' approach that might work well for their particular data sets and image acquisition settings but it is left unclear how this pipeline is actually widely applicable to other conditions as such data is not provided.

      In my experience, without well-defined image pre-processing steps and without training on a wide range of image conditions pipelines typically require significant retraining, which in turn requires generating sufficient amounts of training data, partly defying the purpose of the pipeline.

      It is unclear from the manuscript, how well this pipeline will perform on novel data possibly recorded by a different lab or with a different microscope.

      To address the generalizability of our DL segmentation model, we have performed several validation experiments with deploying our model on out-of-distribution data that 1) had distinct channels  2) were acquired in different species (rat) with a different vascular fluorescent label and a different imaging protocol, and 3) were acquired on a different microscope and with a different vascular label. We first used our model to segment images (507x507um lateral FOV, 170-250 um axial range) from three C57BL/6 mice imaged on the same two-photon fluorescent microscope following the same imaging protocol. The vasculature was labelled by intravenous injection of the Texas Red dextran (70 kDa MW, Thermo Fisher Scientific Inc, Waltham MA), as in the current experiment. In lieu of the EYFP signal from pyramidal neurons that was present in the original data, we added Gaussian noise with a mean and standard deviation identical to the acquired vascular channel in the out-of-distribution dataset. Second, we applied our model to images (507x507um lateral FOV, 300-400 um axial range) from two Fischer rats that were injected with 2000-kDa Alexa680-dextran via a tail vein catheter. These rats were imaged on the same two-photon fluorescence microscope, but with Galvano scanners (instead of resonant scanners). As before, a second channel of Gaussian noise was added to simulate the missing EYFP signal. Finally, we segmented an image of vasculature from an ex-vivo cleared mouse brain (1665x1205x780 um) acquired on a light sheet fluorescence microscope (Miltenyi UltraMicroscope Blaze), with a Lectin-DyLight 649 labelling the vessel walls.  The Dice Score, Precision, Recall, Hausdorff 95%, and Mean surface distance were reported for segmentations of 2PFM data sets, following the generation of ground truth images by assisted manual segmentation in ilastik. Examples of the generated segmentation masks are presented in Supplementary figure 9 for visual comparison. We have described the image pre-processing steps/transforms before model inference in the revised Methods section. In general, should the segmentation results on a data set be deemed unsatisfactory, our model can be further fine-tuned on out-of-distribution data. Furthermore, the image analyses downstream from segmentation are applicable irrespective of the method utilized to arrive at a robust vascular segmentation.

      Author response table 1.

      Dataset performance comparison for UNETR

      (2) Some of the chosen analysis results seem to not fully match the shown data, or the visualization of the data is hard to interpret in the current form.

      We have updated the visualizations to make them more accessible and ensure close correspondence between tables and figures.

      (3) Additionally, some measures seem not fully adapted to the current situation (e.g. the efficiency measure does not consider possible sources or sinks). Thus, some additional analysis work might be required to account for this.

      Thank you for your comment. The efficiency metric was selected as it does not consider sources or sinks. We do agree that accounting for vessel subtypes in the analysis (thus classifying larger vessels as either suppliers/sources or drainers/sinks) would be very useful: notwithstanding, this classification is extremely laborious, as we have noted in our prior work1 . We are therefore leveraging machine learning in a parallel project to afford vessel classification by type. Notwithstanding, the source/sink analysis based on in vivo 2PFM data is confounded by the small FOV.

      (4) The authors apply their method to in vivo data. However, there are some weaknesses in the design that make it hard to accept many of the conclusions and even to see that the method could yield much useful data with this type of application. Primarily, the acquisition of a large volume of tissue is very slow. In order to obtain a network of vascular activity, large volumes are imaged with high resolution. However, the volumes are scanned once every 42 seconds following stimulation. Most vascular responses to neuronal activation have come and gone in 42 seconds so each vessel segment is only being sampled at a single time point in the vascular response. So all of the data on diameter changes are impossible to compare since some vessels are sampled during the initial phase of the vascular response, some during the decay, and many probably after it has already returned to baseline. The authors attempt to overcome this by alternating the direction of the scan (from surface to deep and vice versa). But this only provides two sample points along the vascular response curve and so the problem still remains.

      We thank the Reviewer for bringing up this important point. Although vessels can show relatively rapid responses to perturbation, vascular responses to photostimulation of ChannelRhodopsin-2 in neighbouring neurons are long-lasting: they do not come and go in 42 seconds. To demonstrate this point, we acquired higher temporal-resolution images of smaller volumes of tissue over 5 minutes preceding and 5 minutes following the 5-s photoactivation with the original photostimulation parameters. The imaging protocol was different in that we utilized a piezoelectric motor, a smaller field of view (512um x (80-128)um x (34-73)um), and only 3x frame averaging, resulting in a temporal resolution of 1.57-3.17 seconds per frame. This acquisition was repeated at different cortical depths in three Thy1-ChR2 mice and the vascular radii were estimated using our presented pipeline. Significantly responding vessels here were selected via an F-test of radius estimates before vs. after stimulation. LOESS fits to the time-dependent radius of significantly responding vessels are shown in Supplementary Figure 5. Vessels shorter than 20 um in length were excluded from the analysis so as to focus on vessel segments where averaging the vascular radius over many vertices was possible. A video of one of the acquisitions is shown along with the timecourses of select vessels’ calibre changes in Author response image 1. The vascular calibre changes following photostimulation persisted for several minutes, consistent with earlier observations by us and others2–5. These small-volume acquisitions demonstrated that dilations were repeatedly longer than the 42 seconds (i.e. our original temporal resolution).

      Our temporal sampling was chosen to permit a large field of view acquisition while still being well within the span of the vascular response to look at larger scale vascular coordination that has not previously been studied. The pipeline readily adapts to smaller fields of view at a finer temporal sampling, though such an acquisition precludes the study of the response coordination across hundreds of vessels. While a greater number of baseline frames would help with the baseline variability estimation, maintaining animals under anesthesia during prolonged imaging is exceedingly difficult, precluding us from extending our total acquisition time.

      Author response image 1.

      Estimated vascular radius at each timepoint for select vessels from the imaging stack shown in the following video: https://flip.com/s/kB1eTwYzwMJE

      (5) A second problem is the use of optogenetic stimulation to activate the tissue. First, it has been shown that blue light itself can increase blood flow (Rungta et al 2017). The authors note the concern about temperature increases but that is not the same issue. The discussion mentions that non-transgenic mice were used to control for this with "data not shown". This is very important data given these earlier reports that have found such effects and so should be included.

      We have updated the manuscript to incorporate the data on volumetric scanning in (nontransgenic) C57BL/6 mice undergoing blue light stimulation, with identical parameters as those used in Thy-ChR2 mice (Supplementary Figure 8). As before, responders were identified as vessels that following blue light stimulation showed a radius change greater than 2 standard deviations of their baseline radius standard deviation: their estimated radii changes are shown in Supplementary Figure 8.  There was no statistical difference between the radii distributions of any of the photostimulation conditions and pre-photostimulation baseline.

      (6) Secondly, there doesn't seem to be any monitoring of neural activity following the photo-stimulation. The authors repeatedly mention "activated" neurons and claim that vessel properties change based on distance from "activated" neurons. But I can't find anything to suggest that they know which neurons were active versus just labeled. Third, the stimulation laser is focused at a single depth plane. Since it is single-photon excitation, there is likely a large volume of activated neurons. But there is no way of knowing the spatial arrangement of neural activity and so again, including this as a factor in the analysis of vascular responses seems unjustified.

      Given the high fidelity of Channel-Rhodpsin2 activation with blue light photostimulation found by us and others3, we assume that all labeled neurons within the volume of photostimulation are being activated. Depending on their respective connectivities, their postsynaptic neurons (whether or not they are labeled) may also get activated. We therefore agree with the reviewer that the spatial distribution of neuronal activation is not well defined. The manuscript has been revised to update the terminology from activated to labeled neurons and stress in the Discussion that the motivation for assessing the distance to the closest labeled neuron as one of our metrics is purely to demonstrate the possibility of linking vascular response to activations in their neighbouring neurons and including morphological metrics in the computational pipeline.

      (7) The study could also benefit from more clear illustration of the quality of the model's output. It is hard to tell from static images of 3-D volumes how accurate the vessel segmentation is. Perhaps some videos going through the volume with the masks overlaid would provide some clarity. Also, a comparison to commercial vessel segmentation programs would be useful in addition to benchmarking to the ground truth manual data.

      We generated a video demonstrating the deep-learning model outputs and have made the video available here: https://flip.com/s/_XBs4yVxisNs. We aimed to develop an open-source method for the research community as the vast majority of groups do not have access to commercial software for vessel segmentation.

      (8) Another useful metric for the model's success would be the reproducibility of the vessel responses. Seeing such a large number of vessels showing constrictions raises some flags and so showing that the model pulled out the same response from the same vessels across multiple repetitions would make such data easier to accept.

      We have generated a figure demonstrating the repeatability of the vascular responses following photostimulation in a volume and presented them next to the corresponding raw acquisitions for visual inspection (Supplementary figure 6). It is important to note that there is a significant biological variability in vessels’ responses to repeated stimulation, as described previously 3,6: a well-performing model should be able to quantify biological heterogeneity as it of itself may be of interest. Constrictions have been reported in the literature by our group and others 1,2,4,5,7, though their prevalence has not been systematically studied to date. Concerning the reproducibility of our analysis, we have demonstrated model reproducibility (as a metric of its success) on a dataset where vessels visually appeared to dilate consistently following 452 nm light stimulation: these results are now presented in Supplementary Figure 6 of the revised Manuscript. We thus observed that the model repeatedly detected the vessels - that appeared to dilate on visual inspections - as dilating. Examples of vessels constricting repeatedly were also examined and maximal intensity projections of the vessel before and after photostimulation inspected, confirming their repeated constriction (Author response image 2).

      It is also worth noting that while the presence of the response (defined as change above 2 standard deviations of the radius across baseline frames) was infrequent (2107 vessels responded at least once, out of a total of 10,552 unique vessels imaged), the direction of the response was highly consistent across trials. Given twice the baseline variability as the threshold for response, of the vessels that responded more than once, 31.7% dilated on some trials while constricting on others; 41.1% dilated on each trial; and 27.2% constricted on each trial. (Note that some trials use 1.1 vs. 4.3 mW/mm2 and some have opposite scanning directions).

      Author response image 2.

      Sample capillaries constrictions from maximum intensity projections at repeated time points following optogenetic stimulation. Baseline (pre-stimulation) image is shown on the left and the post-stimulation image, is on the right, with the estimated radius changes listed to the left.

      (9) A number of findings are questionable, at least in part due to these design properties. There are unrealistically large dilations and constrictions indicated. These are likely due to artifacts of the automated platform. Inspection of these results by eye would help understand what is going on.

      Some of the dilations were indeed large in magnitude. We present select examples of large dilations and constrictions ranging in magnitude from 2.08 to 10.80 um for visual inspection (Author response image 3) (for reference, average, across vessel and stimuli, the magnitude of radius changes were 0.32 +/- 0.54 um). Diameter changes above 5 um were visually inspected.

      Author response image 3.

      Additional views of diameter change in maximum intensity projections ranging in magnitude from 2.08 um to 10.80 um.

      (10) In Figure 6, there doesn't seem to be much correlation between vessels with large baseline level changes and vessels with large stimulus-evoked changes. It would be expected that large arteries would have a lot of variability in both conditions and veins much less. There is also not much within-vessel consistency. For instance, the third row shows what looks like a surface vessel constricting to stimulation but a branch coming off of it dilating - this seems biologically unrealistic.

      We now plot photostimulation-elicited vessel-wise radius changes vs. their corresponding baseline radius standard deviations (Author response image 4). The Pearson correlation coefficient between the baseline standard deviation and the radius change was 0.08 (p<1e-5) for  552nm 4.3 mW/mm^2 stimulation,  -0.08 (p<1e-5) for  458nm 1.1 mW/mm^2 stimulation, and -0.04 (p<1e-5) for  458nm 4.3 mW/mm^2 stimulation. For non-control (i.e. blue) photostimulation conditions, the change in the radius is thus negatively correlated to the vessel’s baseline radius standard deviation: this small negative correlation indicates that there is little correlation between vessel radius change and the baseline variability in the vessel radius. Classification of vessels by type (arteries vs. veins) is needed before we can comment on differences between these vascular components. The between-vessel (i.e. between parent vessels and their daughter branches separated by branch points) consistency is explicitly evaluated by the assortativity metric, in Figure 9: vessels do somewhat tend to react similarly to their downstream branches: we observed a mean assortativity of 0.4. As for the instance of a surface vessel constricting while a downstream vessel dilates, it is important to remember that the 2PFM FOV restricts us to imaging a very small portion of the cortical microvascular network: one (among many) daughter vessels showing changes in the opposite direction to the parent vessel is not violating the conservation of mass; in addition, mural cells on adjacent branches can respond differently.

      Author response image 4.

      Vessel radius change elicited by photostimulation vs. baseline radius standard deviation across all vessels. The threshold level for response identification is shown as the black line.

      (11) As mentioned, the large proportion of constricting capillaries is not something found in the literature. Do these happen at a certain time point following the stimulation? Did the same vessel segments show dilation at times and constriction at other times? In fact, the overall proportion of dilators and constrictors is not given. Are they spatially clustered? The assortativity result implies that there is some clustering, and the theory of blood stealing by active tissue from inactive tissue is cited. However, this theory would imply a region where virtually all vessels are dilating and another region away from the active tissue with constrictions. Was anything that dramatic seen?

      The kinetics of the vascular responses are not accessible via the current imaging protocol and acquired data; however, this computational pipeline can readily be adapted to test hypotheses surrounding the temporal evolution of the vascular responses, as shown in Supplementary Figure 2 (with higher temporal-resolution data). Some vessels dilate at some time points and constrict at others as shown in Supplementary Figure 2. As listed in Table 2, 4.4% of all vessels constrict and 7.5% dilate for 452nm stimulation at 4.3 mW/mm^2. There was no obvious spatial clustering of dilators or constrictors: we expect such spatial patterns to be more common with different modes of stimulation and/or in the presence of pathology. The assortativity peaked at 0.4 (quite far from 1 where each vessel’s response exactly matches that of its neighbour).

      (12) Why were nearly all vessels > 5um diameter not responding >2SD above baseline? Did they have highly variable baselines or small responses? Usually, bigger vessels respond strongly to local neural activity.

      In Author response image 5, we now present the stimulation-induced radius changes vs. baseline radius variability across vessels with a radius greater than 5 um. The Pearson correlation between the radius change and the baseline radius standard deviation across time was low: r=0.05 (p=0.5) for  552nm 4.3 mW/mm^2 stimulation,  r=-0.27 (p<1e-5) for  458nm 1.1 mW/mm^2 stimulation, and r=-0.31 (p<1e-5) for 458nm 4.3 mW/mm^2 stimulation. These results demonstrate that the changes following optogenetic stimulation are lower than twice the baseline standard deviation across time for most of these vessels. The pulsatility of arteries results in significant variability in their baseline radius8; in turn, literature to date suggests very limited radius changes in veins. Both of these effects could contribute to the radius response not being detected in many larger vessels.

      Author response image 5.

      The change in the vessel radius elicited by photostimulation vs. baseline vessel radius standard deviation in vessels with a baseline radius greater than 5 um. The threshold level for response identification is shown as the black line.

      References

      (1) Mester JR, Rozak MW, Dorr A, Goubran M, Sled JG, Stefanovic B. Network response of brain microvasculature to neuronal stimulation. NeuroImage. 2024;287:120512. doi:10.1016/j.neuroimage.2024.120512

      (2) Alarcon-Martinez L, Villafranca-Baughman D, Quintero H, et al. Interpericyte tunnelling nanotubes regulate neurovascular coupling. Nature. 2020;kir 2.1(7823):91-95. doi:10.1038/s41586-020-2589-x

      (3) Mester JR, Bazzigaluppi P, Weisspapir I, et al. In vivo neurovascular response to focused photoactivation of Channelrhodopsin-2. NeuroImage. 2019;192:135-144. doi:10.1016/j.neuroimage.2019.01.036

      (4) O’Herron PJ, Hartmann DA, Xie K, Kara P, Shih AY. 3D optogenetic control of arteriole diameter in vivo. Nelson MT, Calabrese RL, Nelson MT, Devor A, Rungta R, eds. eLife. 2022;11:e72802. doi:10.7554/eLife.72802

      (5) Hartmann DA, Berthiaume AA, Grant RI, et al. Brain capillary pericytes exert a substantial but slow influence on blood flow. Nat Neurosci. Published online February 18, 2021:1-13. doi:10.1038/s41593-020-00793-2

      (6) Mester JR, Bazzigaluppi P, Dorr A, et al. Attenuation of tonic inhibition prevents chronic neurovascular impairments in a Thy1-ChR2 mouse model of repeated, mild traumatic brain injury. Theranostics. 2021;11(16):7685-7699. doi:10.7150/thno.60190

      (7) Hall CN, Reynell C, Gesslein B, et al. Capillary pericytes regulate cerebral blood flow in health and disease. Nature. 2014;508(7494):55-60. doi:10.1038/nature13165

      (8) Meng G, Zhong J, Zhang Q, et al. Ultrafast two-photon fluorescence imaging of cerebral blood circulation in the mouse brain in vivo. Proc Natl Acad Sci U S A. 2022;119(23):e2117346119. doi:10.1073/pnas.2117346119

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      Line 207: a superfluous '.' before the references.

      This has been corrected.

      Line 273 ff:

      While the metrics are described in mathematical terms which is very useful, the appearing distances (d) and mathematical symbols are not. While mostly intuitively clear, precise definitions of all symbols introduced should be given to avoid ambiguities.

      The description has been clarified.

      This applies to all formulas appearing in the manuscript and the authors might want to check them carefully.

      We have updated them wherever needed.

      The mean surface distance seems not to reflect the mean MINIMAL surface distance but just the overall mean surface distance. Or a different definition of the appearing symbols is used, highlighting the need for introducing every mathematical symbol carefully.

      The definitions have been updated for clarity, specifying the distinction between Hausdorff 95% distance and mean surface distance.

      Line 284:

      It is unclear to me why center-line detection was performed in MATLAB and not Python. Using multiple languages/software packages and in addition relying on one that is not freely available/open source makes this tool much less attractive as a real open-source tool for the community. The authors stress in the manuscript abstract that their pipeline is an open and accessible tool, the use of MATLAB defies this logic to some extent in my view.

      Centerline detection for large volumetric data is available in Python, see e.g. Scipy packages as well for large data sets via ClearMap or VesselVio.

      We tested the centerline detection in Python, scipy (1.9.3) and Matlab. We found that the Matlab implementation performed better due to its inclusion of a branch length parameter for the identification of terminal branches, which greatly reduced the number of false branches; the Python implementation does not include this feature (in any version) and its output had many more such “hair” artifacts. Clearmap skeletonization uses an algorithm by Palagyi & Kuba(1999) to thin segmentation masks, which does not include hair removal. Vesselvio uses a parallelized version of the scipy implementation of Lee et al. (1994) algorithm which does not do hair removal based on a terminal branch length filter; instead, Vesselvio performs a threshold-based hair removal that is frequently overly aggressive (it removes true positive vessel branches), as highlighted by the authors.

      Moreover, the authors mention that robust center-line detection was critical. In my view, robust center-line extraction typically requires some additional processing of the binarized data, e.g. using a binary smoothing step. Various binary smoothers are available in the literature and as Python code.

      Indeed, binary smoothing was performed: background “holes” located within the vasculature were filled; the masks were dilated (3x) and then eroded to the centreline. Scipy’s binary closing function smoothes the morphology of binary segmentation masks by dilating and then eroding the segmentation masks (as a part of the selected skeletonization algorithm).

      Line 303:

      'RBC' is not defined (red blood cells?)

      This has been updated.

      Line 398:

      pPhotonsimulation -> Photostimulation

      This has been corrected.

      Line 400 ff: Efficiency:

      I am not sure how useful the measure really is without any information about the 'sources' (i.e. arteries) and sinks (i.e. veins) as blood does not need to be moved between any two arbitrary nodes.

      While blood reversals are observed, blood is typically not moved arbitrarily between two arbitrary nodes in capillary networks.

      We agree with the reviewer that classifying the vessels by type is important and are currently working on deep learning-based algorithms for the classification of microvasculature into arterioles and venules for future work.

      In addition, short paths between two nodes with low resistivity will potentially dominate the sum and the authors excluded vessels 10um and above. This threshold seems arbitrary.

      The 10-um diameter threshold was not applied in the computation of the network metrics. The 10-um thresholding was restricted to “capillary” identification in Figure 8: the 10-um cutoff for referring to a vessel as a capillary has long been applied in the literature [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11].

      Figure 3:

      It's unclear what the units are for the Mean Surface and Harsdorf Distances (pixel or um?).

      The units have now been specified (um).

      Figure 4:

      The binarized data, and particularly the crops are difficult to interpret in black and white. It would be much more useful to present the segmentation results in a way that is interpretable (e.g. improving the rendering of the 3d information, particularly in the crops by using shadows or color codes for depth, etc).

      We have updated these visualizations and shaded them based on cortical depth.

      Panel C indicates that the illastik is performing badly due to changes in imagining conditions (much higher background level). As pointed out before, in my view, a reasonable pipeline should start by removing and standardizing background levels as well as dynamic ranges and possibly other artifacts before performing a more detailed analysis. This would also make the pipeline more robust against data from other microscopes etc as only a few preprocessing parameters might need to be adjusted.

      I wonder whether after such a pre-processing step, UNET / UNETR would still perform in a way that was superior to ilastik, as ground truth data was generated with the aid of illastiks initially.

      The Ilastik model is based on semi-automatically generated foreground labels in small batches. We had to break it up into small groups during manual labelling as larger groups were not able to run due to the computational limits of Ilastik. Ilastik is typically trained in an iterative fashion on a few patches at a time because it takes 2-3 hours per patch to train and the resulting model does not generalize on the remaining patches or out-of-distribution data - even with image pre-processing steps. On the reviewer's comment, we did try inputting normalized images into Ilastik, but this did not improve its results. UNET and UNETR inputs have been normalized for signal intensities.

      Typical pre-processing/standard computer vision techniques with parameter tuning do not generalize on out-of-distribution data with different image characteristics, motivating the shift to DL-based approaches.

      Figure 5:

      This is a validation figure that might be better shown in an appendix or as a supplement.

      Since this is a methodological paper, we think it is important to highlight the validation of the proposed method.

      Line 476:

      It's surprising that the number of vessel segments almost doubles when taking the union. Is the number of RBC plugs expected to be so high?

      The etiology of discontinuities includes, but is not limited to, RBC plugs; we expect discontinuities to arise also from a very short pixel dwell time (0.067us) of the resonant scanning and have indeed observed apparent vessel discontinuities on resonant scanning that are not present with Galvano scanning using a pixel dwell time of 2us.

      Section 4.4 / 4.5 :

      The analysis in these sections provides mostly tables with numbers that are more difficult to read and hides possible interesting structures in the distribution of the various measures/quantities. For example, why is 5um a good choice to discriminate between small and large vessels, why not resolve this data more precisely via scatter plots?

      Some distributions are shown in the appendix and could be moved to the main analysis.

      Generally, visualizing the data and providing more detailed insights into the results would make this manuscript more interesting for the general reader.

      The radius of vessel segments drops off after 5.0 um, as shown in Supplementary Figure 4A. The 10-um diameter thresholding is based on prior literature [1], [12], [13], [14], [15], [16], [17], [18], [19] and is used to segregate different vessel types in a conservative manner. The smallest capillaries are expected to have pericytes on their vessel walls whereas arteries are expected to have smooth muscle cells on their vessel walls. These differences in mural cells also may lead to differences in respective vessels’ reactivity.

      The data summarized in Tables 1 and 2 are shown as scatter plots in Figures 8, Supplementary Fig 4 and Supplementary Fig 5.

      Line 556:

      The authors deem a certain change in radius as the relevant measure for responding vessels. They deem a vessel responding if it dilates by twice the std deviation in the radius.

      Based on this measure they find that large vessels rarely respond.

      However, I think this analysis might obscure some interesting effects:

      (1) The standard deviation of the radius depends on the correct estimation of the center point. Given the limited spatial resolution the center point (voxel) obtained from the binarization and skeletonization might not lie in the actual center of the vessel. This effect will be stronger for larger vessels. Center point coordinates should thus be corrected to minimize the std in radius.

      (2) Larger vessels will not necessarily have a perfectly circular shape, and thus the std measure is not necessarily a good measure of 'uncertainty' of estimating the actual radius.

      (3) The above reasons possibly contribute to the fact that from Figure 6 it seems vessels with larger radii have higher std in general (as indicated above some more detailed visualization of the data instead of plain tables could reveal such effects better, e.g. scatter radius vs std). This higher std is making it harder to detect changes in larger vessels. However, with respect to the blood flow, the critical factor is the cross-section of the vessel that scales with the radius squared. Thus, a fixed change in radius for a vessel (say 1um) will induce a larger increase in the flow rate in larger vessels as the change in cross-section is also proportional to the radius of the vessel.

      Thus, larger vessels to be deemed responders should probably have lower thresholds, thresholds should be taken on the cross-section change, or at least thresholds should not be higher for larger vessels as it is the case now using the higher std.

      (1) The radius estimate does not depend on the precise placement of the center point as the radius is not being estimated by the distance from the center point to the boundary of the vessel. Instead, our strategy is to estimate the cross-sectional area (A) of the vessel by the Riemann sum of the sectors with the apex at the center point; the radius is then quoted as sqrt(A/pi) (Supplementary figure 3B). Thus, estimated vessel radius estimates in each cross-sectional plane are then averaged across the cross-sectional planes placed every ~1um along the vessel length. The uncertainty in the cross-sectional plane’s vessel radius, the uncertainty in the vessel radius (upon averaging the cross-sectional planes), and the uncertainty in the radius estimate across repeated measures of a state (i.e. across different samples of the baseline vs, post-photostimulation states) are all reported, and the last one used to define responding vessels.

      To demonstrate the insensitivity to the precise placement of the vessel’s centrepoint, we have jittered the centerline in the perpendicular plane to the vessel tangent plane at each point along the vessel and then estimated the mean radius in 71 cross-sectional planes of larger vessels (mean radius > 5 um). The percent difference in the estimated radius at our selected vessel centrepoints vs. the jittered centrepoints is plotted above. The percent difference in the mean radius estimated was 0.64±3.44%  with 2.45±0.30 um centerpoint jittering. (In contrast, photostimulation was estimated to elicit an average 25.4±18.1% change in the magnitude of the radius of larger vessels, i.e. those with a baseline radius >5um.)

      (2) Indeed, the cross-sectional areas of either large or small vessels are not circles. Consequently, we are placing the vessel boundary, following other published work[20], at the minimum of the signal intensity gradients computed along thirty-six spokes emanating from the centrepoint (cf Figure 2H,K). The cross-sectional area of the vessel in the said cross-sectional plane is then estimated by summing the areas of the sectors flanked by neighbouring spokes. We do not make an assumption about the cross-sectional area being circular. We report radii of circles with the equivalent area as that of the cross-sectional areas merely for ease of communication (as most of the literature to date reports vessel radii, rather than vessel cross-sectional areas.)

      To demonstrate the robustness of this approach, we show the sensitivity of vessel-wise radius estimate on the number of spokes used to estimate the radius in Supplementary Figure 3a. The radius estimate converges after 20 spokes have been used for estimation. Our pipeline utilizes 36 spokes and then excludes minima that lie over 2 STD away from the mean radius estimate across those 36 spokes. With 36 spokes, the vesselwise mean radius estimation was within 0.24±0.62% of the mean of radius estimates using 40-60 spokes.

      (3) Across-baseline sample uncertainty in vessel radius is not dependent on baseline vessel caliber (i.e. this uncertainty is not larger in larger vessels).

      Supplementary Figure 5 shows vessel radius changes for large vessels without a threshold defining responding or non-responding vessels. To explore the dependence of the outcomes on the threshold used to identify the responding vessels, we have explored an alternative strategy, whereby responding small vessels are identified as those vessels that show a post-photostimulation (vs. baseline) radius change of more than 10%. These data are now plotted in Supplementary Figure 10, for capillaries which is in agreement with Figure 8. These points are now also discussed in the Discussion section of the revised manuscript:

      “Additionally, alternative definitions of responding vessels may be useful depending on the end goal of a study (e.g., this could mean selecting a threshold for the radius change based on a percentage change from the baseline level).”

      Section 4.5.1

      Why is the distance to the next neuron a good measure here? If two or more neurons are just a bit further away there will be twice or multiple times the 'load' while the measure would only indicate the distance to the shortest neuron. I wonder how the results change if those 'ensemble' effects are taken into account.

      In this direction, looking for network-level effects with respect to the full spatial organization of the neurons would be very interesting to look at.

      We agree with the review that this question is interesting; however, it is not addressable using present data: activated neuronal firing will have effects on their postsynaptic neighbors, yet we have no means of measuring the spread of activation using the current experimental model.

      Figure 8

      The scatter plots shown are only partly described (e.g. what's the line with error bars in C, why does it only appear for the high-intensity stimulation?).

      Quadratic polynomial fit is shown only in C as the significant response was observed only for this condition, i.e. for the higher intensity blue photostimulation.

      From the scatter plots as shown it is not clear to me why dilations happen on average further away. This might be a density effect not well visible in this representation. The data does not seem to show a clear relationship between neuron distance and Delta R.

      Particularly in the right panel (high stimulation) there seems to be a similar number of close by neurons responding in both directions, but possibly a few more contracting at larger distances?

      So, the overall effect does not seem as 'simple' as suggested in the title of section 4.5.1 in my view, but rather more cells start to contract at larger distances while there seems to be a more intricate balance nearby.

      A more thorough analysis and visualization of the densities etc. might be needed to clarify this point.

      The language has been revised to:

      458-nm photostimulation resulted in a mix of constrictions and dilations with 44.1% of significantly responding vessels within 10 um of a labelled pyramidal neuron constricting and 55.1% dilating, while 53.3% of vessels further than 30 um constricted and 46.7% dilated. The cutoff distances from the closest labelled neuron were based on estimates of cerebral metabolic rate of oxygen consumption that showed a steep gradient in oxygen consumption with distance from arteries, CMRO2 being halved by 30 μm away

      We added a probability density plot for significant constrictors and dilators to Figure 8 and Supplementary Figure 5.

      Figure 8 Panel D / Section 4.5.2

      This is a very interesting result in my view found in this study.

      I am unclear how to interpret the effect. The authors state that dilators tend to be closer to the surface. Looking at the scatter plot (without real density information except the alpha value) it seems again the number of responders in both directions is about the same, but in deeper regions the contraction is just larger? This would be different, than how the authors interpret the data. It is unclear from the provided analysis/plots what is actually the case.

      We added a probability density function plot of the constrictors and dilators, which shows a greater incidence of constrictions (vs. dilations). The text of the paper was then clarified to include the proportion of significant constrictors/ dilators closer than 10 um vs. further than 30 um away from the closest labeled neuron.

      For the analyses above involving $Delta R$ I recommend also look how those results change when looking at changes in cross section instead, i.e. taking into account the actual vessel radius as well as discussed above.

      It would be interesting to speculate here or in the discussion on a reason why vessels in deeper regions might need to contract more?

      Unaddressed is the question if e.g. contraction in a vessel for small stimulation is predictive of contractions for larger stimulation or any other relationships?

      Thank you for your comment. Given its hierarchical organization and high within-vessel response heterogeneity, we believe that the vasculature is best analyzed as a network. Our radius estimates come from averaged cross-sectional estimates allowing us to examine heterogeneity within individual vessel segments.

      The discussion has been updated to include reasons as to why deeper vessels may contract more:

      “As the blue light stimulation power increased, the mean depth of both constricting and dilating vessels increased, likely resulting from higher intensity light reaching ChR2-expressing neurons deeper in the tissue and exciting superficial neurons (and thus their postsynaptic neurons) to a greater level [21], [22]. The blue light would be expected to excite a lower number of neurons farther from the cortical surface at lower powers.”

      Also, how consistent are contractions/dilations observed at a particular vessel etc.

      To look at the consistency of a particular vessel's response to the 1.1 or 4.3 mW/mm^2 blue light photostimulation, we categorized all significant responses as constrictions or dilations, defining a responding vessel as that showing a change that is either > 2 x baseline vessel radius variability or >10% of the vessel’s mean baseline radius.

      Given twice the baseline variability as the threshold for response, of the vessels that responded more than once, 31.7% dilated on some trials while constricting on others; 41.1% dilated on each trial; and 27.2% constricted on each trial. (Note that some trials use 1.1 vs. 4.3 mW/mm2 and some have opposite scanning directions).

      Section 4.5.3

      The results in assortativity are interesting. It would be interesting to look at how the increase in assortativity is mediated. For, example, is this in localized changes in some parts of the graph as visible in A or are there other trends? Do certain sub-graphs that systematically change their radius have certain properties (e.g. do activated neurons cluster there) or are these effects related to some hotspots that also show a coordinated change in control conditions (the assortativity seems not zero there)?

      I already discussed if the efficiency measure is necessarily the best measure to use here without taking into account 'sources' and 'sinks'.

      We plan to address this in future work once we have successfully trained models for the classification of vessels into arteries, veins, and capillaries. Capillaries will be classified based on their branch order from parent arteries to specify where in the network changes are occurring.

      Figure 9

      It's unclear to me why the Ohm symbol needs to be bold?

      It is not bolded (just the font’s appearance).

      Line 707:

      "458-nm photostimulation caused capillaries to dilate when pyramidal neurons were close, and constrict when they were further away."

      In my view, this interpretation is too simple, given the discussion above. A more detailed analysis could clarify this point.

      The discussion on this point has been revised to:

      458-nm photostimulation resulted in a mix of constrictions and dilations, with 44.1% of significantly responding vessels within 10 μm of a labelled pyramidal neuron constricting, and 55.1% dilating; while 53.3% of vessels further than 30 μm constricted and 46.7% dilated. The cutoff distances from the closest labelled neuron were based on estimates of cerebral metabolic rate of oxygen consumption that showed a steep gradient in oxygen consumption with distance from arteries, CMRO2 being halved by 30 μm away [23].

      Line 740:

      "The network efficiency here can be thought of as paralleling mean transit time, i.e., the time it takes blood to traverse the capillary network from the arteries to the veins".

      The network efficiency as defined by the authors seems not to rely on artery/vein information and thus this interpretation is not fully correct in my view.

      The authors might want to reconsider this measure for one that accounts for sources and sinks, if they like to interpret their results as in this line.

      Yes, the efficiency described does not account for sources and sinks. It estimates the resistivity of capillaries, as a proxy for the ease of moving through the observed capillary nexus. Looking at the efficiency metric from graph theory does not require knowledge of the direction of blood flow, and can comment on the resistivity changes across capillary networks.

      For future work, we are investigating methods of classifying vessels as arteries, capillaries, or veins. This type of analysis will provide more detailed information on paths between arteries and veins; it will not provide insight into large-scale network-wide modifications, as those require larger fields of view. 

      Line 754 Pipeline Limitations and Adaptability

      I think the additional 'problem' of generating new training data for novel data sets or data from other microscopes etc should be addressed or the pipeline tested on such data sets.

      Generating training data is typically the biggest time investment when adapting pipelines.

      The generalization properties of the current pipeline are not discussed (e.g. performance on a different microscope / different brain area / different species etc.).

      The public response to reviews has been updated with out-of-distribution data from other imaging protocols, microscopes, and species showing generalizability. These results have also been added to the paper as Supplementary Table 4, and Figure 6. The performance of our pipeline on these out-of-distribution data is now discussed in the updated Discussion section.

      Line 810

      Code availability should be coupled with the publication of this paper as it seems the main contribution. I don't see how the code can be made available after publication only. It should be directly available once the manuscript is published and it could help to make it available to the reviewers before that. It can be updated later of course.

      The code is being made available.

      Reviewer #2 (Recommendations For The Authors):

      This analytical pipeline could be quite useful but it needs to be better demonstrated. If faster volumetric imaging is not possible, perhaps using it over a small volume would still demonstrate its utility at a smaller but more believable scale.

      The higher temporal resolution scans (over smaller tissue volumes) have now been performed and the results of applying our pipeline to these data are summarized in Supplementary Figure 2.

      Using sensory stimuli for neuronal activation might be a better idea than optogenetic stimulation. It isn't necessary but it would avoid the blue light issue.

      The pipeline is readily applicable for analysis of vasoreactivity following different perturbers; however, the robustness of vessels’ response is higher with blue light photostimulation of ChR2 than with sensory stimuli [24]. Notwithstanding, an example of the vascular response to electrical stimulation of the contralateral forepaw is now included in Supplementary Figure 2.

      This tool could be quite useful even without neural activity mapping. It obviously makes it even more powerful, but again, the utility could be demonstrated with just vascular data or even anatomical neuronal data without function.

      We agree with both points, and have emphasized them in the revised discussion section.

      Line 559 says the average capillary diameter change was 1.04 um. The next sentence and the table below all have different values so this is unclear.

      The wording was updated to make this clearer.

      Line 584 - should 458 be 552?

      458 is correct.

      Figure 1 - the schematic doesn't seem right - the 650 LPF with the notches is positioned to pass short light and reflect long wavelengths and the notch bands.

      The figure has been updated to reflect this. The original layout was done for compactness.

      References

      (1) D. A. Hartmann, V. Coelho-Santos, and A. Y. Shih, “Pericyte Control of Blood Flow Across Microvascular Zones in the Central Nervous System,” Annu. Rev. Physiol., vol. 84, no. Volume 84, 2022, pp. 331–354, Feb. 2022, doi: 10.1146/annurev-physiol-061121-040127.

      (2) J. Batista, “An adaptive gradient-based boundary detector for MRI images of the brain,” in 7th International Conference on Image Processing and its Applications, Manchester, UK: IEE, 1999, pp. 440–444. doi: 10.1049/cp:19990360.

      (3) Y. Le, X. Xu, L. Zha, W. Zhao, and Y. Zhu, “Tumor boundary detection in ultrasound imagery using multi-scale generalized gradient vector flow,” J. Med. Ultrason., vol. 42, no. 1, pp. 25–38, Jan. 2015, doi: 10.1007/s10396-014-0559-3.

      (4) X. Ren, “Multi-scale Improves Boundary Detection in Natural Images,” in Computer Vision – ECCV 2008, D. Forsyth, P. Torr, and A. Zisserman, Eds., Berlin, Heidelberg: Springer, 2008, pp. 533–545. doi: 10.1007/978-3-540-88690-7_40.

      (5) C. Grigorescu, N. Petkov, and M. A. Westenberg, “Contour and boundary detection improved by surround suppression of texture edges,” Image Vis. Comput., vol. 22, no. 8, pp. 609–622, Aug. 2004, doi: 10.1016/j.imavis.2003.12.004.

      (6) J. Tang and S. T. Acton, “Vessel Boundary Tracking for Intravital Microscopy Via Multiscale Gradient Vector Flow Snakes,” IEEE Trans. Biomed. Eng., vol. 51, no. 2, pp. 316–324, Feb. 2004, doi: 10.1109/TBME.2003.820374.

      (7) J. Merkow, A. Marsden, D. Kriegman, and Z. Tu, “Dense Volume-to-Volume Vascular Boundary Detection,” in Medical Image Computing and Computer-Assisted Intervention - MICCAI 2016, S. Ourselin, L. Joskowicz, M. R. Sabuncu, G. Unal, and W. Wells, Eds., Cham: Springer International Publishing, 2016, pp. 371–379. doi: 10.1007/978-3-319-46726-9_43.

      (8) F. Orujov, R. Maskeliūnas, R. Damaševičius, and W. Wei, “Fuzzy based image edge detection algorithm for blood vessel detection in retinal images,” Appl. Soft Comput., vol. 94, p. 106452, Sep. 2020, doi: 10.1016/j.asoc.2020.106452.

      (9) M. E. Martinez-Perez, A. D. Hughes, S. A. Thom, A. A. Bharath, and K. H. Parker, “Segmentation of blood vessels from red-free and fluorescein retinal images,” Med. Image Anal., vol. 11, no. 1, pp. 47–61, Feb. 2007, doi: 10.1016/j.media.2006.11.004.

      (10) A. M. Mendonca and A. Campilho, “Segmentation of retinal blood vessels by combining the detection of centerlines and morphological reconstruction,” IEEE Trans. Med. Imaging, vol. 25, no. 9, pp. 1200–1213, Sep. 2006, doi: 10.1109/TMI.2006.879955.

      (11) A. F. Frangi, W. J. Niessen, K. L. Vincken, and M. A. Viergever, “Multiscale vessel enhancement filtering,” in Medical Image Computing and Computer-Assisted Intervention — MICCAI’98, W. M. Wells, A. Colchester, and S. Delp, Eds., Berlin, Heidelberg: Springer, 1998, pp. 130–137. doi: 10.1007/BFb0056195.

      (12) K. Bisht et al., “Capillary-associated microglia regulate vascular structure and function through PANX1-P2RY12 coupling in mice,” Nat. Commun., vol. 12, no. 1, p. 5289, Sep. 2021, doi: 10.1038/s41467-021-25590-8.

      (13) Y. Wu et al., “Quantitative relationship between cerebrovascular network and neuronal cell types in mice,” Cell Rep., vol. 39, no. 12, p. 110978, Jun. 2022, doi: 10.1016/j.celrep.2022.110978.

      (14) T. Kirabali et al., “The amyloid-β degradation intermediate Aβ34 is pericyte-associated and reduced in brain capillaries of patients with Alzheimer’s disease,” Acta Neuropathol. Commun., vol. 7, no. 1, p. 194, Dec. 2019, doi: 10.1186/s40478-019-0846-8.

      (15) X. Ren et al., “Linking cortical astrocytic neogenin deficiency to the development of Moyamoya disease–like vasculopathy,” Neurobiol. Dis., vol. 154, p. 105339, Jul. 2021, doi: 10.1016/j.nbd.2021.105339.

      (16) J. Steinman, M. M. Koletar, B. Stefanovic, and J. G. Sled, “3D morphological analysis of the mouse cerebral vasculature: Comparison of in vivo and ex vivo methods,” PLOS ONE, vol. 12, no. 10, p. e0186676, Oct. 2017, doi: 10.1371/journal.pone.0186676.

      (17) A.-A. Berthiaume et al., “Dynamic Remodeling of Pericytes In Vivo Maintains Capillary Coverage in the Adult Mouse Brain,” Cell Rep., vol. 22, no. 1, pp. 8–16, Jan. 2018, doi: 10.1016/j.celrep.2017.12.016.

      (18) S. Katz, R. Gattegno, L. Peko, R. Zarik, Y. Hagani, and T. Ilovitsh, “Diameter-dependent assessment of microvascular leakage following ultrasound-mediated blood-brain barrier opening,” iScience, vol. 26, no. 6, p. 106965, Jun. 2023, doi: 10.1016/j.isci.2023.106965.

      (19) J. Drouin-Ouellet et al., “Cerebrovascular and blood-brain barrier impairments in Huntington’s disease: Potential implications for its pathophysiology,” Ann. Neurol., vol. 78, no. 2, pp. 160–177, Aug. 2015, doi: 10.1002/ana.24406.

      (20) K. P. McDowell, A.-A. Berthiaume, T. Tieu, D. A. Hartmann, and A. Y. Shih, “VasoMetrics: unbiased spatiotemporal analysis of microvascular diameter in multi-photon imaging applications,” Quant. Imaging Med. Surg., vol. 11, no. 3, pp. 969–982, Mar. 2021, doi: 10.21037/qims-20-920.

      (21) E. L. Johnson et al., “Characterization of light penetration through brain tissue, for optogenetic stimulation.” bioRxiv, p. 2021.04.08.438932, Apr. 08, 2021. doi: 10.1101/2021.04.08.438932.

      (22) S. I. Al-Juboori, A. Dondzillo, E. A. Stubblefield, G. Felsen, T. C. Lei, and A. Klug, “Light scattering properties vary across different regions of the adult mouse brain,” PloS One, vol. 8, no. 7, p. e67626, 2013, doi: 10.1371/journal.pone.0067626.

      (23) P. Mächler et al., “Baseline oxygen consumption decreases with cortical depth,” PLOS Biol., vol. 20, no. 10, p. e3001440, Oct. 2022, doi: 10.1371/journal.pbio.3001440.

      (24) J. R. Mester et al., “In vivo neurovascular response to focused photoactivation of Channelrhodopsin-2,” NeuroImage, vol. 192, pp. 135–144, May 2019, doi: 10.1016/j.neuroimage.2019.01.036.

    1. Author response:

      The following is the authors’ response to the current reviews.

      We have significant concerns about the eLife assessment and the reviews. The reviewers acknowledged substantial strengths in our work:

      • Reviewer 3 noted that “the single-unit analyses of tuning direction are robustly characterized”, “the differences in neural correlations across behaviors, regions and perturbations are robust”, and “The evidence for these claims is solid.”

      • Reviewer 2 stated that “the manuscript has been improved” with “new analyses [that] provide improved rigor”.

      Despite these, the final eLife assessment inexplicably downplayed the significance of the findings and strength of evidence.

      Broader Impact and Significance. The findings, not only the data, have theoretical and/or practical implications extending well beyond a single subfield relevant to:

      1. behavioral neuroscientists studying sensorimotor integration

      2. systems and theoretical neuroscientists

      3. neural and biomechanical engineers working on brain-computer interfaces for speech or oral or limb prosthetics

      4. soft robotics researchers

      5. comparative motor control researchers

      6. clinicians involved in the evaluation and rehabilitation of orolingual function (e.g., after stroke or glossectomy, dysphagia)

      Given this broad relevance, we question why the significance was characterized as merely "useful" rather than "important."

      Dismissive Tone Toward Descriptive Research. Some reviews displayed a dismissive or skeptical tone of the findings and their significance, even when methods were solid and support for the claims were strong. They critiqued the “descriptive nature” of our study, faulting the lack of mechanistic explanation. However, in poorly understood fields such as orofacial sensorimotor control, descriptive studies provide the empirical foundation for mechanistic studies. Rich descriptive data generate testable hypotheses that drive mechanistic discoveries forward, while mechanistic studies conducted without this groundwork often pursue precise answers to poorly formulated questions.

      Specific Issues with Reviews:

      1. Significant omission in study description:

      The eLife Assessment’s second sentence states: “The data, which include both electrophysiology and nerve block manipulations, will be of value to neuroscientists and

      neural engineers interested in tongue use.”

      This description omits our simultaneously recorded high-resolution 3D kinematics data—a significant oversight given that combining high-density electrophysiological recording from multiple cortical regions with high-resolution 3D tongue kinematics during naturalistic behaviors in non-human primates represents one of our study's key strengths. Currently, only two research labs in the US possess this capability.

      2. Overemphasis on the “smaller” and “inconsistent” findings

      While we acknowledge some inconsistent findings between animals, the reviews overemphasized these inconsistencies in ways that cast unwarranted doubt on our more significant and consistent results.

      a. Reviewer 1: “[...] the discrepancies in tuning changes across the two NHPs, coupled with the overall exploratory nature of the study, render the interpretation of these subtle differences somewhat speculative. “[...] in some recording sessions, they blocked sensory feedback using bilateral nerve block injections, which seemed to result in fewer directionally tuned units and changes in the overall distribution of the preferred direction of the units.”

      The skeptical tone of the critique is in opposition to Reviewer 3’s statement that: “the evidence for these claims were solid”. In this statement, the reviewer characterized our findings as “somewhat speculative”, seemingly overlooking robust and consistent changes we documented:

      • “Following nerve block, MIo and SIo showed significant decreases in the proportion of directionally modulated neurons across both tasks (Fig. 10A; Chi-square, MIo: p <0.001, SIo: p < 0.05).”

      • “Nerve block significantly altered PD distributions during both tasks. During feeding, MIo neurons in both subjects exhibited a significant clockwise shift in mean PD toward the center (0°), resulting in more uniform distributions (Fig. 11A; circular k-test, p < 0.01).”

      These results were obtained through careful subsampling of trials with similar kinematics for both feeding and drinking tasks, ensuring that the tuning changes in the nerve block experiments could not be attributed to differing kinematics.

      b. Reviewer 2: “One weakness of the current study is that there is substantial variability in results between monkeys.”

      This vague critique, without specifying which results showed “substantial variability”, reads as though most findings were inconsistent, unfairly casting doubt on our study’s validity.

      3. Inaccurate statements in the Reviewers’ summaries

      Several reviewer statements contain factual inaccuracies:

      a. Reviewer 2: “A majority of neurons in MIo and a (somewhat smaller) percentage of SIo modulated their firing rates during tongue movements, with different modulation depending on the direction of movement (i.e., exhibited directional tuning).”

      Reviewer 2's characterization of directional tuning misrepresents our findings. We reported substantial differences in the proportion of directionally tuned neurons between MIo and SIo during the feeding task but a smaller difference in the drinking task:

      • “The proportion of directionally tuned neurons [...] differed significantly between MIo and SIo during the feeding task in both subjects (Chi-square, p < 0.001). In rostral and caudal MIo, 80% of neurons were modulated to 3D direction (bootstrap, p < 0.05, Fig. 3B, left), compared to 52% in areas 1/2 and 3a/3b.

      • “During drinking, the proportion of directionally modulated neurons was more similar between regions (69% in MIo vs. 60% in SIo: Chi-square, p > 0.05, Fig. 3B right).”

      b. Reviewer 2: “There were differences observed in the proportion and extent of directional tuning between the feeding and licking behaviors, with stronger tuning overall during licking.”

      Reviewer 2's claim about task differences directly contradicts our findings. We consistently reported stronger tuning in feeding compared to drinking across multiple measures:

      • “The proportion of directionally tuned neurons was higher in the feeding vs. drinking task (Chi-square, p < 0.05, feeding: 72%, drinking: 66%)”;

      • “Cumulative explained variance for the first three factors was higher in feeding (MIo: 82%, SIo: 81%) than in drinking (MIo: 74%, SIo: 63%)”;

      • “Decoding using LSTM showed consistently higher accuracies in feeding compared to drinking regardless of the length of intervals used ..., behavioral window .., and directional angles ...”

      These results were also summarized in the Discussion.

      c. Reviewer 1: In Figure 12, factor 2 and 3 are plotted against each other? and factor 1 is left out?

      Reviewer 1’s observation about Figure 12 is incorrect. Factor 1 was included: Top subplots (feeding) show Factor 1 vs 3 (MIo) and Factor 1 vs 2 (SIo) while the bottom subplots (drinking) show Factor 2 vs 3 (MIo) and Factor 1 vs 2 (SIo). We plotted the two latent factors with highest explained variance for clarity, though all 20 factors were included in intertrajectory distance calculations.

      4. Framing and interpretive over-scrutiny

      Several critiques targeted framing rather than methodological rigor and emphasized that interpretations were speculative even when appropriately hedged:

      a. Reviewer 2: “A revised version of the manuscript incorporates more population-level analyses, but with inconsistent use of quantifications/statistics and without sufficient contextualization of what the reader is to make of these results.”

      Reviewer 2 mentioned "inconsistent use of quantifications/statistics" without specifying which analyses were problematic or updating their summary to include our additional population-level findings.

      b. Reviewer 2: “The described changes in tuning after nerve block could also be explained by changes in kinematics between these conditions, which temper the interpretation of these interesting results”

      Despite our addressing kinematic concerns through subsampled data analysis, Reviewer 2 remained unsatisfied, contrasting sharply with Reviewer 3's assessment that our arguments were "convincing" with "solid" evidence.

      c. Reviewer 2: “I am not convinced of the claim that tongue directional encoding fundamentally changes between drinking and feeding given the dramatically different kinematics and the involvement of other body parts like the jaw”

      Reviewer 2 expressed skepticism about fundamental encoding differences between tasks, despite our comprehensive controls including subsampled data with similar kinematics and multiple verification analyses (equal neuron numbers, stable neurons, various interval lengths, behavioral windows, and directional angles).

      Without describing why these analyses were insufficient, this criticism goes beyond methods or statistics. It casts doubt and challenges whether the conclusions are even worth drawing despite careful experimental controls.

      d. Reviewer 2: “The manuscript states that "An alternative explanation be more statistical/technical in nature: that during feeding, there will be more variability in exactly what somatosensation afferent signals are being received from trial to trial (because slight differences in kinematics can have large differences in exactly where the tongue is and the where/when/how of what parts of it are touching other parts of the oral cavity)? This variability could "smear out" the apparent tuning using these types of trial-averaged analyses. Given how important proprioception and somatosensation are for not biting the tongue or choking, the speculation that somatosensory cortical activity is suppressed during feedback is very counter-intuitive to this reviewer".

      By not updating this section, Reviewer 2 failed to acknowledge our responsive revisions, including Fano factor analysis showing higher variability in SIo during feeding versus drinking, and our updated discussion addressing their concerns about trial-to-trial variability: “Varying tongue shape, tongue’s contact with varying bolus properties (size and texture) and other oral structures (palate, teeth) may weaken the directional signal contained in SIo activity. Thus, small differences in tongue kinematics might create large differences in sensory signals across trials. When looking at trial-averaged signals, this natural variability could make the neural response patterns appear less precise or specific than they are. These are consistent with our findings that for both tasks, spiking variability was higher in SIo.”

      Authors’ Response to Recommendations for the authors:

      We thank the editors and the reviewers for their helpful comments. We have provided a response to reviewers’ recommendations and made some revisions on the manuscript. 

      Reviewer #1 (Recommendations for the authors): 

      In the newly added population factor analysis, several methodological decisions remain unclear to me:

      In Figure 7, why do the authors compare the mean distance between conditions in the latent spaces of MIo and SIo? Since these latent spaces are derived separately, they exist on different scales (with MIo appearing roughly four times larger than SIo), and this discrepancy is reflected in the reported mean distances (Figure 7, inset plots). Wouldn't this undermine a direct comparison?

      Thank you for this helpful feedback. The reviewer is correct that the latent spaces are derived separately for MIo and SIo, thus they exist on different scales as we have noted in the caption of Figure 7: “Axes for SIo are 1/4 scale of MIo.” 

      To allow for a direct comparison between MIo and SIo, we corrected the analysis by comparing their normalized mean inter-trajectory distances obtained by first calculating the geometric index (GI) of the inter-trajectory distances, d, between each pair of population trajectories per region as: GI= (d<sub>1</sub>-d<sub>2</sub>)/ (d<sub>1</sub>+d<sub>2</sub>). We then performed the statistics on the GIs and found a significant difference between mean inter-trajectory distances in MIo vs. SIo. We performed the same analysis comparing the distance travelled between MIo and SIo trajectories by getting the normalized difference in distances travelled and still found a significant difference in both tasks. We have updated the results and figure inset to reflect these changes.

      In Figure 12, unlike Figure 7 which shows three latent dimensions, only two factors are plotted. While the methods section describes a procedure for selecting the optimal number of latent factors, Figure 7 - figure supplement 3 shows that variance explained continues to increase up to about five latent dimensions across all areas. Why, then, are fewer dimensions shown?

      Thank you for the opportunity to clarify the figure. The m obtained from the 3-fold crossvalidation varied for the full sample and was 20 factors for the subsample. We clarify that all statistical analyses were done using 20 latent factors. Using the full sample of neurons, the first 3 factors explained 81% of variance in feeding data compared to 71% in drinking data. When extended to 5 factors, feeding maintained its advantage with 91% variance explained versus 82% for drinking. Because feeding showed higher variance explained than drinking across 3 or 5 factors, only three factors were shown in Figure 7 for better visualization. We added this clarification to the Methods and Results.

      Figure 12 shows the differences in the neural trajectories between the control and nerve block conditions. The control vs. nerve block comparison complicated the visualization of the results. Thus, we plotted only the two latent factors with the highest separation between population trajectories. This was clarified in the Methods and caption of Figure 12.

      In Figure 12, factor 2 and 3 are plotted against each other? and factor 1 is left out?

      This observation is incorrect; Factor 1 was included: Top subplots (feeding) show Factor 1 vs 3 (MIo) and Factor 1 vs 2 (SIo) while the bottom subplots (drinking) show Factor 2 vs 3 (MIo) and Factor 1 vs 2 (SIo).  We have clarified this in the Methods and caption of Figure 12.

      Finally, why are factor analysis results shown only for monkey R? 

      Factor analysis results were performed on both animals, but the results were shown only for monkey R to decrease the number of figures in the manuscript. Figure 7- figure supplement 1 shows the data for both monkeys. Here are the equivalent Figure 7 plots for monkey Y. 

      Author response image 1.

      Reviewer #2 (Recommendations for the authors): 

      Overall, the manuscript has been improved. 

      New analyses provide improved rigor (as just one example, organizing the feeding data into three-category split to better match the three-direction drinking data decoding analysis and also matching the neuron counts).

      The updated nerve block change method (using an equal number of trials with a similar leftright angle of movement in the last 100 ms of the tongue trajectory) somewhat reduces my concern that kinematic differences could account for the neural changes, but on the other hand the neural analyses use 250 ms (meaning that the neural differences could be related to behavioral differences earlier in the trial). Why not subselect to trials with similar trajectories throughout the whole movement(or at least show that as an additional analysis, albeit one with lower trial counts). 

      As the reviewer pointed out, selecting similar trajectories throughout the whole movement would result in lower trial counts that lead to poor statistical power. We think that the 100 ms prior to maximum tongue protrusion is a more important movement segment to control for similar kinematics between the control and nerve block conditions since this represents the subject’s intended movement endpoint. 

      A lot of the Results seemed like a list of measurements without sufficient hand-holding or guide-posting to explain what the take-away for the reader should be. Just one example to make concrete this broadly-applicable feedback: "Cumulative explained variance for the first three factors was higher in feeding (MIo: 82%, SIo: 81%) than in drinking (MIo: 74%, SIo: 63%) when all neurons were used for the factor analysis (Fig. 7)": why should we care about 3 factors specifically? Does this mean that in feeding, the neural dimensionality is lower (since 3 factors explain more of it)? Does that mean feeding is a "simpler" behavior (which is counter-intuitive and does not conform to the authors' comments about the higher complexity of feeding). And from later in that paragraph: what are we do make of the differences in neural trajectory distances (aside from quantifying using a different metric the same larger changes in firing rates that could just as well be quantified as statistics across single-neuron PETHs)?

      Thank you for the feedback on the writing style. We have made some revisions to describe the takeaway for the reader. That fewer latent factors explain 80% of the variance in the feeding data means that the underlying network activity is relatively simple despite apparent complexity. When neural population trajectories are farther away from each other in state space, it means that the patterns of activity across tongue directions are more distinct and separable, thus, less likely to be confused with each other. This signifies that neural representations of 3D tongue directions are more robust. When there is better neural discrimination and more reliable information processing, it is easier for downstream brain regions to distinguish between different tongue directions.  

      The addition of more population-level analyses is nice as it provides a more efficient summary of the neural measurements. However, it's a surface-level dive into these methods; ultimately the goal of ensemble "computation through dynamics" analyses is to discover simpler structure / organizational principles at the ensemble level (i.e., show things not evidence from single neurons), rather than just using them as a way to summarize data. For instance, here neural rotations are remarked upon in the Results, without referencing influential prior work describing such rotations and why neural circuits may use this computational motif to separate out conditions and shape muscle activity-generating readouts (Churchland et al. Nature 2012 and subsequent theoretical iterations including the Russo et al.). That said, the Russo et al tangling study was well-referenced and the present tangling results were eGectively contextualized with respect to that paper in terms of the interpretation. I wish more of the results were interpreted with comparable depth. 

      Speaking of Russo et al: the authors note qualitative differences in tangling between brain areas, but do not actually quantify tangling in either. These observations would be stronger if quantified and accompanied with statistics.

      Contrary to the reviewer’s critique, we did frame these results in the context of structure/organizational principles at the ensemble level. We had already cited prior work of Churchland et al., 2012; Michaels et al., 2016and Russo et al., 2018. In the Discussion, Differences across behaviors, we wrote: “In contrast, MIo trajectories in drinking exhibited a consistent rotational direction regardless of spout location (Fig. 7). This may reflect a predominant non-directional information such as condition-independent time-varying spiking activity during drinking (Kaufman et al., 2016; Kobak et al., 2016; Arce-McShane et al., 2023).” 

      Minor suggestions: 

      Some typos, e.g. 

      • no opening parenthesis in "We quantified directional differences in population activity by calculating the Euclidean distance over m latent factors)"

      • missing space in "independent neurons(Santhanam et al., 2009;..."); 

      • missing closing parentheses in "followed by the Posterior Inferior (Figure 3 - figure supplement 1."

      There is a one-page long paragraph in the Discussion. Please consider breaking up the text into more paragraphs each organized around one key idea to aid readability.

      Thank you, we have corrected these typos.

      Could it be that the Kaufman et al 2013 reference was intended to be Kaufman et al 2015 eNeuro (the condition-invariant signal paper)?

      Thank you, we have corrected this reference.

      At the end of the Clinical Implications subsection of the Discussion, the authors note the growing field of brain-computer interfaces with references for motor read-out or sensory write-in of hand motor/sensory cortices, respectively. Given that this study looks at orofacial cortices, an even more clinically relevant development is the more recent progress in speech BCIs (two     recent reviews: https://www.nature.com/articles/s41583-024-00819-9, https://www.annualreviews.org/content/journals/10.1146/annurev-bioeng-110122012818) many of which record from human ventral motor cortex and aspirations towards FES-like approaches for orofacial movements (e.g., https://link.springer.com/article/10.1186/s12984-023-01272-y).  

      Thank you, we have included these references.

      Reviewer #3 (Recommendations for the authors): 

      Major Suggestions 

      (1) For the factor analysis of feeding vs licking, it appears that the factors were calculated separately for the two behaviors. It could be informative to calculate the factors under both conditions and project the neural data for the two behaviors into that space. The overlap/separations of the subspace could be informative. 

      We clarify that we performed a factor analysis that included both feeding and licking for MIo, as stated in the Results: “To control for factors such as different neurons and kinematics that might influence the results, we performed factor analysis on stable neurons across both tasks using all trials (Fig. 7- figure supplement 2A) and using trials with similar kinematics (Fig. 7- figure supplement 2B).” We have revised the manuscript to reflect this more clearly.

      (2) For the LSTM, the Factor analyses and the decoding it is unclear if the firing rates are mean subtracted and being normalized (the methods section was a little unclear). Typically, papers in the field either z-score the data or do a softmax.

      The firing rates were z-scored for the LSTM and KNN. For the factor analysis, the spike counts were not z-scored, but the results were normalized. We clarified this in the Methods section.

      Minor: 

      Page 1: Abstract- '... how OSMCx contributes to...' 

      Since there are no direct causal manipulations of OSMCx in this manuscript, this study doesn't directly study the OSMCx's contribution to movement - I would recommend rewording this sentence.

      Similarly, Page 2: 'OSMCx plays an important role in coordination...' the citations in this paragraph are correlative, and do not demonstrate a causal role.

      There are similar usages of 'OSMCx coordinates...' in other places e.g. Page 8. 

      Thank you, we revised these sentences.

      Page 7: the LSTM here has 400 units, which is a very large network and contains >12000 parameters. Networks of this size are prone to memorization, it would be wise to test the rsquare of the validation set against a shuGled dataset to see if the network is actually working as intended. 

      Thank you for bringing up this important point of verifying that the network is learning meaningful patterns versus memorizing. Considering the size of our training samples, the ratio of samples to parameters is appropriate and thus the risk of memorization is low. Indeed, validation tests and cross-validation performed indicated expected network behavior and the R squared values obtained here were similar to those reported in our previous paper (Laurence-Chasen et al., 2023).


      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      In their paper, Hosack and Arce-McShane investigate how the 3D movement direction of the tongue is represented in the orofacial part of the sensory-motor cortex and how this representation changes with the loss of oral sensation. They examine the firing patterns of neurons in the orofacial parts of the primary motor cortex (MIo) and somatosensory cortex (SIo) in non-human primates (NHPs) during drinking and feeding tasks. While recording neural activity, they also tracked the kinematics of tongue movement using biplanar videoradiography of markers implanted in the tongue. Their findings indicate that most units in both MIo and SIo are directionally tuned during the drinking task. However, during the feeding task, directional turning was more frequent in MIo units and less prominent in SIo units. Additionally, in some recording sessions, they blocked sensory feedback using bilateral nerve block injections, which resulted in fewer directionally tuned units and changes in the overall distribution of the preferred direction of the units.

      Strengths:

      The most significant strength of this paper lies in its unique combination of experimental tools. The author utilized a video-radiography method to capture 3D kinematics of the tongue movement during two behavioral tasks while simultaneously recording activity from two brain areas. Moreover, they employed a nerve-blocking procedure to halt sensory feedback. This specific dataset and experimental setup hold great potential for future research on the understudied orofacial segment of the sensory-motor area.

      Weaknesses:

      Aside from the last part of the result section, the majority of the analyses in this paper are focused on single units. I understand the need to characterize the number of single units that directly code for external variables like movement direction, especially for less-studied areas like the orofacial part of the sensory-motor cortex. However, as a field, our decadelong experience in the arm region of sensory-motor cortices suggests that many of the idiosyncratic behaviors of single units can be better understood when the neural activity is studied at the level of the state space of the population. By doing so, for the arm region, we were able to explain why units have "mixed selectivity" for external variables, why the tuning of units changes in the planning and execution phase of the movement, why activity in the planning phase does not lead to undesired muscle activity, etc. See (Gallego et al. 2017; Vyas et al. 2020; Churchland and Shenoy 2024) for a review. Therefore, I believe investigating the dynamics of the population activity in orofacial regions can similarly help the reader go beyond the peculiarities of single units and in a broader view, inform us if the same principles found in the arm region can be generalized to other segments of sensorymotor cortex.

      We thank and agree with the reviewer on the value of information gained from studying population activity. We also appreciate that population analyses have led to the understanding that individual neurons have “mixed selectivity”. We have shown previously that OSMCx neurons exhibit mixed selectivity in their population activity and clear separation between latent factors associated with gape and bite force levels (Arce-McShane FI, Sessle BJ, Ram Y, Ross CF, Hatsopoulos NG (2023) Multiple regions of primate orofacial sensorimotor cortex encode bite force and gape. Front Systems Neurosci. doi: 10.3389/fnsys.2023.1213279. PMID: 37808467 PMCID: 10556252), and chew-side and food types (Li Z & Arce-McShane FI (2023). Cortical representation of mastication in the primate orofacial sensorimotor cortex. Program No. NANO06.05. 2023 Neuroscience Meeting Planner. Washington, D.C.: Society for Neuroscience, 2023. Online.). 

      The primary goal of this paper was to characterize single units in the orofacial region and to do a follow-up paper on population activity. In the revised manuscript, we have now incorporated the results of population-level analyses. The combined results of the single unit and population analyses provide a deeper understanding of the cortical representation of 3D direction of tongue movements during natural feeding and drinking behaviors. 

      Further, for the nerve-blocking experiments, the authors demonstrate that the lack of sensory feedback severely alters how the movement is executed at the level of behavior and neural activity. However, I had a hard time interpreting these results since any change in neural activity after blocking the orofacial nerves could be due to either the lack of the sensory signal or, as the authors suggest, due to the NHPs executing a different movement to compensate for the lack of sensory information or the combination of both of these factors. Hence, it would be helpful to know if the authors have any hint in the data that can tease apart these factors. For example, analyzing a subset of nerve-blocked trials that have similar kinematics to the control.

      Thank you for bringing this important point. We agree with the reviewer that any change in the neural activity may be attributed to lack of sensory signal or to compensatory changes or a combination of these factors. To tease apart these factors, we sampled an equal number of trials with similar kinematics for both control and nerve block feeding sessions. We added clarifying description of this approach in the Results section of the revised manuscript: “To confirm this e ect was not merely due to altered kinematics, we conducted parallel analyses using carefully subsampled trials with matched kinematic profiles from both control and nerve-blocked conditions.”

      Furthermore, we ran additional analysis for the drinking datasets by subsampling a similar distribution of drinking movements from each condition. We compared the neural data from an equal number of trials with a similar left-right angle of movement in the last 100 ms of the tongue trajectory, nearest the spout. We compared the directional tuning across an equal number of trials with a similar left-right angle of movement in the last 100 ms of the tongue trajectory, nearest the spout. These analyses that control for similar kinematics showed that there was still a decrease in the proportion of directionally modulated neurons with nerve block compared to the control. This confirms that the results may be attributed to the lack of tactile information. These are now integrated in the revised paper under Methods section: Directional tuning of single neurons, as well as Results section: E ects of nerve block: Decreased directional tuning of MIo and SIo neurons and Figure 10 – figure supplement 1.

      Reviewer #2 (Public review):

      Summary:

      This manuscript by Hosack and Arce-McShane examines the directional tuning of neurons in macaque primary motor (MIo) and somatosensory (SIo) cortex. The neural basis of tongue control is far less studied than, for example, forelimb movements, partly because the tongue's kinematics and kinetics are difficult to measure. A major technical advantage of this study is using biplanar video-radiography, processed with modern motion tracking analysis software, to track the movement of the tongue inside the oral cavity. Compared to prior work, the behaviors are more naturalistic behaviors (feeding and licking water from one of three spouts), although the animals were still head-fixed.

      The study's main findings are that:

      • A majority of neurons in MIo and a (somewhat smaller) percentage of SIo modulated their firing rates during tongue movements, with different modulations depending on the direction of movement (i.e., exhibited directional tuning). Examining the statistics of tuning across neurons, there was anisotropy (e.g., more neurons preferring anterior movement) and a lateral bias in which tongue direction neurons preferred that was consistent with the innervation patterns of tongue control muscles (although with some inconsistency between monkeys).

      • Consistent with this encoding, tongue position could be decoded with moderate accuracy even from small ensembles of ~28 neurons.

      • There were differences observed in the proportion and extent of directional tuning between the feeding and licking behaviors, with stronger tuning overall during licking. This potentially suggests behavioral context-dependent encoding.

      • The authors then went one step further and used a bilateral nerve block to the sensory inputs (trigeminal nerve) from the tongue. This impaired the precision of tongue movements and resulted in an apparent reduction and change in neural tuning in Mio and SIo.

      Strengths:

      The data are difficult to obtain and appear to have been rigorously measured, and provide a valuable contribution to this under-explored subfield of sensorimotor neuroscience. The analyses adopt well-established methods, especially from the arm motor control literature, and represent a natural starting point for characterizing tongue 3D direction tuning.

      Weaknesses:

      There are alternative explanations for some of the interpretations, but those interpretations are described in a way that clearly distinguishes results from interpretations, and readers can make their own assessments. Some of these limitations are described in more detail below.

      One weakness of the current study is that there is substantial variability in results between monkeys, and that only one session of data per monkey/condition is analyzed (8 sessions total). This raises the concern that the results could be idiosyncratic. The Methods mention that other datasets were collected, but not analyzed because the imaging pre-processing is very labor-intensive. While I recognize that time is precious, I do think in this case the manuscript would be substantially strengthened by showing that the results are similar on other sessions.

      We acknowledge the reviewer’s concern about inter-subject variability. Animal feeding and drinking behaviors are quite stable across sessions, thus, we do not think that additional sessions will address the concern that the results could be idiosyncratic. Each of the eight datasets analyzed here have su icient neural and kinematic data to capture neural and behavioral patterns.  Nevertheless, we performed some of the analyses on a second feeding dataset from Monkey R. The results from analyses on a subset of this data were consistent across datasets; for example, (1) similar proportions of directionally tuned neurons, (2) similar distances between population trajectories (t-test p > 0.9), and (3) a consistently smaller distance between Anterior-Posterior pairs than others in MIo (t-test p < 0.05) but not SIo (p > 0.1). 

      This study focuses on describing directional tuning using the preferred direction (PD) / cosine tuning model popularized by Georgopoulous and colleagues for understanding neural control of arm reaching in the 1980s. This is a reasonable starting point and a decent first-order description of neural tuning. However, the arm motor control field has moved far past that viewpoint, and in some ways, an over-fixation on static representational encoding models and PDs held that field back for many years. The manuscript benefits from drawing the readers' attention (perhaps in their Discussion) that PDs are a very simple starting point for characterizing how cortical activity relates to kinematics, but that there is likely much richer population-level dynamical structure and that a more mechanistic, control-focused analytical framework may be fruitful. A good review of this evolution in the arm field can be found in Vyas S, Golub MD, Sussillo D, Shenoy K. 2020. Computation Through Neural Population Dynamics. Annual Review of Neuroscience. 43(1):249-75

      Thank you for highlighting this important point. Research on orofacial movements hasn't progressed at the same pace as limb movement studies. Our manuscript focused specifically on characterizing the 3D directional tuning properties of individual neurons in the orofacial area—an analysis that has not been conducted previously for orofacial sensorimotor control. While we initially prioritized this individual neuron analysis, we recognize the value of broader population-level insights.

      Based on your helpful feedback, we have incorporated additional population analyses to provide a more comprehensive picture of orofacial sensorimotor control and expanded our discussion section. We appreciate your expertise in pushing our work to be more thorough and aligned with current neuroscience approaches.

      Can the authors explain (or at least speculate) why there was such a large difference in behavioral e ect due to nerve block between the two monkeys (Figure 7)?

      We acknowledge this as a variable inherent to this type of experimentation. Previous studies have found large kinematic variation in the effect of oral nerve block as well as in the following compensatory strategies between subjects. Each animal’s biology and response to perturbation vary naturally. Indeed, our subjects exhibited different feeding behavior even in the absence of nerve block perturbation (see Figure 2 in Laurence-Chasen et al., 2022). This is why each individual serves as its own control.

      Do the analyses showing a decrease in tuning after nerve block take into account the changes (and sometimes reduction in variability) of the kinematics between these conditions? In other words, if you subsampled trials to have similar distributions of kinematics between Control and Block conditions, does the effect hold true? The extreme scenario to illustrate my concern is that if Block conditions resulted in all identical movements (which of course they don't), the tuning analysis would find no tuned neurons. The lack of change in decoding accuracy is another yellow flag that there may be a methodological explanation for the decreased tuning result.

      Thank you for bringing up this point. We accounted for the changes in the variability of the kinematics between the control and nerve block conditions in the feeding dataset where we sampled an equal number of trials with similar kinematics for both control and nerve block. However, we did not control for similar kinematics in the drinking task. In the revised manuscript, we have clarified this and performed similar analysis for the drinking task. We sampled a similar distribution of drinking movements from each condition. We compared the neural data from an equal number of trials with a similar left-right angle of movement in the last 100 ms of the tongue trajectory, nearest the spout. There was a decrease in the percentage of neurons that were directionally modulated (between 30 and 80%) with nerve block compared to the control. These results have been included in the revised paper under Methods section: Directional tuning of single neurons, as well as Results section: E ects of nerve block: Decreased directionality of MIo and SIo neurons.

      While the results from decoding using KNN did not show significant differences between decoding accuracies in control vs. nerve block conditions, the results from the additional factor analysis and decoding using LSTM were consistent with the decrease in directional tuning at the level of individual neurons.  

      The manuscript states that "Our results suggest that the somatosensory cortex may be less involved than the motor areas during feeding, possibly because it is a more ingrained and stereotyped behavior as opposed to tongue protrusion or drinking tasks". Could an alternative explanation be more statistical/technical in nature: that during feeding, there will be more variability in exactly what somato sensation afferent signals are being received from trial to trial (because slight differences in kinematics can have large differences in exactly where the tongue is and the where/when/how of what parts of it are touching other parts of the oral cavity)? This variability could "smear out" the apparent tuning using these types of trial-averaged analyses. Given how important proprioception and somatosensation are for not biting the tongue or choking, the speculation that somatosensory cortical activity is suppressed during feedback is very counter-intuitive to this reviewer.

      Thank you for bringing up this point. We have now incorporated this in our revised Discussion (see Comparison between MIo and SIo). We agree with the reviewer that trialby-trial variability in the a erent signals may account for the lower directional signal in SIo during feeding than in drinking. Indeed, SIo’s mean-matched Fano factor in feeding was significantly higher than those in drinking (Author response image 1). Moreover, the results of the additional population and decoding analyses also support this.  

      Author response image 1.

      Comparison of mean-matched Fano Factor between Sio neurons during feeding and drinking control tasks across both subjects (Wilcoxon rank sum test, p < 0.001).

      Reviewer #3 (Public review):

      Summary:

      In this study, the authors aim to uncover how 3D tongue direction is represented in the Motor (M1o) and Somatosensory (S1o) cortex. In non-human primates implanted with chronic electrode arrays, they use X-ray-based imaging to track the kinematics of the tongue and jaw as the animal is either chewing food or licking from a spout. They then correlate the tongue kinematics with the recorded neural activity. Using linear regressions, they characterize the tuning properties and distributions of the recorded population during feeding and licking. Then, they recharacterize the tuning properties after bilateral lidocaine injections in the two sensory branches of the trigeminal nerve. They report that their nerve block causes a reorganization of the tuning properties. Overall, this paper concludes that M1o and S1o both contain representations of the tongue direction, but their numbers, their tuning properties, and susceptibility to perturbed sensory input are different.

      Strengths:

      The major strengths of this paper are in the state-of-the-art experimental methods employed to collect the electrophysiological and kinematic data.

      Weaknesses:

      However, this paper has a number of weaknesses in the analysis of this data.

      It is unclear how reliable the neural responses are to the stimuli. The trial-by-trial variability of the neural firing rates is not reported. Thus, it is unclear if the methods used for establishing that a neuron is modulated and tuned to a direction are susceptible to spurious correlations. The authors do not use shuffling or bootstrapping tests to determine the robustness of their fits or determining the 'preferred direction' of the neurons. This weakness colors the rest of the paper.

      Thank you for raising these points. We have performed the following additional analyses: (1) We have added analyses to ensure that the results could not be explained by neural variability. To show the trial-by-trial variability of the neural firing rates, we have calculated the Fano factor (mean overall = 1.34747; control = 1.46471; nerve block = 1.23023). The distribution was similar across directions, suggesting that responses of MIo and SIo neurons to varying 3D directions were reliable. (2) We have used a bootstrap procedure to ensure that directional tuning cannot be explained by mere chance. (3) To test the robustness of our PDs we also performed a bootstrap test, which yielded the same results for >90% of neurons, and a multiple linear regression test for fit to a cosine-tuning function. In the revised manuscript, the Methods and Results sections have been updated to include these analyses.  

      Author response image 2.

      Comparison of Fano Factor across directions for MIo and SIo Feeding Control (Kruskal-Wallis, p > 0.7).

      The authors compare the tuning properties during feeding to those during licking but only focus on the tongue-tip. However, the two behaviors are different also in their engagement of the jaw muscles. Thus many of the differences observed between the two 'tasks' might have very little to do with an alternation in the properties of the neural code - and more to do with the differences in the movements involved. 

      Using the tongue tip for the kinematic analysis of tongue directional movements was a deliberate choice as the anterior region of the tongue is highly mobile and sensitive due to a higher density of mechanoreceptors. The tongue tip is the first region that touches the spout in the drinking task and moves the food into the oral cavity for chewing and subsequent swallowing. 

      We agree with the reviewer that the jaw muscles are engaged differently in feeding vs. drinking (see Fig. 2). For example, a wider variety of jaw movements along the three axes are observed in feeding compared to the smaller amplitude and mostly vertical jaw movements in drinking. Also, the tongue movements are very different between the two behaviors. In feeding, the tongue moves in varied directions to position the food between left-right tooth rows during chewing, whereas in the drinking task, the tongue moves to discrete locations to receive the juice reward. Moreover, the tongue-jaw coordination differs between tasks; maximum tongue protrusion coincides with maximum gape in drinking but with minimum gape in the feeding behavior. Thus, the different tongue and jaw movements required in each behavior may account for some of the differences observed in the directional tuning properties of individual neurons and population activity. These points have been included in the revised Discussion.

      Author response image 3.

      Tongue tip position (mm) and jaw pitch(degree) during feeding (left) and drinking (right) behaviors. Most protruded tongue position coincides with minimum gape (jaw pitch at 0°) during  feeding but with maximum gape during drinking.

      Many of the neurons are likely correlated with both Jaw movements and tongue movements - this complicates the interpretations and raises the possibility that the differences in tuning properties across tasks are trivial.

      We thank the reviewer for raising this important point. In fact, we verified in a previous study whether the correlation between the tongue and jaw kinematics might explain differences in the encoding of tongue kinematics and shape in MIo (see Supplementary Fig. 4 in Laurence-Chasen et al., 2023): “Through iterative sampling of sub-regions of the test trials, we found that correlation of tongue kinematic variables with mandibular motion does not account for decoding accuracy. Even at times where tongue motion was completely un-correlated with the jaw, decoding accuracy could be quite high.” 

      The results obtained from population analyses showing distinct properties of population trajectories in feeding vs. drinking behaviors provide strong support to the interpretation that directional information varies between these behaviors.

      The population analyses for decoding are rudimentary and provide very coarse estimates (left, center, or right), it is also unclear what the major takeaways from the population decoding analyses are. The reduced classification accuracy could very well be a consequence of linear models being unable to account for the complexity of feeding movements, while the licking movements are 'simpler' and thus are better accounted for.

      We thank the reviewer for raising this point. The population decoding analyses provide additional insight on the directional information in population activity,  as well as a point of comparison with the results of numerous decoding studies on the arm region of the sensorimotor cortex. In the revised version, we have included the results from decoding tongue direction using a long short-term memory (LSTM) network for sequence-tosequence decoding. These results differed from the KNN results, indicating that a linear model such as KNN was better for drinking and that a non-linear and continuous decoder was better suited for feeding.  These results have been included in the revised manuscript.

      The nature of the nerve block and what sensory pathways are being affected is unclear - the trigeminal nerve contains many different sensory afferents - is there a characterization of how e ectively the nerve impulses are being blocked? Have the authors confirmed or characterized the strength of their inactivation or block, I was unable to find any electrophysiological evidence characterizing the perturbation.

      The strength of the nerve block is characterized by a decrease in the baseline firing rate of SIo neurons, as shown in Supplementary Figure 6 of “Loss of oral sensation impairs feeding performance and consistency of tongue–jaw coordination” (Laurence-Chasen et al., 2022)..

      Overall, while this paper provides a descriptive account of the observed neural correlations and their alteration by perturbation, a synthesis of the observed changes and some insight into neural processing of tongue kinematics would strengthen this paper.

      We thank the reviewer for this suggestion. We have revised the Discussion to provide a synthesis of the results and insights into the neural processing of tongue kinematics.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) The procedure for anesthesia explained in the method section was not clear to me. The following information was missing: what drug/dose was used? How long the animal was under anesthesia? How long after the recovery the experiments were done?

      The animals were fully sedated with ketamine (100 mg/ml, 10 mg/kg) for less than 30 minutes, and all of the data was collected within 90 minutes after the nerve block was administered.

      (2) In Figure 10, panels A and B are very close together, it was not at first clear whether the text "Monkey R, Monkey Y" belongs to panel A or B.

      We have separated the two panels further in the revised figure.

      (3) I found Figure 11 very busy and hard to interpret. Separating monkeys, fitting the line for each condition, or using a bar plot can help with the readability of the figure.

      Thank you for the suggestion. We agree with you and have reworked this figure. To simplify it we have shown the mean accuracy across iterations.

      (4) I found the laterality discussions like "This signifies that there are more neurons in the left hemisphere contributes toward one direction of tongue movement, suggesting that there is some laterality in the PDs of OSMCx neurons that varies between individuals" bit of an over-interpretation of data, given the low n value and the dissimilarity in how strongly the nerve blocking altered monkies behavior.

      Thank you for sharing this viewpoint. We do think that laterality is a good point of comparison with studies on M1 neurons in the arm/hand region. In our study, we found that the peak of the PD distribution coincides with leftward tongue movements in feeding. The distribution of PDs provides insight into how tongue muscles are coordinated during movement. Intrinsic and extrinsic tongue muscles are involved in shaping the tongue (e.g., elongation, broadening) and positioning the tongue (e.g., protrusion/retraction, elevation/depression), respectively. These muscles receive bilateral motor innervation except for genioglossus. Straight tongue protrusion requires the balanced action of the right and left genioglossi while the lateral protrusion involves primarily the contralateral genioglossus. Given this unilateral innervation pattern, we hypothesized that left MIo/SIo neurons would preferentially respond to leftward tongue movements, corresponding to right genioglossus activation. 

      Reviewer #2 (Recommendations for the authors):

      Are the observation of tuning peaks being most frequently observed toward the anterior and superior directions consistent with the statistics of the movements the tongue typically makes? This could be analogous to anisotropies previously reported in the arm literature, e.g., Lillicrap TP, Scott SH. 2013. Preference Distributions of Primary Motor Cortex Neurons Reflect Control Solutions Optimized for Limb Biomechanics. Neuron. 77(1):168-79

      Thank you for bringing our attention to analogous findings by Lillicrap & Scott, 2013. Indeed, we do observe the highest number of movements in the Anterior Superior directions, followed by the Posterior Inferior. This does align with the distribution of tuning peaks that we observed. Author response image 4 shows the proportions of observed movements in each group of directions across all feeding datasets. We have incorporated this data in the Results section: Neuronal modulation patterns differ between MIo and SIo, as well as added this point in the Discussion.

      Author response image 4.

      Proportion of feeding trials in each group of directions. Error bars represent ±1 standard deviation across datasets (n = 4).

      "The Euclidean distance was used to identify nearest neighbors, and the number of nearest neighbors used was K = 7. This K value was determined after testing different Ks which yielded comparable results." In general, it's a decoding best practice to tune hyperparameters (like K) on fully held-out data from the data used for evaluation. Otherwise, this tends to slightly inflate performance because one picks the hyperparameter that happened to give the best result. It sounds like that held-out validation set wasn't used here. I don't think that's going to change the results much at all (especially given the "comparable results" comment), but providing this suggestion for the future. If the authors replicate results on other datasets, I suggest they keep K = 7 to lock in the method.

      K = 7 was chosen based on the size of our smallest training dataset (n = 55). The purpose of testing different K values was not to select which value gave the best result, but to demonstrate that similar K values did not affect the results significantly. We tested the different K values on a subset of the feeding data, but that data was not fully held-out from the training set. We will keep your suggestion in mind for future analysis.

      The smoothing applied to Figure 2 PSTHs appears perhaps excessive (i.e., it may be obscuring interesting finer-grained details of these fast movements). Can the authors reduce the 50 ms Gaussian smoothing (I assume this is the s.d.?) ~25 ms is often used in studying arm kinematics. It also looks like the movement-related modulation may not be finished in these 200 ms / 500 ms windows. I suggest extending the shown time window. It would also be helpful to show some trial-averaged behavior (e.g. speed or % displacement from start) under or behind the PSTHs, to give a sense of what phase of the movement the neural activity corresponds to.

      Thank you for the suggestion. We have taken your suggestions into consideration and modified Figure 2 accordingly. We decreased the Gaussian kernel to 25 ms and extended the time window shown. The trial-averaged anterior/posterior displacement was also added to the drinking PSTHs.

      Reviewer #3 (Recommendations for the authors):

      The major consideration here is that the data reported for feeding appears to be very similar to that reported in a previous study:

      "Robust cortical encoding of 3D tongue shape during feeding in macaques"

      Are the neurons reported here the same as the ones used in this previous paper? It is deeply concerning that this is not reported anywhere in the methods section.

      These are the same neurons as in our previous paper, though here we include several additional datasets of the nerve block and drinking sessions. We have now included this in the methods section.

      Second, I strongly recommend that the authors consider a thorough rewrite of this manuscript and improve the presentation of the figures. As written, it was not easy to follow the paper, the logic of the experiments, or the specific data being presented in the figures.

      Thank you for this suggestion. We have done an extensive rewrite of the manuscript and revision of the figures.

      A few recommendations:

      (1) Please structure your results sections and use descriptive topic sentences to focus the reader. In the current version, it is unclear what the major point being conveyed for each analysis is.

      Thank you for this suggestion. We have added topic sentences to the begin each section of the results.

      (2) Please show raster plots for at least a few example neurons so that the readers have a sense of what the neural responses look like across trials. Is all of Figure 2 one example neuron or are they different neurons? Error bars for PETH would be useful to show the reliability and robustness of the tuning.

      Figure 2 shows different neurons, one from MIo and one from SIo for each task. There is shading showing ±1 standard error around the line for each direction, however this was a bit difficult to see. In addition to the other changes we have made to these figures, we made the lines smaller and darkened the error bar shading to accentuate this. We also added raster plots corresponding to the same neurons represented in Figure 2 as a supplement.

      (3) Since there are only two data points, I am not sure I understand why the authors have bar graphs and error bars for graphs such as Figure 3B, Figure 5B, etc. How can one have an error bar and means with just 2 data points?

      Those bars represent the standard error of the proportion. We have changed the y-axis label on these figures to make this clearer.

      (4) Results in Figure 6 could be due to differential placement of the electrodes across the animals. How is this being accounted for?

      Yes, this is a possibility which we have mentioned in the discussion. Even with careful placement there is no guarantee to capture a set of neurons with the exact same function in two subjects, as every individual is different. Rather we focus on analyses of data within the same animal. The purpose of Figure 6 is to show the difference between MIo and SIo, and between the two tasks, within the same subject. The more salient result from calculating the preferred direction is that there is a change in the distribution between control and nerve block within the same exact population. Discussions relating to the comparison between individuals are speculative and cannot be confirmed without the inclusion of many more subjects.

      (5) For Figure 7, I would recommend showing the results of the Sham injection in the same figure instead of a supplement.

      Thank you for the suggestion, we have added these results to the figure.

      (6) I think the e ects of the sensory block on the tongue kinematics are underexplored in Figure 7 and Figure 8. The authors could explore the deficits in tongue shape, and the temporal components of the trajectory.

      Some of these effects on feeding have been explored in a previous paper, LaurenceChasen et al., 2022. We performed some additional analyses on changes to kinematics during drinking, including the number of licks per 10 second trial and the length of individual licks. The results of these are included below. We also calculated the difference in the speed of tongue movement during drinking, which generally decreased and exhibited an increase in variance with nerve block (f-test, p < 0.001). However, we have not included these figures in the main paper as they do not inform us about directionality.

      Author response image 5.

      Left halves of hemi-violins (black) are control and right halves (red) are nerve block for an individual. Horizontal black lines represent the mean and horizontal red lines the median. Results of two-tailed t-test and f-test are indicated by asterisks and crosses, respectively: *,† p < 0.05; **,†† p < 0.01; ***,††† p < 0.001.

      (9) In Figures 9 and 10. Are the same neurons being recorded before and after the nerve block? It is unclear if the overall "population" properties are different, or if the properties of individual neurons are changing due to the nerve block.

      Yes, the same neurons are being recorded before and after nerve block. Specifically, Figure 9B shows that the properties of many individual neurons do change due to the nerve block. Differences in the overall population response may be attributed to some of the units having reduced/no activity during the nerve block session.

      Additionally, I recommend that the authors improve their introduction and provide more context to their discussion. Please elaborate on what you think are the main conceptual advances in your study, and place them in the context of the existing literature. By my count, there are 26 citations in this paper, 4 of which are self-citations - clearly, this can be improved upon.

      Thank you for this suggestion. We have done an extensive rewrite of the Introduction and Discussion. We discussed the main conceptual advances in our study and place them in the context of the existing literature.

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife Assessment

      This valuable study investigates how the neural representation of individual finger movements changes during the early period of sequence learning. By combining a new method for extracting features from human magnetoencephalography data and decoding analyses, the authors provide incomplete evidence of an early, swift change in the brain regions correlated with sequence learning, including a set of previously unreported frontal cortical regions. The addition of more control analyses to rule out that head movement artefacts influence the findings, and to further explain the proposal of offline contextualization during short rest periods as the basis for improvement performance would strengthen the manuscript.

      We appreciate the Editorial assessment on our paper’s strengths and novelty. We have implemented additional control analyses to show that neither task-related eye movements nor increasing overlap of finger movements during learning account for our findings, which are that contextualized neural representations in a network of bilateral frontoparietal brain regions actively contribute to skill learning. Importantly, we carried out additional analyses showing that contextualization develops predominantly during rest intervals.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      This study addresses the issue of rapid skill learning and whether individual sequence elements (here: finger presses) are differentially represented in human MEG data. The authors use a decoding approach to classify individual finger elements and accomplish an accuracy of around 94%. A relevant finding is that the neural representations of individual finger elements dynamically change over the course of learning. This would be highly relevant for any attempts to develop better brain machine interfaces - one now can decode individual elements within a sequence with high precision, but these representations are not static but develop over the course of learning.

      Strengths:

      The work follows a large body of work from the same group on the behavioural and neural foundations of sequence learning. The behavioural task is well established and neatly designed to allow for tracking learning and how individual sequence elements contribute. The inclusion of short offline rest periods between learning epochs has been influential because it has revealed that a lot, if not most of the gains in behaviour (ie speed of finger movements) occur in these socalled micro-offline rest periods. The authors use a range of new decoding techniques, and exhaustively interrogate their data in different ways, using different decoding approaches. Regardless of the approach, impressively high decoding accuracies are observed, but when using a hybrid approach that combines the MEG data in different ways, the authors observe decoding accuracies of individual sequence elements from the MEG data of up to 94%.

      We have previously showed that neural replay of MEG activity representing the practiced skill was prominent during rest intervals of early learning, and that the replay density correlated with micro-offline gains (Buch et al., 2021). These findings are consistent with recent reports (from two different research groups) that hippocampal ripple density increases during these inter-practice rest periods, and predict offline learning gains (Chen et al., 2024; Sjøgård et al., 2024). However, decoder performance in our earlier work (Buch et al., 2021) left room for improvement. Here, we reported a strategy to improve decoding accuracy that could benefit future studies of neural replay or BCI using MEG.

      Weaknesses:

      There are a few concerns which the authors may well be able to resolve. These are not weaknesses as such, but factors that would be helpful to address as these concern potential contributions to the results that one would like to rule out. Regarding the decoding results shown in Figure 2 etc, a concern is that within individual frequency bands, the highest accuracy seems to be within frequencies that match the rate of keypresses. This is a general concern when relating movement to brain activity, so is not specific to decoding as done here. As far as reported, there was no specific restraint to the arm or shoulder, and even then it is conceivable that small head movements would correlate highly with the vigor of individual finger movements. This concern is supported by the highest contribution in decoding accuracy being in middle frontal regions - midline structures that would be specifically sensitive to movement artefacts and don't seem to come to mind as key structures for very simple sequential keypress tasks such as this - and the overall pattern is remarkably symmetrical (despite being a unimanual finger task) and spatially broad. This issue may well be matching the time course of learning, as the vigor and speed of finger presses will also influence the degree to which the arm/shoulder and head move. This is not to say that useful information is contained within either of the frequencies or broadband data. But it raises the question of whether a lot is dominated by movement "artefacts" and one may get a more specific answer if removing any such contributions.

      Reviewer #1 expresses concern that the combination of the low-frequency narrow-band decoder results, and the bilateral middle frontal regions displaying the highest average intra-parcel decoding performance across subjects is suggestive that the decoding results could be driven by head movement or other artefacts.

      Head movement artefacts are highly unlikely to contribute meaningfully to our results for the following reasons. First, in addition to ICA denoising, all “recordings were visually inspected and marked to denoise segments containing other large amplitude artifacts due to movements” (see Methods). Second, the response pad was positioned in a manner that minimized wrist, arm or more proximal body movements during the task. Third, while online monitoring of head position was not performed for this study, it was assessed at the beginning and at the end of each recording. The head was restrained with an inflatable air bladder, and head movement between the beginning and end of each scan did not exceed 5mm for all participants included in the study.

      The Reviewer states a concern that “it is conceivable that small head movements would correlate highly with the vigor of individual finger movements”. We agree that despite the steps taken above, it is possible that minor head movements could still contribute to some remaining variance in the MEG data in our study. However, such correlations between small head movements and finger movements could only meaningfully contribute to decoding performance if: (A) they were consistent and pervasive throughout the recording (which might not be the case if the head movements were related to movement vigor and vigor changed over time); and (B) they systematically varied between different finger movements, and also between the same finger movement performed at different sequence locations (see 5-class decoding performance in Figure 4B). The possibility of any head movement artefacts meeting all these conditions is unlikely. Alternatively, for this task design a much more likely confound could be the contribution of eye movement artefacts to the decoder performance (an issue raised by Reviewer #3 in the comments below).

      Remember from Figure 1A in the manuscript that an asterisk marks the current position in the sequence and is updated at each keypress. Since participants make very few performance errors, the position of the asterisk on the display is highly correlated with the keypress being made in the sequence. Thus, it is possible that if participants are attending to the visual feedback provided on the display, they may generate eye movements that are systematically related to the task. Since we did record eye movements simultaneously with the MEG recordings (EyeLink 1000 Plus; Fs = 600 Hz), we were able to perform a control analysis to address this question. For each keypress event during trials in which no errors occurred (which is the same time-point that the asterisk position is updated), we extracted three features related to eye movements: 1) the gaze position at the time of asterisk position update (triggered by a KeyDown event), 2) the gaze position 150ms later, and 3) the peak velocity of the eye movement between the two positions. We then constructed a classifier from these features with the aim of predicting the location of the asterisk (ordinal positions 1-5) on the display. As shown in the confusion matrix below (Author response image 1), the classifier failed to perform above chance levels (overall cross-validated accuracy = 0.21817):

      Author response image 1.

      Confusion matrix showing that three eye movement features fail to predict asterisk position on the task display above chance levels (Fold 1 test accuracy = 0.21718; Fold 2 test accuracy = 0.22023; Fold 3 test accuracy = 0.21859; Fold 4 test accuracy = 0.22113; Fold 5 test accuracy = 0.21373; Overall cross-validated accuracy = 0.2181). Since the ordinal position of the asterisk on the display is highly correlated with the ordinal position of individual keypresses in the sequence, this analysis provides strong evidence that keypress decoding performance from MEG features is not explained by systematic relationships between finger movement behavior and eye movements (i.e. – behavioral artefacts) (end of figure legend).

      Remember that the task display does not provide explicit feedback related to performance, only information about the present position in the sequence. Thus, it is possible that participants did not actively attend to the feedback. In fact, inspection of the eye position data revealed that on majority of trials, participants displayed random-walk-like gaze patterns around a central fixation point located near the center of the screen. Thus, participants did not attend to the asterisk position on the display, but instead intrinsically generated the action sequence. A similar realworld example would be manually inputting a long password into a secure online application. In this case, one intrinsically generates the sequence from memory and receives similar feedback about the password sequence position (also provided as asterisks) as provided in the study task – feedback which is typically ignored by the user.

      The minimal participant engagement with the visual task display observed in this study highlights another important point – that the behavior in explicit sequence learning motor tasks is highly generative in nature rather than reactive to stimulus cues as in the serial reaction time task (SRTT). This is a crucial difference that must be carefully considered when designing investigations and comparing findings across studies.

      We observed that initial keypress decoding accuracy was predominantly driven by contralateral primary sensorimotor cortex in the initial practice trials before transitioning to bilateral frontoparietal regions by trials 11 or 12 as performance gains plateaued. The contribution of contralateral primary sensorimotor areas to early skill learning has been extensively reported in humans and non-human animals.(Buch et al., 2021; Classen et al., 1998; Karni et al., 1995; Kleim et al., 1998) Similarly, the increased involvement of bilateral frontal and parietal regions to decoding during early skill learning in the non-dominant hand is well known. Enhanced bilateral activation in both frontal and parietal cortex during skill learning has been extensively reported (Doyon et al., 2002; Grafton et al., 1992; Hardwick et al., 2013; Kennerley et al., 2004; Shadmehr & Holcomb, 1997; Toni, Ramnani, et al., 2001), and appears to be even more prominent during early fine motor skill learning in the non-dominant hand (Lee et al., 2019; Sawamura et al., 2019). The frontal regions identified in these studies are known to play crucial roles in executive control (Battaglia-Mayer & Caminiti, 2019), motor planning (Toni, Thoenissen, et al., 2001), and working memory (Andersen & Buneo, 2002; Buneo & Andersen, 2006; Shadmehr & Holcomb, 1997; Toni, Ramnani, et al., 2001; Wolpert et al., 1998) processes, while the same parietal regions are known to integrate multimodal sensory feedback and support visuomotor transformations (Andersen & Buneo, 2002; Buneo & Andersen, 2006; Shadmehr & Holcomb, 1997; Toni, Ramnani, et al., 2001; Wolpert et al., 1998), in addition to working memory (Grover et al., 2022). Thus, it is not surprising that these regions increasingly contribute to decoding as subjects internalize the sequential task. We now include a statement reflecting these considerations in the revised Discussion.

      A somewhat related point is this: when combining voxel and parcel space, a concern is whether a degree of circularity may have contributed to the improved accuracy of the combined data, because it seems to use the same MEG signals twice - the voxels most contributing are also those contributing most to a parcel being identified as relevant, as parcels reflect the average of voxels within a boundary. In this context, I struggled to understand the explanation given, ie that the improved accuracy of the hybrid model may be due to "lower spatially resolved whole-brain and higher spatially resolved regional activity patterns".

      We disagree with the Reviewer’s assertion that the construction of the hybrid-space decoder is circular for the following reasons. First, the base feature set for the hybrid-space decoder constructed for all participants includes whole-brain spatial patterns of MEG source activity averaged within parcels. As stated in the manuscript, these 148 inter-parcel features reflect “lower spatially resolved whole-brain activity patterns” or global brain dynamics. We then independently test how well spatial patterns of MEG source activity for all voxels distributed within individual parcels can decode keypress actions. Again, the testing of these intra-parcel spatial patterns, intended to capture “higher spatially resolved regional brain activity patterns”, is completely independent from one another and independent from the weighting of individual inter-parcel features. These intra-parcel features could, for example, provide additional information about muscle activation patterns or the task environment. These approximately 1150 intra-parcel voxels (on average, within the total number varying between subjects) are then combined with the 148 inter-parcel features to construct the final hybrid-space decoder. In fact, this varied spatial filter approach shares some similarities to the construction of convolutional neural networks (CNNs) used to perform object recognition in image classification applications (Srinivas et al., 2016). One could also view this hybrid-space decoding approach as a spatial analogue to common timefrequency based analyses such as theta-gamma phase amplitude coupling (θ/γ PAC), which assess interactions between two or more narrow-band spectral features derived from the same time-series data (Lisman & Jensen, 2013).

      We directly tested this hypothesis – that spatially overlapping intra- and inter-parcel features portray different information – by constructing an alternative hybrid-space decoder (Hybrid<sub>Alt</sub>) that excluded average inter-parcel features which spatially overlapped with intra-parcel voxel features, and comparing the performance to the decoder used in the manuscript (Hybrid<sub>Orig</sub>). The prediction was that if the overlapping parcel contained similar information to the more spatially resolved voxel patterns, then removing the parcel features (n=8) from the decoding analysis should not impact performance. In fact, despite making up less than 1% of the overall input feature space, removing those parcels resulted in a significant drop in overall performance greater than 2% (78.15% ± 7.03% SD for Hybrid<sub>Orig</sub> vs. 75.49% ± 7.17% for Hybrid<sub>Alt</sub>; Wilcoxon signed rank test, z = 3.7410, p = 1.8326e-04; Author response image 2).

      Author response image 2.

      Comparison of decoding performances with two different hybrid approaches. Hybrid<sub>Alt</sub>: Intra-parcel voxel-space features of top ranked parcels and inter-parcel features of remaining parcels. Hybrid<sub>Orig</sub>: Voxel-space features of top ranked parcels and whole-brain parcel-space features (i.e. – the version used in the manuscript). Dots represent decoding accuracy for individual subjects. Dashed lines indicate the trend in performance change across participants. Note, that Hybrid<sub>Orig</sub> (the approach used in our manuscript) significantly outperforms the Hybrid<sub>Alt</sub> approach, indicating that the excluded parcel features provide unique information compared to the spatially overlapping intra-parcel voxel patterns (end of figure legend).

      Firstly, there will be a relatively high degree of spatial contiguity among voxels because of the nature of the signal measured, i.e. nearby individual voxels are unlikely to be independent. Secondly, the voxel data gives a somewhat misleading sense of precision; the inversion can be set up to give an estimate for each voxel, but there will not just be dependence among adjacent voxels, but also substantial variation in the sensitivity and confidence with which activity can be projected to different parts of the brain. Midline and deeper structures come to mind, where the inversion will be more problematic than for regions along the dorsal convexity of the brain, and a concern is that in those midline structures, the highest decoding accuracy is seen.

      We agree with the Reviewer that some inter-parcel features representing neighboring (or spatially contiguous) voxels are likely to be correlated, an important confound in connectivity analyses (Colclough et al., 2015; Colclough et al., 2016), not performed in our investigation.

      In our study, correlations between adjacent voxels effectively reduce the dimensionality of the input feature space. However, as long as there are multiple groups of correlated voxels within each parcel (i.e. – the rank is greater than 1), the intra-parcel spatial patterns could meaningfully contribute to the decoder performance, as shown by the following results:

      First, we obtained higher decoding accuracy with voxel-space features (74.51% ± 7.34% SD) compared to parcel space features (68.77% ± 7.6%; Figure 3B), indicating individual voxels carry more information in decoding the keypresses than the averaged voxel-space features or parcel space features. Second, individual voxels within a parcel showed varying feature importance scores in decoding keypresses (Author response image 3). This finding shows that correlated voxels form mini subclusters that are much smaller spatially than the parcel they reside within.

      Author response image 3.:

      Feature importance score of individual voxels in decoding keypresses: MRMR was used to rank the individual voxel space features in decoding keypresses and the min-max normalized MRMR score was mapped to a structural brain surface. Note that individual voxels within a parcel showed different contribution to decoding (end of figure legend).

      Some of these concerns could be addressed by recording head movement (with enough precision) to regress out these contributions. The authors state that head movement was monitored with 3 fiducials, and their time courses ought to provide a way to deal with this issue. The ICA procedure may not have sufficiently dealt with removing movement-related problems, but one could eg relate individual components that were identified to the keypresses as another means for checking. An alternative could be to focus on frequency ranges above the movement frequencies. The accuracy for those still seems impressive and may provide a slightly more biologically plausible assessment.

      We have already addressed the issue of movement related artefacts in the first response above. With respect to a focus on frequency ranges above movement frequencies, the Reviewer states the “accuracy for those still seems impressive and may provide a slightly more biologically plausible assessment”. First, it is important to note that cortical delta-band oscillations measured with local field potentials (LFPs) in macaques is known to contain important information related to end-effector kinematics (Bansal et al., 2011; Mollazadeh et al., 2011) muscle activation patterns (Flint et al., 2012) and temporal sequencing (Churchland et al., 2012) during skilled reaching and grasping actions. Thus, there is a substantial body of evidence that low-frequency neural oscillatory activity in this range contains important information about the skill learning behavior investigated in the present study. Second, our own data shows (which the Reviewer also points out) that significant information related to the skill learning behavior is also present in higher frequency bands (see Figure 2A and Figure 3—figure supplement 1). As we pointed out in our earlier response to questions about the hybrid space decoder architecture (see above), it is likely that different, yet complimentary, information is encoded across different temporal frequencies (just as it is encoded across different spatial frequencies) (Heusser et al., 2016). Again, this interpretation is supported by our data as the highest performing classifiers in all cases (when holding all parameters constant) were always constructed from broadband input MEG data (Figure 2A and Figure 3—figure supplement 1).

      One question concerns the interpretation of the results shown in Figure 4. They imply that during the course of learning, entirely different brain networks underpin the behaviour. Not only that, but they also include regions that would seem rather unexpected to be key nodes for learning and expressing relatively simple finger sequences, such as here. What then is the biological plausibility of these results? The authors seem to circumnavigate this issue by moving into a distance metric that captures the (neural network) changes over the course of learning, but the discussion seems detached from which regions are actually involved; or they offer a rather broad discussion of the anatomical regions identified here, eg in the context of LFOs, where they merely refer to "frontoparietal regions".

      The Reviewer notes the shift in brain networks driving keypress decoding performance between trials 1, 11 and 36 as shown in Figure 4A. The Reviewer questions whether these shifts in brain network states underpinning the skill are biologically plausible, as well as the likelihood that bilateral superior and middle frontal and parietal cortex are important nodes within these networks.

      First, previous fMRI work in humans assessed changes in functional connectivity patterns while participants performed a similar sequence learning task to our present study (Bassett et al., 2011). Using a dynamic network analysis approach, Bassett et al. showed that flexibility in the composition of individual network modules (i.e. – changes in functional brain region membership of orthogonal brain networks) is up-regulated in novel learning environments and explains differences in learning rates across individuals. Thus, consistent with our findings, it is likely that functional brain networks rapidly reconfigure during early learning of novel sequential motor skills.

      Second, frontoparietal network activity is known to support motor memory encoding during early learning (Albouy et al., 2013; Albouy et al., 2012). For example, reactivation events in the posterior parietal (Qin et al., 1997) and medial prefrontal (Euston et al., 2007; Molle & Born, 2009) cortex (MPFC) have been temporally linked to hippocampal replay, and are posited to support memory consolidation across several memory domains (Frankland & Bontempi, 2005), including motor sequence learning (Albouy et al., 2015; Buch et al., 2021; F. Jacobacci et al., 2020). Further, synchronized interactions between MPFC and hippocampus are more prominent during early as opposed to later learning stages (Albouy et al., 2013; Gais et al., 2007; Sterpenich et al., 2009), perhaps reflecting “redistribution of hippocampal memories to MPFC” (Albouy et al., 2013). MPFC contributes to very early memory formation by learning association between contexts, locations, events and adaptive responses during rapid learning (Euston et al., 2012). Consistently, coupling between hippocampus and MPFC has been shown during initial memory encoding and during subsequent rest (van Kesteren et al., 2010; van Kesteren et al., 2012). Importantly, MPFC activity during initial memory encoding predicts subsequent recall (Wagner et al., 1998). Thus, the spatial map required to encode a motor sequence memory may be “built under the supervision of the prefrontal cortex” (Albouy et al., 2012), also engaged in the development of an abstract representation of the sequence (Ashe et al., 2006). In more abstract terms, the prefrontal, premotor and parietal cortices support novice performance “by deploying attentional and control processes” (Doyon et al., 2009; Hikosaka et al., 2002; Penhune & Steele, 2012) required during early learning (Doyon et al., 2009; Hikosaka et al., 2002; Penhune & Steele, 2012). The dorsolateral prefrontal cortex DLPFC specifically is thought to engage in goal selection and sequence monitoring during early skill practice (Schendan et al., 2003), all consistent with the schema model of declarative memory in which prefrontal cortices play an important role in encoding (Morris, 2006; Tse et al., 2007). Thus, several prefrontal and frontoparietal regions contributing to long term learning (Berlot et al., 2020) are also engaged in early stages of encoding. Altogether, there is strong biological support for the involvement of bilateral prefrontal and frontoparietal regions to decoding during early skill learning. We now address this issue in the revised manuscript.

      If I understand correctly, the offline neural representation analysis is in essence the comparison of the last keypress vs the first keypress of the next sequence. In that sense, the activity during offline rest periods is actually not considered. This makes the nomenclature somewhat confusing. While it matches the behavioural analysis, having only key presses one can't do it in any other way, but here the authors actually do have recordings of brain activity during offline rest. So at the very least calling it offline neural representation is misleading to this reviewer because what is compared is activity during the last and during the next keypress, not activity during offline periods. But it also seems a missed opportunity - the authors argue that most of the relevant learning occurs during offline rest periods, yet there is no attempt to actually test whether activity during this period can be useful for the questions at hand here.

      We agree with the Reviewer that our previous “offline neural representation” nomenclature could be misinterpreted. In the revised manuscript we refer to this difference as the “offline neural representational change”. Please, note that our previous work did link offline neural activity (i.e. – 16-22 Hz beta power (Bonstrup et al., 2019) and neural replay density (Buch et al., 2021) during inter-practice rest periods) to observed micro-offline gains.

      Reviewer #2 (Public review):

      Summary

      Dash et al. asked whether and how the neural representation of individual finger movements is "contextualized" within a trained sequence during the very early period of sequential skill learning by using decoding of MEG signal. Specifically, they assessed whether/how the same finger presses (pressing index finger) embedded in the different ordinal positions of a practiced sequence (4-1-3-2-4; here, the numbers 1 through 4 correspond to the little through the index fingers of the non-dominant left hand) change their representation (MEG feature). They did this by computing either the decoding accuracy of the index finger at the ordinal positions 1 vs. 5 (index_OP1 vs index_OP5) or pattern distance between index_OP1 vs. index_OP5 at each training trial and found that both the decoding accuracy and the pattern distance progressively increase over the course of learning trials. More interestingly, they also computed the pattern distance for index_OP5 for the last execution of a practice trial vs. index_OP1 for the first execution in the next practice trial (i.e., across the rest period). This "off-line" distance was significantly larger than the "on-line" distance, which was computed within practice trials and predicted micro-offline skill gain. Based on these results, the authors conclude that the differentiation of representation for the identical movement embedded in different positions of a sequential skill ("contextualization") primarily occurs during early skill learning, especially during rest, consistent with the recent theory of the "micro-offline learning" proposed by the authors' group. I think this is an important and timely topic for the field of motor learning and beyond.

      Strengths

      The specific strengths of the current work are as follows. First, the use of temporally rich neural information (MEG signal) has a large advantage over previous studies testing sequential representations using fMRI. This allowed the authors to examine the earliest period (= the first few minutes of training) of skill learning with finer temporal resolution. Second, through the optimization of MEG feature extraction, the current study achieved extremely high decoding accuracy (approx. 94%) compared to previous works. As claimed by the authors, this is one of the strengths of the paper (but see my comments). Third, although some potential refinement might be needed, comparing "online" and "offline" pattern distance is a neat idea.

      Weaknesses

      Along with the strengths I raised above, the paper has some weaknesses. First, the pursuit of high decoding accuracy, especially the choice of time points and window length (i.e., 200 msec window starting from 0 msec from key press onset), casts a shadow on the interpretation of the main result. Currently, it is unclear whether the decoding results simply reflect behavioral change or true underlying neural change. As shown in the behavioral data, the key press speed reached 3~4 presses per second already at around the end of the early learning period (11th trial), which means inter-press intervals become as short as 250-330 msec. Thus, in almost more than 60% of training period data, the time window for MEG feature extraction (200 msec) spans around 60% of the inter-press intervals. Considering that the preparation/cueing of subsequent presses starts ahead of the actual press (e.g., Kornysheva et al., 2019) and/or potential online planning (e.g., Ariani and Diedrichsen, 2019), the decoder likely has captured these future press information as well as the signal related to the current key press, independent of the formation of genuine sequential representation (e.g., "contextualization" of individual press). This may also explain the gradual increase in decoding accuracy or pattern distance between index_OP1 vs. index_OP5 (Figure 4C and 5A), which co-occurred with performance improvement, as shorter inter-press intervals are more favorable for the dissociating the two index finger presses followed by different finger presses. The compromised decoding accuracies for the control sequences can be explained in similar logic. Therefore, more careful consideration and elaborated discussion seem necessary when trying to both achieve high-performance decoding and assess early skill learning, as it can impact all the subsequent analyses.

      The Reviewer raises the possibility that (given the windowing parameters used in the present study) an increase in “contextualization” with learning could simply reflect faster typing speeds as opposed to an actual change in the underlying neural representation.

      We now include a new control analysis that addresses this issue as well as additional re-examination of previously reported results with respect to this issue – all of which are inconsistent with this alternative explanation that “contextualization” reflects a change in mixing of keypress related MEG features as opposed to a change in the underlying representations themselves. As correct sequences are generated at higher and higher speeds over training, MEG activity patterns related to the planning, execution, evaluation and memory of individual keypresses overlap more in time. Thus, increased overlap between the “4” and “1” keypresses (at the start of the sequence) and “2” and “4” keypresses (at the end of the sequence) could artefactually increase contextualization distances even if the underlying neural representations for the individual keypresses remain unchanged. One must also keep in mind that since participants repeat the sequence multiple times within the same trial, a majority of the index finger keypresses are performed adjacent to one another (i.e. - the “4-4” transition marking the end of one sequence and the beginning of the next). Thus, increased overlap between consecutive index finger keypresses as typing speed increased should increase their similarity and mask contextualization related changes to the underlying neural representations.

      We addressed this question by conducting a new multivariate regression analysis to directly assess whether the neural representation distance score could be predicted by the 4-1, 2-4 and 4-4 keypress transition times observed for each complete correct sequence (both predictor and response variables were z-score normalized within-subject). The results of this analysis also affirmed that the possible alternative explanation that contextualization effects are simple reflections of increased mixing is not supported by the data (Adjusted R<sup>2</sup> = 0.00431; F = 5.62). We now include this new negative control analysis in the revised manuscript.

      We also re-examined our previously reported classification results with respect to this issue. We reasoned that if mixing effects reflecting the ordinal sequence structure is an important driver of the contextualization finding, these effects should be observable in the distribution of decoder misclassifications. For example, “4” keypresses would be more likely to be misclassified as “1” or “2” keypresses (or vice versa) than as “3” keypresses. The confusion matrices presented in Figures 3C and 4B and Figure 3—figure supplement 3A display a distribution of misclassifications that is inconsistent with an alternative mixing effect explanation of contextualization.

      Based upon the increased overlap between adjacent index finger keypresses (i.e. – “4-4” transition), we also reasoned that the decoder tasked with separating individual index finger keypresses into two distinct classes based upon sequence position, should show decreased performance as typing speed increases. However, Figure 4C in our manuscript shows that this is not the case. The 2-class hybrid classifier actually displays improved classification performance over early practice trials despite greater temporal overlap. Again, this is inconsistent with the idea that the contextualization effect simply reflects increased mixing of individual keypress features.

      In summary, both re-examination of previously reported data and new control analyses all converged on the idea that the proximity between keypresses does not explain contextualization.

      We do agree with the Reviewer that the naturalistic, generative, self-paced task employed in the present study results in overlapping brain processes related to planning, execution, evaluation and memory of the action sequence. We also agree that there are several tradeoffs to consider in the construction of the classifiers depending on the study aim. Given our aim of optimizing keypress decoder accuracy in the present study, the set of trade-offs resulted in representations reflecting more the latter three processes, and less so the planning component. Whether separate decoders can be constructed to tease apart the representations or networks supporting these overlapping processes is an important future direction of research in this area. For example, work presently underway in our lab constrains the selection of windowing parameters in a manner that allows individual classifiers to be temporally linked to specific planning, execution, evaluation or memory-related processes to discern which brain networks are involved and how they adaptively reorganize with learning. Results from the present study (Figure 4—figure supplement 2) showing hybrid-space decoder prediction accuracies exceeding 74% for temporal windows spanning as little as 25ms and located up to 100ms prior to the KeyDown event strongly support the feasibility of such an approach.

      Related to the above point, testing only one particular sequence (4-1-3-2-4), aside from the control ones, limits the generalizability of the finding. This also may have contributed to the extremely high decoding accuracy reported in the current study.

      The Reviewer raises a question about the generalizability of the decoder accuracy reported in our study. Fortunately, a comparison between decoder performances on Day 1 and Day 2 datasets does provide insight into this issue. As the Reviewer points out, the classifiers in this study were trained and tested on keypresses performed while practicing a specific sequence (4-1-3-2-4). The study was designed this way as to avoid the impact of interference effects on learning dynamics. The cross-validated performance of classifiers on MEG data collected within the same session was 90.47% overall accuracy (4-class; Figure 3C). We then tested classifier performance on data collected during a separate MEG session conducted approximately 24 hours later (Day 2; see Figure 3 — figure supplement 3). We observed a reduction in overall accuracy rate to 87.11% when tested on MEG data recorded while participants performed the same learned sequence, and 79.44% when they performed several previously unpracticed sequences. Both changes in accuracy are important with regards to the generalizability of our findings. First, 87.11% performance accuracy for the trained sequence data on Day 2 (a reduction of only 3.36%) indicates that the hybrid-space decoder performance is robust over multiple MEG sessions, and thus, robust to variations in SNR across the MEG sensor array caused by small differences in head position between scans. This indicates a substantial advantage over sensor-space decoding approaches. Furthermore, when tested on data from unpracticed sequences, overall performance dropped an additional 7.67%. This difference reflects the performance bias of the classifier for the trained sequence, possibly caused by high-order sequence structure being incorporated into the feature weights. In the future, it will be important to understand in more detail how random or repeated keypress sequence training data impacts overall decoder performance and generalization. We strongly agree with the Reviewer that the issue of generalizability is extremely important and have added a new paragraph to the Discussion in the revised manuscript highlighting the strengths and weaknesses of our study with respect to this issue.

      In terms of clinical BCI, one of the potential relevance of the study, as claimed by the authors, it is not clear that the specific time window chosen in the current study (up to 200 msec since key press onset) is really useful. In most cases, clinical BCI would target neural signals with no overt movement execution due to patients' inability to move (e.g., Hochberg et al., 2012). Given the time window, the surprisingly high performance of the current decoder may result from sensory feedback and/or planning of subsequent movement, which may not always be available in the clinical BCI context. Of course, the decoding accuracy is still much higher than chance even when using signal before the key press (as shown in Figure 4 Supplement 2), but it is not immediately clear to me that the authors relate their high decoding accuracy based on post-movement signal to clinical BCI settings.

      The Reviewer questions the relevance of the specific window parameters used in the present study for clinical BCI applications, particularly for paretic patients who are unable to produce finger movements or for whom afferent sensory feedback is no longer intact. We strongly agree with the Reviewer that any intended clinical application must carefully consider the specific input feature constraints dictated by the clinical cohort, and in turn impose appropriate and complimentary constraints on classifier parameters that may differ from the ones used in the present study. We now highlight this issue in the Discussion of the revised manuscript and relate our present findings to published clinical BCI work within this context.

      One of the important and fascinating claims of the current study is that the "contextualization" of individual finger movements in a trained sequence specifically occurs during short rest periods in very early skill learning, echoing the recent theory of micro-offline learning proposed by the authors' group. Here, I think two points need to be clarified. First, the concept of "contextualization" is kept somewhat blurry throughout the text. It is only at the later part of the Discussion (around line #330 on page 13) that some potential mechanism for the "contextualization" is provided as "what-and-where" binding. Still, it is unclear what "contextualization" actually is in the current data, as the MEG signal analyzed is extracted from 0-200 msec after the keypress. If one thinks something is contextualizing an action, that contextualization should come earlier than the action itself.

      The Reviewer requests that we: 1) more clearly define our use of the term “contextualization” and 2) provide the rationale for assessing it over a 200ms window aligned to the KeyDown event. This choice of window parameters means that the MEG activity used in our analysis was coincident with, rather than preceding, the actual keypresses. We define contextualization as the differentiation of representation for the identical movement embedded in different positions of a sequential skill. That is, representations of individual action elements progressively incorporate information about their relationship to the overall sequence structure as the skill is learned. We agree with the Reviewer that this can be appropriately interpreted as “what-and-where” binding. We now incorporate this definition in the Introduction of the revised manuscript as requested.

      The window parameters for optimizing accurate decoding individual finger movements were determined using a grid search of the parameter space (a sliding window of variable width between 25-350 ms with 25 ms increments variably aligned from 0 to +100ms with 10ms increments relative to the KeyDown event). This approach generated 140 different temporal windows for each keypress for each participant, with the final parameter selection determined through comparison of the resulting performance between each decoder. Importantly, the decision to optimize for decoding accuracy placed an emphasis on keypress representations characterized by the most consistent and robust features shared across subjects, which in turn maximize statistical power in detecting common learning-related changes. In this case, the optimal window encompassed a 200ms epoch aligned to the KeyDown event (t<sub>0</sub> = 0 ms). We then asked if the representations (i.e. – spatial patterns of combined parcel- and voxel-space activity) of the same digit at two different sequence positions changed with practice within this optimal decoding window. Of course, our findings do not rule out the possibility that contextualization can also be found before or even after this time window, as we did not directly address this issue in the present study. Future work in our lab, as pointed out above, are investigating contextualization within different time windows tailored specifically for assessing sequence skill action planning, execution, evaluation and memory processes.

      The second point is that the result provided by the authors is not yet convincing enough to support the claim that "contextualization" occurs during rest. In the original analysis, the authors presented the statistical significance regarding the correlation between the "offline" pattern differentiation and micro-offline skill gain (Figure 5. Supplement 1), as well as the larger "offline" distance than "online" distance (Figure 5B). However, this analysis looks like regressing two variables (monotonically) increasing as a function of the trial. Although some information in this analysis, such as what the independent/dependent variables were or how individual subjects were treated, was missing in the Methods, getting a statistically significant slope seems unsurprising in such a situation. Also, curiously, the same quantitative evidence was not provided for its "online" counterpart, and the authors only briefly mentioned in the text that there was no significant correlation between them. It may be true looking at the data in Figure 5A as the online representation distance looks less monotonically changing, but the classification accuracy presented in Figure 4C, which should reflect similar representational distance, shows a more monotonic increase up to the 11th trial. Further, the ways the "online" and "offline" representation distance was estimated seem to make them not directly comparable. While the "online" distance was computed using all the correct press data within each 10 sec of execution, the "offline" distance is basically computed by only two presses (i.e., the last index_OP5 vs. the first index_OP1 separated by 10 sec of rest). Theoretically, the distance between the neural activity patterns for temporally closer events tends to be closer than that between the patterns for temporally far-apart events. It would be fairer to use the distance between the first index_OP1 vs. the last index_OP5 within an execution period for "online" distance, as well.

      The Reviewer suggests that the current data is not enough to show that contextualization occurs during rest and raises two important concerns: 1) the relationship between online contextualization and micro-online gains is not shown, and 2) the online distance was calculated differently from its offline counterpart (i.e. - instead of calculating the distance between last Index<sub>OP5</sub> and first Index<sub>OP1</sub> from a single trial, the distance was calculated for each sequence within a trial and then averaged).

      We addressed the first concern by performing individual subject correlations between 1) contextualization changes during rest intervals and micro-offline gains; 2) contextualization changes during practice trials and micro-online gains, and 3) contextualization changes during practice trials and micro-offline gains (Figure 5 – figure supplement 4). We then statistically compared the resulting correlation coefficient distributions and found that within-subject correlations for contextualization changes during rest intervals and micro-offline gains were significantly higher than online contextualization and micro-online gains (t = 3.2827, p = 0.0015) and online contextualization and micro-offline gains (t = 3.7021, p = 5.3013e-04). These results are consistent with our interpretation that micro-offline gains are supported by contextualization changes during the inter-practice rest periods.

      With respect to the second concern, we agree with the Reviewer that one limitation of the analysis comparing online versus offline changes in contextualization as presented in the original manuscript, is that it does not eliminate the possibility that any differences could simply be explained by the passage of time (which is smaller for the online analysis compared to the offline analysis). The Reviewer suggests an approach that addresses this issue, which we have now carried out. When quantifying online changes in contextualization from the first Index<sub>OP1</sub> the last Index<sub>OP5</sub> keypress in the same trial we observed no learning-related trend (Figure 5 – figure supplement 5, right panel). Importantly, offline distances were significantly larger than online distances regardless of the measurement approach and neither predicted online learning (Figure 5 – figure supplement 6).

      A related concern regarding the control analysis, where individual values for max speed and the degree of online contextualization were compared (Figure 5 Supplement 3), is whether the individual difference is meaningful. If I understood correctly, the optimization of the decoding process (temporal window, feature inclusion/reduction, decoder, etc.) was performed for individual participants, and the same feature extraction was also employed for the analysis of representation distance (i.e., contextualization). If this is the case, the distances are individually differently calculated and they may need to be normalized relative to some stable reference (e.g., 1 vs. 4 or average distance within the control sequence presses) before comparison across the individuals.

      The Reviewer makes a good point here. We have now implemented the suggested normalization procedure in the analysis provided in the revised manuscript.

      Reviewer #3 (Public review):

      Summary:

      One goal of this paper is to introduce a new approach for highly accurate decoding of finger movements from human magnetoencephalography data via dimension reduction of a "multiscale, hybrid" feature space. Following this decoding approach, the authors aim to show that early skill learning involves "contextualization" of the neural coding of individual movements, relative to their position in a sequence of consecutive movements. Furthermore, they aim to show that this "contextualization" develops primarily during short rest periods interspersed with skill training and correlates with a performance metric which the authors interpret as an indicator of offline learning.

      Strengths:

      A clear strength of the paper is the innovative decoding approach, which achieves impressive decoding accuracies via dimension reduction of a "multi-scale, hybrid space". This hybrid-space approach follows the neurobiologically plausible idea of the concurrent distribution of neural coding across local circuits as well as large-scale networks. A further strength of the study is the large number of tested dimension reduction techniques and classifiers (though the manuscript reveals little about the comparison of the latter).

      We appreciate the Reviewer’s comments regarding the paper’s strengths.

      A simple control analysis based on shuffled class labels could lend further support to this complex decoding approach. As a control analysis that completely rules out any source of overfitting, the authors could test the decoder after shuffling class labels. Following such shuffling, decoding accuracies should drop to chance level for all decoding approaches, including the optimized decoder. This would also provide an estimate of actual chance-level performance (which is informative over and beyond the theoretical chance level). Furthermore, currently, the manuscript does not explain the huge drop in decoding accuracies for the voxel-space decoding (Figure 3B). Finally, the authors' approach to cortical parcellation raises questions regarding the information carried by varying dipole orientations within a parcel (which currently seems to be ignored?) and the implementation of the mean-flipping method (given that there are two dimensions - space and time - what do the authors refer to when they talk about the sign of the "average source", line 477?).

      The Reviewer recommends that we: 1) conduct an additional control analysis on classifier performance using shuffled class labels, 2) provide a more detailed explanation regarding the drop in decoding accuracies for the voxel-space decoding following LDA dimensionality reduction (see Fig 3B), and 3) provide additional details on how problems related to dipole solution orientations were addressed in the present study.

      In relation to the first point, we have now implemented a random shuffling approach as a control for the classification analyses. The results of this analysis indicated that the chance level accuracy was 22.12% (± SD 9.1%) for individual keypress decoding (4-class classification), and 18.41% (± SD 7.4%) for individual sequence item decoding (5-class classification), irrespective of the input feature set or the type of decoder used. Thus, the decoding accuracy observed with the final model was substantially higher than these chance levels.

      Second, please note that the dimensionality of the voxel-space feature set is very high (i.e. – 15684). LDA attempts to map the input features onto a much smaller dimensional space (number of classes – 1; e.g. – 3 dimensions, for 4-class keypress decoding). Given the very high dimension of the voxel-space input features in this case, the resulting mapping exhibits reduced accuracy. Despite this general consideration, please refer to Figure 3—figure supplement 3, where we observe improvement in voxel-space decoder performance when utilizing alternative dimensionality reduction techniques.

      The decoders constructed in the present study assess the average spatial patterns across time (as defined by the windowing procedure) in the input feature space. We now provide additional details in the Methods of the revised manuscript pertaining to the parcellation procedure and how the sign ambiguity problem was addressed in our analysis.

      Weaknesses:

      A clear weakness of the paper lies in the authors' conclusions regarding "contextualization". Several potential confounds, described below, question the neurobiological implications proposed by the authors and provide a simpler explanation of the results. Furthermore, the paper follows the assumption that short breaks result in offline skill learning, while recent evidence, described below, casts doubt on this assumption.

      We thank the Reviewer for giving us the opportunity to address these issues in detail (see below).

      The authors interpret the ordinal position information captured by their decoding approach as a reflection of neural coding dedicated to the local context of a movement (Figure 4). One way to dissociate ordinal position information from information about the moving effectors is to train a classifier on one sequence and test the classifier on other sequences that require the same movements, but in different positions (Kornysheva et al., 2019). In the present study, however, participants trained to repeat a single sequence (4-1-3-2-4). As a result, ordinal position information is potentially confounded by the fixed finger transitions around each of the two critical positions (first and fifth press). Across consecutive correct sequences, the first keypress in a given sequence was always preceded by a movement of the index finger (=last movement of the preceding sequence), and followed by a little finger movement. The last keypress, on the other hand, was always preceded by a ring finger movement, and followed by an index finger movement (=first movement of the next sequence). Figure 4 - Supplement 2 shows that finger identity can be decoded with high accuracy (>70%) across a large time window around the time of the key press, up to at least +/-100 ms (and likely beyond, given that decoding accuracy is still high at the boundaries of the window depicted in that figure). This time window approaches the keypress transition times in this study. Given that distinct finger transitions characterized the first and fifth keypress, the classifier could thus rely on persistent (or "lingering") information from the preceding finger movement, and/or "preparatory" information about the subsequent finger movement, in order to dissociate the first and fifth keypress. Currently, the manuscript provides no evidence that the context information captured by the decoding approach is more than a by-product of temporally extended, and therefore overlapping, but independent neural representations of consecutive keypresses that are executed in close temporal proximity - rather than a neural representation dedicated to context.

      Such temporal overlap of consecutive, independent finger representations may also account for the dynamics of "ordinal coding"/"contextualization", i.e., the increase in 2-class decoding accuracy, across Day 1 (Figure 4C). As learning progresses, both tapping speed and the consistency of keypress transition times increase (Figure 1), i.e., consecutive keypresses are closer in time, and more consistently so. As a result, information related to a given keypress is increasingly overlapping in time with information related to the preceding and subsequent keypresses. The authors seem to argue that their regression analysis in Figure 5 - Figure Supplement 3 speaks against any influence of tapping speed on "ordinal coding" (even though that argument is not made explicitly in the manuscript). However, Figure 5 - Figure Supplement 3 shows inter-individual differences in a between-subject analysis (across trials, as in panel A, or separately for each trial, as in panel B), and, therefore, says little about the within-subject dynamics of "ordinal coding" across the experiment. A regression of trial-by-trial "ordinal coding" on trial-by-trial tapping speed (either within-subject or at a group-level, after averaging across subjects) could address this issue. Given the highly similar dynamics of "ordinal coding" on the one hand (Figure 4C), and tapping speed on the other hand (Figure 1B), I would expect a strong relationship between the two in the suggested within-subject (or group-level) regression. Furthermore, learning should increase the number of (consecutively) correct sequences, and, thus, the consistency of finger transitions. Therefore, the increase in 2-class decoding accuracy may simply reflect an increasing overlap in time of increasingly consistent information from consecutive keypresses, which allows the classifier to dissociate the first and fifth keypress more reliably as learning progresses, simply based on the characteristic finger transitions associated with each. In other words, given that the physical context of a given keypress changes as learning progresses - keypresses move closer together in time and are more consistently correct - it seems problematic to conclude that the mental representation of that context changes. To draw that conclusion, the physical context should remain stable (or any changes to the physical context should be controlled for).

      The issues raised by Reviewer #3 here are similar to two issues raised by Reviewer #2 above. We agree they must both be carefully considered in any evaluation of our findings.

      As both Reviewers pointed out, the classifiers in this study were trained and tested on keypresses performed while practicing a specific sequence (4-1-3-2-4). The study was designed this way as to avoid the impact of interference effects on learning dynamics. The cross-validated performance of classifiers on MEG data collected within the same session was 90.47% overall accuracy (4class; Figure 3C). We then tested classifier performance on data collected during a separate MEG session conducted approximately 24 hours later (Day 2; see Figure 3—supplement 3). We observed a reduction in overall accuracy rate to 87.11% when tested on MEG data recorded while participants performed the same learned sequence, and 79.44% when they performed several previously unpracticed sequences. This classification performance difference of 7.67% when tested on the Day 2 data could reflect the performance bias of the classifier for the trained sequence, possibly caused by mixed information from temporally close keypresses being incorporated into the feature weights.

      Along these same lines, both Reviewers also raise the possibility that an increase in “ordinal coding/contextualization” with learning could simply reflect an increase in this mixing effect caused by faster typing speeds as opposed to an actual change in the underlying neural representation. The basic idea is that as correct sequences are generated at higher and higher speeds over training, MEG activity patterns related to the planning, execution, evaluation and memory of individual keypresses overlap more in time. Thus, increased overlap between the “4” and “1” keypresses (at the start of the sequence) and “2” and “4” keypresses (at the end of the sequence) could artefactually increase contextualization distances even if the underlying neural representations for the individual keypresses remain unchanged (assuming this mixing of representations is used by the classifier to differentially tag each index finger press). If this were the case, it follows that such mixing effects reflecting the ordinal sequence structure would also be observable in the distribution of decoder misclassifications. For example, “4” keypresses would be more likely to be misclassified as “1” or “2” keypresses (or vice versa) than as “3” keypresses. The confusion matrices presented in Figures 3C and 4B and Figure 3—figure supplement 3A in the previously submitted manuscript do not show this trend in the distribution of misclassifications across the four fingers.

      Following this logic, it’s also possible that if the ordinal coding is largely driven by this mixing effect, the increased overlap between consecutive index finger keypresses during the 4-4 transition marking the end of one sequence and the beginning of the next one could actually mask contextualization-related changes to the underlying neural representations and make them harder to detect. In this case, a decoder tasked with separating individual index finger keypresses into two distinct classes based upon sequence position might show decreased performance with learning as adjacent keypresses overlapped in time with each other to an increasing extent. However, Figure 4C in our previously submitted manuscript does not support this possibility, as the 2-class hybrid classifier displays improved classification performance over early practice trials despite greater temporal overlap.

      As noted in the above reply to Reviewer #2, we also conducted a new multivariate regression analysis to directly assess whether the neural representation distance score could be predicted by the 4-1, 2-4 and 4-4 keypress transition times observed for each complete correct sequence (both predictor and response variables were z-score normalized within-subject). The results of this analysis affirmed that the possible alternative explanation put forward by the Reviewer is not supported by our data (Adjusted R<sup>2</sup> = 0.00431; F = 5.62). We now include this new negative control analysis result in the revised manuscript.

      Finally, the Reviewer hints that one way to address this issue would be to compare MEG responses before and after learning for sequences typed at a fixed speed. However, given that the speed-accuracy trade-off should improve with learning, a comparison between unlearned and learned skill states would dictate that the skill be evaluated at a very low fixed speed. Essentially, such a design presents the problem that the post-training test is evaluating the representation in the unlearned behavioral state that is not representative of the acquired skill. Thus, this approach would miss most learning effects on a task in which speed is the main learning metrics.

      A similar difference in physical context may explain why neural representation distances ("differentiation") differ between rest and practice (Figure 5). The authors define "offline differentiation" by comparing the hybrid space features of the last index finger movement of a trial (ordinal position 5) and the first index finger movement of the next trial (ordinal position 1). However, the latter is not only the first movement in the sequence but also the very first movement in that trial (at least in trials that started with a correct sequence), i.e., not preceded by any recent movement. In contrast, the last index finger of the last correct sequence in the preceding trial includes the characteristic finger transition from the fourth to the fifth movement. Thus, there is more overlapping information arising from the consistent, neighbouring keypresses for the last index finger movement, compared to the first index finger movement of the next trial. A strong difference (larger neural representation distance) between these two movements is, therefore, not surprising, given the task design, and this difference is also expected to increase with learning, given the increase in tapping speed, and the consequent stronger overlap in representations for consecutive keypresses. Furthermore, initiating a new sequence involves pre-planning, while ongoing practice relies on online planning (Ariani et al., eNeuro 2021), i.e., two mental operations that are dissociable at the level of neural representation (Ariani et al., bioRxiv 2023).

      The Reviewer argues that the comparison of last finger movement of a trial and the first in the next trial are performed in different circumstances and contexts. This is an important point and one we tend to agree with. For this task, the first sequence in a practice trial is pre-planned before the first keypress is performed. This occurs in a somewhat different context from the sequence iterations that follow, which involve temporally overlapping planning, execution and evaluation processes. The Reviewer is concerned about a difference in the temporal mixing effect issue raised above between the first and last keypresses performed in a trial. Please, note that since neural representations of individual actions are competitively queued during the pre-planning period in a manner that reflects the ordinal structure of the learned sequence (Kornysheva et al., 2019), mixing effects are most likely present also for the first keypress in a trial.

      Separately, the Reviewer suggests that contextualization during early learning may reflect preplanning or online planning. This is an interesting proposal. Given the decoding time-window used in this investigation, we cannot dissect separate contributions of planning, memory and sensory feedback to contextualization. Taking advantage of the superior temporal resolution of MEG relative to fMRI tools, work under way in our lab is investigating decoding time-windows more appropriate to address each of these questions.

      Given these differences in the physical context and associated mental processes, it is not surprising that "offline differentiation", as defined here, is more pronounced than "online differentiation". For the latter, the authors compared movements that were better matched regarding the presence of consistent preceding and subsequent keypresses (online differentiation was defined as the mean difference between all first vs. last index finger movements during practice). It is unclear why the authors did not follow a similar definition for "online differentiation" as for "micro-online gains" (and, indeed, a definition that is more consistent with their definition of "offline differentiation"), i.e., the difference between the first index finger movement of the first correct sequence during practice, and the last index finger of the last correct sequence. While these two movements are, again, not matched for the presence of neighbouring keypresses (see the argument above), this mismatch would at least be the same across "offline differentiation" and "online differentiation", so they would be more comparable.

      This is the same point made earlier by Reviewer #2, and we agree with this assessment. As stated in the response to Reviewer #2 above, we have now carried out quantification of online contextualization using this approach and included it in the revised manuscript. We thank the Reviewer for this suggestion.

      A further complication in interpreting the results regarding "contextualization" stems from the visual feedback that participants received during the task. Each keypress generated an asterisk shown above the string on the screen, irrespective of whether the keypress was correct or incorrect. As a result, incorrect (e.g., additional, or missing) keypresses could shift the phase of the visual feedback string (of asterisks) relative to the ordinal position of the current movement in the sequence (e.g., the fifth movement in the sequence could coincide with the presentation of any asterisk in the string, from the first to the fifth). Given that more incorrect keypresses are expected at the start of the experiment, compared to later stages, the consistency in visual feedback position, relative to the ordinal position of the movement in the sequence, increased across the experiment. A better differentiation between the first and the fifth movement with learning could, therefore, simply reflect better decoding of the more consistent visual feedback, based either on the feedback-induced brain response, or feedback-induced eye movements (the study did not include eye tracking). It is not clear why the authors introduced this complicated visual feedback in their task, besides consistency with their previous studies.

      We strongly agree with the Reviewer that eye movements related to task engagement are important to rule out as a potential driver of the decoding accuracy or contextualizaton effect. We address this issue above in response to a question raised by Reviewer #1 about the impact of movement related artefacts on our findings.

      First, the assumption the Reviewer makes here about the distribution of errors in this task is incorrect. On average across subjects, 2.32% ± 1.48% (mean ± SD) of all keypresses performed were errors, which were evenly distributed across the four possible keypress responses. While errors increased progressively over practice trials, they did so in proportion to the increase in correct keypresses, so that the overall ratio of correct-to-incorrect keypresses remained stable over the training session. Thus, the Reviewer’s assumptions that there is a higher relative frequency of errors in early trials, and a resulting systematic trend phase shift differences between the visual display updates (i.e. – a change in asterisk position above the displayed sequence) and the keypress performed is not substantiated by the data. To the contrary, the asterisk position on the display and the keypress being executed remained highly correlated over the entire training session. We now include a statement about the frequency and distribution of errors in the revised manuscript.

      Given this high correlation, we firmly agree with the Reviewer that the issue of eye movement related artefacts is still an important one to address. Fortunately, we did collect eye movement data during the MEG recordings so were able to investigate this. As detailed in the response to Reviewer #1 above, we found that gaze positions and eye-movement velocity time-locked to visual display updates (i.e. – a change in asterisk position above the displayed sequence) did not reflect the asterisk location above chance levels (Overall cross-validated accuracy = 0.21817; see Author response image 1). Furthermore, an inspection of the eye position data revealed that most participants on most trials displayed random walk gaze patterns around a center fixation point, indicating that participants did not attend to the asterisk position on the display. This is consistent with intrinsic generation of the action sequence, and congruent with the fact that the display does not provide explicit feedback related to performance. As pointed out above, a similar real-world example would be manually inputting a long password into a secure online application. In this case, one intrinsically generates the sequence from memory and receives similar feedback about the password sequence position (also provided as asterisks), which is typically ignored by the user.

      The minimal participant engagement with the visual display in this explicit sequence learning motor task (which is highly generative in nature) contrasts markedly with behavior observed when reactive responses to stimulus cues are needed in the serial reaction time task (SRTT). This is a crucial difference that must be carefully considered when comparing findings across studies using the two sequence learning tasks.

      The authors report a significant correlation between "offline differentiation" and cumulative microoffline gains. However, it would be more informative to correlate trial-by-trial changes in each of the two variables. This would address the question of whether there is a trial-by-trial relation between the degree of "contextualization" and the amount of micro-offline gains - are performance changes (micro-offline gains) less pronounced across rest periods for which the change in "contextualization" is relatively low? Furthermore, is the relationship between micro-offline gains and "offline differentiation" significantly stronger than the relationship between micro-offline gains and "online differentiation"?

      In response to a similar issue raised above by Reviewer #2, we now include new analyses comparing correlation magnitudes between (1) “online differentiation” vs micro-online gains, (2) “online differentiation” vs micro-offline gains and (3) “offline differentiation” and micro-offline gains (see Figure 5 – figure supplement  4, 5 and 6). These new analyses and results have been added to the revised manuscript. Once again, we thank both Reviewers for this suggestion.

      The authors follow the assumption that micro-offline gains reflect offline learning.

      We disagree with this statement. The original (Bonstrup et al., 2019) paper clearly states that micro-offline gains do not necessarily reflect offline learning in some cases and must be carefully interpreted based upon the behavioral context within which they are observed. Further, the paper lays out the conditions under which one can have confidence that micro-offline gains reflect offline learning. In fact, the excellent meta-analysis of (Pan & Rickard, 2015), which re-interprets the benefits of sleep in overnight skill consolidation from a “reactive inhibition” perspective, was a crucial resource in the experimental design of our initial study (Bonstrup et al., 2019), as well as in all our subsequent work. Pan & Rickard state:

      “Empirically, reactive inhibition refers to performance worsening that can accumulate during a period of continuous training (Hull, 1943 . It tends to dissipate, at least in part, when brief breaks are inserted between blocks of training. If there are multiple performance-break cycles over a training session, as in the motor sequence literature, performance can exhibit a scalloped effect, worsening during each uninterrupted performance block but improving across blocks(Brawn et al., 2010; Rickard et al., 2008 . Rickard, Cai, Rieth, Jones, and Ard (2008 and Brawn, Fenn, Nusbaum, and Margoliash (2010 (Brawn et al., 2010; Rickard et al., 2008 demonstrated highly robust scalloped reactive inhibition effects using the commonly employed 30 s–30 s performance break cycle, as shown for Rickard et al.’s (2008 massed practice sleep group in Figure 2. The scalloped effect is evident for that group after the first few 30 s blocks of each session. The absence of the scalloped effect during the first few blocks of training in the massed group suggests that rapid learning during that period masks any reactive inhibition effect.”

      Crucially, Pan & Rickard make several concrete recommendations for reducing the impact of the reactive inhibition confound on offline learning studies. One of these recommendations was to reduce practice times to 10s (most prior sequence learning studies up until that point had employed 30s long practice trials). They state:

      “The traditional design involving 30 s-30 s performance break cycles should be abandoned given the evidence that it results in a reactive inhibition confound, and alternative designs with reduced performance duration per block used instead (Pan & Rickard, 2015 . One promising possibility is to switch to 10 s performance durations for each performance-break cycle Instead (Pan & Rickard, 2015 . That design appears sufficient to eliminate at least the majority of the reactive inhibition effect (Brawn et al., 2010; Rickard et al., 2008 .”

      We mindfully incorporated recommendations from (Pan & Rickard, 2015) into our own study designs including 1) utilizing 10s practice trials and 2) constraining our analysis of micro-offline gains to early learning trials (where performance monotonically increases and 95% of overall performance gains occur), which are prior to the emergence of the “scalloped” performance dynamics that are strongly linked to reactive inhibition effects.

      However, there is no direct evidence in the literature that micro-offline gains really result from offline learning, i.e., an improvement in skill level.

      We strongly disagree with the Reviewer’s assertion that “there is no direct evidence in the literature that micro-offline gains really result from offline learning, i.e., an improvement in skill level.” The initial (Bonstrup et al., 2019) report was followed up by a large online crowd-sourcing study (Bonstrup et al., 2020). This second (and much larger) study provided several additional important findings supporting our interpretation of micro-offline gains in cases where the important behavioral conditions clarified above were met (see Author response image 4 below for further details on these conditions).

      Author response image 4.

      This Figure shows that micro-offline gains o ser ed in learning and nonlearning contexts are attri uted to different underl ing causes. Micro-offline and online changes relative to overall trial-by-trial learning. This figure is based on data from (Bonstrup et al., 2019). During early learning, micro-offline gains (red bars) closely track trial-by-trial performance gains (green line with open circle markers), with minimal contribution from micro-online gains (blue bars). The stated conclusion in Bönstrup et al. (2019) is that micro-offline gains only during this Early Learning stage reflect rapid memory consolidation (see also (Bonstrup et al., 2020)). After early learning, about practice trial 11, skill plateaus. This plateau skill period is characterized by a striking emergence of coupled (and relatively stable) micro-online drops and micro-offline increases. Bönstrup et al. (2019) as well as others in the literature (Brooks et al., 2024; Gupta & Rickard, 2022; Florencia Jacobacci et al., 2020), argue that micro-offline gains during the plateau period likely reflect recovery from inhibitory performance factors such as reactive inhibition or fatigue, and thus must be excluded from analyses relating micro-offline gains to skill learning. The Non-repeating groups in Experiments 3 and 4 from Das et al. (2024) suffer from a lack of consideration of these known confounds (end of Fig legend).

      Evidence documented in that paper (Bonstrup et al., 2020) showed that micro-offline gains during early skill learning were: 1) replicable and generalized to subjects learning the task in their daily living environment (n=389); 2) equivalent when significantly shortening practice period duration, thus confirming that they are not a result of recovery from performance fatigue (n=118); 3) reduced (along with learning rates) by retroactive interference applied immediately after each practice period relative to interference applied after passage of time (n=373), indicating stabilization of the motor memory at a microscale of several seconds consistent with rapid consolidation; and 4) not modified by random termination of the practice periods, ruling out a contribution of predictive motor slowing (N = 71) (Bonstrup et al., 2020). Altogether, our findings were strongly consistent with the interpretation that micro-offline gains reflect memory consolidation supporting early skill learning. This is precisely the portion of the learning curve (Pan & Rickard, 2015) refer to when they state “…rapid learning during that period masks any reactive inhibition effect”.

      This interpretation is further supported by brain imaging evidence linking known memory-related networks and consolidation mechanisms to micro-offline gains. First, we reported that the density of fast hippocampo-neocortical skill memory replay events increases approximately three-fold during early learning inter-practice rest periods with the density explaining differences in the magnitude of micro-offline gains across subjects (Buch et al., 2021). Second, Jacobacci et al. (2020) independently reproduced our original behavioral findings and reported BOLD fMRI changes in the hippocampus and precuneus (regions also identified in our MEG study (Buch et al., 2021)) linked to micro-offline gains during early skill learning. These functional changes were coupled with rapid alterations in brain microstructure in the order of minutes, suggesting that the same network that operates during rest periods of early learning undergoes structural plasticity over several minutes following practice (Deleglise et al., 2023). Crucial to this point, Chen et al. (2024) and Sjøgård et al (2024) provided direct evidence from intracranial EEG in humans linking sharp-wave ripple density during rest periods (which are known markers for neural replay (Buzsaki, 2015)) in the human hippocampus (80-120 Hz) to micro-offline gains during early skill learning.

      Thus, there is now substantial converging evidence in humans across different indirect noninvasive and direct invasive recording techniques linking hippocampal activity, neural replay dynamics and offline performance gains in skill learning.

      On the contrary, recent evidence questions this interpretation (Gupta & Rickard, npj Sci Learn 2022; Gupta & Rickard, Sci Rep 2024; Das et al., bioRxiv 2024). Instead, there is evidence that micro-offline gains are transient performance benefits that emerge when participants train with breaks, compared to participants who train without breaks, however, these benefits vanish within seconds after training if both groups of participants perform under comparable conditions (Das et al., bioRxiv 2024).

      The recent work of (Gupta & Rickard, 2022, 2024) does not present any data that directly opposes our finding that early skill learning (Bonstrup et al., 2019) is expressed as micro-offline gains during rest breaks. These studies are an extension of the Rickard et al (2008) paper that employed a massed (30s practice followed by 30s breaks) vs spaced (10s practice followed by 10s breaks) experimental design to assess if recovery from reactive inhibition effects could account for performance gains measured after several minutes or hours. Gupta & Rickard (2022) added two additional groups (30s practice/10s break and 10s practice/10s break as used in the work from our group). The primary aim of the study was to assess whether it was more likely that changes in performance when retested 5 minutes after skill training (consisting of 12 practice trials for the massed groups and 36 practice trials for the spaced groups) had ended reflected memory consolidation effects or recovery from reactive inhibition effects. The Gupta & Rickard (2024) follow-up paper employed a similar design with the primary difference being that participants performed a fixed number of sequences on each trial as opposed to trials lasting a fixed duration. This was done to facilitate the fitting of a quantitative statistical model to the data.

      To reiterate, neither study included any analysis of micro-online or micro-offline gains and did not include any comparison focused on skill gains during early learning trials (only at retest 5 min later). Instead, Gupta & Rickard (2022), reported evidence for reactive inhibition effects for all groups over much longer training periods than early learning. In fact, we reported the same findings for trials following the early learning period in our original 2019 paper (Bonstrup et al., 2019) (Author response image 4). Please, note that we also reported that cumulative microoffline gains over early learning did not correlate with overnight offline consolidation measured 24 hours later (Bonstrup et al., 2019) (see the Results section and further elaboration in the Discussion). We interpreted these findings as indicative that the mechanisms underlying offline gains over the micro-scale of seconds during early skill learning versus over minutes or hours very likely differ.

      In the recent preprint from (Das et al., 2024), the authors make the strong claim that “micro-offline gains during early learning do not reflect offline learning” which is not supported by their own data. The authors hypothesize that if “micro-offline gains represent offline learning, participants should reach higher skill levels when training with breaks, compared to training without breaks”. The study utilizes a spaced vs. massed practice groups between-subjects design inspired by the reactive inhibition work from Rickard and others to test this hypothesis.

      Crucially, their design incorporates only a small fraction of the training used in other investigations to evaluate early skill learning (Bonstrup et al., 2020; Bonstrup et al., 2019; Brooks et al., 2024; Buch et al., 2021; Deleglise et al., 2023; F. Jacobacci et al., 2020; Mylonas et al., 2024). A direct comparison between the practice schedule designs for the spaced and massed groups in Das et al., and the training schedule all participants experienced in the original Bönstrup et al. (2019) paper highlights this issue as well as several others (Author response image 5):

      Author response image 5.

      This figure shows (A) Comparison of Das et al. Spaced & Massed group training session designs, and the training session design from the original (Bonstrup et al., 2019) paper. Similar to the approach taken by Das et al., all practice is visualized as 10-second practice trials with a variable number (either 0, 1 or 30) of 10-second-long inter-practice rest intervals to allow for direct comparisons between designs. The two key takeaways from this comparison are that (1) the intervention differences (i.e. – practice schedules) between the Massed and Spaced groups from the Das et al. report are extremely small (less than 12% of the overall session schedule) (gaps in the red shaded area) and (2) the overall amount of practice is much less than compared to the design from the original Bönstrup report (Bonstrup et al., 2019) (which has been utilized in several subsequent studies). (B) Group-level learning curve data from Bönstrup et al. (2019) (Bonstrup et al., 2019) is used to estimate the performance range accounted for by the equivalent periods covering Test 1, Training 1 and Test 2 from Das et al (2024). Note that the intervention in the Das et al. study is limited to a period covering less than 50% of the overall learning range (end of figure legend).

      Participants in the original (Bonstrup et al., 2019) experienced 157.14% more practice time and 46.97% less inter-practice rest time than the Spaced group in the Das et al. study (Author response image 5). Thus, the overall amount of practice and rest differ substantially between studies, with much more limited training occurring for participants in Das et al.

      In addition, the training interventions (i.e. – the practice schedule differences between the Spaced and Massed groups) were designed in a manner that minimized any chance of effectively testing their hypothesis. First, the interventions were applied over an extremely short period relative to the length of the total training session (5% and 12% of the total training session for Massed and Spaced groups, respectively; see gaps in the red shaded area in Author response image 5). Second, the intervention was applied during a period in which only half of the known total learning occurs. Specifically, we know from Bönstrup et al. (2019) that only 46.57% of the total performance gains occur in the practice interval covered by Das et al Training 1 intervention. Thus, early skill learning as evaluated by multiple groups (Bonstrup et al., 2020; Bonstrup et al., 2019; Brooks et al., 2024; Buch et al., 2021; Deleglise et al., 2023; F. Jacobacci et al., 2020; Mylonas et al., 2024), is in the Das et al experiment amputated to about half.

      Furthermore, a substantial amount of learning takes place during Das et al’s Test 1 and Test 2 periods (32.49% of total gains combined). The fact that substantial learning is known to occur over both the Test 1 (18.06%) and Test 2 (14.43%) intervals presents a fundamental problem described by Pan and Rickard (Pan & Rickard, 2015). They reported that averaging over intervals where substantial performance gains occur (i.e. – performance is not stable) inject crucial artefacts into analyses of skill learning:

      “A large amount of averaging has the advantage of yielding more precise estimates of each subject’s pretest and posttest scores and hence more statistical power to detect a performance gain. However, calculation of gain scores using that strategy runs the risk that learning that occurs during the pretest and (or posttest periods (i.e., online learning is incorporated into the gain score (Rickard et al., 2008; Robertson et al., 2004 .”

      The above statement indicates that the Test 1 and Test 2 performance scores from Das et al. (2024) are substantially contaminated by the learning rate within these intervals. This is particularly problematic if the intervention design results in different Test 2 learning rates between the two groups. This in fact, is apparent in their data (Figure 1C,E of the Das et al., 2024 preprint) as the Test 2 learning rate for the Spaced group is negative (indicating a unique interference effect observable only for this group). Specifically, the Massed group continues to show an increase in performance during Test 2 and 4 relative to the last 10 seconds of practice during Training 1 and 2, respectively, while the Spaced group displays a marked decrease. This post-training performance decrease for the Spaced group is in stark contrast to the monotonic performance increases observed for both groups at all other time-points. One possible cause could be related to the structure of the Test intervals, which include 20 seconds of uninterrupted practice. For the Spaced group, this effectively is a switch to a Massed practice environment (i.e., two 10-secondlong practice trials merged into one long trial), which interferes with greater Training 1 interval gains observed for the Space group. Interestingly, when statistical comparisons between the groups are made at the time-points when the intervention is present (Figure 1E) then the stated hypothesis, “If micro-offline gains represent offline learning, participants should reach higher skill levels when training with breaks, compared to training without breaks”, is confirmed.

      In summary, the experimental design and analyses used by Das et al does not contradict the view that early skill learning is expressed as micro-offline gains during rest breaks. The data presented by Gupta and Rickard (2022, 2024) and Das et al. (2024) is in many ways more confirmatory of the constraints employed by our group and others with respect to experimental design, analysis and interpretation of study findings, rather than contradictory. Still, it does highlight a limitation of the current micro-online/offline framework, which was originally only intended to be applied to early skill learning over spaced practice schedules when reactive inhibition effects are minimized (Bonstrup et al., 2019; Pan & Rickard, 2015). Extrapolation of this current framework to postplateau performance periods, longer timespans, or non-learning situations (e.g. – the Nonrepeating groups from Das et al. (2024)), when reactive inhibition plays a more substantive role, is not warranted. Ultimately, it will be important to develop new paradigms allowing one to independently estimate the different coincident or antagonistic features (e.g. - memory consolidation, planning, working memory and reactive inhibition) contributing to micro-online and micro-offline gains during and after early skill learning within a unifying framework.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) I found Figure 2B too small to be useful, as the actual elements of the cells are very hard to read.

      We have removed the grid colormap panel (top-right) from Figure 2B. All of this colormap data is actually a subset of data presented in Figure 2 – figure supplement 1, so can still be found there.

      Reviewer #2 (Recommendations for the authors):

      (1) Related to the first point in my concerns, I would suggest the authors compare decoding accuracy between correct presses followed by correct vs. incorrect presses. This would clarify if the decoder is actually taking the MEG signal for subsequent press into account. I would also suggest the authors use pre-movement MEG features and post-movement features with shorter windows and compare each result with the results for the original post-movement MEG feature with a longer window.

      The present study does not contain enough errors to perform the analysis proposed by the Reviewer. As noted above, we did re-examine our data and now report a new control regression analysis, all of which indicate that the proximity between keypresses does not explain contextualization effects.

      (2) I was several times confused by the author's use of "neural representation of an action" or "sequence action representations" in understanding whether these terms refer to representation on the level of whole-brain, region (as defined by the specific parcellation used), or voxels. In fact, what is submitted to the decoder is some complicated whole-brain MEG feature (i.e., the "neural representation"), which is a hybrid of voxel and parcel features that is further dimension-reduced and not immediately interpretable. Clarifying this point early in the text and possibly using some more sensible terms, such as adding "brain-wise" before the "sequence action representation", would be the most helpful for the readers.

      We now clarified this terminology in the revised manuscript.

      (3) Although comparing many different ways in feature selection/reduction, time window selection, and decoder types is undoubtedly a meticulous work, the current version of the manuscript seems still lacking some explanation about the details of these methodological choices, like which decoding method was actually used to report the accuracy, whether or not different decoding methods were chosen for individual participants' data, how training data was selected (is it all of the correct presses in Day 1 data?), whether the frequency power or signal amplitude was used, and so on. I would highly appreciate these additional details in the Methods section.

      The reported accuracies were based on linear discriminant analysis classifier. A comparison of different decoders (Figure 3 – figure supplement 4) shows LDA was the optimal choice.

      Whether or not different decoding methods were chosen for individual participants' data

      We selected the same decoder (LDA) performance to report the final accuracy.

      How training data was selected (is it all of the correct presses in Day 1 data?),

      Decoder training was conducted as a randomized split of the data (all correct keypresses of Day 1) into training (90%) and test (10%) samples for 8 iterations.

      Whether the frequency power or signal amplitude was used

      Signal amplitude was used for feature calculation.

      (4) In terms of the Methods, please consider adding some references about the 'F1 score', the 'feature importance score,' and the 'MRMR-based feature ranking,' as the main readers of the current paper would not be from the machine learning community. Also, why did the LDA dimensionality reduction reduce accuracy specifically for the voxel feature?

      We have now added the following statements to the Methods section that provide more detailed descriptions and references for these metrics:

      “The F1 score, defined as the harmonic mean of the precision (percentage of true predictions that are actually true positive) and recall (percentage of true positives that were correctly predicted as true) scores, was used as a comprehensive metric for all one-versus-all keypress state decoders to assess class-wise performance that accounts for both false-positive and false-negative prediction tendencies [REF]. A weighted mean F1 score was then computed across all classes to assess the overall prediction performance of the multi-class model.”

      and

      “Feature Importance Scores

      The relative contribution of source-space voxels and parcels to decoding performance (i.e. – feature importance score) was calculated using minimum redundant maximum relevance (MRMR) and highlighted in topography plots. MRMR, an approach that combines both relevance and redundancy metrics, ranked individual features based upon their significance to the target variable (i.e. – keypress state identity) prediction accuracy and their non-redundancy with other features.”

      As stated in the Reviewer responses above, the dimensionality of the voxel-space feature set is very high (i.e. – 15684). LDA attempts to map the input features onto a much smaller dimensional space (number of classes-1; e.g. – 3 dimensions for 4-class keypress decoding). It is likely that the reduction in accuracy observed only for the voxel-space feature was due to the loss of relevant information during the mapping process that resulted in reduced accuracy. This reduction in accuracy for voxel-space decoding was specific to LDA. Figure 3—figure supplement 3 shows that voxel-space decoder performance actually improved when utilizing alternative dimensionality reduction techniques.

      (5) Paragraph 9, lines #139-142: "Notably, decoding associated with index finger keypresses (executed at two different ordinal positions in the sequence) exhibited the highest number of misclassifications of all digits (N = 141 or 47.5% of all decoding errors; Figure 3C), raising the hypothesis that the same action could be differentially represented when executed at different learning state or sequence context locations."

      This does not seem to be a fair comparison, as the index finger appears twice as many as the other fingers do in the sequence. To claim this, proper statistical analysis needs to be done taking this difference into account.

      We thank the Reviewer for bringing this issue to our attention. We have now corrected this comparison to evaluate relative false negative and false positive rates between individual keypress state decoders, and have revised this statement in the manuscript as follows:

      “Notably, decoding of index finger keypresses (executed at two different ordinal positions in the sequence) exhibited the highest false negative (0.116 per keypress) and false positive (0.043 per keypress) misclassification rates compared with all other digits (false negative rate range = [0.067 0.114]; false positive rate range = [0.020 0.037]; Figure 3C), raising the hypothesis that the same action could be differentially represented when executed within different contexts (i.e. - different learning states or sequence locations).”

      (6) Finally, the authors could consider acknowledging in the Discussion that the contribution of micro-offline learning to genuine skill learning is still under debate (e.g., Gupta and Rickard, 2023; 2024; Das et al., bioRxiv, 2024).

      We have added a paragraph in the Discussion that addresses this point.

      Reviewer #3 (Recommendations for the authors):

      In addition to the additional analyses suggested in the public review, I have the following suggestions/questions:

      (1) Given that the authors introduce a new decoding approach, it would be very helpful for readers to see a distribution of window sizes and window onsets eventually used across individuals, at least for the optimized decoder.

      We have now included a new supplemental figure (Figure 4 – figure Supplement 2) that provides this information.

      (2) Please explain in detail how you arrived at the (interpolated?) group-level plot shown in Figure 1B, starting from the discrete single-trial keypress transition times. Also, please specify what the shading shows.

      Instantaneous correct sequence speed (skill measure) was quantified as the inverse of time (in seconds) required to complete a single iteration of a correctly generated full 5-item sequence. Individual keypress responses were labeled as members of correct sequences if they occurred within a 5-item response pattern matching any possible circular shifts of the 5-item sequence displayed on the monitor (41324). This approach allowed us to quantify a measure of skill within each practice trial at the resolution of individual keypresses. The dark line indicates the group mean performance dynamics for each trial. The shaded region indicates the 95% confidence limit of the mean (see Methods).

      (3) Similarly, please explain how you arrived at the group-level plot shown in Figure 1C. What are the different colored lines (rows) within each trial? How exactly did the authors reach the conclusion that KTT variability stabilizes by trial 6?

      Figure 1C provides additional information to the correct sequence speed measure above, as it also tracks individual transition speed composition over learning. Figure 1C, thus, represents both changes in overall correct sequence speed dynamics (indicated by the overall narrowing of the horizontal speed lines moving from top to bottom) and the underlying composition of the individual transition patterns within and across trials. The coloring of the lines is a shading convention used to discriminate between different keypress transitions. These curves were sampled with 1ms resolution, as in Figure 1B. Addressing the underlying keypress transition patterns requires within-subject normalization before averaging across subjects. The distribution of KTTs was normalized to the median correct sequence time for each participant and centered on the mid-point for each full sequence iteration during early learning.

      (4) Maybe I missed it, but it was not clear to me which of the tested classifiers was eventually used. Or was that individualized as well? More generally, a comparison of the different classifiers would be helpful, similar to the comparison of dimension reduction techniques.

      We have now included a new supplemental figure that provides this information.

      (5) Please add df and effect sizes to all statistics.

      Done.

      (6) Please explain in more detail your power calculation.

      The study was powered to determine the minimum sample size needed to detect a significant change in skill performance following training using a one-sample t-test (two-sided; alpha = 0.05; 95% statistical power; Cohen’s D effect size = 0.8115 calculated from previously acquired data in our lab). The calculated minimum sample size was 22. The included study sample size (n = 27) exceeded this minimum.

      This information is now included in the revised manuscript.

      (7) The cut-off for the high-pass filter is unusually high and seems risky in terms of potential signal distortions (de Cheveigne, Neuron 2019). Why did the authors choose such a high cut-off?

      The 1Hz high-pass cut-off frequency for the 1-150Hz band-pass filter applied to the continuous raw MEG data during preprocessing has been used in multiple previous MEG publications (Barratt et al., 2018; Brookes et al., 2012; Higgins et al., 2021; Seedat et al., 2020; Vidaurre et al., 2018).

      (8) "Furthermore, the magnitude of offline contextualization predicted skill gains while online contextualization did not", lines 336/337 - where is that analysis?

      Additional details pertaining to this analysis are now provided in the Results section (Figure 5 – figure supplement 4).

      (9) How were feature importance scores computed?

      We have now added a new subheading in the Methods section with a more detailed description of how feature importance scores were computed.

      (10)  Please add x and y ticks plus tick labels to Figure 5 - Figure Supplement 3, panel A

      Done

      (11) Line 369, what does "comparable" mean in this context?

      The sentence in the “Study Participants” part of the Methods section referred to here has now been revised for clarity.

      (12) In lines 496/497, please specify what t=0 means (KeyDown event, I guess?).

      Yes, the KeyDown event occurs at t = 0. This has now been clarified in the revised manuscript.

      (13) Please specify consistent boundaries between alpha- and beta-bands (they are currently not consistent in the Results vs. Methods (14/15 Hz or 15/16 Hz)).

      We thank the Reviewer for alerting us to this discrepancy caused by a typographic error in the Methods. We have now corrected this so that the alpha (8-14 Hz) and beta-band (15-24 Hz) frequency limits are described consistently throughout the revised manuscript.

      References

      Albouy, G., Fogel, S., King, B. R., Laventure, S., Benali, H., Karni, A., Carrier, J., Robertson, E. M., & Doyon, J. (2015). Maintaining vs. enhancing motor sequence memories: respective roles of striatal and hippocampal systems. Neuroimage, 108, 423-434. https://doi.org/10.1016/j.neuroimage.2014.12.049

      Albouy, G., King, B. R., Maquet, P., & Doyon, J. (2013). Hippocampus and striatum: dynamics and interaction during acquisition and sleep-related motor sequence memory consolidation. Hippocampus, 23(11), 985-1004. https://doi.org/10.1002/hipo.22183 Albouy, G., Sterpenich, V., Vandewalle, G., Darsaud, A., Gais, S., Rauchs, G., Desseilles, M., Boly, M., Dang-Vu, T., Balteau, E., Degueldre, C., Phillips, C., Luxen, A., & Maquet, P. (2012). Neural correlates of performance variability during motor sequence acquisition. NeuroImage, 60(1), 324-331. https://doi.org/10.1016/j.neuroimage.2011.12.049

      Andersen, R. A., & Buneo, C. A. (2002). Intentional maps in posterior parietal cortex. Annu Rev Neurosci, 25, 189-220. https://doi.org/10.1146/annurev.neuro.25.112701.142922 112701.142922 [pii]

      Ashe, J., Lungu, O. V., Basford, A. T., & Lu, X. (2006). Cortical control of motor sequences. Curr Opin Neurobiol, 16(2), 213-221. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Citati on&list_uids=16563734

      Bansal, A. K., Vargas-Irwin, C. E., Truccolo, W., & Donoghue, J. P. (2011). Relationships among low-frequency local field potentials, spiking activity, and three-dimensional reach and grasp kinematics in primary motor and ventral premotor cortices. J Neurophysiol, 105(4), 1603-1619. https://doi.org/10.1152/jn.00532.2010

      Barratt, E. L., Francis, S. T., Morris, P. G., & Brookes, M. J. (2018). Mapping the topological organisation of beta oscillations in motor cortex using MEG. NeuroImage, 181, 831-844. https://doi.org/10.1016/j.neuroimage.2018.06.041

      Bassett, D. S., Wymbs, N. F., Porter, M. A., Mucha, P. J., Carlson, J. M., & Grafton, S. T. (2011). Dynamic reconfiguration of human brain networks during learning. Proc Natl Acad Sci U S A, 108(18), 7641-7646. https://doi.org/10.1073/pnas.1018985108

      Battaglia-Mayer, A., & Caminiti, R. (2019). Corticocortical Systems Underlying High-Order Motor Control. J Neurosci, 39(23), 4404-4421. https://doi.org/10.1523/JNEUROSCI.2094-18.2019

      Berlot, E., Popp, N. J., & Diedrichsen, J. (2020). A critical re-evaluation of fMRI signatures of motor sequence learning. Elife, 9. https://doi.org/10.7554/eLife.55241

      Bonstrup, M., Iturrate, I., Hebart, M. N., Censor, N., & Cohen, L. G. (2020). Mechanisms of offline motor learning at a microscale of seconds in large-scale crowdsourced data. NPJ Sci Learn, 5, 7. https://doi.org/10.1038/s41539-020-0066-9

      Bonstrup, M., Iturrate, I., Thompson, R., Cruciani, G., Censor, N., & Cohen, L. G. (2019). A Rapid Form of Offline Consolidation in Skill Learning. Curr Biol, 29(8), 1346-1351 e1344. https://doi.org/10.1016/j.cub.2019.02.049

      Brawn, T. P., Fenn, K. M., Nusbaum, H. C., & Margoliash, D. (2010). Consolidating the effects of waking and sleep on motor-sequence learning. J Neurosci, 30(42), 13977-13982. https://doi.org/10.1523/JNEUROSCI.3295-10.2010

      Brookes, M. J., Woolrich, M. W., & Barnes, G. R. (2012). Measuring functional connectivity in MEG: a multivariate approach insensitive to linear source leakage. NeuroImage, 63(2), 910-920. https://doi.org/10.1016/j.neuroimage.2012.03.048

      Brooks, E., Wallis, S., Hendrikse, J., & Coxon, J. (2024). Micro-consolidation occurs when learning an implicit motor sequence, but is not influenced by HIIT exercise. NPJ Sci Learn, 9(1), 23. https://doi.org/10.1038/s41539-024-00238-6

      Buch, E. R., Claudino, L., Quentin, R., Bonstrup, M., & Cohen, L. G. (2021). Consolidation of human skill linked to waking hippocampo-neocortical replay. Cell Rep, 35(10), 109193. https://doi.org/10.1016/j.celrep.2021.109193

      Buneo, C. A., & Andersen, R. A. (2006). The posterior parietal cortex: sensorimotor interface for the planning and online control of visually guided movements. Neuropsychologia, 44(13), 2594-2606. https://doi.org/10.1016/j.neuropsychologia.2005.10.011

      Buzsaki, G. (2015). Hippocampal sharp wave-ripple: A cognitive biomarker for episodic memory and planning. Hippocampus, 25(10), 1073-1188. https://doi.org/10.1002/hipo.22488

      Chen, P.-C., Stritzelberger, J., Walther, K., Hamer, H., & Staresina, B. P. (2024). Hippocampal ripples during offline periods predict human motor sequence learning. bioRxiv, 2024.2010.2006.614680. https://doi.org/10.1101/2024.10.06.614680

      Churchland, M. M., Cunningham, J. P., Kaufman, M. T., Foster, J. D., Nuyujukian, P., Ryu, S. I., & Shenoy, K. V. (2012). Neural population dynamics during reaching. Nature, 487(7405), 51-56. https://doi.org/10.1038/nature11129

      Classen, J., Liepert, J., Wise, S. P., Hallett, M., & Cohen, L. G. (1998). Rapid plasticity of human cortical movement representation induced by practice. J Neurophysiol, 79(2), 1117-1123. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Citati on&list_uids=9463469

      Colclough, G. L., Brookes, M. J., Smith, S. M., & Woolrich, M. W. (2015). A symmetric multivariate leakage correction for MEG connectomes. NeuroImage, 117, 439-448. https://doi.org/10.1016/j.neuroimage.2015.03.071

      Colclough, G. L., Woolrich, M. W., Tewarie, P. K., Brookes, M. J., Quinn, A. J., & Smith, S. M. (2016). How reliable are MEG resting-state connectivity metrics? NeuroImage, 138, 284-293. https://doi.org/10.1016/j.neuroimage.2016.05.070

      Das, A., Karagiorgis, A., Diedrichsen, J., Stenner, M.-P., & Azanon, E. (2024). “Micro-offline gains” convey no benefit for motor skill learning. bioRxiv, 2024.2007.2011.602795. https://doi.org/10.1101/2024.07.11.602795

      Deleglise, A., Donnelly-Kehoe, P. A., Yeffal, A., Jacobacci, F., Jovicich, J., Amaro, E., Jr., Armony, J. L., Doyon, J., & Della-Maggiore, V. (2023). Human motor sequence learning drives transient changes in network topology and hippocampal connectivity early during memory consolidation. Cereb Cortex, 33(10), 6120-6131. https://doi.org/10.1093/cercor/bhac489

      Doyon, J., Bellec, P., Amsel, R., Penhune, V., Monchi, O., Carrier, J., Lehéricy, S., & Benali, H. (2009). Contributions of the basal ganglia and functionally related brain structures to motor learning. [Review]. Behavioural brain research, 199(1), 61-75. https://doi.org/10.1016/j.bbr.2008.11.012

      Doyon, J., Song, A. W., Karni, A., Lalonde, F., Adams, M. M., & Ungerleider, L. G. (2002). Experience-dependent changes in cerebellar contributions to motor sequence learning. Proc Natl Acad Sci U S A, 99(2), 1017-1022. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Citati on&list_uids=11805340

      Euston, D. R., Gruber, A. J., & McNaughton, B. L. (2012). The role of medial prefrontal cortex in memory and decision making. Neuron, 76(6), 1057-1070. https://doi.org/10.1016/j.neuron.2012.12.002

      Euston, D. R., Tatsuno, M., & McNaughton, B. L. (2007). Fast-forward playback of recent memory sequences in prefrontal cortex during sleep. Science, 318(5853), 1147-1150. https://doi.org/10.1126/science.1148979

      Flint, R. D., Ethier, C., Oby, E. R., Miller, L. E., & Slutzky, M. W. (2012). Local field potentials allow accurate decoding of muscle activity. J Neurophysiol, 108(1), 18-24. https://doi.org/10.1152/jn.00832.2011

      Frankland, P. W., & Bontempi, B. (2005). The organization of recent and remote memories. Nat Rev Neurosci, 6(2), 119-130. https://doi.org/10.1038/nrn1607

      Gais, S., Albouy, G., Boly, M., Dang-Vu, T. T., Darsaud, A., Desseilles, M., Rauchs, G., Schabus, M., Sterpenich, V., Vandewalle, G., Maquet, P., & Peigneux, P. (2007). Sleep transforms the cerebral trace of declarative memories. Proc Natl Acad Sci U S A, 104(47), 1877818783. https://doi.org/10.1073/pnas.0705454104

      Grafton, S. T., Mazziotta, J. C., Presty, S., Friston, K. J., Frackowiak, R. S., & Phelps, M. E. (1992). Functional anatomy of human procedural learning determined with regional cerebral blood flow and PET. J Neurosci, 12(7), 2542-2548.

      Grover, S., Wen, W., Viswanathan, V., Gill, C. T., & Reinhart, R. M. G. (2022). Long-lasting, dissociable improvements in working memory and long-term memory in older adults with repetitive neuromodulation. Nat Neurosci, 25(9), 1237-1246. https://doi.org/10.1038/s41593-022-01132-3

      Gupta, M. W., & Rickard, T. C. (2022). Dissipation of reactive inhibition is sufficient to explain post-rest improvements in motor sequence learning. NPJ Sci Learn, 7(1), 25. https://doi.org/10.1038/s41539-022-00140-z

      Gupta, M. W., & Rickard, T. C. (2024). Comparison of online, offline, and hybrid hypotheses of motor sequence learning using a quantitative model that incorporate reactive inhibition. Sci Rep, 14(1), 4661. https://doi.org/10.1038/s41598-024-52726-9

      Hardwick, R. M., Rottschy, C., Miall, R. C., & Eickhoff, S. B. (2013). A quantitative metaanalysis and review of motor learning in the human brain. NeuroImage, 67, 283-297. https://doi.org/10.1016/j.neuroimage.2012.11.020

      Heusser, A. C., Poeppel, D., Ezzyat, Y., & Davachi, L. (2016). Episodic sequence memory is supported by a theta-gamma phase code. Nat Neurosci, 19(10), 1374-1380. https://doi.org/10.1038/nn.4374

      Higgins, C., Liu, Y., Vidaurre, D., Kurth-Nelson, Z., Dolan, R., Behrens, T., & Woolrich, M. (2021). Replay bursts in humans coincide with activation of the default mode and parietal alpha networks. Neuron, 109(5), 882-893 e887. https://doi.org/10.1016/j.neuron.2020.12.007

      Hikosaka, O., Nakamura, K., Sakai, K., & Nakahara, H. (2002). Central mechanisms of motor skill learning. Curr Opin Neurobiol, 12(2), 217-222. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Citati on&list_uids=12015240

      Jacobacci, F., Armony, J. L., Yeffal, A., Lerner, G., Amaro, E., Jr., Jovicich, J., Doyon, J., & Della-Maggiore, V. (2020). Rapid hippocampal plasticity supports motor sequence learning. Proc Natl Acad Sci U S A, 117(38), 23898-23903. https://doi.org/10.1073/pnas.2009576117

      Jacobacci, F., Armony, J. L., Yeffal, A., Lerner, G., Amaro Jr, E., Jovicich, J., Doyon, J., & DellaMaggiore, V. (2020). Rapid hippocampal plasticity supports motor sequence learning.

      Proceedings of the National Academy of Sciences, 117(38), 23898-23903. Karni, A., Meyer, G., Jezzard, P., Adams, M. M., Turner, R., & Ungerleider, L. G. (1995). Functional MRI evidence for adult motor cortex plasticity during motor skill learning. Nature, 377(6545), 155-158. https://doi.org/10.1038/377155a0

      Kennerley, S. W., Sakai, K., & Rushworth, M. F. (2004). Organization of action sequences and the role of the pre-SMA. J Neurophysiol, 91(2), 978-993. https://doi.org/10.1152/jn.00651.2003 00651.2003 [pii]

      Kleim, J. A., Barbay, S., & Nudo, R. J. (1998). Functional reorganization of the rat motor cortex following motor skill learning. J Neurophysiol, 80, 3321-3325.

      Kornysheva, K., Bush, D., Meyer, S. S., Sadnicka, A., Barnes, G., & Burgess, N. (2019). Neural Competitive Queuing of Ordinal Structure Underlies Skilled Sequential Action. Neuron, 101(6), 1166-1180 e1163. https://doi.org/10.1016/j.neuron.2019.01.018

      Lee, S. H., Jin, S. H., & An, J. (2019). The difference in cortical activation pattern for complex motor skills: A functional near- infrared spectroscopy study. Sci Rep, 9(1), 14066. https://doi.org/10.1038/s41598-019-50644-9

      Lisman, J. E., & Jensen, O. (2013). The theta-gamma neural code. Neuron, 77(6), 1002-1016. https://doi.org/10.1016/j.neuron.2013.03.007

      Mollazadeh, M., Aggarwal, V., Davidson, A. G., Law, A. J., Thakor, N. V., & Schieber, M. H. (2011). Spatiotemporal variation of multiple neurophysiological signals in the primary motor cortex during dexterous reach-to-grasp movements. J Neurosci, 31(43), 15531-15543. https://doi.org/10.1523/JNEUROSCI.2999-11.2011

      Molle, M., & Born, J. (2009). Hippocampus whispering in deep sleep to prefrontal cortex--for good memories? Neuron, 61(4), 496-498. https://doi.org/10.1016/j.neuron.2009.02.002

      Morris, R. G. M. (2006). Elements of a neurobiological theory of hippocampal function: the role of synaptic plasticity, synaptic tagging and schemas. [Review]. The European journal of neuroscience, 23(11), 2829-2846. https://doi.org/10.1111/j.1460-9568.2006.04888.x

      Mylonas, D., Schapiro, A. C., Verfaellie, M., Baxter, B., Vangel, M., Stickgold, R., & Manoach, D. S. (2024). Maintenance of Procedural Motor Memory across Brief Rest Periods Requires the Hippocampus. J Neurosci, 44(14). https://doi.org/10.1523/JNEUROSCI.1839-23.2024

      Pan, S. C., & Rickard, T. C. (2015). Sleep and motor learning: Is there room for consolidation? Psychol Bull, 141(4), 812-834. https://doi.org/10.1037/bul0000009

      Penhune, V. B., & Steele, C. J. (2012). Parallel contributions of cerebellar, striatal and M1 mechanisms to motor sequence learning. Behav. Brain Res., 226(2), 579-591. https://doi.org/10.1016/j.bbr.2011.09.044

      Qin, Y. L., McNaughton, B. L., Skaggs, W. E., & Barnes, C. A. (1997). Memory reprocessing in corticocortical and hippocampocortical neuronal ensembles. Philos Trans R Soc Lond B Biol Sci, 352(1360), 1525-1533. https://doi.org/10.1098/rstb.1997.0139

      Rickard, T. C., Cai, D. J., Rieth, C. A., Jones, J., & Ard, M. C. (2008). Sleep does not enhance motor sequence learning. J Exp Psychol Learn Mem Cogn, 34(4), 834-842. https://doi.org/10.1037/0278-7393.34.4.834

      Robertson, E. M., Pascual-Leone, A., & Miall, R. C. (2004). Current concepts in procedural consolidation. Nat Rev Neurosci, 5(7), 576-582. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Citati on&list_uids=15208699

      Sawamura, D., Sakuraba, S., Suzuki, Y., Asano, M., Yoshida, S., Honke, T., Kimura, M., Iwase, Y., Horimoto, Y., Yoshida, K., & Sakai, S. (2019). Acquisition of chopstick-operation skills with the non-dominant hand and concomitant changes in brain activity. Sci Rep, 9(1), 20397. https://doi.org/10.1038/s41598-019-56956-0

      Schendan, H. E., Searl, M. M., Melrose, R. J., & Stern, C. E. (2003). An FMRI study of the role of the medial temporal lobe in implicit and explicit sequence learning. Neuron, 37(6), 1013-1025. https://doi.org/10.1016/s0896-6273(03)00123-5

      Seedat, Z. A., Quinn, A. J., Vidaurre, D., Liuzzi, L., Gascoyne, L. E., Hunt, B. A. E., O'Neill, G. C., Pakenham, D. O., Mullinger, K. J., Morris, P. G., Woolrich, M. W., & Brookes, M. J. (2020). The role of transient spectral 'bursts' in functional connectivity: A magnetoencephalography study. NeuroImage, 209, 116537. https://doi.org/10.1016/j.neuroimage.2020.116537

      Shadmehr, R., & Holcomb, H. H. (1997). Neural correlates of motor memory consolidation. Science, 277, 821-824.

      Sjøgård, M., Baxter, B., Mylonas, D., Driscoll, B., Kwok, K., Tolosa, A., Thompson, M., Stickgold, R., Vangel, M., Chu, C., & Manoach, D. S. (2024). Hippocampal ripples mediate motor learning during brief rest breaks in humans. bioRxiv. https://doi.org/10.1101/2024.05.02.592200

      Srinivas, S., Sarvadevabhatla, R. K., Mopuri, K. R., Prabhu, N., Kruthiventi, S. S. S., & Babu, R. V. (2016). A Taxonomy of Deep Convolutional Neural Nets for Computer Vision [Technology Report]. Frontiers in Robotics and AI, 2. https://doi.org/10.3389/frobt.2015.00036

      Sterpenich, V., Albouy, G., Darsaud, A., Schmidt, C., Vandewalle, G., Dang Vu, T. T., Desseilles, M., Phillips, C., Degueldre, C., Balteau, E., Collette, F., Luxen, A., & Maquet, P. (2009). Sleep promotes the neural reorganization of remote emotional memory. J Neurosci, 29(16), 5143-5152. https://doi.org/10.1523/JNEUROSCI.0561-09.2009

      Toni, I., Ramnani, N., Josephs, O., Ashburner, J., & Passingham, R. E. (2001). Learning arbitrary visuomotor associations: temporal dynamic of brain activity. Neuroimage, 14(5), 10481057. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Citati on&list_uids=11697936

      Toni, I., Thoenissen, D., & Zilles, K. (2001). Movement preparation and motor intention. NeuroImage, 14(1 Pt 2), S110-117. https://doi.org/10.1006/nimg.2001.0841

      Tse, D., Langston, R. F., Kakeyama, M., Bethus, I., Spooner, P. A., Wood, E. R., Witter, M. P., & Morris, R. G. (2007). Schemas and memory consolidation. Science, 316(5821), 76-82. https://doi.org/10.1126/science.1135935

      van Kesteren, M. T., Fernandez, G., Norris, D. G., & Hermans, E. J. (2010). Persistent schemadependent hippocampal-neocortical connectivity during memory encoding and postencoding rest in humans. Proc Natl Acad Sci U S A, 107(16), 7550-7555. https://doi.org/10.1073/pnas.0914892107

      van Kesteren, M. T., Ruiter, D. J., Fernandez, G., & Henson, R. N. (2012). How schema and novelty augment memory formation. Trends Neurosci, 35(4), 211-219. https://doi.org/10.1016/j.tins.2012.02.001

      Vidaurre, D., Hunt, L. T., Quinn, A. J., Hunt, B. A. E., Brookes, M. J., Nobre, A. C., & Woolrich, M. W. (2018). Spontaneous cortical activity transiently organises into frequency specific phase-coupling networks. Nat Commun, 9(1), 2987. https://doi.org/10.1038/s41467-01805316-z

      Wagner, A. D., Schacter, D. L., Rotte, M., Koutstaal, W., Maril, A., Dale, A. M., Rosen, B. R., & Buckner, R. L. (1998). Building memories: remembering and forgetting of verbal experiences as predicted by brain activity. [Comment]. Science (New York, N.Y.), 281(5380), 1188-1191. http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pubmed&id=9712582 &retmode=ref&cmd=prlinks

      Wolpert, D. M., Goodbody, S. J., & Husain, M. (1998). Maintaining internal representations: the role of the human superior parietal lobe. Nat Neurosci, 1(6), 529-533. https://doi.org/10.1038/2245

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Review:

      Reviewer #1 (Public review): 

      Summary: 

      Odor- and taste-sensing are mediated by two different systems, the olfactory and gustatory systems, and have different behavioral roles. In this study, Wei et al. challenge this dichotomy by showing that odors can activate gustatory receptor neurons (GRNs) in Drosophila to promote feeding responses, including the proboscis extension response (PER) that was previously thought to be driven only by taste. While previous studies suggested that odors can promote PER to appetitive tastants, Wei et al. go further to show that odors alone cause PER, this effect is mediated through sweet-sensing GRNs, and sugar receptors are required. The study also shows that odor detection by bitter-sensing GRNs suppresses PER. The authors' conclusions are supported by behavioral assays, calcium imaging, electrophysiological recordings, and genetic manipulations. The observation that both attractive and aversive odors promote PER leaves an open question as to why this effect is adaptive. Overall, the study sheds new light on chemosensation and multimodal integration by showing that odor and taste detection converge at the level of sensory neurons, a finding that is interesting and surprising while also being supported by another recent study (Dweck & Carlson, Sci Advances 2023).

      Strengths: 

      (1) The main finding that odors alone can promote PER by activating sweet-sensing GRNs is interesting and novel.

      (2) The study uses video tracking of the proboscis to quantify PER rather than manual scoring, which is typically used in the field. The tracking method is less subjective and provides a higherresolution readout of the behavior.

      (3) The study uses calcium imaging and electrophysiology to show that odors activate GRNs. These represent complementary techniques that measure activity at different parts of the GRN (axons versus dendrites, respectively) and strengthen the evidence for this conclusion. 

      (4) Genetic manipulations show that odor-evoked PER is primarily driven by sugar GRNs and sugar receptors rather than olfactory neurons. This is a major finding that distinguishes this work from previous studies of odor effects on PER and feeding (e.g., Reisenman & Scott, 2019; Shiraiwa, 2008) that assumed or demonstrated that odors were acting through olfactory neurons.

      We appreciate the reviewer’s positive assessment of the novelty and significance of our work.

      Weaknesses/Limitations: 

      (1) The authors may want to discuss why PER to odors alone has not been previously reported, especially as they argue that this is a broad effect evoked by many different odors. Previous studies testing the effect of odors on PER only observed odor enhancement of PER to sugar (Oh et al., 2021; Reisenman & Scott, 2019; Shiraiwa, 2008) and some of these studies explicitly show no effect of odor alone or odor with low sugar concentration; regardless, the authors likely would have noticed if PER to odor alone had occurred. Readers of this paper may also be aware of unpublished studies failing to observe an effect of PER on odor alone (including studies performed by this reviewer and unrelated work by other colleagues in the field), which of course the authors are not expected to directly address but may further motivate the authors to provide possible explanations.

      We appreciate the reviewer’s comment. We believe that the difference in genotype is likely the largest reason behind this point. This is because the strength varied widely across genotypes and was quite weak in some strains including commonly used w[1118] empty Gal4 and w[1118] empty spit Gal4 as shown in Figure1- figure supplement 3 (Figure S3 in original submission). However, given that we observed odor-evoked PER in various genotypes (many in main Figures and three in Figure1- figure supplement 3 including Drosophila simulans), the data illustrate that it is a general phenomenon in Drosophila. Indeed, although Oh et al. (2021) did not emphasize it in the text, their Fig. 1E showed that yeast odor evoked PER at a probability of 20%, which is much higher than the rate of spontaneous PER in many genotypes. Therefore, this literature may represent another support for the presence of odor-evoked PER. We have expanded our text in the Discussion to describe these issues.

      Another possibility is our use of DeepLabcut to quantitatively track the kinematics of proboscis movement, which may have facilitated the detection of PER.

      (2) Many of the odor effects on behavior or neuronal responses were only observed at very high concentrations. Most effects seemed to require concentrations of at least 10-2 (0.01 v/v), which is at the high end of the concentration range used in olfactory studies (e.g., Hallem et al., 2004), and most experiments in the paper used a far higher concentration of 0.5 v/v. It is unclear whether these are concentrations that would be naturally encountered by flies.

      We acknowledge that the concentrations used are on the higher side, suggesting that GRNs may need to be stimulated with relatively concentrated odors to induce PER. Although it is difficult to determine the naturalistic range of odor concentration, it is at least widely reported that olfactory neurons including olfactory receptor neurons and projection neurons do not saturate, and exhibit odor identity-dependent responses at the concentration of 10<sup>-2</sup> where odor-evoked PER can be observed. Furthermore, we have shown in Figure 6 that low concentration (10<sup>-4</sup>) of banana odor, ethyl butyrate, and 4-methycyclohexanol all significantly increased the rate of odor-taste multisensory PER even in olfactory organs-removed flies, suggesting that low concentration odors can influence feeding behavior via GRNs in a natural context where odors and tastants coexist at food sites. Finally, we note that odors were further diluted by a factor of 0.375 by mixing the odor stream with the main air stream before being applied to the flies as described in Methods.

      (3) The calcium imaging data showing that sugar GRNs respond to a broad set of odors contrasts with results from Dweck & Carlson (Sci Adv, 2023) who recorded sugar neurons with electrophysiology and observed responses to organic acids, but not other odors. This discrepancy is not discussed.  

      As the reviewer points out, Dweck and Carlson (Sci Adv, 2023) reported using single sensillum electrophysiology (base recording) that sugar GRNs only respond to organic acids whereas we found using calcium imaging from a group of axons and single sensillum electrophysiology (tip recording) that these GRNs respond to a wide variety of odors. Given that we observed odor responses using two methods, the discrepancy is likely due to the differences in genotype examined. We now have discussed this point in the text.

      (4) Related to point #1, it would be useful to see a quantification of the percent of flies or trials showing PER for the key experiments in the paper, as this is the standard metric used in most studies and would help readers compare PER in this study to other studies. This is especially important for cases where the authors are claiming that odor-evoked PER is modulated in the same way as previously shown for sugar (e.g., the effect of starvation in Figure S4).

      For starved flies, we would like to remind the reviewer that the percentage of trials showing PER is reported in Fig. 1E, which shows a similar trend as the integrated PER duration. For fed flies, we have analyzed the percentage of PER and added the result to Figure 2-figure supplement 1C (Figure S4 in original submission).

      (5) Given the novelty of the finding that odors activate sugar GRNs, it would be useful to show more examples of GCaMP traces (or overlaid traces for all flies/trials) in Figure 3. Only one example trace is shown, and the boxplots do not give us a sense of the reliability or time course of the response. A related issue is that the GRNs appear to be persistently activated long after the odor is removed, which does not occur with tastes. Why should that occur? Does the time course of GRN activation align with the time course of PER, and do different odors show differences in the latency of GRN activation that correspond with differences in the latency of PER (Figure S1A)?

      Following the reviewer’s suggestion, we now report GCaMP responses for all the trials in all the flies (both Gr5a>GCaMP and Gr66a>GCaMP flies), where the time course and trial-to-trial/animal-toanimal variability of calcium responses can be observed (Figure 3-figure supplement 2).

      Regarding the second point, we recorded responses to both sucrose and odors in some flies and found that calcium responses of GRNs are long-lasting not only to odors but also to sucrose, as shown in Author response image 1. This may be due in part to the properties of GCaMP6s and slower decay of intracellular calcium concentration as compared to spikes.

      Author response image 1.

      Example calcium responses to sucrose and odor (MCH) in the same fly (normalized by the respective peak responses to better illustrate the time course of responses). Sucrose (blue) and odor (orange) concentrations are 100 mM, and 10<sup>-1</sup> respectively. Odor stimulation begins at 5 s and lasts for 2 s. Sucrose was also applied at the same timing for the same duration although there was a limitation in controlling the precise timing and duration of tastant application. Because of this limitation, we did not quantify the off time constant of two responses.

      To address whether the time course of GRN activation aligns with the time course of PER, and whether different odors evoke different latencies of GRN activation that correspond to latencies of PER, we plotted the time course of GRN responses and PER, and further compared the response latencies across odors and across two types of responses in Gr5a>GCaMP6s flies. As shown in Author response image 2, no significant differences were found in response latency between the six odors for PER and odor responses. Furthermore, Pearson correlation between GRN response latencies and PER latencies was not significant (r = 0.09, p = 0.872).

      Author response image 2.

      (A) PER duration in each second in Gr5a-Gal4>UAS-GCaMP6s flies. The black lines indicate the mean and the shaded areas indicate standard error of the mean. n = 25 flies. (B) Time course of calcium responses (ΔF/F) to nine odors in Gr5a GRNs. n = 5 flies. (C) Latency to the first odor-evoked PER in Gr5a-Gal4>UAS-GCaMP6s flies. Green bar indicates the odor application period. p = 0.67, one-way ANOVA. Box plots indicate the median (orange line), mean (black dot), quartiles (box), and 5-95% range (bar). Dots are outliers. (D) Latency of calcium responses (10% of rise to peak time) in Gr5a GRNs. Green bar indicates the odor application period. p = 0.32, one-way ANOVA. Box plots indicate the median (orange line), mean (black dot), quartiles (box), and 5-95% range (bar). Dots are outliers.

      (6) Several controls are missing, and in some cases, experimental and control groups are not directly compared. In general, Gal4/UAS experiments should include comparisons to both the Gal4/+ and UAS/+ controls, at least in cases where control responses vary substantially, which appears to be the case for this study. These controls are often missing, e.g. the Gal4/+ controls are not shown in Figure 2C-G and the UAS/+ controls are not shown in Figure 2J-L (also, the legend for the latter panels should be revised to clarify what the "control" flies are). For the experiments in Figure S5, the data are not directly compared to any control group. For several other experiments, the control and experimental groups are plotted in separate graphs (e.g., Figure 2C-G), and they would be easier to visually compare if they were together. In addition, for each experiment, the authors should denote which comparisons are statistically significant rather than just reporting an overall p-value in the legend (e.g., Figure 2H-L).

      We thank the reviewer for the input. We have conducted additional experiments for four Gal4/+controls in Figure 2 and added detailed information about control flies in the figure legend (Figure 2C-F).

      For the RNAi flies shown in Figure 2 and Figure 2-figure supplement 3, we used the recommended controls suggested by the VDRC. These control flies were crossed with tubulin-Gal4 lines to include both Gal4 and UAS control backgrounds.

      Regarding Figure S5 in original submission (current Figure 2-figure supplement 2), we now present the results of statistical tests which revealed that PER to certain odors is statistically significantly stronger than that to the solvent control (mineral oil) for both wing-removed and wing-leg-removed flies.

      For Figure 2C-F, we now plot the results for experimental and control groups side by side in each figure.

      Regarding the results of statistical tests, we have provided more information in the legend and also prepared a summary table (supplemental table). 

      (7) Additional controls would be useful in supporting the conclusions. For the Kir experiments, how do we know that Kir is effective, especially in cases where odor-evoked PER was not impaired (e.g., Orco/Kir)? The authors could perform controls testing odor aversion, for example. For the Gr5a mutant, few details are provided on the nature of the control line used and whether it is in the same genetic background as the mutant. Regardless, it would be important to verify that the Gr5a mutant retains a normal sense of smell and shows normal levels of PER to stimuli other than sugar, ruling out more general deficits. Finally, as the method of using DeepLabCut tracking to quantify PER was newly developed, it is important to show the accuracy and specificity of detecting PER events compared to manual scoring.  

      A previous study (Sato, 2023, Front Mol Neurosci) showed that the avoidance to 100 μM 2methylthiazoline was abolished, and the avoidance to 1 mM 2MT was partially impaired in Orco>Kir2.1 flies. However, because Orco-Gal4 does not label all the ORNs and we have more concrete results on flies in which all the olfactory organs are removed as well as specific GRNs and Gr are manipulated, we decided to remove the data for Orco>kir2.1 flies and have updated the text and Figure 2 accordingly.

      For the Gr5a mutant and its control, we have added detailed information about the genotype in the figure legend and in the Methods. We have used the exact same lines as reported in Dahanukar et al. (2007) by obtaining the lines from Dr. Dahanukar. Dahanukar et al. has already carefully examined that Gr5a mutant loses responses only to certain types of sugars (e.g. it even retains normal responses to some other sugars), demonstrating that Gr5a mutants do not exhibit general deficits.

      As for the PER scoring method, we manually scored PER duration and compared the results with those obtained using DeepLabCut in wild type flies for the representative data. The two results were similar (no statistical difference). We have reported the result in Figure1-figure supplement 1C.

      (8) The authors' explanation of why both attractive and aversive odors promote PER (lines 249-259) did not seem convincing. The explanation discusses the different roles of smell and taste but does not address the core question of why it would be adaptive for an aversive odor, which flies naturally avoid, to promote feeding behavior.  

      We have extended our explanation in the Discussion by adding the following possibility: “Enhancing PER to aversive odors might also be adaptive as animals often need to carry out the final check by tasting a trace amount of potentially dangerous substances to confirm that those should not be further consumed.”

      Reviewer #2 (Public review): 

      Summary: 

      A gustatory receptor and neuron enhances an olfactory behavioral response, proboscis extension. This manuscript clearly establishes a novel mechanism by which a gustatory receptor and neuron evokes an olfactory-driven behavioral response. The study expands recent observations by Dweck and Carlson (2023) that suggest new and remarkable properties among GRNs in Drosophila. Here, the authors articulate a clear instance of a novel neural and behavioral mechanism for gustatory receptors in an olfactory response.

      Strengths: 

      The systematic and logical use of genetic manipulation, imaging and physiology, and behavioral analysis makes a clear case that gustatory neurons are bona fide olfactory neurons with respect to proboscis extension behavior.

      Weaknesses: 

      No weaknesses were identified by this reviewer.  

      We appreciate the reviewer’s recognition of the novelty and significance of our work.

      Reviewer #3 (Public review): 

      Summary: 

      Using flies, Kazama et al. combined behavioral analysis, electrophysiological recordings, and calcium imaging experiments to elucidate how odors activate gustatory receptor neurons (GRNs) and elicit a proboscis extension response, which is interpreted as a feeding response. 

      The authors used DeepLabCut v2.0 to estimate the extension of the proboscis, which represents an unbiased and more precise method for describing this behavior compared to manual scoring.

      They demonstrated that the probability of eliciting a proboscis extension increases with higher odor concentrations. The most robust response occurs at a 0.5 v/v concentration, which, despite being diluted in the air stream, remains a relatively high concentration. Although the probability of response is not particularly high it is higher than control stimuli. Notably, flies respond with a proboscis extension to both odors that are considered positive and those regarded as negative.

      The authors used various transgenic lines to show that the response is mediated by GRNs.

      Specifically, inhibiting Gr5a reduces the response, while inhibiting Gr66a increases it in fed flies. Additionally, they find that odors induce a strong positive response in both types of GRNs, which is abolished when the labella of the proboscis are covered. This response was also confirmed through electrophysiological tip recordings.

      Finally, the authors demonstrated that the response increases when two stimuli of different modalities, such as sucrose and odors, are presented together, suggesting clear multimodal integration.

      Strengths: 

      The integration of various techniques, that collectively support the robustness of the results.

      The assessment of electrophysiological recordings in intact animals, preserving natural physiological conditions.

      We appreciate the reviewer’s recognition of the novelty and significance of our work.

      Weaknesses: 

      The behavioral response is observed in only a small proportion of animals.  

      We acknowledge that the probability of odor-evoked PER is lower compared to sucrose-evoked PER, which is close to 100 % depending on the concentration. To further quantify which proportion of animals exhibit odor-evoked PER, we now report this number besides the probability of PER for each odor shown in Fig. 1E. We found that, in wild type Dickinson flies, 73% and 68 % of flies exhibited PER to at least one odor presented at the concentration of 0.5 and 0.1.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors): 

      Minor comments/suggestions: 

      - Define "MO" in Figure 1D.  

      We have defined it as mineral oil in the figure legend.

      - Clarify how peak response was calculated for GCaMP traces (is it just the single highest frame per trial?).

      We extended the description in the Methods as follows: “The peak stimulus response was quantified by averaging ΔF/F across five frames at the peak, followed by averaging across three trials for each stimulus. Odor stimulation began at frame 11, and the frames used for peak quantification were 12 to 16.” We made sure that information about the image acquisition frame rate was provided earlier in the text.

      - Clarify how the labellum was covered in Figure 3 and show that this does not affect the fly's ability to do PER (e.g., test PER to sugar stimulation on tarsus) - otherwise one might think that gluing the labella could affect PER.

      In Figure 3, only calcium responses were recorded, and PER was not recorded simultaneously from the same flies. To ensure stable recording from GRN axons in the SEZ, we kept the fly’s proboscis in an extended position as gently as possible using a strip of parafilm. In some of the imaging experiments, we covered the labellum with UV curable glue, whose purpose was not to fix the labellum in an extended position but to prevent the odors from interacting with GRNs on the labellum. We have added a text in the Methods to explain how we covered the labellum.

      - Clarify how the coefficients for the linear equation were chosen in Figure 3G.  

      We used linear regression (implemented in Python using scikit-learn) to model the relationship between neural activity and behavior, aiming to predict the PER duration based on the calcium responses of two GRN types, Gr5a and Gr66a. The coefficients were estimated using the LinearRegression function. We added this description to the Methods. 

      - Typo in "L-type", Figure 4A.  

      We appreciate the reviewer for pointing out this error and have corrected it.

      - Clarify over what time period ephys recordings were averaged to obtain average responses.

      We have modified the description in the Methods as follows: “The average firing rate was quantified by using the spikes generated between 200 and 700 ms after the stimulus contact following the convention to avoid the contamination of motion artifact (Dahanukar and Benton, 2023; Delventhal et al., 2014; Hiroi et al., 2002).

      - The data and statistics indicate that MCH does not enhance feeding in Figure 6G, so the text in lines 207-208 is not accurate.

      We have modified the text as follows: “A similar result was observed with ethyl butyrate, and a slight, although not significant, increase was also observed with 4-methylcyclohexanol (Figure 6G).”

      - P-value for Figure S9 correlation is not reported.  

      We appreciate the reviewer for pointing this out. The p-value is 0.00044, and we have added it to the figure legend (current Figure 5-figure supplement 1).

      Reviewer #2 (Recommendations for the authors): 

      Honestly, I have no recommendations for improvement. The manuscript is extremely well-written and logical. The experiments are persuasive. A lapidary piece of work.

      We appreciate the reviewer for the positive assessment of our work.

      Reviewer #3 (Recommendations for the authors): 

      - I suggest explaining the rationale for selecting a 4-second interval, beginning 1 second after the onset of stimulation.

      Integrated PER duration was defined as the sum of PER duration over 4 s starting 1 s after the odor onset. This definition was set based on the following data.

      (1) We used a photoionization detector (PID) to measure the actual time that the odor reaches the position of a tethered fly, which was approximately 1.1 seconds after the odor valve was opened. Therefore, we began analyzing PER responses 1 second after the odor onset (valve opening) to align with the actual timing of stimulation.

      (2) As shown in Fig.1D and 1F, the majority of PER occurred within 4 s after the odor arrival.

      We have now added the above rationale in the Methods.

      - I could not find the statistical analysis for Figures 1E and 1G. If these figures are descriptive, I suggest the authors revise the sentences: 'Unexpectedly, we found that the odors alone evoked repetitive PER without an application of a tastant (Figures 1D-1G, and Movie S1). Different odors evoked PER with different probability (Figure 1E), latency (Figure S1A), and duration (Figures 1F, 1G, and S2)'.

      We have added the results of statistical analysis to the figure legend.

      - In Figure 2, the authors performed a Scheirer-Ray-Hare test, which, to my knowledge, is a nonparametric test for comparing responses across more than two groups with two factors. If this is the case, please provide the p-values for both factors and their interaction

      We now show the p-values for both factors, odor and group as well as their interaction in the supplementary table. 

      - In line 83, I suggest the authors avoid claiming that 'these data show the olfactory system modulates but is not required for odor-evoked PER,' as they are inhibiting most, but not all olfactory receptor neurons. In this regard, is it possible to measure the olfactory response to odors in these flies?  

      We thank the reviewer for the comment. Because Orco-Gal4 does not label all the ORNs and because we have more concrete results on flies in which all the olfactory organs are removed as well as specific GRNs and Gr are manipulated, we decided to remove the data for Orco>kir2.1 flies and have updated the text and Figure 2 accordingly.

      - In Figure 2, I wonder if there are differences in the contribution of various receptors in detecting different odors. A more detailed statistical analysis might help address this question.

      Although it might be possible to infer the contribution of different gustatory receptors by constructing a quantitative model to predict PER, it is a bit tricky because the activity of individual GRNs and not Grs are manipulated in Figure 2 except for Gr5a. The idea could be tested in the future by more systematically manipulating many Grs that are encoded in the fly genome.

      - For Figures 2J-L, please clarify which group serves as the control.  

      We have added this information to the legend. 

      - In Figure 3, I recommend including an air control in panels D and F to better appreciate the magnitude of the response under these conditions.

      The responses to all three controls, air, mineral oil and water, were almost zero. As the other reviewer suggested to present trial-to-trial variability as well, we now show responses to all the controls in all the trials in all the animals tested in Figure 3-figure supplement 2.

      - I had difficulty understanding Figure 3G. Could the authors provide a more detailed explanation of the model?

      We used linear regression (implemented in Python using scikit-learn) to model the relationship between neural activity and behavior, aiming to predict the PER duration based on the calcium responses of two GRN types, Gr5a and Gr66a. The weights for GRNs were estimated using the LinearRegression function. The weight for Gr5a and Gr66a was positive and negative, respectively, indicating that Gr5a contributes to enhance whereas Gr66a contributes to reduce PER.

      To evaluate the model performance, we calculated the coefficient of determination (R<sup>2</sup>), which was 0.81, meaning the model explained 81% of the variance in the PER data.

      The scatter plot in Fig. 3G shows a tight relationship between the predicted PER duration (y-axis) plotted against the actual PER duration (x-axis), demonstrating a strong predictive power of the model.

      We added the details to the Methods.

      - In Figure S4a, the reported p-value is 0.88, which seems to be a typo, as the text indicates that PER is enhanced in a starved state.

      Thank you for pointing this out. We have modified the figure legend to describe that PER was enhanced in a starved state only for the experiments conducted with odors at 10<sup>-1</sup> concentration (current Figure 2-figure supplement 1).

    1. Author Response

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      The authors of this study seek to visualize NS1 purified from dengue virus infected cells. They infect vero cells with DV2-WT and DV2 NS1-T164S (a mutant virus previously characterized by the authors). The authors utilize an anti-NS1 antibody to immunoprecipitate NS1 from cell supernatants and then elute the antibody/NS1 complex with acid. The authors evaluate the eluted NS1 by SDS-PAGE, Native Page, mass spec, negative-stain EM, and eventually Cryo-EM. SDS-PAGE, mas spec, and native page reveal a >250 Kd species containing both NS1 and the proteinaceous component of HDL (ApoA1). The authors produce evidence to suggest that this population is predominantly NS1 in complex with ApoA1. This contrasts with recombinantly produced NS1 (obtained from a collaborator) which did not appear to be in complex with or contain ApoA1 (Figure 1C). The authors then visualize their NS1 stock in complex with their monoclonal antibody by CryoEM. For NS1-WT, the major species visualized by the authors was a ternary complex of an HDL particle in complex with an NS1 dimer bound to their mAB. For their mutant NS1-T164S, they find similar structures, but in contrast to NS1-WT, they visualize free NS1 dimers in complex with 2 Fabs (similar to what's been reported previously) as one of the major species. This highlights that different NS1 species have markedly divergent structural dynamics. It's important to note that the electron density maps for their structures do appear to be a bit overfitted since there are many regions with electron density that do not have a predicted fit and their HDL structure does not appear to have any predicted secondary structure for ApoA1. The authors then map the interaction between NS1 and ApoA1 using cross-linking mass spectrometry revealing numerous NS1-ApoA1 contact sites in the beta-roll and wing domain. The authors find that NS1 isolated from DENV infected mice is also present as a >250 kD species containing ApoA1. They further determine that immunoprecipitation of ApoA1 out of the sera from a single dengue patient correlates with levels of NS1 (presumably COIPed by ApoA1) in a dose-dependent manner.

      In the end, the authors make some useful observations for the NS1 field (mostly confirmatory) providing additional insight into the propensity of NS1 to interact with HDL and ApoA1. The study does not provide any functional assays to demonstrate activity of their proteins or conduct mutagenesis (or any other assays) to support their interaction predications. The authors assertion that higher-order NS1 exists primarily as a NS1 dimer in complex with HDL is not well supported as their purification methodology of NS1 likely introduces bias as to what NS1 complexes are isolated. While their results clearly reveal NS1 in complex with ApoA1, the lack of other NS1 homo-oligomers may be explained by how they purify NS1 from virally infected supernatant. Because NS1 produced during viral infection is not tagged, the authors use an anti-NS1 monoclonal antibody to purify NS1. This introduces a source of bias since only NS1 oligomers with their mAb epitope exposed will be purified. Further, the use of acid to elute NS1 may denature or alter NS1 structure and the authors do not include controls to test functionality of their NS1 stocks (capacity to trigger endothelial dysfunction or immune cell activation). The acid elution may force NS1 homo-oligomers into dimers which then reassociate with ApoA1 in a manner that is not reflective of native conditions. Conducting CryoEM of NS1 stocks only in the presence of full-length mAbs or Fabs also severely biases what species of NS1 is visualized since any NS1 oligomers without the B-ladder domain exposed will not be visualized. If the residues obscured by their mAb are involved in formation of higher-order oligomers then this antibody would functionally inhibit these species from forming. The absence of critical controls, use of one mAb, and acid elution for protein purification severely limits the interpretation of these data and do not paint a clear picture of if NS1 produced during infection is structurally distinct from recombinant NS1. Certainly there is novelty in purifying NS1 from virally infected cells, but without using a few different NS1 antibodies to purify NS1 stocks (or better yet a polyclonal population of antibodies) it's unclear if the results of the authors are simply a consequence of the mAb they selected.

      Data produced from numerous labs studying structure and function of flavivirus NS1 proteins provide diverse lines of evidence that the oligomeric state of NS1 is dynamic and can shift depending on context and environment. This means that the methodology used for NS1 production and purification will strongly impact the results of a study. The data in this manuscript certainly capture one of these dynamic states and overall support the general model of a dynamic NS1 oligomer that can associate with both host proteins as well as itself but the assertions of this manuscript are overall too strong given their data, as there is little evidence in this manuscript, and none available in the large body of existing literature, to support that NS1 exists only as a dimer associated with ApoA1. More likely the results of this paper are a result of their NS1 purification methodology.

      Suggestions for the Authors:

      Major:

      (1) Because of the methodology used for NS1 purification, it is not clear from the data provided if NS1 from viral infection differs from recombinant NS1. Isolating NS1 from viral infection using a polyclonal antibody population would be better to answer their questions. On this point, Vero cells are also not the best candidate for their NS1 production given these cells do not come from a human. A more relevant cell line like U937-DC-SIGN would be preferable.

      We performed an optimization of sNS1 secretion from DENV infection in different cell lines (Author response image 1 below) to identify the best cell line candidate to obtain relatively high yield of sNS1 for the study. As shown in Author response image 1, the levels of sNS1 in the tested human cell lines Huh7 and HEK 293T were at least 3-5 fold lower than in Vero cells. Although using a monocytic cell line expressing DC-SIGN as suggested by the reviewer would be ideal, in our experience the low infectivity of DENV in monocytic cell lines will not yield sufficient amount of sNS1 needed for structural analysis. For these practical reasons we decided to use the closely related non-human primate cell line Vero for sNS1 production supported by our optimization data.

      Author response image 1.

      sNS1 secretion in different mammalian and mosquito cell lines after DENV2 infection. The NS1 secretion level is measured using PlateliaTM Dengue NS1 Ag ELISA kit (Bio-Rad) on day 3 (left) and day 5 (right) post infection respectively.

      (2) The authors need to support their interaction predictions and models via orthogonal assays like mutagenesis followed by HDL/ApoA1 complexing and even NS1 functional assays. The authors should be able to mutate NS1 at regions predicted to be critical for ApoA1/HDL interaction. This is critical to support the central conclusions of this manuscript.

      In our previous publication (Chan et al., 2019 Sci Transl Med), we used similarly purified sNS1 (immunoaffinity purification followed by acid elution) from infected culture supernatants from both DENV2 wild-type and T164S mutant (both also studied in the present work) to carry out stimulation assay on human PBMCs as described by other leading laboratories investigating NS1 (Modhiran et al., 2015 Sci Transl Med). For reader convenience we have extracted the data from our published paper and present it as Author response image 2 below.

      Author response image 2.

      (A) IL6 and (B) TNFa concentrations measured in the supernatants of human PBMCs incubated with either 1µg/ml or 10µg/ml of the BHK-21 immunoaffinity-purified WT and TS mutant sNS1 for 24 hours. Data is adapted from Chan et al., 2019.

      Incubation of immunoaffinity-purified sNS1 (WT and TS) with human PBMCs from 3 independent human donors triggered the production of proinflammatory cytokines IL6 and TNF in a concentration dependent manner (Author response image 2), consistent with the published data by Modhiran et al., 2015 Sci Transl Med. Interestingly the TS mutant derived sNS1 induced a higher proinflammatory cytokines production than WT virus derived sNS1 that appears to correlate with the more lethal and severe disease phenotype in mice as also reported in our previous work (Chan et al., 2019). Additionally, the functionality of our immune-affinity purified infection derived sNS1 (isNA1) is now further supported by our preliminary results on the NS1 induced endothelial cell permeability assay using the purified WT and mutant isNS1 (Author response image 3). As shown in Author response image 3, both the isNS1wt and isNS1ts mutant reduced the relative transendothelial resistance from 0 to 9 h post-treatment, with the peak resistance reduction observed at 6 h post-treatment, suggesting that the purified isNS1 induced endothelial dysfunction as reported in Puerta-Guardo et al., 2019, Cell Rep.) It is noteworthy that the isNS1 in our study behaves similarly as the commercial recombinant sNS1 (rsNS1 purchased from the same source used in study by Puerta-Guardo et al., 2019) in inducing endothelial hyperpermeability. Collectively our previous published and current data suggest that the purified isNS1 (as a complex with ApoA1) has a pathogenic role in disease pathogenesis that is also supported in a recent publication by Benfrid et al., EMBO 2022). The acid elution has not affected the functionality of NS1.

      Author response image 3.

      Functional assessment of isNS1wt and isNS1ts on vascular permeability in vitro. A trans-endothelial permeabilty assay via measurement of the transendothelial electrical resistance (TEER) on human umbilical vascular endothelial cells (hUVEC) was performed, as described previously (Puerta-Guardo et al., 2019, Cell Rep). Ovalbumin serves as the negative control, while TNF-α and rsNS1 serves as the positive controls.

      We agree with reviewer about the suggested mutagnesis study. We will perform site-directed mutagenesis at selected residues and further structural and functional analyses and report the results in a follow-up study.

      (3) The authors need to show that the NS1 stocks produced using acid elution are functional compared to standard recombinantly produced NS1. Do acidic conditions impact structure/function of NS1?

      We are providing the same response to comments 1 & 2 above. We would like to reiterate that we have previously used sNS1 from immunoaffinity purification followed by acid elution to test its function in stimulating PBMCs to produce pro-inflammatory cytokines (Chan et al., 2019; Author response image 2). Similar to Modhiran et al. (2015) and Benfrid et al. (2022), the sNS1 that we extracted using acid elution are capable of activating PBMCs to produce pro-inflammatory cytokines. We have now further demonstrated the ability of both WT and TS isNS1 in inducing endothelial permeability in vitro in hUVECs, using the TEER assay (Author response image 3). Based on the data presented in the rebuttal figures as well as our previous publication we do not think that the acid elution has a significant impact on function of isNS1.

      We performed affinity purification to enrich the complex for better imaging and analysis (Supp Fig. 1b) since the crude supernatant contains serum proteins and serum-free infections also do not provide sufficient isNS1. The major complex observed in negative stain is 1:1 (also under acidic conditions which implies that the complex are stable and intact). We agree that it is possible that other oligomers can form but we have observed only a small population (74 out of 3433 particles, 2.15%; 24 micrographs) of HDL:sNS1 complex at 1:2 ratio as shown in the Author response image 4 below and in the manuscript (p. 4 lines 114-117, Supp Fig. 1c). Other NS1 dimer:HDL ratios including 2:1 and 3:1 have been reported by Benfrid et al., 2022 by spiking healthy sera with recombinant sNS1 and subsequent re-affinity purification. However, this method used an approximately 8-fold higher sNS1 concentration (400 ug/mL) than the maximum clinically reported concentration (50 ug/mL) (Young et al., 2000; Alcon et al., 2002; Libraty et al., 2002). In our hands, the sNS1 concentration in the concentrated media from in vitro infection was quantified as 30 ug/mL which is more physiologically relevant.

      We conclude that the integrity of the HDL of the complex is not lost during sample preparation, as we are able to observe the complex under the negative staining EM as well as infer from XL-MS. Our rebuttal data and our previous studies with our acid-eluted isNS1 from immunoaffinity purification clearly show that our protein is functional and biologically relevant.

      Author response image 4.

      (A) Representative negative stain micrograph of sNS1wt (B) Representative 2D averages of negative stained isNS1wt. Red arrows indicating the characteristic wing-like protrusions of NS1 inserted in HDL. (C) Data adapted from Figure 2 in Benfrid et al. (2022).

      (4) Overall, the data obtained from the mutant NS1 (contrasted to WT NS1) reveals how dynamic the oligomeric state of NS1 proteins are but the authors do not provide any insight into how/why this is, some additional lines of evidence using either structural studies or mutagenesis to compare WT and their mutant and even NS1 from a different serotype of DENV would help the field to understand the dynamic nature of NS1.

      The T164S mutation in DENV2 NS1 was proposed as the residue associated with disease severity in 1997 Cuban dengue epidemic (Halsted SB. “Intraepidemic increases in dengue disease severity: applying lessons on surveillance and transmission”. Whitehorn, J., Farrar. J., Eds., Clinical Insights in Dengue: Transmission, Diagnosis & Surveillance. The Future Medicine (2014), pp. 83-101). Our previous manuscript examined this mutation by engineering it into a less virulent clade 2 DENV isolated in Singapore and showed that sNS1 production was higher without any change in viral RNA replication. Transcript profiling of mutant compared to WT virus showed that genes that are usually induced during vascular leakage were upregulated for the mutant. We also showed that infection of interferon deficient AG129 mice with the mutant virus resulted in disease severity, increased complement protein expression in the liver, tissue inflammation and greater mortality compared to WT virus infected mice. The lipid profiling in our study (Chan et al., 2019) suggested small differences with WT but was overall similar to HDL as described by Gutsche et al. (2011). We were intrigued by our functional results and wanted to explore more deeply the impact of the mutation on sNS1 structure which at that stage was widely believed to be a trimer of NS1 dimers with a central channel (~ X Å) stuffed with lipid as established in several seminal publications (Flamand et al., 1999; Gutsche et al., 2011; Muller et al., 2012). In fact “This Week in Virology” netcast (https://www.microbe.tv/twiv/twiv-725/) discussed two back-to-back publications in Science (Modhiran et al., 371(6625)190-194; Biering et al., Science 371(6625):194-200)) which showed that therapeutic antibodies can ameliorate the NS1 induced pathogenesis and expert discussants posed questions that also pointed to the need for more accurate definition of the molecular composition and architecture of the circulating NS1 complex during virus infection to get a clearer handle on its pathogenic mechanism. Our current studies and also the recent high resolution cryoEM structures (Shu et al., 2022) do not support the notion of a central channel “stuffed with lipid”. Even in the rare instances where trimer of dimers are shown, the narrow channel in the center could only accommodate one molecule of lipoid molecule no bigger than a typical triglyceride molecule. This hexamer model cannot explain the lipid proeotmics data in the literature.

      In our study we observed predominantly 1:1 NS1 dimer to HDL (~30 μg/mL) mirroring maximum clinically reported concentration of sNS1 in the sera of DENV patients (40-50 μg/mL) as we highlighted in our main text (P. 18, lines 461-471). What is often quoted (also see later) is the recent study of Flamand & co-workers which show 1-3 NS1 dimers per HDL (Benfrid et al, 2022) by spiking rsNS1 (400 μg/mL) with HDL. This should not be confused with the previous models which suggested a lipid filled central channel holding together the hexamer. The use of physiologically relevant concentrations is important for these studies as we have highlighted in our main text (P. 18, lines 461-471).

      Our interpretation for the mutant (isNS1ts) is that it is possible that the hydrophilic serine at residue 164 located in the greasy finger loop may weaken the isNS1ts binding to HDL hence the observation of free sNS1 dimers in our immunoaffinity purified (acid eluted sample). The disease severity and increased complement protein expression in AG129 mice liver can be ascribed to weakly bound mutant NS1 with fast on/off rate with HDL being transported to the liver where specific receptors bind to free sNS1 and interact with effector proteins such as complement to drive inflammation and associated pathology. Our indirect support for this is that the XL-MS analysis of purified isNS1ts identified only 7 isNS1ts:ApoA1 crosslinks while 25 isNS1wt:ApoA1 crosslinks were identified from purified isNS1wt (refer to Fig. 4 and Supp. Fig. 8).

      Taken together, the cryoEM and XL-MS analysis of purified isNS1ts suggest that isNS1ts has weaker affinity for HDL compared to isNS1wt. We welcome constructive discussion on our interpretation that we and others will hopefully obtain more data to support or deny our proposed explanation. Our focus has been to compare WT with mutant sNS1 from DENV2 and we agree that it will be useful to study other serotypes.

      Reviewer #2:

      CryoEM:

      Some of the neg-stain 2D class averages for sNS1 in Fig S1 clearly show 1 or 2 NS1 dimers on the surface of a spherical object, presumably HDL, and indicate the possibility of high-quality cryoEM results. However, the cryoEM results are disappointing. The cryo 2D class averages and refined EM map in Fig S4 are of poor quality, indicating sub-optimal grid preparation or some other sample problem. Some of the FSC curves (2 in Fig S7 and 1 in Fig S6) have extremely peculiar shapes, suggesting something amiss in the map refinement. The sharp drop in the "corrected" FSC curves in Figs S5c and S6c (upper) indicate severe problems. The stated resolutions (3.42 & 3.82 Å) for the sNS1ts-Fab56.2 are wildly incompatible with the images of the refined maps in Figs 3 & S7. At those resolutions, clear secondary structural elements should be visible throughout the map. From the 2D averages and 3D maps shown in the figures this does not seem to be the case. Local resolution maps should be shown for each structure.

      The same sample is used for negative staining and the cryoEM results presented. The cryoEM 2D class averages are similar to the negative stain ones, with many spherical-like densities with no discernible features, presumably HDL only or the NS1 features are averaged out. The key difference lies in the 2D class averages where the NS1 could be seen. The side views of NS1 (wing-like protrusion) are more obvious in the negative stain while the top views of NS1 (cross shaped-like protrusion) are more obvious under cryoEM. HDL particles are inherently heterogeneous and known to range from 70-120 Å, this has been highlighted in the main text (p. 8, lines 203 and 228). This helps to explain why the reviewer may find the cryoEM result disappointing. The sample is inherently challenging to resolve structurally as it is (not that the sample is of poor quality). In terms of grid preparation, Supp Fig 4b shows a representative motion-corrected micrograph of the isNS1ts sample whereby individual particles can be discerned and evenly distributed across the grid at high density.

      We acknowledge that most of the dips in the FSC curves (Fig S5-7) are irregular and affect the accuracy of the stated resolutions, particularly for the HDL-isNS1ts-Fab56.2 and isNS1ts-Fab56.2 maps for which the local resolution maps are shown (Fig S7d-e). Probable reasons affecting the FSC curves include (1) the heterogeneous nature of HDL, (2) preferred orientation issue (p 7, lines 198 -200), and (3) the data quality is intrinsically less ideal for high resolution single particle analysis. Optimizing of the dynamic masking such that the mask is not sharper than the resolution of the map for the near (default = 3 angstroms) and far (12 angstroms) parameters during data processing, ranging from 6 - 12 and 14 - 20 respectively, did not help to improve the FSC curves. To report a more accurate global resolution, we have revised the figures S5-7 with new FSC curve plots generated using the remote 3DFSC processing server.

      Regardless, the overall architecture and the relative arrangement of NS1 dimer, Fab, and HDL are clearly visible and identifiable in the map. These results agree well with our biochemical data and mass-spec data.

      The samples were clearly challenging for cryoEM, leading to poor quality maps that were difficult to interpret. None of the figures are convincing that NS1, Ab56.2 or Fab56.2 are correctly fit into EM maps. There is no indication of ApoA1 helices. Details of the fit of models to density for key regions of the higher-resolution EM maps should be shown and the models should be deposited in the PDB. An example of modeling difficulty is clear in the sNS1ts dimer with bound Fab56.2 (figs 3c & S7e). For this complex, the orientation of the Fab56.2 relative to the sNS1ts dimer in this submission (Fig 3c) is substantially different than in the bioRxiv preprint (Fig 3c). Regions of empty density in Fig 3c also illustrate the challenge of building a model into this map.

      We acknowledge the modelling challenge posed by low resolution maps in general, such as the handedness of the Fab molecule as pointed out by the reviewer (which is why others have developed the use of anti-fab nanobody to aid in structure determination among other methods). The change in orientation of the Fab56.2 relative to the sNS1ts dimer was informed by the HDX-MS results which was not done at the point of bioRxiv preprint mentioned. With regards to indication of ApoA1 helices, this is expected given the heterogeneous nature of HDL. To the best of our knowledge, engineered apoA1 helices were also not reported in many cryoEM structures of membrane proteins solved in membrane scaffold protein (MSP) nanodiscs. This is despite nanodiscs, comprised of engineered apoA1 helices, having well-defined size classifications.

      Regions of weak density in Fig 3c is expected due to the preferred orientation issue acknowledged in the results section of the main text (p. 9, line 245). The cryoEM density maps have been deposited in the Electron Microscopy Data Bank (EMDB) under accession codes EMD-36483 (isNS1ts:Fab56.2) and EMD-36480 (Fab56.2:isNS1ts:HDL). The protein model files for isNS1ts:Fab56.2 and Fab56.2:isNS1ts:HDL model are available upon request. Crosslinking MS raw files and the search results can be downloaded from https://repository.jpostdb.org/preview/14869768463bf85b347ac2 with the access code: 3827. The HDX-MS data is deposited to the ProteomeXchange consortium via PRIDE partner repository51 with the dataset identifier PXD042235.

      Mass spec:

      Crosslinking-mass spec was used to detect contacts between NS1 and ApoA1, providing strong validation of the sNS1-HDL association. As the crosslinks were detected in a bulk sample, they show that NS1 is near ApoA1 in many/most HDL particles, but they do not indicate a specific protein-protein complex. Thus, the data do not support the model of an NS1-ApoA1 complex in Fig 4d. Further, a specific NS1-ApoA1 interaction should have evidence in the EM maps (helical density for ApoA1), but none is shown or mentioned. If such exists, it could perhaps be visualized after focused refinement of the map for sNS1ts-HDL with Fab56.2 (Fig S7d). The finding that sNS1-ApoA1 crosslinks involved residues on the hydrophobic surface of the NS1 dimer confirms previous data that this NS1 surface engages with membranes and lipids.

      We thank the reviewer for the comment. The XL-MS is a method to identify the protein-protein interactions by proximity within the spacer arm length of the crosslinker. The crosslinking MS data do support the NS1-ApoA1 complex model obtained by cryo-EM because the identified crosslinks that are superimposed on the EM map are within the cut-off distance of 30 Å. We agree that the XL-MS data do not dictate the specific interactions between specific residues of NS1-ApoA1 in the EM model. We also do not claim that specific residue of NS1 in beta roll or wing domain is interacting with specific residue of ApoA1 in H4 and H5 domain. We claim that beta roll and wing domain regions of NS1 are interacting with ApoA1 in HDL indicating the proximity nature of NS1-ApoA1 interactions as warranted by the XL-MS data.

      As explained in the previous response on the lack of indication of ApoA1 helical density, this is expected given the heterogeneous nature of HDL. It is typical to see lipid membranes as unstructured and of lower density than the structured protein. In our study, local refinement was performed on either the global map (presented in Fig S7d) or focused on the NS1-Fab region only. Both yielded similar maps as illustrated in the real space slices shown in Author response image 5. The mask and map overlay is depicted in similar orientations to the real space slices, and at different contour thresholds at 0.05 (Author response image 5e) and 0.135 (Author response image 5f). While the overall map is of poor resolution and directional anisotropy evident, there is clear signal differences in the low density region (i.e. the HDL sphere) indicative of NS1 interaction with ApoA1 in HDL, extending from the NS1 wing to the base of the HDL sphere.

      Author response image 5.

      Real Space Slices of map and mask used during Local Refinement for overall structure (a-b) and focused mask on NS1 region (c-d). The corresponding map (grey) contoured at 0.05 (e) and 0.135 (f) in similar orientations as shown for the real space slices of map and masks. The focused mask of NS1 used is colored in semi-transparent yellow. Real Space Slices of map and mask are generated during data processing in Cryosparc 4.0 and the map figures were prepared using ChimeraX.

      Sample quality:

      The paper lacks any validation that the purified sNS1 retains established functions, for example the ability to enhance virus infectivity or to promote endothelial dysfunction.

      Please see detailed response for question 2 in Reviewer #1’s comments. In essence, we have showed that both isNS1wt and isNS1ts are capable of inducing endothelial permeability in an in vitro TEER assay (Rebuttal Fig 3) and also in our previous study that quantified inflammation in human PBMC’s (Rebuttal Fig 2).

      Peculiarities include the gel filtration profiles (Fig 2a), which indicate identical elution volumes (apparent MWs) for sNS1wt-HDL bound to Ab562 (~150 kDa) and to the ~3X smaller Fab56.2 (~50 kDa). There should also be some indication of sNS1wt-HDL pairs crosslinked by the full-length Ab, as can be seen in the raw cryoEM micrograph (Fig S5b).

      Obtaining high quality structures is often more demanding of sample integrity than are activity assays. Given the low quality of the cryoEM maps, it's possible that the acidification step in immunoaffinity purification damaged the HDL complex. No validation of HDL integrity, for example with acid-treated HDL, is reported.

      Please see detailed response for question 3 in Reviewer #1’s comments.

      Acid treatment is perhaps discounted by a statement (line 464) that another group also used immunoaffinity purification in a recent study (ref 20) reporting sNS1 bound to HDL. However the statement is incorrect; the cited study used affinity purification via a strep-tag on recombinant sNS1.

      We thank the Reviewer for pointing this out and have rewritten this paragraph instead (p 18, line 445-455). We also expanded our discussion to highlight our prior functional studies showing that acid-eluted isNS1 proteins do induce endothelial hyperpermeability (p 18-19, line 470-476).

      Discussion:

      The Discussion reflects a view that the NS1 secreted from virus-infected cells is a 1:1 sNS1dimer:HDL complex with the specific NS1-ApoA1 contacts detected by crosslinking mass spec. This is inconsistent with both the neg-stain 2D class average with 2 sNS1 dimers on an HDL (Fig S1c) and with the recent study of Flamand & co-workers showing 1-3 NS1 dimers per HDL (ref 20). It is also ignores the propensity of NS1 to associate with membranes and lipids. It is far more likely that NS1 association with HDL is driven by these hydrophobic interactions than by specific protein-protein contacts. A lengthy Discussion section (lines 461-522) includes several chemically dubious or inconsistent statements, all based on the assumption that specific ApoA1 contacts are essential to NS1 association with HDL and that sNS1 oligomers higher than the dimer necessarily involve ApoA1 interaction, conclusions that are not established by the data in this paper.

      We thank the Reviewer and have revised our discussion to cover available structural and functional data to draw conclusions that invariably also need further validation by others. One point that is repeatedly brought up by Reviewer 1 & 2 is the quality and functionality of our sample. Our conclusion now reiterates this point based on our own published data (Chan et al., 2019) and also the TEER assay data provided as Author response image 3.

      Reviewer #1 (Recommendations For The Authors):

      Minor:

      (1) Fig. S3B, should the label for lane 4 be isNS1? In figure 1C you do not see ApoA1 for rsNS1 but for S3B you do? Which is correct?

      This has been corrected in the Fig. S3B, the label for lane 4 has been corrected to isNS1 and lane 1 to rsNS1, where no ApoA1 band (25 kDa) is found.

      (2) Line 436, is this the correct reference? Reference 43?

      This has been corrected in the main text. (p 20, Line 507; Lee et al., 2020, J Exp Med).

      Reviewer #2 (Recommendations For The Authors):

      The cryoEM data analysis is incompletely described. The process (software, etc) leading to each refined EM map should be stated, including the use of reference structures in any step. These details are not in the Methods or in Figs S4-7, as claimed in the Methods. The use of DeepEMhancer (which refinements?) with the lack of defined secondary structural features in the maps and without any validation (or discussion of what was used as "ground truth") is concerning. At the least, the authors should show pre- and post-DeepEMhancer maps in the supplemental figures.

      The data processing steps in the Methods section have been described with improved clarity. DeepEMhancer is a deep learning solution for cryo-EM volume post-processing to reduce noise levels and obtain more detailed versions of the experimental maps (Sanchez-Garcia, et al., 2021). DeepEMhancer was only used to sharpen the maps and reduce the noise for classes 1 and 2 of isNS1wt in complex with Ab56.2 for visualization purpose only and not for any refinements. To avoid any confusion, the use of DeepEMhancer has been removed from the supp text and figures.

      Line 83 - "cryoEM structures...recently reported" isn't ref 17

      This reference has been corrected in to Shu et al. (2022) in p 3, line 83.

      Fig. S3 - mis-labeled gel lanes

      This has been corrected in the Fig. S3B, the label for lane 4 has been corrected to isNS1 and lane 1 to rsNS1.

      Fig S6c caption - "Representative 2D classes of each 3D classes, white bar 100 Å. Refined 3D map for classes 1 and 2 coloured by local resolution". The first sentence is unclear, and there is no white scale bar and no heat map.

      Fig S6c caption has been corrected to “Representative 3D classes contoured at 0.06 and its particle distribution as labelled and coloured in cyan. Scale bar of 100 Å as shown. Refined 3D maps and their respective FSC resolution charts and posterior precision directional distribution as generated in crysosparc4.0”.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews: 

      Reviewer #1 (Public Review): 

      Summary: 

      The authors performed experimental evolution of MreB mutants that have a slow-growing round phenotype and studied the subsequent evolutionary trajectory using analysis tools from molecular biology. It was remarkable and interesting that they found that the original phenotype was not restored (most common in these studies) but that the round phenotype was maintained. 

      Strengths: 

      The finding that the round phenotype was maintained during evolution rather than that the original phenotype, rod-shaped cells, was recovered is interesting. The paper extensively investigates what happens during adaptation with various different techniques. Also, the extensive discussion of the findings at the end of the paper is well thought through and insighXul. 

      Weaknesses: 

      I find there are three general weaknesses: 

      (1) Although the paper states in the abstract that it emphasizes "new knowledge to be gained" it remains unclear what this concretely is. On page 4 they state 3 three research questions, these could be more extensively discussed in the abstract. Also, these questions read more like genetics questions while the paper is a lot about cell biological findings. 

      Thank you for drawing attention to the unnecessary and gratuitous nature of the last sentence of the Abstract. We are in agreement. It has been modified, and we have taken  advantage of additional word space to draw attention to the importance of the two competing (testable) hypotheses laid out in the Discussion. 

      As to new knowledge, please see the Results and particularly the Discussion. But beyond this, and as recognised by others, there is real value for cell biology in seeing how (and whether) selection can compensate for effects that are deleterious to fitness. The results will very o_en depart from those delivered from, for example, suppressor analyses, or bottom up engineering. 

      In the work recounted in our paper, we chose to focus – by way of proof-of principle – on the most commonly observed mutations, namely, those within pbp1A.  But beyond this gene, we detected mutations  in other components of the cell shape / division machinery whose connections are not yet understood and which are the focus of on-going investigation.  

      As to the three questions posed at the end of the Introduction, the first concerns whether selection can compensate for deleterious effects of deleting mreB (a question that pertains to evolutionary aspects); the second seeks understanding of genetic factors; the third aims to shed light on the genotype-to-phenotype map (which is where the cell biology comes into play).  Given space restrictions, we cannot see how we could usefully expand, let alone discuss, the three questions raised at the end of the Introduction in restrictive space available in the Abstract.   

      (2) It is not clear to me from the text what we already know about the restoration of MreB loss from suppressors studies (in the literature). Are there suppressor screens in the literature and which part of the findings is consistent with suppressor screens and which parts are new knowledge?  

      As stated in the Introduction, a previous study with B. subtilis (which harbours three MreB isoforms and where the isoform named “MreB” is essential for growth under normal conditions), suppressors of MreB lethality were found to occur in ponA, a class A penicillin binding protein (Kawai et al., 2009). This led to recognition that MreB plays a role in recruiting Pbp1A to the lateral cell wall. On the other hand, Patel et al. (2020) have shown that deletion of classA PBPs leads to an up-regulation of rod complex activity. Although there is a connection between rod complex and class A PBPs, a further study has shown that the two systems work semi-autonomously (Cho et al., 2016). 

      Our work confirms a connection between MreB and Pbp1A, and has shed new light on how this interaction is established by means of natural selection, which targets the integrity of cell wall. Indeed, the Rod complex and class A PBPs have complementary activities in the building of the cell wall with each of the two systems able to compensate for the other in order to maintain cell wall integrity. Please see the major part of the Discussion. In terms of specifics, the connection between mreB and pbp1A (shown by Kawai et al (2009)) is indirect because it is based on extragenic transposon insertions. In our study, the genetic connection is mechanistically demonstrated.  In addition, we capture that the evolutionary dynamics is rapid and we finally enriched understanding of the genotype-to-phenotype map.

      (3) The clarity of the figures, captions, and data quantification need to be improved.  

      Modifications have been implemented. Please see responses to specific queries listed below.

      Reviewer #2 (Public Review): 

      Yulo et al. show that deletion of MreB causes reduced fitness in P. fluorescens SBW25 and that this reduction in fitness may be primarily caused by alterations in cell volume. To understand the effect of cell volume on proliferation, they performed an evolution experiment through which they predominantly obtained mutations in pbp1A that decreased cell volume and increased viability. Furthermore, they provide evidence to propose that the pbp1A mutants may have decreased PG cross-linking which might have helped in restoring the fitness by rectifying the disorganised PG synthesis caused by the absence of MreB. Overall this is an interesting study. 

      Queries: 

      Do the small cells of mreB null background indeed have have no DNA? It is not apparent from the DAPI images presented in Supplementary Figure 17. A more detailed analysis will help to support this claim. 

      It is entirely possible that small cells have no DNA, because if cell division is aberrant then division can occur prior to DNA segregation resulting in cells with no DNA. It is clear from microscopic observation that both small and large cells do not divide. It is, however, true, that we are unable to state – given our measures of DNA content – that small cells have no DNA. We have made this clear on page 13, paragraph 2.

      What happens to viability and cell morphology when pbp1A is removed in the mreB null background? If it is actually a decrease in pbp1A activity that leads to the rescue, then pbp1A- mreB- cells should have better viability, reduced cell volume and organised PG synthesis. Especially as the PG cross-linking is almost at the same level as the T362 or D484 mutant.  

      Please see fitness data in Supp. Fig. 13. Fitness of ∆mreBpbp1A is no different to that caused by a point mutation. Cells remain round.  

      What is the status of PG cross-linking in ΔmreB Δpflu4921-4925 (Line 7)? 

      This was not analysed as the focus of this experiment was PBPs. A priori, there is no obvious reason to suspect that ∆4921-25 (which lacks oprD) would be affected in PBP activity.

      What is the morphology of the cells in Line 2 and Line 5? It may be interesting to see if PG cross-linking and cell wall synthesis is also altered in the cells from these lines. 

      The focus of investigation was restricted to L1, L4 and L7. Indeed, it would be interesting to look at the mutants harbouring mutations in :sZ, but this is beyond scope of the present investigation (but is on-going). The morphology of L2 and L5 are shown in Supp. Fig. 9.

      The data presented in 4B should be quantified with appropriate input controls. 

      Band intensity has now been quantified (see new Supp. Fig .20). The controls are SBW25, SBW25∆pbp1A, SBW25 ∆mreB and SBW25 ∆mreBpbp1A as explained in the paper.

      What are the statistical analyses used in 4A and what is the significance value? 

      Our oversight. These were reported in Supp. Fig. 19, but should also have been presented in Fig. 4A. Data are means of three biological replicates. The statistical tests are comparisons between each mutant and SBW25, and assessed by paired t-tests.  

      A more rigorous statistical analysis indicating the number of replicates should be done throughout. 

      We have checked and made additions where necessary and where previously lacking. In particular, details are provided in Fig. 1E, Fig. 4A and Fig. 4B. For Fig. 4C we have produced quantitative measures of heterogeneity in new cell wall insertion. These are reported in Supp. Fig. 21 (and referred to in the text and figure caption) and show that patterns of cell wall insertion in ∆mreB are highly heterogeneous.

      Reviewer #3 (Public Review): 

      This paper addresses an understudied problem in microbiology: the evolution of bacterial cell shape. Bacterial cells can take a range of forms, among the most common being rods and spheres. The consensus view is that rods are the ancestral form and spheres the derived form. The molecular machinery governing these different shapes is fairly well understood but the evolutionary drivers responsible for the transition between rods and spheres are not. Enter Yulo et al.'s work. The authors start by noting that deletion of a highly conserved gene called MreB in the Gram-negative bacterium Pseudomonas fluorescens reduces fitness but does not kill the cell (as happens in other species like E. coli and B. subtilis) and causes cells to become spherical rather than their normal rod shape. They then ask whether evolution for 1000 generations restores the rod shape of these cells when propagated in a rich, benign medium. 

      The answer is no. The evolved lineages recovered fitness by the end of the experiment, growing just as well as the unevolved rod-shaped ancestor, but remained spherical. The authors provide an impressively detailed investigation of the genetic and molecular changes that evolved. Their leading results are: 

      (1) The loss of fitness associated with MreB deletion causes high variation in cell volume among sibling cells a_er cell division. 

      (2) Fitness recovery is largely driven by a single, loss-of-function point mutation that evolves within the first ~250 generations that reduces the variability in cell volume among siblings. 

      (3) The main route to restoring fitness and reducing variability involves loss of function mutations causing a reduction of TPase and peptidoglycan cross-linking, leading to a disorganized cell wall architecture characteristic of spherical cells. 

      The inferences made in this paper are on the whole well supported by the data. The authors provide a uniquely comprehensive account of how a key genetic change leads to gains in fitness and the spectrum of phenotypes that are impacted and provide insight into the molecular mechanisms underlying models of cell shape. 

      Suggested improvements and clarifications include: 

      (1) A schematic of the molecular interactions governing cell wall formation could be useful in the introduction to help orient readers less familiar with the current state of knowledge and key molecular players. 

      We understand that this would be desirable, but there are numerous recent reviews with detailed schematics that we think the interested reader would be better consulting. These are referenced in the text.

      (2) More detail on the bioinformatics approaches to assembling genomes and identifying the key compensatory mutations are needed, particularly in the methods section. This whole subject remains something of an art, with many different tools used. Specifying these tools, and the parameter sesngs used, will improve transparency and reproducibility, should it be needed. 

      We overlooked providing this detail, which has now been corrected by provision of more information in the Materials and Methods. In short we used Breseq, the clonal option, with default parameters. Additional analyses were conducted using Genieous. The BreSeq output files are provided https://doi.org/10.17617/3.CU5SX1 (which include all read data).

      (3) Corrections for multiple comparisons should be used and reported whenever more than one construct or strain is compared to the common ancestor, as in Supplementary Figure 19A (relative PG density of different constructs versus the SBW25 ancestor). 

      The data presented in Supp Fig 19A (and Fig 4A) do not involve multiple comparisons. In each instance the comparison is between SBW25 and each of the different mutants. A paired t-test is thus appropriate.

      (4) The authors refrain from making strong claims about the nature of selection on cell shape, perhaps because their main interest is the molecular mechanisms responsible. However, I think more can be said on the evolutionary side, along two lines. First, they have good evidence that cell volume is a trait under strong stabilizing selection, with cells of intermediate volume having the highest fitness. This is notable because there are rather few examples of stabilizing selection where the underlying mechanisms responsible are so well characterized. Second, this paper succeeds in providing an explanation for how spherical cells can readily evolve from a rod-shaped ancestor but leaves open how rods evolved in the first place. Can the authors speculate as to how the complex, coordinated system leading to rods first evolved? Or why not all cells have lost rod shape and become spherical, if it is so easy to achieve? These are important evolutionary questions that remain unaddressed. The manuscript could be improved by at least flagging these as unanswered questions deserving of further attention. 

      These are interesting points, but our capacity to comment is entirely speculative. Nonetheless, we have added an additional paragraph to the Discussion that expresses an opinion that has yet to receive attention:

      “Given the complexity of the cell wall synthesis machinery that defines rod-shape in bacteria, it is hard to imagine how rods could have evolved prior to cocci. However, the cylindrical shape offers a number of advantages. For a given biomass (or cell volume), shape determines surface area of the cell envelope, which is the smallest surface area associated with the spherical shape. As shape sets the surface/volume ratio, it also determines the ratio between supply (proportional to the surface) and demand (proportional to cell volume). From this point of view, it is more efficient to be cylindrical (Young 2006). This also holds for surface attachment and biofilm formation (Young 2006). But above all, for growing cells, the ratio between supply and demand is constant in rod shaped bacteria, whereas it decreases for cocci. This requires that spherical cells evolve complex regulatory networks capable of maintaining the correct concentration of cellular proteins despite changes in surface/volume ratio. From this point of view, rod-shaped bacteria offer opportunities to develop unsophisticated regulatory networks.”

      why not all cells have lost rod shape and become spherical.

      Please see Kevin Young’s 2006 review on the adaptive significance of cell shape

      The value of this paper stems both from the insight it provides on the underlying molecular model for cell shape and from what it reveals about some key features of the evolutionary process. The paper, as it currently stands, provides more on which to chew for the molecular side than the evolutionary side. It provides valuable insights into the molecular architecture of how cells grow and what governs their shape. The evolutionary phenomena emphasized by the authors - the importance of loss-of-function mutations in driving rapid compensatory fitness gains and that multiple genetic and molecular routes to high fitness are o_en available, even in the relatively short time frame of a few hundred generations - are wellunderstood phenomena and so arguably of less broad interest. The more compelling evolutionary questions concern the nature and cause of stabilizing selection (in this case cell volume) and the evolution of complexity. The paper misses an opportunity to highlight the former and, while claiming to shed light on the latter, provides rather little useful insight. 

      Thank you for these thoughts and comments. However, we disagree that the experimental results are an overlooked opportunity to discuss stabilising selection. Stabilising selection occurs when selection favours a particular phenotype causing a reduction in underpinning population-level genetic diversity. This is not happening when selection acts on SBW25 ∆mreB leading to a restoration of fitness. Driving the response are biophysical factors, primarily the critical need to balance elongation rate with rate of septation. This occurs without any change in underlying genetic diversity.  

      Recommendations for the authors:  

      Reviewer 1 (Recommendations for the Authors): 

      Hereby my suggestion for improvement of the quantification of the data, the figures, and the text. 

      -  p 14, what is the unit of elongation rate?  

      At first mention we have made clear that the unit is given in minutes^-1

      -  p 14, please give an error bar for both p=0.85 and f=0.77, to be able to conclude they are different 

      Error on the probability p is estimated at the 95% confidence interval by the formula:1.96 , where N is the total number of cells. This has been added in the paragraph p »probability » of the Image Analysis section in the Material and Methods. 

      We also added errors on p measurement in the main text.

      -  p 14, all the % differences need an errorbar 

      The error bars and means are given in Fig 3C and 3D.

      -  Figure 1B adds units to compactness, and what does it represent? Is the cell size the estimated volume (that is mentioned in the caption)? Shouldn't the datapoints have error bars? 

      Compactness is defined in the “Image Analysis” section of the Material and Methods. It is a dimensionless parameter. The distribution of individual cell shapes / sizes are depicted in Fig 1B. Error does arise from segmentation, but the degree of variance (few pixels) is much smaller than the representations of individual cells shown.

      -  Figure 1C caption, are the 50.000 cells? 

      Correct. Figure caption has been altered.

      -  Figure 1D, first the elongation rate is described as a volume per minute, but now, looking at the units it is a rate, how is it normalized? 

      Elongation rate is explained in the Materials and Methods (see the image analysis section) and is not volume per minute. It is dV/dt = r*V (the unit of r is min^-1). Page 9 includes specific mention of the unit of r.

      -  Figure 1E, how many cells (n) per replicate? 

      Our apologies. We have corrected the figure caption that now reads:

      “Proportion of live cells in ancestral SBW25 (black bar) and ΔmreB (grey bar) based on LIVE/DEAD BacLight Bacterial Viability Kit protocol. Cells were pelleted at 2,000 x g for 2 minutes to preserve ΔmreB cell integrity. Error bars are means and standard deviation of three biological replicates (n>100).”

      -  Figure 1G, how does this compare to the wildtype 

      The volume for wild type SBW25 is 3.27µm^3 (within the “white zone”). This is mentioned in the text.

      -  Figure 2B, is this really volume, not size? And can you add microscopy images? 

      The x-axis is volume (see Materials and Methods, subsection image analysis). Images are available in Supp. Fig. 9.

      -  Figure 3A what does L1, L4 and L7 refer too? Is it correct that these same lines are picked for WT and delta_mreB 

      Thank you for pointing this out. This was an earlier nomenclature. It was shorthand for the mutants that are specified everywhere else by genotype and has now been corrected. 

      -  Figure 3c: either way write out p, so which probability, or you need a simple cartoon that is plotted. 

      The value p is the probability to proceed to the next generation and is explained in Materials and Methods  subsection image analysis.  We feel this is intuitive and does not require a cartoon. We nonetheless added a sentence to the Materials and Methods to aid clarity.

      -  Figure 4B can you add a ladder to the gel? 

      No ladder was included, but the controls provide all the necessary information. The band corresponding to PBP1A is defined by presence in SBW25, but absence in SBW25 ∆pbp1A.

      -  Figure 4c, can you improve the quantification of these images? How were these selected and how well do they represent the community? 

      We apologise for the lack of quantitative description for data presented in Fig 4C. This has now been corrected. In brief, we measured the intensity of fluorescent signal from between 10 and 14 cells and computed the mean and standard deviation of pixel intensity for each cell. To rule out possible artifacts associated with variation of the mean intensity, we calculated the ratio of the standard deviation divided by the square root of the mean. These data reveal heterogeneity in cell wall synthesis and provide strong statistical support for the claim that cell wall synthesis in ∆mreB is significantly more heterogeneous than the control. The data are provided in new Supp. Fig. 21. 

      Minor comments: 

      -  It would be interesting if the findings of this experimental evolution study could be related to comparative studies (if these have ever been executed).  

      Little is possible, but Hendrickson and Yulo published a portion of the originally posted preprint separately. We include a citation to that paper. 

      -  p 13, halfway through the page, the second paragraph lacks a conclusion, why do we care about DNA content? 

      It is a minor observation that was included by way of providing a complete description of cell phenotype.  

      -  p 17, "suggesting that ... loss-of-function", I do no not understand what this is based upon. 

      We show that the fitness of a pbp1A deletion is indistinguishable from the fitness of one of the pbp1A point mutants. This fact establishes that the point mutation had the same effects as a gene deletion thus supporting the claim that the point mutations identified during the course of the selection experiment decrease (or destroy) PBP1A function.

      -  p 25, at the top of the page: do you have a reference for the statement that a disorganized cell wall architecture is suited to the topology of spherical cells? 

      The statement is a conclusion that comes from our reasoning. It stems from the fact that it is impossible to entirely map the surface of a sphere with parallel strands.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      In this study, Basha and colleagues aim to test whether the thalamic nucleus reuniens can facilitate the hippocampus/prefrontal cortex coupling during sleep. Considering the importance of sleep in memory consolidation, this study is important to understand the functional interaction between these three majorly involved regions. This work suggests that the thalamic nucleus reuniens has a functional role in synchronizing the hippocampus and prefrontal cortex.

      Strengths:

      The authors performed recordings in naturally sleeping cats, and analysed the correlation between the main slow wave sleep oscillatory hallmarks: slow waves, spindles, and hippocampal ripples, and with reuniens' neurons firing. They also associated intracellular recordings to assess the reuniens-prefrontal connectivity, and computational models of large networks in which they determined that the coupling of oscillations is modulated by the strength of hippocampal-thalamic connections.

      Thank you for your positive evaluation.

      Weaknesses:

      The authors' main claim is made on slow waves and spindle coupling, which are recorded both in the prefrontal cortex and surprisingly in reuniens. Known to be generated in the cortex by cortico-thalamic mechanisms, the slow waves and spindles recorded in reuniens show no evidence of local generation in the reuniens, which is not anatomically equipped to generate such activities. Until shown differently, these oscillations recorded in reuniens are most likely volume-conducted from nearby cortices. Therefore, such a caveat is a major obstacle to analysing their correlation (in time or frequency domains) with oscillations in other regions.

      (1) We fully agree with the reviewer that reuniens likely does not generate neither slow waves nor spindles. We do not make such claim, which we clearly stated in the discussion (lines 319-324). We propose that Reuniens neurons mediate different forms of activity. In the model, we introduced MD nucleus only because without MD we were unable to generate spindles. While the slow waves and spindles are generated in other thalamocortical regions, the REU neurons show these rhythms due to long-range projections from these regions to REU as has been shown in the model.

      (2) Definitely, we cannot exclude some influence of volume conductance on obtained LFP recordings in REU nucleus. However, we show modulation of spiking activity within REU by spindles. Spike modulation cannot be explained by volume conductance but can be explained by either synaptic drive (likely the case here) or some intrinsic neuronal processes (like T-current).

      (3) In our REU recordings for spike identification we used tetrode recordings. If slow waves and spindles are volume conducted, then slow waves and spindles recorded with tetrodes should have identical shape. Following reviewer comment, we took these recordings and subtracted one channel from another. The difference in signal during slow waves is in the order 0.1 mV. Considering that the distance between electrodes is in the order of 20 um, such a difference in voltage is major and can only be explained by local extracellular currents, likely due to synaptic activities originating in afferent structures.

      Finally, the choice of the animal model (cats) is the best suited one, as too few data, particularly anatomical ones regarding reuniens connectivity, are available to support functional results.

      (1) Thalamus of majority of mammals (definitely primates and carnivores, including cats) contain local circuit interneurons (about 30 % of all neurons). A vast majority of studies in rodents (except LGN nucleus) report either absence or extremally low (i.e. Jager P, Moore G, Calpin P, et al. Dual midbrain and forebrain origins of thalamic inhibitory interneurons. eLife. 2021; 10: e59272.) number of thalamic interneurons. Therefore, studies on other species than rodents are necessary, and bring new information, which is impossible to obtain in rodents.

      (2) Cats’ brain is much larger than the brain of mice or rats, therefore, the effects of volume conductance from cortex to REU are much smaller, if not negligible. The distance between REU and closest cortical structure (ectosylvian gyrus) in cats is about 15 mm.

      (3) Indeed, there is much less anatomical data on cats as opposed to rodents. This is why, we performed experiments shown in the figure 1. This figure contains functional anatomy data. Antidromic responses show that recorded structure projects to stimulated structure. Orthodromic responses show that stimulated structure projects to recorded structure.

      Reviewer #2 (Public Review):

      Summary:

      The interplay between the medial prefrontal cortex and ventral hippocampal system is critical for many cognitive processes, including memory and its consolidation over time. A prominent idea in recent research is that this relationship is mediated at least in part by the midline nucleus reuniens with respect to consolidation in particular. Whereas the bulk of evidence has focused on neuroanatomy and the effects of temproary or permanent lesions of the nucleus reuniens, the current work examined the electrophysiology of these three structures and how they inter-relate, especially during sleep, which is anticipated to be critical for consolidation. They provide evidence from intercellular recordings of the bi-directional functional connectivity among these structures. There is an emphasis on the interactions between these regions during sleep, especially slow-wave sleep. They provide evidence, in cats, that cortical slow waves precede reuniens slow waves and hippocampal sharp-wave ripples, which may reflect prefrontal control of the timing of thalamic and hippocampal events, They also find evidence that hippocampal sharp wave ripples trigger thalamic firing and precede the onset of reuniens and medial prefrontal cortex spindles. The authors suggest that the effectiveness of bidirectional connections between the reuniens and the (ventral) CA1 is particularly strong during non-rapid eye movement sleep in the cat. This is a very interesting, complex study on a highly topical subject.

      Strengths:

      An excellent array of different electrophysiological techniques and analyses are conducted. The temporal relationships described are novel findings that suggest mechanisms behind the interactions between the key regions of interest. These may be of value for future experimental studies to test more directly their association with memory consolidation.

      We thank this reviewer for very positive evaluation of our study.

      Weaknesses:

      Given the complexity and number of findings provided, clearer explanation(s) and organisation that directed the specific value and importance of different findings would improve the paper. Most readers may then find it easier to follow the specific relevance of key approaches and findings and their emphasis. For example, the fact that bidirectional connections exist in the model system is not new per se. How and why the specific findings add to existing literature would have more impact if this information was addressed more directly in the written text and in the figure legends.

      Thank you for this comment. In the revised version, we will do our best to simplify presentation and more clearly explain our findings.

      Reviewing Editor (Recommendations for Authors):

      Please discuss the ability of reuniens to generate spindles?

      We briefly discussed this in previous version. We now extended the discussion (p. 18).

      For population data, how many cats were used in acute and chronic experiments, where does the population data originate in Fig. 2? How repeatable were the findings across animals? Was histology verified in each animal?

      As previously stated in the beginning of method section we totally used 20 cats: 16 anesthetized (or acute) and 4 non-anesthetized (or chronic). We added number of cats in appropriate places in the result section. Population data in figure 2 comes from 48, 49 or 52 recording sessions (depending on the type of analysis, and indicated in the figure legend) from 4 chronic cats; we clarified this information in the legend. Results were highly repeatable across animals. Histology was verified in all chronic and acute animals, we added a sentence in the method section.

      Explanation of figures is very poor, values in figures should be reported in results so they can be compared in the context of the description.

      In this revised version, we report most numbers present in figures and their legend to the main text (result section).

      The depth of the recording tungsten electrodes are meaningless without the AP and ML coordinates given how heterogenous mPFC is. What is the ventromedial wall of the mPFC in the cat?

      We added the ML and AP coordinates in the method section. We corrected ventromedial wall for ventroposterior part of the mPFC.

      What are the two vertical lines in 1F?

      This was an error while preparing the figure. The panel was corrected.

      Line 90 mean +-SD of what? There are no numbers.

      Thanks, we now indicate the values.

      Panel 2L does not show increased spindling in reuniens prior to PFC as indicated in the results, please explain. It does show SWR in the hippocampus prior to spindles, what is the meaning of such a time relationship?

      Panel 2L did show an increased spindling reuniens prior to mPFC, but indeed at the time scale shown, it was not very clear. In this revised manuscript, we added an inset zooming around time zero to make this point clearer.

      Panel 2L indeed show an increase in SWR prior to the increase in spindle in both Reuniens and mPFC.

      As stated in the discussion, ‘We found that hippocampal SWRs trigger thalamic firing and precede the onset of reuniens and mPFC spindles, which points to SWRs as one of candidate events for spindle initiation.’

      It is unclear what the slow waves of PFC mean, these represent filtered PFC lfp, but is this a particular oscillation? They continue to occur during the spindle, while the slow waves supposedly trigger the spindle. Please explain and clarify.

      We recently published a review article involving several scientists studying both human and animal sleep that has inserted Box. 1 (Timofeev I, Schoch S, LeBourgeois M, Huber R, Riedner B, Kurth S. Spatio-temporal properties of sleep slow waves and implications for development. Current Opinion in Physiology. 2020; 15: 172–182). In this box among other terms, we provide current definition of slow waves vs slow oscillation. Briefly, if slow waves are repeated with a given rhythm, they typically form slow oscillation. However, if they occur in isolation or are not rhythmic, they remain slow waves, but cannot be called slow oscillation.

      Regarding relation of spindles and slow oscillation. We are currently systematically analyzing data on spindles and slow waves obtained from head-restrained and freely behaving cats. One of the main findings is that a majority of ‘cortical’ spindles are local. Local to the extent that spindles can occur in alternation in two neighboring cortical cells. Largely, LFP sleep spindles occur more or less synchronously within suprasylvian gyrus of cats where indeed a large majority of them was triggered by slow waves. The synchrony between LFP spindles in suprasylvian vs other other cortical areas is much less clear. So, it is not surprizing that spindles in one bran region can occur when there is a slow wave present in some other brain region. Something of a kind was also shown in human (Mölle M, Bergmann TO, Marshall L, Born J. Fast and slow spindles during the sleep slow oscillation: disparate coalescence and engagement in memory processing. Sleep. 2011; 34 (10): 1411-1421).

      In this regard, we are not ready to include modifications in the manuscript.

      Line 134, where is spindle amplitude shown? Plots report power within the spindle frequency band, which obviously captures more than just spindles.

      No, plots of figure 3 B, C show the phase-amplitude coupling (PAC) strength. These were calculated with detected spindles, therefore, while we cannot exclude some false spindle detections, we are confident that the false spindle detections are at a negligible level. We modified text and instead of spindle amplitude, we describe SW-spindle amplitude coupling. This reflects our analysis with exactitude.

      The discussion must include the medio dorsal nucleus which is the largest thalamic input to the prefrontal cortex and also receives input from the hippocampus. In particular, the case must be made for why reuniens would play a more important or different role than MD? (For example: Occurrence of Hippocampal Ripples is Associated with Activity Suppression in the Mediodorsal Thalamic Nucleus - PMC (nih.gov)).

      We cited the suggested study. We cannot say whether reuniens plays a more or less important role. What is clear is that hippocampal ripples at the onset of spindles trigger increased firing in both MD and reuniens. Our extracellular recordings (Fig. 4, K) suggest that the increased firing is associated with spike-bursts. We also have a parallel unpublished study done on anesthetized mice showing SWR triggered inhibitory potentials in both reuniens and MD that reverses around -65mV - -70 mV. Because the majority of SWR occurred at the onset of cortical up state, a relative role of cortico-thalamic vs hippocampo-thalamic drive is not easy to separate. We hope, we will convincingly do this in our forthcoming study, with the limitation that it was done on anesthetized mice.

      Reviewer #1 (Recommendations For The Authors):

      I strongly encourage the authors to perform current source density analyses on the LFP signals recorded in the nucleus reuniens to make sure that the observed oscillations are indeed locally generated. So far, the anatomical organisation in reuniens cannot support the local generation of oscillations, such as spindles and slow wave. At least in rodents (the cat reuniens does not seem too different, until shown differently), there were no oscillators found in reuniens, and at least not arranged like in cortical areas, allowing the summation in time, and particularly space, of rhythmic input currents. Bipolar recordings with pairs of twisted electrodes might also be useful to assess the local existence of spindles and slow waves.

      Current source density calculation is possible when one knows the exact distance between recording sites. As we used tetrodes made with 4 twisted platinum-iridium wires, we know more or less the range of distance between recording sites, but not the exact distance between any given pair of electrodes.

      Then, the physical distance between the reuniens and any cortical structure is about 8-9 mm. Therefore, with such distances, volume conductance is expected to be negligible. If slow waves and spindles are volume conducted, then slow waves and spindles recorded with tetrodes should have identical shape. Following reviewer comment, we took these recordings and subtracted one channel from another. The difference in signal during slow waves is in the order 0.1 mV. Considering that the distance between electrodes is in the order of 20 um, such a difference in voltage is major and can only be explained by local extracellular currents, likely due to synaptic activities originating in afferent structures.

      Below, we plotted the voltage of one channel of the tetrode versus another channel of the same tetrode. If the signal was simply volume conducted, one would expect to see the vast majority of points on the x=y line (red).

      Author response image 1.

      Below is a segment of mPFC LFP recording (upper black trace), mPFC LFP filtered for spindle frequency (7-15 Hz) and the spindle detected (black lines above the filtered trace. Then two LFP traces from a tetrode in the Reuniens (orange and light blue) are overlayed. The second trace (Blue) from bottom represents the substraction of Reuniens 1 minus Reuniens 2 channel, and just below (lower Blue trace) is this susbtraction trace filtered for spindle frequency (7-15 Hz) showing clear voltage difference in the spindle range between the two electrodes. Note also that around time 179-179.5 s, there is clear spindle oscillation in the mPFC recording which is not present in the Reuniens recordings.

      Author response image 2.

      Therefore, we are convinced that in our recordings, volume conductance did not play any significant role.

      Another concern regarding delays between events, like slow waves, measured between two regions (as exemplified by Figure 3). It appears that the delays were calculated from the filtered signal. Figure 3G shows a delay between the peak of the mPFC slow wave between the raw and the filtered signal, which might be artifactual of the processing. It is though not (or less) visible for the reuniens recording. Such mismatch might explain the observed differences in delays.

      Thanks for this comment. We recomputed the analysis using the original signal (smoothed) and obtained very similar results. Panels H and I of figure 3 were updated using the new analysis performed on original signal.

      The overall analyses of LFP-triggered reuniens MUA activity lack of statistics (at least z-scored firing to normalise the firings).

      Fig. 2 H and I are representative examples for histograms; statistical data are shown in circular plots as explained in the legend. Fig. 2 L, shows populational data and we provide now standard error. Fig. 4 C and D show individual example. Fig. 4 E shows histograms of activity of all identified putative single units. Units that show significant modulation are displayed above white line. Fig. 4 F shows populational data for significantly modified units.  

      A last point of detail in the model, which surprisingly shows reuniens to excitatory hippocampal cells' connectivity. Recent literature reports that reuniens only connect hippocampal interneurons, and not principal cells (at least in rodents, I could not find any report in cats). I wonder how changing this parameter would affect the results of the computational investigation, particularly the results shown in Figure 6.

      There are several studies in the literature showing a direct excitation from the Reuniens to pyramidal cells in the CA1, here are three of them:

      Goswamee, P., et al. (2021). "Nucleus Reuniens Afferents in Hippocampus Modulate CA1 Network Function via Monosynaptic Excitation and Polysynaptic Inhibition." Frontiers in Cellular Neuroscience 15.

      Dolleman-Van der Weel MJ, Lopes da Silva FH, Witter MP (1997) Nucleus Reuniens Thalami Modulates Activity in Hippocampal Field CA1 through Excitatory and Inhibitory Mechanisms. The Journal of Neuroscience 17:5640.

      Dolleman-van der Weel MJ, Lopes da Silva FH, Witter MP (2017) Interaction of nucleus reuniens and entorhinal cortex projections in hippocampal field CA1 of the rat. Brain Structure and Function 222:2421-2438.

      Because this is not a review paper, we opted to not cite all the papers describing connectivity between mPFC, hippocampus and thalamus.

      Reviewer #2 (Recommendations For The Authors):

      I respectively suggest that the earlier (public) comments listed above should be addressed. In addition, it would be useful to make it clearer when non-rapid eye movement sleep was being addressed and when rapid eye movement was being addressed. Is it of value to use a single term instead of adding "slow wave sleep" or else clarify when either term is used? The addition of more subheadings might help. Moreover, the relative contribution/value of evidence from these two sleep states was not addressed or was not very clear.

      We tried to make it clearer when NREM and when REM was analysed.

      We replaced slow-wave sleep with NREM sleep in the figure 5 title.

      We added several subheadings in the discussion.

      Relative contribution of NREM vs REM sleep was not addressed? Sorry but we do not clearly understand your question. Figs. 2 and 3 deal mainly with NREM sleep (Fig 2.B has an example of REM sleep). Fig. 4 essentially describes results obtained during REM sleep.

      I was not sure if the Abstract summarised the key take-home messages from the large amount of evidence provided. Some choices are needed, of course, but "evidence of bidirectional connectivity" struck me as less novel than other evidence provided. Given the huge amount of findings provided, which is commendable, it is still useful to present it perhaps in a more digestible fashion. For example, the headings or the first sentence(s) below headings could indicate the aim or the outcome of the specific method/analysis/findings.

      We rewrote abstract and we also added some conclusion to highlight major findings and their meaning.

      It is more common to use NRe or Re, rather than REU.

      We avoided using RE as, for decades, we used RE to abbreviate the thalamic reticular nucleus in several publications. In this revised version, we spell at full - Reuniens.

      Line 49 mentions "short-term" memory. Please specify this more clearly as it is otherwise ambiguous. Also, line 303.

      We rephrased the sentence: In particular, the hierarchical coupling of slow waves, spindles and SWRs is thought to play a key role in memory consolidation.

      Line 303 was likely about the ventromedial wall: we corrected that sentence.

      Line 62: the word, "required" (for memory function) is too strong because there is evidence that it is not always required.

      We modified the sentence for plays a major role.

      The focus within the medial prefrontal cortex could be specified more clearly / earlier.

      The mPFC is mentioned in the second sentence of the abstract and in the first sentence of the introduction.

      Line 134: The heading states "determine" and then mentions modulation. These terms may not be interchangeable or they need clarification.

      We changed it to slow wave-spindle amplitude coupling. This represents exactly our analysis.

      Line 204: Does "cortical network" mean prefrontal cortex network"?

      Yes, as described in lines 192-193, the two cortical networks (N1 and N2) of the model represent the mPFC layer 5 and 6 respectively.

      Lines 283 to 289: These were not very clear to me.

      These lines described the potential mechanisms for the responses to hippocampal and reuniens stimulation recorded intracellularly (results in figure 1). We modified this paragraph for clarity.

      Line 296: Specify the "claim".

      We modified the sentence for “[…] provides supporting evidence for this claim that nucleus Reuniens might synchronize the activity of ventral hippocampus and mPFC.”

      The discussion naturally focuses on the thalamic nucleus reuniens, but also occasionally mentions the thalamic mediodorsal nucleus. The distinction, assuming this is highly relevant, could be expressed more clearly (direct comparison with their previous papers).

      We never published a study on the mediodorsal nucleus. We do have some unpublished results from recordings in the MD nucleus and they reveal the presence of an inhibitory component at the beginning of cortical active states, therefore behaving in a similar way to first order nuclei. It is then possible that spindles recorded in the reuniens are actually generated in the MD nucleus and then transmitted to Reuniens through the thalamic reticular nucleus, as both MD and reuniens are connected to the rostral thalamic reticular nucleus. We added some discussion about this.

      Figure 1B: Do the authors have any additional evidence of the placements in the reuniens, because the photo provided suggests a large area beyond the reuniens boundary. Also, please confirm is the CEM between Rh and Re in the cat (I think the Rh and Re are adjacent in the rat).

      Figure 1B is from an electrolytic lesion, which is necessarily bigger than the tip of the electrode. Therefore the center of the electrolytic lesion indicates where the electrode tip was located which is well within the reuniens nucleus.

      Also, yes CE (Nucleus centralis thalami, pars medialis) is located between the reuniens and rhomboid in cats. This can be found in two cat atlas:  

      Reinoso-Suárez, F. (1961). Topographischer Hirnatlas der Katze für experimental-physiologische Untersuchungen (Merck).

      Berman AL, Jones EG (1982) The Thalamus and Basal Telencephalon of the Cat: A Cytoarchitectonic Atlas with Stereotaxic Coordinates: University of Wisconsin Press.

      The first mention of hippocampus in the figure legends should remind the reader by stating "ventral hippocampus".

      In this revised version, we added “ventral” in several instances both in the main text and in figure legend.

      Figure 2: It seems unusual to mention "unusually short NREM". Presumably, things are the same otherwise - if so, perhaps mention that, especially if some of the effects reflect an "unusual" episode.

      We display this particular segment because we want to show continuous recording in which still individual elements characterizing specific states are still visible.

      Some effects look like they are strong and others perhaps weaker. If so, how do these impact the final conclusions?

      Sorry, we did not understand clearly what is meant here by the reviewer. In general, if any effect has statistically significant difference (old fashion 0.05) we consider it as significant. Any other cases are described on individual basis.

      Perhaps "MAD" should be in full on the first occasion, if not already.

      It was spelled out at line 659, but we now spell it out also in the results section and in figure 2 legend.

      Methods: the key question is the use of rodent recordings to classify cat recordings. It would be good to have a reference indicating that this can be directly used for cats, which may have different sleep cycles and patterns compared to rats.

      We did not use rodent recordings to classify cat recordings, however we did used a state detection script that was developed with rodent recordings. As mentioned in the method section, we adapted the script to cat mPFC recordings and then manual corrections were made to correctly detect REM episodes. Respectfully, our lab investigates sleep-wake in non-anesthetized animals for a few decades; we developed state detection algorithm in mice, cats, marmosets when needed (to analyse months of recordings), and we have an extensive expertise in identifying states of vigilance from electrophysiological recordings.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Weaknesses:

      INTRODUCTION & THEORY

      (1) Can the authors please clarify why the first trial of extinction in a standard protocol does NOT produce the retrieval-extinction effect? Particularly as the results section states: "Importantly, such a short-term effect is also retrieval dependent, suggesting the labile state of memory is necessary for the short-term memory update to take effect (Fig. 1e)." The importance of this point comes through at several places in the paper:

      1A. "In the current study, fear recovery was tested 30 minutes after extinction training, whereas the effect of memory reconsolidation was generally evident only several hours later and possibly with the help of sleep, leaving open the possibility of a different cognitive mechanism for the short-term fear dementia related to the retrieval-extinction procedure." ***What does this mean? The two groups in study 1 experienced a different interval between the first and second CS extinction trials; and the results varied with this interval: a longer interval (10 min) ultimately resulted in less reinstatement of fear than a shorter interval. Even if the different pattern of results in these two groups was shown/known to imply two different processes, there is absolutely no reason to reference any sort of cognitive mechanism or dementia - that is quite far removed from the details of the present study.

      Indeed, the only difference between the standard extinction paradigm and the retrieval-extinction paradigm is the difference between the first and second CS extinction trials. It has been shown before that a second CS+ presented 1 hour after the initial retrieval CS+ resulted in the dephosphorylation of GluR1 in rats, which was indicative of memory destabilization. The second CS+ presented only 3 minutes after the initial retrieval CS+, as in the standard extinction training, did not cause the GluR1 dephosphorylation effect (Monfils et al., 2009). Therefore, an isolated presentation of the CS+ seems to be important in preventing the return of fear expression. Behaviorally, when the CSs were presented in a more temporally spaced (vs. mass presentation) or a more gradual manner in the extinction training, the fear amnesia effects were more salient (Cain et al., 2003, Gershman et al., 2013). It has also been suggested that only when the old memory and new experience (through extinction) can be inferred to have been generated from the same underlying latent cause, the old memory can be successfully modified (Gershman et al., 2017). On the other hand, if the new experiences are believed to be generated by a different latent cause, then the old memory is less likely to be subject to modification. Therefore, the way the first and 2nd CS are temporally organized (retrieval-extinction or standard extinction) might affect how the latent cause is inferred and lead to different levels of fear expression from a theoretical perspective. These findings, together with studies in both fear and drug memories using the retrieval-extinction paradigm (Liu et al., 2014, Luo et al., 2015, Schiller et al., 2010, Xue et al., 2012), seem to suggest that the retrieval-extinction and the standard extinction procedures engage different cognitive and molecular mechanisms that lead to significant different behavioral outcomes. 

      In our study, we focus on the short-term and long-term amnesia effects of the retrieval-extinction procedure but also point out the critical role of retrieval in eliciting the short-term effect.

      1B. "Importantly, such a short-term effect is also retrieval dependent, suggesting the labile state of memory is necessary for the short-term memory update to take effect (Fig. 1e)." ***As above, what is "the short-term memory update"? At this point in the text, it would be appropriate for the authors to discuss why the retrieval-extinction procedure produces less recovery than a standard extinction procedure as the two protocols only differ in the interval between the first and second extinction trials. References to a "short-term memory update" process do not help the reader to understand what is happening in the protocol.

      Sorry for the lack of clarity here. By short-term memory update we meant the short-term amnesia in fear expression.

      (2) "Indeed, through a series of experiments, we identified a short-term fear amnesia effect following memory retrieval, in addition to the fear reconsolidation effect that appeared much later."

      ***The only reason for supposing two effects is because of the differences in responding to the CS2, which was subjected to STANDARD extinction, in the short- and long-term tests. More needs to be said about how and why the performance of CS2 is affected in the short-term test and recovers in the long-term test. That is, if the loss of performance to CS1 and CS2 is going to be attributed to some type of memory updating process across the retrieval-extinction procedure, one needs to explain the selective recovery of performance to CS2 when the extinction-to-testing interval extends to 24 hours. Instead of explaining this recovery, the authors note that performance to CS1 remains low when the extinction-to-testing interval is 24 hours and invoke something to do with memory reconsolidation as an explanation for their results: that is, they imply (I think) that reconsolidation of the CS1-US memory is disrupted across the 24-hour interval between extinction and testing even though CS1 evokes negligible responding just minutes after extinction.

      In our results, we did not only focus on the fear expression related to CS2. In fact, we also demonstrated that the CS1 related fear expression diminished in the short-term memory test but re-appeared in the long-term memory after the CS1 retrieval-extinction training.

      The “…recovery of performance to CS2 when the extinction-to-testing interval extends to 24 hours…” is a result that has been demonstrated in various previous studies (Kindt and Soeter, 2018, Kindt et al., 2009, Nader et al., 2000, Schiller et al., 2013, Schiller et al., 2010, Xue et al., 2012). That is, the reconsolidation framework stipulates that the pharmacological or behavioral intervention during the labile states of the reconsolidation window only modifies the fear memory linked to the reminded retrieval cue, but not for the non-reminded CS-US memory expression (but also see (Liu et al., 2014, Luo et al., 2015) for using the unconditioned stimulus as the reminder cue and the retrieval-extinction paradigm to prevent the return of fear memory associated with different CS).  In fact, we hypothesized the temporal dynamics of CS1 and CS2 related fear expressions were due to the interplay between the short-term and long-term (reconsolidation) effects of the retrieval-extinction paradigm in the last figure (Fig. 6). 

      (3) The discussion of memory suppression is potentially interesting but, in its present form, raises more questions than it answers. That is, memory suppression is invoked to explain a particular pattern of results but I, as the reader, have no sense of why a fear memory would be better suppressed shortly after the retrieval-extinction protocol compared to the standard extinction protocol; and why this suppression is NOT specific to the cue that had been subjected to the retrieval-extinction protocol.

      We discussed memory suppression as one of the potential mechanisms to account for the three characteristics of the short-term amnesia effects: cue-independence, temporal dynamics (short-term) and thought-control-ability relevance. According to the memory suppression theory, the memory suppression effect is NOT specific to the cue and this effect was demonstrated via the independent cue test in a variety of studies (Anderson and Floresco, 2022, Anderson and Green, 2001, Gagnepain et al., 2014, Zhu et al., 2022). Therefore, we suggest in the discussion that it might be possible the CS1 retrieval cue prompted an automatic suppression mechanism and yielded the short-term fear amnesia consistent with various predictions from the memory suppression theory:

      “In our experiments, subjects were not explicitly instructed to suppress their fear expression, yet the retrieval-extinction training significantly decreased short-term fear expression. These results are consistent with the short-term amnesia induced with the more explicit suppression intervention (Anderson et al., 1994; Kindt and Soeter, 2018; Speer et al., 2021; Wang et al., 2021; Wells and Davies, 1994). It is worth noting that although consciously repelling unwanted memory is a standard approach in memory suppression paradigm, it is possible that the engagement of the suppression mechanism can be unconscious. For example, in the retrieval-induced forgetting (RIF) paradigm, recall of a stored memory impairs the retention of related target memory and this forgetting effect emerges as early as 20 minutes after the retrieval procedure, suggesting memory suppression or inhibition can occur in a more spontaneous and automatic manner (Imai et al., 2014). Moreover, subjects with trauma histories exhibited more suppression-induced forgetting for both negative and neutral memories than those with little or no trauma (Hulbert and Anderson, 2018). Similarly, people with higher self-reported thought-control capabilities showed more severe cue-independent memory recall deficit, suggesting that suppression mechanism is associated with individual differences in spontaneous control abilities over intrusive thoughts (Küpper et al., 2014). It has also been suggested that similar automatic mechanisms might be involved in organic retrograde amnesia of traumatic childhood memories (Schacter et al., 2012; Schacter et al., 1996).”

      3A. Relatedly, how does the retrieval-induced forgetting (which is referred to at various points throughout the paper) relate to the retrieval-extinction effect? The appeal to retrieval-induced forgetting as an apparent justification for aspects of the present study reinforces points 2 and 3 above. It is not uninteresting but needs some clarification/elaboration.

      We introduced the retrieval-induced forgetting (RIF) to make the point that RIF was believed to be related to the memory suppression mechanism and the RIF effect can appear relatively early, consistent with what we observed in the short-term amnesia effect. We have re-written the manuscript to make this point clearer:

      “It is worth noting that although consciously repelling unwanted memory is a standard approach in memory suppression paradigm, it is possible that the engagement of the suppression mechanism can be unconscious. For example, in the retrieval-induced forgetting (RIF) paradigm, recall of a stored memory impairs the retention of related target memory and this forgetting effect emerges as early as 20 minutes after the retrieval procedure, suggesting memory suppression or inhibition can occur in a more spontaneous and automatic manner (Imai et al., 2014). Moreover, subjects with trauma histories exhibited more suppression-induced forgetting for both negative and neutral memories than those with little or no trauma (Hulbert and Anderson, 2018). Similarly, people with higher self-reported thought-control capabilities showed more severe cue-independent memory recall deficit, suggesting that suppression mechanism is associated with individual differences in spontaneous control abilities over intrusive thoughts (Küpper et al., 2014).”

      (4) Given the reports by Chalkia, van Oudenhove & Beckers (2020) and Chalkia et al (2020), some qualification needs to be inserted in relation to reference 6. That is, reference 6 is used to support the statement that "during the reconsolidation window, old fear memory can be updated via extinction training following fear memory retrieval". This needs a qualifying statement like "[but see Chalkia et al (2020a and 2020b) for failures to reproduce the results of 6]."

      https://pubmed.ncbi.nlm.nih.gov/32580869/

      https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7115860/

      We have incorporated the reviewer’s suggestion into the revised manuscript in both the introduction:

      “Pharmacological blockade of protein synthesis and behavioral interventions can both eliminate the original fear memory expression in the long-term (24 hours later) memory test ( Lee, 2008; Lee et al., 2017; Schiller et al., 2013; Schiller et al., 2010), resulting in the cue-specific fear memory deficit (Debiec et al., 2002; Lee, 2008; Nader, Schafe, & LeDoux, 2000). For example, during the reconsolidation window, retrieving a fear memory allows it to be updated through extinction training (i.e., the retrieval-extinction paradigm (Lee, 2008; Lee et al., 2017; Schiller et al., 2013; Schiller et al., 2010), but also see (Chalkia, Schroyens, et al., 2020; Chalkia, Van Oudenhove, et al., 2020; D. Schiller, LeDoux, & Phelps, 2020)”

      And in the discussion:

      “It should be noted that while our long-term amnesia results were consistent with the fear memory reconsolidation literatures, there were also studies that failed to observe fear prevention (Chalkia, Schroyens, et al., 2020; Chalkia, Van Oudenhove, et al., 2020; Schroyens et al., 2023). Although the memory reconsolidation framework provides a viable explanation for the long-term amnesia, more evidence is required to validate the presence of reconsolidation, especially at the neurobiological level (Elsey et al., 2018). While it is beyond the scope of the current study to discuss the discrepancies between these studies, one possibility to reconcile these results concerns the procedure for the retrieval-extinction training. It has been shown that the eligibility for old memory to be updated is contingent on whether the old memory and new observations can be inferred to have been generated by the same latent cause (Gershman et al., 2017; Gershman and Niv, 2012). For example, prevention of the return of fear memory can be achieved through gradual extinction paradigm, which is thought to reduce the size of prediction errors to inhibit the formation of new latent causes (Gershman, Jones, et al., 2013). Therefore, the effectiveness of the retrieval-extinction paradigm might depend on the reliability of such paradigm in inferring the same underlying latent cause. Furthermore, other studies highlighted the importance of memory storage per se and suggested that memory retention was encoded in the memory engram cell ensemble connectivity whereas the engram cell synaptic plasticity is crucial for memory retrieval (Ryan et al., 2015; Tonegawa, Liu, et al., 2015; Tonegawa, Pignatelli, et al., 2015). It remains to be tested how the cue-independent short-term and cue-dependent long-term amnesia effects we observed could correspond to the engram cell synaptic plasticity and functional connectivity among engram cell ensembles (Figure 6). This is particularly important, since the cue-independent characteristic of the short-term amnesia suggest that either different memory cues fail to evoke engram cell activities, or the retrieval-extinction training transiently inhibits connectivity among engram cell ensembles. Finally, SCR is only one aspect of the fear expression, how the retrieval-extinction paradigm might affect subjects’ other emotional (such as the startle response) and cognitive fear expressions such as reported fear expectancy needs to be tested in future studies since they do not always align with each other (Kindt et al., 2009; Sevenster et al., 2012, 2013).”

      5A. What does it mean to ask: "whether memory retrieval facilitates update mechanisms other than memory reconsolidation"? That is, in what sense could or would memory retrieval be thought to facilitate a memory update mechanism?

      It is widely documented in the literatures that memory retrieval renders the old memory into a labile state susceptible for the memory reconsolidation process. However, as we mentioned in the manuscript, studies have shown that memory reconsolidation requires the de novo protein synthesis and usually takes hours to complete. What remains unknown is whether old memories are subject to modifications other than the reconsolidation process. Our task specifically tested the short-term effect of the retrieval-extinction paradigm and found that fear expression diminished 30mins after the retrieval-extinction training. Such an effect cannot be accounted for by the memory reconsolidation effect.

      5B. "First, we demonstrate that memory reactivation prevents the return of fear shortly after extinction training in contrast to the memory reconsolidation effect which takes several hours to emerge and such a short-term amnesia effect is cue independent (Study 1, N = 57 adults)."

      ***The phrasing here could be improved for clarity: "First, we demonstrate that the retrieval-extinction protocol prevents the return of fear shortly after extinction training (i.e., when testing occurs just min after the end of extinction)." Also, cue-dependence of the retrieval-extinction effect was assessed in study 2.

      We thank the reviewer and have modified the phrasing of the sentence:

      “First, we demonstrate that memory retrieval-extinction protocol prevents the return of fear expression shortly after extinction training and this short-term effect is memory reactivation dependent (Study 1, N = 57 adults).”

      5C. "Furthermore, memory reactivation also triggers fear memory reconsolidation and produces cue-specific amnesia at a longer and separable timescale (Study 2, N = 79 adults)." ***In study 2, the retrieval-extinction protocol produced a cue-specific disruption in responding when testing occurred 24 hours after the end of extinction. This result is interesting but cannot be easily inferred from the statement that begins "Furthermore..." That is, the results should be described in terms of the combined effects of retrieval and extinction, not in terms of memory reactivation alone; and the statement about memory reconsolidation is unnecessary. One can simply state that the retrieval-extinction protocol produced a cue-specific disruption in responding when testing occurred 24 hours after the end of extinction.

      We have revised the text according to the reviewer’s comment.

      “Furthermore, across different timescales, the memory retrieval-extinction paradigm triggers distinct types of fear amnesia in terms of cue-specificity and cognitive control dependence, suggesting that the short-term fear amnesia might be caused by different mechanisms from the cue-specific amnesia at a longer and separable timescale (Study 2, N = 79 adults).”

      5D. "...we directly manipulated brain activities in the dorsolateral prefrontal cortex and found that both memory retrieval and intact prefrontal cortex functions were necessary for the short-term fear amnesia."

      ***This could be edited to better describe what was shown: E.g., "...we directly manipulated brain activities in the dorsolateral prefrontal cortex and found that intact prefrontal cortex functions were necessary for the short-term fear amnesia after the retrieval-extinction protocol."

      Edited:

      “Finally, using continuous theta-burst stimulation (Study 3, N = 75 adults), we directly manipulated brain activity in the dorsolateral prefrontal cortex, and found that both memory reactivation and intact prefrontal cortex function were necessary for the short-term fear amnesia after the retrieval-extinction protocol.”

      5E. "The temporal scale and cue-specificity results of the short-term fear amnesia are clearly dissociable from the amnesia related to memory reconsolidation, and suggest that memory retrieval and extinction training trigger distinct underlying memory update mechanisms."

      ***The pattern of results when testing occurred just minutes after the retrieval-extinction protocol was different from that obtained when testing occurred 24 hours after the protocol. Describing this in terms of temporal scale is unnecessary, and suggesting that memory retrieval and extinction trigger different memory update mechanisms is not obviously warranted. The results of interest are due to the combined effects of retrieval+extinction and there is no sense in which different memory update mechanisms should be identified with retrieval (mechanism 1) and extinction (mechanism 2).

      We did not argue for different memory update mechanisms for the “retrieval (mechanism 1) and extinction (mechanism 2)” in our manuscript. Instead, we proposed that the retrieval-extinction procedure, which was mainly documented in the previous literatures for its association with the reconsolidation-related fear memory retention (the long-term effect), also had a much faster effect (the short-term effect). These two effects differed in many aspects, suggesting that different memory update mechanisms might be involved.

      5F. "These findings raise the possibility of concerted memory modulation processes related to memory retrieval..."

      ***What does this mean?

      As we mentioned in our response to the previous comment, we believe that the retrieval-extinction procedure triggers different types of memory update mechanisms working on different temporal scales.

      (6) "...suggesting that the fear memory might be amenable to a more immediate effect, in addition to what the memory reconsolidation theory prescribes..."

      ***What does it mean to say that the fear memory might be amenable to a more immediate effect?

      We intended to state that the retrieval-extinction procedure can produce a short-term amnesia effect and have thus revised the text.

      (7) "Parallel to the behavioral manifestation of long- and short-term memory deficits, concurrent neural evidence supporting memory reconsolidation theory emphasizes the long-term effect of memory retrieval by hypothesizing that synapse degradation and de novo protein synthesis are required for reconsolidation."

      ***This sentence needs to be edited for clarity.

      We have rewritten this sentence:

      “Corresponding to the long-term behavioral manifestation, concurrent neural evidence supporting memory reconsolidation hypothesis emphasizes that synapse degradation and de novo protein synthesis are required for reconsolidation.”

      (8) "previous behavioral manipulations engendering the short-term declarative memory effect..."

      ***What is the declarative memory effect? It should be defined.

      We meant the amnesia on declarative memory research, such as the memory deficit caused by the think/no-think paradigms. Texts have been modified for clarity:

      “On the contrary, previous behavioral manipulations engendering the short-term amnesia on declarative memory, such as the think/no-think paradigm, hinges on the intact activities in brain areas such as dorsolateral prefrontal cortex (cognitive control) and its functional coupling with specific brain regions such as hippocampus (memory retrieval) (Anderson and Green, 2001; Wimber et al., 2015).”

      (9) "The declarative amnesia effect emerges much earlier due to the online functional activity modulation..."

      ***Even if the declarative memory amnesia effect had been defined, the reference to online functional activity modulation is not clear.

      We have rephrased the sentence:

      “The declarative amnesia effect arises much earlier due to the more instant modulation of functional connectivity, rather than the slower processes of new protein synthesis in these brain regions.”

      (10) "However, it remains unclear whether memory retrieval might also precipitate a short-term amnesia effect for the fear memory, in addition to the long-term prevention orchestrated by memory consolidation."

      ***I found this sentence difficult to understand on my first pass through the paper. I think it is because of the phrasing of memory retrieval. That is, memory retrieval does NOT precipitate any type of short-term amnesia for the fear memory: it is the retrieval-extinction protocol that produces something like short-term amnesia. Perhaps this sentence should also be edited for clarity.

      We have changed “memory retrieval” to “retrieval-extinction” where applicable.

      I will also note that the usage of "short-term" at this point in the paper is quite confusing: Does the retrieval-extinction protocol produce a short-term amnesia effect, which would be evidenced by some recovery of responding to the CS when tested after a sufficiently long delay? I don't believe that this is the intended meaning of "short-term" as used throughout the majority of the paper, right?

      By “short-term”, we meant the lack of fear expression in the test phase (measured by skin conductance responses) shortly after the retrieval-extinction procedure (30 mins in studies 1 & 2 and 1 hour in study 3). It does not indicate that the effect is by itself “short-lived”.

      (11) "To fully comprehend the temporal dynamics of the memory retrieval effect..."<br /> ***What memory retrieval effect? This needs some elaboration.

      We’ve changed the phrase “memory retrieval effect” to “retrieval-extinction effect” to refer to the effect of retrieval-extinction on fear amnesia.

      (12) "We hypothesize that the labile state triggered by the memory retrieval may facilitate different memory update mechanisms following extinction training, and these mechanisms can be further disentangled through the lens of temporal dynamics and cue-specificities."

      ***What does this mean? The first part of the sentence is confusing around the usage of the term "facilitate"; and the second part of the sentence that references a "lens of temporal dynamics and cue-specificities" is mysterious. Indeed, as all rats received the same retrieval-extinction exposures in Study 2, it is not clear how or why any differences between the groups are attributed to "different memory update mechanisms following extinction".

      As the reviewer mentioned, if only one time point data were collected, we cannot differentiate whether different memory update mechanisms are involved. In study 2, however, the 3 groups only differed on the time onsets the reinstatement test was conducted. Accordingly, our results showed that the fear amnesia effects for CS1 and CS2 cannot be simply explained by forgetting: different memory update mechanisms must be at work to explain the characteristics of the SCR related to both CS1 and CS2 at three different time scales (30min, 6h and 24h). It was based on these results, together with the results from the TMS study (study 3), that we proposed the involvement of a short-term memory update mechanism in addition to the reconsolidation related fear amnesia (which should become evident much later) induced by the retrieval-extinction protocol.

      (13) "In the first study, we aimed to test whether there is a short-term amnesia effect of fear memory retrieval following the fear retrieval-extinction paradigm."

      ***Again, the language is confusing. The phrase, "a short-term amnesia effect" implies that the amnesia itself is temporary; but I don't think that this implication is intended. The problem is specifically in the use of the phrase "a short-term amnesia effect of fear memory retrieval." To the extent that short-term amnesia is evident in the data, it is not due to retrieval per se but, rather, the retrieval-extinction protocol.

      We have changed the wordings and replaced “memory retrieval” with “retrieval-extinction” where applicable.

      (14) The authors repeatedly describe the case where there was a 24-hour interval between extinction and testing as consistent with previous research on fear memory reconsolidation. Which research exactly? That is, in studies where a CS re-exposure was combined with a drug injection, responding to the CS was disrupted in a final test of retrieval from long-term memory which typically occurred 24 hours after the treatment. Is that what the authors are referring to as consistent? If so, which aspect of the results are consistent with those previous findings? Perhaps the authors mean to say that, in the case where there was a 24-hour interval between extinction and testing, the results obtained here are consistent with previous research that has used the retrieval-extinction protocol. This would clarify the intended meaning greatly.

      Our 24 hour test results after the retrieval-extinction protocol was consistent with both pharmacological and behavioral intervention studies in fear memory reconsolidation studies (Kindt and Soeter, 2018, Kindt et al., 2009, Liu et al., 2014, Luo et al., 2015, Monfils et al., 2009, Nader et al., 2000, Schiller et al., 2013, Schiller et al., 2010, Xue et al., 2012) since the final test phase typically occurred 24 hours after the treatment. At the 24-hour interval, the memory reconsolidation effect would become evident either via drug administration or behavioral intervention (extinction training).

      DATA

      (15) Points about data:

      5A. The eight participants who were discontinued after Day 1 in study 1 were all from the no-reminder group. Can the authors please comment on how participants were allocated to the two groups in this experiment so that the reader can better understand why the distribution of non-responders was non-random (as it appears to be)?

      15B. Similarly, in study 2, of the 37 participants that were discontinued after Day 2, 19 were from Group 30 min, and 5 were from Group 6 hours. Can the authors comment on how likely these numbers are to have been by chance alone? I presume that they reflect something about the way that participants were allocated to groups, but I could be wrong.

      We went back and checked out data. As we mentioned in the supplementary materials, we categorized subjects as non-responders if their SCR response to any CS was less than 0.02  in Day 1 (fear acquisition). Most of the discontinued participants (non-responders) in the no-reminder group (study 1) and the 30min & 24 h groups (study 2) were when the heating seasons just ended or were yet to start, respectively. It has been documented that human body thermal conditions were related to the quality of the skin conductance response (SCR) measurements (Bauer et al., 2022, Vila, 2004). We suspect that the non-responders might be related to the body thermal conditions caused by the lack of central heating.

      15C. "Post hoc t-tests showed that fear memories were resilient after regular extinction training, as demonstrated by the significant difference between fear recovery indexes of the CS+ and CS- for the no-reminder group (t26 = 7.441, P < 0.001; Fig. 1e), while subjects in the reminder group showed no difference of fear recovery between CS+ and CS- (t29 = 0.797, P = 0.432, Fig. 1e)."

      ***Is the fear recovery index shown in Figure 1E based on the results of the first test trial only? How can there have been a "significant difference between fear recovery indexes of the CS+ and CS- for the no-reminder group" when the difference in responding to the CS+ and CS- is used to calculate the fear recovery index shown in 1E? What are the t-tests comparing exactly, and what correction is used to account for the fact that they are applied post-hoc?

      As we mentioned in the results section of the manuscript, the fear recovery index was defined as “the SCR difference between the first test trial and the last extinction trial of a specific CS”. We then calculated the “differential fear recovery index” (figure legends of Fig. 1e) between CS+ and CS- for both the reminder and no-reminder groups. The post-hoc t-tests were used to examine whether there were significant fear recoveries (compare to 0) in both the reminder (t<sub>29</sub> = 0.797, P = 0.432, Fig. 1e) and no-reminder (t<sub>26</sub> = 7.441, P  < 0.001; Fig. 1e) groups. We realize that the description of Bonferroni correction was not specified in the original manuscript and hence added in the revision where applicable.

      15D. "Finally, there is no statistical difference between the differential fear recovery indexes between CS+ in the reminder and no reminder groups (t55 = -2.022, P = 0.048; Fig. 1c, also see Supplemental Material for direct test for the test phase)."

      ***Is this statement correct - i.e., that there is no statistically significant difference in fear recovery to the CS+ in the reminder and no reminder groups? I'm sure that the authors would like to claim that there IS such a difference; but if such a difference is claimed, one would be concerned by the fact that it is coming through in an uncorrected t-test, which is the third one of its kind in this paragraph. What correction (for the Type 1 error rate) is used to account for the fact that the t-tests are applied post-hoc? And if no correction, why not?

      We are sorry about the typo.  The reviewer was correct that we meant to claim here that “… there is a significant difference between the differential fear recovery indexes between CS+ in the reminder and no-reminder groups (t<sub>55</sub> =- 2.022, P = 0.048; Fig. 1e)”.  Note that the t-test performed here was a confirmatory test following our two-way ANOVA with main effects of group (reminder vs. no-reminder) and time (last extinction trial vs. first test trial) on the differential CS SCR response (CS+ minus CS-) and we found a significant group x time interaction effect (F<sub>1.55</sub> = 4.087, P = 0.048, η<sup>2</sup> = 0.069). The significant difference between the differential fear recovery indexes was simply a re-plot of the interaction effect mentioned above and therefore no multiple correction is needed. We have reorganized the sequence of the sentences such that this t-test now directly follows the results of the ANOVA:

      “The interaction effect was confirmed by the significant difference between the differential fear recovery indexes between CS1+ and CS2+ in the reminder and no-reminder groups (t<sub>55</sub> \= -2.022, P \= 0.048; Figure 1E, also see Supplemental Material for the direct test of the test phase).”

      15E. In study 2, why is responding to the CS- so high on the first test trial in Group 30 min? Is the change in responding to the CS- from the last extinction trial to the first test trial different across the three groups in this study? Inspection of the figure suggests that it is higher in Group 30 min relative to Groups 6 hours and 24 hours. If this is confirmed by the analysis, it has implications for the fear recovery index which is partly based on responses to the CS-. If not for differences in the CS- responses, Groups 30 minutes and 6 hours are otherwise identical.

      Following the reviewer’s comments, we went back and calculated the mean SCR difference of CS- between the first test trial and the last extinction trial for all three studies (see Author response image 1 below). In study 1, there was no difference in the mean CS- SCR (between the first test trial and last extinction trial) between the reminder and no-reminder groups (Kruskal-Wallis test , panel a), though both groups showed significant fear recovery even in the CS- condition (Wilcoxon signed rank test, reminder: P = 0.0043, no-reminder: P = 0.0037). Next, we examined the mean SCR for CS- for the 30min, 6h and 24h groups in study 2 and found that there was indeed a group difference (one-way ANOVA,F<sub>2.76</sub> = 5.3462, P = 0.0067, panel b), suggesting that the CS- related SCR was influenced by the test time (30min, 6h or 24h). We also tested the CS- related SCR for the 4 groups in study 3 (where test was conducted 1 hour after the retrieval-extinction training) and found that across TMS stimulation types (PFC vs. VER) and reminder types (reminder vs. no-reminder) the ANOVA analysis did not yield main effect of TMS stimulation type (F<sub>1.71</sub> = 0.322, P = 0.572) nor main effect of reminder type (F<sub>1.71</sub> = 0.0499, P = 0.824, panel c). We added the R-VER group results in study 3 (see panel c) to panel b and plotted the CS- SCR difference across 4 different test time points and found that CS- SCR decreased as the test-extinction delay increased (Jonckheere-Terpstra test, P = 0.00028). These results suggest a natural “forgetting” tendency for CS- related SCR and highlight the importance of having the CS- as a control condition to which the CS+ related SCR was compared with.

      Author response image 1.

      15F. Was the 6-hour group tested at a different time of day compared to the 30-minute and 24-hour groups; and could this have influenced the SCRs in this group?

      For the 30min and 24h groups, the test phase can be arranged in the morning, in the afternoon or at night. However, for the 6h group, the test phase was inevitably in the afternoon or at night since we wanted to exclude the potential influence of night sleep on the expression of fear memory (see Author response table 1 below). If we restricted the test time in the afternoon or at night for all three groups, then the timing of their extinction training was not matched.

      Author response table 1.

      Nevertheless, we also went back and examined the data for the subjects only tested in the afternoon or at nights in the 30min and 24h groups to match with the 6h group where all the subjects were tested either in the afternoon or at night. According to Author response table 1 above, we have 17 subjects for the 30min group (9+8),18 subjects for the 24h group (9 + 9) and 26 subjects for the 6h group (12 + 14). As Author response image 2 shows, the SCR patterns in the fear acquisition, extinction and test phases were similar to the results presented in the original figure.

      Author response image 2.

      15G. Why is the range of scores in "thought control ability" different in the 30-minute group compared to the 6-hour and 24-hour groups? I am not just asking about the scale on the x-axis: I am asking why the actual distribution of the scores in thought control ability is wider for the 30-minute group?

      We went back and tested whether the TCAQ score variance was the same across three groups. We found that there was significant difference in the variance of the TCAQ score distribution across three groups (F<sub>2.155</sub> = 4.324, P = 0.015, Levene test). However, post-hoc analyses found that the variance of TCAQ is not significantly different between the 30min and 6h groups (F<sub>26.25</sub> = 0.4788, P = 0.0697), nor between the 30min and 24h groups (i>F<sub>26.25</sub> = 0.4692, P = 0.0625). To further validate our correlational results between the TCAQ score and the fear recovery index, we removed the TCAQ scores that were outside the TCAQ score range of the 6h & 24h groups from the 30min group (resulting in 4 “outliner” TCAQ scores in the 30min group, panel a in Author response image 3 below) and the Levene test confirmed that the variance of the TCAQ scores showed no difference across groups after removing the 4 “outliner” data points in the 30min group (i>F<sub>2.147</sub> = 0.74028, P = 0.4788). Even with the 4 “outliers” removed from the 30min group, the correlational analysis of the TCAQ scores and the fear recovery index still yielded significant result in the 30min group (beta = -0.0148, t = -3.731, P = 0.0006, see panel b below), indicating our results were not likely due to the inclusion of subjects with extreme TCAQ scores.

      Author response image 3.

      (16) During testing in each experiment, how were the various stimuli presented? That is, was the presentation order for the CS+ and CS- pseudorandom according to some constraint, as it had been in extinction? This information should be added to the method section.

      We mentioned the order of the stimuli in the testing phase in the methods section “… For studies 2 & 3, …a pseudo-random stimulus order was generated for fear acquisition and extinction phases of three groups with the rule that no same trial- type (CS1+, CS2+ and CS-) repeated more than twice. In the test phase, to exclude the possibility that the difference between CS1+ and CS2+ was simply caused by the presentation sequence of CS1+ and CS2+, half of the participants completed the test phase using a pseudo-random stimuli sequence and the identities of CS1+ and CS2+ reversed in the other half of the participants.”

      (17) "These results are consistent with previous research which suggested that people with better capability to resist intrusive thoughts also performed better in motivated dementia in both declarative and associative memories."

      ***Which parts of the present results are consistent with such prior results? It is not clear from the descriptions provided here why thought control ability should be related to the present findings or, indeed, past ones in other domains. This should be elaborated to make the connections clear.

      In the 30min group, we found that subjects’ TCAQ scores were negatively correlated with their fear recovery indices. That is, people with better capacity to resist intrusive thoughts were also less likely to experience the return of fear memory, which are consistent with previous results. Together with our brain stimulation results, the short-term amnesia is related to subject’s cognitive control ability and intact dlPFC functions. It is because of these similarities that we propose that the short-term amnesia might be related to the automatic memory suppression mechanism originated from the declarative memory research. Since we have not provided all the evidence at this point of the results section, we briefly listed the connections with previous declarative and associative memory research.

      Reviewer #2 (Public Review):

      The fear acquisition data is converted to a differential fear SCR and this is what is analysed (early vs late). However, the figure shows the raw SCR values for CS+ and CS- and therefore it is unclear whether the acquisition was successful (despite there being an "early" vs "late" effect - no descriptives are provided).

      As the reviewer mentioned, the fear acquisition data was converted to a differential fear SCR and we conducted a two-way mixed ANOVA (reminder vs. no-reminder) x time (early vs. late part of fear acquisition) on the differential SCRs. We found a significant main effect of time (early vs. late; F<sub>1.55</sub> = 6.545, P = 0.013, η<sup>2</sup> = 0.106), suggesting successful fear acquisition in both groups. Fig. 1c also showed the mean differential SCR for the latter half of the acquisition phase in both the reminder and no-reminder groups and there was no significant difference in acquired SCRs between groups (early acquisition: t<sub>55</sub> = -0.063, P = 0.950; late acquisition: t<sub>55</sub> = -0.318, P = 0.751; Fig. 1c).

      In Experiment 1 (Test results) it is unclear whether the main conclusion stems from a comparison of the test data relative to the last extinction trial ("we defined the fear recovery index as the SCR difference between the first test trial and the last extinction trial for a specific CS") or the difference relative to the CS- ("differential fear recovery index between CS+ and CS-"). It would help the reader assess the data if Figure 1e presents all the indexes (both CS+ and CS-). In addition, there is one sentence that I could not understand "there is no statistical difference between the differential fear recovery indexes between CS+ in the reminder and no reminder groups (P=0.048)". The p-value suggests that there is a difference, yet it is not clear what is being compared here. Critically, any index taken as a difference relative to the CS- can indicate recovery of fear to the CS+ or absence of discrimination relative to the CS-, so ideally the authors would want to directly compare responses to the CS+ in the reminder and no-reminder groups. The latter issue is particularly relevant in Experiment 2, in which the CS- seems to vary between groups during the test and this can obscure the interpretation of the result.

      In all the experiments, the fear recovery index (FRI) was defined as the SCR difference between the first test trial and the last extinction trial for any CS. Subsequently, the differential fear recovery index (FRI) was defined between the FRI of a specific CS+ and the FRI of the CS-. The differential FRI would effectively remove the non-specific time related effect (using the CS- FRI as the baseline). We have revised the text accordingly.

      As we responded to reviewer #1, the CS- fear recovery indices (FIR) for the reminder and no-reminder groups were not statistically different (Kruskal-Wallis test , panel a, Author response image 1), though both groups showed significant fear recovery even in the CS- condition (Wilcoxon signed rank test, reminder: P = 0.0043, no-reminder: P = 0.0037, panel a). Next, we examined the mean SCR for CS- for the 30min, 6h and 24h groups in study 2 and found that there was indeed a group difference (one-way ANOVA,  one-way ANOVA,F<sub>2.76</sub> = 5.3462, P = 0.0067, panel b), suggesting that the CS- SCR was influenced by the test time delay. We also tested the CS- SCR for the 4 groups in study 3 and found that across TMS stimulation types (PFC vs. VER) and reminder types (reminder vs. no-reminder) the ANOVA analysis did not yield main effect of TMS stimulation type (F<sub>1.71</sub> = 0.322, P = 0.572) nor main effect of reminder type (F<sub>1.71</sub> = 0.0499, P = 0.824, panel c). We added the R-VER group results in study 3 (see panel c) to panel b and plotted the CS- SCR difference across 4 different test time points and found that CS- SCR decreased as the test-extinction delay increased (Jonckheere-Terpstra test, P = 0.00028). These results suggest a natural “forgetting” tendency for the CS- fear recovery index and highlight the importance of having the CS- as a control condition to compare the CS+ recovery index with (resulting in the Differential recovery index). Parametric and non-parametric analyses were adopted based on whether the data met the assumptions for the parametric analyses.

      In Experiment 1, the findings suggest that there is a benefit of retrieval followed by extinction in a short-term reinstatement test. In Experiment 2, the same effect is observed on a cue that did not undergo retrieval before extinction (CS2+), a result that is interpreted as resulting from cue-independence, rather than a failure to replicate in a within-subjects design the observations of Experiment 1 (between-subjects). Although retrieval-induced forgetting is cue-independent (the effect on items that are suppressed [Rp-] can be observed with an independent probe), it is not clear that the current findings are similar. Here, both cues have been extinguished and therefore been equally exposed during the critical stage.

      We appreciate the reviewer’s insight on this issue. Although in the discussion we raised the possibility of memory suppression to account for the short-term amnesia effect, we did not intend to compare our paradigm side-by-side with retrieval-induced forgetting. In our previous work (Wang et al., 2021), we reported that active suppression effect of CS+ related fear memory during the standard extinction training generalized to other CS+, yielding a cue-independent effect. In the current experiments, we did not implement active suppression; instead, we used the CS+ retrieval-extinction paradigm. It is thus possible that the CS+ retrieval cue may function to facilitate automatic suppression. Indeed, in the no-reminder group (standard extinction) of study 1, we did observe the return of fear expression, suggesting the critical role of CS+ reminder before the extinction training. Based on the results mentioned above, we believe our short-term amnesia results were consistent with the hypothesis that the retrieval CS+ (reminder) might prompt subjects to adopt an automatic suppress mechanism in the following extinction training, yielding cue-independent amnesia effects.

      The findings in Experiment 2 suggest that the amnesia reported in Experiment 1 is transient, in that no effect is observed when the test is delayed by 6 hours. The phenomena whereby reactivated memories transition to extinguished memories as a function of the amount of exposure (or number of trials) is completely different from the phenomena observed here. In the former, the manipulation has to do with the number of trials (or the total amount of time) that the cues are exposed to. In the current study, the authors did not manipulate the number of trials but instead the retention interval between extinction and test. The finding reported here is closer to a "Kamin effect", that is the forgetting of learned information which is observed with intervals of intermediate length (Baum, 1968). Because the Kamin effect has been inferred to result from retrieval failure, it is unclear how this can be explained here. There needs to be much more clarity on the explanations to substantiate the conclusions.

      Indeed, in our studies, we did not manipulate the amount of exposure (or number of trials) but only the retention interval between extinction and test. Our results demonstrated that the retrieval-extinction protocol yielded the short-term amnesia on fear memory, qualitatively different from the reconsolidation related amnesia proposed in the previous literatures. After examining the temporal dynamics, cue-specificity and TCAQ association with the short-term amnesia, we speculated that the short-term effect might be related to an automatic suppression mechanism. Of course, further studies will be required to test such a hypothesis.

      Our results might not be easily compared with the “Kamin effect”, a term coined to describe the “retention of a partially learned avoidance response over varying time intervals” using a learning-re-learning paradigm (Baum, 1968, Kamin, 1957). However, the retrieval-extinction procedure used in our studies was different from the learning-re-learning paradigm in the original paper (Kamin, 1957) and the reversal-learning paradigm the reviewer mentioned (Baum, 1968).

      There are many results (Ryan et al., 2015) that challenge the framework that the authors base their predictions on (consolidation and reconsolidation theory), therefore these need to be acknowledged. Similarly, there are reports that failed to observe the retrieval-extinction phenomenon (Chalkia et al., 2020), and the work presented here is written as if the phenomenon under consideration is robust and replicable. This needs to be acknowledged.

      We thank the reviewer pointing out the related literature and have added a separate paragraph about other results in the discussion (as well as citing relevant references in the introduction) to provide a full picture of the reconsolidation theory to the audience:

      “It should be noted that while our long-term amnesia results were consistent with the fear memory reconsolidation literatures, there were also studies that failed to observe fear prevention (Chalkia, Schroyens, et al., 2020; Chalkia, Van Oudenhove, et al., 2020; Schroyens et al., 2023). Although the memory reconsolidation framework provides a viable explanation for the long-term amnesia, more evidence is required to validate the presence of reconsolidation, especially at the neurobiological level (Elsey et al., 2018). While it is beyond the scope of the current study to discuss the discrepancies between these studies, one possibility to reconcile these results concerns the procedure for the retrieval-extinction training. It has been shown that the eligibility for old memory to be updated is contingent on whether the old memory and new observations can be inferred to have been generated by the same latent cause (Gershman et al., 2017; Gershman and Niv, 2012). For example, prevention of the return of fear memory can be achieved through gradual extinction paradigm, which is thought to reduce the size of prediction errors to inhibit the formation of new latent causes (Gershman, Jones, et al., 2013). Therefore, the effectiveness of the retrieval-extinction paradigm might depend on the reliability of such paradigm in inferring the same underlying latent cause. Furthermore, other studies highlighted the importance of memory storage per se and suggested that memory retention was encoded in the memory engram cell ensemble connectivity whereas the engram cell synaptic plasticity is crucial for memory retrieval (Ryan et al., 2015; Tonegawa, Liu, et al., 2015; Tonegawa, Pignatelli, et al., 2015). It remains to be tested how the cue-independent short-term and cue-dependent long-term amnesia effects we observed could correspond to the engram cell synaptic plasticity and functional connectivity among engram cell ensembles (Figure 6). This is particularly important, since the cue-independent characteristic of the short-term amnesia suggest that either different memory cues fail to evoke engram cell activities, or the retrieval-extinction training transiently inhibits connectivity among engram cell ensembles. Finally, SCR is only one aspect of the fear expression, how the retrieval-extinction paradigm might affect subjects’ other emotional (such as the startle response) and cognitive fear expressions such as reported fear expectancy needs to be tested in future studies since they do not always align with each other (Kindt et al., 2009; Sevenster et al., 2012, 2013).”

      The parallels between the current findings and the memory suppression literature are speculated in the general discussion, and there is the conclusion that "the retrieval-extinction procedure might facilitate a spontaneous memory suppression process". Because one of the basic tenets of the memory suppression literature is that it reflects an "active suppression" process, there is no reason to believe that in the current paradigm, the same phenomenon is in place, but instead, it is "automatic". In other words, the conclusions make strong parallels with the memory suppression (and cognitive control) literature, yet the phenomena that they observed are thought to be passive (or spontaneous/automatic).

      Ultimately, it is unclear why 10 mins between the reminder and extinction learning will "automatically" suppress fear memories. Further down in the discussion, it is argued that "For example, in the well-known retrieval-induced forgetting (RIF) phenomenon, the recall of a stored memory can impair the retention of related long-term memory and this forgetting effect emerges as early as 20 minutes after the retrieval procedure, suggesting memory suppression or inhibition can occur in a more spontaneous and automatic manner". I did not follow with the time delay between manipulation and test (20 mins) would speak about whether the process is controlled or automatic.

      In our previous research, we showed that the memory suppression instruction together with the extinction procedure successfully prevented the return of fear expression in the reinstatement test trials 30mins after the extinction training (Wang et al., 2021). In the current experiments, we replaced the suppression instruction with the retrieval cue before the extinction training (retrieval-extinction protocol) and observed similar short-term amnesia effects. These results prompted us to hypothesize in the discussion that the retrieval cue might facilitate an automatic suppression process. We made the analogy to RIF phenomenon in the discussion to suggest that the suppression of (competing) memories could be unintentional and fast (20 mins), both of which were consistent with our results. We agree with the reviewer that this hypothesis is more of a speculation (hence in the discussion), and more studies are required to further test such a hypothesis. However, what we want to emphasize in this paper is the report of the short-term amnesia effects which were clearly not related to the memory reconsolidation effect in a variety of aspects.

      Among the many conclusions, one is that the current study uncovers the "mechanism" underlying the short-term effects of retrieval extinction. There is little in the current report that uncovers the mechanism, even in the most psychological sense of the mechanism, so this needs to be clarified. The same applies to the use of "adaptive".

      Whilst I could access the data on the OFS site, I could not make sense of the Matlab files as there is no signposting indicating what data is being shown in the files. Thus, as it stands, there is no way of independently replicating the analyses reported.

      We have re-organized data on the OFS site, and they should be accessible now.

      The supplemental material shows figures with all participants, but only some statistical analyses are provided, and sometimes these are different from those reported in the main manuscript. For example, the test data in Experiment 1 is analysed with a two-way ANOVA with the main effects of group (reminder vs no-reminder) and time (last trial of extinction vs first trial of the test) in the main report. The analyses with all participants in the sup mat used a mixed two-way ANOVA with a group (reminder vs no reminder) and CS (CS+ vs CS-). This makes it difficult to assess the robustness of the results when including all participants. In addition, in the supplementary materials, there are no figures and analyses for Experiment 3.

      We are sorry for the lack of clarity in the supplementary materials. We have supplementary figures Fig. S1 & S2 for the data re-analysis with all the responders (learners + non-learners). The statistical analyses performed on the responders in both figures yielded similar results as those in the main text. For other analyses reported in the supplementary materials, we specifically provided different analysis results to demonstrate the robustness of our results. For example, to rule out the effects we observed in two-way ANOVA in the main text may be driven by the different SCR responses on the last extinction trial, we only tested the two-way ANOVA for the first trial SCR of test phase and these analyses provided similar results. Please note we did not include non-learners in these analyses (the texts of the supplementary materials).

      Since we did not exclude any non-learners in study 3, all the results were already reported in the main text.

      One of the overarching conclusions is that the "mechanisms" underlying reconsolidation (long term) and memory suppression (short term) phenomena are distinct, but memory suppression phenomena can also be observed after a 7-day retention interval (Storm et al., 2012), which then questions the conclusions achieved by the current study.

      As we stated before, the focus of the manuscript was to demonstrate a novel short-term fear amnesia effect following the retrieval-extinction procedure. We discussed memory suppression as one of the potential mechanisms for such a short-term effect. In fact, the durability of the memory suppression effect is still under debate. Although Storm et al. (2012) suggested that the retrieval-induced forgetting can persist for as long as a week, other studies, however, failed to observe long-term forgetting (after 24 hrs; (Carroll et al., 2007, Chan, 2009). It is also worth noting that Storm et al. (2012) tested RIF one week later using half of the items the other half of which were tested 5 minutes after the retrieval practice. Therefore, it can be argued that there is a possibility that the long-term RIF effect is contaminated by the test/re-test process on the same set of (albeit different) items at different time onsets (5mins & 1 week).

      Reviewer #3 (Public Review):

      (1) The entire study hinges on the idea that there is memory 'suppression' if (1) the CS+ was reminded before extinction and (2) the reinstatement and memory test takes place 30 minutes later (in Studies 1 & 2). However, the evidence supporting this suppression idea is not very strong. In brief, in Study 1, the effect seems to only just reach significance, with a medium effect size at best, and, moreover, it is unclear if this is the correct analysis (which is a bit doubtful, when looking at Figure 1D and E). In Study 2, there was no optimal control condition without reminder and with the same 30-min interval (which is problematic, because we can assume generalization between CS1+ and CS2+, as pointed out by the authors, and because generalization effects are known to be time-dependent). Study 3 is more convincing, but entails additional changes in comparison with Studies 1 and 2, i.e., applications of cTBS and an interval of 1 hour instead of 30 minutes (the reason for this change was not explained). So, although the findings of the 3 studies do not contradict each other and are coherent, they do not all provide strong evidence for the effect of interest on their own.

      Related to the comment above, I encourage the authors to double-check if this statement is correct: "Also, our results remain robust even with the "non-learners" included in the analysis (Fig. S1 in the Supplemental Material)". The critical analysis for Study 1 is a between-group comparison of the CS+ and CS- during the last extinction trial versus the first test trial. This result only just reached significance with the selected sample (p = .048), and Figures 1D and E even seem to suggest otherwise. I doubt that the analysis would reach significance when including the "non-learners" - assuming that this is what is shown in Supplemental Figure 1 (which shows the data from "all responded participants").

      Our subjects were categorized based on the criteria specified in supplementary table S1. More specifically, we excluded the non-responders (Mean CS SCR < 0.02 uS  in the fear acquisition phase), and non-learners and focused our analyses on the learners. Non-responders were dismissed after day 1 (the day of fear acquisition), but both learners and non-learners finished the experiments. This fact gave us the opportunity to examine data for both the learners and the responders (learners + non-learners). What we showed in fig. 1D and E were differential SCRs (CS+ minus CS-) of the last extinction trials and the differential fear recovery indices (CS+ minus CS-), respectively. We have double checked the figures and both the learners (Fig. 1) and the responders (i.e. learners and non-learners, supplementary Fig. 1) results showed significant differences between the reminder and no-reminder groups on the differential fear recovery index.

      Also related to the comment above, I think that the statement "suggesting a cue-independent short-term amnesia effect" in Study 2 is not correct and should read: "suggesting extinction of fear to the CS1+ and CS2+", given that the response to the CS+'s is similar to the response to the CS-, as was the case at the end of extinction. Also the next statement "This result indicates that the short-term amnesia effect observed in Study 2 is not reminder-cue specific and can generalize to the non-reminded cues" is not fully supported by the data, given the lack of an appropriate control group in this study (a group without reinstatement). The comparison with the effect found in Study 1 is difficult because the effect found there was relatively small (and may have to be double-checked, see remarks above), and it was obtained with a different procedure using a single CS+. The comparison with the 6-h and 24-h groups of Study 2 is not helpful as a control condition for this specific question (i.e., is there reinstatement of fear for any of the CS+'s) because of the large procedural difference with regard to the intervals between extinction and reinstatement (test).

      In Fig. 2e, we showed the differential fear recovery indices (FRI) for the CS+ in all three groups. Since the fear recovery index (FRI) was calculated as the SCR difference between the first test trial and the last extinction trial for any CS, the differential fear recovery indices (difference between CS+ FRI and CS- FRI) not significantly different from 0 should be interpreted as the lack of fear expression in the test phase. Since spontaneous recovery, reinstatement and renewal are considered canonical phenomena in demonstrating that extinction training does not really “erase” conditioned fear response, adding the no-reinstatement group as a control condition would effectively work as the spontaneous recovery group and the comparison between the reinstatement and no-instatement groups turns into testing the difference in fear recovery using different methods (reinstatement vs. spontaneous recovery).

      (2) It is unclear which analysis is presented in Figure 3. According to the main text, it either shows the "differential fear recovery index between CS+ and CS-" or "the fear recovery index of both CS1+ and CS2+". The authors should clarify what they are analyzing and showing, and clarify to which analyses the ** and NS refer in the graphs. I would also prefer the X-axes and particularly the Y-axes of Fig. 3a-b-c to be the same. The image is a bit misleading now. The same remarks apply to Figure 5.

      We are sorry about the lack of clarity here. Figures 3 & 5 showed the correlational analyses between TCAQ and the differential fear recovery index (FRI) between CS+ and CS-. That is, the differential FRI of CS1+ (CS1+ FRI minus CS- FRI) and the differential FRI of CS2+ (CS2+ FRI minus CS- FRI).

      We have rescaled both X and Y axes for figures 3 & 5 (please see the revised figures). 

      (3) In general, I think the paper would benefit from being more careful and nuanced in how the literature and findings are represented. First of all, the authors may be more careful when using the term 'reconsolidation'. In the current version, it is put forward as an established and clearly delineated concept, but that is not the case. It would be useful if the authors could change the text in order to make it clear that the reconsolidation framework is a theory, rather than something that is set in stone (see e.g., Elsey et al., 2018 (https://doi.org/10.1037/bul0000152), Schroyens et al., 2022 (https://doi.org/10.3758/s13423-022-02173-2)).

      In addition, the authors may want to reconsider if they want to cite Schiller et al., 2010 (https://doi.org/10.1038/nature08637), given that the main findings of this paper, nor the analyses could be replicated (see, Chalkia et al., 2020 (https://doi.org/10.1016/j.cortex.2020.04.017; https://doi.org/10.1016/j.cortex.2020.03.031).

      We thank the reviewer’s comments and have incorporated the mentioned papers into our revised manuscript by pointing out the extant debate surrounding the reconsolidation theory in the introduction:

      “Pharmacological blockade of protein synthesis and behavioral interventions can both eliminate the original fear memory expression in the long-term (24 hours later) memory test ( Lee, 2008; Lee et al., 2017; Schiller et al., 2013; Schiller et al., 2010), resulting in the cue-specific fear memory deficit (Debiec et al., 2002; Lee, 2008; Nader, Schafe, & LeDoux, 2000). For example, during the reconsolidation window, retrieving a fear memory allows it to be updated through extinction training (i.e., the retrieval-extinction paradigm (Lee, 2008; Lee et al., 2017; Schiller et al., 2013; Schiller et al., 2010), but also see (Chalkia, Schroyens, et al., 2020; Chalkia, Van Oudenhove, et al., 2020; D. Schiller, LeDoux, & Phelps, 2020). ”

      As well as in the discussion:

      “It should be noted that while our long-term amnesia results were consistent with the fear memory reconsolidation literatures, there were also studies that failed to observe fear prevention (Chalkia, Schroyens, et al., 2020; Chalkia, Van Oudenhove, et al., 2020; Schroyens et al., 2023). Although the memory reconsolidation framework provides a viable explanation for the long-term amnesia, more evidence is required to validate the presence of reconsolidation, especially at the neurobiological level (Elsey et al., 2018). While it is beyond the scope of the current study to discuss the discrepancies between these studies, one possibility to reconcile these results concerns the procedure for the retrieval-extinction training. It has been shown that the eligibility for old memory to be updated is contingent on whether the old memory and new observations can be inferred to have been generated by the same latent cause (Gershman et al., 2017; Gershman and Niv, 2012). For example, prevention of the return of fear memory can be achieved through gradual extinction paradigm, which is thought to reduce the size of prediction errors to inhibit the formation of new latent causes (Gershman, Jones, et al., 2013). Therefore, the effectiveness of the retrieval-extinction paradigm might depend on the reliability of such paradigm in inferring the same underlying latent cause. Furthermore, other studies highlighted the importance of memory storage per se and suggested that memory retention was encoded in the memory engram cell ensemble connectivity whereas the engram cell synaptic plasticity is crucial for memory retrieval (Ryan et al., 2015; Tonegawa, Liu, et al., 2015; Tonegawa, Pignatelli, et al., 2015). It remains to be tested how the cue-independent short-term and cue-dependent long-term amnesia effects we observed could correspond to the engram cell synaptic plasticity and functional connectivity among engram cell ensembles (Figure 6). This is particularly important, since the cue-independent characteristic of the short-term amnesia suggest that either different memory cues fail to evoke engram cell activities, or the retrieval-extinction training transiently inhibits connectivity among engram cell ensembles. Finally, SCR is only one aspect of the fear expression, how the retrieval-extinction paradigm might affect subjects’ other emotional (such as the startle response) and cognitive fear expressions such as reported fear expectancy needs to be tested in future studies since they do not always align with each other (Kindt et al., 2009; Sevenster et al., 2012, 2013).”

      Relatedly, it should be clarified that Figure 6 is largely speculative, rather than a proven model as it is currently presented. This is true for all panels, but particularly for panel c, given that the current study does not provide any evidence regarding the proposed reconsolidation mechanism.

      We agree with the reviewer that Figure 6 is largely speculative. We realize that there are still debates regarding the retrieval-extinction procedure and the fear reconsolidation hypothesis. We have provided a more elaborated discussion and pointed out that figure 6 is only a working hypothesis and more work should be done to test such a hypothesis:

      “Although mixed results have been reported regarding the durability of suppression effects in the declarative memory studies (Meier et al., 2011; Storm et al., 2012), future research will be needed to investigate whether the short-term effect we observed is specifically related to associative memory or the spontaneous nature of suppression (Figure 6C).”

      Lastly, throughout the paper, the authors equate skin conductance responses (SCR) with fear memory. It should at least be acknowledged that SCR is just one aspect of a fear response, and that it is unclear whether any of this would translate to verbal or behavioral effects. Such effects would be particularly important for any clinical application, which the authors put forward as the ultimate goal of the research.

      Again, we agree with the reviewer on this issue, and we have acknowledged that SCR is only one aspect of the fear response and caution should be exerted in clinical application:

      “Finally, SCR is only one aspect of the fear expression, how the retrieval-extinction paradigm might affect subjects’ other emotional (such as the startle response) and cognitive fear expressions such as reported fear expectancy needs to be tested in future studies since they do not always align with each other (Kindt et al., 2009; Sevenster et al., 2012, 2013).”

      (4) The Discussion quite narrowly focuses on a specific 'mechanism' that the authors have in mind. Although it is good that the Discussion is to the point, it may be worthwhile to entertain other options or (partial) explanations for the findings. For example, have the authors considered that there may be an important role for attention? When testing very soon after the extinction procedure (and thus after the reminder), attentional processes may play an important role (more so than with longer intervals). The retrieval procedure could perhaps induce heightened attention to the reminded CS+ (which could be further enhanced by dlPFC stimulation)?

      We thank the reviewer for this suggestion and have added more discussion on the potential mechanisms involved. Unfortunately, since the literature on attention and fear recovery is rather scarce, it is even more of a speculation given our study design and results are mainly about subjects’ skin conductance responses (SCR).

      (5) There is room for improvement in terms of language, clarity of the writing, and (presentation of the) statistical analyses, for all of which I have provided detailed feedback in the 'Recommendations for the authors' section. Idem for the data availability; they are currently not publicly available, in contrast with what is stated in the paper. In addition, it would be helpful if the authors would provide additional explanation or justification for some of the methodological choices (e.g., the 18-s interval and why stimulate 8 minutes after the reminder cue, the choice of stimulation parameters), and comment on reasons for (and implications of) the large amount of excluded participants (>25%).

      We have addressed the data accessibility issue and added the justifications for the methodological choices as well as the excluded participants. As we mentioned in the manuscript and the supplementary materials, adding the non-learners into data analysis did not change the results. Since the non-responders discontinued after Day 1 due to their non-measurable spontaneous SCR signals towards different CS, it’s hard to speculate whether or how the results might have changed. However, participants’ exclusion rate in the SCR studies were relatively high (Hu et al., 2018, Liu et al., 2014, Raio et al., 2017, Schiller et al., 2010, Schiller et al., 2012, Wang et al., 2021). The non-responders were mostly associated with participants being tested in the winter in our tasks. Cold weather and dry skins in the winter are likely to have caused the SCR hard to measure (Bauer et al., 2022, Vila, 2004). Different intervals between the reinstating US (electric shock) and the test trials were used in the previous literature such as 10min (Schiller et al., 2010, Schiller et al., 2013) and 18 or 19s (Kindt and Soeter, 2018, Kindt et al., 2009, Wang et al., 2021). We stuck with the 18s reinstatement interval in the current experiment. For the cTBS stimulation, since the stimulation itself lasted less than 2mins, we started the cTBS 8min after the onset of reminder cue to ensure that any effect caused by the cTBS stimulation occurred during the hypothesized time window, where the old fear memory becomes labile after memory retrieval. All the stimulation parameters were determined based on previous literature, which showed that with the transcranial magnetic stimulation (TMS) on the human dorsolateral prefrontal cortex could disrupt fear memory reconsolidation (Borgomaneri et al., 2020, Su et al., 2022).

      Finally, I think several statements made in the paper are overly strong in light of the existing literature (or the evidence obtained here) or imply causal relationships that were not directly tested.

      We have revised the texts accordingly.

      Reviewer #2 (Recommendations For The Authors):

      On numerous occasions there are typos and the autocorrect has changed "amnesia" for "dementia".

      We are sorry about this mistake and have revised the text accordingly.

      Reviewer #3 (Recommendations For The Authors):

      *"Neither of the studies reported in this article was preregistered. The data for both studies are publicly accessible at https://osf.io/9agvk". This excerpt from the text suggests that there are 2 studies, but there are 3 in the paper. Also, the data are only accessible upon request, not publicly available. I haven't requested them, as this could de-anonymize me as a reviewer.

      We are sorry for the accessibility of the link. The data should be available to the public now.

      *Please refrain from causal interpretations when they are not supported by the data:

      - Figure 3 "thought-control ability only affected fear recovery"; a correlation does not provide causal evidence.

      - "establishing a causal link between the dlPFC activity and short-term fear amnesia." I feel this statement is too strong; to what extent do we know for sure what the applied stimulation of (or more correct: near) the dlPFC does exactly?

      We thank the reviewer for the suggestion and have changed the wording related to figure 3. On the other hand, we’d like to argue that the causal relationship between the dlPFC activity and short-term fear amnesia is supported by the results from study 3. Although the exact functional role of the TMS on dlPFC can be debated, the fact that the TMS stimulation on the dlPFC (compared to the vertex group) brought back the otherwise diminished fear memory expression can be viewed as the causal evidence between the dlPFC activity and short-term fear amnesia.

      *The text would benefit from language editing, as it contains spelling and grammar mistakes, as well as wording that is vague or inappropriate. I suggest the authors check the whole text, but below are already some excerpts that caught my eye:

      "preludes memory reconsolidation"; "old fear memory can be updated"; "would cause short-term memory deficit"; "the its functional coupling"; "Subjects (...) yielded more severe amnesia in the memory suppression tasks"; "memory retrieval might also precipitate a short-term amnesia effect"; "more SEVERE amnesia in the memory suppression tasks"; "the effect size of reinstatement effect"; "the previous literatures"; "towards different CS"; "failed to show SCR response to the any stimuli"; "significant effect of age of TMS"; "each subject' left hand"; "latter half trials"; "Differntial fear recovery"; "fear dementia"; "the fear reinstatement effects at different time scale is related to"; "fear reocery index"; "thought-control abiliites"; "performed better in motivated dementia"; "we tested that in addition to the memory retrieval cue (reminder), whether the"; "during reconsolidation window"; "consisitent with the short-term dementia"; "low level of shock (5v)"

      We thank the reviewer for thorough reading and sorry about typos in the manuscript. We have corrected typos and grammar mistakes as much as we can find.

      *In line with the remark above, there are several places where the text could still be improved.

      - The last sentence of the Abstract is rather vague and doesn't really add anything.

      - Please reword or clarify: "the exact functional role played by the memory retrieval remains unclear".

      - Please reword or clarify: "the unbinding of the old memory trace".

      - "suggesting that the fear memory might be amenable to a more immediate effect, in addition to what the memory reconsolidation theory prescribes" shouldn't this rather read "in contrast with"?

      We have modified the manuscript.

      - In the Introduction, the authors state: "Specifically, memory reconsolidation effect will only be evident in the long-term (24h) memory test due to its requirement of new protein synthesis and is cue-dependent". They then continue about the more immediate memory update mechanisms that they want to study, but it is unclear from how the rationale is presented whether (and why (not)) they also expect this mechanism to be cue-dependent.

      Most of the previous studies on the fear memory reconsolidation using CS as the memory retrieval cues have demonstrated that the reconsolidation effect is cue-dependent (Kindt and Soeter, 2018, Kindt et al., 2009, Monfils et al., 2009, Nader et al., 2000, Schiller et al., 2013, Schiller et al., 2010, Xue et al., 2012). However, other studies using unconditioned stimulus retrieval-extinction paradigm showed that such protocol was able to prevent the return of fear memory expression associated with different CSs (Liu et al., 2014, Luo et al., 2015). In our task, we used CS+ as the memory retrieval cues and our results were consistent with results from previous studies using similar paradigms.

      - "The effects of cTBS over the right dlPFC after the memory reactivation were assessed using the similar mixed-effect four-way ANOVA". Please clarify what was analyzed here.<br /> - "designing novel treatment of psychiatric disorders". Please make this more concrete or remove the statement.

      This sentence was right after a similar analysis performed in the previous paragraph. While the previous graph focused on how the SCRs in the acquisition phase were modulated by factors such as CS+ (CS1+ and CS2+), reminder (reminder vs. no-reminder), cTBS site (right dlPFC vs. vertex) and trial numbers, this analysis focused instead on the SCR responses in the extinction training phase. We have made the modifications as the reviewer suggested.

      *I have several concerns related to the (presentation) of the statistical analyses/results:<br /> - Some statistical analyses, as well as calculation of certain arbitrary indices (e.g., differential fear recovery index) are not mentioned nor explained in the Methods section, but only mentioned in the Results section.

      We have added the explanation of the differential fear recovery index into the methods section:

      “To measure the extent to which fear returns after the presentation of unconditioned stimuli (US, electric shock) in the test phase, we defined the fear recovery index as the SCR difference between the first test trial and the last extinction trial for a specific CS for each subject. Similarly, in studies 2 and 3, differential fear recovery index was defined as the difference between fear recovery indices of CS+ and CS- for both CS1+ and CS2+.”

      - Figure 1C-E: It is unclear what the triple *** mean. Do they have the same meaning in Figure 1C and Figure 1E? I am not sure that that makes sense. The meaning is not explained in the figure caption (I think it is different from the single asterisk*) and is not crystal clear from the main text either.

      We explained the triple *** in the figure legend (Fig. 1): ***P < 0.001. The asterisk placed within each bar in Figure 1C-E indicates the statistical results of the post-hoc test of whether each bar was significant. For example, the *** placed inside bars in Figure 1E indicates that the differential fear recovery index is statistically significant in the no-reminder group (P < 0.001).

      - Supplemental Figure 1: "with all responded participants" Please clarify how you define 'responded participants' and include the n's.

      We presented the criteria for both the responder/non-responder and the learner/non-learner in the table of the supplementary materials and reported the number of subjects in each category (please see supplement Table 1).

      - "the differential SCRs (difference between CS+ and CS-) for the CS+". Please clarify what this means and/or how it is calculated exactly.

      Sorry, it means the difference between the SCRs invoked by CS+ and CS- for both CS1+ (CS1+ minus CS-) and CS2+ (CS2+ minus CS-).

      *I suggest that the authors provide a bit more explanation about the thought-control ability questionnaire. For example, the type of items, etc, as this is not a very commonly used questionnaire in the fear conditioning field.

      We provided a brief introduction to the thought-control ability questionnaire in the methods section:

      “The control ability over intrusive thought was measured by the 25-item Thought-Control Ability Questionnaire (TCAQ) scle(30). Participants were asked to rate on a five-point Likert-type scale the extent to which they agreed with the statement from 1 (completely disagree) to 5 (completely agree). At the end of the experiments, all participants completed the TCAQ scale to assess their perceived control abilities over intrusive thoughts in daily life(17).”

      We have added further description of the item types to the TCAQ scale.

      *The authors excluded more than 25% of the participants. It would be interesting to hear reasons for this relatively large number and some reflection on whether they think this selection affects their results (e.g., could being a (non)responder in skin conductance influence the susceptibility to reactivation-extinction in some way?).

      Participants exclusion rate in the SCR studies were relatively high (Hu et al., 2018, Liu et al., 2014, Raio et al., 2017, Schiller et al., 2010, Schiller et al., 2012, Wang et al., 2021). The non-responders were mostly associated with participants being tested in the winter in our tasks. Cold weather and dry skins in the winter are likely to have caused the SCR hard to measure (Bauer et al., 2022, Vila, 2004).

      *Minor comments that the authors may want to consider:

      - Please explain abbreviations upon first use, e.g., TMS.

      - In Figure 6, it is a bit counterintuitive that the right Y-axis goes from high to low.

      We added the explanation of TMS:

      “Continuous theta burst stimulation (cTBS), a specific form of repetitive transcranial magnetic stimulation (rTMS)…”

      We are sorry and agree that the right Y-axis was rather counterintuitive. However, since the direction of the fear recovery index (which was what we measured in the experiment) and the short/long-term amnesia effect are of the opposite directions, plotting one index from low to high would inevitably cause the other index to go from high to low.

      Reference:

      Anderson, M. C. and Floresco, S. B. 2022. Prefrontal-hippocampal interactions supporting the extinction of emotional memories: The retrieval stopping model. Neuropsychopharmacology, 47, 180-195.

      Anderson, M. C. and Green, C. 2001. Suppressing unwanted memories by executive control. Nature, 410, 366-9.

      Bauer, E. A., Wilson, K. A. and Macnamara, A. 2022. 3.03 - cognitive and affective psychophysiology. In: ASMUNDSON, G. J. G. (ed.) Comprehensive clinical psychology (second edition). Oxford: Elsevier.

      Baum, M. 1968. Reversal learning of an avoidance response and the kamin effect. J Comp Physiol Psychol, 66, 495-7.

      Borgomaneri, S., Battaglia, S., Garofalo, S., Tortora, F., Avenanti, A. and Di Pellegrino, G. 2020. State-dependent tms over prefrontal cortex disrupts fear-memory reconsolidation and prevents the return of fear. Curr Biol, 30, 3672-3679.e4.

      Cain, C. K., Blouin, A. M. and Barad, M. 2003. Temporally massed cs presentations generate more fear extinction than spaced presentations. J Exp Psychol Anim Behav Process, 29, 323-33.

      Carroll, M., Campbell-Ratcliffe, J., Murnane, H. and Perfect, T. 2007. Retrieval-induced forgetting in educational contexts: Monitoring, expertise, text integration, and test format. European Journal of Cognitive Psychology, 19, 580-606.

      Chan, J. C. K. 2009. When does retrieval induce forgetting and when does it induce facilitation? Implications for retrieval inhibition, testing effect, and text processing. Journal of Memory and Language, 61, 153-170.

      Gagnepain, P., Henson, R. N. and Anderson, M. C. 2014. Suppressing unwanted memories reduces their unconscious influence via targeted cortical inhibition. Proc Natl Acad Sci U S A, 111, E1310-9.

      Gershman, S. J., Jones, C. E., Norman, K. A., Monfils, M. H. and Niv, Y. 2013. Gradual extinction prevents the return of fear: Implications for the discovery of state. Front Behav Neurosci, 7, 164.

      Gershman, S. J., Monfils, M. H., Norman, K. A. and Niv, Y. 2017. The computational nature of memory modification. Elife, 6.

      Hu, J., Wang, W., Homan, P., Wang, P., Zheng, X. and Schiller, D. 2018. Reminder duration determines threat memory modification in humans. Sci Rep, 8, 8848.

      Kamin, L. J. 1957. The retention of an incompletely learned avoidance response. J Comp Physiol Psychol, 50, 457-60.

      Kindt, M. and Soeter, M. 2018. Pharmacologically induced amnesia for learned fear is time and sleep dependent. Nat Commun, 9, 1316.

      Kindt, M., Soeter, M. and Vervliet, B. 2009. Beyond extinction: Erasing human fear responses and preventing the return of fear. Nat Neurosci, 12, 256-8.

      Liu, J., Zhao, L., Xue, Y., Shi, J., Suo, L., Luo, Y., Chai, B., Yang, C., Fang, Q., Zhang, Y., Bao, Y., Pickens, C. L. and Lu, L. 2014. An unconditioned stimulus retrieval extinction procedure to prevent the return of fear memory. Biol Psychiatry, 76, 895-901.

      Luo, Y.-X., Xue, Y.-X., Liu, J.-F., Shi, H.-S., Jian, M., Han, Y., Zhu, W.-L., Bao, Y.-P., Wu, P., Ding, Z.-B., Shen, H.-W., Shi, J., Shaham, Y. and Lu, L. 2015. A novel ucs memory retrieval-extinction procedure to inhibit relapse to drug seeking. Nature Communications, 6, 7675.

      Monfils, M. H., Cowansage, K. K., Klann, E. and Ledoux, J. E. 2009. Extinction-reconsolidation boundaries: Key to persistent attenuation of fear memories. Science, 324, 951-5.

      Nader, K., Schafe, G. E. and Le Doux, J. E. 2000. Fear memories require protein synthesis in the amygdala for reconsolidation after retrieval. Nature, 406, 722-6.

      Raio, C. M., Hartley, C. A., Orederu, T. A., Li, J. and Phelps, E. A. 2017. Stress attenuates the flexible updating of aversive value. Proc Natl Acad Sci U S A, 114, 11241-11246.

      Schiller, D., Kanen, J. W., Ledoux, J. E., Monfils, M. H. and Phelps, E. A. 2013. Extinction during reconsolidation of threat memory diminishes prefrontal cortex involvement. Proc Natl Acad Sci U S A, 110, 20040-5.

      Schiller, D., Monfils, M. H., Raio, C. M., Johnson, D. C., Ledoux, J. E. and Phelps, E. A. 2010. Preventing the return of fear in humans using reconsolidation update mechanisms. Nature, 463, 49-53.

      Schiller, D., Raio, C. M. and Phelps, E. A. 2012. Extinction training during the reconsolidation window prevents recovery of fear. J Vis Exp, e3893.

      Su, S., Deng, J., Yuan, K., Gong, Y., Zhang, Y., Li, H., Cao, K., Huang, X., Lin, X., Wu, P., Xue, Y., Bao, Y., Shi, J., Shi, L. and Lu, L. 2022. Continuous theta-burst stimulation over the right dorsolateral prefrontal cortex disrupts fear memory reconsolidation in humans. iScience, 25, 103614.

      Vila, J. 2004. Psychophysiological assessment. In: SPIELBERGER, C. D. (ed.) Encyclopedia of applied psychology. New York: Elsevier.

      Wang, Y., Zhu, Z., Hu, J., Schiller, D. and Li, J. 2021. Active suppression prevents the return of threat memory in humans. Commun Biol, 4, 609.

      Xue, Y. X., Luo, Y. X., Wu, P., Shi, H. S., Xue, L. F., Chen, C., Zhu, W. L., Ding, Z. B., Bao, Y. P., Shi, J., Epstein, D. H., Shaham, Y. and Lu, L. 2012. A memory retrieval-extinction procedure to prevent drug craving and relapse. Science, 336, 241-5.

      Zhu, Z., Anderson, M. C. and Wang, Y. 2022. Inducing forgetting of unwanted memories through subliminal reactivation. Nature communications, 13, 6496-6496.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Summary:

      The authors examine the eigenvalue spectrum of the covariance matrix of neural recordings in the whole-brain larval zebrafish during hunting and spontaneous behavior. They find that the spectrum is approximately power law, and, more importantly, exhibits scale-invariance under random subsampling of neurons. This property is not exhibited by conventional models of covariance spectra, motivating the introduction of the Euclidean random matrix model. The authors show that this tractable model captures the scale invariance they observe. They also examine the effects of subsampling based on anatomical location or functional relationships. Finally, they briefly discuss the benefit of neural codes which can be subsampled without significant loss of information.

      Strengths:

      With large-scale neural recordings becoming increasingly common, neuroscientists are faced with the question: how should we analyze them? To address that question, this paper proposes the Euclidean random matrix model, which embeds neurons randomly in an abstract feature space. This model is analytically tractable and matches two nontrivial features of the covariance matrix: approximate power law scaling, and invariance under subsampling. It thus introduces an important conceptual and technical advance for understanding large-scale simultaneously recorded neural activity.

      Weaknesses:

      The downside of using summary statistics is that they can be hard to interpret. Often the finding of scale invariance, and approximate power law behavior, points to something interesting. But here caution is in order: for instance, most critical phenomena in neural activity have been explained by relatively simple models that have very little to do with computation (Aitchison et al., PLoS CB 12:e1005110, 2016; Morrell et al., eLife 12, RP89337, 2024). Whether the same holds for the properties found here remains an open question.

      We are grateful for the thorough and constructive feedback provided on our manuscript. We have addressed each point raised by you.

      Regarding the main concern about power law behavior and scale invariance, we would like to clarify that our study does not aim to establish criticality. Instead, we focus on describing and understanding a specific scale-invariant property in terms of collapsed eigenspectra in neural activity. We tested Morrell et al.’s latent-variable model (eLife 12, RP89337, 2024, [1]), where a slowly varying latent factor drives population activity. Although it produces a seemingly power-law-like spectrum, random sampling does not replicate the strict spectral collapse observed in our data (second row in Fig. S23). This highlights that simply adding latent factors does not fully recapitulate the scale invariance we measure, suggesting richer or more intricate processes may be involved in real neural recordings.

      Specifically, we have incorporated five key revisions.

      • As mentioned, we evaluated the latent variable model proposed by Morrell et al., and found that they fail to reproduce the scale-invariant eigenspectra observed in our data; these results are now presented in the Discussion section and supported by a new Supplementary Figure (Fig. S23).

      • We included a comparison with the findings of Manley et al. (2024 [2]) regarding the issue of saturating dimension in the Discussion section, highlighting the methodological differences and their implications.

      • We added a new mathematical derivation in the Methods section, elucidating the bounded dimensionality using the spectral properties of our model. • We have added a sentence in the Discussion section to further emphasize the robustness of our findings by demonstrating their consistency across diverse datasets and experimental techniques.

      • We have incorporated a brief discussion on the implications for neural coding (lines 330-332). In particular, Fisher information can become unbounded when the slope of the power-law rank plot is less than one, as highlighted in the recent work by Moosavi et al. (bioRxiv 2024.08.23.608710, Aug, 2024 [3]).

      We believe these revisions address the concerns raised during the review process and collectively strengthen our manuscript to provides a more comprehensive and robust understanding of the geometry and dimensionality of brain-wide activity. We appreciate your consideration of our revised manuscript and look forward to your feedback.

      Recommendations for the authors:

      In particular, in our experience replies to the reviewers are getting longer than the paper, and we (and I’m sure you!) want to avoid that. Maybe just reply explicitly to the ones you disagree with? We’re pretty flexible on our end.

      (1) The main weakness, from our point of view, is whether the finding of scale invariance means something interesting, or should be expected from a null model. We can suggest such model; if it is inconsistent with the data, that would make the results far more interesting.

      Morrell et al. (eLife 12, RP89337,2024 [1]) suggest a very simple model in which the whole population is driven by a slowly time-varying quantity. It would be nice to determine whether it matched this data. If it couldn’t, that would add some evidence that there is something interesting going on.

      We appreciate your insightful suggestion to consider the model proposed by Morrell et al. (eLife 12, RP89337, 2024 [1]), where a slowly time-varying quantity drives the entire neural population. We conducted simulations using parameters from Morrell et al. [4, 1], as detailed below.

      Our simulations show that Morrell’s model can replicate a degree of scaleinvariance when using functional sampling or RG as referred to in Morrell et al, 2021, PRL [4] (FSap, Fig.S23A-D, Author response image 1). However, it fails to fully capture the scale-invariance of collapsing spectra we observed in data under random sampling (RSap, Fig.S23E-H). This discrepancy suggests that additional dynamics or structures in the neural activity are not captured by this simple model, indicating the presence of potentially novel and interesting features in the data that merit further investigation.

      Unlike random sampling, the collapse of eigenspectra under functional sampling does not require a stringent condition on the kernel function f(x) in our ERM theory (see Discussion line 269-275), potentially explaining the differing results between Fig.S23A-D and Fig.S23E-H.

      We have incorporated these findings into the Result section 2.1 (lines 100-101) and Discussion section (lines 277-282, quoted below):

      “Morrell et al. [4, 1] suggested a simple model in which a slow time-varying factor influences the entire neural population. To explore the effects of latent variables, we assessed if this model explains the scale invariance in our data. The model posits that neural activity is primarily driven by a few shared latent factors. Simulations showed that the resulting eigenspectra differed considerably from our findings (Fig. S23). Although the Morrell model demonstrated a degree of scale invariance under functional sampling, it did not align with the scale-invariant features under random sampling observed in our data, suggesting that this simple model might not capture all crucial features in our observations.”

      Author response image 1:

      Morrell’s latent model. A: We reproduce the results as presented in Morrell et al., PRL 126(11), 118302 (2021) [4]. Parameters are same as Fig. S23A. Sampled 16 to 256 neurons. Unlike in our study, the mean eigenvalues are not normalized to one. Dashed line: eigenvalues fitted to a power law. See also Morrell et al. [4] Fig.1C. Parameters are same as Author response image 1. µ is the power law exponent (black) of the fit, which is different from the µ parameter used to characterize the slow decay of the spatial correlation function, but corresponds to the parameter α in our study.

      (2) The quantification of the degree of scale invariance is done using a ”collapse index” (CI), which could be better explained/motivated. The fact that the measure is computed only for the non-leading eigenvalues makes sense but it is not clear when originally introduced. How does this measure compare to other measures of the distance between distributions?

      We thank you for raising this important point regarding the explanation and motivation for our Collapse Index (CI). We defined the Collapse Index (CI) instead of other measures of distance between distributions for two main reasons. First, the CI provides an intuitive quantification of the shift of the eigenspectrum motivated by our high-density theory for the ERM model (Eq. 3, Fig. 4A). This high-density theory is only valid for large eigenvalues excluding the leading ones, and hence we compute the CI measure with a similar restriction of the range of area integration. Second, when using distribution to assess the collapse (e.g., we can use kernel density method to estimate the distribution of eigenvalues and then calculate the KL divergence between the two distributions), it is necessary to first estimate the distributions. This estimation step introduces errors, such as inaccuracies in estimating the probability of large eigenvalues.

      We agree that a clearer explanation would enhance the manuscript and thus have made modifications accordingly. The CI is now introduced more clearly in the Results section (lines 145-148) and further detailed in the Methods section (lines 630-636). We have also revised the CI diagram in Fig. 4A to better illustrate the shift concept using a more intuitive cartoon representation.

      (3) The paper focuses on the case in which the dimensionality saturates to a finite value as the number of recorded neurons is increased. It would be useful to contrast with a case in which this does not occur. The paper would be strengthened by a comparison with Manley et al. 2024, which argued that, unlike this study, dimensionality of activity in spontaneously behaving head-fixed mice did not saturate.

      Thank you for highlighting this comparison. We have included a discussion (lines 303-309) comparing our approach with Manley et al. (2024) [2]. While Manley et al. [2] primarily used shared variance component analysis (SVCA) to estimate neural dimensionality, they observed that using PCA led to dimensionality saturation (see Figure S4D, Manley et al. [2]), consistent with our findings (Fig. 2D). We acknowledge the value of SVCA as an alternative approach and agree that it is an interesting avenue for future research. In our study, we chose to use PCA for several reasons. PCA is a well-established and widely trusted method in the neuroscience community, with a proven track record of revealing meaningful patterns in neural data. Its mathematical properties are well understood, making it particularly suitable for our theoretical analysis. While we appreciate the insights that newer methods like SVCA can provide, we believe PCA remains the most appropriate tool for addressing our specific research questions.

      (4) More importantly, we don’t understand why dimensionality saturates. For the rank plot given in Eq. 3,

      where k is rank. Using this, one can estimate sums over eigenvalues by integrals. Focusing on the N-dependence, we have

      This gives

      We don’t think you ever told us what mu/d was (see point 13 below), but in the discussion you implied that it was around 1/2 (line 249). In that case, D<sub>PR</sub> should be approximately linear in N. Could you explain why it isn’t?

      Thank you for your careful derivation. Along this line of calculations you suggested, we have now added derivations on using the ERM spectrum to estimate the upper bound of the dimension in the Methods (section 4.14.4). To deduce D<sub>PR</sub> from the spectrum, we focus on the high-density region, where an analytical expression for large eigenvalues λ is given by:

      Here, d is dimension of functional space, L is the linear size of functional space, ρ is the neuron density and γ is the coefficient in Eq. (3), which only depends on d, µ and E(σ<sup>2</sup>). The primary difference between your derivation and ours is that the eigenvalue λ<sub>r</sub> decays rapidly after the threshold r \= β(N), which significantly affects the summations and . Since we did not discuss the small eigenvalues in the article, we represent them here as an unknown function η(r,N,L).

      The sum is the trace of the covariance matrix C. As emphasized in the Methods section, without changing the properties the covariance spectrum, we always consider a normalized covariance matrix such that the mean neural activity variance E(σ<sup>2</sup>) = 1. Thus

      rather than

      The issue stems from overlooking that Eq. (3) is valid only for large eigenvalues (λ > 1).

      Using the Cauchy–Schwarz inequality, we have a upper bound of

      Conversely, provides a lower bound of :

      As a result, we must have

      In random sampling (RSap), L is fixed. We thus must have a bounded dimensionality that is independent of N for our ERM model. In functional sampling (FSap), L varies while the neuronal density ρ is fixed, leading to a different scaling relationship of the upper bound, see Methods (section 4.14.4) for further discussion.

      (5) The authors work directly with ROIs rather than attempting to separate the signals from each neuron in an ROI. It would be worth discussing whether this has a significant effect on the results.

      We appreciate your thoughtful question on the potential impact of using ROIs. The use of ROIs likely does not impact our key findings since they are validated across multiple datasets with various recording techniques and animal models, from zebrafish calcium imaging to mouse brain multi-electrode recordings (see Figure S2, S24). The consistency of the scale-invariant covariance spectrum in diverse datasets suggests that ROIs in zebrafish data do not significantly alter the conclusions, and they together enhance the generalizability of our results. We highlight this in the Discussion section (lines 319-323).

      (6) Does the Euclidean random matrix model allow the authors to infer the value of D or µ? Since the measured observables only depend on µ/D it seems that one cannot infer the latent dimension where distances between neurons are computed. Are there any experiments that one could, in principle, perform to measure D or mu? Currently the conclusion from the model and data is that D/µ is a large number so that the spectrum is independent of neuron density rho. What about the heterogeneity of the scales σ<sub>i</sub>, can this be constrained by data?

      Measuring d and µ in the ERM Model

      We agree with you that the individual values of d and µ cannot be determined separately from our analysis. In our analysis using the Euclidean Random Matrix (ERM) model, we fit the ratio µ/d, rather than the individual values of d (dimension of the functional space) or µ (exponent of the distance-dependent kernel function). This limitation is inherent because the model’s predictions for observable quantities, such as the distribution of pairwise correlation, are dependent solely on this ratio.

      Currently there are no directly targeted experiments to measure d. The dimensions of the functional space is largely a theoretical construct: it could serve to represent latent variables encoding cognitive factors that are distributed throughout the brain or specific sensory or motor feature maps within a particular brain region. It may also be viewed as the embedding space to describe functional connectivity between neurons. Thus, a direct experimental measurement of the dimensions of the functional space could be challenging. Although there are variations in the biological interpretation of the functional space, the consistent scale invariance observed across various brain regions indicates that the neuronal relationships within the functional space can be described by a uniform slowly decaying kernel function.

      Regarding the Heterogeneity of σ<sub>i</sub>

      The heterogeneity of neuronal activity variances ( σ<sub>i</sub>) is a critical factor in our analysis. Our findings indicate that this heterogeneity:

      (1) Enhances scale invariance: The covariance matrix spectrum, which incorporates the heterogeneity of , exhibits stronger scale invariance compared to the correlation matrix spectrum, which imposes for all neurons. This observation is supported by both experimental data and theoretical predictions from the ERM model, particularly in the intermediate density regime.

      (2) Can be constrained by data: We fit a log-normal distribution to the experimentally observed σ<sup>2</sup> values to capture the heterogeneity in our model which leads to excellent agreement with data (section 4.8.1). Figure S10 provides evidence for this by directly comparing the eigenspectra obtained from experimental data (Fig S10A-F) with those generated by the fitted ERM model (Fig S10M-R). These results suggest that the data provides valuable information about the distribution of neuronal activity variances.

      In conclusion, the ERM model and our analysis cannot separately determine d and µ. We also highlight that the neuronal activity variance heterogeneity, constrained by experimental data, plays a crucial role in improving the scale invariance.

      (7) Does the fitting procedure for the positions x in the latent space recover a ground truth in your statistical regime (for the number of recorded neurons)? Suppose you sampled some neurons from a Euclidean random matrix theory. Does the MDS technique the authors use recover the correct distances?

      While sampling neurons from a Euclidean random matrix model, we demonstrated numerically that the MDS technique can accurately recover the true distances, provided that the true parameter f(x) is known. To quantify the precision of recovery, we applied the CCA analysis (Section 4.9) and compared the true coordinates from the original Euclidean random matrix with the fitted coordinates obtained through our MDS procedure. The CCA correlation between the true and fitted coordinates in each spatial dimension is nearly 1 (the difference from 1 is less than 10<sup>−7</sup>). When fitting with experimental data, one source of error arises from parameter estimation. To evaluate this, we assess the estimation error of the fitted parameters. When we choose µ \= 0_.5 in our ERM model and then fit the distribution of the pairwise correlation (Eq. 21), the estimated parameter is = 0.503 ± 0._007 (standard deviation). Then, we use the MDS-recovered distances to fit the coordinates with the fitted kernel function , which is determined by the fitted parameter . The CCA correlation between the true and fitted coordinates in each direction remains nearly 1 (the difference from 1 is less than 10<sup>−5</sup>).

      (8) l. 49: ”... both the dimensionality and covariance spectrum remain invariant ...”. Just to be clear, if the spectrum is invariant, then the dimensionality automatically is too. Correct?

      Thanks for the question. In fact, there is no direct causal relationship between eigenvalue spectrum invariance and dimensionality invariance as we elaborate below and added discussions in lines 311-317. For eigenvalue spectrum invariance, we focus on the large eigenvalues, whereas dimensionality invariance considers the second order statistics of all eigenvalues. Consequently, the invariance results for these two concepts may differ. And dimensional and spectral invariance have different requirements:

      (1) The condition for dimensional saturation is finite mean square covariance

      The participation ratio D<sub>PR</sub> for random sampling (RSap) is given by Eq. 5:

      This expression becomes invariant as N → ∞ if the mean square covariance is finite. In contrast, neural dynamics models, such as the balanced excitatory-inhibitory (E-I) neural network [5], exhibit a different behavior, where , leading to unbounded dimensionality (see discussion lines 291-295, section 6.9 in SI).

      (2) The requirements for spectral invariance involving the kernel function

      In our Euclidean Random Matrix (ERM) model, the eigenvalue distribution follows:

      For spectral invariance to emerge: (1) The eigenvalue distribution must remain unchanged after sampling. (2) Since sampling reduces the neuronal density ρ. (3) The ratio µ/d must approach 0 to maintain invariance.

      We can also demonstrate that D<sub>PR</sub> is independent of density ρ in the large N limit (see the answer of question 4).

      In conclusion, there is no causal relationship between spectral invariance and dimensionality invariance. This is also the reason why we need to consider both properties separately in our analysis.

      (9) In Eq. 1, the exact expression, which includes i=j, isn’t a lot harder than the one with i=j excluded. So why i≠j?

      The choice is for illustration purposes. In Eq. 1, we wanted to demonstrate that the dimension saturates to a value independent of N. When dividing the numerator and denominator of this expression by N<sup>2</sup>, the term is independent of the neuron number N, but the term associated with the diagonal entries is of order O(1_/N_) and can be ignored for large N.

      (10) Fig. 2D: Could you explain where the theory line comes from?

      We first estimate ] from all neurons, and then compute D<sub>PR</sub> for different neuron numbers N using Eq.5 (). This is further clarified in lines 511-512.

      (11) l 94-5: ”It [scale invariance] is also absent when replacing the neural covariance matrix eigenvectors with random ones, keeping the eigenvalues identical (Fig. 2H).” If eigenvalues are identical, why does the spectrum change?

      The eigenspectra of the covariance matrices in full size are the same by construction, but the eigenspectra of the sampled covariance matrices are different because the eigenvectors affect the sampling results. Please also refer to the construction process described in section 4.3 where this is also discussed: “The composite covariance matrix with substituted eigenvectors in (Fig. 2H) was created as described in the following steps. First, we generated a random orthogonal matrix U<sub>r<.sup> (based on the Haar measure) for the new eigenvectors. This was achieved by QR decomposition A=U<sub>r</sub>R of a random matrix A with i.i.d. entries A<sub>ij</sub> ∼ N(0_,1/N_). The composite covariance matrix C<sub>r</sub> was then defined as, where Λ is a diagonal matrix that contains the eigenvalues of C. Note that since all the eigenvalues are real and U<sub>r</sub> is orthogonal, the resulting C<sub>r</sub> is a real and symmetric matrix. By construction, C<sub>r</sub> and C have the same eigenvalues, but their sampled eigenspectra can differ.”

      (12) Eq 3: There’s no dependence on the distribution of sigma. Is that correct?

      Indeed, this is true in the high-density regime when the neuron density ρ is large. The p(λ) depends only on E(σ<sup>2</sup>) rather than the distribution of σ (see Eq. 8). However, in the intermediate density regime, p(λ) depends on the distribution of σ (see Eq.9 and Eq.10). In our analysis, we consider E(σ<sup>4</sup>) as a measure of heterogeneity.

      (13) Please tell us the best fit values of µ/d.

      This information now is added in the figure caption of Fig S10: µ/d \= [0_.456,0.258,0.205,0.262,0.302,0._308] in fish 1-6.

      (14) l 133: ”The eigenspectrum is rho-independent whenever µ/d ≈ 0.”

      It looks to me like rho sets the scale but not the shape. Correct? If so, why do we care about the overall scale – isn’t it the shape that’s important?

      Yes, our study focuses on the overall scale not only the shape, because many models, such as the ERM with other kernel functions, random RNNs, Morrell’s latent model [4, 1], can exhibit a power-law spectrum. However, these models do not exhibit scale-invariance in terms of spectrum curve collapsing. Therefore, considering the overall scale reveal additional non-trivial phenomenon.

      (15) Figs. 3 and 4: Are the grey dots the same as in previous figures? Either way, please specify what they are in the figure caption.

      Yes, they are the same, and thank you for pointing it out. It has been specified in the figure caption now.

      (16) Fig. 4B: Top is correlation matrix, bottom is covariance matrix, correct? If so, that should be explicit. If not, it should be clear what the plots are.

      That is correct. Both matrices (correlation - top, covariance - bottom) are labeled in the figure caption and plot (text in the lower left corner).

      (17) l 158: ”First, the shape of the kernel function f(x) over a small distance ...”. What does ”over a small distance” mean?

      We thank you for seeking clarification on this point. We understand that the phrase ”over a small distance” could be made clearer. We made a revised explanation in lines 164-165 Here, “over a small distance” refers to modifications of the particular kernel function f(x) we use Eq. 11 near x \= 0 in the functional space, while preserving the overall power-law decay at larger distances. The t-distribution based f(x) (Eq. 11) has a natural parameter ϵ that describes the transition to near 0. So we modified f(x) in different ways, all within this interval of |x| ≤ ϵ, and considered different values of ϵ. Table S3 and Figure S7 provide a summary of these modifications. Figure S7 visually compares these modifications to the standard power-law kernel function, highlighting the differences in shape near x \= 0.

      Our findings indicate that these alterations to the kernel function at small distances do not significantly affect the distribution of large eigenvalues in the covariance spectrum. This supports our conclusion that the large eigenvalues are primarily determined by the slow decay of the kernel function at larger distances in the functional space, as this characteristic governs the overall correlations in neural activity.

      (18) l390 . This x<sub>i</sub> is, we believe, different from the x<sub>i</sub> which is position in feature space. Given the difficulty of this paper, it doesn’t help to use the same symbol to mean two different things. But maybe we’re wrong?

      Thank you for your careful reading and suggestion. Indeed here x<sub>i</sub> was representing activity rather than feature space position. We have thus revised the notation (Line 390 has been updated to line 439 as well.):

      In this revised notation: a<sub>i</sub>(t) represents the neural activity of neuron i at time t (typically the firing rate we infer from calcium imaging). is simply the mean activity of neuron i across time. Meanwhile, we’ll keep x<sub>i</sub> exclusively for denoting positions in the functional space.

      This change should make it much easier to distinguish between neural activity measurements and spatial coordinates in the functional space.

      (19) Eq. 19: is it correct that g(u) is not normalized to 1? If so, does that matter?

      It is correct that the approximation of g(u) is not normalized to 1, as Eq. 19 provides an approximation suitable only for small pairwise distances (i.e., large correlation). Therefore, we believe this does not pose an issue. We have newly added this note in lines 691-693.

      (20) I get a different answer in Eq. 20:

      Whereas in Eq. 20,

      µ

      Which is correct?

      Thank you for your careful derivation. We believe the difference arises in the calculation of g(u).In our calculations:

      ,

      (Your first equation seems to missed an 1_/µ_ in R’s exponent.)

      ,

      That is, Eq. 20 is correct. From these, we obtain

      rather than

      We hope this clarifies the question.

      (21) I’m not sure we fully understand the CCA analysis. First, our guess as to what you did: After sampling (either Asap or Fsap), you used ERM to embed the neurons in a 2-D space, and then applied canonical correlation analysis (CCA). Is that correct? If so, it would be nice if that were more clear.

      We first used ERM to embed all the neurons in a 2-D functional space, before any sampling. Once we have the embedding, we can quantify how similar the functional coordinates are with the anatomical coordinates using R<sub>CCA</sub> (section 2.4). We can then use the anatomical and functional coordinates to perform ASap and FSap, respectively. Our theory in section 2.4 predicts the effect on dimension under these samplings given the value of R<sub>CCA</sub> estimated earlier (Fig. 5D). The detailed description of the CCA analysis is in section 4.9, where we explain how CCA is used to find the axes in both anatomical and functional spaces that maximize the correlation between projections of neuron coordinates.

      As to how you sampled under Fsap, I could not figure that out – even after reading supplementary information. A clearer explanation would be very helpful.

      Thank you for your feedback. Functional sampling (FSap) entails the expansion of regions of interest (ROIs) within the functional space, as illustrated in Figure 5A, concurrently with the calculation of the covariance matrix for all neurons contained within the ROI. Technically, we implemented the sampling using the RG approach [6], which is further elaborated in Section 4.12 (lines 852-899), quoted below.

      Stage (i): Iterative Clustering We begin with N</sub>0</sub> neurons, where N</sub>0</sub> is assumed to be a power of 2. In the first iteration, we compute Pearson’s correlation coefficients for all neuron pairs. We then search greedily for the most correlated pairs and group the half pairs with the highest correlation into the first cluster; the remaining neurons form the second cluster. For each pair (a,b), we define a coarse-grained variable according to:

      ,

      Where normalizes the average to ensure unit nonzero activity. This process reduces the number of neurons to N<sub>1</sub> = N<sub>0</sub>/2. In subsequent iterations, we continue grouping the most correlated pairs of the coarse-grained neurons, iteratively reducing the number of neurons by half at each step. This process continues until the desired level of coarse-graining is achieved.

      When applying the RG approach to ERM, instead of combining neural activity, we merge correlation matrices to traverse different scales. During the _k_th iteration, we compute the coarse-grained covariance as:

      and the variance as:

      Following these calculations, we normalize the coarse-grained covariance matrix to ensure that all variances are equal to one. Note that these coarse-grained covariances are only used in stage (i) and not used to calculate the spectrum.

      Stage (ii): Eigenspectrum Calculation The calculation of eigenspectra at different scales proceeds through three sequential steps. First, for each cluster identified in Stage (i), we compute the covariance matrix using the original firing rates of neurons within that cluster (not the coarse-grained activities). Second, we calculate the eigenspectrum for each cluster. Finally, we average these eigenspectra across all clusters at a given iteration level to obtain the representative eigenspectrum for that scale.

      In stage (ii), we calculate the eigenspectra of the sub-covariance matrices across different cluster sizes as described in [6]. Let N<sub>0</sub> = 2<sup>n</sub> be the original number of neurons. To reduce it to size N \= N<sub>0</sub>/2<sup>k</sup> = 2<sup>n-k</sup>, where k is the kth reduction step, consider the coarse-grained neurons in step nk in stage (i). Each coarse-grained neuron is a cluster of 2<sup>n-k</sup> neurons. We then calculate spectrum of the block of the original covariance matrix corresponding to neurons of each cluster (there are 2<sup>k</sup> such blocks). Lastly, an average of these 2<sup>k</sup> spectra is computed.

      For example, when reducing from N<sub>0</sub> = 2<sup>3</sup> = 8 to N \= 2<sup>3−1</sup> = 4 neurons (k \= 1), we would have two clusters of 4 neurons each. We calculate the eigenspectrum for each 4x4 block of the original covariance matrix, then average these two spectra together. To better understand this process through a concrete example, consider a hypothetical scenario where a set of eight neurons, labeled 1,2,3,...,7,8, are subjected to a two-step clustering procedure. In the first step, neurons are grouped based on their maximum correlation pairs, for example, resulting in the formation of four pairs: {1,2},{3,4},{5,6}, and {7,8} (see Fig. S22). Subsequently, the neurons are further grouped into two clusters based on the results of the RG step mentioned above. Specifically, if the correlation between the coarse-grained variables of the pair {1,2} and the pair {3,4} is found to be the largest among all other pairs of coarse-grained variables, the first group consists of neurons {1,2,3,4}, while the second group contains neurons {5,6,7,8}. Next, take the size of the cluster N = 4 for example. The eigenspectra of the covariance matrices of the four neurons within each cluster are computed. This results in two eigenspectra, one for each cluster. The correlation matrices used to compute the eigenspectra of different sizes do not involve coarse-grained neurons. It is the real neurons 1,2,3,...,7,8, but with expanding cluster sizes. Finally, the average of the eigenspectra of the two clusters is calculated.

      (22) Line 37: ”even if two cell assemblies have the same D<sub>PR</sub>, they can have different shapes.” What is meant by shape here isn’t clear.

      Thank you for pointing out this potential ambiguity. The “shape” here refers to the geometric configuration of the neural activity space characterized as a highdimensional ellipsoid by the covariance. Specifically, if we denote the eigenvalues of the covariance matrix as λ<sub>1</sub>,λ<sub>2</sub>,...,λ<sub>N</sub>, then corresponds to the length of the i-th semi-axis of this ellipsoid (Figure 1B). As shown in Figure 1C, two neural populations with the same dimensionality (D<sub>PR</sub> = 25/11 ≈ 2.27) exhibit different eigenvalue spectra, leading to differently shaped ellipsoids. This clarification is now included in lines 39-40.

      (23) Please discuss if any information about the latent dimension or kernel function can be inferred from the measurements.

      Same as comment(6): we would like to clarify that in our analysis using the Euclidean Random Matrix (ERM) model, we fit the ratio µ/d, rather than the individual values of d (dimension of the functional space) or µ (exponent of the distancedependent kernel function). This limitation is inherent because the model’s predictions for observable quantities, such as the eigenvalue spectrum of the covariance matrix, are dependent solely on this ratio.

      For the kernel function, once the d is chosen, we can infer the general shape of the kernel function from data (Figs S12 and S13), up to a certain extent (see also lines 164-166). In particular, we can compare the eigenspectrum of the simulation results for different kernel functions with the eigenspectrum of our data. This allows us to qualitatively exclude certain kernel functions, such as the exponential and Gaussian kernels (Fig. S4), which show clear differences from our data.

      References

      (1) M. C. Morrell, I. Nemenman, A. Sederberg, Neural criticality from effective latent variables. eLife 12, RP89337 (2024).

      (2) J. Manley, S. Lu, K. Barber, J. Demas, H. Kim, D. Meyer, F. M. Traub, A. Vaziri, Simultaneous, cortex-wide dynamics of up to 1 million neurons reveal unbounded scaling of dimensionality with neuron number. Neuron (2024).

      (3) S. A. Moosavi, S. S. R. Hindupur, H. Shimazaki, Population coding under the scale-invariance of high-dimensional noise (2024).

      (4) M. C. Morrell, A. J. Sederberg, I. Nemenman, Latent dynamical variables produce signatures of spatiotemporal criticality in large biological systems. Physical Review Letters 126, 118302 (2021).

      (5) A. Renart, J. De La Rocha, P. Bartho, L. Hollender, N. Parga, A. Reyes, K. D. Harris, The asynchronous state in cortical circuits. science 327, 587–590 (2010).

      (6) L. Meshulam, J. L. Gauthier, C. D. Brody, D. W. Tank, W. Bialek, Coarse graining, fixed points, and scaling in a large population of neurons. Physical Review Letters 123, 178103 (2019).

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Summary:

      This study explores the sequence characteristics and features of high-occupancy target (HOT) loci across the human genome. The computational analyses presented in this paper provide information into the correlation of TF binding and regulatory networks at HOT loci that were regarded as lacking sequence specificity.

      By leveraging hundreds of ChIP-seq datasets from the ENCODE Project to delineate HOT loci in HepG2, K562, and H1-hESC cells, the investigators identified the regulatory significance and participation in 3D chromatin interactions of HOT loci. Subsequent exploration focused on the interaction of DNA-associated proteins (DAPs) with HOT loci using computational models. The models established that the potential formation of HOT loci is likely embedded in their DNA sequences and is significantly influenced by GC contents. Further inquiry exposed contrasting roles of HOT loci in housekeeping and tissue-specific functions spanning various cell types, with distinctions between embryonic and differentiated states, including instances of polymorphic variability. The authors conclude with a speculative model that HOT loci serve as anchors where phase-separated transcriptional condensates form. The findings presented here open avenues for future research, encouraging more exploration of the functional implications of HOT loci.

      Strengths:

      The concept of using computational models to define characteristics of HOT loci is refreshing and allows researchers to take a different approach to identifying potential targets. The major strengths of the study lies in the very large number of datasets analyzed, with hundreds of ChIP-seq data sets for both HepG2 and K562 cells as part of the ENCODE project. Such quantitative power allowed the authors to delve deeply into HOT loci, which were previously thought to be artifacts.

      Weaknesses:

      While this study contributes to our knowledge of HOT loci, there are critical weaknesses that need to be addressed. There are questions on the validity of the assumptions made for certain analyses. The speculative nature of the proposed model involving transcriptional condensates needs either further validation or be toned down. Furthermore, some apparent contradictions exist among the main conclusions, and these either need to be better explained or corrected. Lastly, several figure panels could be better explained or described in the figure legends.

      We thank the reviewer for their valuable comments.

      - We have extended the study and included a new chapter focusing on the condensate hypothesis, added more supporting evidence (including the ones suggested by the reviewer), and made explicit statements on the speculative nature of this model.

      - We have restructured the text to remove the sentences which might be construed as contradictory.

      Reviewer #2 (Public Review):

      Summary:

      The paper 'Sequence characteristic and an accurate model of abundant hyperactive loci in human genome' by Hydaiberdiev and Ovcharenko offers comprehensive analyses and insights about the 'high-occupancy target' (HOT) loci in the human genome. These are considered genomic regions that overlap with transcription factor binding sites. The authors provided very comprehensive analyses of the TF composition characteristics of these HOT loci. They showed that these HOT loci tend to overlap with annotated promoters and enhancers, GC-rich regions, open chromatin signals, and highly conserved regions, and that these loci are also enriched with potentially causal variants with different traits.

      Strengths:

      Overall, the HOT loci' definition is clear and the data of HOT regions across the genome can be a useful dataset for studies that use HepG2 or K562 as a model. I appreciate the authors' efforts in presenting many analyses and plots backing up each statement.

      Weaknesses:

      It is noteworthy that the HOT concept and their signature characteristics as being highly functional regions of the genome are not presented for the first time here. Additionally, I find the main manuscript, though very comprehensive, long-winded and can be put in a shorter, more digestible format without sacrificing scientific content.

      The introduction's mention of the blacklisted region can be rather misleading because when I read it, I was anticipating that we are uncovering new regulatory regions within the blacklisted region. However, the paper does not seem to address the question of whether the HOT regions overlap, if any, with the ENCODE blacklisted regions afterward. This plays into the central assessment that this manuscript is long-winded.

      The introduction also mentioned that HOT regions correspond to 'genomic regions that seemingly get bound by a large number of TFs with no apparent DNA sequence specificity' (this point of 'no sequence specificity' is reiterated in the discussion lines 485-486). However, later on in the paper, the authors also presented models such as convolutional neural networks that take in one-hot-encoded DNA sequence to predict HOT performed really well. It means that the sequence contexts with potential motifs can still play a role in forming the HOT loci. At the same time, lines 59-60 also cited studies that "detected putative drive motifs at the core segments of the HOT loci". The authors should edit the manuscript to clarify (or eradicate) contradictory statements.

      We thank the reviewer for their valuable comments. Below are our responses to each paragraph in the given order:

      We added a statement in the commenting and summarizing other publications that studied the functional aspects of HOT loci with the following sentence in the introduction part:

      “Other studies have concluded that these regions are highly functionally consequential regions enriched in epigenetic signals of active regulatory elements such as histone modification regions and high chromatin accessibility”.

      We significantly shortened the manuscript by a) moving the detailed analyses of the computational model to the supplemental materials, and b) shortening the discussions by around half, focusing on core analyses that would be most beneficial to the field.

      Given that the ENCODE blacklisted regions are the regions that are recommended by the ENCODE guidelines to be avoided in mapping the ChIP-seq (and other NGS), we excluded them from our analyzed regions before mapping to the genome. Instead, we relied on the conclusions of other publications on HOT loci that the initial assessments of a fraction of HOT loci were the result of factoring in these loci which later were included in blacklisted regions.

      We addressed the potential confusion by using the expression of “no sequence specificity” by a) changing the sentence in the introduction by adding a clarification as “... with no apparent DNA sequence specificity in terms of detectible binding motifs of corresponding motifs” and b) removing that part from the sentence in the discussions.

      Reviewer #3 (Public Review):

      Summary:

      Hudaiberdiev and Ovcharenko investigate regions within the genome where a high abundance of DNA-associated proteins are located and identify DNA sequence features enriched in these regions, their conservation in evolution, and variation in disease. Using ChIP-seq binding profiles of over 1,000 proteins in three human cell lines (HepG2, K562, and H1) as a data source they're able to identify nearly 44,000 high-occupancy target loci (HOT) that form at promoter and enhancer regions, thus suggesting these HOT loci regulate housekeeping and cell identity genes. Their primary investigative tool is HepG2 cells, but they employ K562 and H1 cells as tools to validate these assertions in other human cell types. Their analyses use RNA pol II signal, super-enhancer, regular-enhancer, and epigenetic marks to support the identification of these regions. The work is notable, in that it identifies a set of proteins that are invariantly associated with high-occupancy enhancers and promoters and argues for the integration of these molecules at different genomic loci. These observations are leveraged by the authors to argue HOT loci as potential sites of transcriptional condensates, a claim that they are well poised to provide information in support of. This work would benefit from refinement and some additional work to support the claims.

      Comments:

      (1) Condensates are thought to be scaffolded by one or more proteins or RNA molecules that are associated together to induce phase separation. The authors can readily provide from their analysis a check of whether HOT loci exist within different condensate compartments (or a marker for them). Generally, ChIPSeq signal from MED1 and Ronin (THAP11) would be anticipated to correspond with transcriptional condensates of different flavors, other coactivator proteins (e.g., BRD4), would be useful to include as well. Similarly, condensate scaffolding proteins of facultative and constitutive heterochromatin (HP1a and EZH2/1) would augment the authors' model by providing further evidence that HOT Loci occur at transcriptional condensates and not heterochromatin condensates. Sites of splicing might be informative as well, splicing condensates (or nuclear speckles) are scaffolded by SRRM/SON, which is probably not in their data set, but members of the serine arginine-rich splicing factor family of proteins can serve as a proxy-SRSF2 is the best studied of this set. This would provide a significant improvement to their proposed model and be expected since the authors note that these proteins occur at the enhancers and promoter regions of highly expressed genes.

      (2) It is curious that MAX is found to be highly enriched without its binding partner Myc, is Myc's signal simply lower in abundance, or is it absent from HOT loci? How could it be possible that a pair of proteins, which bind DNA as a heterodimer are found in HOT loci without invoking a condensate model to interpret the results?

      (3) Numerous studies have linked the physical properties of transcription factor proteins to their role in the genome. The authors here provide a limited analysis of the proteins found at different HOT-loci by employing go terms. Is there evidence for specific types of structural motifs, disordered motifs, or related properties of these proteins present in specific loci?

      (4) Condensates themselves possess different emergent properties, but it is a product of the proteins and RNAs that concentrate in them and not a result of any one specific function (condensates can have multiple functions!)

      (5) Transcriptional condensates serve as functional bodies. The notion the authors present in their discussion is not held by practitioners of condensate science, in that condensates exist to perform biochemical functions and are dissolved in response to satisfying that need, not that they serve simply as reservoirs of active molecules. For example, transcriptional condensates form at enhancers or promoters that concentrate factors involved in the activation and expression of that gene and are subsequently dissolved in response to a regulatory signal (in transcription this can be the nascently synthesized RNA itself or other factors). The association reactions driving the formation of active biochemical machinery within condensates are materially changed, as are the kinetics of assembly. It is unnecessary and inaccurate to qualify transcriptional condensates as depots for transcriptional machinery.

      6) This work has the potential to advance the field forward by providing a detailed perspective on what proteins are located in what regions of the genome. Publication of this information alongside the manuscript would advance the field materially.

      We thank the reviewer for constructive comments and suggestions. Below are our point-by-point responses:

      (1) We added a new short section “Transcriptional condensates as a model for explaining the HOT regions” with additional support for the condensate hypothesis, wherein some of the points raised here were addressed. Specifically, we used a curated LLPS proteins (CD-CODE) database and provided statistics of those annotation condensate-related DAPs.

      Regarding the DAPs mentioned in this question, we observed that the distributions corresponding ChIP-seq peaks confirm the patterns expected by the reviewer (Author response image 1). Namely:

      - MED1 and Ronin (THAP11) are abundant in the HOT loci, being present 67% and 64% of HOT loci respectively.

      - While the BRD4 is present in 28% of the HOT loci, we observed that the DAPs with annotated LLPS activity ranged from 3% to 73%, providing further support for the condensate hypothesis.

      - ENCODE database does not contain ChIP-seq dataset for HP1A. EZH2 peaks were absent in the HOT loci (0.4% overlap), suggesting the lack of heterochromatin condensate involvement.

      - Serine-rich splicing factor family proteins were present only in 7.7% of the HOT loci, suggesting the absence or limited overlap with splicing condensates or nuclear speckles.

      Author response image 1.

      (2) In this study we selected the TF ChIP-seq datasets with stringent quality metrics, excluding those which had attached audit warning and errors. As a result, the set of DAPs analyzed in HepG2 did not include MYC, since the corresponding ChIP-seq dataset had the audit warning tags of "borderline replicate concordance, insufficient read length, insufficient read depth, extremely low read depth". Analyses in K562 and H1 did include MYC (alongside MAX) ChIP-seq dataset.

      To address this question, we added the mentioned ChIP-seq dataset (ENCODE ID: ENCFF800JFG) and analyzed the colocalization patterns of MYC and MAX. We observed that the MYC ChIP-seq peaks in HepG2 display spurious results, overlapping with only 5% of HOT loci. Meanwhile in K562 and H1, MYC and MAX are jointly present in 54% and 44% of the HOT loci, respectively (Author response image 2).

      Author response image 2.

      These observations were also supported by Jaccard indices between the MYC and MAX ChIP-seq peaks. To do this analysis, we calculated the pairwise Jaccard indices between MYC and MAX and divided them by the average Jaccard indices of 2000 randomly selected DAP pairs. In K562 and H1, the Jaccard indices between MYC and MAX are 5.72x and 2.53x greater than the random background, respectively. For HepG2, the ratio was 0.21x, clearly indicating that HepG2 MYC ChIP-seq dataset is likely erroneous.

      Author response image 3.

      (3) Despite numerous publications focusing on different structural domains in transcription factors, we could not find an extensive database or a survey study focusing on annotations of structural motifs in human TFs. Therefore, surveying such a scale would be outside of this study’s scope. We added only the analysis of intrinsically disordered regions, as it pertains to the condensate hypothesis. To emphasize this shortcoming, we added the following sentence to the end of the discussions section.

      “Further, one of the hallmarks of LLPS proteins that have been associated with their abilities to phase-separate is the overrepresentation of certain structural motifs, which we did not pursue due to size limitations.”

      (4, 5) We agree with these statements and thank the reviewer for pointing out this faulty statement. We modified the sections in the discussions related to the condensates and removed the part where we implied that the condensate model could be because of mostly a single function of TF reservoir.

      (6) We added a table to the supplemental materials (Zenodo repository) with detailed annotation of HOT and non-HOT DAP-bound loci in the genome.

      Recommendations for the authors:

      Reviewing Editor (Recommendations For The Authors):

      The clause with "inadequate" would be dropped if the authors sufficiently address reviewer concerns about clarity of writing, including:

      (1) Editing the title to better reflect the findings of the paper.

      (2) Making clear that the condensate model is speculative and not explicitly tested in this study (and may be better described as a hypothesis).

      (3) Resolving apparent contradictions regarding DNA sequence specificity and the interpretation of ChIP-seq signal intensity.

      (4) Better specifying and justifying model parameters, thresholds, and assumptions.

      (5) Shortening the manuscript to emphasize the main, well-supported claims and to enhance readability (especially the discussion section).

      We thank the Editor for their work. We followed their advice and implemented changes and additions to address all 5 points.

      Reviewer #1 (Recommendations For The Authors):

      (1) The title "Sequence characteristics and an accurate model of abundant hyperactive loci in the human genome" does not accurately reflect the findings of the paper. We are unclear as to what the 'accurate model' refers to. Is it the proposed model 'based on the existence of large transcriptional condensates' (abstract)? If so, there are concerns below regarding this statement (see comment 2). If the authors are referring to the computational modeling presented in Figure 5, it is unclear that any one of them performed that much better than the others and the best single model was not identified. Furthermore, the models being developed in the study constitute only a portion of the paper and lacked validation through additional datasets. Additionally, sequence characteristics were not a primary focus of the study. Only figure 5 talks about the model and sequence characteristics, the rest of the figures are left out of the equation.

      We agree with and thank the reviewer for this idea of clarifying the intended meaning.

      (1) We changed the title and clarified that the computational model is meant:

      “Functional characteristics and a computational model of abundant hyperactive loci in the human genome”.

      (2) Shortened the part of the manuscript discussing the computational models and pointed out the CNNs as “the best single model”.

      (2) The abstract and discussion (and perhaps the title) propose a model of transcriptional condensates in relation to HOT loci. However, there is no data provided in the manuscript that relates to condensates. Therefore, anything relating to condensates is primarily speculative. This distinction needs to be properly made, especially in the abstract (and cannot be included in the title). Otherwise, these statements are misleading. Although the field of transcriptional condensates is relatively new, there have been several factors studied. The authors could include in Figure 2d which factors have been shown to form transcriptional condensates. This might provide some support for the model, though it would still largely remain speculative unless further testing is done.

      We added a new short chapter “Transcriptional condensates as a model for explaining the HOT regions”,  with additional analyses testing the condensates hypothesis. We provided supportive evidence by analyzing the metrics used as hallmarks of condensates including the distributions of annotated condensate-related proteins, nascent transcription, and protein-RNA interaction levels in HOT loci. Still, we acknowledge that this is a speculative hypothesis and we clarified that with the following statement in the discussions:

      “It is important to note here that our proposed condensate model is a speculative hypothesis. Further experimental studies in the field are needed to confirm or reject it.”

      (3) Several apparent contradictions exist throughout the manuscript. For example, "HOT locus formation are likely encoded in their DNA sequences" (lines 329-330) vs the proposed model of formation through condensates (abstract). These two statements do not seem compatible, or at the very least, the authors can explain how they are consistent with each other. Another example: "ChIP-seq signal intensity as a proxy for... binding affinity" (line 229) vs. "ChIP-seq signal intensities do not seem to be a function of the DNA-binding properties of the DAPs" (lines 259-260). The first statement is the assumption for subsequent analyses, which has its own concerns (see comment 4). But the conclusion from that analysis seems to contradict the assumption, at least as it is stated.

      In this study, we argue that the two statements may not necessarily contradict each other. We aimed to a) demonstrate that the observed intensity of DAP-DNA interactions as measured by ChIP-seq experiments at HOT loci cannot be explained with direct DNA-binding events of the DAPs alone and b) propose a hypothesis that this observation can be at least partially explained if the HOT loci have the propensity to either facilitate or take part in the formation of transcriptional condensates.

      One of the conditions for condensates to form at enhancers was shown to be the presence of strong binding sites of key TFs (Shrinivas et al. 2019 “Enhancer features that drive the formation of transcriptional condensates”), where the study was conducted using only one TF (OCT4) and one coactivator (MED1). To the best of our knowledge, no such study has been conducted involving many TFs and cofactors simultaneously. We also know that the factors that lead to liquid-to-liquid phase separation include weak multivalent IDR-IDR, IDR-DNA, and IDR-RNA interactions. As a result, the observed total sum of ChIP-seq peaks in HOT loci is the direct DNA-binding events combined with the indirect DAP-DNA interactions, some of which may be facilitated by condensates. And, the fact that CNNs can recognize the HOT loci with high accuracy suggests that there must be an underlying motif grammar specific to HOT loci.

      We emphasized this conclusion in the discussions.

      The comment on using the ChIP-seq signal as a proxy for DNA-binding affinity is addressed under comment 4.

      (4) In lines 229-230, the authors used "the ChIP-seq signal intensity as a proxy for the DAP binding affinity." What is the basis for this assumption? If there is a study that can be referenced, it should be added. However, ChIP-seq signal intensity is generally regarded as a combination of abundance, frequency, or percentage of cells with binding. RNA Pol2 is a good example of this as it has no specific binding affinity but the peak heights indicate level of expression. Therefore, the analyses and conclusions in Figure 4, particularly panel A, are problematic. In addition, clarification from lines 258-260 is needed as it contradicts the earlier premise of the section (see comment 3).

      We thank the reviewer for pointing out this error. The main conclusion of the paragraph is that the average ChIP-seq signal values at HOT loci do not correlate well with the sequence-specificity of TFs. We reworded the paragraph stating that we are analyzing the patterns of ChIP-seq signals across the HOT loci, removing the part that we use them as a proxy for sequence-specific binding affinity.

      (5) In Figure 1A, the authors show that "the distribution of the number of loci is not multimodal, but rather follows a uniform spectrum, and thus, this definition of HOT loci is ad-hoc" (lines 92-95). The threshold to determine how a locus is considered to be HOT is unclear. How did the authors decide to use the current threshold given the uniform spectrum observed? How does this method of calling HOT loci compare to previous studies? How much overlap is there in the HOT loci in this study versus previous ones?

      We moved the corresponding explanation from the supplemental methods to the main methods section of the manuscript.

      Briefly, our reasoning was as follows: assuming that an average TFBS is 8bp long and given that we analyze the loci of length 400bp, we can set the theoretical maximum number of simultaneous binding events to be 50. Hence, if there are >50 TF ChIP-seq peaks in a given 400bp locus, it is highly unlikely that the majority of ChIP-seq peaks can be explained by direct TF-DNA interactions. The condition of >50 TFs corresponded to the last four bins of our binning scale, which was used as an operational definition for HOT loci.

      We have compared our definition of HOT loci to those reported in previous studies by Remaker et al. and Boyle et al. The results of our analyses are in lines 147-154.

      (6) In Figure 3B, the authors state that of "the loop anchor regions with >3 overlapping loops, 51% contained at least one HOT locus, suggesting an interplay between chromatin loops and HOT loci." However, it is unclear how "51%" is calculated from the figure. Similarly, in the following sentence, "94% of HOT loci are located in regions with at least one chromatin interaction". It is unclear as to how the number was obtained based on the referenced figure.

      Initially, the x-axis on the Figure 3B was missing, making it hard to understand what we meant. We added the x-axis numbers and changed the “51%” to “more than half”. We intend to say that, of the loci with 4 and 5 overlapping loops, exactly 50% contain at least one HOT locus. However, since for x=6 the percentage is 100% (since there’s only one such locus), the percentage is technically “more than half”.

      The percentage of HOT loci engaging in chromatin interaction regions (91%) was calculated by simply overlapping the HOT regions with Hi-C long-range contact anchors. The details of extracting these regions using FitHiChip are described in Supplemental Methods 1.3.

      (7) While we have a limited basis to evaluate computational models, we would like to see a clearer explanation of the model set-up in terms of the number of trained vs. test datasets. In addition, it would be interesting to see if the models can be applied to data from different cell lines.

      We added the table with the sizes of the datasets used for classification in Supplemental Methods 1.6.1.

      Evaluating the models trained on the HOT loci of HepG2 and K562 on other cell lines would pose challenges since the number of available ENCODE TF ChIP-seq datasets is significantly less compared to the mentioned cell lines. Therefore, we conducted the proposed analysis between the studied cell lines. Specifically, we used the CNN models trained on HOT and regular enhancers of HepG2 and K562. Then, we evaluated each model on the test sets of each classification experiment (Author response image 4). We observed that the classification results of the HOT loci demonstrated a higher level of tissue-specificity compared to the same classification results of the regular enhancers.

      Author response image 4.

      (8) Lines 349-351. The significance of highly expressed genes being more prone to having multiple HOT loci, and vice versa, appears conventional and remains unclear. Intuitively, it makes sense for higher expressed genes to have more of the transcriptional machinery bound, and would bias the analysis. One way to circumvent this is to only analyze sequence-specific TFs and remove ones that are directly related to transcription machinery.

      We thank the reviewer for this suggestion. Our attempt to re-annotate the HOT loci with only sequence-specific TFs led to a significantly different set of loci, which would not be strictly comparable to the HOT loci defined by this study. Analyzing these new sets of loci would create a noticeable departure from the flow of the manuscript and further extend the already long scope of the study.

      Moreover, numerous studies have shown that super-enhancers recruit large numbers of TFs via transcriptional condensates (Boija et al., 2018; Cho et al., 2018; Sabari et al., 2018). We hope that our results can serve as data-driven supportive evidence for those studies.

      (9) Lines 393-396. We would like to see a reference to the models shown in the figures, if these models have been published previously.

      We could not understand the question. The lines 393-396 contains the following sentence:

      “However, many of the features of the loci that we’ve analyzed so far demonstrated similar patterns (GC contents, target gene expressions, ChIP-seq signal values etc.) when compared to the DAP-bound loci in HepG2 and K562, suggesting that albeit limited, the distribution of the DAPs in H1 likely reflects the true distribution of HOT loci.”

      In case the question was about the models that we trained to classify the HOT loci, we included the models and codebase to Zenodo and GitHub repository.

      (10) Values in Figure 7D are not reflected in the text. Specifically, the text states "Average ... phastCons of the developmental HOT loci are 1.3x higher than K562 and HepG2 HOT loci (Figure 7D)" (lines 408-409). Figure 7D shows conservation scores between HOT enhancers vs promoters for each cell line, and does not seem to reflect the text.

      We modified the figure to reflect the statement appropriately.

      (11) Methodology should include a justification for the use of the Mann-Whitney U-test (non-parametric) over other statistical tests.

      We added the following description to the methods section:

      “For calculating the statistical significance, we used the non-parametric Mann-Whitney U-test when the compared data points are non-linearly correlated and multi-modal. When the data distributions are bell-curve shaped, the Student’s t-test was used.“

      Minor:

      (1) Figure 2b was never mentioned in the paper. This can be added alongside Figure S6C, line 148.

      Indeed, Figure 2B was supposed to be listed together with Figure S6C, which was omitted by mistake. It was corrected.

      (2) Supplementary Figure 8 has two Cs. Needs to be corrected to D.

      Fixed.

      (3) Figure 3B is missing labels on the x-axis.

      Fixed.

      (4) The horizontal bar graph on the bottom left of Figure 1E needs to be described in the figure legend.

      Description added to the figure caption.

      (5) Line 345, Fig 15A should be Fig S15A.

      Corrected.

      Reviewer #2 (Recommendations For The Authors):

      I listed all my concerns about the paper in the public comments. I think the manuscript is very comprehensive and it is valuable, but it should be cut short and presented in a more digestible way.

      We thank the reviewer for their valuable comments and suggestions. We addressed all the concerns listed in the public comments. We shortened the manuscript by reducing the paragraph that focuses on computational classification models and reduced the discussions by about half in length.

      Line 55: What are chromatin-associated proteins, i.e. are they histone modifications?

      To clarify the definition used from the citation we changed the sentence to the following:

      “For instance, Partridge et al. studied the HOT loci in the context of 208 proteins including TFs, cofactors, and chromatin regulators which they called chromatin-associated proteins.”

      Though most of the paper can be cut short to avoid analysis paralysis for readers, there are details that still need filling in. For example, how did the authors perform PCA analysis, i.e. what are the features of each data point in the PCA analysis? Lines 214-215: How do we calculate the number of multi-way contacts in Hi-C data?

      We added clarifying descriptions and changed the mentioned sentences to the following:

      PCA:

      “To analyze the signatures of unique DAPs in HOT loci, we performed a PCA analysis where each HOT locus is represented by a binary (presence/absence) vector of length equal to the total number of DAPs analyzed.”

      Multi-way contacts on loop anchors:

      “To investigate further, we analyzed the loop anchor regions harboring HOT loci and observed that the number of multi-way contacts on loop anchors (i.e. loci which serve as anchors to multiple loops) correlates with the number of bound DAPs (rho=0.84 p-value<10E-4; Pearson correlation). “

      - Lines 251-252: How did the referenced study categorize DAPs? It is important for any manuscript to be self-contained.

      We added the explanation and changed the sentence to the following:

      “To test this hypothesis, we classified the DAPs into those two categories using the definitions provided in the study (Lambert et al. 2018) 28, where the TFs are classified by manual curation through extensive literature review and supported by annotations such as the presence of DNA-binding domains and validated binding motifs. Based on this classification, we categorized the ChIP-seq signal values into these two groups.“

      - Lines 181-185, sentences starting with 'To test' can be moved to the methods, leaving only brief mentions of the statistic tests if needed.

      We removed the mentioned sentence and moved to the supplemental methods (1.4).

      - Lines 217-220: I find this sentence extremely redundant unless it can offer more specific insights about a particular set of DAPs or if the DAPs are closer/or a proven distal enhancer to a confirmed causal gene.

      We removed the mentioned sentence from the text.

      - Lines 243-246: How did the authors determine the set DAPs that have stabilizing effects, and how exactly are the 'stabilizing effects' observed/measured?

      We added explanations to Supplemental Methods 3.1 and Fig S18, S19.

      While addressing this comment we realized that the reported value of the ratio is 1.91x, not 1.7x. We corrected that value in the main text and added the p-value.

      - When discussing the phastCons scores analyses, such as in lines 268-271, how did the authors calculate the relationship between phastCons scores and HOT loci, i.e. was the score averaged across the 400-bp locus to obtain a locus-specific conservation score?

      Yes, per-locus conservation scores were averaged over the bps of loci. We added this clarification to the methods.

      - Line 311: What is the role of the 'control sets' in the analyses of the sequence's relationship with HOT?

      In this specific case, the control sets are used as background or negative sets to set up the classification tasks. In other words, we are asking, whether the HOT loci can be distinguished when compared to random chromatin-accessible regions, promoters, or regular enhancers. We clarified this in the text.

      - I also find the discussion about different machine learning methods that classify HOT loci based on sequence contexts quite redundant UNLESS the authors decide to go further into the features' importance (such as motifs) in the models that predict/ are associated with HOT loci, which in itself can constitute another study.

      We agree with the reviewer, and shortened the part with the discussions of models by limiting it to only 3 main models and moved the rest to the supplemental materials.

      - Can the authors clarify where they obtain data on super-enhancers?

      We obtained the super-enhancer definitions from the original study (Hnisz et al. 2013, PMID: 24119843) where the super-enhancers were defined for multiple cell lines. We clarified this in the methods.

      - Figure 1B, the x and y axis should be clarified.

      We clarified it by using MAX as an example case in the figure caption as follows:

      “Prevalence of DAPs in HOT loci. Each dot represents a DAP. X-axis: percentage of HOT loci in which DAP is present (e.g. MAX is present in 80% of HOT loci). Y-axis: percentage of total peaks of DAPs that are located in HOT loci (e.g. 45% of all the ChIP-seq peaks of MAX is located in the HOT loci). Dot color and size are proportional to the total number of ChIP-seq peaks of DAP.”

      Reviewer #3 (Recommendations For The Authors):

      The list of proteins associated with different types of genomic loci at a meta level (enhancers, promoters, and gene body etc.), and an annotation of the genome at the specific loci level.

      The authors use a wide range of acronyms throughout the text and figure legends, they do a reasonably good job, but the main text section "HOT-loci are enriched in causal variants" and Figure 8 would be materially improved if they held it to the same standard.

      Size is a physical property and not a physicochemical property.

      We thank the reviewer for their comments and suggestions. We added a table to supplemental files with detailed annotations of analyzed loci.

      We reviewed the section “HOT loci are enriched in causal variants” and corrected a few mismatches in the acronyms.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary: 

      In this paper, Kalidindi and Crevecoeur ask why sequential movements are sometimes coarticulated. To answer this question, first, they modified a standard optimal controller to perform consecutive reaches to two targets (T1 and T2). They investigated the optimal solution with and without a constraint on the endpoint's velocity in the via target (T1). They observed that the controller coarticulates the movements only when there is no constraint on the speed at the via-point. They characterized coarticulation in two ways: First, T2 affected the curvature of the first reach in unperturbed reaches. Second, T2 affected corrective movements in response to a mechanical perturbation of the first reach. 

      Parallel to the modeling work, they ran the same experiment on human participants. The participants were instructed to either consider T1 as via point (go task) or to slow down in T1 and then continue to T2 (stop task). Mirroring the simulation results, they observed coarticulation only in the go task. Interestingly, in the go task, when the initial reach was occasionally perturbed, the long-latency feedback responses differed for different T2 targets, suggesting that the information about the final target was already present in the motor circuits that mediate the long-latency response. In summary, they conclude that coarticulation in sequential tasks depends on instruction, and when coarticulation happens, the corrections in earlier segments of movement reflect the entirety of the coarticulated sequence.

      Evaluation 

      Among many strengths of this paper, most notably, the results and the experiment design are grounded in, and guided by the optimal control simulation. The methods and procedures are appropriate and standard. The results and methods are explained sufficiently and the paper is written clearly. The results on modulation of long-latency response based on future goals are interesting and of broad interest for future experiments on motor control in sequential movement. However, I find the authors' framing of these results, mostly in the introduction section, somewhat complicated.

      The current version of the introduction motivates the study by suggesting that "coarticulation and separation of sub-movement [in sequential movements] have been formulated as distinct hypotheses" and this apparent distinction, which led to contradictory results, can be resolved by Optimal Feedback Control (OFC) framework in which task-optimized control gains control coarticulation. This framing seems complicated for two main reasons. First, the authors use chunking and coarticulation interchangeably. However, as originally proposed by (Miller 1956), the chunking of the sequence items may fully occur at an abstract level like working memory, with no motoric coarticulation of sequence elements at the level of motor execution. In this scenario, sequence production will be faster due to the proactive preparation of sequence elements. This simple dissociation between chunking and coarticulation may already explain the apparent contradiction between the previous works mentioned in the introduction section. Second, the authors propose the OFC as a novel approach for studying neural correlates of sequence production. While I agree that OFC simulations can be highly insightful as a normative model for understanding the importance of sequence elements, it is unclear to me how OFCs can generate new hypotheses regarding the neural implementation of sequential movements. For instance, if the control gains are summarizing the instruction of the task and the relevance of future targets, it is unclear in which brain areas, or how these control gains are implemented. I believe the manuscript will benefit from making points more clear in the introduction and the discussion sections. 

      We agree that chunking may occur at different levels that do not necessarily involve motor coarticulation. We clarified that our contribution is towards answering why sequence movements sometimes coarticulate, and how the way sequences are executed influences the representation of future goals in the sensorimotor system.

      To address this point, we made the following modifications in the introduction:

      Line 44:

      “It remains unclear how future goals are integrated in the sensorimotor system. For rapid execution of a sequence, one possible solution is to represent multiple goals within low-level control circuits (3, 16), enabling the execution of several elements as a single entity, called “motor chunk”. Note that chunking can also occur at a higher level such as in working memory-guided sequences, which in this case may or may not involve the production of a movement (17, 18).”

      Lines 50:

      “Recent neural recordings in the primary motor cortex (M1) have shown no specific influence of future goals on the population responses governing ongoing action (19, 20). Specifically, Zimnik and Churchland (20) observed in a two-reach sequence task that, there was no coarticulation in sub-movement kinematics although the execution got faster with practice. Notably, M1 displayed separate phases of execution related activity for each sub-movement. Using a neural network model, they interpreted that sequence goals could be separated and serially specified to the controller from regions upstream of M1 (Figure 1A). These findings contrast with earlier studies showing coarticulation of sub-movements and whole sequence representations in M1 (21–23). As a result, it has been suggested that coarticulation and separation in rapid sequences may involve distinct computations: coarticulation possibly involves replacing sub-movements with a motor chunk, while separation possibly indicates independent control of each sub-movement with chunking at a higher-level (4, 20).  Thus, there are unresolved questions regarding why sequential movements sometimes coarticulate, and how the representation of future goals in the sensorimotor system influences the way sequences are executed.”

      With respect to the second part of your concern about OFC, we agree that this framework does not make direct prediction about the neural implementation and our statements required clarifications. The first link between the model and prediction about neural data follows from the observation that long-latency circuits participate in task-dependent sequence production, thus indicating that transcortical pathways must express this task dependency. The second link between our work and neural activities is by providing a counter argument to previous interpretation: indeed, Zimnik and Churchland argued that independent or “holistic” sequence production should be associated with different representations in monkey’s brain. In contrast we suggest that the same controller can flexibly generate both kinds of sequences, without implying a different structure in the controller, only a different cost-function. We thus refine the expectation about neural correlates of sequence representations by showing that it potentially relates to the encoding of task constraints.

      To address this point, we added the following changes in the introduction and discussion:

      Line 69 in Introduction: 

      “The theory of optimal feedback control (OFC) has been particularly useful in predicting the influence of numerous task parameters on the controller (27–34), thus reproducing goal-directed motor commands during both unperturbed movements and feedback responses to disturbances (30). OFC has been used in numerous studies to interpret flexible feedback responses occurring in the long-latency response period (30, 35).” 

      Line 454 in Discussion:

      “Although OFC has been predominantly used as a behavioral level framework agnostic to neural activity patterns, it can shed light on the planning, state estimation and execution related computations in the transcortical feedback pathway (Takei et al.,). Using OFC, our study proposes a novel and precise definition of the difference to expect in neural activities in order to identify coarticulated versus independent sequence representations from a computational point of view. Because each condition (i.e., overlapping versus non-overlapping controllers as in Figure 2) was associated with different cost-functions and time-varying control gains, it is the process of deriving these control gains, using the internal representation of the task structure, that may differ across coarticulated and separated sequence conditions. To our knowledge, how and where this operation is performed is unknown. A corollary of this definition is that the preparatory activity (20, 50) may not discern independently planned or coarticulated sequences because these situations imply different control policies (and cost functions), as opposed to different initial states. Moreover, the nature of the sequence representation is potentially not dissociable from its execution for the same reason.”

      Reviewer #2 (Public Review):

      Summary: 

      In this manuscript, the authors examine the question of whether discrete action sequences and coarticulated continuous sequential actions can be produced from the same controller, without having to derive separate control policies for each sequential movement. Using modeling and behavioral experiments, the authors demonstrate that this is indeed possible if the constraints of the policy are appropriately specified. These results are of interest to those interested in motor sequences, but it is unclear whether these findings can be interpreted to apply to the control of sequences more broadly (see weaknesses below). 

      Strengths: 

      The authors provide an interesting and novel extension of the stochastic optimal control model to demonstrate how different temporal constraints can lead to either individual or coarticulated movements. The authors use this model to make predictions about patterns of behavior (e.g., in response to perturbations), which they then demonstrate in human participants both by measuring movement kinematics as well as EMG. Together this work supports the authors' primary claims regarding how changes in task instructions (i.e., task constraints) can result in coarticulated or separated movement sequences and the extent to which the subsequent movement goal affects the planning and control of the previous movement. 

      Weaknesses: 

      I reviewed a prior version of this manuscript, and appreciate the authors addressing many of my previous comments. However, there are some concerns, particularly with regard to how the authors interpret their findings. 

      We thank the reviewer for their continued assessment of our work and for helping us to improve the paper. We are convinced that this and the previous review helped us clarifying our work considerably.

      (1) It would be helpful for the authors to discuss whether they think there is a fundamental distinction between a coarticulated sequence and a single movement passing through a via point (or equivalently, avoiding an obstacle). The notion of a coarticulated sequence brings with it the notion of sequential (sub)movements and temporal structure, whereas the latter can be treated as more of a constraint on the production of a single continuous movement. If I am interpreting the authors' findings correctly it seems they are suggesting that these are not truly different kinds of movements at the level of a control policy, but it would be helpful for the authors to clarify this claim. 

      Indeed, this is our interpretation of the results/simulations. This suggestion can also be observed in Ramkumar et al., article on chunking. To clarify this, we added a statement in the discussion as follows: 

      Line 449: 

      “Notably, in the framework of optimal feedback control, an intermediate goal is equivalent to a via-point that constrains the execution of the sequence (similar to (13)). It is thus possible that coarticulation in motor systems be processed similarly as other kinds of movement constraints, such as via-points, avoiding obstacles, or changes in control policies.”

      (2) The authors' model clearly shows that each subsequent target only influences the movement of one target back, but not earlier ones (page 7 lines 199-204). This stands in contrast to the paper they cite from Kashefi 2023, in which those authors clearly show that people account for at least 2 targets in the future when planning/executing the current movement. It would be useful to know whether this distinction arises because of a difference in experimental methodology, or because the model is not capturing something about human behavior.  

      Thank you for raising this point. There are some differences between the study of Kashefi and colleagues (2023), and ours. Both studies looked into planning of more than one reach. In the study of Kashefi et al., the results of Figure 6 showed that in H2 condition, there was no significant curvature, and the curvature increases in H3 and H4 conditions (only in the 75ms dwell-time scenario). Note that H2 condition in their work meant the presentation of +2 target after the initiation of +1 reach. Hence, we think the GO task in our case should be compared to the H3 condition, resulting in similar curvature as in our study. These authors also showed that curvature increased even in the H4 condition (75 ms dwell). OFC also accommodates this observation, if we consider the relationship between the cost of intermediate goals and spatial location of the targets (see figure below, also added to Supplementary Figure 4). To see this, we performed additional 3 target simulations where the constraint on intermediate goal velocity (at T1 and T2) was varied to achieve similar dwell velocity at the intermediate targets (Supplementary Figure 4C). In this case, the hand curvature of the first reach differed while the dwell velocity was similar across T3 up and T3 down conditions, as may be instructed experimentally. Again, the task instructions and the spatial location of the future goals together determine how much the first reach components are influenced by the next ones, and this may impact several reaches ahead. 

      We added the following clarification in the result to describe this. 

      Line 199:

      “It is worth noting that the OFC model can be generalized to longer sequences (10) through the incorporation of additional cost terms (in Equation 10 of Methods) and targets, enabling simultaneous planning for more than two targets. Simulations of a sample three-reach sequence (Supplementary Figure S4) revealed that, varying the cost of dwell velocity at intermediate targets (w2 and w3 parameters in Methods) caused a variation in control gains. Different amount of change in control gains can be expected for intermediate versus late targets (Supplementary Figure 4A). Notably, even when we used the same dwell velocity cost (w2 = w3 = 0), the observed velocity profiles were different between the two sequences towards different final targets (T3 up and T3 down) (Supplementary Figure 4B). We tested a condition in which both sequence reaches were forced to have similar dwell velocity profiles by increasing the dwell velocity costs in the sequence towards one of the targets (T3 down), while leaving this parameter unchanged for the other target (T3 up). In this scenario, T3 up sequence had the parameters (w2, w3) = (0, 0), while T3 down sequence had the parameters (0.8, 0.8). In this case, the curvature of the first reach was different, and predominantly occurred due to differences in K2 between the two sequence reaches (Supplementary Figure S4C). These simulations highlight that, planning for a longer horizon sequence can indirectly influence the curvature of early reaches, due to the interaction between intermediate dwell constraints, spatial arrangement of targets, and sequence horizon in a task dependent manner.”

      (3) In my prior review I raised a concern that the authors seem to be claiming that because they can use a single control policy for both coarticulated and separated movement sequences, there need not be any higher-level or explicit specification of whether the movements are sequential. While much of that language has been removed, it still appears in a few places (e.g., p. 13, lines 403-404). As previously noted, the authors' control policy can generate both types of movements as long as the proper constraints are provided to the model. However, these constraints must be specified somewhere (potentially explicitly, as the authors do by providing them as task instructions). Moreover, in typical sequence tasks, although some movements become coarticulated, people also tend to form chunks with distinct chunk boundaries, which presumably means that there is at least some specification of the sequential ordering of these chunks that must exist (otherwise the authors' model might suggest that people can coarticulate forever without needing to exhibit any chunk boundaries). Hence the authors should limit themselves to the narrow claim that a single control policy can lead to separated or coarticulated movements given an appropriate set of constraints, but acknowledge that their work cannot speak to where or how those constraints are specified in humans (i.e., that there could still be an explicit sequence representation guiding coarticulation). 

      We thank the reviewer for raising this point. We do not dispute the statement that the controller needs to be set dependent on the constraints of the task that must be specified somewhere. In our view, this problem is similar to the question of how a cost-function (or a task representation) is transformed into a control policy in the brain, which is unknown in general. In the earlier version, our intention was to stress that separation can occur without necessarily implying that the goals be processed independently (as in Figure 1A and Zimnik 2021). To avoid confusion on this point, we modified this statement in the new version as follows:

      Line 405: 

      “A straightforward interpretation could be that the stopping at the first target invoked a completely different strategy in which the control of the two reaches was performed independently (Figure 1A), effectively separating the two movements, whereas executing them rapidly could produce the merging of the two sub-movements into a coarticulated sequence. While this is conceptually valid, it is not necessary and the model provides a more nuanced view: both apparent separation or coarticulation of the two motor patterns can be explained within the same framework of flexible feedback control. These different modes of sequence execution still require proper specification of the task constraints in the model, such as number of intermediate steps, dwell-time, or velocity limit. Such specifications must be considered as input to the controller.”

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors): 

      Line 57: Distinct hypotheses. 

      Line 209, The term "planned holistically" is confusing here. Seems like the authors suggest that the sequence is "planned holistically" as long as all sequence elements are given during the optimization process. 

      We changed the sentence as follows.

      Line 218: 

      “Overall, the model predicted that even if a feedback control policy was computed by optimizing the whole sequence over a long time-horizon, the requirements associated with intermediate goals determine how early in the sequence the second (future) target can influence the feedback controller”

      Line 336, It was not clear to me why the authors explained "the weak significant" results of PEC shortening in R0 given the nonsignificant values in R1. 

      We wanted to be transparent about whether changing the statistical analysis will lead to different interpretations, such as the sequence encoding even before long latency epochs. But we realized that it could lead to confusion and we deleted this sentence in the updated manuscript.

      Reviewer #2 (Recommendations For The Authors): 

      About Weakness #2, to clarify this point the authors should either model and discuss what it would take for their model to account for multiple targets ahead, or else run a study to show that in this task people indeed only ever plan 1 target ahead.  

      Please see our response above (in Weakness #2).

      I am still puzzled by why people would resist the perturbation more when they eventually have to move in the direction of the perturbation (e.g., p 10 lines 313-314). Perhaps this is simply due to the geometry of the task, but it could also depend on what participants were trying to accomplish in the experiment. To help clarify this, the authors should report exactly what instructions were given to participants in each task condition.  

      The simulations suggest that the observed perturbation movements are an optimal way to perform the task given the task constraints on accuracy, control effort and constraints at intermediate goals. The intuition is that modulating the acceleration at the intermediate goal is preferred rather than missing it. This however depends on the cost parameter. 

      Below, in Author response figure 1, we show the simulations by varying the accuracy requirements at intermediate goal and the total motor cost parameters. Clearly, as expected, increasing the cost on accuracy of the intermediate reach, or decreasing the cost on motor output modulated the hand deviation (simulations not included in the article).

      Author response image 1.

      Impact of movement costs (motor effort and intermediate goal reach errors) on the hand path following a mechanical perturbation   

      Our observation suggests that participants’ behaviour agreed with the interpretation that can result from the model. We clarified the exact instructions in the methods section. Note that the instructions were given at the beginning of the task and did not differ across the different conditions involving changes in the location of T2 or perturbation direction:

      Line 594:

      Participants were given the following instructions verbally: “Wait in the starting circle until you receive a GO signal, where the target circles turn red and you will simultaneously hear a beep sound. When the circles turn red, react quickly, move as soon, and as straight as possible to target 1 and then move to target 2. You will get two points at the end of the trial if you reach T1 in the prescribed time window and then move to T2, and in all other cases you will not receive any points. Importantly, once you reach T1 you should try to come out of it quickly. If you stay in T1 for more than 150 ms then T2 will disappear and you will receive only one point. Additionally, in some trials, a force will perturb your hand towards the right or left direction randomly while moving towards T1. The instructions remain the same in the presence of perturbations. Try to score as many points as you can.”

      Additionally, we added the following lines in the results description:

      Line 284:

      “The influence of second target on the lateral hand deviation was qualitatively similar to that observed in model simulations, and counterintuitive to what we might expect without the help of the model simulations. As observed in the model simulations (see also Supplementary Figure S2), lateral hand deviation was smaller when the perturbation was in the direction of the second target (T2) and vice-versa. This was consistent for both rightward and leftward perturbation conditions. Both the model and humans expressed this strategy that can be seen as an emergent feature of efficient feedback control during production of movement sequences. Additionally, even though behavior was reproduced in simulations, changing the cost on control effort and/or accuracy of intermediate reaches could modulate the sequencedependent changes in curvature.”

      I am not sure if "the data and code for simulations can be provided by the corresponding author" satisfies the eLife/PLoS software guidelines (i.e., that it be deposited in a public repository).

      Thank you for pointing this out. This sentence was added by mistake.

      We modified this statement in the updated manuscript. 

      “The data and code from simulations and experiments is available in the public repository ‘figshare’ in the following link (https://figshare.com/s/865a8b77c264ef17a181).”

    1. Author Response

      The following is the authors’ response to the original reviews.

      Reviewer #1

      Recommendation 1: The authors reasoned upon the presence of a differential basal hydraulic stress in waves' valleys vs hills at first from the observation of "domes" formation upon 48h cultivation. I suggest performing a quantification to support the statement as a good scientific practice. Furthermore, it would strengthen the concept when the formation of domes was compared between the waves' dimensions as a different grade of cell extrusion was quantified. i.e., 50, 100, and 200 µm.

      Response 1: Upon seeing the phenomenon (Author response image 1 A), we performed a count for domes on the 100 µm and saw a significant effect. We refrained from including the results as it is the subject of ongoing research in our lab. In response to the reviewer’s suggestion, we have included a graph (Author response image 1 B) showing the increasing number of domes over 48 hours from three 100 µm wave samples.

      We have updated Figure 2A and B in the manuscript to include the new graph.

      Author response image 1.

      (A) shows dome (white arrows) over a 100 µm wave substrate. (B) is the number of accumulated domes in valley and hill regions, for 3 independent samples, over 48 hours.

      Recommendation 2: Using RICM microscopy to quantify the cell basal separation with the substrate and hydraulic stress is very clever. Nevertheless, I am in doubt if the different intensity reported for the hills vs valley (Fig. 2G and H) is a result of the signal reduction at deeper Z levels. Since there is no difference in extrusion and forces between valleys and hills in the 200 µm waves but only in 50µm and 100µm, I would add this to the quantification. I would expect no intensity difference from RICM for the 200 µm sample if this is not an artefact of imaging.

      Response 2: We performed additional experiments on blank wave substrates (both 100 and 200 µm) to ascertain the extent of reflection intensity drop (Author response image 2A). And, as correctly pointed out by Reviewer #1, there was a drop in intensity even without cells. On the 100 µm waves, hill reflections are on average ~27 % dimmer than valley reflections. Whereas, on the 200 µm waves, hill reflections are on average ~39 % dimmer.

      Using this information, we performed a calibration on the RICM results obtained from both the 100 and 200 µm waves (Author response image 3B). The calibrated 100 µm data showed residual signatures of difference, whereas the calibrated 200 µm distributions appeared very similar. We noticed large cross- sample variations in the registered intensities, which will negatively impact effect size if not accounted for. To do this, we subsequently normalized both hill and valley intensities against planar region intensities for each sample. As shown by the final output (Author response image 3C), we were able to remove the skewness in the distributions. Moreover, 1-way ANOVA followed by a post hoc analysis with BH correction revealed a significant reduction in 100 µm hill/flat intensity ratio compared to 100 µm valley/flat intensity ratios (Δ~-23 %). Conversely, no significance was observed for the same comparison on the 200 µm waves.

      Author response image 2.

      (A). RICM from blank wave samples reveal a reduction in reflection intensity in hill regions compared to flat and valley regions.

      Author response image 3.

      (B) shows the RICM intensities after adjusting for the inherent reflection intensity drop shown in (A). (C) show the RICM intensities after normalization against planar region signals; this removes cross-sample variations and improve effect size of differences.

      We have updated the manuscript Figure 2I and text accordingly. The blank wave results are included in Figure 2-figure supplement 1 along with updated text and summary data table in Supplementary File 4.

      Recommendation 3: To measure 3D forces on top of the hills and valleys, the use of PAA gels is necessary. Since in Fig 3B, the authors show a difference in cell extrusion number between substrates and stiffnesses, I think it is necessary to confirm the presence of more extrusion in valleys vs hills on PAA gels. This would ensure the conclusion between normal forces and extrusion.

      Response 3: We do have time-lapse data with monolayers on the PAA waves. However, we felt results from the flat regions were sufficient in supporting the point being made in the text. Specifically, our original intention with PAA gels was to show that the extrusion reductions seen in osmotic perturbations were by virtue of removing basal stress and not some cryptic osmotic response. Hydrogels were chosen because they can effectively dilute basal solute concentration and thereby reduce the osmotically induced water transport. Moreover, as fluid could freely move within the gel, the fluid stress can quickly equilibrate across the basal surface. In contrast, poorly water/solute permeable substrates could lead to localized spikes in solute concentration and transient basal regions with high fluid stress.

      To get a sense of the potential difference in basal solute concentration between the two materials, we can do a quick hand-waving estimation. For monolayers on non-water/solute permeable PDMS of 20x20 mm and using the laser wavelength (640 nm) for RICM as an extreme estimate of basal separation, we should expect ~0.25 µl of total basal water content. On the other hand, we typically produce our PAM gel slabs using ~150 µl of precursor solutions. This means that, given similar amounts of solute, PAM gels will lead to monolayer basal osmolarity that is around 3 orders of magnitude lower than monolayers on PDMS, producing significantly lower osmotic potential. This implies from the outset that we should expect high survivability of cells on these substrates irrespective of curvature domains. Indeed, later immunoblotting experiments showed MDCKs exhibiting hyper activated FAK and Akt on PAM gels.

      In response to Reviewer #1’s suggestion then, we have added another supporting time-lapse (Video 19) showing typical response of MDCK monolayers on 100 µm PAA waves (Author response image 4). Evident from the time-lapses, like the planar regions, cell extrusions were very rare. This supports the idea that on PAM gels the effects of basal hydraulic stress and asymmetric forces are marginal against the strong survival signals. And the response is similar to hyper-osmotic perturbations; there, we did not see a significant difference between valley and hill extrusions.

      Author response image 4.

      Time-lapse snapshot showing negligible MDCK extrusions 24 hours after confluency over PAM gel wave substrates.

      Recommendation 4: Before proceeding with the FAK inhibitor experiment, the authors should better justify why the 4.1 wt % sucrose vs DMSO or NaCl is the most inert treatment. This can be done by citing relevant papers or showing time-lapses (as it is done for the higher FAKI14 dose).

      Response 4: Although some cells have recently been shown to be able to transport and utilize sucrose, mammalian cells generally cannot directly take up polysaccharides for metabolism and this is frequently mentioned in literature: see (Ref. R1) for example. Without special enzymes to break sucrose down into monosaccharides, such as sucrase found in the gut, the sugars should remain spectators in the culture medium, contributing only to osmotic effects.

      DMSO on the other hand, besides changing osmolarity, can also be integrated into cell membrane and pass through cells over time. It has been reported to chronically affect cell membrane properties and gene expressions (Ref. R2).

      Finally, it is well known that both sodium and chloride ions are readily taken up and transported by cells (Ref R3). They help to regulate the transmembrane potential, which in turn can affect membrane bound proteins and biochemical reactions within a cell.

      Hence, comparing the 3 hyper-osmotic perturbations, adding sucrose should have the least off- target effects on both the inhibitor study and the subsequent immunoblotting. And, in response to the reviewer’s recommendation, we have updated the text accordingly and included new references to support our statement.

      Ref R1. H. Meyer, O. Vitavska, H. Wieczorek; Identification of an animal sucrose transporter. Journal of Cell Science 124, 1984–1991 (2011). Doi: 10.1242/jcs.082024

      Ref R2. B. Gironi, Z. Kahveci, B. McGill, B.-D. Lechner, S. Pagliara, J. Metz, A. Morresi, F. Palombo, P. Sassi, P. G. Petrov; Effect of DMSO on the Mechanical and Structural Properties of Model and Biological Membranes. Biophysical Journal 119, 274-286 (2020). Doi: doi.org/10.1016/j.bpj.2020.05.037

      Ref R3. X. Zhang, H. Li; Interplay between the electrostatic membrane potential and conformational changes in membrane proteins. Protein Science 28, 502-512 (2019). Doi: 10.1002/pro.3563

      Recommendation 5: The data showing a FAK-dependent phosphorylation of AKT responsible for a higher cell survival rate in the hills is not yet completely convincing. Please show a reduced AKT phosphorylation level after FAK inhibition in high osmolarity levels. Furthermore, the levels of AKT activation seem to increase slightly upon substrate softening independently of FAK activation or osmotic pressure (i.e., Fig. 4E, Soft PDMS). The authors should comment on this in connection with the results shown for PAA gels.

      Response 5: For the additional immunoblotting experiments, work is currently underway. We could not, however, complete these experiments in time for this revision, as both Cheng-Kuang and Xianbin will shortly be taking on new jobs elsewhere. David will continue with the immunoblotting studies and should be able to include the results in an update in the coming months. As for the apparent elevated levels of AKT seen on soft silicones, we speculate that it is because we cannot immunoblot cells that have died and were inevitably washed out at the start of the procedure. Inferring from the higher extrusion rates on these soft substrates, we could be missing a significant portion of stats. Specifically, we are missing all the cells that would have lowered AKT activation but died, and had we been able to collect those statistics, perhaps both the FAK and AKT should have shown lower levels. We risk committing survival bias on the results if we read too much into the data as is.

      Alternatively, another explanation could be that, by virtue of survival of the fittest, we might have effectively selected a subpopulation of cells that were able to survive on lower FAK signals, or completely irrespectively of it.

      At any rate, to prove our foregoing hypothesis would require us to perform comprehensive immunoblotting and total transcriptome analysis over different duration conditions. Unfortunately, we do not have the time to do that for the current article, but it could be developed into a stand-alone molecular biology investigation in future. We have included similar discussion in the main text.

      Recommendation 6: In the discussion, the authors suggest the reported findings be especially relevant for epithelia that significantly separate compartments and regulate water and soluble transport. These are for example kidney epithelia (i.e., MDCK is the best experimental choice), retinal epithelium or intestinal epithelium. I would suggest that some proof-of-concept experiments could be done to support this concept. For example, I would expect keratinocytes (i.e., HaCaT) not to show a strong difference in extrusion rate between valleys and hills since the monolayer is not so sealed as kidney epithelium. In general, this kind of experiment would significantly strengthen the finding of this work.

      Response 6: As recommended, we tracked the behavior of retina pigment epithelial cells (hTERT RPE-1 from ATCC) which do not form tight monolayers like MDCKs (Ref. R4). We did not detect extrusion events occurring from monolayers of these cells (Author response image 5). This is true even for portions of monolayers over waved regions.

      Author response image 5.

      Time-lapse snapshot showing non-existent o cell extrusions from RPE monolayers confluent for over 21 hours.

      We have updated these findings in the main text discussions and included a new supporting time- lapse (Video 15) in our article.

      Ref R4 F. Liu, T. Xu, S. Peng, R. A. Adelman, L. I. Rizzolo; Claudins regulate gene and protein expression of the retinal pigment epithelium independent of their association with tight junctions. Experimental Eye Research 198, 108157 (2020). Doi: 10.1016/j.exer.2020.108157

      Recommendation 7 (minor point): Figure S1 needs to have clear notes indicating in each step what is what. i.e., where is glass, PDMS, NOA73, etc? A more detailed caption will help the figure's comprehension. Also "Cy52" should be changed to "soft silicone" to be consistent with the text (or Cy52 should be mentioned in the text).

      Response 7 (minor point): Changes were made to Figure 1-figure supplement 1 to improve comprehension accordingly. CY52 was added to the main-text, next to the first appearance of the word soft silicone, to be consistent with the figures.

      Recommendation 8 (minor point): The authors often mentioned that epithelial monolayers are denser on PAA gels. Please add a reference(s) to this statement.

      Response 8 (minor point): The statement is an inference from visually comparing monolayers on PAM gels and PDMS. The difference is quite evident (Author response image 6). The density difference is in spite of the fact that the substrates share similar starting cell numbers.

      To address the reviewer’s comment, we have combined time-lapses of monolayers on silicones and PAM gels side-by-side in Video 17 to facilitate convenient comparisons.

      Author response image 6.

      Time-lapse snapshot at 24 hours after confluence, showing conspicuously higher density of MDCK monolayers on PAM gel compared to those on silicon elastomer.

      Reviewer #2

      Recommendation 1: The sinusoidal wavy substrate that the authors use in their investigation is interesting and relevant, but it is important to realize that this is a single-curved surface (also known as a developable surface). This means that the Gaussian curvature is zero and that monolayers need to undergo (almost) no stretching to conform to the curvature. The authors should at least discuss other curved surfaces as an option for future research, and highlight how the observations might change. Convex and concave hemispherical surfaces, for example, might induce stronger differences than observed on the sinusoidal substrates, due to potentially higher vertical resultant forces that the monolayer would experience. The authors could discuss this geometry aspect more in their manuscript and potentially link it to some other papers exploring cell-curvature interactions in more complex environments (e.g. non-zero Gaussian curvature).

      Response 1: In response to reviewer #2’s recommendation we have highlighted in the discussion of our text that our waves constitute a developable surface and that cells will experience little stretching for the most part. Based on our knowledge of how curvature can modulate forces and thus osmotic effects, we included some rudimentary analysis of what one would expect on hemispherical surfaces of two types: one that is periodic and contiguous (Ref. R5), and another with delineating flat regions (Ref. R6).

      For epithelial monolayers in the first scenario, and on poorly solute/water permeable substrates, we should also expect to see a relatively higher likelihood of extrusions from concave regions compared to convex ones. Moreover, as the surfaces are now curved in both principal directions (producing larger out-of-plane forces), we should see the onset of differential extrusions seen in this study, but at larger length scales. For example, the effects seen on 100 µm hemicylindrical waves might now happen at larger feature size for hemispherical waves. Furthermore, as this kind of surface would invariably contain hyperbolic regions (saddle points), we might expect an intermediate response from these locations. If the forces in both principal directions offset each other, the extrusion response may parallel planar regions. On the other hand, if one dominates over the other, we may see extrusion responses tending to the dominating curvature (concave of convex).

      On the other hand, on curved landscapes with discrete convex or concave regions, we should expect, within the curved surface, extrusion behaviors paralleling findings in this study. What would be interesting would be to see what happens at the rims (or skirt regions) of the features. At these locations we effectively have hyperbolically curved surfaces, and like before, we should expect some sort of competing effect between the forces generated from the principal directions. So, for dome skirts, we should see fewer extrusions when the domes are small, and vice versa, when they are larger. Meanwhile, for pit rims, we should see a reversed behavior. It should also be noted that the transitioning curvature between convex/concave and planar regions would also modulate the effect.

      These effects might have interesting developmental implications. For instance, in developing pillar like tissues (e.g., villi) structures, the strong curvatures of nascent lumps would favor accumulation of cell numbers. However, once the size of the lumps reaches some critical value, epithelial cell extrusions might begin to appear at the roots of the developing structures, offsetting cell division, and eventually halting growth.

      Ref R5. L. Pieuchot, J. Marteau, A. Guignandon, T. Dos Santos, I. Brigaud, P. Chauvy, T. Cloatre, A. Ponche, T. Petithory, P. Rougerie, M. Vassaux, J. Milan, N. T. Wakhloo, A. Spangenberg, M. Bigerelle, K. Anselme, Curvotaxis directs cell migration through cell-scale curvature landscapes. Nature Communications 9, 3995 (2018). Doi: 10.1038/s41467-018-06494-6

      Ref R6. M. Werner, S. B.G. Blanquer, S. P. Haimi, G. Korus, J. W. C. Dunlop, G. N. Duda, D. W. Grijpma, A. Petersen, Surface curvature differentially regulates stem cell migration and differentiation via altered attachment morphology and nuclear deformation. Advanced Science 4, 1–11 (2017). Doi: 10.1002/advs.201600347

      Recommendation 2: The discussion of the experiments on PAM gels is rather limited. The authors describe that cells on the PAM gels experience fewer extrusions than on the PDMS substrates, but this is not discussed in sufficient detail (e.g. why is this the case). Additionally, the description of the 3D traction force microscopy and its validation is quite limited and should be extended to provide more convincing evidence that the measured force differences are not an artefact of the undulations of the surface.

      Response 2: We first saw a significant reduction in cell extrusions when we performed hyper-osmotic perturbations, and to eliminate possible off-target effects of the compounds used to increase osmolarity, we used three different compounds to be sure. In spite of this, we felt it would further support our argument, that basal accumulation of fluid stress was responsible for the extrusions, if we had some other independent means of removing fluid stress without directly tuning osmolarity through addition of extraneous solutes. We hence thought of culturing MDCK monolayers on hydrogels.

      Hydrogels were chosen because they can effectively dilute basal solute concentration (for reference ions (Na+) are continuously pumped out basally by the monolayer) and thereby reduce the associated osmotically induced water transport. Moreover, as fluid could freely move within the gel, the fluid stress can quickly equilibrate across the basal surface. In contrast, poorly water/solute permeable substrates will lead to localized spikes in solute concentration and transient basal regions with high fluid stress.

      To get a sense of the extent of difference in basal solute concentration between the two materials, we can do a quick hand-waving estimation. For monolayers on non-water-permeable PDMS of 20x20 mm, and using the laser wavelength (640 nm) for RICM as an extreme estimate of basal separation, we should expect ~0.25 µl of total basal water content. On the other hand, we typically produce our PAM gel slabs using ~150 µl of precursor solutions. This means that, given similar amounts of solute, PAM gels will lead to monolayer basal osmolarity that is around 3 orders of magnitude lower than monolayers on PDMS, producing significantly lower osmotic potential. This implies from the outset that we should expect high survivability of cells on these substrates. Indeed, later immunoblotting experiments showed MDCKs exhibiting hyper activated FAK and Akt on PAM gels.

      As for the 3D TFM used in this study, it is actually implemented from a well-established finite element method to solve inverse problems in engineering and has been repeatedly validated in larger scale engineering contexts (Ref. R7). The novelty and contribution of our article is in its adaptation to reconstruct cellular forces at microscopic scales.

      In brief, soft materials, such as hydrogels used in our case, are doped with fluorescent particles, coated with ECM, and then seeded with cells. The cells would exert forces that deform the soft substrate, thereby displacing the fluorescent particles from their equilibrium positions. This particle displacement can be extracted by producing an image pair with microscopy; first one with the cells, and subsequent one of relaxed gel after removal of cells with acutely cytotoxic reagents, such as SDS. There are several ways in which the displacement field can be extracted from the image pair. These include particle tracking velocimetry, particle image velocimetry, digital volume correlation, and optical flow.

      We employed 3D Farneback optical flow in our study for its superior computational performance. The method was validated using synthetically generated images from Sample 14 of the Society for Experimental Mechanics DIC challenge. The accuracy of the calculated displacements using the 3D Farneback optical flow was then compared to the provided ground truth displacements. For the highest frequency displacement image pairs, an x-component root-mean-square-error (RMSE) value of 0.0113 was observed. This was lower than the 0.0141 RMSE value for the Augmented Lagrangian Digital Volume Correlation method. This suggested that the 3D Farneback optical flow is capable of accurately calculating the displacement between two bead images.

      The displacement fields are then fed into a finite element suite (ANSYS in our case) along with the model and mesh of the underlying substrate structure to obtain node specific displacements. This is required because mech nodes do not typically align with voxel positions of displacements. With these node specific displacements, we subsequently solve the inverse problem for the forces using Tikhonov regularization (Ref. R8). The outcome is a vector of node specific forces.

      In light of the above, to physically validate the method in our context would require the generation of a known ground truth force on the scale of pico- to nano-newtons and subsequently image the particle displacements from this force using confocal microscopy. The force must then be released in situ in order for the relaxed gel to be imaged again. This is not a straightforward feat at this scale, and a method that immediately springs to mind is magnetic tweezers. Unfortunately, this is a tool that we cannot develop within reasonable timeframes, as the method will have to be seamlessly integrated with our spinning-disk confocal. However, as a compromise, we have included an in-silico validation with our revised manuscript.

      Specifically, given a finite element model with a predefined curvature, a known force was applied to the surface of the model (Author response image 7A). The resulting displacements were then calculated from the finite element solution. A 10% random noise is then added to the resulting displacement. The traction force recovery (Fig. R2-1 B) was then performed using the in-silico noisy displacements. To evaluate the accuracy of the recovery, the cosine similarity along with the mean norm of the force vectors were calculated. A value closer to 1 for both evaluation metrics indicates a more accurate reconstruction of the simulated traction force. The cosine similarity of the recovered traction forces to the original applied force was 0.977±0.056 while the norm of the recovered traction forces as a proportion of the original applied force was 1.016±0.165. As both values are close to 1 (i.e., identical), this suggested that the traction forces could be satisfactorily recovered using the finite-element based method.

      In response to the reviewer’s recommendations then, additional content has been included in the main text to explain the use of PAM gels and the workings of our 3D TFM pipeline.

      Ref R7. James F. Doyle, Modern Experimental Stress Analysis: Completing the Solution of Partially Specified Problems (John Wiley & Sons, Chichester, 2004).

      Ref R8. Per Christian Hansen, Discrete Inverse Problems: Insight and Algorithms (siam, Philadelphia, 2010).

      Author response image 7.

      (A) shows simulated force field to generate simulated displacements. (B) shows force field reconstructed from simulated displacements with noise.

      Recommendation 3: The authors show nuclear deformation on the hills and use this as evidence for a resultant downward-pointing force vector. This has, indeed, also been observed in other works referenced by the authors (e.g. Werner et al.), and could be interesting evidence to support the current observations, provided the authors also show a nuclear shape on the concave and flat regions. The authors could potentially also characterize this shape change better using higher-resolution data.

      Response 3: We characterized nucleus deformation using Hoechst-stained samples as per recommendation. The deformation is estimated by dividing segmented nuclei volumes by best-fit ellipsoid volumes of same objects. In this way, objects exhibiting minimal bending will lead to values close to 1.0. The obtained graph is shown in figure Author response image 8B (and manuscript Figure 3D).

      Author response image 8.

      (A) an example of deformed nuclei on 50 µm wave hill region. (B) a Violin plot of calculated nuclear deformations across dimensions and features using segmented volume normalized against best-fit ellipsoid volume.

      Our quantifications show a statistically significant difference in nuclei deformation measure medians between hill and valley cells on the 50 µm (0.973 vs 0.982) and 100 µm (0.971 vs 0.979) waves; this indicates that cells on the hills tend to have more deformed nuclei compared to cells in the valleys. Meanwhile, no significant difference was found for a similar comparison on 200 µm (0.978 vs 0.978) samples. For reference, the median found for cells pooled from planar regions was 0.975.

      In response to the reviewer’s suggestions Figure 3 of our manuscript has been updated to include the new results on nuclei deformation. The text has also been updated to account for the new information to support our claims. The statistics are included in a new summary data table in Supplementary File 6.

      Recommendation 4: The U-net for extrusion detection is a central tool used within this study, though the explanation and particularly validation of the tool are somewhat lacking. More clarity in the explanation and more examples of good (or bad) detections would help establish this tool as a more robust component of the data collection (on all geometries).

      Response 4: The architecture of the neural network used in this study is outlined in supplementary figure S5a. To validate the performance of the model, a test dataset consisting of 200 positive examples and 100 negative examples were fed into the network and the resulting prediction was obtained from model. The confusion matrix of the model is shown in supplementary figure S5c. The weighted precision and recall of the model are 0.958 and 0.953 respectively.

      Additionally, we have included examples of false positive and false negative detections in Figure 1-figure supplement 5 (Author response image 8). For false positive detections, these were typically observed to be extrusions that were labelled to have occurred the frame prior to the frame of interest (Author response image 9 bottom sequence). However, as the extrusion process is incomplete in the prior frame, there are still changes in the extruded cell body and the network falsely predicts this as a detection.

      Author response image 9.

      Examples of false negative and false positive extrusions registration.

      Recommendation 5: The authors study the involvement of FAK in the observed curvature-dependent and hydraulic stress-dependent spatial regulation of cell extrusion. In one of the experiments, the authors supplement the cell medium with FAK inhibitors, though only in a hyper-osmotic medium. They show that FAK inhibition counteracts the extrusion-suppressing effect of a hyper-osmotic medium. However, no data is shown on the effect of FAK inhibitors within the control medium. Would the extrusion rates be even higher then?

      Response 4: We proceeded, as suggested by the reviewer, to explore the effects of the FAK inhibitor on MDCK monolayers in our control medium. The results revealed that, at the 3 µM FAK concentration, where cells in sucrose media showed an elevated extrusion rate, monolayers in control medium quickly suffered massive cell death (Author response image 10) similar to what was seen when 6 µM FAK was introduced to sucrose medium.

      This finding suggests that osmolarity protects against FAK inhibitors in a dose dependent manner. Moreover, as cell extrusions require an intact monolayer, its rates cannot increase indefinitely: a point will be reached where an intact monolayer can no longer be maintained.

      We have updated the main text of our article to mention this observation, and also included a new time-lapse (Video 22) to demonstrate the effect.

      Author response image 10.

      Timelapse snapshot of MDCK monolayers over waves 4 hours after inclusion of focal adhesion kinase inhibitor.

      Recommendation 6: The supplementary videos show two fields of view next to each other, which is not immediately clear to the viewer. I strongly advise the authors to add a clear border between the two panels, so that it is clear that the cells from one panel are not migrating into the next panel.

      Response 6: A distinctive border has been added to the movies to separate panels showing different focal planes of the same stack.

      Recommendation 7: The general quality and layout of the figures could be improved. Some figures would benefit from higher-resolution or larger cell images (e.g. Figure 2A, C, D), and the organisation of subpanels could be improved (e.g. especially in Figure 2). The box plots and bar graphs are also not consistent throughout the manuscript in terms of colouring and style, which should be improved.

      Response 7: We have enlarged the figures in question accordingly, at the cost of reducing some information. However, the full scope of the sub-figures remains accessible in the supplementary movies. We have also tried to change the placement of the panels to improve readability. We have also adjusted the valley, hill, and flat coloring scheme for the extrusion boxplots in Figures 1 and 2 to make them consistent.

      Recommendation 8: The graphs in Figures 3E and F are confusing and difficult to interpret. The x-axis states "Position along curve in radians" but it is unclear how to relate this to the position on the wavy substrate. The graphs also have a second vertical axis on the right ("valley-interface-hill"), which adds to the confusion. I would recommend the authors provide more explanation and consider a different approach of plotting this.

      Response 8: We have removed the confusing plot of cross-sectional profile from the force graphs. To indicate positions on the waves, we have augmented radian values with Hill, Interface, and Valley accordingly.

      Recommendation 9: Specify which silicone was used for the low-stiffness silicone substrates in the methods and in the main text.

      Response 9: CY52 has been added to the main-text, next to the first appearance of the word soft silicone, to be consistent with the figures.

      Recommendation 10: The flow lines that are plotted over the RICM data make it difficult to see the underlying RICM images. I would advise to also show the RICM images without the flow lines.

      Response 10: The original movie S15 (now Video 16) showing the RICM overlapped with optical flow paths has now been replaced by a movie showing the same, but with the flow paths and RICM in separate panels.

      Recommendation 11: In the first paragraph of the discussion, the authors write: "And this difference was both dependent on the sense (positive or negative)...". This is superfluous since the authors already mentioned earlier in the paragraph that the convex and concave regions (i.e. different signs of curvature) show differences in extrusion rates.

      Response 11: The sentence has been changed to “And this difference was also dependent on the degree of curvature.”

      Recommendation 12: In the second paragraph of the discussion, the authors mention that "basal fluid spaces under monolayers in hill regions were found consistently smaller than those in valley regions". Is this data shown in the figures of the manuscript? If so, a reference should be made because it was unclear to me.

      Response 12: This statement is an inference from the comparison of the hill and valley RICM grey values. Specifically, RICM intensities are direct surrogates for basal separations (i.e., fluid space (as there cannot be a vacuum)) by virtue of the physics underlying the effect. To be more precise then, “inferred from RICM intensity differences (Figure 2I)” has been added to support the statement.

      Recommendation 13: On page 7 of the discussion, the authors talk about positively and negatively curved surfaces. This type of description should be avoided, as this depends on the definition of the surface normal (i.e. is positive convex or concave?). Rather use convex and concave in this context.

      Response 13: The wording has been changed accordingly.

      Recommendation 14: The label of Table 8 reads "Table 2".

      Response 14: The error has been corrected.

      Reviewer #3

      Recommendation 1: The central finding seems to be opposite to an earlier report (J Cell Sci (2019) 132, jcs222372), where MDCK cells in curved alginate tubes exhibit increased extrusion on a convex surface. I suggest that you comment on possible explanations for the different behaviors.

      Response 1: The article in question primarily reported the phenomenon of MDCK and J3B1A monolayers detaching from the concave alginate tube walls coated with Matrigel. The authors attributed this to the curvature induced out-of-plane forces towards the center of the tubes. Up to this point, the findings and interpretation are consistent with our current study where we also find a similar force trend in concave regions.

      To further lend support to the importance of curvature in inducing detachment, the authors cleverly bent the tubes to introduce asymmetry in curvature between outer and inner surfaces. Specifically, the outside bend is concave in both principal directions, whereas the inside bend is convex in one of its principal directions. As expected, the authors found that detachment rates from the outer surface were much larger compared to the inner one. Again, the observations and interpretations are consistent with our own findings; the convex direction will generate out-of-plane forces pointing into the surface, serving to stabilize the monolayer against the substrate. It should be noted however, since the inner-side tube is characterized by both convex and concave curvatures in its two principal directions, the resulting behavior of overlaying monolayers will depend on which of the two resulting forces become dominant. So, for gradual bends, one should expect the monolayers to still be able to detach from the inner tube surface. This is what was reported in their findings.

      For their extrusion observations, I am surprised. Because their whole material (hydrogels) is presumably both solute and water permeable, I would be more inclined to expect very few extrusions irrespective of curvature. This is indeed the case with our study of MDCKs on PAM hydrogels, where the hydrogel substrate effectively buffers against the quick build-up of solute concentration and basal hydraulic stress. Without the latter, concave monolayer forces alone are unlikely to be able to disrupt cell focal adhesions. Indeed, the detachments seen in their study are more likely by exfoliation of Matrigel rather than pulling cells off Matrigel matrix entirely.

      My guess is that the extrusions seen in their study are solely of the canonical crowding effect. If this was the case, then the detached monolayer on the outside bend could buffer against crowding pressure by buckling. Meanwhile, the monolayer on the inside bend, being attached to the surface, can only regulate crowding pressure by removing cells through extrusions. This phenomenon should be particular to soft matrices such as Matrigel. Using stiffer and covalently bonded ECM should be sufficient to prevent monolayers from detaching, leading to similar extrusion behaviors. In response to the reviewer’s recommendation then, we have included a short paragraph to state the points discussed in this response.

      Recommendation 2: Fig 3E, F: The quantities displayed on the panels are not forces, but have units of pressure (or stress).

      Response 2: we have changed “force” to “stress” according to the reviewer’s suggestion. The reason we kept the use of force in the original text was due to the fact that we were reconstructing forces. Due to discretization, the resulting forces will inevitably be assigned to element nodes. In between the nodes, in the faces, there will be no information. So, in order to have some form of continuity to plot, the face forces are obtained by averaging the 4 nodes around the element face. Unfortunately, element face areas are not typically of the same size, therefore the average forces obtained needs to be further normalized against the face area, leading to a quantity that has units of stress.

      Recommendation 3: Fig 2D: Asterisks are hard to see.

      Response 3: the color of the asterisks has been changed to green for better clarity against a B&W background.

      Recommendation 4: p 19, l 7: Word missing in "the of molding"

      Response 4: the typo has been amended to “the molding of”.

    1. Author Response

      We thank you for the time you took to review our work and for your feedback!

      The major changes to the manuscript are:

      1. We have extended the range of locomotion velocity over which we compare its dependence with cholinergic activity in Figures 2E and S2H.

      2. We have quantified the contributions of cholinergic stimulation on multiplicative and additive gains on visual responses (Figure S7).

      3. We have provided single cell examples for the change in latency to visual response (Figure S12).

      4. We have added an analysis to compare layer 2/3 and layer 5 locomotion onset responses as a function of visuomotor condition (Figure S8).

      A detailed point-by-point response to all reviewer concerns is provided below.  

      Reviewer #1 (Public Review):

      The paper submitted by Yogesh and Keller explores the role of cholinergic input from the basal forebrain (BF) in the mouse primary visual cortex (V1). The study aims to understand the signals conveyed by BF cholinergic axons in the visual cortex, their impact on neurons in different cortical layers, and their computational significance in cortical visual processing. The authors employed two-photon calcium imaging to directly monitor cholinergic input from BF axons expressing GCaMP6 in mice running through a virtual corridor, revealing a strong correlation between BF axonal activity and locomotion. This persistent activation during locomotion suggests that BF input provides a binary locomotion state signal. To elucidate the impact of cholinergic input on cortical activity, the authors conducted optogenetic and chemogenetic manipulations, with a specific focus on L2/3 and L5 neurons. They found that cholinergic input modulates the responses of L5 neurons to visual stimuli and visuomotor mismatch, while not significantly affecting L2/3 neurons. Moreover, the study demonstrates that BF cholinergic input leads to decorrelation in the activity patterns of L2/3 and L5 neurons.

      This topic has garnered significant attention in the field, drawing the interest of many researchers actively investigating the role of BF cholinergic input in cortical activity and sensory processing. The experiments and analyses were thoughtfully designed and conducted with rigorous standards, leading to convincing results which align well with findings in previous studies. In other words, some of the main findings, such as the correlation between cholinergic input and locomotor activity and the effects of cholinergic input on V1 cortical activity, have been previously demonstrated by other labs (Goard and Dan, 2009; Pinto et al., 2013; Reimer et al., 2016). However, the study by Yogesh and Keller stands out by combining cutting-edge calcium imaging and optogenetics to provide compelling evidence of layerspecific differences in the impact of cholinergic input on neuronal responses to bottom-up (visual stimuli) and top-down inputs (visuomotor mismatch).

      We thank the reviewer for their feedback.

      Reviewer #2 (Public Review):

      The manuscript investigates the function of basal forebrain cholinergic axons in mouse primary visual cortex (V1) during locomotion using two-photon calcium imaging in head-fixed mice. Cholinergic modulation has previously been proposed to mediate the effects of locomotion on V1 responses. The manuscript concludes that the activity of basal forebrain cholinergic axons in visual cortex provides a signal which is more correlated with binary locomotion state than locomotion velocity of the animal. Cholinergic axons did not seem to respond to grating stimuli or visuomotor prediction error. Optogenetic stimulation of these axons increased the amplitude of responses to visual stimuli and decreased the response latency of layer 5 excitatory neurons, but not layer 2/3 neurons. Moreover, optogenetic or chemogenetic stimulation of cholinergic inputs reduced pairwise correlation of neuronal responses. These results provide insight into the role of cholinergic modulation to visual cortex and demonstrate that it affects different layers of visual cortex in a distinct manner. The experiments are well executed and the data appear to be of high quality. However, further analyses are required to fully support several of the study's conclusions.

      We thank the reviewer for their feedback.

      1) In experiments analysing the activity of V1 neurons, GCaMP6f was expressed using a ubiquitous Ef1a promoter, which is active in all neuronal cell types as well as potentially non-neuronal cells. The manuscript specifically refers to responses of excitatory neurons but it is unclear how excitatory neuron somata were identified and distinguished from that of inhibitory neurons or other cell types.

      This might be a misunderstanding. The Ef1α promoter has been reported to drive highly specific expression in neurons (Tsuchiya et al., 2002) with 99.7% of labeled cells in layer 2/3 of rat cortex being NeuN+ (a neuronal marker), with only 0.3% of labeled cells being GFAP+ (a glial marker) (Yaguchi et al., 2013). This bias was even stronger in layer 5 with 100% of labeled cells being NeuN+ and none GFAP+ (Yaguchi et al., 2013). The Ef1α promoter in an AAV vector, as we use it here, also biases expression to excitatory neurons. In layer 2/3 of mouse visual cortex, we have found that 96.8% ± 0.7% of labeled neurons are excitatory three weeks after viral injection (Attinger et al., 2017). Similar results have also been found in rats (Yaguchi et al., 2013), where on expressing GFP under Ef1a promoter delivered using Lenti virus, 95.2% of labeled neurons in layer 2/3 were excitatory and 94.1% in layer 5 were excitatory. These numbers are comparable to the ones obtained with promoters commonly used to target expression to excitatory neurons. To do this, typically two variants of promoters based on the transcription start region of CaMKIIα gene have been used. The first, the CaMKIIα-0.4 promoter, results in 95% excitatory specificity (Scheyltjens et al., 2015). The second, the CaMKIIα-1.3 promoter, results in only 82% excitatory specificity (Scheyltjens et al., 2015), and is thus not far from chance. We have clarified this in the manuscript. Nevertheless, we have removed the qualifier “excitatory” when talking about neurons in most instances, throughout the manuscript.

      2) The manuscript concludes that cholinergic axons convey a binary locomotion signal and are not tuned to running speed. The average running velocity of mice in this study is very slow - slower than 15 cm/s in the example trace in Figure 1D and speeds <6 cm/s were quantified in Figure 2E. However, mice can run at much faster speeds both under head-fixed and freely moving conditions (see e.g. Jordan and Keller, 2020, where example running speeds are ~35 cm/s). Given that the data in the present manuscript cover such a narrow range of running speeds, it is not possible to determine whether cholinergic axons are tuned to running speed or convey a binary locomotion signal.

      Our previous analysis window of 0-6.25 cm/s covered approximately 80% of all data. We have increased the analysis window to 0-35 cm/s that now covers more than 99% of the data (see below). Also, note that very high running speeds are probably overrepresented in the Jordan and Keller 2020 paper as mice had to be trained to run reliably before all experiments given the relatively short holding times of the intracellular recordings. The running speeds in our current dataset are comparable to other datasets we have acquired in similar experiments.

      Figure 2E has now been updated to reflect the larger range of data. Please note, as the number of mice that contribute to the data now differs as a function of velocity (some mice run faster than others), we have now switched to a variant of the plot based on hierarchical bootstrap sampling (see Methods). This does not overtly change the appearance of the plot. See Author response image 1 for a comparison of the original plot, the extended range without bootstrap sampling, and the extended range with bootstrap sampling currently used in the paper.

      Author response image 1.

      Average activity of cholinergic axons as a function of locomotion velocity. (A) As in the previous version of the manuscript. (B) As in A, but with the extended velocity range. (C) As in B, but using hierarchical bootstrap sampling to estimate median (red dots) and 95% confidence interval (shading) for each velocity bin.

      3) The analyses in Figure 4 only consider the average response to all grating orientations and directions. Without further analysing responses to individual grating directions it is unclear how stimulation of cholinergic inputs affects visual responses. Previous work (e.g. Datarlat and Stryker, 2017) has shown that locomotion can have both additive and multiplicative effects and it would be valuable to determine the type of modulation provided by cholinergic stimulation.

      We thank the reviewer for this suggestion. To address this, we quantified how cholinergic stimulation influenced the orientation tuning of V1 neurons. The stimuli we used were full field sinusoidal drifting gratings of 4 different orientations (2 directions each). For each neuron, we identified the preferred orientation and plotted responses relative to this preferred orientation as a function of whether the mouse was running, or we were stimulating cholinergic axons. Consistent with previous work, we found a mixture of a multiplicative and an additive components during running. With cholinergic axon stimulation, the multiplicative effect was stronger than the additive effect. This is now quantified in Figure S7.

      4) The difference between the effects of locomotion and optogenetic stimulation of cholinergic axons in Figure 5 may be confounded by differences in the visual stimulus. These experiments are carried out under open-loop conditions, where mice may adapt their locomotion based on the speed of the visual stimulus. Consequently, locomotion onsets are likely to occur during periods of higher visual flow. Since optogenetic stimulation is presented randomly, it is likely to occur during periods of lower visual flow speed. Consequently, the difference between the effect of locomotion and optogenetic stimulation may be explained by differences in visual flow speed and it is important to exclude this possibility.

      We find that in general locomotion is unaffected by visual flow in open loop conditions in this type of experiment (in this particular dataset, there was a small negative correlation between locomotion and visual flow in the open loop condition, Author response image 2).

      Author response image 2.

      Correlation between visual flow and locomotion in open loop conditions. Average correlation of locomotion velocity and visual flow speed in open loop for all mice in Figure 5. Each dot is an imaging site. In the open loop, the correlation between locomotion and visual flow speed is close to zero, but significantly negative in this dataset.

      However, to directly address the concern that our results are influenced by visual flow, we can restrict our analysis only to locomotion onsets that occurred in absence of visual flow (Author response image 3A and R3B). These responses are not substantially different from those when including all data (Figures 5A and 5B). Thus, the difference between the effect of locomotion and optogenetic stimulation cannot be explained by differences in visual flow speed.

      Author response image 3.

      Open loop locomotion onset responses without visual flow. (A) Average calcium response of layer 2/3 neurons in visual cortex to locomotion onset in open loop in the absence of visual flow. Shading indicates SEM. (B) As in A, but for layer 5 neurons.

      5) It is unclear why chemogenetic manipulations of cholinergic inputs had no effect on pairwise correlations of L2/3 neuronal responses while optogenetic stimulation did.

      This is correct – we do not know why that is the case and can only speculate. There are at least two possible explanations for this difference:

      1) Local vs. systemic. The optogenetic manipulation is relatively local, while the chemogenetic manipulation is systemic. It is not clear how cholinergic release in other brain regions influences the correlation structure in visual cortex. It is conceivable that a cortex-wide change in cholinergic release results in a categorically different state with a specific correlation structure in layer 2/3 neurons different from the one induced by the more local optogenetic manipulation.

      2) Layer-specificity of activation. Cholinergic projections to visual cortex arrive both in superficial and deep layers. We activate the axons in visual cortex optogenetically by illuminating the cortical surface. Thus, in our optogenetic experiments, we are primarily activating the axons arriving superficially, while in the chemogenetic experiment, we are likely influencing superficial and deep axons similarly. Thus, we might expect a bias in the optogenetic activation to influencing superficial layers more strongly than the chemogenetic activation does.

      6) The effects of locomotion and optogenetic stimulation on the latency of L5 responses in Figure 7 are very large - ~100 ms. Indeed, typical latencies in mouse V1 measured using electrophysiology are themselves shorter than 100 ms (see e.g. Durand et al., 2016). Visual response latencies in stationary conditions or without optogenetic stimulation appear surprisingly long - much longer than reported in previous studies even under anaesthesia. Such large and surprising results require careful analysis to ensure they are not confounded by artefacts. However, as in Figure 4, this analysis is based only on average responses across all gratings and no individual examples are shown.

      This is correct and we speculate this is the consequence of a combination of different reasons.

      1) Calcium imaging is inherently slower than electrophysiological recordings. While measuring spiking responses using electrophysiology, response latencies of on the order of 100 ms have indeed been reported, as the reviewer points out. Using calcium imaging these latencies are typically 4 times longer (Kuznetsova et al., 2021). This is likely a combination of a) calcium signals that are slower than electrical changes, b) delays in the calcium sensor itself, and c) temporal sampling used for imaging that is about 3 orders of magnitude slower than what typically used for electrophysiology.

      2) Different neurons included in analysis. The calcium imaging likely has very different biases than electrophysiological recordings. Historically, the fraction of visually responsive neurons in visual cortex based on extracellular electrophysiological recordings has been systematically overestimated (Olshausen and Field, 2005). One key contributor to this is the fact that recordings are biased to visually responsive neurons. The criteria for inclusion of “responsive neurons” strongly influences the “average” response latency. In addition, calcium imaging has biases that relate to the vertical position of the somata in cortex. Both layer 2/3 and layer 5 recordings are likely biased to superficial layer 2/3 and superficial layer 5 neurons. Conversely, electrical recordings are likely biased to layer 4 and layer 5 neurons. Thus, comparisons at this level of resolution between data obtained with these two methods are difficult to make.

      We have added example neurons as Figure S12, as suggested.  

      Reviewer #1 (Recommendations For The Authors):

      While the study showcases valuable insights, I have a couple of concerns regarding the novelty of their research and the interpretation of results. By addressing these concerns, the authors can clarify the positioning of their research and strengthen the significance of their findings.

      (Major comments)

      1) Page 1, Line 21: The authors claim, "Our results suggest that acetylcholine augments the responsiveness of layer 5 neurons to inputs from outside of the local network, enabling faster switching between internal representations during locomotion." However, it is not clear which specific data or results support the claim of "switching between internal representations." Overall, their study primarily presents responses averaged across all neurons imaged, lacking a detailed exploration of individual neuron response patterns. Population analysis, such as PCA and decoding, can be used to assess the encoding of each stimulus by V1 neurons - "internal representation."<br /> To strengthen their claim regarding "switching between internal representations," the authors could consider an experiment measuring the speed at which the population activity pattern A transitions to the population activity pattern B when the visual stimulus switches from A to B. Such experiments would significantly enhance the impact of their study, providing a clearer understanding of how BF cholinergic input influences the dynamic representation of stimuli during locomotion.

      We thank the reviewer for bringing this up. That acetylcholine enables a faster switching between internal representations in layer 5 is a speculation. We have attempted to make this clearer in the discussion. Our speculation is based on the finding that the population response in layer 5 to sensory input is faster under high levels of acetylcholine (Figures 4D and 7B). In line with the reviewer’s intuition, the neuronal response to a change in visual stimulus, in our experiment from a uniform grey visual stimulus to a sinusoidal grating stimulus, is indeed faster. Based on evidence in favor of layer 5 encoding internal representation (Heindorf and Keller, 2023; Keller and Mrsic-Flogel, 2018; Suzuki and Larkum, 2020), we interpret the decrease in latency of the population response as a faster change in internal representation. We are not sure a decoding analysis would add much to this, given that a trivial decoder simply based on mean population response would already find a faster transition. We have expanded on our explanation of these points in the manuscript.

      2) Page 4, Line 103: "..., a direct measurement of the activity of cholinergic projection from basal forebrain to the visual cortex during locomotion has not been made." This statement is incorrect. An earlier study by Reimer et al. indeed imaged cholinergic axons in the visual cortex of mice running on a wheel. They found that "After walking onset, ... ACh activation, and a large pupil diameter, were sustained throughout the walking period in both cortical areas V1 and A1." Their findings are very similar to the results presented by Yogesh and Keller - that is, BF cholinergic axons exhibited locomotion statedependent activity. The authors should clarify the positioning of this study relative to previous studies.

      Reimer, J., McGinley, M., Liu, Y. et al. Pupil fluctuations track rapid changes in adrenergic and cholinergic activity in cortex. Nat Commun 7, 13289 (2016). https://doi.org/10.1038/ncomms13289

      We have clarified this as suggested. However, we disagree slightly with the reviewer here. The key question is whether the cholinergic axons imaged originate in basal forebrain. While Reimer et al. 2016 did set out to do this, we believe a number of methodological considerations prevent this conclusion:

      1) In their analysis, Reimer et al. 2016 combine data from mice with cholinergic axons labeled with either viral injection to basal forebrain or germline cross of ChAT-cre mice with reporter line. Unfortunately, it is unclear what the exact number of mice labeled with either strategy was. Based on the information in the paper, we can conclude that of the 6 mice used for experiments between 2 and 5 were germline cross. The problem with germline labeling of ChAT positive neurons is that when using a cross, VIP-ChAT+ neurons in cortex are also labeled. Based on the fact that Reimer et al. 2016 find an anticipatory increase in activity on locomotion onset, that is also seen by Larsen et al. 2018 (they use a germline cross strategy), an effect we do not see in our data, we speculate that a significant part of the signals reported in the Reimer et al. 2016 paper are from local VIP-ChAT+ neurons.

      2) In their analysis, Reimer et al. 2016 also combine all imaging data obtained from both primary auditory cortex and primary visual cortex. Given the heterogeneity in the basal forebrain cholinergic neuronal population and their projection selectivity, to better understand these signals, it’s important to acquire the signals from cholinergic axons selectively in specific cortical regions, which we do in visual cortex. Based on the information provided in their paper, we were unfortunately not able to discern the injection location for their viral labeling strategy. Given the topographic selectivity in projection from basal forebrain, this could give hints as to the relative contribution of cholinergic projections to A1 vs V1 in their data. The injection coordinates given in the methods of the Reimer paper, of 4 mm lateral and 0.5 mm posterior to bregma to target basal forebrain, are likely wrong (they fall outside the head of the mouse).

      Given the heterogeneity in the basal forebrain cholinergic neuronal population and their projection selectivity, to better understand these signals, it’s important to acquire the signals from cholinergic axons both selectively in a cortical region, as we do in visual cortex, and purely originating from basal forebrain. Collins et al. 2023 inject more laterally and thus characterize cholinergic input to S1 and A1, while Lohani et al. 2022 use GRAB sensors which complement our findings. Please note, we don’t think there is any substantial disagreement in the results of previous studies and ours, with very few exceptions, like the anticipatory increase in cholinergic activity that precedes locomotion onset in the Reimer et al. 2016 data, but not in ours. This is a rather critical point in the context of the literature of motor-related neuronal activity in mouse V1. Based on early work on the topic, it is frequently assumed that motor-related activity in V1 is driven by a cholinergic input. This is very likely incorrect given our results, hence we feel it is important to highlight this methodological caveat of earlier work.

      3) Fig. 4H: The authors found that L5 neurons exhibit positive responses at the onset of locomotion in a closed-loop configuration. Moreover, these responses are further enhanced by photostimulation of BF axons.

      In a previous study from the same authors' group (Heindorf and Keller, 2023), they reported 'negative' responses in L5a IT neurons during closed-loop locomotion. This raises a question about the potential influence of different L5 neuron types on the observed results between the two studies. Do the author think that the involvement of the other neuronal type in L5, the PT neurons, might explain the positive responses seen in the present study? Discussing this point in the paper would provide valuable insights into the underlying mechanisms.

      Yes, we do think the positive response observed on locomotion onset in closed loop is due to non-Tlx3+ neurons. Given that Tlx3-cre only labels a subset of inter-telencephalic (IT) neurons (Gerfen et al., 2013; Heindorf and Keller, 2023), it’s not clear whether the positive response is explained by the pyramidal tract (PT) neurons, or the non-Tlx3+ IT neurons. Dissecting the response profiles of different subsets of layer 5 neurons is an active area of research in the lab and we hope to be able to answer these points more comprehensively in future publications. We have expanded on this in the discussion as suggested.

      Furthermore, it would be valuable to investigate whether the effects of photostimulation of BF axons vary depending on neuronal responsiveness. This could help elucidate how neurons with positive responses, potentially putative PT neurons, differ from neurons with negative responses, putative IT neurons, in their response to BF axon photostimulation during locomotion.

      We have attempted an analysis of the form suggested. In short, we found no relationship between a neuron’s response to optogenetic stimulation of ChAT axons and its response to locomotion onset, or its mean activity. Based on their response to locomotion onset in closed loop, we split layer 5 neurons into three groups, 30% most strongly decreasing (putative Tlx3+), 30% most strongly increasing, and the rest. We did not see a response to optogenetic stimulation of basal forebrain cholinergic axons in any of the three groups (Author response image 4A). We also found no obvious relationship between the mean activity of neurons and their response to optogenetic stimulation (Author response image 4B).

      Author response image 4.

      Neither putative layer 5 cell types nor neuronal responsiveness correlates with the response to optogenetic stimulation of cholinergic axons. (A) Average calcium response of layer 5 neurons split into putative Tlx3 (closed loop locomotion onset suppressed) and non-Tlx3 like (closed loop locomotion onset activated) to optogenetic stimulation of cholinergic axons. (B) Average calcium response of layer 5 neurons to optogenetic stimulation of cholinergic axons as a function of their mean response throughout the experimental session. Left: Each dot is a neuron. Right: Average correlation in the response of layer 5 to optogenetic stimulation and mean activity over all neurons per imaging site. Each dot is an imaging site.

      (Minor comments)

      1) It is unclear which BF subregion(s) were targeted in this study.

      Thanks for pointing this out. We targeted the entire basal forebrain (medial septum, vertical and horizontal limbs of the diagonal band, and nucleus basalis) with our viral injections. All our axonal imaging data comes from visual cortex and given the sensory modality-selectivity of cholinergic projections to cortex, the labeled axons originate from medial septum and the diagonal bands (Kim et al., 2016). We have now added the labels for basal forebrain subregions targeted next to the injection coordinates in the manuscript.

      2) Page 43, Line 818: The journal name of the cited paper Collins et al. is missing.

      Fixed.

      3) In the optogenetic experiments, how long is the inter-trial interval? Simulation of BF is known to have long-lasting effects on cortical activity and plasticity. It is, therefore, important to have a sufficient interval between trials.

      The median inter-trial interval for different stimulation events are as follows:

      • Optogenetic stimulation only : 15 s

      • Optogenetic stimulation + grating : 12 s

      • Optogenetic stimulation + mismatch: 35 s

      • Optogenetic stimulation + locomotion onset: 45 s

      We have added this information to the methods in the manuscript.

      Assuming locomotion is the primary driver of acetylcholine release (as we argue in Figures 1 and 2), the frequency of stimulation roughly corresponds to the frequency of acetylcholine release experienced endogenously. It is of course possible that being awake and mobile puts the entire system in a longlasting acetylcholine driven state different from what would be observed during long-term quite wakefulness or during sleep. But the main focus of the optogenetic stimulation experiments we performed was to investigate the consequences of the rapid acetylcholine release driven by locomotion.

      4) Page 11, Line 313: "..., we cannot exclude the possibility of a systemic contribution to the effects we observe through shared projections between different cortical and subcortical target." This possibility can be tested by examining the effect of optogenetic stimulation of cholinergic axons on locomotor activity, as they did for the chemogenetic experiments (Fig. S7). If the optogenetic manipulation changes locomotor activity, it is likely that this manipulation has some impact on subcortical activity and systemic contribution to the changes in cortical responses observed.

      Based on the reviewer suggestion we tested this and found no change in the locomotor activity of the mice on optogenetic stimulation of cholinergic axons locally in visual cortex (we have added this as Figure S5 to the manuscript). Please note however, we can of course not exclude a systemic contribution based on this.

      5) Fig. 4 and 5: In a closed-loop configuration, L2/3 neurons exhibit a transient increase in response at the onset of locomotion, while in an open-loop configuration, their response is more prolonged. On the other hand, L5 neurons show a sustained response in both configurations. Do the authors have any speculation on this difference?

      This is correct. Locomotion onset responses in layer 2/3 are strongly modulated by whether the locomotion onset occurs in closed loop or open loop configurations (Widmer et al., 2022). This difference is absent in our layer 5 data here. We suspect this is a function of a differential within-layer cell type bias in the different recordings. In the layer 2/3 recordings we are likely biased strongly towards superficial L2/3 neurons that tend to be negative prediction error neurons (top-down excited and bottom-up inhibited), see e.g. (O’Toole et al., 2023). A reduction of locomotion onset responses in closed loop is what one would expect for negative prediction error neurons. While layer 5 neurons exhibit mismatch responses, they do not exhibit opposing top-down and bottom-up input that would result in such a suppression (Jordan and Keller, 2020).

      We can illustrate this by splitting all layer 2/3 neurons based on their response to gratings and to visuomotor mismatch into a positive prediction error (PE) type (top 30% positive grating response), a negative prediction error type (top 30% positive visuomotor mismatch response), and the rest (remaining neurons and neurons responsive to both grating and visuomotor mismatch). Plotting the response of these neurons to locomotion onset in closed loop and open loop, we find that negative PE neurons have a transient response to locomotion onset in closed loop while positive PE neurons have a sustained increase in response in closed loop. In open loop the response of the two populations is indistinguishable. Splitting the layer 5 neurons using the same criteria, we don’t find a striking difference between closed and open loop between the two groups of neurons. We have added this as Figure S8.

      Reviewer #2 (Recommendations For The Authors):

      Major concerns:

      1) As a ubiquitous promoter was used to drive GCaMP expression, please explain how excitatory neurons were identified.

      2) As the data cover a very small range of running speeds, it is important to confirm that the binary locomotion signal model still applies when mice run at higher speeds - either by selecting recordings where mice have a wider range of running speeds or conducting additional experiments. In addition, please show the running speed tuning of individual axons.

      3) Please provide a more detailed analysis of the effects of locomotion and cholinergic modulation on visual responses. How does cholinergic modulation affect orientation and direction tuning? Are the effects multiplicative or additive? How does this compare to the effects of locomotion on single neurons?

      4) To ensure that the analyses in Figure 5 are not confounded by differences in the visual stimulus, please include average visual flow speed traces for each condition.

      5) Please clarify why chemogenetic manipulations of cholinergic inputs had no effect on pairwise correlations in L2/3.

      6) The latency effect is quite an extraordinary claim and requires careful analysis. Please provide examples of single neurons illustrating the latency effect - including responses across individual grating orientations/directions. One possible confound is that grating presentation could itself trigger locomotion or other movements. In the stationary / noOpto conditions, the grating response might not be apparent in the average trace until the animal begins to move. Thus the large latency in the stationary / noOpto conditions may reflect movement-related rather than visual responses.

      Please see our responses to these points in the public review part above.

      There are some minor points where text and figures could be improved:

      1) When discussing the decorrelation of neuronal responses by cholinergic axon activation, it is important to make it clear that Figure 6D quantifies the responses of layer 5 apical dendrites rather than neurons.

      We have added this information to the results section.

      2) In Figure S7, please clarify why velocity is in arbitrary units.

      This was an oversight and has been fixed.

      3) Please clarify how locomotion and stational trials are selected in Figure 4.

      We thank the reviewers for pointing this out. Trials were classified as occurring during locomotion or while mice were stationary as follows. We used a time-window of -0.5 s to +1 s around stimulus onset. If mice exhibited uninterrupted locomotion above a threshold of 0.25 cm/s in this time-window, we considered the stimulus as occurring during locomotion, otherwise it was defined as occurring while the mice were stationary. Note, the same criteria to define locomotion state was used to isolate visuomotor mismatch events, and also during control optogenetic stimulation experiments. We have added this information to the methods.

      4) When testing whether cholinergic activation is sufficient to explain locomotion-induced decorrelation in Figure 6G-H, please show pre-CNO and post-CNO delta-correlation, not just their difference.

      We can do that, but the results are harder to parse this way. We have added this as Figure S11 to the manuscript. The problem with parsing the figure is that the pre-CNO levels are different in different groups. This is likely a function of mouse-to-mouse variability and makes it harder to identify what the CNO induced changes are. Using the pre-post difference removes the batch influence. Hence, we have left this as the main analysis in Figure 6G and 6H.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Joint Public Review:

      Previously, this group showed that Tgfbr1 regulates the reorganization of the epiblast and primitive streak into the chordo-neural hinge and tailbud during the trunk-to-tail transition. Gdf11 signaling plays a crucial role in orchestrating the transition from trunk to tail tissues in vertebrate embryos, including the reallocation of axial progenitors into the tailbud and Tgfbr1 plays a key role in mediating its signaling activity. Progenitors that contribute to the extension of the neural tube and paraxial mesoderm into the tail are located in this region. In this work, the authors show that Tgfbr1 also regulates the reorganization of the posterior primitive streak/base of allantois and the endoderm as well. 

      By analyzing the morphological phenotypes and marker gene expression in Tgfbr1 mutant mouse embryos, they show that it regulates the merger of somatic and splanchnic layers of the lateral plate mesoderm, the posterior streak derivative. They also present evidence suggesting that Tgfbr1 acts upstream of Isl1 (key effector of Gdf11 signaling for controlling differentiation of lateral mesoderm progenitors) and regulates the remodelling of the major blood vessels, the lateral plate mesoderm and endoderm associated with the trunk-to-tail transition. Through a detailed phenotypic analysis, the authors observed that, similarly to Isl1 mutants, the lack of Tgfbr1 in mouse embryos hinders the activation of hindlimb and external genitalia maker genes and results in a failure of lateral plate mesoderm layers to converge during tail development. As a result, they interpret that ventral lateral mesoderm, which generates the peri cloacal mesenchyme and genital tuberculum, fails to specify. 

      They also show defects in the morphogenesis of the dorsal aorta at the trunk/tail juncture, resulting in an aberrant embryonic/extraembryonic vascular connection. Endoderm reorganization defects following abnormal morphogenesis of the gut tube in the Tgfbr1 mutants cause failure of tailgut formation and cloacal enlargement. Thus, Tgfbr1 activity regulates the morphogenesis of the trunk/tail junction and the morphogenetic switch in all germ layers required for continuing post-anal tail development. Taken together with the previous studies, this work places Gdf11/8 - Tgfbr1 signaling at the pivot of trunk-to-tail transition and the authors speculate that critical signaling through Tgfbr1 occurs in the posterior-most part of the caudal epiblast, close to the allantois. 

      Strengths: 

      The data shown is solid with excellent embryology/developmental biology. This work demonstrates meticulous execution and is presented in a comprehensive and coherent manner. Although not completely novel, the results/conclusions add to the known function of Gdf11 signaling during the trunk-to-tail transition. 

      Weaknesses: 

      The authors rely on the expression of a small number of key regulatory genes to interpret the developmental defects. The alternative possibilities remain to be ruled out thoroughly. The manuscript is also quite descriptive and would benefit from more focused highlighting of the novelty regarding the absence of Tgfbr1 in the mouse embryo. They should also strengthen some of their conclusions with more details in the results.

      Although we used a limited number of key regulatory genes to interpret the phenotype, these genes were carefully chosen to focus on specific processes involving the lateral mesoderm, its derivatives, and the endoderm. In addition to these markers, we included references to other relevant markers that were previously analyzed and initially led us to examine the lateral plate mesoderm and tail gut in Tgfbr1 mutants. To strengthen our analysis, we have now incorporated additional data to clarify specific phenotypes. For instance, in situ hybridization (ISH) for Shh further confirms abnormalities at the caudal end of the endoderm in mutant embryos, while no endodermal defects are observed in the trunk region. We also included an analysis of the intermediate mesoderm, which shows abnormalities at the same level as those found in the lateral plate mesoderm and endoderm of Tgfbr1 mutants.

      It’s important to note that using additional markers to assess the epiblast/primitive streak of Tgfbr1 mutants at E7.5–E8.5, as suggested by a reviewer, is unlikely to yield new insights. At these early stages, Tgfbr1 mutant embryos do not display observable phenotypes in the main body axis. Data in this manuscript already demonstrate the absence of abnormalities at this stage, as shown in Figure 3 and Supplementary Figure 6. Additionally, the expression of certain genes showing abnormalities when the embryo would enter tail development, in the trunk their expression remains unaffected, indicating that trunk extension is not significantly impacted by Tgfbr1 deficiency. While transcriptomic analysis of these Tgfbr1 mutants could provide interesting insights, it would be more appropriate to focus on later developmental stages, which would be beyond the scope of the current study.

      The second major critique was that the manuscript is primarily descriptive. We disagree with this assessment. Several hypotheses were rigorously tested using genetic approaches, including Isl1 knockout experiments, cell tracing from the primitive streak with a newly generated Cre driver to activate a reporter from the ROSA26 locus, and assessment of extraembryonic endoderm fate in Tgfbr1 mutants by introducing the Afp-GFP transgene into the Tgfbr1 mutant background. Additionally, we conducted tracing analyses of tail bud cell contributions to the tail gut via DiI injection and embryo incubation. To address potential concerns regarding this experiment, we have included data showing the DiI position immediately after injection to confirm that it does not contact the tail gut. We also considered and accounted for potential DiI leakage into neuromesodermal progenitors to clarify the endodermal results.

      Our genetic and DiI experiments were specifically designed to differentiate between alternative hypotheses and to confirm hypotheses generated from other analyses. Additionally, improvements in some of the imaging data have helped address remaining concerns.

      Reviewer #1 (Recommendations For The Authors): 

      I have listed my suggestions as queries. The authors may perform experiments or clarify by editing the text to address them. 

      The authors state on Page 11 and elsewhere that the ventral lateral mesoderm is absent in the Tgfbr1 mutant. What is the basis for this conclusion? Are there specific markers for PCM or GT primordium? 

      The specific marker of PCM and GT primordium is Isl1. The absence of this marker in the Tgfbr1 mutants is shown in (Dias et al, 2020). The reference is introduced in the manuscript.

      A schematic illustrating the VLM and the expression patterns of Tgfbr1, Gdf11, etc., would be helpful. 

      Characterization of Gdf11 expression has been previously reported (e.g. McPherron et al 1999, cited in our manuscript). It is expressed in the region containing of axial progenitors before the trunk to tail transition and not expressed in the VLM. As for Tgfbr1 expression is hard to detect, likely because it is ubiquitously expressed at low level. We include in this document some pictures of an ISH, including a control using the Tgfbr1 mutants to illustrate that the staining resembling background actually represents Tgfbr1 expression. If the reviewers find it important, we can also incorporate these data into the manuscript. Under these circumstances, we feel that a schematic might not be very informative.

      Author response image 1.

      Image showing an example of an ISH procedure with a probe against Tgfbr1, showing widespread and low expression. The lower picture shows a ventral view of a stained wild type E10.5 embryo.

      Foxf1+ cells in the 'extended LPM' of Tgfbr1 mutants suggest fate transformation, or does it indicate the misexpression of marker gene otherwise suppressed by Tgfbr1 activity? The authors suggest that Foxf1+ cells are VLM progenitors from posterior PS trapped in the extended LPM. Do they continue to express PS markers? 

      The observation that both in wild type and Tgfbr1 mutant embryos Foxf1 expression in the trunk is restricted to the splanchnic LPM indicates that the absence of this marker in the somatic LPM is not the result of a suppression of its expression by Tgfbr1. In wild type embryos Foxf1 is also expressed in the posterior PS, regulated independently of its expression in the LPM (i.e. Shh-independent) and later in the pericloacal mesoderm (our supplementary figure 2). As Foxf1 expression in the posterior PS was not suppressed in the Tgfbr1 mutants, together with the absence of pericloacal mesoderm, we interpret that the Foxf1-positive cells in the two layers around the extended celomic cavity in the posterior end of the mutant embryos derived from the posterior PS, resulting from the absence of its normal progression through the embryonic tissues.

      We did not find expression of PS markers giving rise to paraxial mesoderm, like Tbxt, further suggesting that those cells could derive from the restricted set of cells within the posterior PS that contribute to the pericloacal mesoderm

      For example, the misexpression of Apela is interpreted as mis-localized endoderm cells. They show scattered Keratin 8 misexpression to support the interpretation. It would be more convincing if the authors tested the expression of other endoderm markers. 

      As indicated in the manuscript, we suggest that these cells are endoderm progenitors (p. 13), like those present at the posterior end of the gut tube at E9.5 and E10.5, that are unable to incorporate into the gut tube. Apela is not a general endodermal marker: it is expressed in the foregut pocket and the nascent cells of the hindgut/tail gut, becoming down regulated as cells take typical endodermal signatures. The presence of ectopic Apela expression in the extended LPM of the mutant embryos might indeed indicate the presence of progenitors that failed to downregulate Apela resulting from the lack differentiation-associated downregulation. This would also implicate the absence of definitive endodermal markers.

      The Nodal signaling pathway in the anterior PS drives endoderm development. It acts through Alk7. Does Tgfbr1 (Alk5) mutation impact endoderm development, in general? It isn't easy to assess this from the Foxa2 in situ RNA hybridization shown in Figures 6A and B. It would be helpful for the readers if the authors clarified this point. 

      In the pictures shown in Figure 7D-D’ it is already shown that the endoderm is mostly preserved until the region of the trunk to tail transition. The presence of a rather normal endoderm in the embryonic trunk can also be seen with Shh, a figure added as Supplementary Fig.5.

      Reviewer #2 (Recommendations For The Authors): 

      The authors mention two interesting novel points which they should develop in the discussion, and probably also in the results. 

      (1) The authors speculate about the possible involvement of the posterior PS as a mediator of Gdf11/Tgfbr1 signaling activity. However, as mentioned in the manuscript, their experiments do not allow regional sublocalization within the PS... Here it would be important to assess/discuss in more detail which progenitors respond to this signaling activity and when they do it. At the very least, the authors should provide high-resolution spatiotemporal data of the expression of Tgfbr1 in the PS. 

      Tgfbr1 expression at this embryonic stage does not give clear differential patterns. The data reported for this expression in Andersson et al 2006 is very low quality and we have not been able to reproduce the reported pattern. On the contrary, all our efforts over the years provided a very general staining that could even be interpreted as background. When we now included Tgfbr1 mutants as controls, it became clear that the ubiquitous and low level signal observed in wild type embryos indeed represent Tgfbr1 expression pattern: low level and ubiquitous. We are attaching a figure to this document illustrating these observations. If required, this can also be included in the manuscript as a supplementary figure. 

      Also, the work of Wymeersch et al., 2019 regarding the lateral plate mesoderm progenitors (LPMPs) should be referred to and discussed here. 

      This was now added in the results (page 11) and in discussion (page 16). 

      For instance, are the LPMP transcriptomic differences detected between E7.5 and E8.5 caused by Tgfbr1 signaling activity? This question could be easily answered through a comparative bulk RNAseq analysis of the posterior-most region of the PS of mutant and WT embryos. The possible colocalization of Tgfb1 (Wymeersch et al., 2019) and Tgfbr1 in the LPMPs should also be addressed. 

      We agree with the suggestion that RNA-seq in the posterior PS of WT and mutant embryos might be informative. However, it is very likely that within the proposed timeframe (E7.5 to E8.5) that there are no significant differences between the wild type and the Tgfbr1 mutant embryos because there is no apparent axial phenotype in Tgfbr1 mutant embryos before the trunk to tail transition. Therefore, at this stage, we think that this experiment is out of the scope of the present manuscript. 

      (2) The activity of Tgfbr1 during the trunk-to-tail transition is critical for the development of tail endodermal tissues. Here the authors suggest again the involvement of the posterior PS/allantois region, but a similar phenotype can also be observed for instance in the absence of Snai1 in the caudal epiblast (Dias et al., 2020)... It would be important to assess/discuss the origin of those morphogenetic problems in the gut. Is it due to the reallocation of NMC cells into the CNH? The tailbud-EMT process? LPMPs specification?... Regional mutations or gain of functions of Snai1 or Tgfbr1 in the caudal epiblast would help answer the question.  

      The endodermal phenotype in the Snai1 mutants is different to that observed in the Tgfbr1 mutants. As can be observed in Figures 3, 4 and 5 of Dias et al. the absence of tailbud is replaced by a structure that extends the epiblast. As a consequence, the endoderm finishes at the base of that structure, even expanding to make a structure resembling the cloaca, which is different to what is seen in the Tgfbr1 mutants. In this case, the lack of tail gut is likely to result either from the lack of formation of the progenitors of the gut endoderm or from the dissociation of what would be the tail bud from the LPM. Actually, hindlimb/pericloacal mesoderm markers, like Tbx4, are preserved in the Snai1 mutant. As for the gain of function of Snai1 experiment, already reported also in Dias et al 2020, the destiny of these cells is not clear. The ISH for Foxa2 showed extra signals but as it is not an exclusive marker for endoderm it is not possible to know whether any of these signals correspond to endodermal tissues.

      Regarding the development of tail endodermal tissues, the authors suggest that it occurs from a structure derived from the PS that is located posteriorly, in the tailbud, after the tip of the growing gut. This is an important and novel point as it suggests that the primordia of the endoderm is not wholly specified during gastrulation. So the observation should be well supported. How can Anastasiia et al. distinguish such "structure" from the actual developing gut? Does it have a distinct molecular signature or any morphological landmark that enables its separation from the actual gut? The data suggests that the region highlighted in Supplementary Figure 4Ab contains part of the actual gut tube (the same is suggested in Figure 5B). If the authors think otherwise, they must characterize that region of the tailbud by doing a thorough morphological and gene/protein expression analysis and assess its potency, via transplantation experiments. Also, the authors' claim mostly relies on the DiI experiments and those have three problems: #1 Anastasiia et al. assess "tail" endodermal growth at E9.5 when the correct stage to do it is after E10.5 (after tailbud formation). 2# Incongruencies, low number (only three embryos), and diversity in the results shown in Figure 8 and Supplementary Figure 4. For instance, despite similar staining at 0h, the extension and amount of DiI present in the gut tube after 20h varies significantly amongst the differently labeled embryos. A possible explanation lies in the abnormal leakiness of the DiI labelings and that is confirmed by the observations shown in Supplementary Figure 4M-O; the same for Supplementary Figure 4G, which shows a substantial amount of DiI in the neural tube. 3# The authors must provide high-quality data showing which tissues/regions were labelled at time 0h, including transversal and sagittal sections as they did for the 20h time-point. Additionally, it is important to re-orient the sagittal optical sections to a position that also shows the neural tube (like a mid-sagittal section) and include information concerning the AP/DV axis, as well as the location of the transversal optical sections in the sagittal image. 

      As described in the reply to reviewer 1, Apela is expressed in the nascent tail gut endoderm but not in more anterior areas except for a foregut pocket, and becomes downregulated as the tube acquires endodermal signatures. Therefore, the structure to which the reviewer refers to might indeed represent a group of progenitors that extend the tail gut. And the observation that this property is observed only in the tail gut as it grows, already separates this region of the gut, which in the end do not contribute to mature organs, from more anterior areas of the endoderm (essentially anterior to the cloaca) that will become a relevant tissue of the intestinal organs. Our DiI labelling experiment was aimed to test whether this pool of cells contributes to the gut but does not allow to determine the nature of those cells, a question that will require further research (discussed on p. 17) and we think is beyond the scope of the present manuscript.

      Regarding the labelling at E10.5, we agree that the tail bud in terms of NMCs is not completely formed, for example, at E9.5 the neuropore is not yet closed. However, we are more interested in regression of the epiblast, which is complete by E9.5. Injecting at E9.5 also has technical advantages for us, first, because in our hands earlier embryos grow better in culture, and second, because it is easier to inject in the tailbud at E9.5 because it is a little bit bigger than at E10.5. Therefore, injecting at E9.5 is less prone to technical artifacts due to injection inaccuracy and compromised growth in culture.

      We agree that the injected DiI could also leak into NMPs, which might be located in the same area. However, while this could result in labeling of the neural tube, it would not affect the interpretation of the finding of labeled cells in the tail gut. Indeed, the presence of this label in the gut epithelium indicates the presence of progenitors in the injected region of the tail gut. We added some considerations of this the possible leakage into the results section of the manuscript (p. 15). We thank the reviewer for drawing our attention to this issue. 

      We also now provide high quality data showing labelled tissue at 0h in Supplementary figure 8A-c’, higher magnification images in Fig. 8, and reoriented optical sections in Fig.6 and in Supplementary Fig. 7, including axis and location of the sections as suggested by the reviewer.

      Minor concerns/comments: 

      (1) The abstract is quite long, though this might be fine for this journal. 

      (2) In relation to the comment on the abstract, the manuscript needs an initial Figure descrbing the events that are described in the introduction. Otherwise, the manuscript will only be accessible to mouse embryologists.

      We have a figure summarizing the results at the end of the manuscript, we think that including similar figure in the beginning might be redundant. What we could do, if required, is to include this type of schematic as a graphical abstract.

      (3) The authors need to clarify what they mean when they use the following expressions "PS fate" and "fate of the posterior PS".

      I do not think that we have used such expressions. Indeed, they did not come out when we run a “find” in the word document. However, they would mean the tissue that would come out from them at later developmental stages.

      (4) The assessment of Isl1 expression in Tgfbr1 mutant and transgenic mouse embryos would be better indicative of their molecular relationship than a comparative phenotypic analysis. 

      These data have been reported in Dias et al 2020 and Jurberg et al 2013, both cited in the manuscript.  

      (5) The authors should explain or discuss what the upregulation of Foxa2 in the posterior end of Tgfbr1 mutants means.

      While an upregulation is apparent in the figure, looking at other pictures we cannot be sure of this being a significantly quantifiable up-regulation. We therefore removed the statement from the text.

      (6) What happens to the intermediate mesoderm during the trunk-to-tail transition? Is Tgfbr1 involved in the regulation of its development?

      We have tested this using Pax2 and added the relevant data in Supplementary Fig. 1 and described in the results.

      (7) The term "potential" should not be used during the description of DiI labeling experiments as this technique only assesses cell fate.

      Corrected

      (8) Some figures lack AP/DV axis information (e.g. Figures 6, C, and D).

      Corrected

    1. Author Response

      The following is the authors’ response to the original reviews.

      We would like to extend our gratitude to the reviewers for their meticulous analysis and constructive feedback on our manuscript. We have revised our paper based on the suggestions regarding supporting literature and the theory behind CAPs along with detailed insights regarding our methods. Their suggestions have been extremely useful in strengthening the clarity and rigor of our manuscript.

      Reviewer #1 (Recommendations For The Authors):

      (1) There are no obvious problems with this paper and it is relatively straightforward. There are some challenges that I would like to suggest. These variants have multiple mutations, so it would be interesting if you could drill down to find out which mutation is the most important for the collective changes reported here. I would like to see a sequence alignment of these variants, perhaps in the supplemental material, just to get some indication of the extent of mutations involved.

      Finding the most important mutation within a set is a tricky question, as each mutation changes the way future mutations will affect function due to epistasis. Indeed, this is what we aim to explore in this work. To illustrate this point, we included a new supplementary figure S5A. Three critical mutations that emerged quickly, and were frequently observed in other dominant variants, were S477N, T478K, and N501Y. Thus, we computed the EpiScore values of these three mutations, with several critical residues contributing to hACE2 binding. The EpiScore distribution indicates that residues 477, 478, and 501 have strong epistatic (i.e., non-additive) interactions, as indicated by EpiScore values above 2.0.

      To further investigate these epistatic interactions, we first conducted MD simulations and computed the DFI profile of these three single mutants. We analyzed how different the DFI scores of the hACE2 binding interface residues of the RBD are, across three single mutants with Omicron, Delta, and Omicron XBB variants (Fig S5B). Fig S5B shows how mutations at these particular sites affect the binding interface DFI in various backgrounds, as the three mutations are also observed in the Omicron, XBB, and XBB 1.5 variants. If the difference in the DFI profile of the mutant and the given variant is close to 0, then we could safely state that this mutation affected the variant the most. However, what we observe is quite the opposite: the DFI profile of the mutation is significantly different in different variant backgrounds. While these mutations may change overall behavior, their individual contributions to overall function are more difficult to pin down because overall function is dependent on the non-additive interactions between many different residues.

      Author response image 1.

      (A) Three critical mutations that emerged quickly, and were frequently observed in other dominant variants, were S477N, T478K, and N501Y. EpiScores of sites 477, 478, and 501 with one another are shown with k = the binding interface of the open chain. These residues are highly epistatic, producing higher responses than expected when perturbed together. (B) The difference in the dynamic flexibility profiles between the single mutants and the most common variants for the hACE2 binding residues of the RBD. DFI profiles exhibit significant variation from zero, and also show different flexibility in each background variant, highlighting the critical non-additive interactions of the other mutation in the given background variant. Thus, these three critical mutations, impacting binding affinity, do not solely contribute to the binding. There are epistatic interactions with the other mutations in VOCs that shape the dynamics of the binding interface to modulate binding affinity with hACE2.

      As we discussed above, while the epistatic interactions are crucial and the collective impact of the mutations shape the mutational landscape of the spike protein, we would like note that mutation S486P is one of the critical mutations we identify, modulating both antibody and hACE2 binding and our analysis reveals the strong non-additive interactions with the other mutational sites. This mutational site appears in both XBB1.5 and earlier Omicron strains which highlights its importance in functional evolution of the spike protein. CAPs 346R, 486F, and 498Q also may be important, as they have a high EpiScore, indicating critical epistatic interaction with many mutation sites.

      Regarding to the suggestion about presenting the alignment of the different variants, we have attached a mutation table, highlighting the mutated residues for each strain compared to the reference sequence as supplemental Figure S1 along with the full alignment file.

      (2) Also, I am wondering if it would be possible to insert some of these flexibilities and their correlations directly into the elastic network models to enable a simpler interpretation of these results. I realize this is beyond the scope of the present work, but such an effort might help in understanding these relatively complex effects.

      This is great suggestion. A similar analysis has been performed for different proteins by Mcleash (See doi: 10.1016/j.bpj.2015.08.009) by modulating the spring constants of specific position to alter specific flexibility and evaluate change in elastic free energy to identify critical mutation (in particular, allosteric mutation) sites. We will be happy to pursue this as future work.

      Minor

      (3) 1 typo on line 443 - should be binding instead of biding.

      Fixed, thanks for spotting that.

      (4) The two shades of blue in Fig. 4B were not distinguishable in my version.

      To fix this, we have changed the overlapping residues between Delta and Omicron to a higher contrast shade of blue.

      (5) Compensatory is often used in an entirely different way - additional mutations that help to recover native function in the presence of a deleterious mutation.

      Although our previous study (Ose et al. 2022, Biophysical Journal) shows that compensatory mutations were generally additive, the two ideas are not one and the same. We thank the reviewer for pointing this out. Therefore, to clarify, we have now described our results in terms of dynamic additivity, rather than compensation.

      Reviewer #2 (Recommendations For The Authors):

      (1) The authors note that the identified CAPs overlap with those of others (Cagliani et al. 2020; Singh and Yi 2021; Starr, Zepeda, et al. 2022). In itself, this merits a deeper discussion and explicit indication of which positions are not identified. However, there is one point that I believe may represent a fundamental flaw in this study in that the calculation of EP from the alignment of S proteins ignores entirely the differences in the interacting interface with which S for different coronaviruses in the alignment interact in the different receptors in each host species. This may be the reason why so many "CAPs" are in the RBD. The authors should at the very least make a convincing case of why they are not simply detecting constraints imposed by the different interacting partners, at least in the case of positions within the RBD interface with ACE2. Another point that the authors should discuss is that ACE2 is not the only receptor that facilitates infection, TMPRSS2 and possibly others have been identified as well. The results should be discussed in light of this.

      To begin with, we have now explicitly noted (on line 135) that “sites 478, 486, 498, and 681 have already been implicated in SARS-CoV-2 evolution, leaving the remaining 11 CAPs as undiscovered candidate sites for adaptation.” Evolutionary analyses are done using orthologous protein sequences, so there is no way to integrate information on different receptors in each host species in the calculation of EPs. However, we appreciate that the preponderance of CAPs in the RBD is likely due to different binding environments. We have added the following text (on line 83) to clarify our point: “Adaptation in this case means a virus which can successfully infect human hosts. As CAPs are unexpected polymorphisms under neutral theory, their existence implies a non-neutral effect. This can come in the form of functional changes (Liu et al. 2016) or compensation for functional changes (Ose et al. 2022). Therefore, we suspect that these CAPs, being unexpected changes from coronaviruses across other host species with different binding substrates, may be partially responsible for the functional change of allowing human infection.” This hypothesis is supported by the overlap of CAPs we identified with the positions identified in other studies (e.g., 478, 486, 498, and 681). Binding to TMPRSS2 and other substrates are also covered by this analysis as it is a measure of overall evolutionary fitness, rather than binding to any specific substrate. Our paper does focus on discussing hACE2 binding and mentions furin cleavage, but indeed lacks discussion on the role of TMPRSS2. We have added the following text to line 157: “Another host cell protease, TMPRSS2, facilitates viral attachment to the surface of target cells upon binding either to sites Arg815/Ser816, or Arg685/Ser686 which overlaps with the furin cleavage site 676-689, further emphasizing the importance of this area (Hoffmann et al. 2020b; Fraser et al. 2022).”

      (2) Turning now to the computational methods utilized to study dynamics, I have serious reservations about the novelty of the results as well as the validity of the methodology. First of all, the authors mention the work of Teruel et al. (PLOS Comp Bio 2021) in an extremely superficial fashion and do not mention at all a second manuscript by Teruel et al. (Biorxiv 2021.12.14.472622 (2021)). However, the work by Teruel et al. identifies positions and specific mutations that affect the dynamics of S and the evolution of the SARS-CoV-2 virus in light of immune escape, ACE2 binding, and open and closed state dynamics. The specific differences in approach should be noted but the results specifically should be compared. This omission is evident throughout the manuscript. Several other groups have also published on the use of nomal-mode analysis methods to understand the Spike protein, among them Verkhivker et al., Zhou et al., Majumder et al., etc.

      Thank you for your suggestions. Upon further examination of the listed papers, we have added citations to other groups employing similar methods. However, it's worth noting that the results of Teruel et al.'s studies are generally not directly comparable to our own. Particularly, they examine specific individual mutations and overall dynamical signatures associated with them, whereas our results are always considered in the context of epistasis and joint effects with CAPs, and all mutations belong to the common variants. Although important mutations may be highlighted in both cases, it is for very different reasons. Nevertheless, we provide a more detailed mention of the results of both studies. See lines 178, 255, and 393.

      (3) The last concern that I have is with respect to the methodology. The dynamic couplings and the derived index (DCI) are entirely based on the use of the elastic network model presented which is strictly sequence-agnostic. Only C-alpha positions are taken into consideration and no information about the side-chain is considered in any manner. Of course, the specific sequence of a protein will affect the unique placement of C-alpha atoms (i.e., mutations affect structure), therefore even ANM or ENM can to some extent predict the effect of mutations in as much as these have an effect on the structure, either experimentally determined or correctly and even incorrectly modelled. However, such an approach needs to be discussed in far deeper detail when it comes to positions on the surface of a protein such that the reader can gauge if the observed effects are the result of modelling errors.

      We would like to clarify that most of our results do not involve simulations of different variants, but rather how characteristic mutation sites for those variants contribute to overall dynamics. For the full spike, we operate on only two simulations: open and closed. When we do analyze different variants, starting on line 438, the observed difference does not come from the structure, but from the covariance matrix obtained from molecular dynamics (MD) simulations, which are sensitive to single amino acid changes.

      Reviewer #3 (Recommendations For The Authors):

      (1) On line 99 there is a misspelling, 'withing'.

      It has been fixed. Thanks for spotting that.

      (2) Some graphical suggestions to make the figures easier to read:

      In Figure 1C, a labeled circle around the important sites, the receptor binding domain, and the Furin cleavage site, would help the reader orient themselves. Moreover, it would make clear which CAPs are NOT in the noteworthy sites described in the text.

      Good idea. We have added transparent spheres and labels to show hACE2 binding sites and Furin cleavage sites.

      In Figure 2C the colors are a bit low contrast; moreover, there are multiple text sizes on the same figure which should perhaps be avoided to ensure legibility.

      We have made yellow brighter and standardized font sizes.

      Figure 3 is a bit dry, perhaps indicating in which bins the 'interesting' sites could be informative.

      Thank you for the suggestion, but the overall goal of Figure 3 is to illustrate that the mutational landscape is governed by the equilibrium dynamics in which flexible sites undergo more mutations during the evolution of the CoV2 spike protein. Therefore, adding additional positional information may complicate our message.

      Figure 4, the previous suggestions about readability apply.

      We ensured same sized text and higher contrast colors.

      Figure 5B, the residue labels are too small.

      We increased the font size of the residue labels.

      In Figure 8 maybe adding Delta to let the reader orient themselves would be helpful to the discussion.

      Unfortunately, there is no single work that has experimentally quantified binding affinities towards hACE2 for all the variants. When we conducted the same analysis for the Delta variant in Figure 8, the experimental values were obtained from a different source (doi: 10.1016/j.cell.2022.01.001) and the values were significantly different from the experimental work we used for Omicron (Yue et al. 2023). When we could adjust based on the difference in experimentally measured binding affinity values of the original Wuhan strain in these two separate studies, we observed a similar correlation, as seen below. However, we think this might not be a proper representation. Therefore, we chose to keep the original figure.

      Author response image 2.

      The %DFI calculations for variants Delta, Omicron, XBB, and XBB 1.5. (A) %DFI profile of the variants are plotted in the same panel. The grey shaded areas and dashed lines indicate the ACE2 binding regions, whereas the red dashed lines show the antibody binding residues. (B) The sum of %DFI values of RBD-hACE2 interface residues. The trend of total %DFI with the log of Kd values overlaps with the one seen with the experiments. (C) The RBD antibody binding residues are used to calculate the sum of %DFI. The ranking captured with the total %DFI agrees with the susceptibility fold reduction values from the experiments.

      (3) Replicas of the MD simulations would make the conclusions stronger in my opinion.

      We ran a 1µs long simulation and performed convergence analysis for the MD simulations using the prior work (Sawle L, Ghosh K. 2016.) More importantly, we also evaluated the statistical significance of computed DFI values as explained in detail below (Please see the answer to question 3 of Reviewer #3 (Public Review):)

      Reviewer #3 (Public Review):

      (1) A longer discussion of how the 19 orthologous coronavirus sequences were chosen would be helpful, as the rest of the paper hinges on this initial choice.

      The following explanation has been added on line 114: EP scores of the amino acid variants of the S protein were obtained using a Maximum Likelihood phylogeny (Kumar et al. 2018) built from 19 orthologous coronavirus sequences. Sequences were selected by examining available non-human sequences with a sequence identity of 70% or above to the human SARS CoV-2’s S protein sequence. This cutoff allows for divergence over evolutionary history such that each amino acid position had ample time to experience purifying selection, whilst limiting ourselves to closely related coronaviruses. (Figure 1A).

      (2) The 'reasonable similarity' with previously published data is not well defined, nor there was any comment about some of the residues analyzed (namely 417-484). We have revised this part of the manuscript and add to the revised version.

      We removed the line about reasonable similarity as it was vague, added a line about residues 417-484, and revised the text accordingly, starting on line 354.

      (3) There seem to be no replicas of the MD simulations, nor a discussion of the convergence of these simulations. A more detailed description of the equilibration and production schemes used in MD would be helpful. Moreover, there is no discussion of how the equilibration procedure is evaluated, in particular for non-experts this would be helpful in judging the reliability of the procedure.

      We opted for a single, extended equilibrium simulation to comprehensively explore the longterm behavior of the system. Given the specific nature of our investigation and resource constraints, a well-converged, prolonged simulation was deemed a practical and scientifically valid approach, providing a thorough understanding of the system's dynamics. (doi: 10.33011/livecoms.1.1.5957, https://doi.org/10.1146/annurev-biophys-042910-155255 )

      We updated our methods section starting on line 605 with extended information about the MD simulations and the converge criteria for the equilibrium simulations. We also added a section that explains our analysis to check statistical significance of obtained DFI values.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      In this study, Millard and colleagues investigated if the analgesic effect of nicotine on pain sensitivity, assessed with two pain models, is mediated by Peak Alpha Frequency (PAF) recorded with resting state EEG. The authors found indeed that nicotine (4 mg, gum) reduced pain ratings during phasic heat pain but not cuff pressor algometry compared to placebo conditions. Nicotine also increased PAF (globally). However, mediation analysis revealed that the reduction in pain ratings elicited by the phasic heat pain after taking nicotine was not mediated by the changes in PAF. Also, the authors only partially replicated the correlation between PAF and pain sensitivity at baseline (before nicotine treatment). At the group-level no correlation was found, but an exploratory analysis showed that the negative correlation (lower PAF, higher pain sensitivity) was present in males but not in females. The authors discuss the lack of correlation.

      In general, the study is rigorous, methodology is sound and the paper is well-written. Results are compelling and sufficiently discussed.

      Strengths:

      Strengths of this study are the pre-registration, proper sample size calculation, and data analysis. But also the presence of the analgesic effect of nicotine and the change in PAF.

      Weaknesses:

      It would even be more convincing if they had manipulated PAF directly.

      We thank Reviewer #1 for their positive and constructive comments regarding our study. We appreciate the view that the study was rigorous and methodologically sound, that the paper was well-written, and that the strengths included our pre-registration, sample size calculation, and data analysis.

      In response to the reviewer's comment about more directly manipulating Peak Alpha Frequency (PAF), we agree that such an approach could provide a more direct investigation of the role of PAF in pain processing. We chose nicotine to modulate PAF as the literature suggested it was associated with a reliable increase in PAF speed. As mentioned in our Discussion, there are several alternative methods to manipulate PAF, such as non-invasive brain stimulation techniques (NIBS) like transcranial alternating current stimulation (tACS) or neurofeedback training. These approaches could help clarify whether a causal relationship exists between PAF and pain sensitivity. Although methods such as NIBS still require further investigation as there is little evidence for these approaches changing PAF (Millard et al., 2024).

      Reviewer #2 (Public Review):

      Summary:

      The study by Millard et al. investigates the effect of nicotine on alpha peak frequency and pain in a very elaborate experimental design. According to the statistical analysis, the authors found a factor-corrected significant effect for prolonged heat pain but not for alpha peak frequency in response to the nicotine treatment.

      Strengths:

      I very much like the study design and that the authors followed their research line by aiming to provide a complete picture of the pain-related cortical impact of alpha peak frequency. This is very important work, even in the absence of any statistical significance. I also appreciate the preregistration of the study and the well-written and balanced introduction. However, it is important to give access to the preregistration beforehand.

      Weaknesses:

      The weakness of the study revolves around three aspects:

      (1) I am not entirely convinced that the authors' analysis strategy provides a sufficient signal-tonoise ratio to estimate the peak alpha frequency in each participant reliably. A source separation (ICA or similar) would have been better suited than electrode ROIs to extract the alpha signal. By using a source separation approach, different sources of alpha (mu, occipital alpha, laterality) could be disentangled.

      (2) Also, there's a hint in the literature (reference 49 in the manuscript) that the nicotine treatment may not work as intended. Instead, the authors' decision to use nicotine to modulate the peak alpha frequency and pain relied on other, not suitable work on chronic pain and permanent smokers. In the present study, the authors use nicotine treatment and transient painful stimulation on nonsmokers.

      (3) In my view, the discussion could be more critical for some aspects and the authors speculate towards directions their findings can not provide any evidence. Speculations are indeed very important to generate new ideas but should be restricted to the context of the study (experimental pain, acute interventions). The unfortunate decision to use nicotine severely hampered the authors' aim of the study.

      Impact:

      The impact of the study could be to show what has not worked to answer the research questions of the authors. The authors claim that their approach could be used to define a biomarker of pain. This is highly desirable but requires refined methods and, in order to make the tool really applicable, more accurate approaches at subject level.

      We thank reviewer #2 for their recognition of the study’s design, the importance of this research area, and the pre-registration of our study. In response to the weaknesses highlighted:

      (1) We appreciate the reviewer’s suggestion to improve the signal-to-noise ratio by applying source separation techniques, such as ICA, which have now been performed and incorporated into the manuscript. Our original decision to use sensor-level ROIs followed the precedent set in previous studies, our rationale being to improve reproducibility and avoid  biases from picking individual electrodes or manually picking sources. We have  added analyses using an automated pipeline that selects components based on the presence of a peak in the alpha range and alignment with a predefined template topography representing sensorimotor sites. Here again we found no significant differences in the mediation results that used a sensor space sensorimotor ROI, further supporting the robustness of the chosen approach. ICA could still potentially disentangle different sources of alpha, such as occipital alpha and mu rhythm, and provide new insights into the PAF-pain relationship. We have now added a discussion in the manuscript about the potential advantages of source separation techniques and suggest that the possible contributions of separate alpha sources be investigated and compared to sensor space PAF as a direction for future research.

      (2) We recognise the reviewer's concern regarding our choice of nicotine as a modulator of pain and alpha peak frequency (PAF). The meta-analysis by Ditre et al. (2016) indeed points to small effect sizes for nicotine's impact on experimental pain and highlights the potential for publication bias. However, our decision to use nicotine in this study was not primarily based on its direct analgesic effects, but rather on its well-documented ability to modulate PAF, in smoking and non-smoker populations, as outlined in our study aims.

      In this regard, the intentional use of nicotine was to assess whether changes in PAF could mediate alterations in pain. This approach aligns with the broader concept that a direct effect of an intervention is not necessary to observe indirect effects (Fairchild & McDaniel, 2017). We have, however, revised our introduction to further clarify this rationale, highlighting that nicotine was used as a tool for PAF modulation, not solely for its potential analgesic properties.

      (3) We agree with the reviewer’s observation that certain aspects of the Discussion could be more cautious, particularly regarding speculations about nicotine’s effects and PAF as a biomarker of pain. We have revised the Discussion to ensure that our interpretations are better grounded in the data from this study, clearly stating the limitations and avoiding overgeneralization. This revision focuses on a more critical evaluation of the potential relationships between PAF, nicotine, and pain sensitivity based solely on our experimental context.

      Finally, We also apologize for not providing access to the preregistration earlier. This was an oversight on our end, and we will ensure that future preregistrations are made available upfront.

      Reviewer #3 (Public Review):

      In this manuscript, Millard et al. investigate the effects of nicotine on pain sensitivity and peak alpha frequency (PAF) in resting state EEG. To this end, they ran a pre-registered, randomized, double-blind, placebo-controlled experiment involving 62 healthy adults who received either 4 mg nicotine gum (n=29) or placebo (n=33). Prolonged heat and pressure were used as pain models. Resting state EEG and pain intensity (assessed with a visual analog scale) were measured before and after the intervention. Additionally, several covariates (sex at birth, depression and anxiety symptoms, stress, sleep quality, among others) were recorded. Data was analyzed using ANCOVAequivalent two-wave latent change score models, as well as repeated measures analysis of variance. Results do not show *experimentally relevant* changes of PAF or pain intensity scores for either of the prolonged pain models due to nicotine intake.

      The main strengths of the manuscript are its solid conceptual framework and the thorough experimental design. The researchers make a good case in the introduction and discussion for the need to further investigate the association of PAF and pain sensitivity. Furthermore, they proceed to carefully describe every aspect of the experiment in great detail, which is excellent for reproducibility purposes. Finally, they analyse the data from almost every possible angle and provide an extensive report of their results.

      The main weakness of the manuscript is the interpretation of these results. Even though some of the differences are statistically significant (e.g., global PAF, pain intensity ratings during heat pain), these differences are far from being experimentally or clinically relevant. The effect sizes observed are not sufficiently large to consider that pain sensitivity was modulated by the nicotine intake, which puts into question all the answers to the research questions posed in the study.

      We would like to express our gratitude to Reviewer #3 for their thoughtful and constructive review, including the positive feedback on the strengths of our study's conceptual framework, experimental design, and thorough methodological descriptions.

      We acknowledge the concern regarding the experimental and clinical relevance of some statistically significant results (e.g., global PAF and pain intensity during heat pain) and agree that small effect sizes may limit their practical implications. However, our primary goal was to assess whether nicotine-induced changes in PAF mediate pain changes, rather than to demonstrate large direct effects on pain sensitivity. Nicotine was chosen for its known ability to modulate PAF, and our focus was on the mechanistic role of PAF in pain perception. To clarify this, we have revised the discussion to better differentiate between statistical significance, experimental relevance, and clinical applicability. We emphasize that this study represents a preliminary step towards understanding PAF’s mechanistic role in pain, rather than a direct clinical application.

      We appreciate the suggestion to refine our interpretation. We have adjusted our language to ensure it aligns with the effect sizes observed and made recommendations for future research, such as testing different nicotine doses, to potentially uncover stronger or more clinically relevant effects.

      Although modest, we believe these findings offer valuable insights into the potential mechanisms by which nicotine affects alpha oscillations and pain. We have also discussed how these small effects could become more pronounced in different populations (e.g., chronic pain patients) and over time, offering guidance for future research on PAF modulation and pain sensitivity.

      Recommendations for the authors:

      Reviewer #2 (Recommendations For The Authors):

      I have a number of points that the authors may want to consider for this or future work.

      (1) By reviewing the literature provided by the authors in the introduction I think that using nicotine as a means to modulate pain and alpha peak frequency was a mistake. The only work that may give a hint on whether nicotine can modulate experimental pain is the meta-analysis by Ditre and colleagues (2016). They suggest that their small effect may contain a publication bias. I think the other "large body of evidence" is testing something else than analgesia.

      Thank you for your consideration of our choice of nicotine in the study. The meta-analysis by Ditre and colleagues (2016) suggests small effect sizes for nicotine's impact on experimental pain, compared to the moderate effects claimed in some papers, especially when accounting for the potential publication bias you mentioned. However, our selection of nicotine was primarily driven by its documented ability to modulate PAF rather than its direct analgesic effects, as clearly stated in our aims. Therefore, we do not view our decision to use nicotine as a mistake; instead, it was aligned with our goal of assessing whether changes in PAF mediate alterations in pain and thus served as a valuable tool. This perspective aligns with the broader concept that a direct effect is not a prerequisite for observing indirect effects of an intervention on an outcome (Fairchild &

      McDaniel, 2017). To further enhance clarity, we've revised the introduction to emphasize the role of nicotine in manipulating PAF in relation to our study's aims.

      Previously we wrote: “A large body of evidence suggests that nicotine is an ideal choice for manipulating PAF, as both nicotine and smoking increase PAF speed [37,40–47] as well as pain thresholds and tolerance [48–52].” This has been changed to read: “Because evidence suggests that nicotine can modulate PAF, where both nicotine and smoking increase PAF speed [37,40–47], we chose nicotine to assess our aim of whether changes in PAF mediate changes in pain in a ‘mediation by design’ approach [48]. In addition, given evidence that nicotine may increase experimental pain thresholds and tolerance [49–53], nicotine could also influence pain ratings during tonic pain.”

      (2) As mentioned above, the OSF page is not accessible.

      We apologise for this. We had not realised that the pre-registration was under embargo, but we have now made it available.

      (3) I generally struggle with the authors' approach to investigating alpha. With the approach the authors used to detect peak alpha frequency it might be that the alpha signal may just show such a low amplitude that it is impossible to reliably detect it at electrode level. In my view, the approach is not accurate enough, which can be seen by the "jagged" shape of the individual alpha peak frequency. In my view, a source separation technique would have been more useful. I wonder which of the known cortical alphas contributes to the effects the authors have reported previously: occipital, mu rhythms projections or something else? A source separation approach disentangles the different alphas and will increase the SNR. My suggestion would be to work on ICA components or similar approaches. The advantage is that the components are almost completely free of any artefacts. ICAs could be run on the entire data or separately for each individual. In the latter case, it might be that some participants do not exhibit any alpha component.

      We appreciate your thoughtful consideration of our approach to investigating alpha. The calculation of PAF involves various methods and analysis steps across the literature (Corcoran et al., 2018; Gil Avila et al., 2023; McLain et al., 2022). Your query about which known cortical alphas contribute to reported effects is important. Initially focusing on a sensorimotor component from an ICA in Furman et al., 2018, subsequent work from our labs suggested a broader relationship between PAF and pain across the scalp (Furman et al., 2019; Furman et al., 2020; Millard et al., 2022), and a desire to conduct analyses at the sensor level in order to improve the reproducibility of the methods (Furman et al., 2020). However, based on your comment we have made several additions to the manuscript, including: explaining why we did not use manual ICA methods, suggest this for future research, and added an exploratory analysis using a recently developed automated pipeline that selects components based on the presence of a peak in the alpha range and alignment with a predefined template topography representing activity from occipital or motor sites.

      While we acknowledge that ICA components can offer a better signal-to-noise ratio (SNR) and possibly smoother spectral plots, we opted for our chosen method to avoid potential bias inherent in deciding on a component following source separation. The desire for a quick, automated, replicable, and unbiased pipeline, crucial for potential clinical applications of PAF as a biomarker, influenced this decision. At the time of analysis registration, automated methods for deciding which alpha components to extract following ICA were not apparent. We have now added this reasoning to Methods.

      “Contrary to some previous studies that used ICA to isolate sensory region alpha sources (Furman et al., 2018; De Martino et al., 2021; Valentini et al., 2022), we used pre-determined sensor level ROIs to improve reproducibility and reduce the potential for bias when individually selecting ICA components. Using sensor level ROIs may decrease the signal-to-noise ratio of the data; however, this approach has still been effective for observing the relationship between PAF and experimental pain (Furman et al., 2019; Furman et al., 2020).”

      We have also added use of ICA and development of methods as a suggestion for future research in the discussion:

      “Additionally, the use of global PAF may have introduced mediation measurement error into our mediation analysis. The spatial precision used in the current study was based on previous literature on PAF as a biomarker of pain sensitivity, which have used global and/or sensorimotor ROIs (Furman et al., 2018; Furman et al., 2020). Identification and use of the exploratory electrode clusters found in this study could build upon the current work (e.g., Furman et al., 2021). However, exploratory analysis of the clusters found in the present analysis demonstrated no influence on mediation analysis results (Supplementary Materials 3.8-3.10). Alternatively, independent component analysis (ICA) could be used to identify separate sources of alpha oscillations (Choi et al., 2005), as used in other experimental PAF-pain studies (Furman et al., 2018; Valentini et al., 2022), which could aid to disentangle the potential relevance of different alpha sources in the PAFpain relationship. Although this comes with the need to develop more reproducible and automated methods for identifying such components.”

      The specific location or source of PAF that relates to pain remains unclear. Because of this, we did employ an exploratory cluster-based permutation analysis to assess the potential for variations in the presence of PAF changes across the scalp at sensor level, and emphasise that location of PAF change could be explored in future. However, we have now conducted the mediation analysis (difference score 2W-LCS model) using averages from the data-driven parietal cluster, frontal cluster, and both clusters together. For these we see a stronger effect of gum on PAF change, which was expected given the data driven approach of picking electrodes. There was still a total and direct effect of nicotine on pain during the PHP model, but still no indirect effect via change in PAF. For the CPA models, there were still no significant total, direct, or indirect effects of nicotine on CPA ratings. Therefore, using these data-driven clusters did not alter results compared to the model using the global PAF variable.

      The reader has been directed to this supplementary material so:

      “The potential mediating effect of this change in PAF on change in PHP and CPA was explored (not pre-registered) by averaging within each cluster (central-parietal: CP1, CP2, Cpz, P1, P2, P3, P4, Pz, POz; right-frontal: F8, FT8, FT10) and across both clusters. This averaging across electrodes produced three new variables, each assessed in relation to mediating effects on PHP and CPA ratings. The resulting in six exploratory mediation analysis (difference score 2W-LCS) models demonstrated minimal differences from the main analysis of global PAF (8-12 Hz), except for the

      expected stronger effect of nicotine on change in PAF (bs = 0.11-0.14, ps < .003; Supplementary

      Materials 3.8-3.10).”

      Moreover, our team has been working on an automated method for selecting ICA components, so in response to your comment we assessed whether using this method altered the results of the current analysis. The in-depth methodology behind this new automatic pipeline will be published with a validation from some co-authors in the current collaboration in due course. At present, in summary, this automatic pipeline conducts independent component analysis (ICA) 10 times for each resting state, and selects the component with the highest topographical correlation to a template created of a sensorimotor alpha component from Furman et al., (2018). 

      The results of the PHP or CPA mediation models were not substantially different using the PAF calculated from independent components than that using the global PAF. For the PHP model, the total effect (b = -0.648, p \= .033) and direct effects (b = -0.666, p \= .035) were still significant, and there was still no significant indirect effect (b = 0.018, p \= .726). The general fit was reduced, as although the CFI was above 0.90, akin to the original model, the RMSEA and SRMR were not below 0.08, unlike the original models (Little, 2013). For the CPA model, there were still no significant total (b = -0.371, p \= .357), direct (b = -0.364, p \= .386), or indirect effects (b = -0.007, p \= .906), and the model fit also decreased, with CFI below 0.90 and RMSEA and SRMR above 0.08. See supplementary material (3.11). Note that still no correlations were seen between this IC sensorimotor PAF and pain (PHP: r = 0.11, p = .4; CPA: r \= -0.064, p = .63).

      Interestingly, in both models, there was now no longer a significant a-path (PHP: b = 0.08, p =

      0.292; CPA: b = 0.039, p = 0.575), unlike previously observed (PHP: b = 0.085, p = 0.018; CPA: b = 0.089, p = 0.011). We interpret this as supporting the previously highlighted difference between finding an effect on PAF globally but not in a sensorimotor ROI (and now a sensorimotor IC), justifying the exploratory CBPA and the suggestion in the discussion to explore methodology.

      We understand that this analysis does not fully uncover the reviewer’s question in which they wondered which of the known cortical alphas contributes to the effects reported in our previous work. However, we consider this exploration to be beyond the scope of the current paper, as it would be more appropriately addressed with larger datasets or combinations of datasets, potentially incorporating MEG to better disentangle oscillatory sources. The highlighted differences seen between global PAF, sensorimotor ROI PAF, sensorimotor IC PAF, as well as the CBPA of PAF changes provide ample directions for future research to build upon: 1) which alpha (sensor or source space) are related to pain, 2) how are these alpha signals represented robustly in a replicable way, and 3) which alpha (sensor or source space) are manipulable through interventions. These are all excellent questions for future studies to investigate.

      The below text has been added to the Discussion:

      In-house code was developed to compare a sensorimotor component to the results presented in this manuscript (Supplementary Material 3.11), showing similar results to the sensorimotor ROI mediation analysis presented here. However, examination of which alpha - be it sensor or source space - are related to pain, how they can be robustly represented, and how they can be manipulated are ripe avenues for future study.

      (4) I have my doubts that you can get a reliable close to bell-shaped amplitude distribution for every participant. The argument that the peak detection procedure is hampered by the high-amplitude lower frequency can be easily solved by subtracting the "slope" before determining the peak. My issue is that the entire analysis is resting on the assumption that each participant has a reliable alpha effect at electrode level. This is not the case. Non-alpha participants can severely distort the statistics. ICA-based analyses would be more sensitive but not every participant will show alpha. You may want to argue with robust group effects but In my view, every single participant counts, particularly for this type of data analysis, where in the case of a low SNR the "peak" can easily shift to the extremes. In case there is an alpha effect for a specific subject, we should see a smooth bump in the frequency spectrum between 8 and 12 12Hz. Anything beyond that is hard to believe. The long stimulation period allows a broad FFT analysis window with a good frequency resolution in order to detect the alpha frequency bump.

      The reviewer is correct that non-alpha participants can distort the statistics. We did visually assess the EEG of each individual’s spectra at baseline to establish the presence of global peaks, as we believe this is good practice to aid understanding of the data. Please see Author response image 1 for individual spectra seen at baseline. Although not all participants had a ‘smooth bump in the frequency spectrum between 8 and 12 Hz’, we prefer to not apply/necessitate this assumption to our data. Chiang et al., (2011) suggest that ~3% of individuals do not have a discernible alpha peak, and in our data we observed only one participant without a very obvious spectral peak (px-39). But, this participant does have enough activity within the alpha range to identify PAF by the CoG method (i.e. not just flat spectra and activity on top of 1/f characteristics). Without a pre-registered and standardised decision process to remove such a participant in place, we opted to not remove any participants to avoid curation of our data.

      Author response image 1.

      (5) I find reports on frequent channel rejections reflect badly on the data quality. Bad channels can be avoided with proper EEG preparation. EEG should be continuously monitored during recording in order to obtain best data quality. Have any of the ROI channels been rejected?

      We appreciate your attention to the channel rejection. We believe that the average channels removed (0.94, 0.98, 0.74, and 0.87 [range: 0-4] for each of the four resting states out of 64 channels) does not suggest overly frequent rejection, as it was less than one electrode on average and the numbers are below the accepted number of bad channels to remove/interpolate (i.e. 10%) in EEG pipelines (Debnath et al., 2020; Kayhan et al., 2022). To maintain data quality, consistently poor channels were identified and replaced over time. We hope you will accept our transparency on this issue and note that by stating how channel removal decisions were made (i.e. 8 or more deviations) and reporting the number of channels removed, we adhere to the COBIDAS guidelines (Pernet et al., 2018; 2020).

      During analysis, cases of sensorimotor ROI channels being rejected were noted and are now specified in our manuscript. “Out of 248 resting states recorded, 14 resting states had 4 ROI channels instead of 5. Importantly, no resting state had fewer than 4 channels for the sensorimotor ROI.”

      Note, we also realised that we had not specified that we did interpolate channels for the cluster based permutation analysis. This has been corrected with the following sentence:

      “Removed channels were not interpolated for the pre-registered global and sensorimotor ROI averaged analyses, but were interpolated for an exploratory cluster based permutation analysis using the nearest neighbour average method in `Fieldtrip`.”

      (6) I have some issues buying the authors' claims that there is an effect of nicotine on prolonged pain. By looking at the mean results for the nicotine and placebo condition, this can not be right. What was the point in including the variables in the equation? In my view, in this within-subject design the effect of nicotine should be universal, no matter what gender, age, or depression. The unconditional effect of nicotine is close to zero. I can not get my head around how any of the variables can turn the effects into significance. There must be higher or lower variable scores that might be related to a higher or lower effect on nicotine. The question is not to consider these variables as a nuisance but to show how they modulate the pain-related effect of nicotine treatment. Still, the overall nicotine effect of the entire group is basically zero.

      Another point is that for within-subject analyses even tiny effects can become statistically significant if they are systematically in one direction. This might be the case here. There might be a significant effect of nicotine on pain but the actual effect size (5.73 vs. 5.78) is actually not interpretable. I think it would be interesting for the reader how (in terms of pain rating difference) each of the variables can change the effect of nicotine.

      Thank you for your comments. We recognize the concern about interpreting the effect of nicotine on prolonged pain solely based on mean results, and in fact wish to discourage this approach. It's crucial to note that both PAF and pain are highly individual measures (i.e. high inter-individual variance), necessitating the use of random intercepts for participants in our analyses to acknowledge the inherent variability at baseline across participants. Including random intercepts rather than only considering the means helps address the heterogeneity in baseline levels among participants. We also recognise that displaying the mean PHP ratings for all participants in Table 2 could be misleading, firstly because these means do not have weight in an analysis that takes into account a random-effects intercept for participants, and secondly because two participants (one from each group) did not have post-gum PHP assessments and were not included in the mediation analysis due to list-wise deletion of missing data. Therefore, to reduce the potential for misinterpretation, we have added extra detail to display both the full sample and CPA mediation analysis (i.e. N=62) and the data used for PHP mediation analysis (i.e. n=60) in Table 2. We hope that the extra details added to this table will help the readers interpretation of results.

      In light of this, we have also altered the PAF Table 3 to reflect both the pre-post values used for the CPA mediation and baseline correlations with CPA and PHP pain (i.e. N=62), and the pre-post values used for the PHP mediation (i.e. n=60).

      It is inherently difficult to visualise the findings of a mediation analysis with confounding variables that also used latent change scores (LCS) and random-effect intercepts for participants. LCS was specifically used because of issues of regression to the mean that occur if you calculate a straightforward ‘difference-score’, therefore calculating the difference in order to demonstrate the results of the statistical model in a figure, for example, does not provide a full description of the data assessed (Valente & McKinnon, 2017). Nevertheless, if we look at the data descriptively with this in mind, then calculating the change in PHP ratings does indicate that, for the nicotine group, the mean change in PHP ratings was -0.047 (SD = 1.05, range: -4.13, 1.45). Meanwhile, for the placebo group the mean change in PHP ratings was 0.33 (SD = 0.75, range: -1.37, 1.66). Therefore suggesting a slight decrease in pain ratings on average for the nicotine group compared to a slight increase on average for the placebo group. With control for pre-determined confounders, we found that the latent change score was -0.63 lower for the nicotine group compared to the control group (i.e. the direct effect of nicotine on change in pain).

      If the reviewer is only discussing the effect of nicotine on pain, we do not believe that this effect ‘should be universal’. There is clear evidence that effects of nicotine on other measures can vary greatly across individuals (Ettinger et al., 2009; Falco & Bevins, 2015; Pomerleau et al., 1995). Our intention would not be to propose a universal effect but to understand how these variables may influence nicotine's impact on pain for individuals. Here we focus on the effects of nicotine on PAF and pain sensitivity, but attempted to control for the potential influence of these other confounding factors. Therefore, our statistical approach goes beyond mean values, incorporating variables like sex at birth, age, and depression to control for and explore potential modulating factors. Control for confounding factors is an important aspect of mediation analysis (Lederer et al., 2019; VanderWeele, 2019).

      Regarding the seemingly small effect size, we understand your concern. Indeed ‘tiny effects can become statistically significant if they are systematically in one direction’, which may be what we see in this analysis. We do not agree that the effect is ‘not interpretable’, rather that it should be interpreted in light of its small effect size (effect size being the beta coefficient in our analysis, rather than the mean group difference). We agree on the importance of considering practical significance alongside statistical significance and hope to conduct additional experiments and analyses in future to elucidate the contribution of each variable to the subtle and therefore not entirely conclusive overall effect you mention.

      Your feedback on this is valuable, and we have ensured a more detailed discussion in the revised manuscript on how these factors should be interpreted alongside some additional post-hoc analyses of confounding factors that were significant in our mediation, with the note that investigation of these interactions is exploratory. We had already discussed the potential contribution of sex on the effect of nicotine on PAF, with exploratory post-hoc analysis on this included in supplementary materials. In addition, we have now added an exploratory post-hoc analysis on the potential contribution of stress on the effect of nicotine on pain. This then shows the stratified effects by the covariates that our model suggest are influencing change in PAF and pain.

      Results edits:

      “There was also a significant effect of perceived stress at baseline on change in PHP ratings when controlling for group allocation and other confounding variables (b = -0.096, p = .048, bootstrapped 95% CI: [-0.19, -0.000047]), where higher perceived stress resulted in larger decreases in PHP ratings (see Supplementary Material 3.3 for post-hoc analysis of stress).”

      Supplementary material addition:

      “3.3 Exploratory analysis of the influence of perceived stress on the effects of nicotine on change in PHP ratings “

      “Due to the significant estimated effects of perceived stress on change in PHP ratings in the 2WLCS mediation model, we also explored post-hoc effects of stress on change in PHP ratings. We found that there is strong evidence for a negative correlation between stress and change in PHP rating within the nicotine group (n = 28, r = −0.39, BF10 = 13.65; Figure 3) that is not present in the placebo group, with equivocal evidence (n = 32, r = −0.14, BF10 = 0.46). This suggests that those with higher baseline stress who had nicotine gum experienced greater decreases in PHP ratings. Note that there was less, but still sufficient evidence for this relationship within the nicotine group when the participant who was a potential outlier for change in PHP rating was removed (n = 27, r = −0.32, BF10 = 1.45). “

      Author response image 2.

      Spearman correlations od baseline perceived stress with the change in phasic heat pain (PHP) ratings, suggest strong evidence for a negative relationship for the nicotine gum groupin orange (n=28; BF<sub>10</sub>=13.65) but not for the placebo group in grey (n=32; BF<sub>10</sub>=0.46). Regression lines and 95% confidence intervals.

      Discussion edits:

      “For example, in addition to the effect of nicotine on prolonged heat pain ratings, our results suggest an effect of stress on changes in heat pain ratings, with those self-reporting higher stress at baseline having greater reductions in pain. Our post-hoc analysis suggested that this relationship between higher stress and larger decrease in PHP ratings was only present for the nicotine group (Supplementary Material 3.3). As stress is linked to nicotine use [69,70] and pain [71–73], these interactions should be explored in future.”

      (7) Is the differential effect of nicotine vs. placebo based on the pre vs. post treatment effect of the placebo condition or on the pre vs. post effect of the nicotine treatment? Can the mediation model be adapted and run for each condition separately? The placebo condition seems to have a stronger effect and may have driven the result.

      Thank you for your comments. In our mediation analysis, the differential effect of nicotine vs. placebo is assessed as a comparison between the pre-post difference within each condition. A latent change score (i.e. pre-post) is calculated for each condition (nicotine and placebo), and then the effect of being in the nicotine group (dummy coded as 1) is compared to being in the placebo group (dummy coded as 0). The comparison between conditions is needed for this model (Valente & MacKinnon, 2017), as we are assessing the change in PAF and pain in the nicotine group compared to the change in the placebo group.

      However, to address your response, it is possible to simplify and assess the relationship between the change in peak alpha frequency (PAF) and change in pain within each gum group (nicotine and placebo) independently, without including the intervention as a factor. To do this, the mediation model can be simplified to regression analysis with latent change scores that focus purely on these relationships. The results of this can help to understand whether change in PAF influences change in pain within each group separately. As with the main analysis, we see no significant influence of change in PAF on change in pain while controlling for the same confounding variables within the nicotine group (Beta = -0.146 +/- 1.105, p = 0.895, 95% CI: -2.243, 2.429) or the placebo group (Beta = 0.730 +/- 2.061, p = 0.723, 95% CI: -4.177, 3.625).

      When suggesting that the “the placebo condition seems to have a stronger effect and may have driven the result”, we believe you are referring to the increase in mean PHP ratings within the placebo group from pre (5.51 +/- 2.53) to post-placebo gum (5.84 +/- 2.67). Indeed there was a significant increase in pain ratings pre to post chewing placebo gum (t(31) = -2.53, p = 0.0165, 95% CI: -0.603, -0.0653), that was not seen after chewing nicotine gum (t(27) = 0.237, p = 0.81, 95% CI: -0.358, 0.452). In lieu of a control where no gum was chewed (i.e. simply a second pain assessment ~30 minutes after the first), we assume the gum without nicotine is a good reference that controls for the effect of time plus expectation of chewing nicotine gum. With this in mind, as we describe in our results, the change in PHP ratings is reduced in the nicotine group compared to the placebo group. Note that this phrasing keeps the effect of placebo on pain as our reference from which to view the effect of nicotine on pain. However, you are correct that we need to ensure we emphasise that the change in pain in the PHP group is reduced in comparison to the change seen after placebo.

      We have not included these extra statistics in our revised manuscript, but hope that they aid the your understanding and interpretation of the included analyses and have highlighted these nuances in the discussion.

      “However, we note that the observed effect of nicotine on pain was small in magnitude, and most prominent in comparison to the effect of placebo, where pain ratings increased after chewing, which brings into question whether this reduction in pain is meaningful in practice.”

      (8) I would not dare to state that nicotine can function as an acute analgesic. Acute analgesics need to work for everyone. The average effect here is close to zero.

      In light of your feedback, we have refined our language to avoid a sweeping assertion of universal analgesic effects and emphasize individual variability. Nicotine's role as a coping strategy for pain is acknowledged in the literature (Robinson et al., 2022), with the meta-analysis by Ditre et al. (2016) discussing its potential as an acute analgesic in humans, along with some evidence from animal research (Zhang et al., 2020). Our revised discussion underscores the need for further exploration into factors influencing nicotine's potential impact on pain. We have also specified the short-term nature of nicotine use in this context to distinguish acute effects from potential opposing effects after long-term use (Zhang et al., 2020).

      “Short-term nicotine use is thought to have acute analgesic properties in experimental settings, with a review reporting that nicotine increased pain thresholds and pain tolerance [49]. In addition, research in a rat model suggests analgesic effects on mechanical thresholds after short-term nicotine use (Zhang et al., 2020). However, previous research has not assessed the acute effects of nicotine on prolonged experimental pain models. The present study found that 4 mg of nicotine reduced heat pain ratings during prolonged heat pain compared to placebo for our human participants, but that prolonged pressure pain decreased irrespective of which gum was chewed. Our findings are thus partly consistent with the idea that nicotine may have acute analgesic properties [49], although further research is required to explore factors that may influence nicotine’s potential impact on a variety of prolonged pain models. We further advance the literature by reporting this effect in a

      model of prolonged heat pain, which better approximates the experience of clinical pain than short lasting models used to assess thresholds and tolerance [50]. However, we note that the observed effect of nicotine on pain was small in magnitude, and most prominent in comparison to the effect of placebo, where pain ratings increased after chewing, which brings into question whether this reduction in pain is meaningful in practice. Future research should examine whether effects on pain increase in magnitude with different nicotine administration regimens (i.e. dose and frequency).”

      (9) Figures 2E and 2F are not particularly intuitive. Usually, the colour green in "jet" colour coding is being used for "zero" values. I would suggest to cut off the blue and use only the range between red green and red.

      We have chosen to retain the current colour scale for several reasons. In our analysis, green represents the middle of the frequency range (approx 10 Hz in this case), and if we were to use green as zero, it would effectively remove both blue and green from the plot, resulting in only red shades. Additionally, we have provided a clear colour scale for reference next to the plot, which allows readers to interpret the data accurately. Our intention is to maintain clarity and precision in representing the data, rather than conforming strictly to conventional practices in color coding.

      We believe that the current representation effectively conveys the results of our study while allowing readers to interpret the data within the context provided. Thank you again for your suggestion, and we hope you understand our reasoning in this matter.

      (10) Did the authors do their analysis on the parietal ROI or on the pre-registerred ROI?

      The analysis was conducted on the pre-registered sensorimotor ROI and on the global values. We have now also conducted the analysis with the regions suggested with the cluster based permutation analysis as requested by reviewer 2, comment 3.

      (11) Point 3.2 in the discussion. I would be very cautious to discuss smoking and chronic pain in the context of the manuscript. The authors can not provide any additional knowledge with their design targeting non-smokers, acute nicotine and experimental pain. The information might be interesting in the introduction in order to provide the reader with some context but is probably misleading in the discussion.

      We appreciate your perspective and agree with your caution regarding the discussion of smoking and chronic pain. While our study specifically targets non-smokers and focuses on acute nicotine effects in experimental pain, we understand the importance of contextual clarity. We have removed these points from the discussion to not mislead the reader.

      Previously we wrote, and have removed: “For those with chronic pain, smoking and nicotine use is reported as a coping strategy for pain [52]; abstinence can increase pain sensitivity [48,50], and pain is thus seen as a barrier to smoking cessation due to fear of worsening pain [51,52]. Therefore, continued understanding of the acute effects of nicotine on models of prolonged pain could improve understanding of the role of nicotine and smoking use in chronic pain [49,51,52].”

      (12) I very much appreciate section 3.3 of the discussion. I would not give up on PAF as a target to modulate pain. A modulation might not be possible in such a short period of experimental intervention. PAF might need longer and different interventions to gradually shift in order to attenuate the intensity of pain. As discussed by the authors themselves, I would also consider other targets for alpha analysis (as mentioned above not other electrodes or ROIs but separated sources.)

      Thank you for your comments on section 3.3. We appreciate your recognition of the potential significance of PAF as a target for pain modulation. Your insights align with our considerations that the experimental intervention duration or type might be a limiting factor in observing substantial shifts in PAF to attenuate pain intensity. We had mentioned the use of the exploratory electrode clusters in future work, but have now also mentioned that the use of ICA to identify separate ICA sources may provide an alternative approach. See responses to your previous ICA comment regarding separate sources.

      REFERENCES for responses to reviewer 2

      Chiang, A. K. I., Rennie, C. J., Robinson, P. A., Van Albada, S. J., & Kerr, C. C. (2011). Age trends and sex differences of alpha rhythms including split alpha peaks. Clinical Neurophysiology, 122(8), 1505-1517.

      Debnath, R., Buzzell, G. A., Morales, S., Bowers, M. E., Leach, S. C., & Fox, N. A. (2020). The Maryland analysis of developmental EEG (MADE) pipeline. Psychophysiology, 57(6), e13580.

      Ettinger, U., Williams, S. C., Patel, D., Michel, T. M., Nwaigwe, A., Caceres, A., ... & Kumari, V. (2009). Effects of acute nicotine on brain function in healthy smokers and non-smokers: estimation of inter-individual response heterogeneity. Neuroimage, 45(2), 549-561.

      Falco, A. M., & Bevins, R. A. (2015). Individual differences in the behavioral effects of nicotine: a review of the preclinical animal literature. Pharmacology Biochemistry and Behavior, 138, 80-90.

      Kayhan, E., Matthes, D., Haresign, I. M., Bánki, A., Michel, C., Langeloh, M., ... & Hoehl, S. (2022). DEEP: A dual EEG pipeline for developmental hyperscanning studies. Developmental cognitive neuroscience, 54, 101104.

      Lederer, D. J., Bell, S. C., Branson, R. D., Chalmers, J. D., Marshall, R., Maslove, D. M., ... & Vincent, J. L. (2019). Control of confounding and reporting of results in causal inference studies. Guidance for authors from editors of respiratory, sleep, and critical care journals. Annals of the American Thoracic Society, 16(1), 22-28.

      Little TD. Longitudinal structural equation modeling. Guilford press; 2013.

      Pernet, C., Garrido, M., Gramfort, A., Maurits, N., Michel, C. M., Pang, E., ... & Puce, A. (2018). Best practices in data analysis and sharing in neuroimaging using MEEG.

      Pernet, C., Garrido, M. I., Gramfort, A., Maurits, N., Michel, C. M., Pang, E., ... & Puce, A. (2020). Issues and recommendations from the OHBM COBIDAS MEEG committee for reproducible EEG and MEG research. Nature neuroscience, 23(12), 1473-1483.

      Pomerleau, O. F. (1995). Individual differences in sensitivity to nicotine: implications for genetic research on nicotine dependence. Behavior genetics, 25(2), 161-177.

      Robinson, C. L., Kim, R. S., Li, M., Ruan, Q. Z., Surapaneni, S., Jones, M., ... & Southerland, W. (2022). The Impact of Smoking on the Development and Severity of Chronic Pain. Current Pain and Headache Reports, 26(8), 575-581.

      Xia, J., Mazaheri, A., Segaert, K., Salmon, D. P., Harvey, D., Shapiro, K., ... & Olichney, J. M. (2020). Event-related potential and EEG oscillatory predictors of verbal memory in mild cognitive impairment. Brain communications, 2(2), fcaa213.

      VanderWeele, T. J. (2019). Principles of confounder selection. European journal of epidemiology, 34, 211-219.

      Valente, M. J., & MacKinnon, D. P. (2017). Comparing models of change to estimate the mediated effect in the pretest–posttest control group design. Structural Equation Modeling: A Multidisciplinary Journal, 24(3), 428-450.

      Vimolratana, O., Aneksan, B., Siripornpanich, V., Hiengkaew, V., Prathum, T., Jeungprasopsuk, W., ... & Klomjai, W. (2024). Effects of anodal tDCS on resting state eeg power and motor function in acute stroke: a randomized controlled trial. Journal of NeuroEngineering and Rehabilitation, 21(1), 1-15.

      Zhang, Y., Yang, J., Sevilla, A., Weller, R., Wu, J., Su, C., ... & Candiotti, K. A. (2020). The mechanism of chronic nicotine exposure and nicotine withdrawal on pain perception in an animal model. Neuroscience letters, 715, 134627.

      Reviewer #3 (Recommendations For The Authors):

      Introduction

      (1) Rationale and link to chronic pain. I am not sure I agree with the statement "The ability to identify those at greater risk of developing chronic pain is limited". I believe there is an abundance of literature associating risk factors with the different instances of chronic pain (e.g., Mills et al., 2019). The fact that the authors cite studies involving potential neuroimaging biomarkers leads me to believe that they perhaps did not intend to make such a broad statement, or that they wanted to focus on individual prediction instead of population risk.

      We thank the reviewer for the thought put into this comment. We did indeed wish to refer to individual prediction, but also realise that the focus on predicting pain might not be the most appropriate opening for this manuscript. Therefore, we have adjusted the below sentence to refer to the need to identify modifiable factors rather than the need to predict pain.

      “Identifying modifiable factors that influence pain sensitivity could be a key step in reducing the presence and burden of chronic pain (van der Miesen et al., 2019; Davis et al., 2020; Tracey et al., 2021).”

      (2) The statement "Individual peak alpha frequency (PAF) is an electro-physiological brain measure that shows promise as a biomarker of pain sensitivity, and thus may prove useful for predicting chronic pain development" is a non sequitur. PAF may very well be a biomarker of pain sensitivity, but the best measures of pain sensitivity we have (selfreported pain intensity ratings) in general are not in themselves predictive of the development of chronic pain. Conversely, features that are not related to pain sensitivity could be useful for predicting chronic pain (e.g., Tanguay-Sabourin et al., 2023).

      We agree that it is essential to acknowledge that self-reported pain intensity ratings alone are not definitive predictors of chronic pain development. To align with this, we have revised the sentence, removing the second clause to avoid overstatement. The adjusted sentence now reads, "Individual peak alpha frequency (PAF) is an electrophysiological brain measure that shows promise as a biomarker of pain sensitivity."

      (3) Finally, some of the statements in the discussion comparing a tonic heat pain model with chronic neuropathic pain might be an overstatement. Whereas it is true that some of the descriptors are similar, the time courses and mechanisms are vastly different.

      We appreciate this comment, and agree that it is difficult to compare the heat pain model used to clinical neuropathic pain. This was an oversight and with further understanding we have removed this comment from the introduction and the discussion:

      “In parallel, we saw no indication of a relationship between PAF and pain ratings during CPA. The introduction of the CPA model, specifically calibrated to a moderate pain threshold, provides further support for the notion that the relationship between PAF and pain is specific to certain pain types [17,28]. Prolonged heat pain was pre-dominantly described as moderate/severe shooting, sharp, and hot pain, whereas prolonged pressure pain was predominantly described as mild/moderate throbbing, cramping, and aching in the present study. It is possible that the PAF–pain relationship is specific to particular pain models and protocols [12,17].”

      Methodology

      (4) or the benefit of good science. However, I am compelled to highlight that I could not access the preregistered files, even though I waited for almost two weeks after requesting permission to do so. This was a problem on two levels: the main one is that I could not check the hypothesized effect sizes of the sample size estimation, which are not only central to my review, and in general negate all the benefits that should go with preregistration (i.e., avoiding phacking, publication bias, data dredging, HARKing, etc.). The second one is that I had to provide an email address to request access. This allows the authors to potentially identify the reviewers. Whereas I have no issues with this and I support transparent peer review practices (https://elifesciences.org/inside-elife/e3e90410/increasingtransparency-in-elife-s-review-process), I also note that this might condition other reviewers.

      We apologise for this. We had not realised that the pre-registration was under embargo, but we have now made it available.

      Interpretation of results

      (5)To be perfectly clear, I trust the results of this study more than some of the cited studies regarding nicotine and pain because it was preregistered, the sample size is considerably larger, and it seems carefully controlled. I just do not agree with the interpretation of the results, stated in the first paragraph of the Discussion. Quoting J. Cohen, "The primary product of a research inquiry is one or more measures of effect size, not P values" (Cohen, 1990). As I am sure the authors are aware of, even tiny differences between conditions, treatments or groups will eventually be statistically significant given arbitrarily large sample sizes. What really matters then is the magnitude of these differences. In general, the authors hypothesize on why there were no differences on the pressure pain model, and why decreases in heat pain were not mediated by PAF, but do not seem to consider the possibility that the intervention just did not cause the intended effect on the nociceptive system, which would be a much more straightforward explanations for all observations.

      While acknowledging and agreeing with the concern that 'even tiny differences between conditions, treatments, or groups will eventually be statistically significant given arbitrarily large sample sizes,' it's crucial to clarify that our sample size of N=62 does not fall into the category of arbitrarily large. We carefully considered the observed outcomes in the pressure pain model and the lack of PAF mediation in heat pain, as dictated by our statistical approach and the obtained results.

      The suggestion of a straightforward explanation aligning with the intervention not causing the intended effect on the nociceptive system is a valid consideration. We did contemplate the possibility of a false positive, emphasising this in the limitations of our findings and the need for replication to draw stronger conclusions to follow up this initial study.

      (6) In this regard, I do not believe that an average *increase* of 0.05 / 10 (Nicotine post - pre) can be considered a "reduction of pain ratings", regardless of the contrast with placebo (average increase of 0.24 / 10). This tiny effect size is more relevant in the context of the considerable inter-individual variation, in which subjects scored the same heat pain model anywhere from 1 to 10, and the same pressure pain model anywhere from 1 to 8.5. In this regard, the minimum clinically or experimentally important differences (MID) in pain ratings varies from study to study and across painful conditions but is rarely below 1 / 10 in a VAS or NRS scale, see f. ex. (Olsen et al., 2017). It is not my intention to question whether nicotine can function as an acute analgesic in general (as stated in the Discussion), but instead, if it worked as such under these very specific experimental conditions. I also acknowledge that the authors note this issue in two lines in the Discussion, but I believe that this is not weighed properly.

      We appreciate your perspective on the interpretation of the effect size, and we understand the importance of considering it in the context of individual variation.

      As also discussed in response to comment 6 From reviewer 2, we recognize the concern about interpreting the effect of nicotine on prolonged pain solely based on mean results, and in fact wish to discourage this approach. It's crucial to note that both PAF and pain are highly individual measures (i.e. high inter-individual variance), necessitating the use of random intercepts for participants in our analyses to acknowledge the inherent variability at baseline across participants. Including random intercepts rather than only considering the means helps address the heterogeneity in baseline levels among participants. We also recognise that displaying the mean PHP ratings for all participants in Table 2 could be misleading, firstly because these means do not have weight in an analysis that takes into account a random-effects intercept for participants, and secondly because two participants (one from each group) did not have post-gum PHP assessments and were not included in the mediation analysis due to list-wise deletion of missing data. Therefore, to reduce the potential for misinterpretation, we have added extra detail to display both the full sample and CPA mediation analysis (i.e. N=62) and the data used for PHP mediation analysis (i.e. n=60) in Table 2. We hope that the extra details added to this table will help the readers interpretation of results.

      Moreover, we have made sure refer to the comparison with the placebo group when discussing the reduction or decrease in pain seen in the nicotine group, for example:

      “2) nicotine reduced prolonged heat pain intensity but not prolonged pressure pain intensity compared to placebo gum;”

      “The nicotine group had a decrease in heat pain ratings compared to the placebo group and increased PAF speed across the scalp from pre to post-gum, driven by changes at central-parietal and right-frontal regions.”

      We have kept our original comment of whether this effect on pain is meaningful in practice to refer to the minimum clinically or experimentally important differences in pain ratings as highlighted by Olsen et al., 2017.

      “While acknowledging the modest effect size, it’s essential to consider the broader context of our study’s focus. Assessing the clinical relevance of pain reduction is pertinent in applications involving the use of any intervention for pain management [69]. However, from a mechanistic standpoint, particularly in understanding the implications of and relation to PAF, the specific magnitude of the pain effect becomes less pivotal. Nevertheless, future research should examine whether effects on pain increase in magnitude with different nicotine administration regimens (i.e. dose and frequency).”

      (7) In line with the topic of effect sizes, average effect sizes for PAF in the study cited in the manuscript range from around 1 Hz (Boord et al., 2008; Wydenkeller et al., 2009; Lim et al., 2016), to 2 Hz (Foulds et al., 1994), compared with changes of 0.06 Hz (Nicotine post - pre) or -0.01 Hz (Placebo post - pre). MIDs are not so clearly established for peak frequencies in EEG bands, but they should be certainly larger than some fractions of a Hertz (which is considerably below the reliability of the measurement).

      We appreciate your care of these nuances. We acknowledge the differences in effect sizes between our study and those referenced in the manuscript. Given the current state of the literature, it's noteworthy that ‘MIDs’ for peak frequencies in EEG bands, particularly PAF changes, are not clearly established, other than a recent publication suggesting that even small changes in PAF are reliable and meaningful (Furman et al., 2021). In light of this, we have addressed the uncertainty around the existence and determination of MIDs in our revision, highlighting the need for further research in this area.

      In addition, our study employed a greater frequency resolution (0.2 Hz) compared to some of the referenced studies, with approximately 0.5 Hz resolution (Boord et al., 2008; Wydenkeller et al., 2009; Foulds et al., 1994). This improved resolution allows for a more precise measurement of changes in PAF. Considering this, it is plausible that studies with lower resolution might have conflated increases in PAF, and our higher resolution contributes to a more accurate representation of the observed changes.

      We have also incorporated this insight into the manuscript, emphasising the methodological advancements in our study and their potential impact on the interpretation of PAF changes. Thank you for your thoughtful feedback.

      “The ability to detect changes in PAF can be considerably impacted by the frequency resolution used during Fourier Transformations, an element that is overlooked in recent methodological studies on PAF calculation [16,95]. Changes in PAF within individuals might be obscured or conflated by lower frequency resolutions, which should be considered further in future research.”

      (8) The authors also ran alternative statistical models to analyze the data and did not find consistent results in terms of PHP ratings (PAF modulation was still statistically significantly different). The authors attribute this to the necessity of controlling for covariates. Now, considering the effects sizes, aren't these statistically significant differences just artifacts stemming from the inclusion of too many covariates (Simmons et al., 2011)? How much influence should be attributable to depression and anxiety symptoms, stress, sleep quality and past pain, considering that these are healthy volunteers? Should these contrasting differences call the authors to question the robustness of the findings (i.e., whether the same data subjected to different analysis provides the same results), particularly when the results do not align with the preregistered hypothesis (PAF modulation should occur on sensorimotor ROIs)?

      Thank you for your comments on our alternative statistical models. By including these covariates, we aim to provide a more nuanced understanding of the complexities within our data by considering their potential impact on the effects of interest. The decision to include covariates was preregistered (apologies again that this was not available) and made with consideration of balancing model complexity and avoiding potential confounding. Moreover, we hope that the insights gained from these analyses will offer valuable information about the behaviour of our data and aid future research in terms of power calculations, expected variance, and study design.

      (9) Beyond that, I believe in some cases that the authors overreach in an attempt to provide explanations for their results. While I agree that sex might be a relevant covariate, I cannot say whether the authors are confirming a pre-registered hypothesis regarding the gender-specific correlation of PAF and pain, or if this is just a post hoc subgroup analysis. Given the large number of analyses performed (considering the main document and the supplementary files), caution should be exercised on the selective interpretation of those that align with the researchers' hypotheses.

      We chose to explore the influence of sex on the correlation between PAF and pain, because this has also been investigated in previous publications of the relationship (Furman et al., 2020).  We state that the assessment by sex is exploratory in our results on p.17: “in an exploratory analysis of separate correlations in males and females (Figure 5, plot C)”. For clarity regarding whether this was a pre-registered exploration or not, we have adjusted this to be: “in an exploratory analysis (not pre-registered) of separate correlations in males and females (Figure 5, plot C), akin to those conducted in previous research on this topic (Furman et al., 2020),

      We have made sure to state this in the discussion also. Therefore, when we previously said on p.22:

      “Regarding the relationship between PAF and pain at baseline, the negative correlation between PAF and pain seen in previous work [7–11,15] was only observed here for male participants during the PHP model for global PAF.” We have now changed this to: “Regarding the relationship between PAF and pain at baseline, the negative correlation between PAF and pain seen in previous work [7– 11,15] was only observed here for male participants during the PHP model for global PAF in an exploratory analysis.”

      Please also note that we altered the colour and shape of points on the correlation plot (Figure 5 in initial submission), the male brown was changed to a dark brown as we realised that the light brown colour was difficult to read. The shape was then changed for male points so that the two groups can be distinguished in grey-scale.

      Overall, your thoughtful feedback is instrumental in refining the interpretation of our findings, and we look forward to presenting a more comprehensive and nuanced discussion. Thank you for your comments.

      REFERENCES for responses to reviewer 3

      Arendt-Nielsen, L., & Yarnitsky, D. (2009). Experimental and clinical applications of quantitative sensory testing applied to skin, muscles and viscera. The Journal of Pain, 10(6), 556-572.

      Chowdhury, N. S., Skippen, P., Si, E., Chiang, A. K., Millard, S. K., Furman, A. J., ... & Seminowicz, D. A. (2023). The reliability of two prospective cortical biomarkers for pain: EEG peak alpha frequency and TMS corticomotor excitability. Journal of Neuroscience Methods, 385, 109766.

      Fishbain, D. A., Lewis, J. E., & Gao, J. (2013). Is There Significant Correlation between SelfReported Low Back Pain Visual Analogue Scores and Low Back Pain Scores Determined by Pressure Pain Induction Matching?. Pain practice, 13(5), 358-363.

      Furman, A. J., Prokhorenko, M., Keaser, M. L., Zhang, J., Chen, S., Mazaheri, A., & Seminowicz, D. A. (2021). Prolonged pain reliably slows peak alpha frequency by reducing fast alpha power.

      bioRxiv, 2021-07.

      Heitmann, H., Ávila, C. G., Nickel, M. M., Dinh, S. T., May, E. S., Tiemann, L., ... & Ploner, M. (2022). Longitudinal resting-state electroencephalography in patients with chronic pain undergoing interdisciplinary multimodal pain therapy. Pain, 163(9), e997.

      McLain, N. J., Yani, M. S., & Kutch, J. J. (2022). Analytic consistency and neural correlates of peak alpha frequency in the study of pain. Journal of neuroscience methods, 368, 109460.

      Ngernyam, N., Jensen, M. P., Arayawichanon, P., Auvichayapat, N., Tiamkao, S., Janjarasjitt, S., ... & Auvichayapat, P. (2015). The effects of transcranial direct current stimulation in patients with neuropathic pain from spinal cord injury. Clinical Neurophysiology, 126(2), 382-390.

      Parker, T., Huang, Y., Raghu, A. L., FitzGerald, J., Aziz, T. Z., & Green, A. L. (2021). Supraspinal effects of dorsal root ganglion stimulation in chronic pain patients. Neuromodulation: Technology at the Neural Interface, 24(4), 646-654.

      Petersen-Felix, S., & Arendt-Nielsen, L. (2002). From pain research to pain treatment: the role of human experimental pain models. Best Practice & Research Clinical Anaesthesiology, 16(4), 667680.

      Sarnthein, J., Stern, J., Aufenberg, C., Rousson, V., & Jeanmonod, D. (2006). Increased EEG power and slowed dominant frequency in patients with neurogenic pain. Brain, 129(1), 55-64.

      Sato, G., Osumi, M., & Morioka, S. (2017). Effects of wheelchair propulsion on neuropathic pain and resting electroencephalography after spinal cord injury. Journal of Rehabilitation Medicine, 49(2), 136-143.

      Sufianov, A. A., Shapkin, A. G., Sufianova, G. Z., Elishev, V. G., Barashin, D. A., Berdichevskii, V. B., & Churkin, S. V. (2014). Functional and metabolic changes in the brain in neuropathic pain syndrome against the background of chronic epidural electrostimulation of the spinal cord. Bulletin of experimental biology and medicine, 157(4), 462-465.

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife Assessment

      In an important fMRI study with an elegant experimental design and rigorous cross-decoding analyses, this work shows a solid dissociation between two parietal regions in visually processing actions. Specifically, aIPL is found to be sensitive to the causal effects of observed actions, while SPL is sensitive to the patterns of body motion involved in those actions. Additional analysis and explanation would help to determine the strength of evidence and the mechanistic underpinnings would benefit from closer consideration. Nevertheless, the work will be of broad interest to cognitive neuroscientists, particularly vision and action researchers.

      We thank the editor and the reviewers for their assessment and their excellent comments and suggestions. We really believe they helped us to provide a stronger and more nuanced paper. In our revision, we addressed all points raised by the reviewers. Most importantly, we added a new section on a series of analyses to characterize in more detail the representations isolated by the action-animation and action-PLD cross-decoding. Together, these analyses strengthen the conclusion that aIPL and LOTC represent action effect structures at a categorical rather than specific level, that is, the type of change (e.g., of location or configuration) rather than the specific effect type (e.g. division, compression). SPL is sensitive to body-specific representations, specifically manuality (unimanual vs. bimanual) and movement kinematics. We also added several other analyses and addressed each point of the reviewers. Please find our responses below.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      The authors report a study aimed at understanding the brain's representations of viewed actions, with a particular aim to distinguish regions that encode observed body movements, from those that encode the effects of actions on objects. They adopt a cross-decoding multivariate fMRI approach, scanning adult observers who viewed full-cue actions, pantomimes of those actions, minimal skeletal depictions of those actions, and abstract animations that captured analogous effects to those actions. Decoding across different pairs of these actions allowed the authors to pull out the contributions of different action features in a given region's representation. The main hypothesis, which was largely confirmed, was that the superior parietal lobe (SPL) more strongly encodes movements of the body, whereas the anterior inferior parietal lobe (aIPL) codes for action effects of outcomes. Specifically, region of interest analyses showed dissociations in the successful cross-decoding of action category across full-cue and skeletal or abstract depictions. Their analyses also highlight the importance of the lateral occipito-temporal cortex (LOTC) in coding action effects. They also find some preliminary evidence about the organisation of action kinds in the regions examined.

      Strengths:

      The paper is well-written, and it addresses a topic of emerging interest where social vision and intuitive physics intersect. The use of cross-decoding to examine actions and their effects across four different stimulus formats is a strength of the study. Likewise, the a priori identification of regions of interest (supplemented by additional full-brain analyses) is a strength.

      Weaknesses:

      I found that the main limitation of the article was in the underpinning theoretical reasoning. The authors appeal to the idea of "action effect structures (AES)", as an abstract representation of the consequences of an action that does not specify (as I understand it) the exact means by which that effect is caused, nor the specific objects involved. This concept has some face validity, but it is not developed very fully in the paper, rather simply asserted. The authors make the claim that "The identification of action effect structure representations in aIPL has implications for theories of action understanding" but it would have been nice to hear more about what those theoretical implications are. More generally, I was not very clear on the direction of the claim here. Is there independent evidence for AES (if so, what is it?) and this study tests the following prediction, that AES should be associated with a specific brain region that does not also code other action properties such as body movements? Or, is the idea that this finding -- that there is a brain region that is sensitive to outcomes more than movements -- is the key new evidence for AES?

      Thank you for raising this important issue. We reasoned that AES should exist to support the recognition of perceptually variable actions, including those that we have never experienced before. To the best of our knowledge, there is only indirect evidence for the existence of AES, namely that humans effortlessly and automatically recognize actions (and underlying intentions and feelings) in movements of abstract shapes, as in the famous Heider and Simmel (1949) animations. As these animations do not contain any body posture or movement information at all, the only available cues are the spatiotemporal relations between entities and entity parts in the perceived scene. We think that the effortless and automatic attribution of actions to these stimuli points toward an evolutionary optimized mechanism to capture action effect structures from highly variable action instantiations (so general that it even works for abstract animations). Our study thus aimed to test for the existence of such a level of representation in the brain. We clarified this point in the introduction.

      In our revised manuscript, we also revised our discussion of the implications of the finding of AES representations in the brain:

      "The identification of action effect structure representations in aIPL and LOTC has implications for theories of action understanding: Current theories (see for review e.g. Zentgraf et al., 2011; Kemmerer, 2021; Lingnau and Downing, 2024) largely ignore the fact that the recognition of many goal-directed actions requires a physical analysis of the action-induced effect, that is, a state change of the action target. Moreover, premotor and inferior parietal cortex are usually associated with motor- or body-related processing during action observation. Our results, together with the finding that premotor and inferior parietal cortex are similarly sensitive to actions and inanimate object events (Karakose-Akbiyik et al., 2023), suggest that large parts of the 'action observation network' are less specific for body-related processing in action perception than usually thought. Rather, this network might provide a substrate for the physical analysis and predictive simulation of dynamic events in general (Schubotz, 2007; Fischer, 2024). In addition, our finding that the (body-independent) representation of action effects substantially draws on right LOTC contradicts strong formulations of a 'social perception' pathway in LOTC that is selectively tuned to the processing of moving faces and bodies (Pitcher and Ungerleider, 2021). The finding of action effect representation in right LOTC/pSTS might also offer a novel interpretation of a right pSTS subregion thought to specialized for social interaction recognition: Right pSTS shows increased activation for the observation of contingent action-reaction pairs (e.g. agent A points toward object; agent B picks up object) as compared to two independent actions (i.e., the action of agent A has no effect on the action of agent B) (Isik et al., 2017). Perhaps the activation reflects the representation of a social action effect - the change of an agent's state induced by someone else's action. Thus, the representation of action effects might not be limited to physical object changes but might also comprise social effects not induced by a physical interaction between entities. Finally, not all actions induce an observable change in the world. It remains to be tested whether the recognition of, e.g., communication (e.g. speaking, gesturing) and perception actions (e.g. observing, smelling) similarly relies on structural action representations in aIPL and LOTC"

      On a more specific but still important point, I was not always clear that the significant, but numerically rather small, decoding effects are sufficient to support strong claims about what is encoded or represented in a region. This concern of course applies to many multivariate decoding neuroimaging studies. In this instance, I wondered specifically whether the decoding effects necessarily reflected fully five-way distinction amongst the action kinds, or instead (for example) a significantly different pattern evoked by one action compared to all of the other four (which in turn might be similar). This concern is partly increased by the confusion matrices that are presented in the supplementary materials, which don't necessarily convey a strong classification amongst action kinds. The cluster analyses are interesting and appear to be somewhat regular over the different regions, which helps. However: it is hard to assess these findings statistically, and it may be that similar clusters would be found in early visual areas too.

      We agree that in our original manuscript, we did not statistically test what precisely drives the decoding, e.g., specific actions or rather broader categories. In our revised manuscript, we included a representational similarity analysis (RSA) that addressed this point. In short, we found that the action-animation decoding was driven by categorical distinctions between groups of actions (e.g. hit/place vs. the remaining actions) rather than a fully five-way distinction amongst all action kinds. The action-PLD decoding was mostly driven by , specifically manuality (unimanual vs. bimanual)) and movement kinematics; in left and right LOTC we found additional evidence for action-specific representations.

      Please find below the new paragraph on the RSA:

      "To explore in more detail what types of information were isolated by the action-animation and action-PLD cross-decoding, we performed a representational similarity analysis.

      We first focus on the representations identified by the action-animation decoding. To inspect and compare the representational organization in the ROIs, we extracted the confusion matrices of the action-animation decoding from the ROIs (Fig. 5A) and compared them with different similarity models (Fig. 5B) using multiple regression. Specifically, we aimed at testing at which level of granularity action effect structures are represented in aIPL and LOTC: Do these regions encode the broad type of action effects (change of shape, change of location, ingestion) or do they encode specific action effects (compression, division, etc.)? In addition, we aimed at testing whether the effects observed in EVC can be explained by a motion energy model that captures the similarities between actions and animations that we observed in the stimulus-based action-animation decoding using motion energy features. We therefore included V1 in the ROI analysis. We found clear evidence that the representational content in right aIPL and bilateral LOTC can be explained by the effect type model but not by the action-specific model (all p < 0.005; two-sided paired t-tests between models; Fig. 5C). In left V1, we found that the motion energy model could indeed explain some representational variance; however, in both left and right V1 we also found effects for the effect type model. We assume that there were additional visual similarities between the broad types of actions and animations that were not captured by the motion energy model (or other visual models; see Supplementary Information). A searchlight RSA revealed converging results, and additionally found effects for the effect type model in the ventral part of left aIPL and for the action-specific model in the left anterior temporal lobe, left dorsal central gyrus, and right EVC (Fig. 5D). The latter findings were unexpected and should be interpreted with caution, as these regions (except right EVC) were not found in the action-animation cross-decoding and therefore should not be considered reliable (Ritchie et al., 2017). The motion energy model did not reveal effects that survived the correction for multiple comparison, but a more lenient uncorrected threshold of p = 0.005 revealed clusters in left EVC and bilateral posterior SPL.

      To characterize the representations identified by the action-PLD cross-decoding, we used a manuality model that captures whether the actions were performed with both hands vs. one hand, an action-specific model as used in the action-animation RSA above, and a kinematics model that was based on the 3D kinematic marker positions of the PLDs (Fig. 6B). Since pSTS is a key region for biological motion perception, we included this region in the ROI analysis. The manuality model explained the representational variance in the parietal ROIs, pSTS, and LOTC, but not in V1 (all p < 0.002; two-sided paired t-tests between V1 and other ROIs; Fig. 6C). By contrast, the action-specific model revealed significant effects in V1 and LOTC, but not in pSTS and parietal ROIs (but note that effects in V1 and pSTS did not differ significantly from each other; all other two-sided paired t-tests between mentioned ROIs were significant at p < 0.0005). The kinematics model explained the representational variance in all ROIs. A searchlight RSA revealed converging results, and additionally found effects for the manuality model in bilateral dorsal/medial prefrontal cortex and in right ventral prefrontal cortex and insula (Fig. 6D).”

      We also included an ROI covering early visual cortex (V1) in our analysis. While there was significant decoding for action-animation in V1, the representational organization did not substantially match the organization found in aIPL and LOTC: A cluster analysis revealed much higher similarity between LOTC and aIPL than between these regions and V1:

      (please note that in this analysis we included the action-PLD RDMs as reference, and to test whether aIPL shows a similar representational organization in action-anim and action-PLD; see below)

      Given these results, we think that V1 captured different aspects in the action-animation cross-decoding than aIPL and LOTC. We address this point in more detail in our response to the "Recommendations for The Authors".

      Reviewer #2 (Public Review):

      Summary:

      This study uses an elegant design, using cross-decoding of multivariate fMRI patterns across different types of stimuli, to convincingly show a functional dissociation between two sub-regions of the parietal cortex, the anterior inferior parietal lobe (aIPL) and superior parietal lobe (SPL) in visually processing actions. Specifically, aIPL is found to be sensitive to the causal effects of observed actions (e.g. whether an action causes an object to compress or to break into two parts), and SPL to the motion patterns of the body in executing those actions.

      To show this, the authors assess how well linear classifiers trained to distinguish fMRI patterns of response to actions in one stimulus type can generalize to another stimulus type. They choose stimulus types that abstract away specific dimensions of interest. To reveal sensitivity to the causal effects of actions, regardless of low-level details or motion patterns, they use abstract animations that depict a particular kind of object manipulation: e.g. breaking, hitting, or squashing an object. To reveal sensitivity to motion patterns, independently of causal effects on objects, they use point-light displays (PLDs) of figures performing the same actions. Finally, full videos of actors performing actions are used as the stimuli providing the most complete, and naturalistic information. Pantomime videos, with actors mimicking the execution of an action without visible objects, are used as an intermediate condition providing more cues than PLDs but less than real action videos (e.g. the hands are visible, unlike in PLDs, but the object is absent and has to be inferred). By training classifiers on animations, and testing their generalization to full-action videos, the classifiers' sensitivity to the causal effect of actions, independently of visual appearance, can be assessed. By training them on PLDs and testing them on videos, their sensitivity to motion patterns, independent of the causal effect of actions, can be assessed, as PLDs contain no information about an action's effect on objects.

      These analyses reveal that aIPL can generalize between animations and videos, indicating that it is sensitive to action effects. Conversely, SPL is found to generalize between PLDs and videos, showing that it is more sensitive to motion patterns. A searchlight analysis confirms this pattern of results, particularly showing that action-animation decoding is specific to right aIPL, and revealing an additional cluster in LOTC, which is included in subsequent analyses. Action-PLD decoding is more widespread across the whole action observation network.

      This study provides a valuable contribution to the understanding of functional specialization in the action observation network. It uses an original and robust experimental design to provide convincing evidence that understanding the causal effects of actions is a meaningful component of visual action processing and that it is specifically localized in aIPL and LOTC.

      Strengths:

      The authors cleverly managed to isolate specific aspects of real-world actions (causal effects, motion patterns) in an elegant experimental design, and by testing generalization across different stimulus types rather than within-category decoding performance, they show results that are convincing and readily interpretable. Moreover, they clearly took great care to eliminate potential confounds in their experimental design (for example, by carefully ordering scanning sessions by increasing realism, such that the participants could not associate animation with the corresponding real-world action), and to increase stimulus diversity for different stimulus types. They also carefully examine their own analysis pipeline, and transparently expose it to the reader (for example, by showing asymmetries across decoding directions in Figure S3). Overall, this is an extremely careful and robust paper.

      Weaknesses:

      I list several ways in which the paper could be improved below. More than 'weaknesses', these are either ambiguities in the exact claims made, or points that could be strengthened by additional analyses. I don't believe any of the claims or analyses presented in the paper show any strong weaknesses, problematic confounds, or anything that requires revising the claims substantially.

      (1) Functional specialization claims: throughout the paper, it is not clear what the exact claims of functional specialization are. While, as can be seen in Figure 3A, the difference between action-animation cross-decoding is significantly higher in aIPL, decoding performance is also above chance in right SPL, although this is not a strong effect. More importantly, action-PLD cross-decoding is robustly above chance in both right and left aIPL, implying that this region is sensitive to motion patterns as well as causal effects. I am not questioning that the difference between the two ROIs exists - that is very convincingly shown. But sentences such as "distinct neural systems for the processing of observed body movements in SPL and the effect they induce in aIPL" (lines 111-112, Introduction) and "aIPL encodes abstract representations of action effect structures independently of motion and object identity" (lines 127-128, Introduction) do not seem fully justified when action-PLD cross-decoding is overall stronger than action-animation cross-decoding in aIPL. Is the claim, then, that in addition to being sensitive to motion patterns, aIPL contains a neural code for abstracted causal effects, e.g. involving a separate neural subpopulation or a different coding scheme. Moreover, if sensitivity to motion patterns is not specific to SPL, but can be found in a broad network of areas (including aIPL itself), can it really be claimed that this area plays a specific role, similar to the specific role of aIPL in encoding causal effects? There is indeed, as can be seen in Figure 3A, a difference between action-PLD decoding in SPL and aIPL, but based on the searchlight map shown in Figure 3B I would guess that a similar difference would be found by comparing aIPL to several other regions. The authors should clarify these ambiguities.

      We thank the reviewer for this careful assessment. The observation of action-PLD cross-decoding in aIPL is indeed not straightforward to interpret: It could mean that aIPL encodes both body movements and action effect structures by different neural subpopulations. Or it could mean that representations of action effect structures were also activated by the PLDs, which lead to successful decoding in the action-PLD cross-decoding. Our revision allows a more nuanced view on this issue:

      First, we included the results of a behavioral test show that PLDs at least weakly allow for recognition of the specific actions (see our response to the second comment), which in turn might activate action effect structure representations. Second, the finding that also the cross-decoding between animations and PLDs revealed effects in left and right aIPL (as pointed out by the reviewer in the second comment) supports the interpretation that PLDs have activated, to some extent, action effect structure representations.

      On the other hand, if aIPL encodes only action-effect-structures, that were also captured in the action-PLD cross-decoding, we would expect that the RDMs in aIPL are similar for the action-PLD and action-animation cross-decoding. However, the cluster analysis (see our response to Reviewer 1 above) does not show this; rather, all action-PLD RDMs are representationally more similar with each other than with action-animation RDMs, specifically with regard to aIPL. In addition, the RSA revealed sensitivity to manuality and kinematics also in aIPL. This suggests that the action-PLD decoding in aIPL was at least partially driven by representations related to body movements.

      Taken together, these findings suggest that aIPL encodes also body movements. In fact, we didn't want to make the strong claim that aIPL is selectively representing action effect structures. Rather, we think that our results show that aIPL and SPL are disproportionally sensitive to action effects and body movements, respectively. We added this in our revised discussion:

      "The action-PLD cross-decoding revealed widespread effects in LOTC and parietal cortex, including aIPL. What type of representation drove the decoding in aIPL? One possible interpretation is that aIPL encodes both body movements (isolated by the action-PLD cross-decoding) and action effect structures (isolated by the action-animation cross-decoding). Alternatively, aIPL selectively encodes action effect structures, which have been activated by the PLDs. A behavioral test showed that PLDs at least weakly allow for recognition of the specific actions (Tab. S2), which might have activated corresponding action effect structure representations. In addition, the finding that aIPL revealed effects for the cross-decoding between animations and PLDs further supports the interpretation that PLDs have activated, at least to some extent, action effect structure representations.  On the other hand, if aIPL encodes only action effect structures, we would expect that the representational similarity patterns in aIPL are similar for the action-PLD and action-animation cross-decoding. However, this was not the case; rather, the representational similarity pattern in aIPL was more similar to SPL for the action-PLD decoding, which argues against distinct representational content in aIPL vs. SPL isolated by the action-PLD decoding. In addition, the RSA revealed sensitivity to manuality and kinematics also in aIPL, which suggests that the action-PLD decoding in aIPL was at least partially driven by representations related to body movements. Taken together, these findings suggest that aIPL encodes not only action effect structures, but also representations related to body movements. Likewise, also SPL shows some sensitivity to action effect structures, as demonstrated by effects in SPL for the action-animation and pantomime-animation cross-decoding. Thus, our results suggest that aIPL and SPL are not selectively but disproportionally sensitive to action effects and body movements, respectively."

      A clarification to the sentence "aIPL encodes abstract representations of action effect structures independently of motion and object identity": Here we are referring to the action-animation cross decoding only; specifically, the fact that because the animations did not show body motion and concrete objects, the representations isolated in the action-animation cross decoding must be independent of body motion and concrete objects. This does not rule out that the same region encodes other kinds of representations in addition.

      And another side note to the RSA: It might be tempting to test the "effects" model (distinguishing change of shape, change of location and ingest) also in the action-PLD multiple regression RSA in order to test whether this model explains additional variance in aIPL, which would point towards action effect structure representations. However, the "effect type" model is relatively strongly correlated with the "manuality" model (VIF=4.2), indicating that multicollinearity might exist. We therefore decided to not include this model in the RSA. However, we nonetheless tested the inclusion of this model and did not find clear effects for the "effects" model in aIPL (but in LOTC). The other models revealed largely similar effects as the RSA without the "effects" model, but the effects appeared overall noisier. In general, we would like to emphasize that an RSA with just 5 actions is not ideal because of the small number of pairwise comparisons, which increases the chance for coincidental similarities between model and neural RDMs. We therefore marked this analysis as "exploratory" in the article.

      (2) Causal effect information in PLDs: the reasoning behind the use of PLD stimuli is to have a condition that isolates motion patterns from the causal effects of actions. However, it is not clear whether PLDs really contain as little information about action effects as claimed. Cross-decoding between animations and PLDs is significant in both aIPL and LOTC, as shown in Figure 4. This indicates that PLDs do contain some information about action effects. This could also be tested behaviorally by asking participants to assign PLDs to the correct action category. In general, disentangling the roles of motion patterns and implied causal effects in driving action-PLD cross-decoding (which is the main dependent variable in the paper) would strengthen the paper's message. For example, it is possible that the strong action-PLD cross-decoding observed in aIPL relies on a substantially different encoding from, say, SPL, an encoding that perhaps reflects causal effects more than motion patterns. One way to exploratively assess this would be to integrate the clustering analysis shown in Figure S1 with a more complete picture, including animation-PLD and action-PLD decoding in aIPL.

      With regard to the suggestion to behaviorally test how well participants can grasp the underlying action effect structures: We indeed did a behavioral experiment to assess the recognizability of actions in the PLD stick figures (as well as in the pantomimes). In short, this experiment revealed that participants could not well recognize the actions in the PLD stick figures and often confused them with kinematically similar but conceptually different actions (e.g. breaking --> shaking, hitting --> swiping, squashing --> knitting). However, the results also show that it was not possible to completely eliminate that PLDs contain some information about action effects.

      Because we considered this behavioral experiment as a standard assessment of the quality of the stimuli, we did not report them in the original manuscript. We now added an additional section to the methods that describes the behavioral experiments in detail:

      "To assess how much the animations, PLD stick figures, and pantomimes were associated with the specific action meanings of the naturalistic actions, we performed a behavioral experiment. 14 participants observed videos of the animations, PLDs (without stick figures), and pantomimes in three separate sessions (in that order) and were asked to describe what kind of actions the animations depict and give confidence ratings on a Likert scale from 1 (not confident at all) to 10 (very confident). Because the results for PLDs were unsatisfying (several participants did not recognize human motion in the PLDs), we added stick figures to the PLDs as described above and repeated the rating for PLD stick figures with 7 new participants, as reported below.

      A general observation was that almost no participant used verb-noun phrases (e.g. "breaking a stick") in their descriptions for all stimulus types. For the animations, the participants used more abstract verbs or nouns to describe the actions (e.g. dividing, splitting, division; Tab. S1). These abstract descriptions matched the intended action structures quite well, and participants were relatively confident about their responses (mean confidences between 6 and 7.8). These results suggest that the animations were not substantially associated with specific action meanings (e.g. "breaking a stick") but captured the coarse action structures. For the PLD stick figures (Tab. S2), responses were more variable and actions were often confused with kinematically similar but conceptually different actions (e.g. breaking --> shaking, hitting --> turning page, squashing --> knitting). Confidence ratings were relatively low (mean confidences between 3 and 5.1). These results suggest that PLD stick figures, too, were not substantially associated with specific action meanings and additionally did not clearly reveal the underlying action effect structures. Finally, pantomimes were recognized much better, which was also reflected in high confidence ratings (mean confidences between 8 and 9.2; Tab. S3). This suggests that, unlike PLD stick figures, pantomimes allowed much better to access the underlying action effect structures."

      We also agree with the second suggestion to investigate in more detail the representational profiles in aIPL and SPL. We think that the best way to do so is the RSA that we reported above. However, to provide a complete picture of the results, we also added the whole brain maps and RDMs for the animation-pantomime, animation-PLD, pantomime-PLD, and action-pantomime to the supplementary information.

      (3) Nature of the motion representations: it is not clear what the nature of the putatively motion-driven representation driving action-PLD cross-decoding is. While, as you note in the Introduction, other regions such as the superior temporal sulcus have been extensively studied, with the understanding that they are part of a feedforward network of areas analyzing increasingly complex motion patterns (e.g. Riese & Poggio, Nature Reviews Neuroscience 2003), it doesn't seem like the way in which SPL represents these stimuli are similarly well-understood. While the action-PLD cross-decoding shown here is a convincing additional piece of evidence for a motion-based representation in SPL, an interesting additional analysis would be to compare, for example, RDMs of different actions in this region with explicit computational models. These could be, for example, classic motion energy models inspired by the response characteristics of regions such as V5/MT, which have been shown to predict cortical responses and psychophysical performance both for natural videos (e.g. Nishimoto et al., Current Biology 2011) and PLDs (Casile & Giese Journal of Vision 2005). A similar cross-decoding analysis between videos and PLDs as that conducted on the fMRI patterns could be done on these models' features, obtaining RDMs that could directly be compared with those from SPL. This would be a very informative analysis that could enrich our knowledge of a relatively unexplored region in action recognition. Please note, however, that action recognition is not my field of expertise, so it is possible that there are practical difficulties in conducting such an analysis that I am not aware of. In this case, I kindly ask the authors to explain what these difficulties could be.

      Thank you for this very interesting suggestion. We conducted a cross-decoding analysis that was based on the features of motion energy models as described in Nishimoto et al. (2011). Control analyses within each stimulus type revealed high decoding accuracies (animations: 100%, PLDs: 100%, pantomimes: 65%, actions: 55%), which suggests that the motion energy data generally contains information that can be detected by a classifier. However, the cross-decoding between actions and PLDs was at chance (20%), and the classification matrix did not resemble the neural RDMs. We also tested optical flow vectors as input to the decoding, which revealed similarly high decoding for the within-stimulus-type decoding (animations: 75%, PLDs: 100%, pantomimes: 65%, actions: 40%), but again at-chance decoding for action-PLD (20%), notably with a very different classification pattern:

      Author response image 1.

      Given these mixed results, we decided not to use these models for a statistical comparison with the neural action-PLD RDMs.

      It is notable that the cross-decoding worked generally less well for decoding schemes that involve PLDs, which is likely due to highly different feature complexity of actions and PLDs: Naturalistic actions have much richer visual details, texture, and more complex motion cues. Therefore, motion energy features extracted from these videos likely capture a mixture of both fine-grained and broad motion information across different spatial frequencies. By contrast, motion energy features of PLDs are sparse and might not match the features of naturalistic actions. In a way, this was intended, as we were interested in higher-level body kinematics rather than lower-level motion features. We therefore decided to use a different approach to investigate the representational structure found in the action-PLD cross-decoding: As the PLDs were based on kinematic recordings of actions that were carried out in exactly the same manner as the naturalistic actions, we computed the dissimilarity of the 5 actions based on the kinematic marker positions. Specifically, we averaged the kinematic data across the 2 exemplars per PLD, vectorized the 3D marker positions of all time points of the PLDs (3 dimensions x 13 markers x 200 time points), computed the pairwise correlations between the 5 vectors, and converted the correlations into dissimilarity values by subtracting 1 - r. This RDM was then compared with the neural RDMs extracted from the action-PLD cross-decoding. This was done using a multiple regression RSA (see also our response to Reviewer 1's public comment 2), which allowed us to statistically test the kinematic model against other dissimilarity models: a categorical model of manuality (uni- vs. bimanual) and an action-specific model that discriminates each specific action from each other with equal distance.

      This analysis revealed interesting results: the kinematic model explained the representational variance in bilateral SPL and (particularly right) pSTS as well as in right fusiform cortex and early visual cortex. The action-specific model revealed effects restricted to bilateral LOTC. The manuality model revealed widespread effects throughout the action observation network but not in EVC.

      (4) Clustering analysis: I found the clustering analysis shown in Figure S1 very clever and informative. However, there are two things that I think the authors should clarify. First, it's not clear whether the three categories of object change were inferred post-hoc from the data or determined beforehand. It is completely fine if these were just inferred post-hoc, I just believe this ambiguity should be clarified explicitly. Second, while action-anim decoding in aIPL and LOTC looks like it is consistently clustered, the clustering of action-PLD decoding in SPL and LOTC looks less reliable. The authors interpret this clustering as corresponding to the manual vs. bimanual distinction, but for example "drink" (a unimanual action) is grouped with "break" and "squash" (bimanual actions) in left SPL and grouped entirely separately from the unimanual and bimanual clusters in left LOTC. Statistically testing the robustness of these clusters would help clarify whether it is the case that action-PLD in SPL and LOTC has no semantically interpretable organizing principle, as might be the case for a representation based entirely on motion pattern, or rather that it is a different organizing principle from action-anim, such as the manual vs. bimanual distinction proposed by the authors. I don't have much experience with statistical testing of clustering analyses, but I think a permutation-based approach, wherein a measure of cluster robustness, such as the Silhouette score, is computed for the clusters found in the data and compared to a null distribution of such measures obtained by permuting the data labels, should be feasible. In a quick literature search, I have found several papers describing similar approaches: e.g. Hennig (2007), "Cluster-wise assessment of cluster stability"; Tibshirani et al. (2001) "Estimating the Number of Clusters in a Data Set Via the Gap Statistic". These are just pointers to potentially useful approaches, the authors are much better qualified to pick the most appropriate and convenient method. However, I do think such a statistical test would strengthen the clustering analysis shown here. With this statistical test, and the more exhaustive exposition of results I suggested in point 2 above (e.g. including animation-PLD and action-PLD decoding in aIPL), I believe the clustering analysis could even be moved to the main text and occupy a more prominent position in the paper.

      With regard to the first point, we clarified in the methods that we inferred the 3 broad action effect categories after the stimulus selection: "This categorization was not planned before designing the study but resulted from the stimulus selection."

      Thank you for your suggestion to test more specifically the representational organization in the action-PLD and action-animation RDMs. However, after a careful assessment, we decided to replace the cluster analysis with an RSA. We did this for two reasons:

      First, we think that RSA is a better (and more conventional) approach to statistically investigate the representational structure in the ROIs (and in the whole brain). The RSA allowed us, for example, to specifically test the mentioned distinction between unimanual and bimanual actions, and to test it against other models, i.e., a kinematic model and an action-specific model. This indeed revealed interesting distinct representational profiles of SPL and LOTC.

      Second, we learned that the small number of items (5) is generally not ideal for cluster analyses (absolute minimum for meaningful interpretability is 4, but to form at least 2-3 clusters a minimum of 10-15 items is usually recommended). A similar rule of thumb applies to methods to statistically assess the reliability of cluster solutions (e.g., Silhouette Scores, Cophenetic Correlation Coefficient, Jaccard Coefficient). Finally, the small number of items is not ideal to run a permutation test because the number of unique permutations (for shuffling the data labels: 5! = 30) is insufficient to generate a meaningful null distribution. We therefore think it is best to discard the cluster analysis altogether. We hope you agree with this decision.

      (5) ROI selection: this is a minor point, related to the method used for assigning voxels to a specific ROI. In the description in the Methods (page 16, lines 514-24), the authors mention using the MNI coordinates of the center locations of Brodmann areas. Does this mean that then they extracted a sphere around this location, or did they use a mask based on the entire Brodmann area? The latter approach is what I'm most familiar with, so if the authors chose to use a sphere instead, could they clarify why? Or, if they did use the entire Brodmann area as a mask, and not just its center coordinates, this should be made clearer in the text.

      We indeed used a sphere around the center coordinate of the Brodmann areas. This was done to keep the ROI sizes / number of voxels constant across ROIs. Since we aimed at comparing the decoding accuracies between aIPL and SPL, we thereby minimized the possibility that differences in decoding accuracy between ROIs are due to ROI size differences. The approach of using spherical ROIs is a quite well established practice that we are using in our lab by default (e.g. Wurm & Caramazza, NatComm, 2019; Wurm & Caramazza, NeuroImage, 2019; Karakose, Caramazza, & Wurm, NatComm, 2023). We clarified that we used spherical ROIs to keep the ROI sizes constant in the revised manuscript.

      Reviewer #3 (Public Review):

      This study tests for dissociable neural representations of an observed action's kinematics vs. its physical effect in the world. Overall, it is a thoughtfully conducted study that convincingly shows that representations of action effects are more prominent in the anterior inferior parietal lobe (aIPL) than the superior parietal lobe (SPL), and vice versa for the representation of the observed body movement itself. The findings make a fundamental contribution to our understanding of the neural mechanisms of goal-directed action recognition, but there are a couple of caveats to the interpretation of the results that are worth noting:

      (1) Both a strength of this study and ultimately a challenge for its interpretation is the fact that the animations are so different in their visual content than the other three categories of stimuli. On one hand, as highlighted in the paper, it allows for a test of action effects that is independent of specific motion patterns and object identities. On the other hand, the consequence is also that Action-PLD cross-decoding is generally better than Action-Anim cross-decoding across the board (Figure 3A) - not surprising because the spatiotemporal structure is quite different between the actions and the animations. This pattern of results makes it difficult to interpret a direct comparison of the two conditions within a given ROI. For example, it would have strengthened the argument of the paper to show that Action-Anim decoding was better than Action-PLD decoding in aIPL; this result was not obtained, but that could simply be because the Action and PLD conditions are more visually similar to each other in a number of ways that influence decoding. Still, looking WITHIN each of the Action-Anim and Action-PLD conditions yields clear evidence for the main conclusion of the study.

      The reviewer is absolutely right: Because the PLDs are more similar to the actions than the animations, a comparison of the effects of the two decoding schemes is not informative. As we also clarified in our response to Reviewer 2, we cannot rule out that the action-PLD decoding picked up information related to action effect structures. Thus, the only firm conclusion that we can draw from our study is that aIPL and SPL are disproportionally sensitive to action effects and body movements, respectively. We clarified this point in our revised discussion.

      (2) The second set of analyses in the paper, shown in Figure 4, follows from the notion that inferring action effects from body movements alone (i.e., when the object is unseen) is easier via pantomimes than with PLD stick figures. That makes sense, but it doesn't necessarily imply that the richness of the inferred action effect is the only or main difference between these conditions. There is more visual information overall in the pantomime case. So, although it's likely true that observers can more vividly infer action effects from pantomimes vs stick figures, it's not a given that contrasting these two conditions is an effective way to isolate inferred action effects. The results in Figure 4 are therefore intriguing but do not unequivocally establish that aIPL is representing inferred rather than observed action effects.

      We agree that higher decoding accuracies for Action-Pant vs. Action-PLD and Pant-PLD could also be due to visual details (in particular of hands and body) that are more similar in actions and pantomimes relative to PLDs. However, please note that for this reason we included also the comparison of Anim-Pant vs. Anim-PLD. For this comparison, visual details should not influence the decoding. We clarified this point in our revision.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      It struck me that there are structural distinctions amongst the 5 action kinds that were not highlighted and may have been unintentional. Specifically, three of the actions are "unary" in a sense: break(object), squash(object), hit(object). One is "binary": place(object, surface), and the fifth (drink) is perhaps ternary - transfer(liquid, cup, mouth)? Might these distinctions be important for the organization of action effects (or actions generally)?

      This is an interesting aspect that we did not think of yet. We agree that for the organization of actions (and perhaps action effects) this distinction might be relevant. One issue we noticed, however, is that for the animations the suggested organization might be less clear, in particular for "drink" as ternary, and perhaps also for "place" as binary. Thus, in the action-animation cross-decoding, this distinction - if it exists in the brain - might be harder to capture. We nonetheless tested this distinction. Specifically, we constructed a dissimilarity model (using the proposed organization, valency model hereafter) and tested it in a multiple regression RSA against an effect type model and two other models for specific actions (discriminating each action from each other with the same distance) and motion energy (as a visual control model). This analysis revealed no effects for the "valency" model in the ROI-based RSA. Also a searchlight analysis revealed no effects for this model. Since we think that the valency model is not ideally suited to test representations of action effects (using data from the action-animation cross-decoding) and to make the description of the RSA not unnecessarily complicated, we decided to not include this model in the final RSA reported in the manuscript.

      In general, I found it surprising that the authors treated their LOTC findings as surprising or unexpected. Given the long literature associating this region with several high-level visual functions related to body perception, action perception, and action execution, I thought there were plenty of a priori reasons to investigate the LOTC's behaviour in this study. Looking at the supplementary materials, indeed some of the strongest effects seem to be in that region.

      (Likewise, classically, the posterior superior temporal sulcus is strongly associated with the perception of others' body movements; why not also examine this region of interest?)

      One control analysis that would considerably add to the strength of the authors' conclusions would be to examine how actions could be cross-decoded (or not) in the early visual cortex. Especially in comparisons of, for example, pantomime to full-cue video, we might expect a high degree of decoding accuracy, which might influence the way we interpret similar decoding in other "higher level" regions.

      We agree that it makes sense to also look into LOTC and pSTS, and also EVC. We therefore added ROIs for these regions: For EVC and LOTC we used the same approach based on Brodmann areas as for aIPL and SPL, i.e., we used BA 17 for V1 and BA 19 for LOTC. For pSTS, we defined the ROI based on a meta analysis contrast for human vs. non-human body movements (Grobras et al., HBM 2012). Indeed we find that the strongest effects (for both action effect structures and body movements) can be found in LOTC. We also found effects in EVC that, at least for the action-animation cross-decoding, are more difficult to interpret. To test for a coincidental visual confound between actions and animations, we included a control model for motion energy in the multiple regression RSA, which could indeed explain some of the representational content in V1. However, also the effect type model revealed effects in V1, suggesting that there were additional visual features that caused the action-animation cross-decoding in V1. Notably, as pointed out in our response to the Public comments, the representational organization in V1 was relatively distinct from the representational organization in aIPL and LOTC, which argues against the interpretation that effects in aIPL and LOTC were driven by the same (visual) features as in V1.

      Regarding the analyses reported in Figure 4: wouldn't it be important to also report similar tests for SPL?

      In the analysis of implied action effect structures, we focused on the brain regions that revealed robust effects for action-animation decoding in the ROI and the searchlight analysis, that is, aIPL and SPL. However, we performed a whole brain conjunction analysis to search for other brain regions that show a profile for implied action effect representation. This analysis (that we forgot to mention in our original manuscript; now corrected) did not find evidence for implied action effect representations in SPL.

      However, for completeness, we also added a ROI analysis for SPL. This analysis revealed a surprisingly complex pattern of results: We observed stronger decoding for Anim-Pant vs. Anim-PLD, whereas there were no differences for the comparisons of Action-Pant with Action-PLD and Pant-PLD:

      This pattern of results is not straightforward to explain: First, the equally strong decoding for Action-Pant, Action-PLD, and Pant-PLD suggests that SPL is not substantially sensitive to body part details. Rather, the decoding relied on the coarse body part movements, independently of the specific stimulus type (action, pantomime, PLD). However, the stronger difference between Anim-Pant and Anim-PLD suggests that SPL is also sensitive to implied AES. This appears unlikely, because no effects (in left aIPL) or only weak effects (in right SPL) were found for the more canonical Action-Anim cross-decoding. The Anim-Pant cross-decoding was even stronger than the Action-Anim cross-decoding, which is counterintuitive because naturalistic actions contain more information than pantomimes, specifically with regard to action effect structures. How can this pattern of results be interpreted? Perhaps, for pantomimes and animations, not only aIPL and LOTC but also SPL is involved in inferring (implied) action effect structures. However, for this conclusion, also differences for the comparison of Action-Pant with Action-PLD and for Action-Pant with Pant-PLD should be found. Another non-mutually exclusive interpretation is that both animations and pantomimes are more ambiguous in terms of the specific action, as opposed to naturalistic actions. For example, the squashing animation and pantomime are both ambiguous in terms of what is squashed/compressed, which might require additional load to infer both the action and the induced effect. The increased activation of action-related information might in turn increase the chance for a match between neural activation patterns of animations and pantomimes.

      In any case, these additional results in SPL do not question the effects reported in the main text, that is, disproportionate sensitivity for action effect structures in right aIPL and LOTC and for body movements in SPL and other AON regions. The evidence for implied action effect structures representation in SPL is mixed and should be interpreted with caution.

      We added this analysis and discussion as supplementary information.

      Statistical arguments that rely on "but not" are not very strong, e.g. "We found higher cross-decoding for animation-pantomime vs. animation-PLD in right aIPL and bilateral LOTC (all t(23) > 3.09, all p < 0.0025; one-tailed), but not in left aIPL (t(23) = 0.73, p = 0.23, one-tailed)." Without a direct statistical test between regions, it's not really possible to support a claim that they have different response profiles.

      Absolutely correct. Notably, we did not make claims about different profiles of the tested ROIs with regard to implied action effect representations. But of course it make sense to test for differential profiles of left vs. right aIPL, so we have added a repeated measures ANOVA to test for an interaction between TEST (animation-pantomime, animation-PLD) and ROI (left aIPL, right aIPL), which, however, was not significant (F(1,23)=3.66, p = 0.068). We included this analysis in the revised manuscript.

      Reviewer #2 (Recommendations for The Authors):

      (1) I haven't found any information about data and code availability in the paper: is the plan to release them upon publication? This should be made clear.

      Stimuli, MRI data, and code are deposited at the Open Science Framework (https://osf.io/am346/). We included this information in the revised manuscript.

      (2) Samples of videos of the stimuli (or even the full set) would be very informative for the reader to know exactly what participants were looking at.

      We have uploaded the full set of stimuli on OSF (https://osf.io/am346/).

      (3) Throughout the paper, decoding accuracies are averaged across decoding directions (A->B and B->A). To my knowledge, this approach was proposed in van den Hurk & Op de Beeck (2019), "Generalization asymmetry in multivariate cross-classification: When representation A generalizes better to representation B than B to A". I believe it would be fair to cite this paper.

      Absolutely, thank you very much for the hint. We included this reference in our revised manuscript.

      (4) Page 3, line 70: this is a very nitpicky point, but "This suggests that body movements and the effects they induce are at least partially processed independently from each other." is a bit of an inferential leap from "these are distinct aspects of real-world actions" to "then they should be processed independently in the brain". The fact that a distinction exists in the world is a prerequisite for this distinction existing in the brain in terms of functional specialization, but it's not in itself a reason to believe that functional specialization exists. It is a reason to hypothesize that the specialization might exist and to test that hypothesis. So I think this sentence should be rephrased as "This suggests that body movements and the effects they induce might be at least partially processed independently from each other.", or something to that effect.

      Your reasoning is absolutely correct. We revised the sentence following your suggestion.

      (5) Page 7, line 182: the text says "stronger decoding for action-animation vs. action-PLD" (main effect of TEST), which is the opposite of what can be seen in the figure. I assume this is a typo?

      Thanks for spotting this, it was indeed a typo. We corrected it: “…stronger decoding for action-PLD vs. action-animation cross-decoding..”

      (6) Page 7, Figure 3B: since the searchlight analysis is used to corroborate the distinction between aIPL and SPL, it would be useful to overlay the contours of these ROIs (and perhaps LOTC as well) on the brain maps.

      We found that overlaying the contours of the ROIs onto the decoding searchlight maps would make the figure too busy, and the contours would partially hide effects. However, we added a brain map with all ROIs in the supplementary information.

      (7) Page 9, Figure 4A: since the distinction between the significant difference between anim-pant and anim-PLD is quite relevant in the text, I believe highlighting the lack of difference between the two decoding schemes in left aIPL (for example, by writing "ns") in the figure would help guide the reader to see the relevant information. It is generally quite hard to notice the absence of something.

      We added “n.s.” to the left aIPL in Fig. 4A.

      (8) Page 11, line 300: "Left aIPL appears to be more sensitive to the type of interaction between entities, e.g. how a body part or an object exerts a force onto a target object" since the distinction between this and the effect induced by that interaction" is quite nuanced, I believe a concrete example would clarify this for the reader: e.g. I guess the former would involve a representation of the contact between hand and object when an object is pushed, while the latter would represent only the object's displacement following the push?

      Thank you for the suggestion. We added a concrete example: “Left aIPL appears to be more sensitive to the type of interaction between entities, that is, how a body part or an object exerts a force onto a target object (e.g. how a hand makes contact with an object to push it), whereas right aIPL appears to be more sensitive to the effect induced by that interaction (the displacement of the object following the push).”

      (9) Page 12, line 376: "Informed consent, and consent to publish, was obtained from the participant in Figure 2." What does this refer to? Was the person shown in the figure both a participant in the study and an actor in the stimulus videos? Since this is in the section about participants in the experiment, it sounds like all participants also appeared in the videos, which I guess is not the case. This ambiguity should be clarified.

      Right, the statement sounds misleading in the “Participants” section. We rephrased it and moved it to the “Stimuli” section: “actions…were shown in 4 different formats: naturalistic actions, pantomimes, point light display (PLD) stick figures, and abstract animations (Fig. 2; informed consent, and consent to publish, was obtained from the actor shown in the figure).”

      (10) Page 15, line 492: Here, "within-session analyses" are mentioned. However, these analyses are not mentioned in the text (only shown in Figure S2) and their purpose is not clarified. I imagine they were a sanity check to ensure that the stimuli within each stimulus type could be reliably distinguished. This should be explained somewhere.

      We clarified the purpose of the within session decoding analyses in the methods section: "Within-session decoding analyses were performed as sanity checks to ensure that for all stimulus types, the 5 actions could be reliably decoded (Fig. S2)."

      (11) Page 20, Figure S1: I recommend using the same color ranges for the two decoding schemes (action-anim and action-PLD) in A and C, to make them more directly comparable.

      Ok, done.

      Reviewer #3 (Recommendations For The Authors):

      (1) When first looking at Figure 1B, I had a hard time discerning what action effect was being shown (I thought maybe it was "passing through") Figure 2 later clarified it for me, but it would be helpful to note in the caption that it depicts breaking.

      Thank you for the suggestion. Done.

      (2) It would be helpful to show an image of the aIPL and SPL ROIs on a brain to help orient readers - both to help them examine the whole brain cross-decoding accuracy and to aid in comparisons with other studies.

      We added a brain map with all ROIs in the supplementary information.

      (3) Line 181: I'm wondering if there's an error, or if I'm reading it incorrectly. The line states "Moreover, we found ANOVA main effects of TEST (F(1,24)=33.08, p=7.4E-06), indicating stronger decoding for action-animation vs. action-PLD cross-decoding..." But generally, in Figure 3A, it looks like accuracy is lower for Action-Anim than Action-PLD in both hemispheres.

      You are absolutely right, thank you very much for spotting this error. We corrected the sentence: “…stronger decoding for action-PLD vs. action-animation cross-decoding..”

      (4) It might be useful to devote some more space in the Introduction to clarifying the idea of action-effect structures. E.g., as I read the manuscript I found myself wondering whether there is a difference between action effect structures and physical outcomes in general... would the same result be obtained if the physical outcomes occurred without a human actor involved? This question is raised in the discussion, but it may be helpful to set the stage up front.

      We clarified this point in the introduction:

      In our study, we define action effects as induced by intentional agents. However, the notion of action effect structures might be generalizable to physical outcomes or object changes as such (e.g. an object's change of location or configuration, independently of whether the change is induced by an agent or not).

      (5) Regarding my public comment #2, it would perhaps strengthen the argument to run the same analysis in the SPL ROIs. At least for the comparison of Anim-Pant with Anim-PLD, the prediction would be no difference, correct?

      The prediction would indeed be that there is no difference for the comparison of Anim-Pant with Anim-PLD, but also for the comparison of Action-Pant with Action-PLD and for Action-Pant with Pant-PLD, there should be no difference. As explained in our response to the public comment #2, we ran a whole brain conjunction (Fig. 4B) to test for the combination of these effects and did not find SPL in this analysis. However, we did found differences for Anim-Pant vs. Anim-PLD, which is not straightforward to interpret (see our response to your public comment #2 for a discussion of this finding).

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      Amaral et al. presents a study investigating the mesoscale modelling and dynamics of bolalipids.

      Strengths:

      The figures in this paper are exceptional. Both those to outline and introduce the lipid types, but also the quality and resolution of the plots. The data held within also appears to be outstanding and of significant (hopefully) general interest.

      We thank the reviewer for their kind words and the appreciation of our work.

      Weaknesses:

      In the introduction, I would like to have read more specifics on the biological role of bolalipids. Archaea are mentioned, but this kingdom is huge - there must be specific species that can be discussed where bolalipids are integral to archaeal life. The authors should go beyond ’extremophiles’. In short, they should unpack why the general audience should be interested in these lipids, within a subset of organisms that are often forgotten about.

      Following the reviewer’s advice we have revised the introduction of the manuscript, in which we now discuss specific species (Sulfolobus acidocaldarius and Thermococcus kodakarensis) and how in these species bolalipids are integral to archaeal life. We explain that the ratio between bilayer and bolalipids, and the number of cyclopentane rings contained within bolalipids can change to adapt to the environment. The revised parts of the introduction read (p.1 ):

      “Like for bacteria and eukaryotes, archaea must keep their lipid membranes in a fluid state (homeoviscous adaptation). This is important even under extreme environmental conditions, such as hot and cold temperatures, or high and low pH values [7]. Because of this, many archaea adapt to changes in their environment by tuning the lipid composition of their membranes: altering the ratio between bola- and bilayer lipids in their membranes [8, 9] and/or by changing the number of cyclopentane rings in their lipid tails, which are believed to make lipid molecules more rigid [5]. For example, Thermococcus kodakarensis increases its tetraether bolalipid ratio from around 50% to over 80% when the temperature of the environment increases from 60 to 85 C [10]. Along the same lines, the cell membrane of Sulfolobus acidocaldarius, can contain over 90 % of bolalipids with up to 8 cyclopentane rings at 70 C and pH 2.5 [5, 11]. It is worth mentioning that in exceptional cases bacteria also synthesise bolalipids in response to high temperatures [12], highlighting that the study of bolalipid membranes is relevant not only for archaeal biology but also from a general membrane biophysics perspective.”

      Reviewer #2 (Public review):

      Summary:

      The authors aimed to understand the biophysical properties of archeal membranes made of bolalipids. Bacterial and eukaryotic membranes are made of lipids that self-assemble into bilayers. Archea, instead, use bolalipids, lipids that have two headgroups and can span the entire bilayer. The authors wanted to determine if the unique characteristics of archaea, which are often extremophiles, are in part due to the fact that their membranes contain bolalipids.

      The authors develop a minimal computational model to compare the biophysics of bilayers made of lipids, bolalipids, and mixtures of the two. Their model enables them to determine essential parameters such as bilayer phase diagrams, mechanical moduli, and the bilayer behaviour upon cargo inclusion and remodelling.

      The author demonstrates that bolalipid bilayers behave as binary mixtures, containing bolalipids organized either in a straight conformation, spanning the entire bilayer, or in a u-shaped one, confined to a single leaflet. This dynamic mixture allows bolalipid bilayers to be very sturdy but also provides remodelling. However, remodelling is energetically more expensive than with standard lipids. The authors speculate that this might be why lipids were more abundant in the evolutionary process. Strengths:

      This is a wonderful paper, a very fine piece of scholarship. It is interesting from the point of view of biology, biophysics, and material science. The authors mastered the modelling and analysis of these complex systems. The evidence for their findings is really strong and complete. The paper is written superbly, the language is precise and the reading experience is very pleasant. The plots are very well-thought-out.

      Weaknesses:

      I would not talk about weaknesses, because this is really a nice paper. If I really had to find one, I would have liked to see some clear predictions of the model expressed in such a way that experimentalists could design validation experiments.

      We thank the reviewer for their very kind assessment. We incorporated their recommendations regarding experimental validation in the discussion section, as follows (p.14):

      “Our model makes a number of predictions that could be tested by experiment either in cells or in vitro. First, it predicts that a small increase in the fraction of archaeal bilayer lipids should be sufficient to soften a bolalipid-rich membrane. While this could be tested in the future, so far only very few studies have yet reported experimental analysis of archaeal membrane mixtures [18, 50]. Second, we observed that membranes with moderate bolalipid molecular rigidity k<sub>bola</sub> exhibit curvature-dependent bending rigidity. To experimentally verify this, one could extrude membrane tethers from cells while controlling for membrane tension. Finally, to get to the core mechanism underlying our findings, it will be important to develop experimental methods that will allow the fraction of U-shaped bolalipid conformers per leaflet to be imaged and measured.”

      Reviewer #3 (Public review):

      Summary:

      The authors have studied the mechanics of bolalipid and archaeal mixed-lipid membranes via comprehensive molecular dynamics simulations. The Cooke-Deserno 3-bead-per-lipid model is extended to bolalipids with 6 beads. Phase diagrams, bending rigidity, mechanical stability of curved membranes, and cargo uptake are studied. Effects such as the formation of U-shaped bolalipids, pore formation in highly curved regions, and changes in membrane rigidity are studied and discussed. The main aim has been to show how the mixture of bolalipids and regular bilayer lipids in archaeal membrane models enhances the fluidity and stability of these membranes.

      Strengths:

      The authors have presented a wide range of simulation results for different membrane conditions and conformations. For the most part, the analyses and their results are presented clearly and concisely. Figures, supplementary information, and movies very well present what has been studied. The manuscript is well-written and is easy to follow.

      We thank the reviewer for the detailed assessment of our work and their constructive feedback.

      Major issues

      R3.Q1: The Cooke-Deserno model, while very powerful for biophysical analysis of membranes at the mesoscale, is very much void of chemical information. It is parametrized such that it is good in producing fluid membranes and predicting values for bending rigidity, compressibility, and even thermalexpansioncoefficientfallingintheacceptedrangeofvaluesforbilayermembranes. But it still represents a generic membrane. Now, the authors have suggested a similar model for the archaeal bolalipids, which have chemically different lipids (the presence of cyclopentane rings for one), and there is no good justification for using the same pairwise interactions between their representative beads in the coarse-grained model. This does not necessarily diminish the worth of all the authors’ analyses. What is at risk here is the confusion between ”what we observe this model of bolalipidor mixed-membranes do” and ”how real bolalipid-containing archaeal membranes behave at these mechanical and thermal conditions.”.

      As the reviewer correctly notes, Cooke and Deserno used a minimal model, devoid of chemical detail, to represent fluid lipid membranes composed of bilayer lipids. Indeed archaeal lipids are chemically different compared to non-archaeal lipids, but just like non-archaeal lipids, they can be very different from one another. Given the chemical diversity of bolalipids between each other, instead of representing their complexity in a complicated model with many experimentally unconstrained parameters, we here defined a minimal model for bolalipids. The power of this minimal model is to represent the key physical/geometrical characteristics of archaeal membranes, namely the fact that lipid heads on two sides of the membrane are often connected, that bolalipids can exhibit a conformational change, and that bolalipids mix with some percentage of bilayer molecules. We then ask a general question: how do these unique geometrical characteristics of archaeal membranes influence their mechanics and reshaping? The reviewer is however right in pointing out that a model, regardless of its level of details (atomistic, coarse-grained, minimal), is still a model.

      Our approach of extending an established coarse-grained model for bilayer lipids to bolalipids is further supported by experimental observations, which report that archaeal bilayer lipids can form membranes of comparable bending rigidity to those of non-archaeal bilayer membranes [53]. Hence, different lipid linkages (archaeal vs. non-archaeal) give rise to fluid, deformable membranes of not too dissimilar rigidities, suggesting that both archaeal and non-archaeal bilayer lipids can be represented by a similar minimal coarse-grained model for the purpose of mesoscopic biophysical investigations. Since archaeal bolalipids have the same core chemical structure as two archaeal bilayer lipids joined by their tail ends, similarly we model a bolalipid by joining two bilayer lipids. Such an approach also efficiently enables us to compare bolalipid with bilayer membranes, and connect to the large body of knowledge on the physics of bilayer membranes.

      To conclude, our coarse-grained model is indeed intended to capture the main physical properties of bolalipid membranes, and not their chemical diversity.

      R3.Q2: Another more specific, major issue has to do with using the Hamm-Kozlov model for fitting the power spectrum of thermal undulations. The 1/q<sup>2</sup> term can very well be attributed to membrane tension. While a barostat is indeed used, have the authors made absolutely sure that the deviation from 1/q<sup>4</sup> behaviour does not correspond to lateral tension?

      To the casual observer, any 1/q<sup>2</sup> trend might point at membrane tension. However, the precise functional form is relevant as it determines whether the 1/q<sup>2</sup> dominates the 1/q<sup>4</sup> trend for small or large values of the wave number q in the fitted power spectrum.

      The first model (including lipid tilt) exhibits the functional form 1/(kq<sup>4</sup>) + 1/(kq<sup>2</sup>). In contrast, the second model (including membrane tension) exhibits the functional form 1/(kq<sup>4</sup> + ∑q<sup>2</sup>). Importantly, the two models obey a different functional form. Here k and k<sub>θ</sub>, are the bending and tilt moduli, which are assumed positive, and ∑ is the membrane tension, which can be either positive or negative. For the first model (with tilt), while for small q the amplitude is proportional to q<sup>-4</sup>, for large q the amplitude is proportional to q<sup>-2</sup>. In contrast, for the second model (with positive tension) while for small q the amplitude is proportional to q<sup>-2</sup>, for large q the amplitude is proportional to q<sup>-4</sup>. If membrane tension were to be negative in the second model, the slope would cross from negative infinity for small q to -4 for large q. The functional dependencies are summarized in Author response image 1A.

      For rigid bolalipid membranes, it is clearly visible that the slope of the power spectrum plotted against the wave number q decreases with increasing q (Author response image 1B). While the slope initially assumes a value close to 4, it gradually approaches 2 for larger values of q. We conclude that only the model including lipid tilt can fit the power spectrum of membrane fluctuations appropriately (solid-dashed line), whereas the model with tension fails to fit the data (dashed line). We note that the combined model containing both lipid tilt and membrane tension does not give a better fit (dotted line).

      To demonstrate that the tension model cannot fit the data, we included the best fits for both models for rigid bolalipid membranes in the new SI section 16 (p. S22) and show that only the tilt model leads to acceptable fits. We also measured the projected membrane tension - , where P<sub>x</sub>,P<sub>y</sub> are respectively the pressure in x and y direction and  L<sub>z</sub> is the dimension of the simulation box in z axis. We found the projected membrane tension to give a negligible value similarly to the one that we indirectly measured by fitting a combined model with both tension and tilt, further confirming our conjecture.

      Author response image 1.

      (A) Schematic showing the decay of the power spectrum as a function of the wave number q in the tilt model (top), in the tension model with positive membrane tension (middle), and in the tension model with negative membrane tension (bottom). (B) Fitted power spectrum as a function of q for rigid bolalipid membranes (k<sub>bola</sub>=5k<sub>B</sub>T). The fit shows that while the model with tension (dashed line) cannot fit the data, the model with tilt nicely fits the spectrum (solid-dashed line). The combined model including both tension and tilt does not fit the spectrum any better (dotted line).

      R3.Q3: I got more worried when I noticed in the SI that the simulations had been done with combined ”fix langevin” and ”fix nph” LAMMPS commands. This combination does not result in a proper isothermal-isobaric ensemble. The importance of tilt terms for bolalipids is indeed very interesting, but I believe more care is needed to establish that.

      In what follows, we show that there is no reason to worry. First of all we want to clarify that the physical setup we simulate is that of a membrane contained in a heat bath under negligible tension with correct diffusional dynamics. To achieve this physical setup, for which we use a Langevin thermostat combined with pressure control via an overdamped barostat, which we implement in LAMMPS by combining ”fix langevin” and ”fix nph”.

      In more detail: we simulated particles in an implicit solvent, for which we use a Langevin thermostat to get the right diffusional dynamics. To apply the theory of fitting fluctuation spectrums the simulation box length needs to be (near) constant. However, simulating membranes at a fixed box size results in an average non-zero membrane tension, making it hard to measure bending rigidity. The reason is that the effect of membrane tension is most influential on the largest wavelength modes, which are also most decisive when determining mechanical membrane properties like membrane rigidity. To minimize the effect of tension, we perform our simulation with an overdamped barostat (𝜏<sub>baro</sub> = 10 𝜏 <sub>langevin</sub>), which keeps the membrane near tensionless, as also done before [32]. In the revised manuscript, we have clarified the statement on the physical ensemble used (p.S2):

      “For simulating flat membrane patches of bolalipids, we combined the previously used Langevin thermostat with relaxation time of 1𝜏 with a Nosé–Hoover barostat with relaxation time of 10𝜏. In LAMMPS this amounts to combining the commands ’fix langevin’ with ’fix nph’. We configured the barostat to set lateral pressure P<sub>xy</sub> to zero by re-scaling the simulation box in the x-y plane. We compare this setup to a fixed box length setup, and an NPT ensemble setup, in SI section 17.”

      To connect our results with statistical mechanics ensemble theory we tested alternative setups. Similar setups, including the formal isothermal-isobaric ensemble, where N,P,T are kept constant using Nose-Hoover style equations for thermostating and barostating with modern corrections [34], which the reviewer refers to, result in very similar fluctuation spectrums. Consequently, our measurements of bending and tilt modulus hold true regardless of the integration scheme. However, such a setup does not correctly capture implicit solvent and diffusional dynamics.

      In even more detail: we tested our setup (implemented via ”fix langevin”+”fix nph”) versus a isothermal-isobaric ensemble (implemented via ”fix npt”). We measured volume mean and standard deviation, and found them matching for a reference LJ gas.

      To be completely sure, and to please the reviewer, we have performed additional verifications in the new SI section 17, which we summarize in the following. We simulated three representative membranes with different integration schemes: ”fix npt”, ”fix langevin”+”fix nph”, and ”fix langevin” (Langevin dynamics with projected area fixed at the average value obtained from a ”langevin+nph”). We checked that the ”fix nph” barostat is merely equilibrating the membrane to a tensionless configuration, after which the projected membrane area (A<sub>p</sub> = L<sub>x</sub>L<sub<y</sub>) is practically constant. Consequently, the different schemes resulted in minor changes in the longest wavelength modes that we tracked down to small changes in the negligible tension. The resulting measurements of bending modulus change by less than 10%, and our main text conclusions do not change. Author response image 2 compares the fluctuation spectrums for the different integration schemes.

      Author response image 2.

      Height fluctuation spectrum, for a bilayer membrane at T<sub>eff</sub> =1.1, simulated with Langevin dynamics (pink, ‘langevin‘), our setup (purple, ‘nph+langevin‘), and under an isothermal-isobaric ensemble (blue, ‘npt‘); fits are shown as dotted lines.

      R3.Q4: This issue is reinforced when considering Figure 3B. These results suggest that increasing the fraction of regular lipids increases the tilt modulus, with the maximum value achieved for a normal Cooke-Deserno bilayer void of bolalipids. But this is contradictory. For these bilayers, we don’t need the tilt modulus in the first place.

      We understand the concern why this might be counter-intuitive, and we thank the reviewer for pointing it out. We first want to stress that the tilt modulus can also be measured for bilayer membranes even if it is not needed to fit the fluctuation spectrum. If we measure the tilt modulus for a bilayer membrane, we obtain a value similar to the previously measured one [36]. Importantly, here we also report measurements for the tilt modulus for bolalipid membranes.

      To understand the seemingly contradictory behaviour of the tilt modulus, it is insightful to rewrite the expression for the fluctuation spectrum as done in Eq. (1):

      where is a characteristic length scale related to tilt, which we call the tilt persistence length. From the last equation it is easy to see that the tilt modulus 𝜅<sub>𝜃</sub> becomes relevant for the fluctuation spectrum if the tilt persistence length l<sub>𝜃</sub>  is not negligible. In other words, this means that we have to consider the tilt modulus 𝜅<sub>𝜃</sub> as relevant, if it is sufficiently small compared to the bending rigidity 𝜅.

      However, this is not only counter-intuitive, but also difficult to communicate graphically. Per the excellent reviewer’s suggestion, to make the interpretation more accessible, we converted in the main text and its figures the tilt modulus to the more directly interpretable tilt persistence length l<sub>𝜃</sub>, as this is small when tilt is irrelevant (for bilayer lipids and flexible bolalipids) and large otherwise (for rigid bolalipids). This includes changes to the main text on p.6 and p.8 , and to the insets in Figs. 2C and 3B. We note that for completeness we also report the tilt modulus 𝜅<sub>𝜃</sub>  in the SI.

      R3.Q5: Also, from the SI, I gathered that the authors have neglected the longest wavelength mode because it is not equilibrated. If this is indeed the case, it is a dangerous thing to do, because with a small membrane patch, this mode can very well change the general trend of the power spectrum. As a lot of other analyses in the manuscript rely on these measurements, I believe more elaboration is in order.

      We thank the reviewer for the careful examination of our supplementary material. For each fluctuation spectrum measurement, we ran multiple replicas. We observed that the largest wavelength modes were not fully equilibrated. In the simulations the first mode of the fluctuation spectrum is probed at different amplitudes and phases. We thus expected the potential systematic error would show up clearly when comparing spectrums of the different replicas. As we saw no correlation in these systematic offsets between replicas, we concluded that the simulations are sufficiently equilibrated and we could safely exclude the first mode of the fluctuation spectrum from our analysis.

      To show without doubt that this procedure does not randomly bias our results, we also ran simulations for three representative membranes until all modes were equilibrated. On the modes previously equilibrated, the resulting spectrums agree with our previous shorter simulations. On the largest wavelength modes that were previously not fully equilibrated, we noticed a small deviation from theory, specifically for flexible membranes (small bending modulus). These small deviations can be explained by including a negligible negative tension. Importantly, however, the resulting bending modulus σ stays nearly the same. We note that the small negative tension disappears when we halve the timestep (see Author response image 3). This verification is shown in SI section 17.

      R3.Q6: The authors have found that ”there is a strong dependency of the bending rigidity on the membrane mean curvature of stiffer bolalipids.” The effect is negative, with the membrane becoming less stiff at higher mean curvatures. Why is that? I would assume that with more flexible bolalipids, the possibility of reorganization into U-shaped chains should affect the bending rigidity more (as Figure 2E suggests). While for a stiff bolalipid, not much would change if you increase the mean curvature. This should be either a tilt effect, or have to do with asymmetry between the leaflets. But on the other hand, the tilt modulus is shown to decrease with increasing bolalipid rigidity. The authors get back to this issue only on page 10, when they consider U-shaped lipids in the inner and outer leaflets and write, ”this suggested that an additional membrane-curving mechanism must be involved.” But then again, in the Discussion, the authors write, ”It is striking that membranes made from stiffer bolalipids showed a curvature-dependent bending modulus, which is a clear signature that bolalipid membranes exhibit plastic behaviour during membrane reshaping,” adding to the confusion.

      Author response image 3.

      Height fluctuation spectrum, for a bilayer membrane at T<sub>eff</sub> =1.1, as simulated in the main text (grey, for 60⇥10<sup>3</sup>τ), for longer duration (1_.44⇥10<sup>6</sup>τ) (pink), and with the longer duration and halved timestep =0.005_τ(purple); fits are shown as dotted lines (tension and tilt) or dash-dot lines (tilt only).

      We thank the reviewer for asking this important question. Membrane bending rigidity in bolalipid membranes decreases dramatically once a small fraction of U-shapes is allowed to form, but then plateaus once this U-shape fraction reaches 20%. In a curved bolalipid membrane, U-shapes must accumulate in the outer leaflet to accommodate for area difference. Together, the bending rigidity non-linear dependence on U-shape fraction, and the promotion of U-shapes by curvature, explain why in a membrane made of moderately stiff bolalipids (k<sub>bola</sub> = 1k<sub>B</sub>T), which contain very few U-shapes in the flatstate, the bending rigidity of the membrane decreases as curvature increases. While in a membrane made of flexible bolalipid molecules (k<sub>bola</sub> = 0), where many U-shapes are present in the flat membrane, the bending rigidity does not change with curvature.

      Bending rigidity 𝜅 in flat membranes composed of bolalipids decreases dramatically once a small fraction of U-shapes is allowed to form, but plateaus once more than 20% of U-shaped bolalipids are present. In details, our data shows that with an increasing bolalipid molecular rigidity k<sub>bola</sub>, both the number of U-shaped bolalipids decreases (Fig. 2B) and the membrane rigidity 𝜅 increases (Fig. 2C). Thus, the correlation suggests that U-shaped bolalipids soften the membrane, in a non-linear way where most of the change in membrane bending rigidity happens for U-shaped bolalipid fraction < 20% (Figure S11).

      Separately, membrane curvature affects the area difference between curved membrane leaflets and thus drives U-shape accumulation. To be specific, a cylindrical membrane with area A, mean curvature H and thickness h has the outer leaflet with area A(1 + Hh) and the inner leaflet with smaller area A(1 Hh). This can be large, in our simulations up to an area change of Hh \= 25%. For pure bolalipid membranes, straight bolalipids occupy the same space in each leaflet. Area difference can then be achieved only by having a different amount of U-shaped bolalipids in each leaflet, which can result in a different U-shape fraction between leaflets and thus ’asymmetry between leaflets’. Figure S10 confirms U-shape head fraction asymmetry that increases with curvature, for both flexible (k<sub>bola</sub> = 0) and moderately stiff bolalipids (k<sub>bola</sub> = 1k<sub>B</sub>T).

      Together, these two effects result in membrane softening under curvature for the moderately stiff bolalipids, but constant rigidity for flexible bolalipids (Fig. 2F). In details: for membranes composed of moderately stiff bolalipid molecules (k<sub>bola</sub> = 1k<sub>B</sub>T), the U-shape bolalipid head fraction only increases in the outer leaflet, goingfrom10to20%(Figure S10). This is in the high sensitivity region where the bending rigidity is expected to change the most (Figure S11). We hypothesize that the molecular rigidity of a U-shaped bolalipid creates compression on the outer leaflet that stabilizes the membrane curvature and thus causes membrane softening. We suspect that for membranes composed of rigid bolalipids (k<sub></sub> > 1k<sub>B</sub>T), the effect is likely not present due to the absence of U-shape formation even under strong bending.

      By contrast, for membranes composed of flexible bolalipids (k<sub></sub> = 0), the U-shaped bolalipid head fraction changes relatively little from its value for flat membranes (from 50% to respectively 60 and 40% for the outer and inner leaflet, Figure S10). This is in the region where the membrane bending rigidity is expected to respond weakly to U-shape fraction (Figure S11). Additionally, the change is symmetric, so presumably the outer leaflet becomes softer as the inner leaflet becomes stiffer, thus creating opposing effects and only weakly affecting the membrane bending rigidity as a whole. We note that the distinction between the U-shape head fraction that we plot (Figure S10) and U-shape fraction (Figure S11) matters little for this analysis.

      We have added this deduction and its plots to SI section 8, and revised the corresponding statement in the main text accordingly (p.7 ).

      “Changing membrane curvature alters the area differently in the two membrane leaflets. To adapt to the area difference, we thus expect the fraction of U-shaped bolalipids to change as the membrane curvature changes. Moreover, the results of Fig. 2B and Fig. 2C showed that the U-shaped bolalipid fraction and the membrane bending rigidity are correlated. As a result, we predict that the fraction of straight versus U-shaped bolalipids in a membrane will change in response to membrane bending, in a way that makes the bending rigidity of a bolalipid membrane curvature dependent.”

      R3.Q7: This issue is repeated when the authors study nanoparticle uptake. They write: ”to reconcile these seemingly conflicting observations we reason that the bending rigidity, similar to Figure 2F, is not constant but softens upon increasing membrane curvature, due to dynamic change in the ratio between bolalipids in straight and U-shaped conformation. Hence, bolalipid membranes show stroking plastic behaviour as they soften during reshaping.” But the softening effect that they refer to, as shown in Figure 4B, occurs for very stiff bolalipids, for which not much switching to U-shaped conformation should occur.

      We thank the reviewer for locating a particularly dense sentence. We changed the text to explicitly refer to the range k<sub></sub> 2 [0,2] k<sub>B</sub>T for which there is significant change in U-shape fraction (p.8 ):

      “To reconcile these seemingly conflicting observations we reason that the bending rigidity κ, similar to Fig. 2F, is not constant but softens in the range k<sub></sub> 2 [0,2] k<sub>B</sub>T, upon increasing membrane curvature. This is due to the dynamic change in the ratio between bolalipids in straight and U-shaped conformation.”

      As for Fig. 4B, for k<sub></sub> > 2k<sub>B</sub>T, pores form thus explaining the plateau in adsorption energy.

      R3.Q8: Another major issue is with what the authors refer to as the ”effective temperature”. While plotting phase diagrams for kT/eps value is absolutely valid, I’m not a fan of calling this effective temperature. It is a dimensionless quantity that scales linearly with temperature, but is not a temperature. It is usually called a ”reduced temperature”. Then the authors refer to their findings as studying the stability of archaeal membranes at high temperatures. I have to disagree because eps is not the only potential parameter in the simulations (there are at least space exclusion and angle-bending stiffnesses) so one cannot identify changing eps with changing the global simulation temperature. This only works when you have one potential parameter, like an LJ gas.

      We indeed thought about this before and found that it makes little difference in our set-up. To thoroughly show that the distinction matters very little, per reviewer’s question, we computed our phase diagrams by scaling temperature T explicitly (and not lipid tail interactions T<sub>eff</sub> = k<sub>B</sub>T /ϵ<sub>p</sub>). We added these results to the SI section 14 and found no significant difference when comparing scaling tail interactions (Figure S15A) with scaling temperature explicitly (Figure S15B).

      We also computed Fig. 2A-C for scaling interactions (Figure S17A) and scaling temperature explicitly (Figure S17B). We found a slightly increased U-shaped bolalipid fraction for low k<sub></sub> when comparing scaling interactions (Figure S17A) with temperature scaling (Figure S17B). The reason is that the U-shaped fraction depends on temperature, as with higher temperature bolalipids can easier transition into the U-shape. Most importantly, however, we found no qualitative changes on the liquid region or the mechanical membrane properties when we compared the different scaling variants.

      The reason why both scaling variants match so well can be understood easily. All pair potentials, including volume exclusion interactions between head beads and other membrane beads, were also scaled in the same manner as tail-to-tail interactions, as described in the SI. In contrast, the energy scales for maintaining the lipid bonds, the bilayer lipid angles and the bolalipid angles are relatively large compared to the energy scales involved in tail-to-tail interactions. This separation of energy scales guarantees that there will be little effect when increasing global temperature. Regarding nomenclature, we take the reviewer’s advice and have added ’reduced temperature’ as an alias for T<sub>eff</sub> in the main text.

      In the revised version of the manuscript, we mention these observations in the SI section 14 and point towards these results in the main text (p.4 ):

      “This interaction strength governs the membrane phase behaviour and can be interpreted as the effective temperature or reduced temperature T<sub>eff</sub> = k<sub>B</sub>T /ϵ<sub>p</sub>. As the distinction between scaling interactions (T<sub>eff</sub>) or temperature (T) is not important for our analysis (see Supplemental Information (SI) section 14), for simplicity we refer to T<sub>eff</sub> as temperature in the following.”

      Minor issues

      R3.Q9: As the authors have noted, the fact that the membrane curvature can change the ratio of U-shaped to straight bolalipids would render the curvature elasticity non-linear (though the term ”plastic” should not be used, as this is still structurally reversible when the stress is removed. Technically, it is hypoelastic behaviour, possibly with hysteresis.) With this in mind, when the authors use essentially linear elastic models for fluctuation analysis, they should make a comparison of maximum curvatures occurring in simulations with a range that causes significant changes in bolalipid conformational ratios.

      We thank the reviewer for their suggestion on calling the non-linear behaviour of the curvature elasticity hypoelastic. We have edited the main text accordingly (p.8 ):

      “In an elastic material, the strain modulus holds constant and deformation is reversible. For bolalipid membranes at k<sub></sub> = 1k<sub>B</sub>T, however, the bending modulus decreases when deformation increases, rendering bolalipid membranes hypoelastic.”

      Moreover, regarding the maximum curvatures occurring in the fluctuation simulations: We first note that the ensemble average of the mean curvature H from the fluctuation measurements is indicated as a vertical line in Fig. 2F. As the average value is nearly zero, the membrane can be considered as flat in good approximation. To investigate the question in more detail, we extended the SI with a careful analysis of the validity of the maximum membrane curvature and the validity of the Monge gauge approximation (SI section 15).

      In short, we found that the involved membrane curvatures are small and therefore are unlikely to trigger any significant changes of the bending modulus. Moreover, since we are dealing with two bolalipid conformations, we also tested the homogeneity of the membrane. In our simulations of flat membrane patches we did not observe clustering or phase separation between the two bolalipid conformations beyond the [2,3]σ range. Furthermore, we get good agreement between our fluctuation measurement and the cylinder simulations in Fig. 2F. We now mention this verification in the revised version of the manuscript (p.8 ):

      “Fortunately, this dependency on curvature does not invalidate our fluctuation results, where the curvature is small enough that its effect on the bending modulus is negligible (SI section 15).”

      Last but least, simulating bending/unbending cycles of an arc-shaped membrane (frozen endpoints) shows agreement with cylinder membrane simulations, and no hysteresis at the rates of deformation employed (cf. M. Amaral’s thesis [54], soon to be out of the embargo period).

      R3.Q10: The Introduction section of the manuscript is written with a biochemical approach, with very minor attention to the simulation works on this system. Some molecular dynamics works are only cited as existing previous work, without mentioning what has already been studied in archaeal membranes. While some information, like the binding of ESCRT proteins to archaeal membranes, though interesting, helps little to place the study within the discipline. The Introduction should be revised to show what has already been studied with simulations (as the authors mention in the Discussion) and how the presented research complements it.

      The present research for the first time covers archaeal membranes with a single coarse-grained model capable of assuming both bolalipid in-membrane conformations and sweeps through temperature, membrane composition, and molecular rigidity. The work shows the first curvature dependent bending modulus for pure bolalipid membranes. It also investigates systematically bending modulus and Gaussian modulus, and tests the model in an all-encompassing budding simulation that incorporates topology changes. Existing atomistic or coarse-grained MD simulations (MARTINI or similar force fields) are limited to small patches of membrane, with no study of large-scale deformations or topology changes; plus, they rely on force fields that were parametrized for bilayer membranes.

      To give a comprehensive overview of the field, we revised the introduction section of the manuscript, in which we now discuss previous computational work investigating membrane diffusivity, U-shaped lipid fraction, and bending rigidity (p.3 ):

      “By contrast, only a few studies have investigated bolalipid membranes applying computational or theoretical tools [24, 25]. Specifically, the pore closure time in bolalipid membranes, and the role of cyclopentane rings for membrane properties has been investigated using all-atom simulations, showing decreased lateral mobility, reduced permeability to water, and increased lipid packing [26–28]. Moreover, using coarse-grained simulations, it was suggested that bolalipid membranes are thicker [29], exhibit a gel-to-liquid phase transition at higher temperature [30], and exhibit a reduced diffusivity [31]. However, little research has been devoted to investigating mechanics and reshaping of bolalipid membranes at the mesoscale despite the obvious importance of this question from evolutionary, biophysics, and biotechnological perspectives and although different membrane physics is expected to manifest.”

      Following the reviewer’s advice and to keep the introduction concise and focused on bolalipid membranes, we have removed the paragraph on ESCRT-III proteins in the revised manuscript.

      R3.Q11: The authors have been a bit loose with using the term ”stability”. I’d like to see the distinction in each case, as in ”chemical/thermal/mechanical/conformational stability”.

      We have clarified when applicable the type of stability throughout the manuscript. In all other instances, if not clear from context, we mean simply that the membrane persists being a membrane. At our coarse-grained level, this means the membrane does not disassemble into a gas phase.

      R3.Q12: In the original Cooke-Deserno model, a so-called ”poorman’s angle-bending term” is used, which is essentially a bond-stretching term between the first and third particle. However, I notice the authors using the full harmonic angle-bending potential. This should be mentioned.

      This is made clear in the SI (Eq. (S3)). Cooke and Deserno mention the harmonic angle potential as a valid alternative in their original publication. We now also added this detail to the main text (p.3 ):

      “The angle formed by the chain of three beads is kept near 180° via an angular potential with strength k<sub>0</sub>, instead of the approximation by a bond between end beads of the original model [32].”

      R3.Q13: The analysis of energy of U-shaped lipids with the linear model E \= c<sub>0</sub> + c<sub>1</sub>k<sub></sub> is indeed very interesting. I am curious, can this also be corroborated with mean energy measurements? The minor issue is calling the source of the favorability of U-shaped lipids ”entropic”, while clearly an energetic contribution is found. The two conformations, for example, might differ in the interactions with the neighbouring lipids.

      We were also curious and thank the reviewer for the suggestion of mean energy measurements. We concluded that there must be either an entropic contribution to the free energy or an intermolecular interaction energy favouring U-shaped bolalipids. We have now included these measurements in SI section 6 (p.S5 ):

      “By splitting the average potential energy between an internal contribution (bonds, angles and pair interactions between particles in the same molecule) and an external contribution (pair interactions between a molecule and its neighbours), we determined the transition energy from straight to U-shaped bolalipids in detail. We found that this transition lowers the internal potential energy of the bolalipid while increasing its interaction energy. In total, we obtained an energy barrier for the transition of ΔE<sub>s→u</sub> = 0.79±0.01k<sub>B</sub>T. Since the fit indicates, however, that the U-shaped bolalipid conformation is preferred over the straight conformation, we conclude that there must be either an entropic contribution to the free energy or an intermolecular interaction energy favouring U-shaped bolalipids.”

      We refer to these measurements in the main text (p.6 ):

      “For the fit it appears that c<sub>0</sub> < 0, which implies that bolalipids in U-shape conformation are slightly favoured over straight bolalipids at k<sub></sub> = 0 (explored in SI section 6).”

      R3.Q14: The authors write in the Discussion, ”In any case, our results indicate that membrane remodelling, such as membrane fission during membrane traffic, is much more difficult in bolalipid membranes [34].” Firstly, I’m not sure if studying the dependence of budding behaviour on adhesion energy with nanoparticles is enough to make claims about membrane fission. Secondly, why is the 2015 paper by Markus Deserno cited here?

      We thank the reviewer for giving us the opportunity to clarify. We make an energetic argument on membrane fission based on the observed difference in the ratio of .

      Splitting a spherical membrane vesicle into two spherical vesicles (fission) increases the bending energy by 8𝜋𝜅 and decreases the energy related to the Gaussian bending modulus by . The second part of the argument is given for example in the review by Markus Deserno (p.23, right column), that’s why we cite the paper here. Together, this gives an energy barrier, required for membrane fission in the considered geometry of ∆E<sub>fission</sub> = . We found that is around 0.5 for bolalipid membranes and around 1 for bilayer membranes. Since 𝜅 was typically larger in bolalipid membranes we thus expect the energy barrier for fission ∆E<sub>fission</sub> to be larger for bolalipid membranes. We therefore predict that membrane remodelling, such as membrane fission during membrane trafficking, is harder in bolalipid membranes. We explain our reasoning in the discussion of the revised manuscript (p.13 ):

      “Membrane remodelling, such as the fission of one spherical vesicle into two, increases the bending energy by 8πκ but decreases the energy related to the Gaussian modulus by – [39], giving rise to a fission energy barrier of ∆E<sub>fission</sub> = . Our results indicated that while in bolalipid membranes 𝜅 is larger, is smaller compared to bilayer membranes. Our results thus predict a larger energy barrier for membrane fission ∆E<sub>fission</sub> in bolalipid membranes compared to bilayer membranes.”

      R3.Q15: In the SI, where the measurement of the diffusion coefficient is discussed, the expression for D is missing the power 2 of displacement.

      We thank the reviewer for spotting this oversight. We corrected it in the revised version of the SI (p.S5 ).

      R3.Q16: Where cargo uptake is discussed, the term ”adsorption energy” is used. I think the more appropriate term would be ”adhesion energy”.

      For the sake of simplicity, we changed the term to adhesion energy (caption of Fig. 4, and p.10). We do not have a strong opinion on this, but we believe that adsorption energy would be equally correct as we describe the adsorption of many lipid head beads to a nanoparticle.

      R3.Q17: Typos:

      Page 1, paragraph 2: Adaption → Adaptation. Page 10, paragraph 1: Stroking → Striking.

      We thank the reviewer for spotting these typos which we have corrected in the revised version of the manuscript.

      Recommendations for the authors

      Reviewer #1 (Recommendations for the authors):

      A few thoughts (likely out of the scope of this paper but possibly to consider upon revision):

      R1.Q1: Do bolalipids always have the same headgroup? I don’t recall reading this in the introduction/discussion. R1 and R2 are in Figure 1, but I don’t know whether there are standard types. Could this be expanded upon? Is the model able to take these differences into account?

      We thank the reviewer for raising this important question. Similar to bacteria and eukaryotes, in archaea there is a huge variety in terms of the different head groups that lipids can contain and thus also lipid variety. Most archaeal lipids have head groups that contain either phosphate groups or sugar residues. Typically, archaeal bolalipids are asymmetric and contain a phosphatidyl and a sugar moiety at the two ends of the lipid molecule. Within the membrane the lipid is oriented such that the phosphatidyl moiety points towards the interior of the cell whereas the sugar moiety points towards the outside of the cell as it occupies more space [5].

      In our computational model, however, we consider symmetric bolalipids for the sake of simplicity and to decouple the role of ”connected geometry” from other effects. In principle, we could investigate the effect of lipid asymmetry by increasing the size of one of the lipid head beads. However, this investigation exceeds the scope of the present study and therefore requires future work.

      In the revised version of the manuscript, we now clarify that bolalipids can have different headgroups (p.1 and the caption of Fig. 1):

      “The hydrophilic heads can be composed of different functional groups with phosphatidyl and sugar being the most relevant moieties. For bolalipids the two head groups at either end of the molecule are typically distinct (Fig. 1A right) [5].”

      “The hydrophilic head of a bolalipid can be composed of different functional groups represented by R1 and R2 (right).”

      We also explicitly state that we neglect lipid head group asymmetry for the sake of simplicity (p.4 ):

      “To decouple the effect of the connected geometry of the bolalipids from that of lipid asymmetry, we assume both head beads of a bolalipid to share the same properties.”

      R1.Q2: Is it possible to compare the mesoscale models to either Coarse-grained or even all-atom lipid models? Have simulations previously been performed for bolalipids at those levels of description?

      A few studies have investigated bolalipids membranes in simulations previously. These studies either used all-atom or coarse-grained simulations. However, none of these studies investigated how bolalipids respond to membrane deformations. Therefore, it is currently not possible to directly compare our results to studies in the literature. However, to recapitulate our predictions experimentally is certainly something that could and should be done in the future. As a reply to this reviewer and reviewer 3, we discuss the current state of modelling bolalipid membranes in simulations in the revised version of the manuscript (p.3 ):

      “By contrast, only a few studies have investigated bolalipid membranes applying computational or theoretical tools [24, 25]. Specifically, the pore closure time in bolalipid membranes, and the role of cyclopentane rings for membrane properties has been investigated using all-atom simulations, showing decreased lateral mobility, reduced permeability to water, and increased lipid packing [26–28]. Moreover, using coarse-grained simulations, it was suggested that bolalipid membranes are thicker [29], exhibit a gel-to-liquid phase transition at higher temperature [30], and exhibit a reduced diffusivity [31]. However, little research has been devoted to investigating mechanics and reshaping of bolalipid membranes at the mesoscale despite the obvious importance of this question from evolutionary, biophysics, and biotechnological perspectives and although different membrane physics is expected to manifest.”

      We want to mention, however, that we do compare membrane diffusivity, U-shaped lipid fraction, and bending rigidity to the behaviour and values that have been previously measured in simulations in the discussion section. In general, we find good agreement between our results and previously reported behaviour/values (p.13 ):

      “While flexible bolalipid membranes are liquid under the same conditions as bilayer membranes, we found that stiff bolalipids form membranes that operate in the liquid regime at higher temperatures. These results agree well with previous molecular dynamics simulations that suggested that bolalipid membranes are more ordered and have a reduced diffusivity compared to bilayer membranes [24, 29]. In our simulations, this is due to the fact that completely flexible bolalipids molecules adopt both straight (transmembrane) as well as the U-shaped (loop) conformation with approximately the same frequency. In contrast, stiff bolalipids typically only take on the straight conformation when assembled in a membrane. These results agree with the previous coarse-grained molecular dynamics simulations using the MARTINI force field which showed that the ratio of straight to U-shaped bolalipids increased upon stiffening the linker between the lipid tails [29].

      [...]

      When we determined the bending rigidity of bolalipid membranes by measuring their response to thermal fluctuations, we found that membranes made from flexible bolalipids are only slightly more rigid than bilayer membranes. This result is consistent with previous atomistic simulations, which showed that the membrane rigidity was similar for membranes composed of bilayer lipids and flexible synthetic bolalipids [45].”

      R1.Q3: How would membrane proteins alter the behaviour of bolalipids? Either those integral to the membrane or those binding peripherally?

      The reviewer asks an important question. However, the question is difficult to answer due to its scope and the gaps in the current literature. Important examples of integral or peripheral membrane proteins that alter the behaviour of bolalipids and archaeal bolalipid membranes are involved in cell homeostasis, cell division, membrane trafficking, and lipid synthesis.

      The cells of many archaeal species are enclosed in a paracrystalline protein layer called the Slayer, which is attached to the lipid membrane [4, 55]. The main function of the S-layer is to keep the cell’s shape and to protect it against osmotic stress. Due to the embedding of the S-layer in the membrane at specific locations, it is to be expected that the membrane properties are influenced by the S-layer. Furthermore, archaea execute cell division by locally reshaping the membrane using FtsZ and ESCRT-III proteins [56]. While Asgard archaeal genomes encode proteins with homology to those regulating aspects of eukaryotic membrane remodelling and trafficking [57], they have yet to be observed undergoing a process like endocytosis [58]. In addition, it has been speculated that the proteins that drive the synthesis of two diether lipids into a tetraether lipid are either membrane associated or integral membrane proteins [59].

      However, to the best of our knowledge it is not known how membrane proteins specifically alter the behaviour of bolalipids. Future work will need to be executed to answer this question. Following the advice of reviewer 3 and to keep the introduction concise and focused on bolalipid membranes, we do not mention these observations in the revised manuscript.

      R1.Q4: Is there a mechanism in cells to convert or switch bolalipids from a straight to a u-shaped description? Does this happen spontaneously or are there enzymes responsible for this?

      We thank the reviewer for bringing up this important point. Despite the relevance of the question, little is currently known about the mechanism that make bolalipids transition between a straight and a U-shaped configuration mainly because there is to date no established experimental method.

      Besides our own results, most of what we know comes from coarse-grained molecular dynamics simulations, which showed that bolalipids can spontaneously transition between the straight and U-shaped configuration [29]. In addition, by using comparative genomic analysis, it has been predicted that many archaeal species contain flippases, i.e., membrane proteins that are able, upon the consumption of energy, to transfer (flipflop) bilayer lipids between the two membrane leaflets [43]. Moreover, it has been shown that Halobacterium salinarum (an archaeon with a bilayer lipid membrane) [44] contains scramblases, which are membrane proteins that passively transfer bilayer lipids from one membrane leaflet to the other. It is therefore tempting to speculate that similar proteins might exist for bolalipids which could facilitate the straight to U-shaped transition.

      In addition, it has been reported that vesicles composed of bolalipid membranes can undergo fusion with enveloped influenza viruses [17]. In this context, it has been suggested that the influenza fusion protein hemagglutinin may locally induce U-shaped bolalipids to facilitate membrane fusion. However, all these hints are by far no proof of a mechanism that can drive the straight to U-shaped bolalipid transition, and further work needs to be done to investigate this question in detail.

      In the revised version of the manuscript, we now discuss what is known about potential mechanisms to facilitate the straight to U-shaped transition in the discussion section (p.13 ):

      “While previous coarse-grained simulations predicted that bolalipids spontaneously transition between the straight and U-shaped conformations [29], how this happens in archaeal membranes and whether membrane proteins are involved in this conformational transition needs to be clarified in the future. Experimental studies suggest that archaeal membranes contain flippases and scramblases for the transitioning of bilayer lipids between membrane leaflets [43, 44], raising the possibility that similar proteins could also facilitate conformational transitions in bolalipids. In addition, it has been suggested that the viral fusion protein hemagglutinin could cause a transition from straight to U-shaped bolalipid conformation during the fusion of bolalipid vesicles with influenza viruses [17]. However, future investigation is required.”

      R1.Q5: Ideally, coordinates and any parameter files required to run the molecular simulations should be included for reproducibility.

      We absolutely share the reviewer’s concern with reproducibility and as such have included in the original submission as part of our data availability section a link to a code repository (available at: https://doi.org/10.5281/zenodo.13934991 [51]) that allows initializing and simulating flat membrane patches, with user control of the parameters explored in this paper (𝜔,T<sub>eff</sub>,k<sub>bola</sub>,f<sup>bi</sup>).

      Reviewer #2 (Recommendations for the authors):

      This is a great paper and I congratulate the authors for writing such a fine piece of scholarship. The only nitty-gritty feedback that I have is summarized in the following three points:

      R2.Q1: In the introduction the authors talk about archaea adapting their membrane to retain membrane fluidity. However, homeoviscous adaptation is also fundamental in bacteria and eukaryotes.

      The reviewer is correct, like archaea the membranes of bacteria and eukaryotes must balance between flexibility and stability. Moreover, the cell membranes in all 3 domains of life need to maintain membrane fluidity and provide mobility to the embedded lipids and membrane proteins (homeoviscous adaptation). The general idea is that these organisms change the ratio of different lipids to change membrane properties and thereby optimally adapt to their environments [10]. Importantly, however, there are differences of how homeoviscous adaptation is maintained across the different domains of life. As a reply to this reviewer and reviewer 3, we now discuss the underlying mechanisms in the revised parts of the introduction (p.1 ):

      “Like for bacteria and eukaryotes, archaea must keep their lipid membranes in a fluid state (homeoviscous adaptation). This is important even under extreme environmental conditions, such as hot and cold temperatures, or high and low pH values [7]. Because of this, many archaea adapt to changes in their environment by tuning the lipid composition of their membranes: altering the ratio between bola- and bilayer lipids in their membranes [8, 9] and/or by changing the number of cyclopentane rings in their lipid tails, which are believed to make lipid molecules more rigid [5]. For example, Thermococcus kodakarensis increases its tetraether bolalipid ratio from around 50% to over 80% when the temperature of the environment increases from 60 to 85 C [10]. Along the same lines, the cell membrane of Sulfolobus acidocaldarius, can contain over 90 % of bolalipids with up to 8 cyclopentane rings at 70 C and pH 2.5 [5, 11]. It is worth mentioning that in exceptional cases bacteria also synthesise bolalipids in response to high temperatures [12], highlighting that the study of bolalipid membranes is relevant not only for archaeal biology but also from a general membrane biophysics perspective.”

      R2.Q2: Uncertainties in Gaussian rigidity modulus estimates are not properly reported.

      The large uncertainties in the Gaussian rigidity modulus were due to the fact how they were calculated. In short, is determined in cap folding simulations [41] (SI section 9), by using the measured values of the dimensionless parameter 𝜉, related to the folding probability, the bending modulus 𝜅, the membrane line tension , and the cap radius R. In our case, the main source of uncertainty for determining comes from the uncertainty in the measurement of the bending rigidity 𝜅. To obtain 𝜅, previously, we fitted fluctuation spectra for different seeds and only then averaged the obtained values. In the revised version of the manuscript, we now first pool the fluctuation spectra of the different simulation seeds before we fit all spectra at the same time. This new approach results in smaller uncertainties for the bending rigidity 𝜅 and also the Gaussian rigidity modulus .

      As a consistency check, in addition to the simulations that we previously performed at T<sub>eff</sub> = 1.3, we have repeated the cap folding and line tension simulations at T<sub>eff</sub> = 1.2, resulting in similar values for . In the revised version of the manuscript, we report the newly calculated values and uncertainties for at T<sub>eff</sub>  = 1.2 in the main text (p.8 ):

      “At T<sub>eff</sub>  = 1.2, we obtained = 4.30±0.22kBT and thus a ratio of = 0.89±0.04 for bilayer membranes, similar to what has been reported previously [41]. For flexible bolalipid membranes, we got a slightly smaller value for = 5.04 ± 0.37kBT. Due to the larger bending modulus, however, flexible bolalipid membranes show a significantly smaller ratio = 0.64± 0.04 (k<sub></sub> = 0). At larger temperature (Teff = 1.3), the ratio can be even smaller = 0.45 ± 0.07 (see SI section 9).”

      In addition, we report the values at T<sub>eff</sub> = 1.3 and T<sub>eff</sub> = 1.2 in the SI (p.S15 , Tabl. S4):

      We have also adapted the discussion of the Gaussian bending modulus accordingly (p.13 ):

      “Another marked difference between bilayer and flexible bolalipid membranes is the ratio of the Gaussian rigidity to the bending modulus. Instead of being around 1 as for bilayer membranes [41], it is around 1/2 and therefore only half of that of bilayer lipids.”

      Reviewer #3 (Recommendations for the authors):

      While I think the bulk of the work presented is useful, some of the issues that I raised in my review are indeed major. Without properly addressing them, it is hard to accept the conclusions of the manuscript. I hope the authors can address them by revising their analysis.

      We thank the reviewer for their constructive feedback, which helped us to improve the manuscript. We have addressed all points raised by the reviewer in our detailed point-by-point response to the reviewer (see above). We hope the reviewer will now find it easier to accept our conclusions.

      (1) R. Phillips, J. Kondev, J. Theriot, and H. Garcia, Physical biology of the cell (Garland Science, New York, 2012).

      (2) H. T. McMahon and J. L. Gallop, Membrane curvature and mechanisms of dynamic cell membrane remodelling, Nature 438, 590 (2005).

      (3) S. B. Gould, Membranes and evolution, Curr. Biol. 28, R381 (2018).

      (4) S.-V. Albers and B. H. Meyer, The archaeal cell envelope, Nat. Rev. Microbiol. 9, 414 (2011).

      (5) P. M. Oger and A. Cario, Adaptation of the membrane in Archaea, Biophys. Chem. 183, 42 (2013).

      (6) K. Rastädter, D. J. Wurm, O. Spadiut, and J. Quehenberger, The Cell Membrane of Sulfolobus spp.—Homeoviscous Adaption and Biotechnological Applications, International Journal of Molecular Sciences 21, 3935 (2020).

      (7) P. L.-G. Chong, Archaebacterial bipolar tetraether lipids: Physico-chemical and membrane properties, Chem. Phys. Lipids 163, 253 (2010).

      (8) M. Tourte, P. Schaeffer, V. Grossi, and P. M. Oger, Functionalized Membrane Domains: An Ancestral Feature of Archaea?, Front. Microbiol. 11, 526 (2020).

      (9) Y. H. Kim, G. Leriche, K. Diraviyam, T. Koyanagi, K. Gao, D. Onofrei, J. Patterson, A. Guha, N. Gianneschi, G. P. Holland, M. K. Gilson, M. Mayer, D. Sept, and J. Yang, Entropic effects enable life at extreme temperatures, Sci. Adv. 5, eaaw4783 (2019).

      (10) M. F. Siliakus, J. van der Oost, and S. W. M. Kengen, Adaptations of archaeal and bacterial membranes to variations in temperature, pH and pressure, Extremophiles 21, 651 (2017).

      (11) D. W. Grogan, Phenotypic characterization of the archaebacterial genus sulfolobus: comparison of five wild-type strains, J. Bacteriol. 171, 6710 (1989).

      (12) D. X. Sahonero-Canavesi, M. F. Siliakus, A. Abdala Asbun, M. Koenen, F. von Meijenfeldt, S. Boeren, N. J. Bale, J. C. Engelman, K. Fiege, L. Strack van Schijndel, J. S. Sinninghe Damsté, and L. Villanueva, Disentangling the lipid divide: Identification of key enzymes for the biosynthesis of membrane-spanning and ether lipids in Bacteria, Sci. Adv. 8, eabq8652 (2022).

      (13) M. van Wolferen, A. A. Pulschen, B. Baum, S. Gribaldo, and S.-V. Albers, The cell biology of archaea, Nat. Microbiol. 10.1038/s41564-022-01215-8 (2022).

      (14) U. Bakowsky, U. Rothe, E. Antonopoulos, T. Martini, L. Henkel, and H.-J. Freisleben, Monomolecular organization of the main tetraether lipid from Thermoplasma acidophilum at the water–air interface, Chem. Phys. Lipids 105, 31 (2000).

      (15) C. Jeworrek, F. Evers, M. Erlkamp, S. Grobelny, M. Tolan, P. L.-G. Chong, and R. Winter, Structure and Phase Behavior of Archaeal Lipid Monolayers, Langmuir 27, 13113 (2011).

      (16) D. P. Brownholland, G. S. Longo, A. V. Struts, M. J. Justice, I. Szleifer, H. I. Petrache, M. F. Brown, and D. H. Thompson, Phase Separation in Binary Mixtures of Bipolar and Monopolar Lipid Dispersions Revealed by 2H NMR Spectroscopy, Small Angle X-Ray Scattering, and Molecular Theory, Biophysical Journal 97, 2700 (2009).

      (17) A. Bhattacharya, I. D. Falk, F. R. Moss, T. M. Weiss, K. N. Tran, N. Z. Burns, and S. G. Boxer, Structure–function relationships in pure archaeal bipolar tetraether lipids, Chem. Sci. 15, 14273 (2024).

      (18) V. Vitkova, D. Mitkova, V. Yordanova, P. Pohl, U. Bakowsky, G. Staneva, and O. Batishchev, Elasticity and phase behaviour of biomimetic membrane systems containing tetraether archaeal lipids, Colloids Surf. A Physicochem. Eng. Asp. 601, 124974 (2020).

      (19) E. Chang, Unusual thermal stability of liposomes made from bipolar tetraether lipids, Biochem. Biophys. Res. Commun. 202, 673 (1994).

      (20) O. V. Batishchev, A. S. Alekseeva, D. S. Tretiakova, T. R. Galimzyanov, A. Y. Chernyadyev, N. R. Onishchenko, P. E. Volynsky, and I. A. Boldyrev, Cyclopentane rings in hydrophobic chains of a phospholipid enhance the bilayer stability to electric breakdown, Soft Matter 16, 3216 (2020).

      (21) U. Seifert, Configurations of fluid membranes and vesicles, Adv. Phys. 46, 13 (1997).

      (22) H. Noguchi, Membrane Simulation Models from Nanometer to Micrometer Scale, J. Phys. Soc. Jpn. 78, 041007 (2009).

      (23) F. Frey and T. Idema, More than just a barrier: using physical models to couple membrane shape to cell function, Soft Matter 17, 3533 (2021).

      (24) C. Huguet, S. Fietz, A. Rosell-Melé, X. Daura, and L. Costenaro, Molecular dynamics simulation study of the effect of glycerol dialkyl glycerol tetraether hydroxylation on membrane thermostability, Biochimica et Biophysica Acta (BBA) - Biomembranes 1859, 966 (2017).

      (25) T. R. Galimzyanov, P. I. Kuzmin, P. Pohl, and S. A. Akimov, Elastic deformations of bolalipid membranes, Soft Matter 12, 2357 (2016).

      (26) T. R. Galimzyanov, P. E. Volynsky, and O. V. Batishchev, Continuum elasticity and molecular dynamics of a pore in archaeal bolalipid membranes, Soft Matter 21, 687 (2025).

      (27) A. O. Chugunov, P. E. Volynsky, N. A. Krylov, I. A. Boldyrev, and R. G. Efremov, Liquid but Durable: Molecular Dynamics Simulations Explain the Unique Properties of Archaeal-Like Membranes, Sci. Rep. 4, 7462 (2015).

      (28) L. F. Pineda De Castro, M. Dopson, and R. Friedman, Biological Membranes in Extreme Conditions: Simulations of Anionic Archaeal, PLoS One 11, e0155287 (2016).

      (29) M. Bulacu, X. Périole, and S. J. Marrink, In Silico Design of Robust Bolalipid Membranes, Biomacromolecules 13, 196 (2012).

      (30) C. H. Davis, H. Nie, and N. V. Dokholyan, Insights into thermophilic archaebacterial membrane stability from simplified models of lipid membranes, Phys. Rev. E 75, 051921 (2007).

      (31) S. Dey and J. Saha, Minimal Coarse-Grained Modeling toward Implicit Solvent Simulation of Generic Bolaamphiphiles, J. Phys. Chem. B 124, 2938 (2020).

      (32) I. R. Cooke and M. Deserno, Solvent-free model for self-assembling fluid bilayer membranes: Stabilization of the fluid phase based on broad attractive tail potentials, J. Chem. Phys. 123, 224710 (2005).

      (33) P. L.-G. Chong, U. Ayesa, V. Prakash Daswani, and E. C. Hur, On Physical Properties of Tetraether Lipid Membranes: Effects of Cyclopentane Rings, Archaea 2012, 1 (2012).

      (34) A. P. Thompson, H. M. Aktulga, R. Berger, D. S. Bolintineanu, W. M. Brown, P. S. Crozier, P. J. in ’t Veld, A. Kohlmeyer, S. G. Moore, T. D. Nguyen, R. Shan, M. J. Stevens, J. Tranchida, C. Trott, and S. J. Plimpton, LAMMPS - a flexible simulation tool for particle-based materials modeling at the atomic, meso, and continuum scales, Comput. Phys. Commun. 271, 108171 (2022).

      (35) A. Stukowski, Visualization and analysis of atomistic simulation data with ovito–the open visualization tool, Modelling and Simulation in Materials Science and Engineering 18, 015012 (2009).

      (36) E. R. May, A. Narang, and D. I. Kopelevich, Role of molecular tilt in thermal fluctuations of lipid membranes, Physical Review E 76, 021913 (2007).

      (37) W. Helfrich, Elastic Properties of Lipid Bilayers: Theory and Possible Experiments, Z. Naturforsch. C 28, 693 (1973).

      (38) M. Hamm and M. Kozlov, Elastic energy of tilt and bending of fluid membranes, Eur. Phys. J. E 3, 323 (2000).

      (39) M. Deserno, Fluid lipid membranes: From differential geometry to curvature stresses, Chemistry and Physics of Lipids 185, 11 (2015).

      (40) V. A. Harmandaris and M. Deserno, A novel method for measuring the bending rigidity of model lipid membranes by simulating tethers, The Journal of Chemical Physics 125, 204905 (2006).

      (41) M. Hu, J. J. Briguglio, and M. Deserno, Determining the Gaussian Curvature Modulus of Lipid Membranes in Simulations, Biophys. J. 102, 1403 (2012).

      (42) M. Deserno, Elastic deformation of a fluid membrane upon colloid binding, Phys. Rev. E 69, 031903 (2004), arXiv: cond-mat/0303656.

      (43) K. S. Makarova, M. Y. Galperin, and E. V. Koonin, Comparative genomic analysis of evolutionarily conserved but functionally uncharacterized membrane proteins in archaea: Prediction of novel components of secretion, membrane remodeling and glycosylation systems, Biochimie 118, 302 (2015).

      (44) A. Verchère, W.-L. Ou, B. Ploier, T. Morizumi, M. A. Goren, P. Bütikofer, O. P. Ernst, G. Khelashvili, and A. K. Menon, Light-independent phospholipid scramblase activity of bacteriorhodopsin from Halobacterium salinarum, Sci. Rep. 7, 9522 (2017).

      (45) T. B. H. Schroeder, G. Leriche, T. Koyanagi, M. A. Johnson, K. N. Haengel, O. M. Eggenberger, C. L. Wang, Y. H. Kim, K. Diraviyam, D. Sept, J. Yang, and M. Mayer, Effects of lipid tethering in extremophile-inspired membranes on H(+)/OH(-) flux at room temperature, Biophys. J. 110, 2430 (2016).

      (46) R. Xu, A. Dehghan, A.-C. Shi, and J. Zhou, Elastic property of membranes self-assembled from diblock and triblock copolymers, Chem. Phys. Lipids 221, 83 (2019).

      (47) Z. Dogic and S. Fraden, Ordered phases of filamentous viruses, Curr. Opin. Colloid Interface Sci. 11, 47 (2006).

      (48) E. Barry and Z. Dogic, Entropy driven self-assembly of nonamphiphilic colloidal membranes, Proc. Natl. Acad. Sci. U.S.A. 107, 10348 (2010).

      (49) A. J. Balchunas, R. A. Cabanas, M. J. Zakhary, T. Gibaud, S. Fraden, P. Sharma, M. F. Hagan, and Z. Dogic, Equation of state of colloidal membranes, Soft Matter 15, 6791 (2019).

      (50) M. Saracco, P. Schaeffer, M. Tourte, S.-V. Albers, Y. Louis, J. Peters, B. Demé, S. Fontanay, and P. M. Oger, Bilayer-Forming Lipids Enhance Archaeal Monolayer Membrane Stability, Int. J. Mol. Sci. 26, 3045 (2025).

      (51) M. Amaral, archaeal_membranes : code and examples (2024), available at https://doi.org/10.5281/zenodo. 13934991.

      (52) M. F. Ergüder and M. Deserno, Identifying systematic errors in a power spectral analysis of simulated lipid membranes, The Journal of Chemical Physics 154, 214103 (2021).

      (53) J. Genova, N. Ulrih, V. Kralj-Iglič, A. Iglič, and I. Bivas, Bending Elasticity Modulus of Giant Vesicles Composed of Aeropyrum Pernix K1 Archaeal Lipid, Life 5, 1101 (2015).

      (54) M. Amaral, Archaeal Membranes: In Silico Modelling and Design, Ph.D. thesis, Institute of Science and Technology Austria (2024).

      (55) M. Pohlschroder, F. Pfeiffer, S. Schulze, and M. F. A. Halim, Archaeal cell surface biogenesis, FEMS Microbiol. Rev. 42, 694 (2018).

      (56) K. S. Makarova, N. Yutin, S. D. Bell, and E. V. Koonin, Evolution of diverse cell division and vesicle formation systems in Archaea, Nat. Rev. Microbiol. 8, 731 (2010).

      (57) C. W. Stairs and T. J. Ettema, The Archaeal Roots of the Eukaryotic Dynamic Actin Cytoskeleton, Curr. Biol. 30, R521 (2020).

      (58) B. Baum and D. A. Baum, The merger that made us, BMC Biol. 18, 72 (2020).

      (59) Z. Zeng, H. Chen, H. Yang, Y. Chen, W. Yang, X. Feng, H. Pei, and P. V. Welander, Identification of a protein responsible for the synthesis of archaeal membrane-spanning GDGT lipids, Nat. Commun. 13, 1545 (2022).

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews: 

      Reviewer #1 (Public Review):

      Aging is associated with a number of physiologic changes including perturbed circadian rhythms. However, mechanisms by which rhythms are altered remain unknown. Here authors tested the hypothesis that age-dependent factors in the sera affect the core clock or outputs of the core clock in cultured fibroblasts. They find that both sera from young and old donors are equally potent at driving robust ~24h oscillations in gene expression, and report the surprising finding that the cyclic transcriptome after stimulation by young or old sera differs markedly. In particular, genes involved in the cell cycle and transcription/translation remain rhythmic in both conditions, while genes associated with oxidative phosphorylation and Alzheimer's Disease lose rhythmicity in the aged condition. Also, the expression of cycling genes associated with cholesterol biosynthesis increases in the cells entrained with old serum. Together, the findings suggest that age-dependent blood-borne factors, yet to be identified, affect circadian rhythms in the periphery. The most interesting aspect of the paper is that the data suggest that the same system (BJ-5TA), may significantly change its rhythmic transcriptome depending on how the cells are synchronized. While there is a succinct discussion point on this, it should be expanded and described whether there are parallels with previous works, as well as what would be possible mechanisms for such an effect.

      We’ve expanded our discussion in the manuscript to discuss possible mechanisms and also how the genes/pathways implicated in our study relate to other aging literature.  

      Major points: 

      Fig 1 and Table S1. Serum composition and levels of relevant blood-borne factors probably change in function of time. At what time of the day were the serum samples from the old and young groups collected? This important information should be provided in the text and added to Table S1. 

      We made sure to highlight the collection time in the abstract of the manuscript “We collected blood from apparently healthy young (age 25-30) and old (age 70-76) individuals at 14:001 and used the serum to synchronize cultured fibroblasts.” The time of blood draw is also in sections of the paper (Intro and Methods). Since Table S1 is demographic information, we did not think that the blood draw time fit best there, but hopefully it is now clear in the text.

      Fig 2A. Luminescence traces: the manuscript would greatly benefit from inclusion of raw luminescence traces.

      Raw luminescence traces have been added to Figure S3 (S3A).

      Fig 2. Of the many genes that change their rhythms after stimulation with young and old sera, what are the typical fold changes? For example, it would be useful to show histograms for the two groups. Does one group tend to have transcript rhythms of higher or lower fold changes? 

      We’ve presented these data in Figure S5. There are a few significant differences, but largely the groups are similar in terms of fold change.

      Fig. 2 Gene expression. Also here, the presentation would benefit from showing a few key examples for different types of responses. 

      Sample traces of genes that gain rhythmicity, lose rhythmicity, phase shift, and change MESOR are now illustrated in Figure S6.

      What was the rationale to use these cells over the more common U2OS cells? Are there similarities between the rhythmic transcriptomes of the BJ-5TA cells and that of U2OS cells or other human cells? This could easily be assessed using published datasets. 

      The original rationale to use BJ-5TA fibroblast cells was that we were aiming to build upon an observation found in a previous study2 which showed that circadian period changes with age in human fibroblasts. While our findings did not match theirs, we think an added benefit of using the BJ-5TA line is that unlike U2OS cells, it is not a carcinoma derived cell line. We’ve added this point in lines 98-101.

      Our study finds many more rhythmic transcripts compared to the previous studies examining U2OS cells. This can be attributed to several factors including differences in methods, including the use of human serum in our study, cell type differences, or decoupling of rhythms in some cancer cells. While a comparison of BJ-5TA cells and U2OS cells could be interesting, a proper comparison requires investigation of many data sets, since any pair of BJ-5TA and U2OS data sets will most likely differ in some detail of experimental design or data processing pipeline, which could contribute to observed differences in rhythmic transcripts.

      That being said, we compared clock reference genes (see Author response image 1) between BJ-5TA and U2OS cells, comparing circadian profiles obtained from our data with those available on CircaDB. These circadian profiles exhibit many similarities and a few differences. The peak to trough ratios (amplitudes) are quite similar for ARNTL, NR1D1, NR1D2, PER2, PER3, and are about 25% lower for CRY1 and somewhat higher for TEF (about 15%) in our data. We find that the MESORS are generally similar with the exception of NR1D1 which is much lower and NR1D2 which is much higher in our data.

      Author response image 1.

      BJ-5TA and U2OS Cells Exhibit Similar Profiles of Circadian Gene Transcription. We compared the transcriptomic profiles of the BJ-5TA cells in young and old serum (left) to the U2OS transcriptomic data (right) available on CircaDB, a database containing profiles of several circadian reference genes in U2OS cells. This figure suggests that circadian profiles of these genes exhibit many similarities. We find that the peak to trough ratios (amplitudes) are similar for ARNTL, NR1D1, NR1D2, Per2, PER3, and that the MESORS are similar (with the exception of NR1D1 which is much lower and NR1D2 which is much higher in the BJ-5TA cells). We find that the amplitudes of CRY1 is ~25% lower and TEF is ~15% higher for the BJ5TA cells. The axis for plots on the left show counts divided by 3.5 in order to made MESORs of ARNTL similar to ease comparison.

      For the rhythmic cell cycle genes, could this be the consequence of the serum which synchronizes also the cell cycle, or is it rather an effect of the circadian oscillator driving rhythms of cell cycle genes? 

      This is an interesting point. Given our previous data showing that the cell cycle gene cyclin D1 is regulated by clock transcription factors3, we believe the circadian oscillator drives, or at least contributes, to rhythms of cell cycle genes. However, the serum clearly makes a difference as we find that MESORs of cell cycle genes decrease with aged serum. This is consistent with the decreased proliferation previously observed in aged human tissue4.

      While the reduction of rhythmicity in the old serum for oxidative phosphorylation transcripts is very interesting and fits with the general theme that metabolic function decreases with age, it is puzzling that the recipient cells are the same, but it is only the synchronization by the old and young serum that changes. Are the authors thus suggesting that decrease of metabolic rhythms is primarily a non cell-autonomous and systemic phenomenon? What would be a potential mechanism? 

      We are indeed suggesting this, although it is also possible that it is not cycling per se, but rather an overall inefficiency of oxidative phosphorylation that is conveyed by the serum. Relating other work in the field to our findings, we’ve added the following to our discussion: “Previous work in the field demonstrates that synchronization of the circadian clock in culture results in cycling of mitochondrial respiratory activity5,6 further underscoring the different effects of old serum, which does not support oscillations of oxidative phosphorylation associated transcripts. Age-dependent decrease in oxidative phosphorylation and increase in mitochondrial dysfunction7 has been seen in aged fibroblasts8 and contributes to age-related diseases9. We suggest that the age-related inefficiency of oxidative phosphorylation is conferred by serum signals to the cells such that oxidative phosphorylation cycles are mitigated. On the other hand, loss of cycling could contribute to impairments in mitochondrial function with age.”

      The delayed shifts after aged serum for clock transcripts (but not for Bmal1) are interesting and indicate that there may be a decoupling of Bmal1 transcript levels from the other clock gene phases. How do the authors interpret this? could it be related to altered chronotypes in the elderly? 

      One possible explanation is that the delay of NPAS2, BMAL1’s binding partner, results in the delay of the transcription of clock controlled genes/negative arm genes. Since the RORs do not seem to be affected, Bmal is transcribed/translated as usual, but there isn’t enough NPAS2 to bind with BMAL1. In this case downstream genes are slower to transcribe causing the phase delay.

      Reviewer #2 (Public Review): 

      Schwarz et al. have presented a study aiming to investigate whether circulating factors in sera of subjects are able to synchronize depending on age, circadian rhythms of fibroblast. The authors used human serum taken from either old (age 70-76) or young (age 25-30) individuals to synchronise cultured fibroblasts containing a clock gene promoter driven luciferase reporter, followed by RNA sequencing to investigate whole gene expression. 

      This study has the potential to be very interesting, as evidence of circulating factors in sera that mediate peripheral rhythms has long been sought after. Moreover, the possibility that those factors are affected by age which could contribute to the weaken circadian rhythmicity observed with aging. 

      Here, the authors concluded that both old and young sera are equally competent at driving robust 24 hour oscillations, in particular for clock genes, although the cycling behaviour and nature of different genes is altered between the two groups, which is attributed to the age of the individuals. This conclusion could however be influenced by individual variabilities within and between the two age groups. The groups are relatively small, only four individual two females and two males, per group. And in addition, factors such as food intake and exercise prior to blood drawn, or/and chronotype, known to affect systemic signals, are not taken into consideration. As seen in figure 4, traces from different individuals vary heavily in terms of their patterns, which is not addressed in the text. Only analysing the summary average curve of the entire group may be masking the true data. More focus should be attributed to investigating the effects of serum from each individual and observing common patterns. Additionally, there are many potential causes of variability, instead or in addition to age, that may be contributing to the variation both, between the groups and between individuals within groups. All of this should be addressed by the authors and commented appropriately in the text. 

      We are not aware of any specific feature distinguishing the subjects (other than age) that could account for the differences between old and young. The fact that we see significant differences between the two groups, even with the relatively small size of the groups, suggests strongly that these differences are largely due to age. Nevertheless, we acknowledge that individual variability can be a contributing factor. For instance, the change in phase of clock genes appears to be driven largely by two subjects. We have commented on this and individual differences, in general, in the discussion.  

      The authors also note in the introduction that rhythms in different peripheral tissues vary in different ways with age, however the entire study is performed on only fibroblast, classified as peripheral tissue by the authors. It would be very interesting to investigate if the observed changes in fibroblast are extended or not to other cell lines from diverse organ origin. This could provide information about whether circulating circadian synchronising factors could exert their function systemically or on specific tissues. At the very least, this hypothesis should be addressed within the discussion. 

      It is likely that factors circulating in serum act on several tissues, and so their effects are relatively broad. However, this would require extensive investigation of other tissues. We now discuss this in the manuscript.

      In addition to the limitations indicated above I consider that the data of the study is an insufficiently analysis beyond the rhythmicity analysis. Results from the STRING and IPA analysis were merely descriptive and a more comprehensive bioinformatic analysis would provide additional information about potential molecular mechanism explaining the differential gene expression. For example, enrichment of transcription factors binding sites in those genes with different patters to pinpoint chromatin regulatory pathways.

      We performed LinC similarity analysis (LISA) to study enrichment of transcription factor binding. Results are displayed in Fig 3B and in lines 157-168. 

      Recommendations for the authors:

      The two reviewers and reviewing editor have agreed on the following recommendations for the authors: 

      Major: 

      (1) The bioinformatic analysis would benefit from a more thorough focus on variability between individuals. Specifically, the main conclusion of the manuscript could be significantly influenced by individual variabilities within and between the two age groups. This is of particular concern, as the groups are relatively small (four individual two females and two males, per group). In addition, the consideration of factors such as food intake and exercise prior to blood drawn, or/and chronotype, known to affect systemic signals should be more adequately explained. The lab is an experienced chronobiology lab, and thus we are confident that these factors had been thought of, but this needs to be better made clear.

      As seen in Figure 4, traces from different individuals vary heavily in terms of their patterns, which is not addressed in the text. Only analysing the summary average curve of the entire group may be masking the relevant data. Furthermore, there are many potential causes of variability, instead or in addition to age, that may be contributing to the variation both, between the groups and between individuals within groups. All of this should be addressed by the authors and commented appropriately in the text. 

      We are not aware of any specific feature distinguishing the subjects (other than age) that could account for the differences between old and young. The fact that we see significant differences between the two groups, even with the relatively small size of the groups, suggests strongly that these differences are largely due to age. Nevertheless, we acknowledge that individual variability can be a contributing factor. For instance, the change in phase of clock genes appears to be driven largely by two subjects. We have commented on this and individual differences, in general, in the discussion. 

      (2) The study would benefit from a more thorough analysis of the data beyond the rhythmicity analysis. Results from the STRING and IPA analysis were merely descriptive and a more comprehensive bioinformatic analysis would provide additional information about potential molecular mechanism explaining the differential gene expression. For example, enrichment of transcription factors binding sites in those genes with different patters to pinpoint chromatin regulatory pathways. This would provide additional value to the study, especially given the otherwise apparent lack of any mechanistic explanation. 

      We performed LinC similarity analysis (LISA) to study enrichment of transcription factor binding. Results are displayed in Fig 3B and in lines 157-168.

      (3) There were some questions about the amplitude of the core circadian clock gene rhythms raised, which in other human cell types would be much higher. A comment on this matter and the provision of the raw luminescence traces for Fig 2A would be greatly beneficial.

      Addressing the same topic: what are the typical fold changes of the many genes that change their rhythms after stimulation with young and old sera? For example, it would be useful to show histograms for the two groups. Does one group tend to have transcript rhythms of higher or lower fold changes? The presentation of the manuscript would further benefit from showing a few key examples for different types of responses. 

      The average luminescence trace for each individual serum sample from Fig 2A has been added to Fig S3A.

      We’ve presented the fold change data in Figure S5. There are a few significant differences, but largely the groups are similar in terms of fold change.

      (4) There are several points that we recommend to consider to add to the discussion: 

      What was the rationale to use these cells over the more common U2OS cells? Are there similarities between the rhythmic transcriptomes of the BJ-5TA cells and that of U2OS cells or other human cells? It should be relatively easy to address this point by assessing published datasets. 

      The original rationale to use BJ-5TA fibroblast cells was that we were aiming to build upon an observation found in a previous study2 which showed that circadian period changes with age in human fibroblasts. While our findings did not match theirs, we think an added benefit of using the BJ-5TA line is that unlike U2OS cells, it is not carcinoma derived cell line. We’ve added this point in lines 98-101. 

      Our study finds many more rhythmic transcripts compared to the previous studies examining U2OS cells. This can be attributed to several factors including differences in methods, including the use of human serum in our study, cell type differences, or decoupling of rhythms in some cancer cells. While a comparison of BJ-5TA cells and U2OS cells could be interesting, a proper comparison requires investigation of many data sets, since any pair of BJ-5TA and U2OS data sets will most likely differ in some detail of experimental design or data processing pipeline, which could contribute to observed differences in rhythmic transcripts.

      That being said, we compared clock reference genes (see Author response image 1) between BJ-5TA and U2OS cells, comparing circadian profiles obtained from our data with those available on CircaDB. These circadian profiles exhibit many similarities and a few differences. The peak to trough ratios (amplitudes) are quite similar for ARNTL, NR1D1, NR1D2, PER2, PER3, and are about 25% lower for CRY1 and somewhat higher for TEF (about 15%) in our data. We find that the MESORS are generally similar with the exception of NR1D1 which is much lower and NR1D2 which is much higher in our data.

      For the rhythmic cell cycle genes, could this be the consequence of the serum which synchronizes also the cell cycle, or is it rather an effect of the circadian oscillator driving rhythms of cell cycle genes? 

      This is an interesting point. Given our previous data showing that the cell cycle gene cyclin D1 is regulated by clock transcription factors3, we believe the circadian oscillator drives, or at least contributes to rhythms of cell cycle genes. However, the serum clearly makes a difference as we find that MESORs of cell cycle genes decrease with aged serum. This is consistent with the decreased proliferation previously observed in aged human tissue.

      While the reduction of rhythmicity in the old serum for oxidative phosphorylation transcripts is very interesting and fits with the general theme that metabolic function decreases with age, it is puzzling that the recipient cells are the same, but it is only the synchronization by the old and young serum that changes. Are the authors thus suggesting that decrease of metabolic rhythms is primarily a non cell-autonomous and systemic phenomenon? What would be a potential mechanism? 

      It may not be the cycling per se, but rather an overall inefficiency of oxidative phosphorylation that is conveyed by the serum. Relating other work in the field to our findings, we’ve added the following to our discussion: “Previous work in the field demonstrates that synchronization of the circadian clock in culture results in cycling of mitochondrial respiratory activity5,6 further underscoring the different effects of old serum, which does not support oscillations of oxidative phosphorylation associated transcripts. Age-dependent decrease in oxidative phosphorylation and increase in mitochondrial dysfunction7 is seen also in aged fibroblasts8 and contributes to age-related diseases9. We suggest that the age-related inefficiency of oxidative phosphorylation is conferred by serum signals to the cells such that oxidative phosphorylation cycles are mitigated. On the other hand, loss of cycling could contribute to impairments in mitochondrial function with age.”

      The delayed shifts after aged serum for clock transcripts (but not for Bmal1) are interesting and indicate that there may be a decoupling of Bmal1 transcript levels from the other clock gene phases. How do the authors interpret this? Could it be related to altered chronotypes in the elderly? 

      One possible explanation is that the delay of NPAS2, BMAL1’s binding partner, results in the delay of the transcription of clock controlled genes/negative arm genes. Since the RORs do not seem to be affected, Bmal is transcribed/translated as usual, but there isn’t enough NPAS2 to bind with BMAL1. In this case downstream genes are slower to transcribe causing the phase delay.

      The discussion would also benefit from mentioning parallels and dissimiliarities with previous works, as well as what would be possible mechanisms for such an effect. 

      We’ve expanded our discussion in the manuscript to discuss possible mechanisms and also how the genes/pathways implicated in our study relate to other aging literature.  

      Minor: 

      While time of serum collection is provided in the methods, it would be very useful to provide this information, along with the accompanying argumentation also at a more prominent position and to also add it to Table S1. 

      We made sure to highlight the collection time in the abstract of the manuscript “We collected blood from apparently healthy young (age 25-30) and old (age 70-76) individuals at 14:001 and used the serum to synchronize cultured fibroblasts.” The time of blood draw is also in sections of the paper (Intro and Methods). Since Table S1 is demographic information, we did not think that the blood draw time fit best there, but hopefully it is now clear in the text.

      L73 EKG: define the abbreviation 

      We rewrote this paragraph, but defined the term where it is used the paper.  

      L77: transfected BJ-5TA fibroblasts. Mention in the text that these are stably transfected cells. 

      We added this to the text.

      L88: Day 2 also revealed different phases of cyclic expression between young and old "groups" for a larger number of genes. Here it is only two donors, right? 

      Yes, we swapped out the word “groups” for “subjects”.

      L115. MESORs of steroid biosynthesis genes, particularly those relating to cholesterol biosynthesis, were also increased in the old sera condition. This is quite interesting, can the authors speculate on the significance of this finding? 

      We’ve added discussion about this finding in the context of the literature in our discussion.

      Fig 3. - FDRs are only listed for certain KEGG pathways, and gene counts for each pathway are also missing, which excludes some valuable context for drawing conclusions. Full tables of KEGG pathway enrichment outputs should be provided in supplementary materials. Input gene lists should also be uploaded as supplementary data files.

      Both output and input files are included in this submission as additional files.  

      Line 322 - How many replicates were excluded in the end for each group? Providing this information would strengthen the claim that the ability of both old and young serum to drive 24h oscillations in fibroblasts is robust and not only individual. 

      Each serum was tested in triplicate in two individual runs of the experiment. Of the 15 serum samples, on one of the runs, a triplicate for each of two serum samples (one old, one young) was excluded. Given that only one technical replicate in one run of the experiment had to be excluded for one old and one young individual out of all the samples assayed, this supports the idea that young and old serum drive robust oscillations.

      Line 373 - Should list which active interaction sources were used for analysis. 

      In this manuscript we used STRING (search tool for retrieval of interacting genes) analysis to broadly identify relevant pathways defined by different algorithms. From these data, we focused in particular on KEGG pathways.

      Reviewer #1 (Recommendations For The Authors): 

      These comments are in addition to those provided above: 

      Minor: 

      L73 EKG: define the abbreviation 

      We rewrote this paragraph, but defined the term where it is used the paper.  

      L77: transfected BJ-5TA fibroblasts. Mention in the text that these are stably transfected cells. 

      We added this to the text.

      L88: Day 2 also revealed different phases of cyclic expression between young and old "groups" for a larger number of genes. Here it is only two donor, right? 

      Yes, we swapped out the word “groups” for “subjects”.

      L115. MESORs of steroid biosynthesis genes, particularly those relating to cholesterol biosynthesis, were also increased in the old sera condition. This is quite interesting, can the authors speculate on the significance of this finding? 

      We’ve added discussion about this finding in the context of the literature.

      Fig.4 The fold change amplitude of the clock gene seems quite a bit lower than what is usually expected (for Nr1d1 it is usually 10 fold). The authors should provide an explanation and discuss this. 

      There are a variety of factors that contribute to the fold change amplitude of clock genes. First, the change in amplitude of clock genes is lower in vitro compared to in vivo samples. For example, in U2OS cell cultures the fold change in the cycling of Nr1d1 is only 2 fold and is not significantly different from the fold change we observe (as shown in the U2OS data from CircaDB plotted in Figure 1R). Second, the method of synchronization contributes to the strength of the rhythms. Serum synchronization is generally less effective at driving strong clock cycling than forskolin or dexamethasone although, as noted in the manuscript, it may promote the cycling of more genes. Lastly, rhythm amplitude is also dependent on the cell type in question so cell to cell variability also contributes to differences. However, overall, we do not find major differences in comparing the U2OS data and ours. Please note that the y-axis has a logarithmic scale.

      What is the authors' strategy to identify which serum components that are responsible for the reported changes? This should be discussed. 

      In the future, we intend to analyze the serum factors using a combination of fractionation and either proteomics or metabolomics to identify relevant factors. We have added this to the discussion.

      Reviewer #2 (Recommendations For The Authors): 

      Overall, the article is well-written but lacks some more rigorous data analysis as mentioned in the public review above. In addition to a more thorough analysis approach focusing much more heavily on individual variability, several other changes can be made to strengthen this study:

      Fig 3. - FDRs are only listed for certain KEGG pathways, and gene counts for each pathway are also missing, which excludes some valuable context for drawing conclusions. Full tables of KEGG pathway enrichment outputs should be provided in supplementary materials. Input gene lists should also be uploaded as supplementary data files. 

      Both output and input files are included in this submission as additional files.

      Fig 1A. - Only n=5 participants were used for this analysis, explanation of the exclusion criteria for the other participants would be useful. 

      As Figure 1A is a schematic, we assume the reviewer is referring to Figure 1B. We’ve provided a flow chart of subject inclusion/exclusion in Figure S2.

      Fig 2. - For circadian transcriptome analysis only n=4 participants were used - what criteria was used to exclude individuals, and why were only these individuals used in the end? 

      As patient recruitment was interrupted by COVID, we selected samples where we had sufficient serum to effectively carry out the RNA seq experiment and control for age and sex.

      Line 322 - How many replicates were excluded in the end for each group? Providing this information would strengthen the claim that the ability of both old and young serum to drive 24h oscillations in fibroblasts is robust and not only individual. 

      Each serum was tested in triplicate in two individual runs of the experiment. Of the 15 serum samples, on one of the runs, a triplicate for each of two serum samples (one old, one young) was excluded. Given that only one technical replicate in one run of the experiment had to be excluded for one old and one young individual out of all the samples assayed, this supports the idea that young and old serum drive robust oscillations.

      Line 373 - Should list which active interaction sources were used for analysis. 

      In this manuscript we used STRING (search tool for retrieval of interacting genes) analysis to identify relevant pathways. We do not present any STRING networks in the paper.

      Line 68 - "These novel findings suggest that it may be possible to treat impaired circadian physiology and the associated disease risks by targeting blood borne factors." This is a completed overstatement that are cannot be sustained by the limited findings provided by the authors. 

      We’ve modified this statement to avoid overstating results.

      (1) Pagani, L. et al. Serum factors in older individuals change cellular clock properties. Proceedings of the National Academy of Sciences 108, 7218–7223 (2011).

      (2) Pagani, L. et al. Serum factors in older individuals change cellular clock properties. Proc Natl Acad Sci U S A 108, 7218–7223 (2011).

      (3) Lee, Y. et al. G1/S cell cycle regulators mediate effects of circadian dysregulation on tumor growth and provide targets for timed anticancer treatment. PLOS Biology 17, e3000228 (2019).

      (4) Tomasetti, C. et al. Cell division rates decrease with age, providing a potential explanation for the age-dependent deceleration in cancer incidence. Proceedings of the National Academy of Sciences 116, 20482–20488 (2019).

      (5) Cela, O. et al. Clock genes-dependent acetylation of complex I sets rhythmic activity of mitochondrial OxPhos. Biochimica et Biophysica Acta (BBA) - Molecular Cell Research 1863, 596–606 (2016).

      (6) Scrima, R. et al. Mitochondrial calcium drives clock gene-dependent activation of pyruvate dehydrogenase and of oxidative phosphorylation. Biochimica et Biophysica Acta (BBA) - Molecular Cell Research 1867, 118815 (2020).

      (7) Lesnefsky, E. J. & Hoppel, C. L. Oxidative phosphorylation and aging. Ageing Research Reviews 5, 402–433 (2006).

      (8) Greco, M. et al. Marked aging-related decline in efficiency of oxidative phosphorylation in human skin fibroblasts. The FASEB Journal 17, 1706–1708 (2003).

      (9) Federico, A. et al. Mitochondria, oxidative stress and neurodegeneration. Journal of the Neurological Sciences 322, 254–262 (2012).

    1. Author Response

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Recommendations For The Authors):

      The manuscript is very well written, the data are clearly presented and the methodology is robust. I only have suggestions to improve the manuscript, to make the study more appealing or to discuss in more detail some questions raised by the work.

      1. In the study as it stands, PFG seems to come out of the blue. The authors apparently selected this protein based on sequence conservation between species but this is unlikely to be sufficient to identify novel TFs. Explaining in more detail the reasoning that led to PFG would make the story more appealing. Perhaps PFG was identified through a large reverse genetics screening?

      Response: Thank you for your suggestion. We identified this gene solely by the strategy we described in the manuscript. We decided on this strategy based on the findings of our previous study on AP2-Family TFs, whose DNA binding domains are highly conserved among Plasmodium orthologues. Using this screening strategy, we identified a novel AP2 family TF AP2-Z. The results of the present study demonstrated that this strategy is applicable to TFs other than those belonging to the AP2 family. We are aware that this strategy is not all-encompassing. In fact, we failed to identify HDP1 as a candidate TF when it was also in the target list of AP2-G. However, at present, this is our primary strategy for identifying novel TFs in the targetome.

      1. The authors propose that PFG and AP2-FG form a complex, but this is actually not shown. Did they try to document a physical interaction between the two proteins, for example using co-IP?

      Response: Even when the two molecules were identified to be at the same position by ChIPseq, it cannot be concluded that they form a physical complex because it is possible that they competitively occupy the region. However, in this study, we performed ChIP-seq in the absence of PFG and demonstrated that the cAP2-FG peaks disappeared while those of sAP2-FG remained. This result can only be explained by the two proteins forming a complex at this region, which excludes the possibility that AP2-FG binds the region independently.

      1. It is unclear how PFG can bind to DNA in the absence of DNA-binding domain. Did the authors search for unconventional domains in the protein? This should be at least discussed in the manuscript.

      Response: We speculate that the two highly conserved regions, region 1 and region 2, function as DNA-binding domains in PFG. However, this domain is not similar to any DNA binding domains reported thus far. A straightforward way to demonstrate this would be to perform in vitro binding assays using a recombinant protein. However, thus far, we have not succeeded in obtaining soluble recombinant proteins for these regions. We have added the following sentences to the results section.

      “At present, we speculate that PFG directly interacts with genomic DNA through two highly conserved regions; region 1 and region 2. However, these regions are not similar to any DNA binding domains reported thus far. In other apicomplexan orthologues, these two domains are located adjacent to one another in the protein (Fig. 1A). Therefore, these two regions may be separated by a long interval region but constitute a DNA binding domain of PFG as a result of protein folding.”

      1. How do the authors explain that PFG is still expressed in the absence of AP2-FG? Is AP2G alone sufficient to express sufficient levels of the protein? Is PFG down-regulated in the absence of AP2-FG?

      Response: Our previous ChIP-seq data indicate that PFG is a target of AP2-G. According to the study by Kent et al. (2018), this gene is up-regulated in the early period following conditional AP2-G induction. The results of the present study showed that PFG is capable of autoactivation through a transcriptional positive feed-back loop. These results suggest that PFG can maintain its expression to a certain level once activated by AP2-G, even in the absence of AP2-FG. In our previous microarray analysis, significant decreases in PFG expression were not observed in AP2-FG-diaruptedparasites.

      1. How do AP2-FG regulated genes (based on RNAseq) compare with the predicted cAP2FG/sAP2-FG predicted genes (based on ChIPseq)? Are the two subsets included in the genes that are actually down-regulated in AP2-FG(-)?

      Response: Disruption of the AP2-FG gene impairs gametocyte development. We considered that the direct effect of this disruption would be difficult to analyze in gametocyte-enriched blood, in which gametocytes are pooled during sulfadiazine treatment to deplete asexual stages. Therefore, in our previous paper, we performed microarray analysis between WT and KO parasites to detect the direct effect of AP2-FG disruption on target gene expression, using mice which were synchronously infected with parasites. According to our results, 206 genes were down-regulated in AP2-FG-disrupted parasites. Of these genes, 40 and 117 were targets of sAP2-FG and cAP2-FG, respectively. However, it is still possible that a significant proportion of genes were indirectly down-regulated by AP2-FG disruption, which may impair gametocyte development. Moreover, based on the results of the present study, expression of a significant proportion of AP2-FG target genes could be complemented by PFG transcription. We believe that it would be difficult to compare the direct effects of these TFs on gene expression via transcriptome analysis (therefore, targetome analysis is important). In this study, we compared the expression of target genes of sAP2-FG and cAP2FG between PFG(-) and WT parasites. We expected that down-regulation of PFG (cAP2FG) targets would be complemented with transcription by sAP2-FG.

      1. Minor points

      -Page 5 Line 10, remove "as"

      Response: We have corrected this.

      -Page 7 Lines 4-13: is it possible to perform the assay in PFG(-) parasites?

      Response: Thank you for your question. Even when the marker gene expression was decreased in PFG(-) parasites, we cannot conclude the reason to be a direct effect of the mutation. To determine the function of the motif, it is necessary to perform the assay using wild-type parasites.

      -Page 7 Line 45: Fig6C instead of 5C

      Response: Thank you for pointing this out. We have corrected this.

      -Page 8 Line 27: "decreases"

      Response: Thank you for pointing this out. We have corrected this.

      -Page 8 Line 36: PFG instead of PGP

      Response: We have corrected this.

      -Page 8 Line 39: remove "the fact"

      Response: We have removed this word.

      -Page 8 Line 42: Fig6G instead of 5G

      Response: We have corrected this.

      -Page 8 Line 43: PFG instead of PGP

      Response: We have corrected this.

      -Page 9 Line 23: "electroporation"

      Response: We have corrected this.

      -Page 9 Line 32: "BamHI"

      Response: We have corrected this.

      -Fig 2E: in the crosses did the authors check oocyst formation in the mosquito?

      Response: We did not check oocyst formation because abnormalities in males may not affect oocyst formation.

      -Page 17, legend Fig3, Line 14, there is probably an inversion between left and right for PFG versus AP2-FG (either in the legend or in the figure)

      Response: Thank you for pointing this out. PFG peaks are located in the center in both heat maps. The description “AP2-FG peaks” over the arrowhead in the left map was incorrect. We have corrected this to “PFG peaks”. The peaks in the left heat map must be located in the center; thus, this figure might be redundant.

      Reviewer #2 (Recommendations for the Authors):

      • Could the authors please state in the results section that PFG stands for partner of AP2FG.

      Response: Thank you for the comment. We have added the following to the results section:

      “Through this screening, a gene encoding a 2709 amino acid protein with two regions highly conserved among Plasmodium was identified (PBANKA0902300, designated as a partner of AP2-FG (PFG; Fig. 1A).”

      • Given that the transcriptional program is so dynamic, the timing of the ChIP-seq experiments is crucial. Could the authors clarify the timings of the different ChIP-seq experiments (AP2-FG, PFG, PFG in AP2-FG-, AP2-FG in PFG-, ...)

      Response: Thank you for the comment. To deplete any parasites in the asexual stages, all ChIP-seq experiments in this study were performed using blood from mice treated with sulfadiazine, namely, gametocyte-enriched blood. As the reviewer points out, timing is important, and samples from the period when TFs are maximally expressed are optimal for ChIP-seq. However, when parasites in the asexual stages are present, the background becomes higher. Thus we usually use gametocyte-enriched blood for ChIP-seq when expression of the TF is observed in mature gametocytes. The exception was our ChIP-seq analysis of AP2-G, because is not present in mature gametocytes.

      • Fig 4c is an example of great overlap of peaks, but it would be helpful if the authors could quantify the overlaps between experiments (and describe the overlap parameters used).

      Response: According to the comment, we have created a Venn diagram of overlapping peaks (attached below). However, the peaks used for this Venn diagram were selected after peakcalling via fold-enrichment values. Thus, even if the counterpart of a peak is absent in these selected peaks (non-overlapping peaks in the Venn diagram), it does not indicate that it is absent in the original read map. We believe the overlap of peaks would be estimated more correctly in the heat maps.

      Author response image 1.

      Legged: The Venn diagram shows the number of common peaks between these ChIP seq experiments (distance of peak summits < 150

      • Additionally, how were the promoter coordinates used for each gene when they associate ChIP peaks to a gene target. Did the authors choose 1-2kb? Or use a TSS/5utr dataset such as Adjalley 2016 or Chappell 2020?

      Response: We selected a 1.2 Kbp region for target prediction based on our previous studies. As the reviewer pointed out, target prediction using TSS information may be more accurate. However, reliable TSS information is not available for P. berghei to the best of our knowledge.

      The two papers are studies on P. falciparum.

      • In the absence of evidence of physical interaction, it remains unclear if AP2-FG and PFG actually interact directly or as part of the same complex. A more detailed characterisation with IPs/co-IPs followed by mass spectrometry of the GFP-tagged version of PFG in the presence and absence of AP2-FG would be highly informative.

      Response: Thank you for the comment. Even when these two TFs occupy the same genomic region, it cannot be conclusively said that they exist at the same time in the region: they might competitively occupy the region. However, we showed that the cAP2-FG peaks disappear from the region when PFG was disrupted, while sAP2-FG peaks remain. We believe that this is evidence that the two TFs physically interact with each other.

      • It was not clear if the assessment of motif binding using cytometry was performed using all the required controls and compensation. This section should be clarified.

      Response: Thank you for the comment. Condensation was performed using parasites expressing a single fluorescent protein. The results are attached below. The histogram of mCherry using control parasites expressing GFP under the control of the HSP70 promoter is also attached.

      Author response image 2.

      However, we found that descriptions of the filters for detecting red signals were not correct. This assay was performed using parasites which expressed GFP constitutively and mCherry under the control of the p28 promoter. These two fluorescent proteins were excited by independent lasers (488 and 561, respectively), and the emission spectra were detected using independent detectors (through 530/30 and 610/20 filters, respectively). We have revised the description regarding our FACS protocols as follows:

      “Flow cytometric analysis was performed using an LSR-II flow cytometer (BD Biosciences). In experiments using 820 parasites, the tail blood from infected mice was selected via gating with forward scatter and staining with Hoechst 33342 (excitation =355 nm, emission = 450/50). The gated population was then analyzed for GFP fluorescence (excitation = 488 nm, emission = 530/30) and RFP fluorescence (excitation = 561 nm, emission = 610/20). In the promoter assay (using parasites transfected with a centromere plasmid), the tail blood from infected mice was selected via gating with forward scatter and staining with Hoechst 33342 (excitation =355 nm, emission = 450/50), followed by GFP fluorescence (excitation = 488 nm, emission = 530/30). The gated population was analyzed for mCherry fluorescence (excitation = 561 nm, emission = 610/20). Analysis was performed using the DIVER program (BD Biosciences).”

      Minor points:

      • Page 4, line 37: The authors should specify the timing of expression of AP2-FG on the text.

      Response: We have added the following description to the text.

      “The timing of the expression was approximately four hours later than that of AP2-FG, which started at 16 hpi (9).” .

      • Ref 9 and 17 are repeated

      Response: Thank you for pointing this out. We have corrected this.

      • Fig 1D and 1F do not have scale bars

      Response: We have added scale bars to Fig. 1D.

      We have not changed Fig. 1F, because we believe that the scales can be estimated from the size of the erythrocyte.

      • Page 5, line 29-30. Could the authors specify how many and which of the de-regulated genes have a PFG in their promoter.

      Response: Thank you for the comment, As described in a later section (page 7; Impact of PFG disruption on the expression of AP2-FG target genes), among the 279 genes significantly downregulated in PFG(-) parasites, 165 genes were targets for PFG (unique for PFG or common for sAP2-FG and PFG). In contrast, only four genes were targets unique to sAP2-FG. Therefore, 165 genes harbor the upstream peaks of PFG. These genes are shown in Table S1.

      • Fig 5F. in the methods associated with this figure there seems to be a mixup with the description of the lasers. In addition, given the spillover of the red and green signal between detectors this experiment needs compensation parameters. The authors should provide the gating strategy before and after compensation as this is critical for the correct calculation of the number of red parasites. Indeed, the lowest red cloud on the gate shown could be green signal spill over.

      Response: Thank you for the comment. As described above, there were some incorrect descriptions about the conditions of our FACS protocols in the methods section. We have revised them.

      -Page 7, line 19. Could the authors explicitly say in the text that the 810 genes are those with 1 (or more?) PFG peaks in their promoter (out of a total of 1029) to best guide the reader. Additionally, it is important to define the maximum distance allowed between a peak and CDS for it to be associated with said CDS.

      Response: We have revised Table S2 by adding the nearest genes. The revised table shows the relationship between a PFG peak and its nearest genes, together with their distances.

      • Page 7, line 45: fig 6c, not 5c

      Response: Thank you for the comment. We have corrected this.

      • Page 7 last paragraph: This section is very hard to follow. For instance, on line 50 do the authors mean that the sAP2-FG unique targets are LESS de-regulated? On line 51: do the authors mean unique targets of cAP2-FG or unique targets of PFG? Line 53: do the authors mean that genes expressed in the "common" category are LESS de-regulated than the PFG unique targets?

      Response: We are sorry for the lack of clarity; after reviewing the manuscript, it appears to be unclear what the fold change means in this section. Here, fold change means the ratio of PFG(-)/wild type. Thus “High log2(fold change) value” means that the genes were less downregulated. We have revised the description as follows:

      “The log2 distribution (fold change = PFG(-)/wild type) in the three groups of target genes showed that the average value was significantly higher (i.e., less down-regulated) in targets unique to sAP2-FG than in the other two groups (targets unique to cAP2-FG or common targets for both), with p-values of 1.3 × 10-10 and 1.4 × 10-5, respectively, by two-tailed Student’s t-test (Fig. 6F). In addition, the average log2 (fold change) value of the common target genes was relatively higher (i.e., less down-regulated) than that of targets unique to PFG, suggesting that transcriptional activation by sAP2-FG partly complements the impact of PFG disruption on these common targets.”

      • Page 8, line 42: Fig 6G, not 5G

      Response: Thank you for pointing this out. We have corrected this.

      Reviewer #3 (Recommendations For The Authors):

      1. The gene at the center of this study (PBANKA_0902300) was identified in an earlier genetic screen by Russell et al. as being a female specific gene with essential role in transmission and named Fd2 (for female-defective 2). Since this name entered the literature first and is equally descriptive, the Fd2 name should be used instead of PFG to maintain clarity and avoid unnecessary confusion. Surprisingly, this study is neither cited nor acknowledged despite a preprint having been available since August of 2021. This should be remedied.

      Response: Thank you for the comment. We have added the paper by Russell et al. accordingly and mentioned the name FD2 in the revised manuscript. However, we have retained the use of PFG throughout the paper. We believe that this usage of PFG shouldn’t be confusing, as FD2 has only been used in one previous paper. We have added the following:

      “Through this screening, a gene encoding a 2709 amino acid protein with two regions highly conserved among Plasmodium was identified (PBANKA0902300, designated as a partner of AP2-FG (PFG; Fig. 1A). This gene is one of the P. berghei genes that were previously identified as genes involved in female gametocyte development (named FD2), based on mass screening combined with single cell RNA-seq (ref).”

      1. While it isn't really important how the authors came to arrive at studying the function of Fd2, the rationale/approach given in the first paragraph of the result section seems far too broad to lead to Fd2, given that it lacks identifiable domains and many other ortholog sets exist across these species.

      Response: We selected this gene from the list of AP2-G targets as a candidate for a sequence-specific TF based on the hypothesis that the amino acid sequences of DNAbinding domains are highly conserved. We successfully identified two TFs (including PFG) using this method. However, there may be TFs that do not fit this hypothesis which are also targets of AP2-G. In fact, we were unable to identify HDP1 as a TF candidate, despite being a AP2-G target.

      1. Fig. 1A-C: Gene IDs for the orthologs should be provided, as well as the methodology for generating the alignments.

      Response; We have added the gene IDs and method for alignment in the legend as follows:

      (A) Schematic diagram of PFG from P. berghei and its homologs in apicomplexan parasites. Regions homologous to Regions 1 and 2, which are highly conserved among Plasmodium species, are shown as yellow and blue rectangles, respectively. Nuclear localization signals were predicted using the cNLS mapper (http://nls-10 mapper.iab.keio.ac.jp/cgibin/NLS_Mapper_form.cgi). The gene IDs of P. berghei PFG, P. falciparum PFG, and their homologs in Toxoplasma gondii, Eimeria tenella and Vitrella brassicaformis are PBANKA_0902300, PF3D7_1146800, TGGT1_239670, ETH2_1252400, and Vbra_10234, respectively.

      (C) The amino acid sequences of Regions 1 and 2 from P. berghei PFG and its homologs from other apicomplexan parasites in (A) were aligned using the ClustalW program in MEGA X. The positions at which all these sequences have identical amino acids are indicated by two asterisks, and positions with amino acid residues possessing the same properties are indicated by one asterisk.

      1. Figure 2: The Phenotype of Fd2 knockout should be characterized more comprehensively.

      It remains unclear whether ∆Fd2 parasite generate the same number of females but these are defective upon fertilization or whether there is also a decrease in the number of female gametocytes. Is the defect just post-fertilization and zygotes lyse or are there fewer fertilization events? If so is activation of female GCs effected?

      The number of male and female gametocytes should be quantified using sex-specific markers not affected by Fd2 knockout rather than providing a single image of each. The ability of ∆Fd2 GCs should also be evaluated.

      This is also important for the interpretation of Fig 2G. Is the down-regulation of the genes due to fewer female GCs or are the down-regulated genes only a subset of female-specific genes.

      Response: In PFG(-) parasites, the rate of conversion into zygotes of female gametocytes decreased, and zygotes had lost capacity for developing into ookinetes. This indicates that gametocyte development (i.e., the ability to egress the erythrocyte and to fertilize) and zygote development were both impaired. This phenotype is consistent with the observation that genes expressed in female gametocytes are broadly downregulated. PFG is a TF, and its disruption led to decreased expression of hundreds of female genes. Thus, the observed phenotype may be derived from combined decreased expression of these genes. We believe further detailed phenotypic analyses will not generate much novel information on this TF. Instead, RNA-seq data in PFG(-) parasites and the targetome have promise in helping to characterize the functions of this TF.

      1. Figure 3: what fraction of down-regulated genes have the Fd2 10mer motif?

      Response: Thank you for the question. We investigated the upstream binding motifs of these genes. Of the 279 significantly down-regulated genes (containing 165 targets), 161 genes harbor the motif (including nine-base motifs that lack one lateral base which is likely not essential for binding) in their upstream regions (within 1,200 bp from the first methionine codon). However, this result has not been described in the revised manuscript because it is more important whether these regions harbor PFG peaks (upstream motifs can exist without being involved in the binding of PFG).

      1. sAP2-FG (single) vs cAP2-FG (complex) nomenclature is confusing and possibly misleading since few TFs function in isolation and sAP2-FG likely functions in a complex that doesn't contain Fd2, possibly with another DNA binding protein that binds the TGCACA hexamer. The name for the distinct peaks should refer to the presence or absence of Fd2 in the complex, or maybe simply refer to them as complex A & B.

      Response: As shown in the DIP-seq analysis results, AP2-FG can bind the motif by itself. In contrast, AP2-FG must form a complex with PFG to bind to the ten-base motif. The complex and single forms are named according to this difference (the presence or absence of PFG) and used solely in its relation with PFG. We wrote “In the following, we refer to the form with PFG as cAP2-FG or the complex form, and the form without PFG as sAP2-FG or the single form.” We believe that the nomenclature has sufficient clarity. However, we have partially (underlined) revised certain sentences in the discussion section as follows.

      “As the expression of PFG increases via this mechanism, AP2-FG recruited by PFG (cAP2FG) increases and eventually becomes predominant in the transcriptional regulation of female gametocytes.”

      “This suggests that the promoter of the CCP2 gene, which is a target of PFG only, is still active in AP2-FG(-)820 parasites.”

      We recently reported that the TGCACA motif is a cis-activation motif in early gametocytes and important for both male and female gametocyte development. Thus we speculate that sAP2-FG is not involved in cis-activation by the TGCACA motif. The p-value of the six-base motif is indeed comparable to that of the five-base motif. However, the pvalue (calculated by Fisher’s exact test) in six-base motifs tend to be lower than that calculated in five-base motifs, because the population is much large. We speculate that there is a sequence-specific TF that may be expressed in early gametocytes and bind this motif, independently of AP2-FG.

      1. I compared the overlap of peaks in the 4 ChIP-seq data sets:

      90% of the Fd2 peaks are shared with AP2-FG (binding 24% of shared peaks is lost in ∆AP2FG)

      10% are bound by Fd2 alone (binding at 35% of Fd2 is lost in ∆AP2-FG)

      75% of Fd2 peaks are bound independently of AP2-FG

      47% of AP2-FG peaks shared with Fd2 (binding at 71% of shared peaks is lost in ∆Fd2) 53% of AP2-FG peaks are bound only by AP2-FG (but binding at 82% of AP2-FG only peaks is still lost in the ∆Fd2)

      Binding at 78% of all AP2-FG peaks is lost in ∆Fd2

      This indicates that much of AP2-FG binding in regions even in regions devoid of Fd2 still depends on Fd2. What are possible explanations for this?

      https://elife-rp.msubmit.net/eliferp_files/2023/04/03/00117573/00/117573_0_attach_10_17936_convrt.pdf

      Response: In the ChIP-seq of AP2-FG in the absence of PFG, 441 peaks are still called. This means that at least 441 binding sites for AP2-FG independent of PFG exist. This is a straightforward conclusion from our ChIP-seq data. On the other hand, simple deduction of peaks between two ChIP-seq experiments (AP2-FG peaks minus PFG peaks) is not a precise method for determining sAP2-FG. Peak-calling is independently performed in each ChIP-seq experiment. Thus, peaks remaining after the deduction between two experiments can still contain peaks that are actually common, but which are differentially picked up through the process of peak calling. Even when using data obtained by the same ChIP-seq experiment, markedly different numbers of peaks are called according to the conditions for peak calling (in contrast, common peaks between two independent experiments increase the reliability of the data). If wanting to identify sAP2-FG peaks via comparisons between AP2-FG peaks and PFG peaks, the reviewer has to increase the number of PFG peaks by reducing the peak-calling threshold until the number of overlapping peaks between AP2-FG and PFG are saturated, and then deduce the overlapping peaks from the AP2-FG peaks. However, as described above, for the purposes of estimating the number of sAP2-FG, it would be better to perform ChIP-seq of AP2-FG in the absence of PFG.

      1. Possible explanations of why recombinant Fd2 doesn't bind the TGCACA hexamer. It would also be good to note that the GCTCA AP2-FG motif found in Fig4G is now perfect match for the motif identified by protein binding microarray in Campbell et al.

      Response: It is not known what sequence recombinant PFG binds. The TGCACA motif is not enriched in PFG peaks. If the reviewer is referring to AP2-FG, our findings that the recombinant AP2 domain binds the five-base motif strongly suggests that other TFs recognize this motif. As described in our response to comment 9, we recently reported that TGCACA is a cis-activating sequence important for the normal development of both male and female gametocytes. Therefore, we currently speculate that this motif is a binding motif of other TFs and is independent of AP2-FG.

      We have mentioned the protein binding microarray data in the Results section as follows.

      “The most enriched motif matched well with the binding sequence of the AP2 domain of P. falciparum AP2-FG, which was reported by Campbell et al.”

      1. What might explain the strong enrichment for TGCACA in ChIPseq but when pulled down by AP2-FG DBD: another binding partner? requires more of AP2-DF than just DBD?

      Response: As described above in our response to comment 6, we have recently submitted a preprint studying the roles of the remodeler subunit PbARID in gametocyte development. We reported that the remodeler subunit is recruited to the six-base motif and that the motif is a novel cis-activation element for early gametocyte development. We speculate that a proportion of AP2-FG targets are also targets of a TF that recognizes this motif and recruits the remodeler subunit. These two TFs may be involved in the regulation of early gametocyte genes but function independently.

      1. Calling DNA pulldown with recombinant AP2-FG DNA-binding domain DNAImmunoprecipitation sequencing (DIP-seq) is confusing since there are no antibodies involved. Describing it directly as a pulldown of fragmented DNA will be clearer to the reader.

      Response: Thank you for the comment. We have also recognized this discrepancy. However we called the method DIP-seq because the original paper reporting this method used this name, wherein it did not use antibodies to capture the MBP-fusion recombinant protein. Our experiment was performed using essentially the same methods, and thus we retained the name.

      1. The legends and methods are very sparse and should include substantially more detail.

      Response: Thank you for the comment. We have revised the description of the FACS experimental method for clarity.

      1. BigWig files for all ChIPseq enrichment used for analysis in this study need to be provided.

      (two replicates each of : Fd2 in WT, Fd2 in ∆AP2-GF, AP2-FG in WT, AP2-FG in ∆Fd2)

      Response: We have deposited the BigWig files to GEO (GSE.226028 and GSE114096).

      1. Tables of ChIP data need to have both summits and peaks and need to list nearest gene. Also the ChIPseq peaks for Fd2 are surprisingly broad (ChIP peaks are very large, e.g. 68% of Fd2 peaks (dataset2) are greater than 1000kb) give its specificity for a long motif. Why is this?

      Response: We have revised Table S2 to include the nearest genes. We are unsure why peaks in the over 1000-bp peak region exist in such high proportions. However, this proportion was also high in our previous ChIP-seq data. Therefore, we speculate that this is a tendency of peak-calling by MACS2. We did not use these values in this paper. For example, targets were predicted using peak summits, and binding motifs were calculated using the 100-base regions around peak summits.

      1. Figure 5E: The positions of the 10mer and 5mer motifs in the promoter should be indicated as well as the length of the promoter. Moreover, mutation of just the 5bp motifs would be valuable to understand if 10mer is sufficient for expression of the reporter.

      Response: Thank you for the comment. We have revised the figure accordingly. The majority of female-specific promoters only harbor ten-base motifs. Thus the ten-base motif is sufficient for evaluating reporter activity (i.e., it would function without five-base motifs).

      1. How is AP2-FG expression affected in ∆Fd2 and vice versa?

      Response: According to our previous microarray data, PFG expression was not significantly downregulated by disruption of AP2-FG. This may be because PFG transcriptionally activates itself through a positive feedback loop after being induced by AP2-G. Similarly, according to our present study, AP2-FG expression was not downregulated by PFG disruption. This may be because AP2-FG is transcriptionally activated by AP2-G.

      1. The single cell data in Russell et al. could easily be used to indicate the order of expression.

      Response: Determining the expression order of gametocyte TFs via the single cell RNA-seq data from Russel et al. is difficult, because only a small number of parasite cells were considered to be in the early gametocyte stage in this study. This is because the parasites were cultured for 24h before the analysis. The analysis suggested by the reviewer may be possible via single cell RNA-seq, but the experiments must be performed with more focus on the early gametocyte stage.

      1. A discussion of the implication of P. falciparum transmission would be appreciated.

      Response: Thank you for the comment. We have added the following to the Discussion section:

      “P. falciparum gametocytes require 9-12 days to mature, which is much longer than that of P. berghei. Meanwhile, it has been reported that the ten-base motif is highly enriched in the upstream regions of female-specific genes also in P. falciparum. Thus, despite the difference in maturation periods, PFG is likely to play an important role in the transcriptional regulation of female P. falciparum gametocyte development."

      1. The lack of identifiable DNA binding domains in Fd2 is intriguing given the strong sequence-specificity. Do the authors think they have identified a new DNA-binding fold ?

      Alphafold of the orthologs with contiguous regions 1&2 might offer insight.

      Response: We speculate that these regions function as DNA binding domains. We performed analysis using Alfafold2 according to the comment. However, the predicted structure of the region was not similar to any other canonical DNA-binding domains. Thus, it may be a novel DNA-binding fold as the reviewer mentioned. Further studies such as binding assays using recombinant proteins would be necessary to confirm this, but thus far we have not successfully obtained the soluble proteins of these regions.

    1. Author Response

      The following is the authors’ response to the original reviews.

      Author response:

      Reviewer #1:

      The main objective of this study is to achieve the development of a synthetic autotroph using adaptive laboratory evolution. To accomplish this, the authors conducted chemostat cultivation of engineered E. coli strains under xylose-limiting conditions and identified autotrophic growth and the causative mutations. Additionally, the mutational mechanisms underlying these causative mutations were also explored with drill down assays. Overall, the authors demonstrated that only a small number of genetic changes were sufficient (i.e., 3) to construct an autotrophic E. coli when additional heterologous genes were added. While natural autotrophic microorganisms typically exhibit low genetic tractability, numerous studies have focused on constructing synthetic autotrophs using platform microorganisms such as E. coli. Consequently, this research will be of interest to synthetic biologists and systems biologists working on the development of synthetic autotrophic microorganisms. The conclusions of this paper are mostly well supported by appropriate experimental methods and logical reasoning. However, further experimental validation of the mutational mechanisms involving rpoB and crp would enhance readers' understanding and provide clearer insights, despite acknowledgement that these genes impact a broad set of additional genes. Additionally, a similar study, 10.1371/journal.pgen.1001186, where pgi was deleted from the E. coli genome and evolved to reveal an rpoB mutation is relevant to this work and should be placed in the context of the presented findings.

      We thank the reviewer for pointing this study out. It is very interesting that a mutation in a similar region in RpoB was observed in a related context of Pgi loss of activity. We have added a reference to this study in our text (Page 11, line 21).

      he authors addressed rpoB and crp as one unit and performed validation. They cultivated the mutant strain and wild type in a minimal xylose medium with or without formate, comparing their growth and NADH levels. The authors argued that the increased NADH level in the mutant strain might facilitate autotrophic growth. Although these phenotypes appear to be closely related, their relationship cannot be definitively concluded based on the findings presented in this paper alone. Therefore, one recommendation is to explore investigating transcriptomic changes induced by the rpoB and crp mutations. Otherwise, conducting experimental verification to determine whether the NADH level directly causes autotrophic growth would provide further support for the authors' claim.

      We appreciate the valuable comment and agree that the work was lacking such an analysis. Due to various reasons we have opted to use a proteomic approach which we feel fulfills the same purpose as the transcriptomics suggestion. We found interesting evidence in up-regulation of the fdoGH operon (comprising the native formate dehydrogenase O enzyme complex) which could indicate why there is an increase in NADH/NAD+ levels. We also hypothesize that this upregulation might be important more generally by drawing comparisons to natural chemo-autotrophs.

      Further experimental work (which we were not able to include in the current study) could help validate this link by deleting fdoGH and observing a loss of phenotype and, on the flip side, directly overexpressing the fdoGH operon and observing an increase in the NADH/NAD+ ratio. Indeed, if this overexpression were to prove sufficient for achieving an autotrophic phenotype without the mutations in the global transcription regulators, it would be a much more transparent design.

      We have added a section titled "Proteomic analysis reveals up-regulation of rPP cycle and formate-associated genes alongside down-regulation of catabolic genes" to the Results based on this analysis.

      • It would be beneficial to provide a more detailed explanation of the genetic background before the evolution stage, specifically regarding the ∆pfk and ∆zwf mutations. Furthermore, it is suggested to include a figure that provides a comprehensive depiction of the reductive pentose phosphate pathway and the bypass pathway. These will help readers grasp the concept of the "metabolic scaffold" as proposed by the authors.

      We agree with the reviewer that this could be helpful and we added a reference to the original paper Gleizer et al. 2019 that reported this design and also includes the relevant figure. We feel that the figure should not be added to the current manuscript as we continue to show that this design is not relevant in the context of the three reported mutations and such a figure could distract the attention of the reader from the main takeaways of the current study.

      • Despite the essentiality of the rpoB mutation (A1245V) to the autotrophic phenotype in the final strain, the inclusion of this mutation in step C1 does not appear to be justified. According to line 37 on page 3, the authors chose to retain the unintended mutation in rpoB based on its essentiality to the phenotype observed in other evolved strains. However, it should be noted that the mutations found in the evolved strain I, II, and III (P552T or D866E) were entirely different from the unintended mutation (A1245V) during genetic engineering. This aspect should be revised to avoid confusion among readers.

      Thank you for pointing this issue out, we added a clarification in the text (page 4 line 7) to avoid such confusion. We believe this point is much clearer now.

      The rpoB mutation which was shown to be essential in the study is indeed known to be common in ALE experiments in E. coli. Thus, I searched the different rpoB mutations in ALEdb in E. coli and I was able to find a similar mutation in a study where pgi was knocked out and then evolved. https://doi.org/10.1371/journal.pgen.1001186 This study seems very relevant given that pgi was a key mutation in the compact set of this work and the section "Modulation of a metabolic branch-point activity increased the concentration of rPP metabolites" informs that loss of function mutations in pgi were also found. The findings of this study should thus be put in the context of the previous related ALE study. I would recommend a similar analysis of crp mutations from studies in ALEdb to see if there are similar mutations in this gene as well or if this a unique mutation.

      We thank the reviewer for bringing this publication to our attention. We have addressed this observation in the main text (page 11 , line 21). We agree that it could have some connection to the pgi mutation yet we would not want to overspeculate about this role, as we also found the exact same mutation (A1245V) as an adaptation to higher temperature in another E. coli study (Tenaillon et al. 2012). We would like to bring forward the fact that the two reported rpoB mutations are always accompanied by another mutation with pleiotropic effects, either in the transcription factor Crp or in another RNA polymerase subunit (e.g RpoC). As such many epistatic effects could occur, one of which we also report here in page 13, line 18. In conclusion, although there could be a connection between the rpoB and pgi mutations, it could be a mere coincidence and the two mutations could exhibit two distinct roles in two distinct phenotypes.

      We also would like to thank the reviewer for suggesting a similar analysis for crp and found another mutation at a nearby residue with strong adaptive effects and mentioned it in our main text.

      Can the typical number of mutations found in a given ALE experiment be directly compared to those found in this study? It seems like a retrospective analysis of other ALE studies to show how many mutations typically occur in an ALE study and sets which were found to be causal to reproduce the phenotype of interest (through similar reverse engineering in the starting strain) should be presented. Again, the authors cite ALEdb which should provide direct numbers of mutations found in similar ALE studies with E. coli and one could then examine them to find sets of clearly causal mutations which recreate phenotypes of interest. Such an analysis would go a long way in supporting the main finding of "small number" of mutations.

      Discussion, page 12, line 42. "This could serve as a promising strategy for achieving minimally perturbed genotypes in future metabolic engineering attempts". There is an entire body of work around growth-coupled production which can be predicted and evolved with a genome-scale metabolic model and ALE. Thus, if this statement is going to be made, relevant studies should be cited and placed in context.

      The reviewer raises an important point which could indeed yield an interesting perspective. However, it would be difficult to perform this comparison in practice since many of the studies published on ALEdb have not isolated essential mutations from other mutation incidents nor have they determined the role of each mutation in the reported phenotypes. For example, many ALE trajectories include a hypermutator that greatly increases the number of irrelevant mutations and it is nearly impossible to sieve through them to find an essential set.

      Moreover, it is hard to compare the “level of difficulty” of achieving one phenotype over another and therefore feel that even though such an analysis would be insightful, it requires an amount of work which is outside the scope of this study.

      Finally, we would like to highlight our approach of using the iterative approach, isolating the relevant consensus mutations and repeating this process until no evolution process is required, we are not aware of prior studies that used this approach.

      We now clarified what we mean by "promising strategy" in the discussion in order to avoid any false claims about novelty (page 16 line 32): "Using metabolic growth-coupling as a temporary 'metabolic scaffold' that can be removed, could serve as a promising strategy for achieving minimally perturbed genotypes in future metabolic engineering attempts."

      Reviewer #2:

      Synthetic autotrophy of biotechnologically relevant microorganisms offers exciting chances for CO2 neutral or even CO2 negative production of goods. The authors' lab has recently published an engineered and evolved Escherichia coli strain that can grow on CO2 as its only carbon source. Lab evolution was necessary to achieve growth. Evolved strains displayed tens of mutations, of which likely not all are necessary for the desired phenotype.

      In the present paper the authors identify the mutations that are necessary and sufficient to enable autotrophic growth of engineered E. coli. Three mutations were identified, and their phenotypic role in enhancing growth via the introduced Calvin-Benson-Bassham cycle were characterized. It was demonstrated that these mutations allow autotrophic growth of E. coli with the introduced CBB cycle without any further metabolic intervention. Autotrophic growth is demonstrated by 13C labelling with 13C CO2, measured in proteinogenic amino acids. In Figures 2B and S1, the labeling data are shown, with an interval of the "predicted range under 13CO2".

      Here, the authors should describe how this interval was derived.

      The methodology is clearly described and appropriate.

      The present results will allow other labs to engineer E. coli and other microorganisms further to assimilate CO2 efficiently into biomass and metabolic products. The importance is evident in the opportunity to employ such strain in CO2 based biotech processes for the production of food and feed protein or chemicals, to reduce atmospheric CO2 levels and the consumption of fossil resources.

      Please describe in the methodology how the interval of the predicted range of 13C labeling was derived for Figures 2B and S1. Was it calculated by the dilution factor during 4 generations, or did you predict the label incorporation individually with a metabolic model?

      The text needs careful editing, some sentences are incomplete and there are frequent inconsistencies in writing metabolites and enzymes.

      P2L6: unclear sentence (incomplete?)

      P2L19: pastoris with lower case "p"

      P2L40: incomplete sentence

      P2L42: here, and at many other places, the writing of RuBisCO needs to be aligned. It is an abbreviation and should begin with a capital letter. Most commonly it is written as RuBisCO which I would suggest - please unify throughout the text.

      P3L3: formate dehydrogenase ... metabolites and enzymes with lower case letter. And, no hyphen here.

      P5L4: delete the : after unintentionally

      P6L16: carboxylation of RuBP (it is not CO2 that is carboxylated - if any, CO2 is carboxylating)

      P7L25: phosphoglucoisomerase (lower case)

      P8L5: in line

      P8L9: part of glycolysis/ ...

      P10L4: pentose phosphates (lower case, no hyphen).

      P10L4: all metabolites lower case

      P12L28: incomplete sentence

      P18L4: Escherichia coli in italics P18L15: Pseudomonas sp. in italics P18L16: ... promoter and with a strong ...

      P20, chapter Metabolomics: put the numbers of 12C and 13C in superscript P23L9: pentose phosphates ; all metabolites in lower case (as above) P23: all 12C and 13C with superscript numbers.

      Response to reviewer #2:

      We thank the reviewer for their comments, and for pointing out the need to clarify how we derived the predicted range of 13C labeling. We edited the text accordingly, and added the relevant calculation to the methods section (under the “13C Isotopic labeling experiment”). We would like to also thank the reviewer for the required text improvements, which were implemented. 

      Reviewer #3:

      The authors previously showed that expressing formate dehydrogenase, rubisco, carbonic anhydrase, and phosphoribulokinase in Escherichia coli, followed by experimental evolution, led to the generation of strains that can metabolise CO2. Using two rounds of experimental evolution, the authors identify mutations in three genes - pgi, rpoB, and crp - that allow cells to metabolise CO2 in their engineered strain background. The authors make a strong case that mutations in pgi are loss-of-function mutations that prevent metabolic efflux from the reductive pentose phosphate autocatalytic cycle. The authors also argue that mutations in crp and rpoB lead to an increase in the NADH/NAD+ ratio, which would increase the concentration of the electron donor for carbon fixation. While this may explain the role of the crp and rpoB mutations, there is good reason to think that the two mutations have independent effects, and that the change in NADH/NAD+ ratio may not be the major reason for their importance in the CO2-metabolising strain.

      We thank the reviewer for their comments and constructive feedback.

      We agree that there is probably a broader effect caused by the rpoB and crp mutations, besides the change in the NADH/NAD+ ratio. Hence, we performed a proteomics analysis, comparing the rpoB and crp mutations on a WT background to an autotrophic E.coli, searching for a mutual change in both strains compared to their "ancestors". We found up-regulation of rPP cycle and formate-associated genes, and a down-regulation of catabolic genes. We added a section dedicated to this matter under the title "Proteomic analysis reveals up-regulation of rPP cycle and formate-associated genes alongside down-regulation of catabolic genes".

      Specific comments:

      1. Deleting pgi rather than using a point mutation would allow the authors to more rigorously test whether loss-off-function mutants are being selected for in their experimental evolution pipeline. The same argument applies to crp.

      We appreciate this recommendation and indeed tried to delete pgi, but the genetic manipulation caused a knockout of other genes along with pgi (pepE, rluF, yjbD, lysC) so in the time available to us we cannot confidently determine whether the deletion alone is sufficient and can replace the mutation.

      Regarding crp, we do not think there is a reason to believe the mutation is a loss-of-function. In any case, the proteomics-based characterization of the crp mutation is now included in the SI.

      1. Page 10, lines 10-11, the authors state "Since Crp and RpoB are known to physically interact in the cell (26-28), we address them as one unit, as it is hard to decouple the effect of one from the other". CRP and RpoB are connected, but the authors' description of them is misleading. CRP activates transcription by interacting with RNA polymerase holoenzyme, of which the Beta subunit (encoded by rpoB) is a part. The specific interaction of CRP is with a different RNA polymerase subunit. The functions of CRP and RpoB, while both related to transcription, are otherwise very different. The mutations in crp and rpoB are unlikely to be directly functionally connected. Hence, they should be considered separately.

      Indeed, the fact that the proteins are interacting in the cell does not necessarily mean that the mutations are functionally connected. We therefore added as further justification in the new section:

      "As far as we know, the mutations in the Crp and RpoB genes affect the binding of the RNA polymerase complex to DNA and/or its transcription rates. Depending on the transcribed gene target, the effect of the two mutations might be additive, antagonistic, or synergistic. Since each one of these mutations individually (in combination with the pgi mutation) is not sufficient to achieve autotrophic growth, it is reasonable to assume that only the target genes whose levels of expression change significantly in the double-mutant are the ones relevant for the autotrophic phenotype”.

      In our proteomics analysis we considered each mutation separately. We found that in some cases the two mutations together have an additive effect, but in other cases we found that the two mutations together affect differently on the proteome, compared to the effect of each mutation alone. Since both mutations are essential to the phenotype, we decided to go with the approach of addressing the two mutations as one unit for the physiological and metabolic experiments.

      1. A Beta-galactosidase assay would provide a very simple test of CRP H22N activity. There are also simple in vivo and in vitro assays for transcription activation (two different modes of activation) and DNA-binding. H22 is not near the DNA-binding domain, but may impact overall protein structure.

      The mutation is located in “Activating Region 2”, interacting with RNA polymerase. We tried an in-vivo assay to determine the CRP H22N activity and got inconclusive results, we believe the proteomics analysis serves as a good method for understanding the global effect of the mutation.

      1. There are many high-resolution structures of both CRP and RpoB (in the context of RNA polymerase). The authors should compare the position of the sites of mutation of these proteins to known functional regions, assuming H22N is not a loss-of-function mutation in crp.

      We added a supplementary figure regarding the structural location of the two mutations, where it is demonstrated that crp H22N is located in a region interacting with the RNA polymerase and rpoB A1245V is located in proximity to regions interacting with the DNA.

      1. RNA-seq would provide a simple assay for the effects of the crp and rpoB mutations. While the precise effect of the rpoB mutation on RNA polymerase function may be hard to discern, the overall impact on gene expression would likely be informative.

      Indeed we agree that an omics approach to infer the global effect of these mutations is beneficial, we opted to use a proteomics approach and think it serves the purpose of clarifying the final, down-stream, effect on the cell.

      1. Page 2, lines 40-45, the authors should more clearly explain that the deletion of pfkA, pfkB and zwf was part of the experimental evolution strategy in their earlier work (Gleizer et al., 2019), and not a new strategy in the current study.

      We thank you for pointing this out, and edited the text accordingly.

      1. Page 3, line 27. Why did the authors compare the newly acquired mutants to only two mutants from the earlier work, not all 6?

      The 6 clones that were isolated in Gleizer et al., had 2 distinct mutation profiles. During the isolation process the lineage split into two groups. Three out of the 6 clones (clones 1,2,6) came from the same ancestor, and the other three (clones 3,4,5) came from another ancestor. Hence, these two groups shared almost all of their mutations (see Venn diagram). We decided to use for our comparison the representative with the highest number of mutations from each group (clones 5 and 6).

      Author response image 1.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The manuscript by Rühling et al analyzes the mode of entry of S. aureus into mammalian cells in culture. The authors propose a novel mechanism of rapid entry that involves the release of calcium from lysosomes via NAADP-stimulated activation of TPC1, which in turn causes lysosomal exocytosis; exocytic release of lysosomal acid sphingomyelinase (ASM) is then envisaged to convert exofacial sphingomyelin to ceramide. These events not only induce the rapid entry of the bacteria into the host cells but are also described to alter the fate of the intracellular S. aureus, facilitating escape from the endocytic vacuole to the cytosol.

      Strengths:

      The proposed mechanism is novel and could have important biological consequences.

      Weaknesses:

      Unfortunately, the evidence provided is unconvincing and insufficient to document the multiple, complex steps suggested. In fact, there appear to be numerous internal inconsistencies that detract from the validity of the conclusions, which were reached mostly based on the use of pharmacological agents of imperfect specificity.

      We thank the reviewer for the detailed evaluation of our manuscript. We will address the criticism below.

      We agree with the reviewer that many of the experiments presented in our study rely on the usage of inhibitors. However, we want to emphasize that the main conclusion (invasion pathway affects the intracellular fate/phagosomal escape) was demonstrated without the use of inhibitors or genetic ablation in two key experiments (Figure5 D/E). These experiments were in line with the results we obtained with inhibitors (amitriptyline [Figure 4D], ARC39, PCK310, [Figure 4C] and Vacuolin-1 [Figure4E]). Importantly, the hypothesis was also supported by another key experiment, in which we showed the intracellular fate of bacteria is affected by removal of SM from the plasma membrane before invasion, but not by removal of SM from phagosomal membranes after bacteria internalization (Figure5A-C). Taken together, we thus believe that the main hypothesis is strongly supported by our data.

      Moreover, we either used different inhibitors for the same molecule (ASM was inhibited by ARC39, amitriptyline and PCK310 with similar outcome) or supported our hypothesis with gene-ablated cell pools (TPC1, Syt7, SARM1), as we will point out in more detail below.

      Firstly, the release of calcium from lysosomes is not demonstrated. Localized changes in the immediate vicinity of lysosomes need to be measured to ascertain that these organelles are the source of cytosolic calcium changes. In fact, 9-phenantrol, which the authors find to be the most potent inhibitor of invasion and hence of the putative calcium changes, is not a blocker of lysosomal calcium release but instead blocks plasmalemmal TRPM4 channels. On the other hand, invasion is seemingly independent of external calcium. These findings are inconsistent with each other and point to non-specific effects of 9-phenantrol. The fact that ionomycin decreases invasion efficiency is taken as additional evidence of the importance of lysosomal calcium release. It is not clear how these observations support involvement of lysosomal calcium release and exocytosis; in fact treatment with the ionophore should itself have induced lysosomal exocytosis and stimulated, rather than inhibited invasion. Yet, manipulations that increase and others that decrease cytosolic calcium both inhibited invasion.

      With respect to lysosomal Ca<sup>2<sup>+</sup></sup> release, we agree with the reviewer that direct visual demonstration of lysosomal Ca<sup>2<sup>+</sup></sup> release upon infection will improve the manuscript. We therefore performed live cell imaging to visualize lysosomal Ca<sup>2<sup>+</sup></sup> release by a previously published method.1 The approach is based on two dextran-coupled fluorophores that were incubated with host cells. The dyes are endocytosed and eventually stain the lysosomes. One of the dyes, Rhod-2, is Ca<sup>2<sup>+</sup></sup>-sensitive and can be used to estimate the lysosomal Ca<sup>2<sup>+</sup></sup> content. The second dye, AF647, is Ca<sup>2<sup>+</sup></sup>-insensitive and is used to visualize the lysosomes. If the ratio Rhod-2/AF647 within the lysosomes is decreasing, lysosomal Ca<sup>2<sup>+</sup></sup> release is indicated. We monitored lysosomal Ca<sup>2<sup>+</sup></sup> content during S. aureus infection with this method (Author response image 1 and Author response video 1). However, the lysosomes are very dynamic, and it is challenging to monitor the fluorescence intensities over time. Thus, quantitative measurements are not possible with our methodology, and we decided to not include these data in the main manuscript. However, one could speculate that lysosomal Ca<sup>2<sup>+</sup></sup> content in the selected ROI (Author response image 1 and Author response video 1) is decreased upon attachment of S. aureus to the host cells as indicated by a decrease in Rhod-2/AF647 ratio.

      Author response image 1.

      Lysosomal Ca<sup>2<sup>+</sup></sup> imaging during S. aureus infection. The lysosomes of HuLEC were stained with two dextran-coupled fluorescent dyes. A Ca<sup>2<sup>+</sup></sup>-sensitive dye Rhod-2 as well as Ca<sup>2<sup>+</sup></sup>insensitive AF647. Cells were infected with fluorescent S. aureus JE2 and monitored by live cell imaging (see Author response video 1). The intensity of Rhod-2/AF647 was measured close to a S. aureus-host contact site. Ratio of Rhod-2 vs. AF647 fluorescence intensity was calculated

      As to the TRPM4 involvement in S. aureus host cell internalization, it has been reported that TRPM4 is activated by cytosolic Ca<sup>2<sup>+</sup></sup>. However, the channel conducts monovalent cations such as K<sup>+</sup> or Na<sup>+</sup> but is impermeable for Ca<sup>2<sup>+</sup></sup> [2, 3]. The following of our observations are supporting this:

      i) S. aureus invasion is dependent on intracellular Ca<sup>2<sup>+</sup></sup>, but is independent from extracellular Ca<sup>2<sup>+</sup></sup>  (Figure 1A).

      ii) 9-phenantrol treatment reduces S. aureus internalization by host cells, illustrating the dependence of this process on TRPM4 (data removed from the manuscript) . We therefore hypothesize that TRPM4 is activated by Ca<sup>2<sup>+</sup></sup> released from lysosomes (see above).

      TRPM4 is localized to focal adhesions and is connected to actin cytoskeleton[4, 5] – a requisite of host cell entry of S. aureus.[6, 7] This speaks for an important function of TRPM4 in uptake of S. aureus in general, but does not necessarily have to be involved exclusively in the rapid uptake pathway.

      TRPM4 itself is not permeable for Ca<sup>2<sup>+</sup></sup> but is activated by the cation.  Thus, it is unlikely to cause lysosomal exocytosis. The stronger bacterial uptake reduction by treatment with 9-phenantrol when compared to Ned19 thus may be caused by the involvement of TRPM4 in additional pathways of S. aureus host cell entry involving that association of TRPM4 with focal adhesions or as pointed out by the reviewer, unspecific side effects of 9-phenantrol that we currently cannot exclude.  However, we think that experiments with 9-phenantrol distract from the main story (lysosomal Ca<sup>2<sup>+</sup></sup> and exocytosis) and might be confusing for the reader. We thus removed all data and discussion concerning 9phenantrol in the revised manuscript.

      Regarding the reduced S. aureus invasion after ionomycin treatment, we agree with the reviewer that ionomycin is known to lead to lysosomal exocytosis as was previously shown by others8 as well as our laboratory[9}. 

      We hypothesized that pretreatment with ionomycin would trigger lysosomal exocytosis and thus would reduce the pool of lysosomes that can undergo exocytosis before host cells are contacted by S. aureus. As a result, we should observe a marked reduction of S. aureus internalization in such “lysosome-depleted cells”, if the lysosomal exocytosis is coupled to bacterial uptake. Our observation of reduced bacterial internalization after ionomycin treatment supports this hypothesis.

      However, ionomycin treatment and S. aureus infection of host cells are distinct processes.  

      While ionomycin results in strong global and non-directional lysosomal exocytosis of all “releasable” lysosomes (~5-10 % of all lysosomes according to previous observations)8, we hypothesize that lysosomal exocytosis upon contact with S. aureus only involves a small proportion of lysosomes at host-bacteria contact sites. This is supported by experiments that demonstrate that ~30% of the lysosomes that are released by ionomycin treatment are exocytosed during S. aureus infection (see below and Figure 2, A-C). We added this new data as well as an according section to the discussion  (line 563 ff). Moreover, we moved the data obtained with ionomycin to Figure 2E and described our idea behind this experiment more precisely (line 166 ff).

      The proposed role of NAADP is based on the effects of "knocking out" TPC1 and on the pharmacological effects of Ned-19. It is noteworthy that TPC2, rather than TPC1, is generally believed to be the primary TPC isoform of lysosomes. Moreover, the gene ablation accomplished in the TPC1 "knockouts" is only partial and rather unsatisfactory. Definitive conclusions about the role of TPC1 can only be reached with proper, full knockouts. Even the pharmacological approach is unconvincing because the high doses of Ned-19 used should have blocked both TPC isoforms and presumably precluded invasion. Instead, invasion is reduced by only ≈50%. A much greater inhibition was reported using 9-phenantrol, the blocker of plasmalemmal calcium channels. How is the selective involvement of lysosomal TPC1 channels justified?

      As to partial gene ablation of TPC1: To avoid clonal variances, we usually perform pool sorting to obtain a cell population that predominantly contains cells -here- deficient in TPC1, but also a small proportion of wildtype cells as seen by the residual TPC1 protein on the Western blot. We observe a significant reduction in bacterial uptake in this cell pool suggesting that the uptake reduction in a pure K.O. population may be even more pronounced. 

      As to the inhibition by Ned19: 

      The scale of invasion reduction upon Ned19 treatment (50%, Figure 1B) is comparable with the reduction caused by other compounds that influence the ASM-dependent pathway (such as amitriptyline, ARC39 [Figure 2G], BAPTA-AM [Figure 1A], Vacuolin-1 [Figure 2D], β-toxin [Figure 2L] and ionomycin [Figure 2E]). Further, the partial reduction of invasion is most likely due to the concurrent activity of multiple internalization pathways which are not all targeted by the used compounds and which we briefly discuss in the manuscript.

      We agree with the reviewer that Ned19 inhibits TPC1 and TPC2. Since ablation of TPC1 reduced invasion of S. aureus, we concluded that TPC1 is important for S. aureus host cell invasion. We thus agree with the reviewer that a role for TPC2 cannot be excluded. We clarified this in the revised manuscript (Lines 552). It needs to be noted, however, that deficiency in either TPC1 or TPC2 alone was sufficient to prevent Ebola virus infection10, which is in line with our observations.

      In order to address the role of TPC2 for this review process, we kindly were gifted TPCN1/TPCN2 double knock-out HeLa cells by Norbert Klugbauer (Freiburg, Germany), which we tested for S. aureus internalization. We found that invasion was reduced in these cell lines supporting a role of lysosomal Ca<sup>2<sup>+</sup></sup> release in S. aureus host cell entry and a role for both TPC channels (Author response image 2, see end of the document). Since we did not have a single TPCN2 knock-out available we decided to exclude these data from the main manuscript.

      Author response image 2.

      Invasion efficiency is reduced in TPC1/TPC2 double K.O. HeLa cells. Invasion efficiency of S. aureus JE2 was determined in TPC1/TPC2 double K.O. cells after 10 and 30 min. Results were normalized to the parental HeLa WT cell line (set to 100 %).  

      Invoking an elevation of NAADP as the mediator of calcium release requires measurements of the changes in NAADP concentration in response to the bacteria. This was not performed. Instead, the authors analyzed the possible contribution of putative NAADP-generating systems and reported that the most active of these, CD38, was without effect, while the elimination of SARM1, another potential source of NAADP, had a very modest (≈20%) inhibitory effect that may have been due to clonal variation, which was not ruled out. In view of these data, the conclusion that NAADP is involved in the invasion process seems unwarranted.

      Our results from two independent experimental set-ups (Ned19 [Figure 1B] and TPC1 K.O. [Figure 1C & Figure 2N]) indicate the involvement of NAADP in the process. Together with the metabolomics unit at the Biocenter Würzburg, we attempted to measure cellular NAADP levels, however, this proved to be non-trivial and requires further optimization. However, we can rule out clonal variation in the SARM1 mutant since experiments were conducted with a cell pool as described above in order to avoid clonal variation of single clones.

      The mechanism behind biosynthesis of NAADP is still debated. CD38 was the first enzyme discovered to possess the ability of producing NAADP. However, it requires acidic pH to produce NAADP[11] -which does not match the characteristics of a cytosolic NAADP producer. HeLa cells do not express CD38 and hence, it is not surprising that inhibition of CD38 had no effect on S. aureus invasion in HeLa cells. However, NAADP production by HeLa cells was observed in absence of CD38[12]. Thus CD38independent NAADP generation is likely. SARM1 can produce NAADP at neutral pH[13] and is expressed in HeLa, thus providing a more promising candidate.  

      We agree with the reviewer that the reduction of S. aureus internalization after ablation of SARM1 is less pronounced than in other experiments of ours. This may be explained by NAADP originating from other enzymes, such as the recently discovered DUOX1, DUOX2, NOX1 and NOX2[14], which – with exception of DUOX2- possess a low expression even in HeLa cells. We add this to the discussion in the revised manuscript (line 579).

      We can, however, rule out clonal variation for the inhibitory effect. As stated above we generated K.O. cell pools specifically to avoid inherent problems of clonality. Thus, we also detect some residual wildtype cells within our cell pools.  

      The involvement of lysosomal secretion is, again, predicated largely on the basis of pharmacological evidence. No direct evidence is provided for the insertion of lysosomal components into the plasma membrane, or for the release of lysosomal contents to the medium. Instead, inhibition of lysosomal exocytosis by vacuolin-1 is the sole source of evidence. However, vacuolin-1 is by no means a specific inhibitor of lysosomal secretion: it is now known to act primarily as a PIKfyve inhibitor and to cause massive distortion of the endocytic compartment, including gross swelling of endolysosomes. The modest (20-25%) inhibition observed when using synaptotagmin 7 knockout cells is similarly not convincing proof of the requirement for lysosomal secretion.

      We agree with the reviewer that the manuscript will benefit from a functional analysis of lysosomal exocytosis and therefore conducted assays to investigate exocytosis in the revised manuscript. We previously showed i) by addition of specific antisera that LAMP1 transiently is exposed on the plasma membrane during ionomycin and pore-forming toxin challenge and ii) demonstrated the release of ASM activity into the culture medium under these conditions.[9] However, both measurements are not compatible with S. aureus infection, since LAMP1 antibodies also are non-specifically bound by protein A and another IgG-binding proteins on the S. aureus surface, which would bias the results. Since protein A also may serve as an adhesin in the investigated pathway, we cannot simply delete the ORF without changing other aspects of staphylococcal virulence. Further, FBS contains a ASM background activity that impedes activity measurements of cell culture medium. We previously removed this background activity by a specific heat-inactivation protocol.[9] However, S. aureus invasion is strongly reduced in culture medium containing this heat-inactivated FBS.

      We therefore developed a luminescence assay based on split NanoLuc luciferase that enables detection of LAMP1 exposed on the plasma membrane without usage of antibodies (Figure 2, A-C). We added a section on the assay in the revised manuscript. Briefly, we generated reporter cells by fusing a short peptide fragment of NanoLuc called HiBiT between the signal peptide and the mature luminal domain of LAMP1 and stably expressed the resulting protein in HeLa cells by lentiviral transduction. The LgBiT protein domain of NanoLuc luciferase (Promega) as well as the substrate Furimazine are added to the culture medium. HiBiT can reconstitute a functional NanoLuc with LgBiT and process Furimazine when lysosomes are exocytosed thereby generating luminescence measurable in a suitable plate reader. 

      With this assay we detected that  about 30% of lysosomes that were “releasable” by treatment with ionomycin are exocytosed during S. aureus infection. Lysosomal exocytosis was strongly reduced (even below the levels of untreated controls), if we treated cells with Vacuolin-1 or Ned19.  

      We agree with the reviewer that Vacuolin-1 to some extent has unspecific side effects as has been shown by others and which we addressed in the revised version of the manuscript (line 541 ff). However, our new results with the HiBiT reporter cell line clearly demonstrate a reduction of lysosomal exocytosis after Vacuolin-1 treatment. Supported by this and our other results we hypothesize that Vacuolin-1 decreases S. aureus internalization due to the inhibition of lysosomal exocytosis.

      As to the involvement of synaptotagmin 7: The effect of Syt7 K.O. on invasion was moderate in initial experiments, likely due to a high culture passage and presumably overgrowth of WT cells. However, reduction of invasion in Syt7 K.O.s was more pronounced in experiments with β-toxin complementation (Figure 2, N) and hence, we combined the two data sets (Figure 2, F). This demonstrates the reduction of bacterial invasion by ~40% in Syt7 K.O. cell pools. Moreover, Syt7 is not the only protein possibly involved in Ca<sup>2<sup>+</sup></sup>-dependent exocytosis. For instance, Syt1 has been shown to possess an overlapping function.[15] This may explain the differences between our Vacuolin-1 and Syt7 ablation experiments. We added this information to the discussion. 

      ASM is proposed to play a central role in the rapid invasion process. As above, most of the evidence offered in this regard is pharmacological and often inconsistent between inhibitors or among cell types. Some drugs affect some of the cells, but not others. It is difficult to reach general conclusions regarding the role of ASM. The argument is made even more complex by the authors' use of exogenous sphingomyelinase (beta-toxin). Pretreatment with the toxin decreased invasion efficiency, a seemingly paradoxical result. Incidentally, the effectiveness of the added toxin is never quantified/validated by directly measuring the generation of ceramide or the disappearance of SM.

      Although pharmacological inhibitors can have unspecific side effects, we want to emphasize that the inhibitors used in our study act on the enzyme ASM by completely different mechanisms. Amitriptyline is a so called functional inhibitor of ASM (FIASMA) which induces the detachment of ASM from lysosomal membranes resulting in degradation of the enzyme.[16] By contrast, ARC39 is a competitive inhibitor.[17, 18] 

      There are no inconsistencies in our data obtained with ASM inhibitors. Amitriptyline and ARC39 both reduce the invasion of S. aureus in HuLEC, HuVEC and HeLa cells (Figure 2G). ARC39 needs a longer pre-incubation, since its uptake by host cells is slower (to be published elsewhere). We observe a different outcome in 16HBE14o- and Ea.Hy 926 cells, with 16HBE14o- even demonstrating a slightly increased invasion of S. aureus upon ARC39 treatment. Amitriptyline had no effect (Figure 2G). 

      Thus, the ASM-dependent S. aureus internalization is cell type/line specific, which we state in the manuscript. The molecular origin of these differences is unclear and will require further investigation, e.g. in testing cell lines for potential differences in surface receptors. In a separate study we have already developed a biotinylation-based approach to identify potential novel host cell surface interaction partners during S. aureus infection.[19]

      Moreover, both inhibitors affected the invasion dynamics (Figure 3D), phagosomal escape (Figure 4C and Figure 4D) and Rab7 recruitment (Figure 4A and Supp. Figure 4A-C) in a similar fashion. Proper inhibition of ASM by both compounds in all cell lines used was validated by enzyme assays (Supp. Figure 2H), which again suggests that the ASM-dependent pathway does only exist in specific cell lines and also supports  that we do not observe unspecific side effects of the compounds. We clarified this in the revised manuscript.

      ASM is a key player for SM degradation and recycling. In clinical context, deficiency in ASM results in the so-called Niemann Pick disease type A/B. The lipid profile of ASM-deficient cells is massively altered[20], which will result in severe side effects. Short-term inhibition by small molecules therefore poses a clear benefit when compared to the usage of ASM K.O. cells. In order to satisfy the query of the reviewer, we generated two ASM K.O. cell pools (generated with two different sgRNAs) and tested these for S. aureus invasion efficiency (Figure 2, I). We did not observe bacterial invasion differences between WT and K.O. cells. However, when we treated the cells additionally with ASM inhibitor, we observed a strongly reduced invasion in WT cells, while invasion efficiency in ASM K.O. was only slightly affected (Figure 2, J). We concluded that the reduced invasion observed in inhibitor-treated WT cells  predominantly is due to absence of ASM, while the small reduction observed in ARC39treated ASM K.O.s is likely due to unspecific side effects.  

      We performed lipidomics on these cells and demonstrated a strongly altered sphingolipid profile in ASM K.O. cells compared to untreated and inhibitor-treated WT cells (Figure 2, K). We speculate that other ASM-independent bacterial invasion pathways are upregulated in ASM K.O.s., thereby obscuring the effect contributed by absence of ASM. We discussed this in the revised manuscript (line 518 ff).

      Moreover, we introduced the RFP-CWT escape marker into the ASM K.O. cells and measured phagosomal escape of S. aureus JE2 and Cowan I.  The latter strain is non-cytotoxic and serves as negative control, since it is known to possess a very low escape rate, due to its inability to produce toxin. Again, we compared early invaders (infection for 10 min) with early<sup>+</sup>late invaders (infection for 30 min). As observed  for JE2, “early invaders” possess lower escape rates than “early<sup>+</sup>late invaders”.

      We did not observe differences between WT and ASM K.O. cells, if we infected for only 10 min. By contrast, we observed a lower escape rate in ASM K.O (Author response image 3, see end of the document). compared to WT cells, when we infected for 30 min.  

      However, we usually observe an increased phagosomal escape, when we treated host cells with ASM inhibitors (Figure 4C and D). Reduced phagosomal escape of intracellular S. aureus in ASM K.O. cells may be caused by the altered sphingolipid profile(e.g., by interference with binding of bacterial toxins to phagosomal membranes or altered vesicular acidification). We hence think that these data are difficult to interpret, and clarification would require intense additional experimentation. Thus, we did not include this data in the manuscript. 

      Author response image 3.

      Phagosomal escape rates were established in either HeLa wild-type or ASM K.O. cells expressing the phagosomal escape reporter RFP-CWT. Host cells that were infected with the cytotoxic S. aureus strain JE2 or the non-cytotoxic strain Cowan I for 10 or 30 minutes and escape rates were determined by microscopy 3h p.i.

      As to the treatment with a bacterial sphingomyelinase:

      Treatment with the bacterial SMase (bSMase, here: β-toxin) was performed in two different ways:

      i) Pretreatment of host cells with β-toxin to remove SM from the host cell surface before infection. This removes the substrate of ASM from the cell surface prior to addition of the bacteria (Figure 2L, Figure 4A-C). Since SM is not present on the extracellular plasma membrane leaflet after treatment, a release of ASM cannot cause localized ceramide formation at the sites of lysosomal exocytosis. Similar observations were made by others.[21] 

      ii) Addition of bSMase to host cells together with the bacteria to complement for the absence of ASM (Figure 2N).  

      Removal of the ASM substrate before infection (i) prevents localized ASM-mediated conversion of SM to Cer during infection and resulted in a decreased invasion, while addition of the SMase during infection resulted in an increased invasion in TPC1 and Syt7 ablated cells. Thus, both experiments are consistent with each other and in line with our other observations. 

      Removal of SM from the plasma membrane by β-toxin was indirectly demonstrated by the absence of Lysenin recruitment to phagosomes/escaped bacteria when host cells were pretreatment with the toxin before infection (Figure5C). We also added another data set that demonstrates degradation of a fluorescence SM derivative upon β-toxin treatment of host cells (Supp Figure 2, M). In another publication, we recently quantified the effectiveness of β-toxin treatment, even though with slightly longer treatment times (75 min vs. 3h).[22]

      To clarify our experimental approaches to the readership we added an explanatory section to the revised manuscript (line 287 ff) and we also added a scheme to in Figure 2M describing the experimental settings.

      As to the general conclusions regarding the role of ASM: ASM and lysosomal exocytosis has been shown to be involved in uptake of a variety of pathogens[21, 23-27] supporting its role in the process.

      The use of fluorescent analogs of sphingomyelin and ceramide is not well justified and it is unclear what conclusions can be derived from these observations. Despite the low resolution of the images provided, it appears as if the labeled lipids are largely in endomembrane compartments, where they would presumably be inaccessible to the secreted ASM. Moreover, considering the location of the BODIPY probe, the authors would be unable to distinguish intact sphingomyelin from its breakdown product, ceramide. What can be concluded from these experiments? Incidentally, the authors report only 10% of BODIPY-positive events after 10 min. What are the implications of this finding? That 90% of the invasion events are unrelated to sphingomyelin, ASM, and ceramide?

      During the experiments with fluorescent SM analogues (Figure 3a,b), S. aureus was added to the samples immediately before the start of video recording. Hence, bacteria are slowly trickling onto the host cells, and we thus can image the initial contact between them and the bacteria, for instance, the bacteria depicted in Figure 3A contact the host cell about 9 min before becoming BODIPY-FL-positive (see Supp. Video 1, 55 min). Hence, in these cases we see the formation of phagosomes around bacteria rather than bacteria in endomembrane compartments. Since generation of phagosomes happens at the plasma membrane, SM is accessible to secreted ASM.  

      The “trickling” approach for infection is an experimental difference to our invasion measurements, in which we synchronized the infection by  centrifugation. This ensures that all bacteria have contact to host cells and are not just floating in the culture medium. However, live cell imaging of initial bacterialhost contact and synchronization of infection is hard to combine technically.

      In our invasion measurements -with synchronization-, we typically see internalization of ~20% of all added bacteria after 30 min. Hence, most bacteria that are visible in our videos likely are still extracellular and only a small proportion was internalized. This explains why only 10% of total bacteria are positive for BODIPY-FL-SM after 10 min. The proportion of internalized bacteria that are positive for BODIPY-FL-SM should be way higher but cannot be determined with this method.

      We agree with the reviewer that we cannot observe conversion of BODIPY-FL-SM by ASM. In order to do that, we attempted to visualize the conversion of a visible-range SM FRET probe (Supp. Figure 3), but the structure of the probe is not compatible with measurement of conversion on the plasma membrane, since the FITC fluorophore released into the culture medium by the ASM activity thereby gets lost for imaging. In general, the visualization of SM conversion with subcellular resolution is challenging and even with novel tools developed in our lab[28] visualization of SM on the plasma membrane is difficult. 

      The conclusions we draw from these experiments are that i.) S. aureus invasion is associated with SM and ii.) SM-associated invasion can be very fast, since bacteria are rapidly engulfed by BODIPY-FL-SM containing membranes.

      It is also unclear how the authors can distinguish lysenin entry into ruptured vacuoles from the entry of RFP-CWT, used as a criterion of bacterial escape. Surely the molecular weights of the probes are not sufficiently different to prevent the latter one from traversing the permeabilized membrane until such time that the bacteria escape from the vacuole.

      We here want to clarify that both Lysenin as well as the CWT reporter have access to ruptured vacuoles (Figure 4B). We used the Lysenin reporter in these experiments for estimation of SM content of phagosomal membranes. If a vacuole is ruptured, both the bacteria and the luminal leaflet of the phagosomal membrane remnants get in contact with the cytosol and hence with the cytosolically expressed reporters YFP-Lysenin as well as RFP-CWT resulting in “Lysenin-positive escape” when phagosomes contained SM (see Figure 5C). By contrast, either β-toxin expression by S. aureus or pretreatment with the bSMase resulted in absence of Lysenin recruitment suggesting that the phagosomal SM levels were decreased/undetectable (Figure 5C, Supp Figure 6F, G, I, J).

      Although this approach does not enable a quantitative measurement of phagosomal SM, this method is sufficient to show that β-toxin expression and pretreatment result in markedly decreased phagosomal SM levels in the host cells.

      The approach we used here to analyze “Lysenin-positive escape” can clearly be distinguished from Lysenin-based methods that were used by others.29 There Lysenin was used to show trans-bilayer movement of SM before rupture of bacteria-containing phagosomes.

      To clarify the function of Lysenin in our approach we added  additional figures (Figure 4F, Supp. Figure 5) and a movie (Supp. Video 4) to the revised manuscript.

      Both SMase inhibitors (Figure 4C) and SMase pretreatment increased bacterial escape from the vacuole. The former should prevent SM hydrolysis and formation of ceramide, while the latter treatment should have the exact opposite effects, yet the end result is the same. What can one conclude regarding the need and role of the SMase products in the escape process?

      As pointed out above, pretreatment of host cells with SMase removes SM from the plasma membrane and hence, ASM does not have access to its substrate. Hence, both treatment with either ASM inhibitors or pretreatment with bacterial SMase prevent ASM from being active on the plasma membrane and hence block the ASM-dependent uptake (Figure 2 G, L). Although overall less bacteria were internalized by host cells under these conditions, the bacteria that invaded host cells did so in an ASM-independent manner. 

      Since blockage of the ASM-dependent internalization pathway (with ASM inhibitor [Figure 4C, D], SMase pretreatment [Figure 5B] and Vacuolin-1[Figure.4E]) always resulted in enhanced phagosomal escape, we conclude that bacteria that were internalized in an ASM-independent fashion cause enhanced escape. Vice versa, bacteria that enter host cells in an ASM-dependent manner demonstrate lower escape rates. 

      This is supported by comparing the escape rates of “early” and “late” invaders [Figure 5D, E], which in our opinion is a key experiment that supports this hypothesis. The “early” invaders are predominantly ASM-dependent (see e.g. Figure 3E) and thus, bacteria that entered host cell in the first 10 min of infection should have been internalized predominantly in an ASM-dependent fashion, while slower entry pathways are active later during infection. The early ASM dependent invaders possessed lower escape rates, which is in line with the data obtained with inhibitors (e.g. Figure 4C, D).

      We hypothesize that the activity of ASM on the plasma membrane during invasion mediates the recruitment of a specific subset of receptors, which then influences downstream phagosomal maturation and escape. This hypothesis is supported by the fact that the subset of receptors interacting with S. aureus is altered upon inhibition of the ASM-dependent uptake pathway. We describe this in another study that is currently under evaluation elsewhere.  

      Reviewer #2 (Public review):

      Summary:

      In this manuscript, Ruhling et al propose a rapid uptake pathway that is dependent on lysosomal exocytosis, lysosomal Ca<sup>2<sup>+</sup></sup> and acid sphingomyelinase, and further suggest that the intracellular trafficking and fate of the pathogen is dictated by the mode of entry.

      The evidence provided is solid, methods used are appropriate and results largely support their conclusions, but can be substantiated further as detailed below. The weakness is a reliance on chemical inhibitors that can be non-specific to delineate critical steps.

      Specific comments:

      A large number of experiments rely on treatment with chemical inhibitors. While this approach is reasonable, many of the inhibitors employed such as amitriptyline and vacuolin1 have other or nondefined cellular targets and pleiotropic effects cannot be ruled out. Given the centrality of ASM for the manuscript, it will be important to replicate some key results with ASM KO cells.

      We thank the reviewer for the critical evaluation of our manuscript and plenty of constructive comments. 

      We agree with the reviewer, that ASM inhibitors such as functional inhibitors of ASM (FIASMA) like amitriptyline used in our study have unspecific side effects given their mode-of-action. FIASMAs induce the detachment of ASM from lysosomal membranes resulting in degradation of the enzyme.[16]  However, we want to emphasize that we also used the competitive inhibitor ARC39 in our study[17, 18] which acts on the enzyme by a completely different mechanism. All phenotypes (reduced invasion [Figure 2G], effect on invasion dynamics [Figure 3D], enhanced escape [Figure 4C, D] and differential recruitment of Rab7 [Supp. Figure 4A-C]) were observed with both inhibitors thereby supporting the role of ASM in the process.  

      We further agree that experiments with genetic evidence usually support and improve scientific findings. However, ASM is a cellular key player for SM degradation and recycling. In a clinical context, deficiency in ASM results in a so-called Niemann Pick disease type A/B. The lipid profile of ASMdeficient cells is massively altered[20], which in itself will result in severe side effects. Thus, the usage of inhibitors provides a clear benefit when compared to ASM K.O. cells, since ASM activity can be targeted in a short-term fashion thereby preventing larger alterations in cellular lipid composition.

      We nevertheless generated two ASM K.O. cell pools (generated with two different sgRNAs) and tested for invasion efficiency (Figure 2, I). Here, we did not observe differences between WT and mutants. However, if we treated the cells additionally with ASM inhibitor, we observed a strongly reduced invasion in WT cells, while invasion efficiency in ASM K.O. was only slightly affected (Figure 2, J). We concluded that the reduced invasion observed in WT cells upon inhibitor treatment predominantly is due to inhibition of ASM, whereas the small reduction observed in ARC39-treated ASM K.O.s is likely due to unspecific side effects. We also demonstrated a strongly altered sphingolipid profile in ASM K.O. cells when compared to untreated and inhibitor-treated WT cells (new Figure 2, K). We speculate that other ASM-independent invasion pathways are upregulated in ASM K.O.s., thereby making up for the absence of ASM. We discuss this in the revised manuscript (line 518 ff).

      We introduced the RFP-CWT escape marker into the ASM K.O. cells and measured phagosomal escape of S. aureus JE2 and Cowan I (Author response image 3). The latter serves as negative control, since it is known to possess a very low escape rate, due to its inability of toxin production. Again, we compared early invaders (infection for 10 min) with early<sup>+</sup>late invaders (infection for 30 min). As seen before for JE2, early invaders possess lower escape rates than early<sup>+</sup>late invaders. We did not observe differences between WT and K.O. cells, if we infected for 10 min. By contrast, we observed a lower escape rate in ASM K.O. compared to WT cells, when we infected for 30 min. However, we usually observe an increased phagosomal escape, when we treated host cells with ASM inhibitors (Figure 4C and D). We think that the reduced phagosomal escape in ASM K.O. is caused by the altered sphingolipid profile, which could have versatile effects (e.g., inference with binding of bacterial toxins to phagosomal membranes or changes in acidification). We hence think that these data are difficult to interpret, and clarification would require intense additional experimentation. Thus, we did not include this data in the manuscript. 

      Most experiments are done in HeLa cells. Given the pathway is projected as generic, it will be important to further characterize cell type specificity for the process. Some evidence for a similar mechanism in other cell types S. aureus infects, perhaps phagocytic cell type, might be good. 

      Whenever possible we performed the experiments not only in HeLa but also in HuLECs. For example, we refer to experiments concerning the role of Ca<sup>2<sup>+</sup></sup> (Figure 1A/Supp.Figure1A), lysosomal Ca<sup>2<sup>+</sup></sup>/Ned19 (Figure1B/Supp Figure 1C), lysosomal exocytosis/Vacuolin-1 (Figure 2D/Supp. Figure2D), ASM/ARC39 and amitriptyline (Figure 2G), surface SM/β-toxin (Figure 2L/Supp. Figure 2L), analysis of invasion dynamics (complete Figure 3) and measurement of cell death during infection (Figure 6C<sup>+</sup>E, Supp. Figure 8A<sup>+</sup>B).

      HuLECs, however, are not really genetically amenable and hence we were not able to generate gene deletions in these cells and upon introduction of the fluorescence escape reporter the cells are not readily growing. 

      As to ASM involvement in phagocytic cells: a role for ASM during the uptake of S. aureus by macrophages was previously reported by others.[25] However, in professional phagocytes S. aureus does not escape from the phagosome and replicates within the phagosome.[30]

      I'm a little confused about the role of ASM on the surface. Presumably, it converts SM to ceramide, as the final model suggests. Overexpression of b-toxin results in the near complete absence of SM on phagosomes (having representative images will help appreciate this), but why is phagosomal SM detected at high levels in untreated conditions? If bacteria are engulfed by SM-containing membrane compartments, what role does ASM play on the surface? If surface SM is necessary for phagosomal escape within the cell, do the authors imply that ASM is tuning the surface SM levels to a certain optimal range? Alternatively, can there be additional roles for ASM on the cell surface? Can surface SM levels be visualized (for example, in Figure 4 E, F)?

      We initially hypothesized that we would detect higher phagosomal SM levels upon inhibition of ASM, since our model suggests SM cleavage by ASM on the host cell surface during bacterial cell entry. However, we did not detect any changes in our experiments (Supp. Figure 4F). We currently favor the following explanation: SM is the most abundant sphingolipid in human cells.[31] If peripheral lysosomes are exocytosed and thereby release ASM, only a localized and relative small proportion of SM may get converted to Cer, which most likely is below our detection limit. In addition, the detection of cytosolically exposed phagosomal SM by YFP-Lysenin is not quantitative and provides a “Yes or No” measurement. Hence, we think that the rather limited SM to Cer conversion in combination with the high abundance of SM in cellular membranes does not visibly affect the recruitment of the Lysenin reporter. 

      In our experiments that employ BODIPY-FL-SM (Figure 3a<sup>+</sup>b), we cannot distinguish between native SM and downstream metabolites such as Cer. Hence, again we cannot make any assumptions on the extent to which SM is converted on the surface during bacterial internalization. Although our laboratory recently used trifunctional sphingolipid analogs to analyze the SM to Cer conversion[22], the visualization of this process on the plasma membrane is currently still challenging.

      Overall, we hypothesize that the localized generation of Cer on the surface by released ASM leads to generation of Cer-enriched platforms. Subsequently, a certain subset of receptors may be recruited to these platforms and influence the uptake process. These platforms are supposed to be very small, which also would explain that we did not detect changes in Lysenin recruitment.

      Related to that, why is ASM activity on the cell surface important? Its role in non-infectious or other contexts can be discussed.

      ASM release by lysosomal exocytosis is implied in plasma membrane repair upon injury. We added a short description of the role of extracellular ASM in the introduction (line 35).

      If SM removal is so crucial for uptake, can exocytosis of lysosomes alone provide sufficient ASM for SM removal? How much or to what extent is lysosomal exocytosis enhanced by initial signaling events? Do the authors envisage the early events in their model happening in localized confines of the PM, this can be discussed.

      Ionomycin treatment led to a release of ~10 % of all lysosomes and also increased extracellular ASM activity.[8, 9] In the revised manuscript, we developed an assay to determine lysosomal exocytosis during S. aureus infection (Figure 2, A-C). We detected lysosomal exocytosis of ~30% when compared to ionomycin treatment  during infection. Since this is only a fraction of the “releasable lysosomes”, we assume that the effects (lysosomal Ca<sup>2<sup>+</sup></sup> liberation, lysosomal exocytosis and ASM activity) are very localized and take place only at host-pathogen contact sites (see also above). We discuss this in the revised manuscript (line 563 ff). To our knowledge it is currently unclear to which extent the released ASM affects surface SM levels. We attempted to visualize the local ASM activity on the cell surface by using a visible range FRET probe (Supp. Fig. 3). Cleavage of the probe by ASM on the surface leads to release of FITC into the cell culture medium, which does not contribute a measurable signal at the surface. 

      How are inhibitor doses determined? How efficient is the removal of extracellular bacteria at 10 min? It will be good to substantiate the cfu experiments for infectivity with imaging-based methods. Are the roles of TPC1 and TPC2 redundant? If so, why does silencing TPC1 alone result in a decrease in infectivity? For these and other assays, it would be better to show raw values for infectivity. Please show alterations in lysosomal Ca<sup>2<sup>+</sup></sup> at the doses of inhibitors indicated. Is lysosomal Ca<sup>2<sup>+</sup></sup> released upon S. aureus binding to the cell surface? Will be good to directly visualize this.

      Concerning the inhibitor concentrations, we either used values established in published studies or recommendations of the suppliers (e.g. 2-APB, Ned19, Vacuolin-1). For ASM inhibitors, we determined proper inhibition of ASM by activity assays. Concentrations of ionomycin resulting in Ca<sup>2<sup>+</sup></sup> influx and lysosomal exocytosis was determined in earlier studies of our lab.[9, 32] 

      As to the removal of bacteria at 10 min p.i.: Lysostaphin is very efficient for removal of extracellular S. aureus and sterilizes the tissue culture supernatant. It significantly lyses bacteria within a few minutes, as determined by turbidity assays.[33]

      As to imaging-based infectivity assays: We performed imaging-based invasion assays to show reduced invasion efficiency with two ASM inhibitors in the revised manuscript with similar results as obtained by CFU counts (Supp. Figure 2, J).

      Regarding the roles of TPC1 and TPC2: from our data we cannot conclude whether the roles of TPC1 and TPC2 are redundant. One could speculate that since blockage of TPC1 alone is sufficient to reduce internalization of bacteria, that both channels may have distinct roles. On the other hand, there might be a Ca<sup>2<sup>+</sup></sup> threshold in order to initiate lysosomal exocytosis that can only be attained if TPC1 and TPC2 are activated in parallel. Thus, our observations are in line with another study that shows reduced Ebola virus infection in absence of either TPC1 or TPC2.[34] In order to address the role of TPC2 for this review process, we kindly were gifted TPCN1/TPCN2 double knock-out HeLa cells by Norbert Klugbauer (Freiburg, Germany), which we tested for S. aureus internalization. We found that invasion was reduced in these double KO cell lines even further supporting a role of lysosomal Ca<sup>2<sup>+</sup></sup> release in S. aureus host cell entry (Author response image 2, see end of the document). Since we did not have a single TPCN2 knockout available, we decided to exclude these data from the main manuscript.

      As to raw CFU counts: whereas the observed effects upon blocking the invasion of S. aureus are stable, the number of internalized bacteria varies between individual biological replicates, for instance, by differences in host cell fitness or growth differences in bacterial cultures, which are prepared freshly for each experiment.

      With respect to visualization of lysosomal Ca<sup>2<sup>+</sup></sup> release: we agree with the reviewer that direct visual demonstration of lysosomal Ca<sup>2<sup>+</sup></sup> release upon infection would improve the manuscript. We therefore performed live cell imaging to visualize lysosomal Ca<sup>2<sup>+</sup></sup> release by a previously published method.[1] The approach is based on two dextran-coupled fluorophores that were incubated with host cells. The dyes are endocytosed and eventually stain the lysosomes. One of the dyes, Rhod-2, is Ca<sup>2<sup>+</sup></sup>-sensitive and can be used to estimate the lysosomal Ca<sup>2<sup>+</sup></sup> content. The second dye, AF647, is Ca<sup>2<sup>+</sup></sup>-insensitive and is used to visualize the lysosomes. If the ratio Rhod-2/AF647 within the lysosomes is decreasing, lysosomal Ca<sup>2<sup>+</sup></sup> release is indicated. We monitored lysosomal Ca<sup>2<sup>+</sup></sup> content during S. aureus infection with this method (Author response image 1 and Author response video 1). However, the lysosomes are very dynamic, and it is challenging to monitor the fluorescence intensities over time. Thus, quantitative measurements are not possible with our methodology, and we decided to not include these data in the final manuscript. However, one could speculate that lysosomal Ca<sup>2<sup>+</sup></sup> content in the selected ROI (Author response image 1 and Author response video 1) is decreased upon attachment of S. aureus to the host cells as indicated by a decrease in Rhod-2/AF647 ratio.

      The precise identification of cytosolic vs phagosomal bacteria is not very easy to appreciate. The methods section indicates how this distinction is made, but how do the authors deal with partial overlaps and ambiguities generally associated with such analyses? Please show respective images.

      The number of events (individual bacteria) for the live cell imaging data should be clearly mentioned.

      We apologize for not having sufficiently explained the technology to detect escaped S. aureus. The cytosolic location of S. aureus is indicated by recruitment of RFP-CWT.[35] CWT is the cell wall targeting domain of lysostaphin, which efficiently binds to the pentaglycine cross bridge in the peptidoglycan of S. aureus. This reporter is exclusively and homogenously expressed in the host cytosol. Only upon rupture of phagoendosomal membranes, the reporter can be recruited to the cell wall of now cytosolically located bacteria. S. aureus mutants, for instance in the agr quorum sensing system, cannot break down the phagosomal membrane in non-professional phagocytes and thus stay unlabeled by the CWT-reporter.[35] We  include several images (Figure 4, F, Supp. Figure 5) /movies (Supp. Video 4) of escape events in the revised manuscript.  The bacteria numbers for live cell experiments are now shown in Supp. Figure 7.

      In the phagosome maturation experiments, what is the proportion of bacteria in Rab5 or Rab7 compartments at each time point? Will the decreased Rab7 association be accompanied by increased Rab5? Showing raw values and images will help appreciate such differences. Given the expertise and tools available in live cell imaging, can the authors trace Rab5 and Rab7 positive compartment times for the same bacteria?

      We included the proportion of Rab7-associated bacteria in the revised manuscript (Supp. Figure 4A and C) and also shortly mention these proportions in the text (line 353). Usually, we observe that Rab5 is only transiently (for a few minutes) present on phagosomes and only afterwards the phagosomes become positive for Rab7. We do not think that a decrease in Rab7-positive phagosomes would increase the proportion of Rab5-positive phagosomes. However, we cannot exclude this hypothesis with our data.

      We can achieve tracing of individual bacteria for recruitment of Rab5/Rab7 only manually, which impedes a quantitative evaluation. However, we included a Video (Supp. Video 3)  that illustrates the consecutive recruitment of the GTPases.

      The results with longer-term infection are interesting. Live cell imaging suggests that ASM-inhibited cells show accelerated phagosomal escape that reduces by 6 hpi. Where are the bacteria at this time point ? Presumably, they should have reached lysosomes. The relationship between cytosolic escape, replication, and host cell death is interesting, but the evidence, as presented is correlative for the populations. Given the use of live cell imaging, can the authors show these events in the same cell?

      We think that most bacteria-containing phagoendosomes should have fused with lysosomes 6 h p.i. as we have previously shown by acidification to pH of 5 and LAMP1 decoration.[36]

      The correlation between phagosomal escape and replication in the cytosol of non-professional phagocytes has been observed by us and others. In the revised manuscript we also provide images (Supp. Figure 5)/videos (Supp. Video 4) to show this correlation in our experiments.

      Given the inherent heterogeneity in uptake processes and the use of inhibitors in most experiments, the distinction between ASM-dependent and independent pathways might not be as clear-cut as the authors suggest. Some caution here will be good. Can the authors estimate what fraction of intracellular bacteria are taken up ASM-dependent?

      We agree with the reviewer that an overlap between internalization pathways is likely. A clear distinction is therefore certainly non-trivial. Alternative to ASM-dependent and ASM-independent pathways, the ASM activity may also accelerate one or several internalization pathways. We address this limitation in the discussion of the revised manuscript (line 596 ff).

      Early in infection (~10 min after contact with the cells), the proportion of bacteria that enter host cells ASM-dependently is relatively high amounting to roughly 75-80% in HuLEC. After 30 min, this proportion is decreasing to about 50%. We included a paragraph in the discussion of the revised manuscript (line 593 ff).

      Reviewer #2 (Recommendations for the authors):

      (1) The experiment in Figure 4H is interesting. Details on what proportion of the cell is double positive, and if only this fraction was used for analysis will be good.

      We did use all bacteria found in the images independently from whether host cells were infected with only one or both strains. We unfortunately cannot properly determine the proportion of cells that are double infected, since i) we record the samples with CLSM and hence, cannot exclude that there are intracellular bacteria found in higher or lower optical sections. ii) we visualized cells by staining Nuclei and did not stain the cell borders, thus we cannot precisely tell to which host cell the bacteria localize.

      (2) Data is sparse for steps 5 and 6 of the model (line 330).

      We apologize for the inconvenience. There is a related study published  elsewhere[19], in which we identified NRCAM and PTK7 as putative receptors involved in this invasion pathway. We included a section in the discussion with the corresponding citation (line 569).

      (3) Data for the reduced number of intracellular bacteria upon blocking ASM-dependent uptake (line 235) is not clear. Do they mean decreased invasion efficiency? These two need not be the same.

      We changed “reduced number of intracellular bacteria” to “invasion efficiency”.

      (4) b-toxin added to the surface can get endocytosed. Can its surface effect be delineated from endo/phagosomal effect?

      We attempted to delineate effects contributed by the toxin activity on the surface vs. within phagosomes (Figure 5 A-C). We see an increased phagosomal escape, when we pretreated host cells with β-toxin (removal of SM form the surface) and infected either in presence (toxin will be taken up together with the bacteria into the phagosome) or in absence (toxin was washed away shortly before infection) of β-toxin. By contrast, overexpression of β-toxin by S. aureus did not affect phagosomal escape rates. The proper activity of β-toxin was confirmed by absence of Lysenin recruitment during phagosomal escape in all three conditions. We concluded that the activity on the surface and not the activity in the phagosome is important.

      (5) The potential role(s) of bacterial factors in the uptake and subsequent intracellular stages can be discussed.

      There are multiple bacterial adhesins known in S. aureus. These usually are either covalently attached to the bacterial cell wall such as the sortase-dependently anchored Fibronectin-binding Proteins A and B but also secreted and “cell wall binding” proteins as well at non proteinaceous factor such as wall-teichoic acids. A discussion of these factors would thus be out of the scope of this manuscript, and we here suggest reverting to specialized reviews on that topic.

      (6) The manuscript is not very easy to read. The abstract could be rephrased for better clarity and succinctness, with a clearly stated problem statement. The introduction is somewhat haphazard, I feel it can be better structured.

      We apologize for the inconvenience. We stated the problem/research question in the abstract and tried to improve the introduction without adding too much unnecessary detail. In general, we tried  to improve the readability of the manuscript and hope that our results and conclusions can be easier understood by the reader in the revised version.

      (7) Typo in Figure 5F. Step 6 should read "accessory receptors"

      The typo was corrected.

      References

      (1) Lloyd-Evans, E. et al. Niemann-Pick disease type C1 is a sphingosine storage disease that causes deregulation of lysosomal calcium. Nature Medicine 14, 1247-1255 (2008).

      (2) Launay, P. et al. TRPM4 Is a Ca<sup>2<sup>+</sup></sup>-Activated Nonselective Cation Channel Mediating Cell Membrane Depolarization. Cell 109, 397-407 (2002).

      (3) Nilius, B. et al. The Ca<sup>2<sup>+</sup></sup>‐activated cation channel TRPM4 is regulated by phosphatidylinositol 4,5‐biphosphate. The EMBO Journal 25, 467-478-478 (2006).

      (4) Cáceres, M. et al. TRPM4 Is a Novel Component of the Adhesome Required for Focal Adhesion Disassembly, Migration and Contractility. PLoS One 10, e0130540 (2015).

      (5) Silva, I., Brunett, M., Cáceres, M. & Cerda, O. TRPM4 modulates focal adhesion-associated calcium signals and dynamics. Biophysical Journal 123, 390a (2024).

      (6) Schlesier, T., Siegmund, A., Rescher, U. & Heilmann, C. Characterization of the Atl-mediated staphylococcal internalization mechanism. International Journal of Medical Microbiology 310, 151463 (2020).

      (7) Jevon, M. et al. Mechanisms of Internalization ofStaphylococcus aureus by Cultured Human Osteoblasts. Infection and Immunity 67, 2677-2681 (1999).

      (8) Rodriguez, A., Webster, P., Ortego, J. & Andrews, N.W. Lysosomes behave as Ca<sup>2<sup>+</sup></sup>-regulated exocytic vesicles in fibroblasts and epithelial cells. J Cell Biol 137, 93-104 (1997).

      (9) Krones & Rühling et al. Staphylococcus aureus alpha-Toxin Induces Acid Sphingomyelinase Release From a Human Endothelial Cell Line. Front Microbiol 12, 694489 (2021).

      (10) Sakurai, Y. et al. Two-pore channels control Ebola virus host cell entry and are drug targets for disease treatment. Science 347, 995-998 (2015).

      (11) Aarhus, R., Graeff, R.M., Dickey, D.M., Walseth, T.F. & Lee, H.C. ADP-ribosyl cyclase and CD38 catalyze the synthesis of a calcium-mobilizing metabolite from NADP. J Biol Chem 270, 3032730333 (1995).

      (12) Schmid, F., Fliegert, R., Westphal, T., Bauche, A. & Guse, A.H. Nicotinic acid adenine dinucleotide phosphate (NAADP) degradation by alkaline phosphatase. J Biol Chem 287, 32525-32534 (2012).

      (13) Angeletti, C. et al. SARM1 is a multi-functional NAD(P)ase with prominent base exchange activity, all regulated bymultiple physiologically relevant NAD metabolites. iScience 25, 103812 (2022).

      (14) Gu, F. et al. Dual NADPH oxidases DUOX1 and DUOX2 synthesize NAADP and are necessary for Ca(2<sup>+</sup>) signaling during T cell activation. Sci Signal 14, eabe3800 (2021).

      (15) Schonn, J.-S., Maximov, A., Lao, Y., Südhof, T.C. & Sørensen, J.B. Synaptotagmin-1 and -7 are functionally overlapping Ca<sup>2<sup>+</sup></sup> sensors for exocytosis in adrenal chromaffin cells. Proceedings of the National Academy of Sciences 105, 3998-4003 (2008).

      (16) Kornhuber, J. et al. Functional Inhibitors of Acid Sphingomyelinase (FIASMAs): a novel pharmacological group of drugs with broad clinical applications. Cell Physiol Biochem 26, 9-20 (2010).

      (17) Naser, E. et al. Characterization of the small molecule ARC39, a direct and specific inhibitor of acid sphingomyelinase in vitro. J Lipid Res 61, 896-910 (2020).

      (18) Roth, A.G. et al. Potent and selective inhibition of acid sphingomyelinase by bisphosphonates. Angew Chem Int Ed Engl 48, 7560-7563 (2009).

      (19) Rühling, M., Schmelz, F., Kempf, A., Paprotka, K. & Fraunholz Martin, J. Identification of the Staphylococcus aureus endothelial cell surface interactome by proximity labeling. mBio 0, e03654-03624 (2025).

      (20) Schuchman, E.H. & Desnick, R.J. Types A and B Niemann-Pick disease. Mol Genet Metab 120, 27-33 (2017).

      (21) Miller, M.E., Adhikary, S., Kolokoltsov, A.A. & Davey, R.A. Ebolavirus Requires Acid Sphingomyelinase Activity and Plasma Membrane Sphingomyelin for Infection. Journal of Virology 86, 7473-7483 (2012).

      (22) M. Rühling, L.K., F. Wagner, F. Schumacher, D. Wigger, D. A. Helmerich, T. Pfeuffer, R. Elflein, C. Kappe, M. Sauer, C. Arenz, B. Kleuser, T. Rudel, M. Fraunholz, J. Seibel Trifunctional sphingomyelin derivatives enable nanoscale resolution of sphingomyelin turnover in physiological and infection processes via expansion microscopy. Nat Commun accepted in principle (2024).

      (23) Peters, S. et al. Neisseria meningitidis Type IV Pili Trigger Ca(2<sup>+</sup>)-Dependent Lysosomal Trafficking of the Acid Sphingomyelinase To Enhance Surface Ceramide Levels. Infect Immun 87 (2019).

      (24) Grassmé, H. et al. Acidic sphingomyelinase mediates entry of N. gonorrhoeae into nonphagocytic cells. Cell 91, 605-615 (1997).

      (25) Li, C. et al. Regulation of Staphylococcus aureus Infection of Macrophages by CD44, Reactive Oxygen Species, and Acid Sphingomyelinase. Antioxid Redox Signal 28, 916-934 (2018).

      (26) Fernandes, M.C. et al. Trypanosoma cruzi subverts the sphingomyelinase-mediated plasma membrane repair pathway for cell invasion. J Exp Med 208, 909-921 (2011).

      (27) Luisoni, S. et al. Co-option of Membrane Wounding Enables Virus Penetration into Cells. Cell Host & Microbe 18, 75-85 (2015).

      (28) Rühling, M. et al. Trifunctional sphingomyelin derivatives enable nanoscale resolution of sphingomyelin turnover in physiological and infection processes via expansion microscopy. Nature Communications 15, 7456 (2024).

      (29) Ellison, C.J., Kukulski, W., Boyle, K.B., Munro, S. & Randow, F. Transbilayer Movement of Sphingomyelin Precedes Catastrophic Breakage of Enterobacteria-Containing Vacuoles. Curr Biol 30, 2974-2983 e2976 (2020).

      (30) Moldovan, A. & Fraunholz, M.J. In or out: Phagosomal escape of Staphylococcus aureus. Cell Microbiol 21, e12997 (2019).

      (31) Slotte, J.P. Biological functions of sphingomyelins. Progress in Lipid Research 52, 424-437 (2013).

      (32) Stelzner, K. et al. Intracellular Staphylococcus aureus Perturbs the Host Cell Ca(2<sup>+</sup>) Homeostasis To Promote Cell Death. mBio 11 (2020).

      (33) Kunz, T.C. et al. The Expandables: Cracking the Staphylococcal Cell Wall for Expansion Microscopy. Front Cell Infect Microbiol 11, 644750 (2021).

      (34) Sakurai, Y. et al. Ebola virus. Two-pore channels control Ebola virus host cell entry and are drug targets for disease treatment. Science 347, 995-998 (2015).

      (35) Grosz, M. et al. Cytoplasmic replication of Staphylococcus aureus upon phagosomal escape triggered by phenol-soluble modulin alpha. Cell Microbiol 16, 451-465 (2014).

      (36) Giese, B. et al. Staphylococcal alpha-toxin is not sufficient to mediate escape from phagolysosomes in upper-airway epithelial cells. Infect Immun 77, 3611-3625 (2009).

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public Review):

      The study starts with the notion that in an AD-like disease model, ILC2s in the Rag1 knockout were expanded and contained relatively more IL-5<sup>+</sup> and IL-13<sup>+</sup> ILC2s. This was confirmed in the Rag2 knock-out mouse model.

      By using a chimeric mouse model in which wild-type knock-out splenocytes were injected into irradiated Rag1 knock-out mice, it was shown that even though the adaptive lymphocyte compartment was restored, there were increased AD-like symptoms and increased ILC2 expansion and activity. Moreover, in the reverse chimeric model, i.e. injecting a mix of wild-type and Rag1 knock-out splenocytes into irradiated wild-type animals, it was shown that the Rag1 knock-out ILC2s expanded more and were more active. Therefore, the authors could conclude that the RAG1 mediated effects were ILC2 cell-intrinsic.

      Subsequent fate-mapping experiments using the Rag1Cre;reporter mouse model showed that there were indeed RAGnaïve and RAGexp ILC2 populations within naïve mice. Lastly, the authors performed multi-omic profiling, using single-cell RNA sequencing and ATACsequencing, in which a specific gene expression profile was associated with ILC2. These included well-known genes but the authors notably also found expression of Ccl1 and Ccr8 within the ILC2. The authors confirmed their earlier observations that in the RAGexp ILC2 population, the Th2 regulome was more suppressed, i.e. more closed, compared to the RAGnaïve population, indicative of the suppressive function of RAG on ILC2 activity. I do agree with the authors' notion that the main weakness was that this study lacks the mechanism by which RAG regulates these changes in ILC2s.

      The manuscript is very well written and easy to follow, and the compelling conclusions are well supported by the data. The experiments are meticulously designed and presented. I wish to commend the authors for the study's quality.

      Even though the study is compelling and well supported by the presented data, some additional context could increase the significance:

      (1) The presence of the RAGnaïve and RAGexp ILC2 populations raises some questions on the (different?) origin of these populations. It is known that there are different waves of ILC2 origin (most notably shown in the Schneider et al Immunity 2019 publication, PMID 31128962). I believe it would be very interesting to further discuss or possibly show if there are different origins for these two ILC populations.

      Several publications describe the presence and origin of ILC2s in/from the thymus (PMIDs 33432227 24155745). Could the authors discuss whether there might be a common origin for the RAGexp ILC2 and Th2 cells from a thymic lineage? If true that the two populations would be derived from different populations, e.g. being the embryonic (possibly RAGnaïve) vs. adult bone marrow/thymus (possibly RAGexp), this would show a unique functional difference between the embryonic derived ILC2 vs. adult ILC2.

      We agree with the Reviewer that our findings raise important questions about ILC ontogeny. These are areas of ongoing investigation for us, and it is our hope this study may inform further investigation by others as well.

      Regarding the Schneider et al study, we have considered the possibility that RAG expression may mark a particular wave of ILC2 origin. In that study, the authors used a tamoxifen-based inducible Cre strategy in their experiments to precisely time the lineage tracing of a reporter from the Rosa26 locus. Those lineage tracing mice would overlap genetically with the RAG lineage tracing mice we used in our current study, thus performing combined timed migration fate mapping and RAG fate mapping experiments would require creating novel mouse strains.

      Similarly, the possible influence of the thymic or bone marrow environment on RAG expression in ILCs is an exciting possibility. Perhaps there are signals common to those environments that can influence all developing lymphocytes, including not only T and B cells but also ILCs, with one consequence being induction of RAG expression. While assessing levels of RAG-experienced ILCs in these tissues using our lineage tracing mouse may hint at these possibilities, conclusive evidence would require more precise control over the timing of RAG lineage tracing than our current reagents allow (e.g. to control for induction in those environments vs migration of previously fate-mapped cells to those environments).

      To answer these questions directly, we are developing orthogonal lineage tracing mouse strains, which can report on both timing of ILC development and RAG expression, but these mice are not available yet. Given the limitations of our currently available reagents, we were careful to focus our manuscript on the skin phenotype and the more descriptive aspects of the RAG-induced phenotype. We have elaborated on these important questions and referenced all the studies noted by the Reviewer in the Discussion section as areas of future inquiry on lines 421-433.  

      (2) On line 104 & Figures 1C/G etc. the authors describe that in the RAG knock-out ILC2 are relatively more abundant in the lineage negative fraction. On line 108 they further briefly mentioned that this observation is an indication of enhanced ILC2 expansion. Since the study includes an extensive multi-omics analysis, could the authors discuss whether they have seen a correlation of RAG expression in ILC2 with regulation of genes associated with proliferation, which could explain this phenomenon?

      We thank the Reviewer for pointing out this opportunity to further correlate our functional and multiomic findings. To address this, we first looked deeper into our prior analyses and found that among the pathways enriched in GSEA analysis of differentially expressed genes (DEGs) between RAG<sup>+</sup> and RAG<sup>-</sup> ILC2s, one of the pathways suppressed in RAG<sup>+</sup> ILC2s was “GOBP_EPITHELIAL_CELL_PROLIFERATION.”

      ( Author response image 1). There are a few other gene sets present in other databases such as MSigDB with terms including “proliferation,” but these are often highly specific to a particular cell type and experimental or disease condition (e.g. tissue-specific cancers). We did not find any of these enriched in our GSEA analysis.

      Author response image 1.

      GSEA plot of GOBP epithelial proliferation pathway in RAG-experienced vs RAG-naïve ILC2s.

      The ability to predict cellular proliferation states from transcriptomic data is an area of active research, and there does not appear to be any universally accepted method to do this reliably. We found two recent studies (PMIDs 34762642; 36201535) that identified novel “proliferation signatures.” Since these gene sets are not present in any curated database, we repeated our GSEA analysis using a customized database with the addition of these gene sets. However, we did not find enrichment of these sets in our RAG+/- ILC2 DEG list. We also applied our GPL strategy integrating analysis of our epigenomic data to the proliferation signature genes, but we did not see any clear trend. Conversely, our GSEA analysis did not identify any enrichment for apoptotic signatures as a potential mechanism by which RAG may suppress ILC2s.

      Notwithstanding the limitations of inferring ILC2 proliferation states from transcriptomic and epigenomic data, our experimental data suggest RAG exerts a suppressive effect on ILC2 proliferation. To formally test the hypothesis that RAG suppresses proliferation in the most rigorous way, we feel new mouse strains are needed that allow simultaneous RAG fate mapping and temporally restricted fate mapping. We elaborate on this in new additions to the discussion on lines 421-433.

      Reviewer #2 (Public Review):

      Summary:

      The study by Ver Heul et al., investigates the consequences of RAG expression for type 2 innate lymphoid cell (ILC2) function. RAG expression is essential for the generation of the receptors expressed by B and T cells and their subsequent development. Innate lymphocytes, which arise from the same initial progenitor populations, are in part defined by their ability to develop in the absence of RAG expression. However, it has been described in multiple studies that a significant proportion of innate lymphocytes show a history of Rag expression. In compelling studies several years ago, members of this research team revealed that early Rag expression during the development of Natural Killer cells (Karo et al., Cell 2014), the first described innate lymphocyte, had functional consequences.

      Here, the authors revisit this topic, a worthwhile endeavour given the broad history of Rag expression within all ILCs and the common use of RAG-deficient mice to specifically assess ILC function. Focusing on ILC2s and utilising state-of-the-art approaches, the authors sought to understand whether early expression of Rag during ILC2 development had consequences for activity, fitness, or function. Having identified cell-intrinsic effects in vivo, the authors investigated the causes of this, identifying epigenetic changes associated with the accessibility genes associated with core ILC2 functions.

      The manuscript is well written and does an excellent job of supporting the reader through reasonably complex transcriptional and epigenetic analyses, with considerate use of explanatory diagrams. Overall I think that the conclusions are fair, the topic is thoughtprovoking, and the research is likely of broad immunological interest. I think that the extent of functional data and mechanistic insight is appropriate.

      Strengths:

      - The logical and stepwise use of mouse models to first demonstrate the impact on ILC2 function in vivo and a cell-intrinsic role. Initial analyses show enhanced cytokine production by ILC2 from RAG-deficient mice. Then through two different chimeric mice (including BM chimeras), the authors convincingly show this is cell intrinsic and not simply as a result of lymphopenia. This is important given other studies implicating enhanced ILC function in RAG-/- mice reflect altered competition for resources (e.g. cytokines).

      - Use of Rag expression fate mapping to support analyses of how cells were impacted - this enables a robust platform supporting subsequent analyses of the consequences of Rag expression for ILC2.

      - Use of snRNA-seq supports gene expression and chromatin accessibility studies - these reveal clear differences in the data sets consistent with altered ILC2 function.

      - Convincing evidence of epigenetic changes associated with loci strongly linked to ILC2 function. This forms a detailed analysis that potentially helps explain some of the altered ILC2 functions observed in ex vivo stimulation assays.

      - Provision of a wealth of expression data and bioinformatics analyses that can serve as valuable resources to the field.

      We appreciate the strengths noted by the Reviewer for our study. We would like to especially highlight the last point about our single cell dataset and provision of supplemental data tables. Although our study is focused on AD-like skin disease and skin draining lymph nodes, we hope that our findings can serve as a valuable resource for future investigation into mechanisms of RAG modulation of ILC2s in other tissues and disease states.  

      Weaknesses:

      - Lack of insight into precisely how early RAG expression mediates its effects, although I think this is beyond the scale of this current manuscript. Really this is the fundamental next question from the data provided here.

      We thank the Reviewer for their recognition of the context of our current work and its future implications. We aimed to present compelling new observations within the scope of what our current data can substantiate. We believe answering the next fundamental question of the mechanisms by which RAG mediates its effects in ILC2s will require development of novel reagents. We are actively pursuing this, and we look forward to others building on our findings as well.

      - The epigenetic analyses provide evidence of differences in the state of chromatin, but there is no data on what may be interacting or binding at these sites, impeding understanding of what this means mechanistically.

      We thank the Reviewer for pointing out this aspect of the epigenomic data analysis and the opportunity to expand the scope of our manuscript. We performed additional analyses of our data to identify DNA binding motifs and infer potential transcription factors that may be driving the effects of a history of RAG expression that we observed. We hope that these additional data, analyses, and interpretation add meaningful insight for our readers.

      We first performed the analysis for the entire dataset and validated that the analysis yielded results consistent with prior studies (e.g. finding EOMES binding motifs as a marker in NK cells). Then, we examined the differences in RAG fate-mapped ILC2s. These analyses are in new Figure S10 and discussed on lines 277-316.  

      We also performed an analysis specifically on the Th2 locus, given the effects of RAG on type 2 cytokine expression. These analyses are in new Figure S12 and discussed on lines 366-378.

      - Focus on ILC2 from skin-draining lymph nodes rather than the principal site of ILC2 activity itself (the skin). This may well reflect the ease at which cells can be isolated from different tissues.

      We appreciate the Reviewer’s insight into the limitations of our study. Difficulties in isolating ILC2s from the skin were indeed a constraint in our study. In particular, we were unable to isolate enough ILC2s from the skin for stimulation and cytokine staining. Given that one of our main hypotheses was that RAG affects ILC2 function, we focused our studies on skin draining lymph nodes, which allowed measurement of the two main ILC2 functional cytokines, IL-5 and IL-13, as readouts in the key steady state and AD-like disease experiments.

      - Comparison with ILC2 from other sites would have helped to substantiate findings and compensate for the reliance on data on ILC2 from skin-draining lymph nodes, which are not usually assessed amongst ILC2 populations.

      We agree with the Reviewer that a broader survey of the RAG-mediated phenotype in other tissues and by extension other disease models would strengthen the generalizability of our observations. Indeed, we did a more expansive survey of tissues in our BM chimera experiments. We found a similar trend to our reported findings in the sdLN in tissues known to be affected by ILC2s ( Author response image 2) including the skin and lung and in other lymphoid tissues including spleen and mesenteric lymph nodes (mLN). We found that donor reconstitution in each tissue was robust except for the skin, where there was no significant difference between host and -donor CD45<sup>+</sup> immune cells and where CD45<sup>-</sup> parenchymal cells predominated ( Author response image 2A,C,E,G,I). This may explain why Rag1<sup>-/-</sup> donor ILC2s were significantly higher in proportion in all tissues except the skin, where we observed a similar trend that was not statistically significant ( Author response image 2B,D,F,H,J).

      Notwithstanding these results, given that we unexpectedly observed enhanced AD-like inflammation in the MC903 model in Rag1 KO mice, we concentrated our later experiments and analyses on defining the differences in skin draining ILC2s modulated by RAG. Our subsequent findings in the skin provoke many new hypotheses about the role of RAG in ILC2s in other tissues, and our tissue survey in the BM chimera provides additional rationale to pursue similar studies in disease models in other tissues. While this is an emerging area of investigation in our lab, we opted to focus this manuscript on our findings related to the AD-like disease model. We have ongoing studies to investigate other tissues, and we are still in the early stages of developing disease models to expand on these findings. However, if the reviewer feels strongly this additional data should be included in the manuscript, we are happy to add it. Considering the complexity of the data and concepts in the manuscript, we hoped to keep it focused to where we have strong molecular, cellular, and phenotypic outcomes.

      Author response image 2.

      Comparison of immune reconstitution in and ILC2 donor proportions in different tissues from BM chimeras. Equal quantities of bone marrow cells from Rag1<sup>-/-</sup> (CD45.2,CD90.2) and WT (CD45.2, CD90.1) C57Bl/6J donor mice were used to reconstitute the immune systems of irradiated recipient WT (CD45.1) C57Bl/6J mice. The proportion of live cells that are donor-derived (CD45.2), host-derived (CD45.1), or parenchymal (CD45-) [above] and proportion of ILC2s that are from Rag1<sup>-/-</sup> (CD90.2) or WT (CD90.1) donors [below] for A,B) skin C,D) sdLN E,F) lung G,H) spleen and I,J) mLN.

      - The studies of how ILC2 are impacted are a little limited, focused exclusively on IL-13 and IL-5 cytokine expression.

      We agree with the reviewer that our functional readout on IL-5 and IL-13 is relatively narrow. However, this focused experimental design was based on several considerations. First, IL-5 and IL-13 are widely recognized as major ILC2 effector molecules (Vivier et al, 2018, PMID 30142344). Second, in the MC903 model of AD-like disease, we have previously shown a clear correlation between ILC2s, levels of IL-5 and IL-13, and disease severity as measured by ear thickness (Kim et al, 2013, PMID 23363980). Depletion of ILC2s led to decreased levels of IL-13 and IL-5 and correspondingly reduced ear inflammation. However, while ILC2s are also recognized to produce other effector molecules such as IL-9 and Amphiregulin, which are likely involved in human atopic dermatitis (Namkung et al, 2011, PMID 21371865; Rojahn et al, 2020, PMID 32344053), there is currently no evidence linking these effectors to disease severity in the MC903 model. Third, IL-13 is emerging as a key cytokine driving atopic dermatitis in humans (Tsoi et al, 2019, PMID 30641038). Drugs targeting the IL-4/IL-13 receptor (dupilumab), or IL-13 itself (tralokinumab, lebrikizumab), have shown clear efficacy in treating atopic dermatitis. Interestingly, drugs targeting more upstream molecules, like TSLP (tezepelumab) or IL-33 (etokimab), have failed in atopic dermatitis. Taken together, these findings from both mouse and human studies suggest IL-13 is a critical therapeutic target, and thus functional readout, in determining the clinical implications of type 2 immune activation in atopic dermatitis.

      Aside from effector molecules, other readouts such as surface receptors may be of interest in understanding the mechanism of how RAG influences ILC2 function. For example, IL-18 has been shown to be an important co-stimulatory molecule along with TSLP in driving production of IL-13 by cutaneous ILC2s (Ricardo-Gonzalez et al, 2018, PMID 30201992). Our multiomic analysis showed decreased IL-18 receptor regulome activity in RAG-experienced ILC2s, which may be a mechanism by which RAG suppresses IL-13 production. Ultimately, in that study the role of IL-18 in enhancing MC903-induced inflammation through ILC2s was via increased production of IL-13, which was one of our major functional readouts. To clearly define mechanisms like these will require generation of new mice to interrogate RAG status in the context of tissue-specific knockout of other genes, such as the IL-18 receptor. We plan to perform these types of experiments in follow up studies. Notwithstanding this, we have now included additional discussion on lines 476508 to highlight why understanding how RAG impacts other regulatory and effector pathways would be an interesting area of future inquiry.

      Reviewer #3 (Public Review):

      In this study, Ver Heul et al. investigate the role of RAG expression in ILC2 functions. While RAG genes are not required for the development of ILCs, previous studies have reported a history of expression in these cells. The authors aim to determine the potential consequences of this expression in mature cells. They demonstrate that ILC2s from RAG1 or RAG2 deficient mice exhibit increased expression of IL-5 and IL-13 and suggest that these cells are expanded in the absence of RAG expression. However, it is unclear whether this effect is due to a direct impact of RAG genes or a consequence of the lack of T and B cells in this condition. This ambiguity represents a key issue with this study: distinguishing the direct effects of RAG genes from the indirect consequences of a lymphopenic environment.

      The authors focus their study on ILC2s found in the skin-draining lymph nodes, omitting analysis of tissues where ILC2s are more enriched, such as the gut, lungs, and fat tissue. This approach is surprising given the goal of evaluating the role of RAG genes in ILC2s across different tissues. The study shows that ILC2s derived from RAG-/- mice are more activated than those from WT mice, and RAG-deficient mice show increased inflammation in an atopic dermatitis (AD)-like disease model. The authors use an elegant model to distinguish ILC2s with a history of RAG expression from those that never expressed RAG genes. However, this model is currently limited to transcriptional and epigenomic analyses, which suggest that RAG genes suppress the type 2 regulome at the Th2 locus in ILC2s.

      We agree with the Reviewer that understanding the role of RAG in ILC2s across different tissues is an important goal. One of the primary inspirations for our paper was the clinical paradox that patients with Omenn syndrome, despite having profound adaptive T cell deficiency, develop AD with much greater penetrance than in the general population. Thus, there was always an appreciation for the likelihood that skin ILC2s have a unique proclivity towards the development of AD-like disease. Notwithstanding this, given the profound differences that can be found in ILC2s based on their tissue residence and disease state (as the Reviewer also points out below), we focused our investigations on characterizing the skin draining lymph nodes to better define factors underlying our initial observations of enhanced AD-like disease in Rag1<sup>-/-</sup> mice. While our findings in skin provoke the hypothesis that similar effects may be observed in other tissues and influence corresponding disease states, we were cautious not to suggest this may be the case by reporting surveys of other tissues without development of additional disease models to formally test these hypotheses. We present this manuscript now as a short, skin-focused study, rather than delaying publication to expand its scope. Truthfully, this project started in 2015 and has undergone many delays with the hopes of newer technologies and reagents coming to add greater clarity. We hope our study will enable others to pursue the goal of understanding the broader effects of RAG in ILC2s, and potentially other innate lymphoid lineages as well.

      We did a more expansive survey of tissues in our BM chimera experiments. We found a similar trend to our reported findings in the sdLN in tissues known to be affected by ILC2s ( Author response image 2) including the skin and lung and in other lymphoid tissues including spleen and mesenteric lymph nodes (mLN). We found that donor reconstitution in each tissue was robust except for the skin, where there was no significant difference between host and donor CD45<sup>+</sup> immune cells and where CD45<sup>-</sup> parenchymal cells predominated ( Author response image 2A,C,E,G,I). This may explain why Rag1<sup>-/-</sup> donor ILC2s were significantly higher in proportion in all tissues except the skin, where we observed a similar trend that was not statistically significant ( Author response image 2B,D,F,H,J). However, given the lack of correlation to disease readouts in other organ systems, we chose to not include this data in our manuscript. However, if the Reviewer feels these data should be included, we would be happy to include as a supplemental figure.

      The authors report a higher frequency of ILC2s in RAG-/- mice in skin-draining lymph nodes, which is expected as these mice lack T and B cells, leading to ILC expansion. Previous studies have reported hyper-activation of ILCs in RAG-deficient mice, suggesting that this is not necessarily an intrinsic phenomenon. For example, RAG-/- mice exhibit hyperphosphorylation of STAT3 in the gut, leading to hyperactivation of ILC3s. This study does not currently provide conclusive evidence of an intrinsic role of RAG genes in the hyperactivation of ILC2s. The splenocyte chimera model is artificial and does not reflect a normal environment in tissues other than the spleen. Similarly, the mixed BM model does not demonstrate an intrinsic role of RAG genes, as RAG1-/- BM cells cannot contribute to the B and T cell pool, leading to an expected expansion of ILC2s. As the data are currently presented it is expected that a proportion of IL-5-producing cells will come from the RAG1/- BM.

      The Reviewer raises an important point about the potential cell-intrinsic roles of RAG vs the many cell-extrinsic explanations that could affect ILC2 populations, with the most striking being the lack of T and B cells in RAG knockout mice. It is well-established that splenocyte transfer into T and B cell-deficient mice reconstitutes T cell-mediated effects (such as the T cell transfer colitis model pioneered by Powrie and others), and we were careful in our interpretation of the splenocyte chimera experiment to conclude only that lack of Tregs was unlikely to explain the enhanced ADlike disease in T (and B) cell-deficient mice.

      We agree with the Reviewer that the Rag1<sup>-/-</sup> BM will not contribute to the B and T cell pool. However, BM from the WT mice would be expected to contribute to development of the adaptive lymphocyte pool. Indeed, we found that most of the CD45<sup>+</sup> immune cells in the spleens of BM chimera mice were donor-derived ( Author response image 3A), and total levels of B cells and T cells showed reconstitution in a pattern similar to control spleens from donor WT mice, while spleens from donor Rag1<sup>-/-</sup> mice expectedly had essentially no detectable adaptive lymphocytes ( Author response image 3B-D). From this, we concluded the BM chimera experiment was successful in establishing an immune environment with the presence of adaptive lymphocytes, and the differences in ILC2 proportions we observed were in the context of developing alongside a normal number of B and T lymphocytes. Notwithstanding the potential role of the adaptive lymphocyte compartment in shaping ILC2 development, since we transplanted equal amounts of WT and Rag1<sup>-/-</sup> BM into the same recipient environment, we are not able to explain how cell-extrinsic effects alone would account for the unequal numbers of WT vs Rag1<sup>-/-</sup> ILC2s we observed after immune reconstitution.

      Author response image 3.

      Comparison of immune reconstitution in BM chimeras to controls. Equal quantities of bone marrow cells from Rag1<sup>-/-</sup> (CD45.2) and WT (CD45.2) C57Bl/6J donor mice were used to reconstitute the immune systems of irradiated recipient WT (CD45.1) C57Bl/6J mice. A) Number of WT recipient CD45.1+ immune cells in the spleens of recipient mice compared to number of donor CD45.2+ cells (WT and Rag1<sup>-/-</sup>) normalized to 100,000 live cells. Comparison of numbers of B cells, CD4+ T cells, and CD8+ T cells in spleens of B) BM chimera mice, C) control WT mice and D) control Rag1<sup>-/-</sup> mice.

      We also subsequently found transcriptional and epigenomic differences in RAG-experienced ILC2s compared to RAG-naïve ILC2s. Critically, these differences were present in ILC2s from the same mice that had developed normally within an intact immune system, rather than in the setting of a BM transplant or a defective immune background such as in Rag1<sup>-/-</sup> mice.

      We recognize that there are almost certainly cell-extrinsic factors affecting ILC2s in Rag1<sup>-/-</sup> mice due to lack of B and T cells, and that BM chimeras are not perfect substitutes for simulating normal hematopoietic development. However, the presence of cell-extrinsic effects does not negate the potential contribution of cell-intrinsic factors as well, and we respectfully stand by our conclusion that our data support a role, however significant, for cell-intrinsic effects of RAG in ILC2s.

      Finally, the Reviewer mentions the interesting observation that gut ILC3s exhibit hyperphosphorylation of STAT3 in Rag1<sup>-/-</sup> mice compared to WT as an example of cell-extrinsic effects of RAG deficiency (we assume this is in reference to Mao et al, 2018, PMID 29364878 and subsequent work). We now reference this paper and have included additional discussion on how our observations of ILC2s may be generalizable to not only other organ systems, but also other ILC subsets, limitations on these generalizations, and future directions on lines 477-520.

      Overall, the level of analysis could be improved. Total cell numbers are not presented, the response of other immune cells to IL-5 and IL-13 (except the eosinophils in the splenocyte chimera mice) is not analyzed, and the analysis is limited to skin-draining lymph nodes.

      We thank the Reviewer for the suggestions to add rigor to our analysis. ILC2 populations are relatively rare, and we designed our experiments to assess frequencies, rather than absolute numbers. We did not utilize counting beads, so our counts may not be comparable between samples. We have added additional data for absolute cell counts normalized to 100,000 live cells for each experiment (see below for a summary of new panels in each figure). Our new data on total cell numbers are consistent with the initial observations regarding frequency of ILC2s we reported from our experiments. For the BM chimera experiments, we presented the proportions of ILC2s, and IL-5 and IL-13 positive ILC2s, by donor source, as this is the critical question of the experiment. Notwithstanding our analysis by proportion, we found that the frequency of Rag1<sup>-/-</sup> ILC2s, IL-5<sup>+</sup> cells, or IL-13<sup>+</sup> cells within Lin- population was also significantly increased. While our initial submission included only the proportions for clarity and simplicity, we now include frequency and absolute numbers in new panels for more critical appraisal of our data by readers.

      In New Figure 1, we added new panels for ILC2 cell number in both the AD-like disease experiment (C) and in steady state (H).

      In New Figure S2, we added a panel for ILC2 cell number in steady state (B).

      In Figure 2 and associated supplemental data in Figure S4, we added several more panels. For the splenocyte chimera, we added a panel for ILC2 cell number in New Figure 2C.

      We incorporated multiple new panels in New Figure S4 to address the need for more data to be shown for the BM chimera (also requested by Reviewer #2). These included total cell counts and frequency for ILC2 (New Figure S4F,G), and IL-5<sup>+</sup> (New Figure S4I,K) and IL-13<sup>+</sup> (New Figure S4J,L) ILCs in addition to the proportions originally presented in Figure 2.  

      In terms of the limited analysis of other tissues, our initial observation of enhanced AD-like disease in Rag1<sup>-/-</sup> compared to WT mice built on our prior work elucidating the role of ILC2s in the MC903 model of AD-like disease in mice and AD in humans (Kim et al, 2013, PMID 23363980). Consequently, we focused on the skin to further develop our understanding of the role of RAG1 in this model. As in our prior studies, technical limitations in obtaining sufficient numbers of ILC2s from the skin itself for ex vivo stimulation to assess effector cytokine levels required performing these experiments in the skin draining lymph nodes.

      We agree that IL-5 and IL-13 are major mediators of type 2 pathology and studying their effects on immune cells is an important area of inquiry, particularly since there are multiple drugs available or in development targeting these pathways. However, our goal was not to study what was happening downstream of increased cytokine production from ILC2s, but instead to understand what was different about RAG-deficient or RAG-naïve ILC2s themselves that drive their expansion and production of effector cytokines compared to RAG-sufficient or RAGexperienced ILC2s. By utilizing the same MC903 model in which we previously showed a critical role for ILC2s in driving IL-5 and IL-13 production and subsequent inflammation in the skin, we were able to instead focus on defining the cell-intrinsic aspects of RAG function in ILC2s.

      The authors have a promising model in which they can track ILC2s that have expressed RAG or not. They need to perform a comprehensive characterization of ILC2s in these mice, which develop in a normal environment with T and B cells. Approximately 50% of the ILC2s have a history of RAG expression. It would be valuable to know whether these cells differ from ILC2s that never expressed RAG, in terms of proliferation and expression of IL5 and IL-13. These analyses should be conducted in different tissues, as ILC2s adapt their phenotype and transcriptional landscape to their environment. Additionally, the authors should perform their AD-like disease model in these mice.

      We agree with the Reviewer (and a similar comment from Reviewer #2) that a broader survey of the RAG-mediated phenotype in other tissues and by extension other disease models would strengthen the generalizability of our observations. Indeed, we did a more expansive survey of tissues in our BM chimera experiments. We found a similar trend to our reported findings in the sdLN in tissues known to be affected by ILC2s ( Author response image 2) including the skin and lung and in other lymphoid tissues including spleen and mesenteric lymph nodes (mLN). We found that donor reconstitution in each tissue was robust except for the skin, where there was no significant difference between host and donor CD45<sup>+</sup> immune cells and where CD45<sup>-</sup> parenchymal cells predominated (Author response image 2A,C,E,G,I). This may explain why Rag1<sup>-/-</sup> donor ILC2s were significantly higher in proportion in all tissues except the skin, where we observed a similar trend that was not statistically significant (Author response image 2B,D,F,H,J). We omitted these analyses to maintain the focus on the skin, but we will be happy to add this data to the manuscript if the Reviewer feels this figure should be helpful.

      Notwithstanding these results, given that we unexpectedly observed enhanced AD-like inflammation in the MC903 model in Rag1 KO mice, we concentrated our later experiments and analyses on defining the differences in skin draining ILC2s modulated by RAG. Our subsequent findings in the skin provoke many new hypotheses about the role of RAG in ILC2s in other tissues, and our tissue survey in the BM chimera provides additional rationale to pursue similar studies in disease models in other tissues. While this is an emerging area of investigation in our lab, we opted to focus this manuscript on our findings related to the AD-like disease model. We have ongoing studies to investigate other tissues, and we are still in the early stages of developing disease models to expand on these findings. However, if the reviewer feels strongly this additional data should be included in the manuscript, we are happy to add it. Considering the complexity of the data and concepts in the manuscript, we hoped to keep it focused to where we have strong molecular, cellular, and phenotypic outcomes. We elaborate on the implications of our work for future studies, including limitations of our study and currently available reagents and need for new mouse strains to rigorously answer these questions on lines 476-508

      The authors provide a valuable dataset of single-nuclei RNA sequencing (snRNA-seq) and ATAC sequencing (snATAC-seq) from RAGexp (RAG fate map-positive) and RAGnaïve (RAG fate map-negative) ILC2s. This elegant approach demonstrates that ILC2s with a history of RAG expression are epigenomically suppressed. However, key genes such as IL-5 and IL-13 do not appear to be differentially regulated between RAGexp and RAGnaïve ILC2s according to Table S5. Although the authors show that the regulome activity of IL-5 and IL-13 is decreased in RAGexp ILC2s, how do the authors explain that these genes are not differentially expressed between the RAGexp and RAGnaïve ILC2? I think that it is important to validate this in vivo.

      We thank the Reviewer for highlighting the value and possible elegance of our data. The Reviewer brings up an important issue that we grappled with in this study and that highlights a major technical limitation of single cell sequencing studies. Genes for secreted factors such as cytokines are often transcribed at low levels and are poorly detected in transcriptomic studies. This is particularly true in single cell studies with lower sequencing depth. Various efforts have been made to overcome these issues such as computational approaches to estimate missing data (e.g. van Djik et al, 2018, PMID 29961576; Huang et al, 2018, PMID 29941873), or recent use of cytokine reporter mice and dial-out PCR to enhance key cytokine signals in sequenced ILCs (Bielecki et al, 2021, PMID 33536623). We did not utilize computational methods to avoid the risk of introducing artifacts into the data, and we did not perform our study in cytokine reporter mice. Thus, cytokines were poorly detected in our transcriptomic data, as evidenced by lack of identification of cytokines as markers for specific clusters (e.g. IL-5 for ILC2s) or significant differential expression between RAG-naïve and RAG-experienced ILC2s.

      However, the multiomic features of our data allowed a synergistic analysis to identify effects on cytokines. For example, transcripts for the IL-4 and IL-5 were not detected at a high enough level to qualify as marker genes of the ILC2 cluster in the gene expression (GEX) assay but were identified as markers for the ILC2 cluster in the ATAC-seq data in the differentially accessible chromatin (DA) assay. Using the combined RNA-seq and ATAC-seq gene to peak links (GPL) analyses, many GPLs were identified in the Th2 locus for ILC2s, including for IL-13, which was not identified as a marker for ILC2s by any of the assays alone. Thus, our combined analysis took advantage of the potential of multiomic datasets to overcome a general weakness inherent to most scRNAseq datasets.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      - Line 168; Reference 23 also showed expression in the NK cells, please add this reference to reference 24.

      We thank the reviewer for catching this oversight, and we have corrected it in the revised manuscript.

      - Please add the full names for GPL and sdLN in the text of the manuscript when first using these abbreviations. They are now only explained in the legends.

      We reviewed the manuscript text and found that we defined sdLNs for the first time on line 104. We defined GPLs for the first time on line 248. We believe these definitions are placed appropriately near the first references to the corresponding figures/analysis, but if the Reviewer believes we should move these definitions earlier, we are happy to do so.

      Reviewer #2 (Recommendations For The Authors):

      I would suggest that the following reanalyses would improve the clarity of the data:

      - Can ILC2 numbers, rather than frequency, be used (e.g. in Figure 1C, S2B, and so on). This would substantiate the data that currently relies on percentages.

      This was a weakness also noted by Reviewer #3. We have added data on ILC2 numbers for each experiment as outlined below:

      In New Figure 1, we added new panels for ILC2 cell number in both the AD-like disease experiment (C) and in steady state (H).

      In New Figure S2, we added a panel for ILC2 cell number in steady state (B).

      In Figure 2 and associated supplemental data in Figure S4, we added several more panels. For the splenocyte chimera, we added a panel for ILC2 cell number in New Figure 2C.

      We incorporated multiple new panels in New Figure S4 to address the need for more data to be shown for the BM chimera (also requested by Reviewer #2). These included total cell counts and frequency for ILC2 (New Figure S4F,G), and IL-5<sup>+</sup> (New Figure S4I,K) and IL-13<sup>+</sup> (New Figure S4J,L) ILCs in addition to the proportions originally presented in Figure 2.  

      - Can the authors provide data on IL-33R expression on sdLN ILC2s? Expression of ST-2 (IL-33R) does vary between ILC2 populations and is impacted by the digestion of tissue. All of the data provided here requires ILC2 to be IL-33R<sup>+</sup>. In the control samples, the ILC2 compartment is very scarce - in LNs, ILC2s are rare. The gating strategy with limited resolution of positive and negative cells in the lineage gate doesn't help this analysis.

      The Reviewer raises a valid point regarding the IL-33R marker and ILC2s. We designed our initial experiments to be consistent with our earlier observations of skin ILC2s, which were defined as CD45<sup>+</sup>Lin-CD90+CD25+IL33+, and the scarcity of skin draining lymph node ILC2s at steady state was consistent with our prior findings (Kim et al, 2013, PMID 23363980). We can include MFI data on IL-33R expression in these cells if the reviewer feels strongly that this would add to the manuscript, but we did not include other ILC2-specific markers in these experiments that would give us an alternative total ILC2 count to calculate frequency of IL-33R<sup>+</sup> ILC2s, which would also make the context of the IL-33 MFI difficult to interpret.

      Other studies defining tissue specific expression patterns in ILC2s have called into question whether IL-33R is a reliable marker to define skin ILC2s (Ricardo-Gonzalez et al, 2018, PMID 30201992). However, there is evidence for region-specific expression of IL-33R (Kobayashi et al, 2019, PMID 30712873), with ILC2s in the subcutis expressing high levels of IL-33R and both IL5 and IL-13, while ILC2s in the epidermis and dermis have low levels of IL-33R and IL-5 expression. In contrast to the Kobayashi et al study, Ricardo-Gonzalez et al sequenced ILC2s from whole skin, thus the region-specific expression patterns were not preserved, and the lower expression of IL-33R in the epidermis and dermis may have diluted the signal from the ILC2s in the subcutis. These may also be the ILC2s most likely to drain into the lymph nodes, which is the tissue on which we focused our analyses (consistent with our prior work in Kim et al, 2013).

      - In Figure 2 (related to 2H, 2I) can flow plots of the IL-5 versus IL-13 gated on either CD90.1+CD45.2+ or CD90.2+CD45.2+ ILC2 be shown? I.e. gate on the ILC2s and show cytokine expression, rather than the proportion of donor IL5/13. The proportion of donor ILC2 is shown to be significantly higher in 2G. Therefore gating on the cells of interest and showing on a cellular basis their ability to produce the cytokines would better make the point I think.

      We agree that this is important additional data to include. We have added flow plots of sdLN ILC2s from the BM chimera divided by donor genotype showing IL-5 and IL-13 expression in New Figure S4H.

      I assume the authors have looked and there is no obvious data, but does analysis of transcription factor consensus binding sequences in the open chromatin provide any new insight?

      The Reviewer also commented on this in the public review. As copied from our response above:

      We found that the most enriched sites in the ILC2 gene loci contained the consensus sequence GGGCGG (or its reverse complement), a motif recognized by a variety of zinc finger transcription factors (TFs). Predictions from our analyses predicted the KLF family of zinc finger TFs as most likely to be enriched at the identified open chromatin regions. To infer which KLFs might be occupying these sites in the RAG-experienced or RAG-naïve cells, we also assessed the expression levels of these identified TFs. Interestingly, KLF2 and KLF6 are more expressed in RAG-experienced ILC2s. KLF6 is a tumor suppressor (PMID: 11752579), and both KLF6 and KLF2 were recently shown to be markers of “quiescent-like” ILCs (PMID: 33536623). Further, upon analysis of the Th2 locus, the (A/T)GATA(A/G) consensus site (or reverse complement) was enriched in identified open chromatin at that locus. The algorithm predicted multiple TFs from the GATA family as possible binding partners, but expression analysis showed only GATA3 was highly expressed in ILC2s, consistent with what would be predicted from prior studies (PMID: 9160750).

      We have added this data in new Figure S10 and new Figure S12, with corresponding text in the Results section on lines 277-316 and lines 366-378.

      In terms of phrasing and presentation:

      - It would help to provide some explanation of why all analyses focus on the draining LNs rather than the actual site of inflammation (the ear skin). I do not think it appropriate to ask for data on this as this would require extensive further experimentation, but there should be some discussion on this topic. This feels relevant given that the skin is the site of inflammatory insult and ILC2 is present here. How the ILC2 compartment in the skindraining lymph nodes relates to those in the skin is not completely clear, particularly given the prevailing dogma that ILC2 are tissue-resident.

      Given limitations of assessing cytokine production of the relatively rare population of skin-resident ILC2s, we focused on the skin-draining lymph nodes (sdLN). Our findings in the current manuscript are consistent with our prior work in Kim et al, 2013 (PMID 23363980), and more recently in Tamari et al, 2024 (PMID 38134932), which demonstrated correlation of increased ILC2s in sdLN with increased skin inflammation in the MC903 model. Similarly, Dutton et al (PMID 31152090) have demonstrated expansion of the sdLN ILC2 pool in response to MC903-induced AD-like inflammation in mice. We elaborate on the implications of our work for future studies, including limitations of our study (including the focus on the sdLN), and currently available reagents and need for new mouse strains to rigorously answer these questions on lines 476-508

      - I think the authors should explicitly state that cytokine production is assessed after ex vivo restimulation (e.g. Lines 112-113).

      We have added this statement to the revised text.

      - I also think that it would help to be consistent with axis scales where analyses are comparable (e.g. Figure 1D vs Figure 1H).

      We agree with the Reviewer and we have adjusted the axes for consistency. The data remains unchanged, but axes are slightly adjusted in New Figure 1 (D&I, E&J, F&K) and New Figure S2 (C-E match New Figure 1 D-F). This same axis scaling scheme is carried forward to New Figure 2 (D-E) and New Figure S4 (G,K,L). New data on cell counts is also included per request by Reviewers 2 and 3 (see above). However, we found results for total cells, including ILC2s (New Figure 1C,H, New Figure S2B, New Figure 2C, New Figure S4F), were consistent within experiments, but not between experiments, likely representing issues with normalizing counts (we did not include counting beads for more accurate total counts). Thus, the y-axes in those panels are not consistent between experiments/figures.

      We feel reporting the proportion of WT vs Rag1<sup>-/-</sup> donor cells for the BM chimera is most illustrative of the effect of RAG and have kept it in the main New Figure 2, but for the BM chimera experiment panels we also include the total counts of IL-5<sup>+</sup> and IL-13<sup>+</sup> ILC2s (New Figure S4I,J).

    1. Author response:

      The following is the authors’ response to the original reviews.

      In summary, the changes made in the revision process include:

      An addition of a paragraph in the result section that discusses the absolute values of measured Young’s moduli in the light of probing frequencies, accompanied by a new supplementary figure and a supplementary table that support that discussion

      - Fig. S10. Absolute Young’s modulus values across the frequencies characteristic for the three measurement methods.

      - Table S9. Operation parameters of the three methods used for characterizing the mechanical properties of cells.

      Three new supplementary figures that display the expression matrices for the genes from the identified modules in carcinoma datasets used for validation:

      - Fig. S4. Expression of identified target genes in the CCLE microarray dataset used for validation.

      - Fig. S5. Expression of identified target genes in the CCLE RNA-Seq dataset used for validation.

      - Fig. S6. Expression of identified target genes in the Genentech dataset used for validation.

      An addition of a paragraph in the discussion section that discusses the intracellular origins of resistance to deformation and the dominance of actin cortex at low deformations.

      - Refinement of the manuscript text and figures based on the specific feedback from the Reviewers.

      Please see below for detailed responses to the Reviewers’ comments.

      Reviewer #1 (Public Review)

      In this work, Urbanska and colleagues use a machine-learning based crossing of mechanical characterisations of various cells in different states and their transcriptional profiles. Using this approach, they identify a core set of five genes that systematically vary together with the mechanical state of the cells, although not always in the same direction depending on the conditions. They show that the combined transcriptional changes in this gene set is strongly predictive of a change in the cell mechanical properties, in systems that were not used to identify the genes (a validation set). Finally, they experimentally after the expression level of one of these genes, CAV1, that codes for the caveolin 1 protein, and show that, in a variety of cellular systems and contexts, perturbations in the expression level of CAV1 also induce changes in cell mechanics, cells with lower CAV1 expression being generally softer. 

      Overall the approach seems accessible, sound and is well described. My personal expertise is not suited to judge its validity, novelty or relevance, so I do not make comments on that. The results it provides seem to have been thoroughly tested by the authors (using different types of mechanical characterisations of the cells) and to be robust in their predictive value. The authors also show convincingly that one of the genes they identified, CAV1, is not only correlated with the mechanical properties of cells, but also that changing its expression level affects cell mechanics. At this stage, the study appears mostly focused on the description and validation of the methodological approach, and it is hard to really understand what the results obtain really mean, the importance of the biological finding - what is this set of 5 genes doing in the context of cell mechanics? Is it really central, or is it just one of the set of knobs on which the cell plays - and it is identified by this method because it is systematically modulated but maybe, for any given context, it is not the dominant player - all these fundamental questions remain unanswered at this stage. On one hand, it means that the study might have identified an important novel module of genes in cell mechanics, but on the other hand, it also reveals that it is not yet easy to interpret the results provided by this type of novel approach. 

      We thank the Reviewer #1 for the thoughtful evaluation of our manuscript. The primary goal of the manuscript was to present a demonstration of an unbiased approach for the identification of genes involved in the regulations of cell mechanics. The manuscript further provides a comprehensive computational validation of all genes from the identified network, and experimental validation of a selected gene, CAV1. 

      We agree that at the current stage, far-reaching conclusions about the biological meaning of the identified network cannot be made. We are, however, convinced that the identification of an apparently central player such as CAV1 across various cellular systems is per se meaningful, in particular since CAV1 modulation shows clear effects on the cell mechanical state in several cell types. 

      We anticipate that our findings will encourage more mechanistic studies in the future, investigating how these identified genes regulate mechanical properties and interact with each other. Notwithstanding, the identified genes (after testing in specific system of interest) can be readily used as genetic targets for modulating mechanical properties of cells. Access to such modifications is of huge relevance not only for performing further research on the functional consequence of cell mechanics changes (in particular in in-vivo systems where using chemical perturbations is not always possible), but also for the potential future implementation in modulating mechanical properties of the cells to prevent disease (for example to inhibit cancer metastasis or increase efficacy of cancer cell killing by cytotoxic T cells).

      We have now added a following sentence in the first paragraph of discussion to acknowledge the open ends of our study:

      “(...). Here we leveraged this opportunity by performing discriminative network analysis on transcriptomes associated with mechanical phenotype changes to elucidate a conserved module of five genes potentially involved in cell mechanical phenotype regulation. We provided evidence that the inferred conserved functional network module contains an ensemble of five genes that, in particular when combined in a unique combinatorial marker, are universal, specific and trustworthy markers of mechanical phenotype across the studied mouse and human systems. We further demonstrated on the example of a selected marker gene, CAV1, that its experimental up- and downregulation impacts the stiffness of the measured cells. This demonstrates that the level of CAV1 not only correlates with, but also is causative of mechanical phenotype change. The mechanistic insights into how precisely the identified genes are involved in regulating mechanical properties, how they interact with each other, and whether they are universal and dominant in various contexts all remain to be established in

      future studies.”

      Reviewer #2 (Public Review)

      A key strength is the quantitative approaches all add rigor to what is being attempted. The approach with very different cell culture lines will in principle help identify constitutive genes that vary in a particular and predictable way. To my knowledge, one other study that should be cited posed a similar pan-tissue question using mass spectrometry proteomics instead of gene expression, and also identified a caveolae component (cavin-1, PTRF) that exhibited a trend with stiffness across all sampled tissues. The study focused instead on a nuclear lamina protein that was also perturbed in vitro and shown to follow the expected mechanical trend (Swift et al 2013). 

      We thank the Reviewer #2 for the positive evaluation of the breadth of the results and for pointing us to the relevant reference for the proteomic analysis related to tissue stiffness (Swift et al., 2013). This study, which focused primarily on the tissue-level mechanical properties, identifying PTRF, a caveolar component, which links to our observation of another caveolar component, CAV1, at the single-cell level. 

      We have now included the citation in the following paragraph of the discussion:

      “To our knowledge, there are no prior studies that aim at identifying gene signatures associated with single-cell mechanical phenotype changes, in particular across different cell types. There are, however, several studies that investigated changes in expression upon exposure of specific cell types to mechanical stimuli such as compression (87, 88) or mechanical stretch (22, 80, 89), and one study that investigated difference in expression profiles between stiffer and softer cells sorted from the same population (90). Even though the studies concerned with response to mechanical stimuli answer a fundamentally different question (how gene expression changes upon exposure to external forces vs which genes are expressed in cells of different mechanical phenotype), we did observe some similarities in the identified genes. For example, in the differentially expressed genes identified in the lung epithelia exposed to compression (87), three genes from our module overlapped with the immediate response (CAV1, FHL2, TGLN) and four with the long-term one (CAV1, FHL2, TGLN, THBS1). We speculate that this substantial overlap is caused by the cells undergoing change in their stiffness during the response to compression (and concomitant unjamming transition). Another previous study explored the association between the stiffness of various tissues and their proteomes. Despite the focus on the tissue-scale rather than single-cell elasticity, the authors identified polymerase I and transcript release factor (PTRF, also known as cavin 1 and encoding for a structural component of the caveolae) as one of the proteins that scaled with tissue stiffness across samples (91).”

      Reviewer #3 (Public Review)

      In this work, Urbanska et al. link the mechanical phenotypes of human glioblastoma cell lines and murine iPSCs to their transcriptome, and using machine learning-based network analysis identify genes with putative roles in cell mechanics regulation. The authors identify 5 target genes whose transcription creates a combinatorial marker which can predict cell stiffness in human carcinoma and breast epithelium cell lines as well as in developing mouse neurons. For one of the target genes, caveolin1 (CAV1), the authors perform knockout, knockdown, overexpression and rescue experiments in human carcinoma and breast epithelium cell lines. They determine the cell stiffness via RT-DC, AFM indentation and AFM rheology and confirm that high CAV1 expression levels correlate with increased stiffness in those model systems. This work brings forward an interesting approach to identify novel genes in an unbiased manner, but surprisingly the authors validate caveolin 1, a target gene with known roles in cell mechanics regulation. 

      I have two main concerns with the current version of this work: 

      (1) The authors identify a network of 5 genes that can predict mechanics. What is the relationship between the 5 genes? If the authors aim to highlight the power of their approach by knockdown, knockout or over-expression of a single gene why choose CAV1 (which has an individual p-value of 0.16 in Fig S4)? To justify their choice, the authors claim that there is limited data supporting the direct impact of CAV1 on mechanical properties of cells but several studies have previously shown its role in for example zebrafish heart stiffness, where a knockout leads to higher stiffness (Grivas et al., Scientific Reports 2020), in cancer cells, where a knockdown leads to cell softening (Lin et al., Oncotarget 2015), or in endothelial cell, where a knockout leads to cell softening (Le Master et al., Scientific Reports 2022). 

      We thank the reviewer for their comments. First, we do acknowledge that studying the relationship between the five identified genes is an intriguing question and would be a natural extension of the currently presented work. It is, however, beyond the scope of presented manuscript, in which our primarily goal was to introduce a general pipeline for de novo identification of genes related to cell mechanics. We did add a following statement in the discussion (yellow highlight) to acknowledge the open ends of our study:

      “The mechanical phenotype of cells is recognized as a hallmark of many physiological and pathological processes. Understanding how to control it is a necessary next step that will facilitate exploring the impact of cell mechanics perturbations on cell and tissue function (76).

      The increasing availability of transcriptional profiles accompanying cell state changes has recently been complemented by the ease of screening for mechanical phenotypes of cells thanks to the advent of high-throughput microfluidic methods (77). This provides an opportunity for data-driven identification of genes associated with the mechanical cell phenotype change in a hypothesis-free manner. Here we leveraged this opportunity by performing discriminative network analysis on transcriptomes associated with mechanical phenotype changes to elucidate a conserved module of five genes potentially involved in cell mechanical phenotype regulation. We provided evidence that the inferred conserved functional network module contains an ensemble of five genes that, in particular when combined in a unique combinatorial marker, are universal, specific and trustworthy markers of mechanical phenotype across the studied mouse and human systems. We further demonstrated on the example of a selected marker gene, CAV1, that its experimental up- and downregulation impacts the stiffness of the measured cells. This demonstrates that the level of CAV1 not only correlates with, but also is causative of mechanical phenotype change. The mechanistic insights into how precisely the identified genes are involved in regulating mechanical properties, how they interact with each other, and whether they are universal and dominant in various contexts all remain to be established in future studies.”

      Regarding the selection of CAV1 as the gene that we used for validation experiment; as mentioned in the introductory paragraph of the result section “Perturbing expression levels of CAV1 changes cells stiffness” (copied below), we were encouraged by the previous data already linking CAV1 with cell mechanics when selecting it as our first target. The relationship between CAV1 and cell mechanics regulation, however, is not very well established (of note, two of the latest manuscripts came out after the initial findings of our study). 

      Regarding the citations suggested by the reviewer: two are already included in the original manuscript (Lin et al., Oncotarget 2015 – Ref (63), Le Master –2022 Ref (67)), along with an additional one (Hsu et al 2018 (66)), and the third one (Grivas et al, 2020 (68)) is now also added to the manuscript. Though, we would like to highlight that even though Grivas et al state that the CAV1 KO cells are stiffer, the AFM indentation measurements were performed on the cardiac tissue, with a spherical tip of 30 μm radius and likely reflect primarily supracelluar, tissue-scale properties, as opposed to cell-scale measurements performed in our study (we used cultured cells which mostly lack the extracellular tissue structures, deformability cytometry was performed on dissociated cells and picks up on cell properties exclusively, and in case of AFM measurements a spherical tip with 5 μm radius was used).

      “We decided to focus our attention on CAV1 as a potential target for modulating mechanical properties of cells, as it has previously been linked to processes intertwined with cell mechanics. In the context of mechanosensing, CAV1 is known to facilitate buffering of the membrane tension (45), play a role in β1-inegrin-dependent mechanotransduction (58) and modulate the mechanotransduction in response to substrate stiffness (59). CAV1 is also intimately linked with actin cytoskeleton — it was shown to be involved in cross-talk with Rho-signaling and actin cytoskeleton regulation (46, 60–62), filamin A-mediated interactions with actin filaments (63), and co-localization with peripheral actin (64). The evidence directly relating CAV1 levels with the mechanical properties of cells (47, 62, 65, 66) and tissues (66, 67) , is only beginning to emerge.”

      Regarding the cited p-value of 0.16, we would like to clarify that it is the p-value associated with the coefficient of the crude linear regression model fitted to the data for illustrative purposes in Fig S4. This value only says that from the linear fit we cannot conclude much about the correlation of the level of Cav1 with the Young’s modulus change. Much more relevant parameters to look at are the AUC-ROC values and associated p-values reported in the Table 4 in the main text (see below), which show good performance of CAV1 in separating soft and stiff cell states. 

      The positive hypothesis I assumes that markers are discriminative of samples with stiff/soft mechanical phenotype regardless of the studied biological system, and CAV1 has a clear trend with the minimum AUC-ROC on 3 datasets of 0.78, even though the p-value is below the significance level. The positive hypothesis II assumes that markers are discriminative of samples with stiff/soft mechanical phenotype in carcinoma regardless of data source, and CAV1 has a clear significance because the minimum AUC-ROC on 3 datasets is 0.89 and the p-value is 0.02.

      (2) The authors do not show how much does PC-Corr outperforms classical co-expression network analysis or an alternative gold standard. It is worth noting that PC-Corr was previously published by the same authors to infer phenotype-associated functional network modules from omics datasets (Ciucci et al., Scientific Reports 2017). 

      As pointed out by the Reviewer, PC-corr has been introduced and characterized in detail in a previous publication (Ciucci et al, 2017, Sci. Rep.), where it was compared against standard co-expression analysis (below reported as: p-value network) on molecules selected using univariate statistical analysis. 

      See the following fragment of Discussion in Ciucci et al, 2017:

      “The PC-corr networks were always compared to P-value networks. The first strategical difference lies in the way features are selected: while the PC-corr adopts a multivariate approach, i.e. it uses a combination of features that are responsible for the sample discrimination, in the P-value network the discriminating features are singly selected (one by one) with each Mann-Whitney test (followed by Benjamini-Hochberg procedure). The second strategical difference lies in the generation of the correlation weights in the network. PC-corr combines in parallel and at the same time in a unique formula the discrimination power of the PC-loadings and the association power of the Pearson correlation, directly providing in output discriminative omic associations. These are generated using a robust (because we use as merging factor the minimum operator, which is a very penalizing operator) mathematical trade-off between two important factors: multivariate discriminative significance and correlation association. In addition, as mentioned above, the minimum operator works as an AND logical gate in a digital circuit, therefore in order to have a high link weight in the PCcorr network, both the discrimination (the PC-loadings) and the association (the Pearson correlations) of the nodes adjacent to the link should be simultaneously high. Instead, the Pvalue procedure begins with the pre-selection of the significant omic features and, only in a second separated step, computes the associations between these features. Therefore, in P-value networks, the interaction weights are the result neither of multivariate discriminative significance, nor of a discrimination/association interplay.”

      Here we implement PC-corr for a particular application and do not see it as central to the message of the present manuscript to compare it with other available methods. We considered it much more relevant to focus on an in-silico validation on dataset not used during the PCcorr analysis (see Table 3 and 4 for details).

      Altogether, the authors provide an interesting approach to identify novel genes associated with cell mechanics changes, but the current version does not fulfill such potential by focusing on a single gene with known roles in cell mechanics. 

      Our manuscript presents a demonstration of an overall approach for the identification of genes involved in the regulation of cell mechanics, and the perturbations performed on CAV1 have a demonstrative role (please also refer to the explanations of why we decided to perform the verification focused on CAV1 above). The fact that we identify CAV1, which has been implicated in regulating cell mechanics in a handful of studies, de novo and in an unbiased way speaks to the power of our approach. We do agree that investigation into the effect of manipulating the expression of the remaining genes from the identified network module, as well as into the mutual relationships between those genes and their covariance in perturbation experiments, constitutes a desirable follow-up on the presented results. It is, however, beyond the scope of the current manuscript. Regardless, the other genes identified can be readily tested in systems of interest and used as potential knobs for tuning mechanical properties on demand.

      Reviewer #1 (Recommendations For Authors)

      I am not a specialist of the bio-informatics methods used in this study, so I will not make any specific technical comments on them. 

      In terms of mechanical characterisation of cells, the authors use well established methods and the fact that they systematically validate their findings with at least two independent methods (RT-DC and AFM for example) makes them very robust. So I have no concerns with this part.  The experiments of perturbations of CAV 1 are also performed to the best standards and the results are clear, no concern on that. 

      My main concerns are rather questions I was asking myself and could not answer when reading the article. Maybe the authors could find ways to clarify them - the discussion of their article is already very long and maybe it should not be lengthened to much. In my opinion, some of the points discussed are not really essential and rather redundant with other parts of the paper. This could be improved to give some space to clarify some of the points below:  

      We thank the Reviewer #1 for an overall positive evaluation of the manuscript as well as the points of criticism which we addressed in a point-by-point manner below.

      (1) This might be a misunderstanding of the method on my side, but I was wondering whether it is possible to proceed through the same steps but choose other pairs of training datasets amongst the 5 systems available (there are 10 such pairs if I am not mistaken) and ask whether they always give the same set of 5 genes. And if not, are the other sets also then predictive, robust, etc. Or is it that there are 'better' pairs than others in this respect. Or the set of 5 genes is the only one that could be found amongst these 5 datasets - and then could it imply that it is the only group 'universal' group of predictive genes for cell mechanics (when applied to any other dataset comprising similar mechanical measures and expression profiles, for other cells, other conditions)? 

      I apologize in case this question is just the result of a basic misunderstanding of the method on my side. But I could not answer the question myself based on what is in the article and it seems to be important to understand the significance of the finding and the robustness of the method. 

      We thank the Reviewer for this question. To clarify: while in general it is possible to proceed through the same analysis steps choosing a different pair of datasets (see below for examples), we have purposefully chosen those two and not any other datasets because they encompassed the highest number of samples per condition in the RNAseq data (see Fig 4 and Table R1 below), originated from two different species and concerned least related tissues (the other option for mouse would be neural progenitors which in combination with the glioblastoma would likely result in focusing on genes expressed in neural tissues). This is briefly explained in the following fragment of the manuscript on Page 10:

      “For the network construction, we chose two datasets that originate from different species, concern unrelated biological processes, and have a high number of samples included in the transcriptional analysis: human glioblastoma and murine iPSCs (Table 1).”

      To further address the comment of the reviewer: there is indeed a total of 10 possible two-set combinations of datasets, 6 of those pairs are human-mouse combinations (highlighted in orange in Author response Table 1), 3 are human-human combinations (highlighted in blue), and 1 is mousemouse (marked in green).

      Author response table 1.

      Possible two-set combinations of datasets. For each combination, the number of common genes is indicated. The number on the diagonal represents total number of transcripts in the individual datasets, n corresponds to the number of samples in the respective datasets.  * include non-coding genes.

      To reiterate, we have chosen the combination of set A (glioblastoma) and set D (iPSCs) to choose datasets from different species and with highest sample number. 

      As for the other combinations of human-mouse datasets:

      • set A & E lead to derivation of a conserved module, however as expected this module includes genes specific for neuronal tissues (such as brain & testis specific immunoglobulin IGSF11, or genes involved in neuronal development such as RFX4, SOX8)

      Author response image 1.

      • the remaining combinations (set B&D, B&E, C&D and C&E) do not lead to a derivation of a highly interconnected module

      Author response image 2.

      Author response image 3.

      Author response image 4.

      Author response image 5.

      Finally, it would have also been possible to perform the combined PC-corr procedure on all 5 datasets. However, this would prevent us from doing validation using unknown datasets.

      Hence, we decided to proceed with the 2 discovery and 4 validation datasets.

      For the sake of completeness, we present below some of the networks obtained from the analysis performed on all 5 datasets (which intersect at 8059 genes).

      Author response image 6.

      The above network was created by calculating mean/minimum PC-corr among all five datasets and applying the threshold. The thresholding can be additionally restricted in that we:

      a. constrain the directionality of the correlation between the genes (𝑠𝑔𝑛(𝑐) ) to be the same among all or at least n datasets

      b. constrain the directionality of the correlation between the cell stiffness and gene expression level (𝑠𝑔𝑛(𝑉)) for individual genes.

      Some of the resulting networks for such restrictions are presented below.

      Author response image 7.

      Author response image 8.

      Of note, some of the nodes from the original network presented in the paper (CAV1, FHL2, and IGFBP7) are preserved in the 5-set network (and highlighted with blue rims),

      (2) The authors already use several types of mechanical characterisation of the cells, but there are even more of them, in particular, some that might not directly correspond to global cell stiffness but to other aspects, like traction forces, or cell cortex rheology, or cell volume or passage time trough constrictions (active or passive) - they might all be in a way or another related, but they are a priori independent measures. Would the authors anticipate finding very different 'universal modules' for these other mechanical properties, or again the same one? Is there a way to get at least a hint based on some published characterisations for the cells used in the study? Basically, the question is whether the gene set identified is specific for a precise type of mechanical property of the cell, or is more generally related to cell mechanics modulation - maybe, as suggested by the authors because it is a set of molecular knobs acting upstream of general mechanics effectors like YAP/TAZ or acto-myosin? 

      We thank the Reviewer for this comment. We would like to first note that in our study, we focused on single-cell mechanical phenotype understood as a response of the cells to deformation at a global (RT-DC) or semi-local (AFM indentation with 5-μm bead) level and comparatively low deformations (1-3 μm, see Table S9). There is of course a variety of other methods for measuring cell mechanics and mechanics-related features, such as traction force microscopy mentioned by the reviewer. Though, traction force microscopy probes how the cells apply forces and interact with their environment rather than the inherent mechanical properties of the cells themselves which were the main interest of our study. 

      Nevertheless, as mentioned in the discussion, we found some overlap with the genes identified in other mechanical contexts, for example in the context of mechanical stretching of cells:

      “Furthermore, CAV1 is known to modulate the activation of transcriptional cofactor yesassociated protein, YAP, in response to changes in stiffness of cell substrate (60) and in the mechanical stretch-induced mesothelial to mesenchymal transition (74).”

      Which suggests that the genes identified here may be more broadly related to mechanical aspects of cells. 

      Of note, we do have some insights connected to the changes of cell volume — one of the biophysical properties mentioned by the reviewer — from our experiments.  For all measurements performed with RT-DC, we can also calculate cell volumes from 2D cell contours (see Author response images 9, 10, and 11). For most of the cases (all apart from MEF CAV1KO), the stiffer phenotype of the cells, associated with higher levels of CAV1, shows a higher volume.

      Author response image 9.

      Cell volumes for the divergent cell states in the five characterized biological systems. (A) Glioblastoma. (B) Carcinoma, (C) MCF10A, (D) iPSCs, (E) Developing neurons. Data corresponds to Figure 2. Cell volumes were estimated using Shape-Out 1.0.10 by rotation of the cell contours.

      Author response image 10.

      Cell volumes for CAV1 perturbation experiments. (A) CAV1 knock down performed in TGBC cells. (B) CAV1 overexpression in ECC4 and TGBC cells. Data corresponds to Figure 5. Cell volumes were estimated using Shape-Out 1.0.10 by rotation of the cell contours.  

      Author response image 11.

      Cell volumes for WT and CAV1KO MEFs. Data corresponds to Figure S9. Cell volumes were estimated using Shape-Out 1.0.10 by rotation of the cell contours.  

      (3) The authors have already tested a large number of conditions in which perturbations of the level of expression of CAV1 correlates with changes in cell mechanics, but I was wondering whether it also has some direct explanatory value for the initial datasets used - for example for the glioblastoma cells from Figure 2, in the different media, would a knock-down of CAV1 prevent the increase in stiffness observed upon addition of serum, or for the carcinoma cells from different tissues treated with different compounds - if I understand well, the authors have tested a subset of these (ECC4 versus TGBC in figure 5) - how did they choose these and how general is it that the mechanical phenotype changes reported in Figure 2 are all mostly dependant on CAV1 expression level? I must say that the way the text is written and the results shown, it is hard to tell whether CAV1 is really having a dominant effect on cell mechanics in most of these contexts or only a partial effect. I hope I am being clear in my question - I am not questioning the conclusions of Figures 5 and 6, but asking whether the level of expression of CAV1, in the datasets reported in Figure 2, is the dominant explanatory feature for the differences in cell mechanics. 

      We thank reviewer for this comment and appreciate the value of the question about the generality and dominance of CAV1 in influencing cell mechanics.

      On the computational side, we have addressed these issues by looking at the performance of CAV1 (among other identified genes) in classifying soft and stiff phenotypes across biological systems (positive hypothesis I), as well as across data of different type (sequencing vs microarray data) and origin (different research institutions) (positive hypothesis II). CAV1 showed strong classification performance (Table 4), suggesting it is a general marker of stiffness changes.  

      On the experimental side, we conducted the perturbation experiments in two systems of choice: two intestinal carcinoma cell lines (ECC4 and TGBC) and the MCF10A breast epithelial cell line. These choices were driven by ease of handling, accessibility, as well as (for MCF10A) connection with a former study (Taveres et al, 2017). While we observed correlations between CAV1 expression and cell mechanics in wide range of datasets, the precise role of CAV1 in each system may vary, and further perturbation experiments in specific systems could be performed to solidify the direct/dominant role of CAV1 in cell mechanics. We hypothesize that the suggested knockdown of CAV1 upon serum addition in glioblastoma cells could reduce or prevent the increase in stiffness observed, though this experiment has not been performed. 

      In conclusion, while the computational analysis gives us confidence that CAV1 is a good indicator of cell stiffness, we predict that it acts in concert with other genes and in specific context could be replaced by other changes. We suggest that the suitability of CAV1 for manipulation of the mechanical properties should be tested in each system of interested before use. 

      To highlight the fact that the relevance of CAV1 for modulating cell mechanics in specific systems of interest should be tested and the mechanistic insights into how CAV1 regulates cell mechanics are still missing, we have added the following sentence in the discussion:

      “The mechanical phenotype of cells is recognized as a hallmark of many physiological and pathological processes. Understanding how to control it is a necessary next step that will facilitate exploring the impact of cell mechanics perturbations on cell and tissue function (76). The increasing availability of transcriptional profiles accompanying cell state changes has recently been complemented by the ease of screening for mechanical phenotypes of cells thanks to the advent of high-throughput microfluidic methods (77). This provides an opportunity for data-driven identification of genes associated with the mechanical cell phenotype change in a hypothesis-free manner. Here we leveraged this opportunity by performing discriminative network analysis on transcriptomes associated with mechanical phenotype changes to elucidate a conserved module of five genes potentially involved in cell mechanical phenotype regulation. We provided evidence that the inferred conserved functional network module contains an ensemble of five genes that, in particular when combined in a unique combinatorial marker, are universal, specific and trustworthy markers of mechanical phenotype across the studied mouse and human systems. We further demonstrated on the example of a selected marker gene, CAV1, that its experimental up- and downregulation impacts the stiffness of the measured cells. This demonstrates that the level of CAV1 not only correlates with, but also is causative of mechanical phenotype change. The mechanistic insights into how precisely the identified genes are involved in regulating mechanical properties, how they interact with each other, and whether they are universal and dominant in various contexts all remain to be established in future studies.”

      (4) It would be nice that the authors try to more directly address, in their discussion, what is the biological meaning of the set of 5 genes that they found - is it really mostly a product of the methodology used, useful but with little specific relevance to any biology, or does it have a deeper meaning? Either at a system level, or at an evolutionary level. 

      We would like to highlight that our manuscript is focused on the method that we introduce to identify sets of genes involved in the regulation of cell mechanics. The first implementation included here is only the beginning of this line of work which, in the future, will include looking in detail at the biological meaning and the interconnectivity of the genes identified. Most likely, there is a deeper meaning of the identified module which could be revealed with a lot of dedicated future work. As it is a mere speculation at this point, we would like to refrain from going into more detail about it in the current manuscript. We provide below a few words of extended explanation and additional analysis that can shed light on the current limited knowledge of the connections between the genes and evolutionary preservation of the genes. 

      While it is difficult to prove at present, we do believe that the identified node of genes may have an actual biological meaning and is not a mere product of the used methodology. The PC-corr score used for applying the threshold and obtaining the gene network is high only if the Pearson’s correlation between the two genes is high, meaning that the high connected module of genes identified show corelated expression and is likely co-regulated. Additionally, we performed the GO Term analysis using DAVID to assess the connections between the genes (Figure S3). We have now performed an additional analysis using two orthogonal tools the functional protein association tool STRING and KEGG Mapper. 

      With STRING, we found a moderate connectivity using the five network nodes identified in our study, and many of the obtained connections were based on text mining and co-expression, rather than direct experimental evidence (Author response image 12A). A more connected network can be obtained by allowing STRING to introduce further nodes (Author response image 12B). Interestingly, some of the nodes included by STRING in the extended network are nodes identified with milder PCcorr thresholds in our study (such as CNN2 or IGFBP3, see Table S3). 

      With KEGG Mapper, we did not find an obvious pathway-based clustering of the genes from the module either. A maximum of two genes were assigned to one pathway and those included: 

      • focal adhesions (pathway hsa04510): CAV1 and THBS1

      • cytoskeleton in muscle cells (pathway hsa04820): FHL2 and THBS1

      • proteoglycans in cancer (pathway hsa05205): CAV1 and THBS1.

      As for the BRITE hierarchy, following classification was found:

      • membrane trafficking(hsa04131): CAV1, IGFBP7, TAGLN, THBS, with following subcategories:

      - endocytosis / lipid raft mediated endocytosis/caveolin-mediated endocytosis:

      CAV1

      - endocytosis / phagocytosis / opsonins: THBS1

      - endocytosis / others/ insulin-like growth factor-binding proteins: IGFBP7 o others / actin-binding proteins/others: TAGLN.

      Taken together, all that analyses (DAVID, STRING, KEGG) show that at present no direct relationship/single pathway can be found that integrates all the genes from the identified modules. Future experiments, including investigations of how other module nodes are affected when one of the genes is manipulated, will help to establish actual physical or regulatory interactions between the genes from our module. 

      To touch upon the evolutionary perspective, we provide an overview of occurrence of the genes from the identified module across the evolutionary tree. This overview shows that the five identified genes are preserved in phylum Chordata with quite high sequence similarity, and even more so within mammals (Author response image 13).

      Author response image 12.

      Visualisation of interactions between the nodes in the identified module using functional protein association networks tool STRING. (A) Connections obtained using multiple proteins search and entering the five network nodes. (B) Extended network that includes further genes to increase indirect connectivity. The genes are added automatically by STRING. Online version of STRING v12.0 was used with Homo sapiens as species of interest.   

      Author response image 13.

      Co-occurrence of genes from the network module across the evolutionary tree. Mammals are indicated with the green frame, glires (include mouse), as well as primates (include human) are indicated with yellow frames. The view was generated using online version of STRING 12.0.

      Reviewer #2 (Recommendations For Authors) 

      (1) The authors need to discuss the level of sensitivity of their mechanical measurements with RT-DC for changes to the membrane compared to changes in microtubules, nucleus, etc. The limited AFM measurements also seem membrane/cortex focused. For these and further reasons below, "universal" doesn't seem appropriate in the title or abstract, and should be deleted. 

      We thank the reviewer for this comment. Indeed, RT-DC is a technique that deforms the entire cell to a relatively low degree (inducing ca 17% mean strain, i.e. a deformation of approximately 2.5 µm on a cell with a 15 µm diameter, see Table S9 and Urbanska et al., Nat Methods 2020). Similarly, the AFM indentation experiments performed in this study (using a 5-µm diameter colloidal probe and 1 µm indentation) induce low strains, at which, according to current knowledge, the actin cortex dominates the measured deformations. However, other cellular components, including the membrane, microtubules, intermediate filaments, nucleus, other organelles, and cytoplasmic packing, can also contribute. We have reviewed these contributions in detail in a recent publication (Urbanska and Guck, 2024, Ann Rev Biophys., PMID 38382116). For a particular system, it is hard to speculate without further investigation which parts of the cell have a dominant effect on the measured deformability. We have added now a following paragraph in the discussion to include this information:

      “The mechanical phenotype of single cells is a global readout of cell’s resistance to deformation that integrates contributions from all cellular components. The two techniques implemented for measuring cell mechanical in this study — RT-DC and AFM indentation using a spherical indenter with 5 µm radius — exert comparatively low strain on cells (< 3 µm, see Table S9), at which the actin cortex is believed to dominate the measured response. However, other cellular components, including the membrane, microtubules, intermediate filaments, nucleus, other organelles, and cytoplasmic packing, also contribute to the measured deformations (reviewed in detail in (79)) and, for a particular system, it is hard to speculate without further investigation which parts of the cell have a dominant effect on the measured deformability.”

      The key strength of measuring the global mechanics is that such measurements are agnostic of the specific origin of the resistance to shape change. As such, the term “universal” could be seen as rather appropriate, as we are not testing specific contributions to cell mechanics, and we see the two methods used (RT-DC and AFM indentation) as representative when it comes to measuring global cell mechanics. And we highlighted many times throughout the text that we are measuring global single-cell mechanical phenotype. 

      Most importantly, however, we have used the term “universal” to capture that the genes are preserved across different systems and species, not in relation to the type of mechanical measurements performed and as such we would like to retain the term in the title.

      (2) Fig.2 cartoons of tissues is a good idea to quickly illustrate the range of cell culture lines studied. However, it obligates the authors to examine the relevant primary cell types in singlecell RNAseq of human and/or mouse tissues (e.g. Tabula Muris). They need to show CAV1 is expressed in glioblastoma, iPSCs, etc and not a cell culture artifact. CAV1 and the other genes also need to be plotted with literature values of tissue stiffness.  

      We thank the reviewer for this the comment; however, we do believe that the cartoons in Figure 2 should assist the reader to readily understand whether cultured cells derived from the respective tissues were used (see cartoons representing dishes), or the cells directly isolated from the tissue were measured (this is the case for the developing neurons dataset). 

      We did, however, follow the suggestion of the reviewer to use available resources and checked the expression of genes from the identified network module across various tissues in mouse and human. We first used the Mouse Genome Informatics (MGI; https://www.informatics.jax.org/) to visualize the expression of the genes across organs and organ systems (Author response image 14) as well as across more specific tissue structures (Author response image 15). These two figures show that the five identified genes are expressed quite broadly in mouse. We next looked at the expression of the five genes in the scRNASeq dataset from Tabula Muris (Author response image 16). Here, the expression of respective genes seemed more restricted to specific cell clusters. Finally, we also collected the cross-tissue expression of the genes from our module in human tissues from Human Protein Atlas v23 at both mRNA (Author response image 17) and protein (Author response image 18) levels. CAV1, IGFBP7, and THBS1 showed low tissue specificity at mRNA level, FHL2 was enriched in heart muscle and ovary (the heart enrichment is also visible in Author response image 15 for mouse) and TAGLN in endometrium and intestine. Interestingly, the expression at the protein level (Author response image 18) did not seem to follow faithfully the mRNA levels (Author response image 17). Overall, we conclude that the identified genes are expressed quite broadly across mouse and human tissues. 

      Author response image 14.

      Expression of genes from the identified module across various organ and organ systems in mouse. The expression matrices for organs (A) and organ systems (B) were generated using Tissue x Gene Matrix tool of Gene eXpression Database (https://www.informatics.jax.org/gxd/, accessed on 22nd September 2024). No pre-selection of stage (age) and assay type (includes RNA and protein-based assays) was applied. The colors in the grid (blues for expression detected and reds for expression not detected) get progressively darker when there are more supporting annotations. The darker colors do not denote higher or lower levels of expression, just more evidence.

      Author response image 15.

      Expression of genes from the identified module across various mouse tissue structures. The expression matrices for age-selected mouse marked as adult (A) or young individuals (collected ages labelled P42-84 / P w6-w12 / P m1.5-3.0) (B) are presented and were generated using RNASeq Heatmap tool of Gene eXpression Database (https://www.informatics.jax.org/gxd/, accessed on 2nd October 2024).

      Author response image 16.

      Expression of genes from the identified module across various cell types and organs in t-SNE embedding of Tabula Muris dataset. (A) t-SNE clustering color-coded by organ. (B-F) t-SNE clustering colorcoded for expression of CAV1 (B), IGFBP7 (C), FHL2 (D), TAGLN (E), and THBS1 (F). The plots were generated using FACS-collected cells data through the visualisation tool available at https://tabulamuris.sf.czbiohub.org/ (accessed on 22nd September 2024).

      Author response image 17.

      Expression of genes from the identified module at the mRNA level across various human tissues. (A-E) Expression levels of CAV1 (A), IGFBP7 (B), FHL2 (C), TAGLN (D), and THBS1 (E). The plots were generated using consensus dataset from Human Protein Atlas v23 https://www.proteinatlas.org/ (accessed on 22nd September 2024).

      Author response image 18.

      Protein levels of genes from the identified module across various human tissues. (A-E) Protein levels of CAV1 (A), IGFBP7 (B), FHL2 (C), TAGLN (D), and THBS1 (E). The plots were generated using Human Protein Atlas v23 https://www.proteinatlas.org/ (accessed on 22nd September 2024).

      Regarding literature values and tissue stiffness, we would like to argue that cell stiffness is not equivalent to tissue stiffness, and we are interested in the former. Tissue stiffness is governed by a combination of cell mechanical properties, cell adhesions, packing and the extracellular matrix. There can be, in fact, mechanically distinct cell types (for example characterized by different metabolic state, malignancy level etc) within one tissue of given stiffness. Hence, we consider that testing for the correlation between tissue stiffness and expression of identified genes is not immediately relevant.

      (3) Fig.5D,H show important time-dependent mechanics that need to be used to provide explanations of the differences in RT-DC (5B,F) and in standard AFM indentation expts (5C,G). In particular, it looks to me that RT-DC is a high-f/short-time measurement compared to the AFM indentation, and an additional Main or Supp Fig needs to somehow combine all of this data to clarify this issue. 

      We thank the reviewer for this comment. It is indeed the case, that cells typically display higher stiffness when probed at higher rates. We have now expanded on this aspect of the results and added a supplementary figure (Fig. S10) that illustrates the frequencies used in different methods and summarizes the apparent Young’s moduli values into one plot in a frequencyordered manner. Of note, we typically acquire RT-DC measurements at up to three flowrates, and the increase in measurement flow rates accompanying increase in flow rate also results in higher extracted apparent Young’s moduli (see Fig. S10 B,D). We have further added Table S9 that summarizes operating parameters of all three methods used for probing cell mechanics in this manuscript:

      “The three techniques for characterizing mechanical properties of cells — RT-DC, AFM indentation and AFM microrheology — differ in several aspects (summarized in Table S9), most notably in the frequency at which the force is applied to cells during the measurements, with RT-DC operating at the highest frequency (~600 Hz), AFM microrheology at a range of frequencies in-between (3–200 Hz), and AFM indentation operating at lowest frequency (5 Hz) (see Table S9 and Figure S10A). Even though the apparent Young’s moduli obtained for TGBCS cells were consistently higher than those for ECC4 cells across all three methods, the absolute values measured for a given cell line varied depending on the methods: RT-DC measurements yielded higher apparent Young’s moduli compared to AFM indentation, while the apparent Young’s moduli derived from AFM microrheology measurements were frequency-dependent and fell between the other two methods (Fig. 5B–D, Fig. S10B). The observed increase in apparent Young’s modulus with probing frequency aligns with previous findings on cell stiffening with increased probing rates observed for both AFM indentation (68, 69) and microrheology assays (70–72).”

      (4) The plots in Fig.S4 are important as main Figs, particularly given the cartoons of different tissues in Fig.1,2. However, positive correlations for a few genes (CAV1, IGFBP7, TAGLN) are most clear for the multiple lineages that are the same (stomach) or similar (gli, neural & pluri). The authors need to add green lines and pink lines in all plots to indicate the 'lineagespecific' correlations, and provide measures where possible. Some genes clearly don't show the same trends and should be discussed. 

      We thank reviewer for this comment. It is indeed an interesting observation (and worth highlighting by adding the fits to lineage-restricted data) that the relationship between relative change in Young’s modulus and the selected gene expression becomes steeper for samples from similar tissue contexts. 

      For the sake of keeping the main manuscript compact, we decided to keep Fig. S7 (formerly Fig. S4) in the supplement, however, we did add the linear fit to the glioblastoma dataset (pink line) and a fit to the related neural/embryonic datasets (gli, neural & pluri – purple line) as advised — see below.

      We did not pool the stomach data since it is represented by a single point in the figure, aligning with how the data is presented in the main text—stomach adenocarcinoma cell lines (MKN1 and MKN45) are pooled in Fig. 1B (see below).

      We have also amended the respective results section to emphasize that, in certain instances, the correlation between changes in mechanical phenotype and alterations in the expression of analysed genes may be less pronounced:

      “The relation between normalized apparent Young’s modulus change and fold-change in the expression of the target genes is presented in Fig. S7. The direction of changes in the expression levels between the soft and stiff cell states in the validation datasets was not always following the same direction (Fig. 4, C to F, Fig. S7). This suggests that the genes associated with cell mechanics may not have a monotonic relationship with cell stiffness, but rather are characterized by different expression regimes in which the expression change in opposite directions can have the same effect on cell stiffness. Additionally, in specific cases a relatively high change in Young’s modulus did not correspond to marked expression changes of a given gene — see for example low CAV1 changes observed in MCF10A PIK3CA mutant (Fig. S7A), or low IGFBP7 changes in intestine and lung carcinoma samples (Fig. S7C). This indicates that the importance of specific targets for the mechanical phenotype change may vary depending on the origin of the sample.”

      (5) Table-1 neuro: Perhaps I missed the use of the AFM measurements, but these need to be included more clearly in the Results somewhere. 

      To clarify: there were no AFM measurements performed for the developing neurons (neuro) dataset, and it is not marked as such in Table 1. There are previously published AFM measurements for the iPSCs dataset (maybe that caused the confusion?), and we referred to them as such in the table by citing the source (Urbanska et al (30)) as opposed to the statement “this paper” (see the last column of Table 1). We did not consider it necessary to include these previously published data. We have added additional horizontal lines to the table that will hopefully help in the table readability.

      Reviewer #3 (For Authors) 

      Major 

      -  I strongly encourage the authors to validate their approach with a gene for which mechanical data does not exist yet, or explore how the combination of the 5 identified genes is the novel regulator of cell mechanics. 

      We appreciate the reviewer’s insightful comment and agree that it would be highly interesting to validate further targets and perform combinatorial perturbations. However, it is not feasible at this point to expand the experimental data beyond the one already provided. We hope that in the future, the collective effort of the cell mechanics community will establish more genes that can be used for tuning of mechanical properties of cells.

      - If this paper aims at highlighting the power of PC-Corr as a novel inference approach, the authors should compare its predictive power to that of classical co-expression network analysis or an alternative gold standard. 

      We thank the reviewer for the suggestion to compare the predictive power of PC-Corr with classical co-expression network analysis or an alternative gold standard. PC-corr has been introduced and characterized in detail in a previous publication (Ciucci et al, 2017, Sci. Rep.), where it was compared against standard co-expression analysis methods. Here we implement PC-corr for a particular application. Thus, we do not see it as central to the message of the present manuscript to compare it with other available methods again.

      - The authors call their 5 identified genes "universal, trustworthy and specific". While they provide a great amount of data all is derived from human and mouse cell lines. I suggest toning this down. 

      We thank the reviewers for this comment. To clarify, the terms universal, trustworthy and specific are based on the specific hypotheses tested in the validation part of the manuscript, but we understand that it may cause confusion. We have now toned that the statement by adding “universal, trustworthy and specific across the studied mouse and human systems” in the following text fragments:

      (1) Abstract

      “(…) We validate in silico that the identified gene markers are universal, trustworthy and specific to the mechanical phenotype across the studied mouse and human systems, and demonstrate experimentally that a selected target, CAV1, changes the mechanical phenotype of cells accordingly when silenced or overexpressed. (...)”

      (2) Last paragraph of the introduction

      “(…) We then test the ability of each gene to classify cell states according to cell stiffness in silico on six further transcriptomic datasets and show that the individual genes, as well as their compression into a combinatorial marker, are universally, specifically and trustworthily associated with the mechanical phenotype across the studied mouse and human systems. (…)”

      (3) First paragraph of the discussion

      “We provided strong evidence that the inferred conserved functional network module contains an ensemble of five genes that, in particular when combined in a unique combinatorial marker, are universal, specific and trustworthy markers of mechanical phenotype across the studied mouse and human systems.”

      Minor suggestions 

      -  The authors point out how genes that regulate mechanics often display non-monotonic relations with their mechanical outcome. Indeed, in Fig.4 developing neurons have lower CAV1 in the stiff group. Perturbing CAV1 expression in that model could show the nonmonotonic relation and strengthen their claim. 

      We thank reviewer for highlighting this important point. It would indeed be interesting to explore the changes in cell stiffness upon perturbation of CAV1 in a system that has a potential to show an opposing behavior. Unfortunately, we are unable to expand the experimental part of the manuscript at this time. We do hope that this point can be addressed in future research, either by our team or other researchers in the field. 

      -  In their gene ontology enrichment assay, the authors claim that their results point towards reduced transcriptional activity and reduced growth/proliferation in stiff compared to soft cells. Proving this with a simple proliferation assay would be a nice addition to the paper. 

      This is a valuable suggestion that should be followed up on in detail in the future. To give a preliminary insight into this line of investigation, we have had a look at the cell count data for the CAV1 knock down experiments in TGBC cells. Since CAV1 is associated with the GO Term “negative regulation of proliferation/transcription” (high CAV1 – low proliferation), we would expect that lowering the levels of CAV1 results in increased proliferation and higher cell counts at the end of experiment (3 days post transfection). As illustrated in Author response image 19 below, the cell counts were higher for the samples treated with CAV1 siRNAs, though, not in a statistically significant way. Interestingly, the magnitude of the effect partially mirrored the trends observed for the cell stiffness (Figure 5F).

      Author response image 19.

      The impact of CAV1 knock down on cell counts in TGBC cells. (A) Absolute cell counts per condition in a 6-well format. Cell counts were performed when harvesting for RT-DC measurements using an automated cell counter (Countess II, Thermo Fisher Scientific). (B) The event rates observed during the RT-DC measurements. The harvested cells are resuspended in a specific volume of measuring buffer standardized per experiment (50-100 μl); thus, the event rates reflect the absolute cell numbers in the respective samples. Horizontal lines delineate medians with mean absolute deviation (MAD) as error, datapoints represent individual measurement replicates, with symbols corresponding to matching measurement days. Statistical analysis was performed using two sample two-sided Wilcoxon rank sum test.

      Methods

      - The AFM indentation experiments are performed with a very soft cantilever at very high speeds. Why? Also, please mention whether the complete AFM curve was fitted with the Hertz/Sneddon model or only a certain area around the contact point. 

      We thank the reviewer for this comment. However, we believe that the spring constants and indentation speeds used in our study are typical for measurements of cells and not a cause of concern. 

      For the indentation experiments, we used Arrow-TL1 cantilevers (nominal spring constant k = 0.035-0.045 N m<sup>−1</sup>, Nanoworld, Switzerland) which are used routinely for cell indentation (with over 200 search results on Google Scholar using the term: "Arrow-TL1"+"cell", and several former publications from our group, including Munder et al 2016, Tavares et al 2017, Urbanska et al 2017, Taubenberger et al 2019, Abuhattum et al 2022, among others). Additionally, cantilevers with the spring constants as low as 0.01 N m−1 can be used for cell measurements (Radmacher 2002, Thomas et al, 2013). 

      The indentation speed of 5 µm s<sup>−1</sup> is not unusually high and does not result in significant hydrodynamic drag. 

      For the microrheology experiments, we used slightly stiffer and shorter (100/200 µm compared to 500 µm for Arrow-TL1) cantilevers: PNP-TR-TL (nominal spring constant k = 0.08 N m<sup>−1</sup>, Nanoworld, Switzerland). The measurement frequencies of 3-200 Hz correspond to movements slightly faster than 5 µm s<sup>−1</sup>, but cells were indented only to 100 nm, and the data were corrected for the hydrodynamic drag (see equation (8) in Methods section).

      Author response image 20.

      Exemplary indentation curve obtained using arrow-TL1 decorated with a 5-µm sphere on a ECC4 cell. The shown plot is exported directly from JPK Data Processing software. The area shaded in grey is the area used for fitting the Sneddon model.  

      In the indentation experiments, the curves were fitted to a maximal indentation of 1.5 μm (rarely exceeded, see Author response image 20). We have now added this information to the methods section:

      - Could the authors include the dataset wt #1 in Fig 4D? Does it display the same trend? 

      We thank the reviewer for this comment. To clarify: in the MCF10A dataset (GEO: GSE69822) there are exactly three replicates of each wt (wild type) and ki (knock-in, referring to the H1047R mutation in the PIK3CA) samples. The numbering wt#2, wt#3, wt#4 originated from the short names that were used in the working files containing non-averaged RPKM (possibly to three different measurement replicates that may have not been exactly paired with the ki samples). We have now renamed the samples as wt#1, wt#2 and wt#3 to avoid the confusion. This naming also reflects better the sample description as deposited in the GSE69822 dataset (see Author response table 2).

      Author response table 2.

      - Reference (3) is an opinion article with the last author as the sole author. It is used twice as a self-standing reference, which is confusing, as it suggests there is previous experimental evidence. 

      We thank the reviewer for pointing this out and agree that it may not be appropriate to cite the article (Guck 2019 Biophysical Reviews, formerly Reference (3), currently Reference (76)) in all instances. The references to this opinion article have now been removed from the introduction:

      “The extent to which cells can be deformed by external loads is determined by their mechanical properties, such as cell stiffness. Since the mechanical phenotype of cells has been shown to reflect functional cell changes, it is now well established as a sensitive label-free biophysical marker of cell state in health and disease (1-2).”

      “Alternatively, the problem can be reverse-engineered, in that omics datasets for systems with known mechanical phenotype changes are used for prediction of genes involved in the regulation of mechanical phenotype in a mechanomics approach.”

      But has been kept in the discussion:

      “The mechanical phenotype of cells is recognized as a hallmark of many physiological and pathological processes. Understanding how to control it is a necessary next step that will facilitate exploring the impact of cell mechanics perturbations on cell and tissue function

      (76).”.

      This reference seems appropriate to us as it expands on the point that our ability to control cell mechanics will enable the exploration of its impact on cell and tissue function, which is central to the discussion of the current manuscript. 

      -The authors should mention what PC-corr means. Principle component correlation? Pearson's coefficient correlation? 

      PC-corr is a combination of loadings from the principal component (PC) analysis and Pearson’s correlation for each gene pair. We have aimed at conveying this in the “Discriminative network analysis on prediction datasets” result section. We have now added and extra sentence at the first appearance of PC-corr to clarify that for the readers from the start:

      “After characterizing the mechanical phenotype of the cell states, we set out to use the accompanying transcriptomic data to elucidate genes associated with the mechanical phenotype changes across the different model systems. To this end, we utilized a method for inferring phenotype-associated functional network modules from omics datasets termed PCCorr (28), that relies on combining loadings obtained from the principal component (PC) analysis and Pearson’s correlation (Corr) for every pair of genes. PC-Corr was performed individually on two prediction datasets, and the obtained results were overlayed to derive a conserved network module. Owing to the combination of the Pearson’s correlation coefficient and the discriminative information included in the PC loadings, the PC-corr analysis does not only consider gene co-expression — as is the case for classical co-expression network analysis — but also incorporates the relative relevance of each feature for discriminating between two or more conditions; in our case, the conditions representing soft and stiff phenotypes. The overlaying of the results from two different datasets allows for a multi-view analysis (utilizing multiple sets of features) and effectively merges the information from two different biological systems.”

      - The formatting of Table 1 is confusing. Horizontal lines should be added to make it clear to the reader which datasets are human and which mouse as well as which accession numbers belong to the carcinomas. 

      Horizontal lines have now been added to improve the readability of Table 1. We hope that makes the table easier to follow and satisfies the request. We assume that further modifications to the table appearance may occur during publishing process in accordance with the publisher’s guidelines. 

      - In many figures, data points are shown in different shapes without an explanation of what the shapes represent. 

      We thank the reviewer for this comment and apologize for not adding this information earlier. We have added explanations of the symbols to captions of Figures 2, 3, 5, and 6 in the main text:

      “Fig. 2. Mechanical properties of divergent cell states in five biological systems. Schematic overviews of the systems used in our study, alongside with the cell stiffness of individual cell states parametrized by Young’s moduli E. (…) Statistical analysis was performed using generalized linear mixed effects model. The symbol shapes represent measurements of cell lines derived from three different patients (A), matched experimental replicates (C), two different reprogramming series (D), and four different cell isolations (E). Data presented in (A) and (D) were previously published in ref (29) and (30), respectively.”

      “Fig. 3. Identification of putative targets involved in cell mechanics regulation. (A) Glioblastoma and iPSC transcriptomes used for the target prediction intersect at 9,452 genes. (B, C) PCA separation along two first principal components of the mechanically distinct cell states in the glioblastoma (B) and iPSC (C) datasets. The analysis was performed using the gene expression data from the intersection presented in (A). The symbol shapes in (B) represent cell lines derived from three different patients. (…)”

      “Fig. 5. Perturbing levels of CAV1 affects the mechanical phenotype of intestine carcinoma cells. (…) In (E), (F), (I), and (J), the symbol shapes represent experiment replicates.”

      “Fig. 6. Perturbations of CAV1 levels in MCF10A-ER-Src cells result in cell stiffness changes. (…)  Statistical analysis was performed using a two-sided Wilcoxon rank sum test. In (B), (D), and (E), the symbol shapes represent experiment replicates.”

      As well as to Figures S2, S9, and S11 in the supplementary material (in Figure S2, the symbol explanation was added to the legends in the figure panels as well): 

      “Fig. S2. Plots of area vs deformation for different cell states in the characterized systems. Panels correspond to the following systems: (A) glioblastoma, (B) carcinoma, (C) non-tumorigenic breast epithelia MCF10A, (D) induced pluripotent stem cells (iPSCs), and (E) developing neurons. 95%- and 50% density contours of data pooled from all measurements of given cell state are indicated by shaded areas and continuous lines, respectively. Datapoints indicate medians of individual measurements. The symbol shapes represent cell lines derived from three different patients (A), two different reprogramming series (D), and four different cell isolations (E), as indicated in the respective panels. (…).”

      “Fig. S9. CAV1 knock-out mouse embryonic fibroblasts (CAV1KO) have lower stiffness compared to the wild type cells (WT). (…) (C) Apparent Young’s modulus values estimated for WT and CAV1KO cells using areadeformation data in (B). The symbol shapes represent experimental replicates. (…)”

      “Fig. S11. Plots of area vs deformation from RT-DC measurements of cells with perturbed CAV1 levels. Panels correspond to the following experiments: (A and B) CAV1 knock-down in TGBC cells using esiRNA (A) and ONTarget siRNA (B), (C and D) transient CAV1 overexpression in ECC4 cells (C) and TGBC cells (D). Datapoints indicate medians of individual measurement replicates. The isoelasticity lines in the background (gray) indicate regions of of same apparent Young’s moduli. The symbol shapes represent experimental replicates.”

      - In Figure 2, the difference in stiffness appears bigger than it actually is because the y-axes are not starting at 0. 

      While we acknowledge that starting the y-axes at a value other than 0 is generally not ideal, we chose this approach to better display data variability and minimize empty space in the plots.

      A similar effect can be achieved with logarithmic scaling, which is a common practice (see  Author response image 21 for visualization). We believe our choice of axes cut-off enhances the interpretability of the data without misleading the viewer.

      Author response image 21.

      Visualization of different axis scaling strategies applied to the five datasets presented in Figure 2 of the manuscript. 

      Of note, apparent Young’s moduli obtained from RT-DC measurements typically span 0.5-3.0 kPa (see Figure 2.3 from Urbanska et al 2021, PhD thesis). Differences between treatments rarely exceed a few hundred pascals. For example, in an siRNA screen of mitotic cell mechanics regulators in Drosophila cells (Kc167), the strongest hits (e.g., Rho1, Rok, dia) showed changes in stiffness of 100-150 Pa (see Supplementary Figure 11 from Rosendahl, Plak et al 2018, Nature Methods 15(5): 355-358).

      - In Figure 3, I don't personally see the benefit of showing different cut-offs for PC-corr. In the end, the paper focuses on the 5 genes in the pentagram. I think only showing one of the cutoffs and better explaining why those target genes were picked would be sufficient and make it clearer for the reader. 

      We believe it is beneficial to show the extended networks for a few reasons. First, it demonstrates how the selected targets connect to the broader panel of the genes, and that the selected module is indeed much more interconnected than other nodes. Secondly, the chosen PC-corr cut-off is somewhat arbitrary and it may be interesting to look through the genes from the extended network as well, as they are likely also important for regulating cell mechanics. This broader view may help readers identify familiar genes and recognizing the connections to relevant signaling networks and processes of interest.

      - In Figure 4C, I suggest explaining why the FANTOM5 and not another dataset was used for the visualization here and mentioning whether the other datasets were similar. 

      In Figure 4C, we have chosen to present data corresponding to FANTOM5, because that was the only carcinoma dataset in which all the cell lines tested mechanically are presented. We have now added this information to the caption of Figure 4. Additionally, the clustergrams corresponding to the remaining carcinoma datasets (CCLE RNASeq, Genetech ) are presented in supplementary figures S4-S6. 

      “The target genes show clear differences in expression levels between the soft and stiff cell states and provide for clustering of the samples corresponding to different cell stiffnesses in both prediction and validation datasets (Fig. 4, Figs. S4-S6).”

      Typos 

      We would like to thank the Reviewer#3 for their detailed comments on the typos and details listed below. This is much appreciated as it improved the quality of our manuscript.

      -  In the first paragraph of the results section the 'and' should be removed from this sentence: Each dataset encompasses two or more cell states characterized by a distinct mechanical phenotype, and for which transcriptomic data is available. 

      The sentence has been corrected and now reads:

      “Each dataset encompasses two or more cell states characterized by a distinct mechanical phenotype, and for which transcriptomic data is available.”

      -  In the methods in the MCF10A PIK3CA cell lines part, it says cell liens instead of cell lines. 

      The sentence has been corrected and now reads:

      “The wt cells were additionally supplemented with 10 ng ml<sup>−1</sup> EGF (E9644, Sigma-Aldrich), while mutant cell lienes were maintained without EGF.”

      -  In the legend of Figure 6 "accession number: GSE17941, data previously published in ())" the reference is missing. 

      The reference has been added.

      -  In the legend of Figure 5 "(E) Verification of CAV1 knock-down in TGBC cells using two knock-down system" 'a' between using and two is missing. 

      The legend has been corrected (no ‘a’ is missing, but it should say systems (plural)):

      -  In Figure 5B one horizontal line is missing. 

      The Figure 5B has been corrected accordingly. 

      -  Terms such as de novo or in silico should be written in cursive. 

      We thank the Reviewer for this comment; however, we believe that in the style used by eLife, common Latin expressions such as de novo or in vitro are used in regular font.

      -  In the heading of Table 4 "The results presented in this table can be reproducible using the code and data available under the GitHub link reported in the methods section." It should say reproduced instead of reproducible. 

      Yes, indeed. It has been corrected.

      -  The citation of reference 20 contains several author names multiple times. 

      Indeed, it has been fixed now:

      -  In Figure S2 there is a vertical line in the zeros of the y axis labels. 

      I am not sure if there was some rendering issue, but we did not see a vertical line in the zeros of the y axis label in Figure S2.

      - The Text in Figure S4 is too small.                   

      We thank the reviewer for pointing this out. We have now revised Figure S7 (formerly Figure S4) to increase the text size, ensuring better readability. (It has also been updated to include additional fits as requested by Reviewer #2).

      - In Table 3 "positive hypothesis II markers are discriminative of samples with stiff/soft independent of data source" the words 'mechanical phenotype' are missing. 

      The column headings in Table 3 have now been updated accordingly.

      - In Table S3 explain in the table headline what vi1, vi2 and v are. I assume the loading for PC1, the loading for PC2 and the average of the previous two values. But it should be mentioned somewhere.

      The caption of table S3 has been updated to explain the meaning of vi1, vi2 and v.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      In this manuscript, the authors provide strong evidence that the cell surface E3 ubiquitin ligases RNF43 and ZNRF3, which are well known for their role in regulating cell surface levels of WNT receptors encoded by FZD genes, also target EGFR for degradation. This is a newly identified function for these ubiquitin ligases beyond their role in regulating WNT signaling. Loss of RNF43/ZNRF3 expression leads to elevated EGFR levels and signaling, suggesting a potential new axis to drive tumorigenesis, whereas overexpression of RNF43 or ZNRF3 decreases EGFR levels and signaling. Furthermore, RNF43 and ZNRF3 directly interact with EGFR through their extracellular domains.

      Strengths:

      The data showing that RNF43 and ZNRF3 interact with EGFR and regulate its levels and activity are thorough and convincing, and the conclusions are largely supported.

      Weaknesses:

      While the data support that EGFR is a target for RNF43/ZNRF3, some of the authors' interpretations of the data on EGFR's role relative to WNT's roles downstream of RNF43/ZNRF3 are overstated. The authors, perhaps not intentionally, promote the effect of RNF43/ZNRF3 on EGFR while minimizing their role in WNT signaling. This is the case in most of the biological assays (cell and organoid growth and mouse tumor models). For example, the conclusion of "no substantial activation of Wnt signaling" (page 14) in the prostate cancer model is currently not supported by the data and requires further examination. In fact, examination of the data presented here indicates effects on WNT/b-catenin signaling, consistent with previous studies.

      Cancers in which RNF43 or ZNRF3 are deleted are often considered to be "WNT addicted", and inhibition of WNT signaling generally potently inhibits tumor growth. In particular, treatment of WNT-addicted tumors with Porcupine inhibitors leads to tumor regression. The authors should test to what extent PORCN inhibition affects tumor (and APC-min intestinal organoid) growth. If the biological effects of RNF43/ZNRF3 loss are mediated primarily or predominantly through EGFR, then PORCN inhibition should not affect tumor or organoid growth.

      We thank the reviewer’s appreciation of the key strength of our study. We fully agree with the reviewer that RNF43/ZNRF3 play key roles in restraining WNT signaling and their deletions activate WNT signaling that leads  to cancer promotion, as discussed and cited in our manuscript (Hao et al, 2012; Koo et al, 2012). We have revised the language in this manuscript to avoid any confusion or appearance of downplaying this known signaling pathway in cancer progression.

      What we would like to highlight in this work is that our study uncovered an effect of RNF43/ZNRF3 on EGFR, leading to biological impact in multiple model systems. In particular, we included the APC-mutated human cancer cell line HT29 and Apc min mouse intestinal tumor organoids. In the context of APC mutations, β-catenin stabilization and the activation of WNT target genes are essentially decoupled from upstream WNT ligand binding to WNT receptors, thus we could primarily focus on the effect of RNF43/ZNRF3 on EGFR. Our statement of “no substantial activation of WNT signaling” as cited by the reviewer was made in describing the data in Fig. 7E where we did not observe β-catenin accumulation in the nucleus and reasoned no substantial activation of canonical WNT signaling. We agree that further examination would help strengthen the conclusion and appreciate the reviewer’s suggestion of PORCN inhibition experiments. While PORCN inhibition is a valuable experiment in models with abundance of WNT ligands/receptors and non-mutationally activated regulators of WNT signaling (Yu et al, 2020), in biological scenarios with existing APC mutations, another group has previously demonstrated that PORCN inhibition had no observable effect on WNT signaling in APC-deficient cells (PMID: 29533772). In our initial submission, we confirmed this predicted low response to manipulation of WNT signaling components upstream of a mutated APC. We showed that addition of RSPO1 in Apc min mouse intestinal tumor organoids failed to further activate WNT target expression (Fig. 6G). Furthermore, in this revised manuscript, we added new data on EGFR inhibition and PORCN inhibition in WT and Znrf3 KO MEFs (Fig. 6L). PORCN inhibition had no impact on cell growth in neither WT nor Znrf3 KO MEFs, suggesting that Znrf3 KO promoting MEF growth is WNT independent. In contrast, inhibition of EGFR downstream signaling components (Fig. 6L) significantly blocked MEF growth and abolished the impact of Znrf3 KO in MEF growth. This new evidence further supports our main conclusion that RNF43/ZNRF3 controls EGFR signaling to regulate cell growth.

      Reviewer #2 (Public Review):

      Using proteogenomic analysis of human cancer datasets, Yu et al, found that EGFR protein levels negatively correlate with ZNFR3/RNF43 expression across multiple cancers. Interestingly, they found that CRC harbouring the frequent RNF43 G659Vfs*41 mutation exhibits higher levels of EGFR when compared to RNF43 wild-type tumors. This is highly interesting since this mutation is generally not thought to influence Frizzled levels and Wnt-bcatenin pathway activity. Using CRISPR knockouts and overexpression experiments, the authors show that EGFR levels are modulated by ZNRF3/RNF43. Supporting these findings, modulation of ZNRF3/RNF43 activity using Rspondin also leads to increased EGFR levels. Mechanistically, the authors, show that ZNRF3/RNF43 ubiquitinate EGFR and leads to degradation. Finally, the authors present functional evidence that loss of ZNRF3/RNF43 unleashes EGFR-mediated cell growth in 2D culture and organoids and promotes tumor growth in vivo.

      Overall, the conclusions of the manuscript are well supported by the data presented, but some aspects of the mechanism presented need to be reinforced to fully support the claims made by the authors. Additionally, the title of the paper suggests that ZNRF3 and RNF43 loss leads to the hyperactivity of EGFR and that its signalling activity contributes to cancer initiation/progression. I don't think the authors convincingly showed this in their study.

      We thank the reviewer commenting that our “conclusions of the manuscript are well supported by the data presented.”  We address the concerns raised by this reviewer in an itemized way as detailed below:

      Major points:

      (1) EGFR ubiquitination. All of the experiments supporting that ZNFR3/RNF43 mediates EGFR ubiquitination are performed under overexpression conditions. A major caveat is also that none of the ubiquitination experiments are performed under denaturing conditions. Therefore, it is impossible to claim that the ubiquitin immunoreactivity observed on the western blots presented in Figure 4 corresponds to ubiquitinated-EGFR species. Another issue is that in Figure 4A, the experiments suggest that the RNF43-dependent ubiquitination of EGFR is promoted by EGF. However, there is no control showing the ubiquitination of EGFR in the absence of EGF but under RNF43 overexpression. According to the other experiments presented in Figures 4B, 4C, and 4F, there seems to be a constitutive ubiquitination of EGFR upon overexpression. How do the authors reconcile the role of ZNRF3/RNF43 vs c-cbl?

      We agree with this reviewer of the limitation of overexpression experiments. In this manuscript, we actually leveraged both overexpression and knockout systems to demonstrate that ZNRF3/RNF43 regulates EGFR ubiquitination: in Fig 4A, we showed that overexpression of RNF43 increased EGFR ubiquitination; in Fig 4B&C and Fig S3A, we showed that RNF43 knockout decreased EGFR ubiquitination; in Fig 4F, we showed that overexpression of ZNRF3 WT increased EGFR ubiquitination but overexpression of ZNRF3 RING domain deletion mutant failed to increase EGFR ubiquitination.

      We also appreciate the rigor with which the reviewer has approached our methodology. We acknowledge that denaturing conditions can provide additional validation, but the technical challenges associated with denaturing conditions include the potential disruption of epitope structures recognized by these antibodies. Our methodology was chosen to balance the need for accurate detection with the preservation of protein structure and function, which are crucial for understanding the biological implications of EGFR ubiquitination. Moreover, our immunoprecipitation and subsequent Western blotting were stringent with high SDS and 2-ME, optimized to minimize non-specific binding and enhance the specificity of detection. We believe that the data presented are robust and contribute significantly to the existing body of knowledge on EGFR ubiquitination.

      CBL is a well-known E3 ligase of EGFR, and it induces EGFR ubiquitination upon EGF ligand stimulation. Therefore, in order to have a fair comparison of RNF43 and CBL on EGFR ubiquitination, we designed Fig 4A and related experiments in the setting of EGF stimulation. We observed that RNF43 overexpression increased EGFR ubiquitination as potently as CBL did. Following this result, we further demonstrated that knockout of RNF43 decreased endogenous ubiquitinated EGFR level in the unstimulated/basal condition (Fig 4B) as well as in the EGF-stimulated condition (Fig 4C). We acknowledge the importance and interest in fully understanding how ZNRF3/RNF43 interplays with the functions of CBL in regulating EGFR ubiquitination. This line of investigation indeed holds the potential to uncover novel regulatory mechanisms in detail. However, the primary focus of the current study was to establish a foundational understanding of ZNRF3/RNF43 role in regulating EGFR ubiquitination. We look forward to exploring further in future work.

      (2) EGFR degradation vs internalization. In Figure 3C, the authors show experiments that demonstrate that RNF43 KO increases steady-state levels of EGFR and prevents its EGF-dependent proteolysis. Using flow cytometry they then present evidence that the reduction in cell surface levels of EGFR mediated by EGF is inhibited in the absence of RNF43. The authors conclude that this is due to inhibition of EGF-induced internalization of surface EGF. However, the experiments are not designed to study internalization and rather merely examine steady-state levels of surface EGFR pre and post-treatment. These changes are an integration of many things (retrograde and anterograde transport mechanisms presumable modulated by EGF). What process(es) is/are specifically affected by ZNFR3/RNF43? Are these processes differently regulated by c-cbl? If the authors are specifically interested in internalization/recycling, the use of cell surface biotinylation experiments and time courses are needed to examine the effect of EGF in the presence or absence of the E3 ligases.

      We agree that our study design primarily assesses EGFR levels on the cell surface before and after EGF treatment and does not comprehensively measure the whole internalization process. In response to the reviewer’s comments, we have revised the relevant sections of manuscript to clarify that our current findings are focused on changes in cell surface EGFR and do not extend to the detailed mechanisms of EGF-induced internalization or recycling.

      (3) RNF43 G659fs*41. The authors make a point in Figure 1D that this mutant leads to elevated EGFR in cancers but do not present evidence that this mutant is ineffective in mediated ubiquitination and degradation of EGFR. As this mutant maintains its ability to promote Frizzled ubiquitination and degradation, it would be important to show side by side that it does not affect EGFR. This would perhaps imply differential mechanisms for these two substrates.

      Fig 1D is based on bioinformatic analysis of colon cancer patient samples, showing that RNF43 G659Vfs*41 mutant tumors exhibited significantly higher levels of EGFR protein compared to RNF43 WT tumors. Following this lead, we investigated whether this RNF43 G659fs*41 hotspot mutation lost its role in downregulating EGFR. To this end, we transfected the same amount of control vector, RNF43 WT, RING deletion mutant, G659fs*41 mutant DNA into 293T cells and measured the level of EGFR (co-transfected). As shown in Author response image 1, overexpression of RNF43 WT decreased EGFR level while overexpression of RING deletion mutant had no impact on EGFR level as compared with the Vector group, which is consistent with our findings in the manuscript. Cells transfected with the RNF43 G659Vfs*41 mutant exhibited nearly normal levels of EGFR; however, we also observed that RNF43 G659Vfs*41 was less expressed than WT, even though the same amounts of DNA were transfected. Therefore, the insubstantial impact on EGFR levels could be attributed to both functional loss or compromised stability of RNF43 G659Vfs*41 mRNA or protein. Further investigation on RNF43 G659Vfs*41 mRNA and protein stability vs. RNF43 G659Vfs*41 protein function is needed to draw a solid conclusion.

      Author response image 1.

      (4) "Unleashing EGFR activity". The title of the paper implies that ZNRF3/RNF43 loss leads to increased EGFR expression and hence increased activity that underlies cancer. However, I could find only one direct evidence showing that increased proliferation of the HT29 cell line mutant for RNF43 could be inhibited by the EGFR inhibitor Erlotinib. All the other evidence presented that I could find is correlative or indirect (e.g. RPPA showing increased phosphorylation of pathway members upon RNF43 KO, increased proliferation of a cell line upon ZNRF3/ RNF43 KO, decreased proliferation of a cell line upon ZNRF3/RNF43 OE in vitro or in xeno...). Importantly, the authors claim that cancer initiation/ progression in ZNRF3/RNF43 mutants may in some contexts be independent of their regulation of Wnt-bcatenin signaling and relying on EGFR activity upregulation. However, this has not been tested directly. Could the authors leverage their znrf3/RNF43 prostate cancer model to test whether EGFR inhibition could lead to reduced cancer burden whereas a Frizzled or Wnt inhibitor does not?

      More broadly, if EGFR signaling were to be unleashed in cancer, then one prediction would be that these cells would be more sensitive to EGFR pathway inhibition. Could the authors provide evidence that this is the case? Perhaps using isogenic cell lines or a panel of patient-derived organoids (with known genotypes).

      We appreciate the reviewer’s suggestion to provide more direct evidence demonstrating the importance of the ZNRF3/RNF43-EGFR axis in cancer cell proliferation.   In this revised manuscript, we further studied this issue in the WT vs. Znrf3 KO MEF cells. We observed that treatment with the EGFR inhibitor erlotinib did not affect WT MEF but stunted the growth advantage of Znrf3 KO MEF cells (Fig. 6L). On the other hand, treatment with the porcupine inhibitor C59 did not impact either WT or Znrf3 KO MEF cells (Fig. 6L), suggesting a more important role of the ZNRF3/RNF43-EGFR axis in mediating the enhanced cell growth of MEF caused by Znrf3 knockout. Furthermore, considering EGFR is often mutated in human cancer, to increase the clinical relance of our study, we also tested the effect of RNF43 knockout on EGFR L858R (Fig. 2D), a common oncogenic EGFR mutant, and found that RNF43 knockout in HT29 boosted levels of this EGFR mutant detected by its FLAG tag, suggesting that RNF43 degrades both WT and mutated EGFR and its loss can enhance signaling of both WT EGFR and its oncogenic mutant .  However, we emphasize again that this manuscript is in no way written to diminish the proven importance of ZNRF3/RNF43-WNT-β-catenin axis in cancer and development.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      The main conclusion that EGFR is targeted for degradation by RNF43 and ZNRF3 is well supported and documented. Figures 1-5 and associated supplemental figures contain largely convincing data. Figures 6 and 7, however, require some modifications, as follows in order of appearance:

      Figure 6C: Growth of intestinal tumor organoids from Apcmin mice does not require Rspo, however, the authors show that these organoids grow larger in the presence of Rspo, an effect they attribute to increased EGFR activity, rather than increased WNT activity. While this conclusion may be correct, the authors should address this possibility by treating the organoids with PORCN inhibitor. The prediction would be that Rspo treatment still increases organoid size in the presence of PORCN inhibition. A further prediction would be that blocking EGFR (e.g. with Cetuximab) will abrogate the RSPO1 effect.

      Yes, we attributed the impact of Rspo on Apc min organoid growth to enhanced EGFR activity because we observed increased EGFR levels (Fig 6F) but no detectable increase in eight WNT target genes assayed. We agree that further pharmacologic experiments would further boost our conclusion, but our few attempts at treating organoids encountered technical difficulties. Hence, we switched to testing PORCN inhibition vs EGFR inhibition in WT and Znfr33 KO MEFs. As shown in the revised Fig. 6L, EGFR inhibition significantly reversed the growth advantage caused by Znrf3 KO but C59 did not.

      Figure 6G: It is unclear why the authors provide "8-day RSPO1 treatment" data. Here, EGFR mRNA appears to be elevated 2-fold (perhaps not statistically significant), and the Wnt targets Lef1 and Axin2 are decreased, as indicated by the statistical significance. What point is being made here?

      Our observation of increased size of APC min mouse intestinal tumor organoids and increased the EGFR protein levels were at 8 days of RSPO1 treatment. Therefore, we measured mRNA levels at the same time point with the 2-day time point also included for comparison. The goal of this qPCR experiment was to detect the contribution of WNT signaling, and we did not detect an increased transcriptional readout. We included EGFR mRNA levels for comparison, and we did not detect a statistically significant increase, consistent with our experiments concluding that ZNRF3/RNF43 regulate EGFR at the protein level. As stated in the preceding response, these data led us to attribute the impact of Rspo on Apc min organoid growth to enhanced EGFR activity.

      Figure 7A: This requires quantitation. How many mice were used per cell line? The data shown is not particularly convincing, with ZNRF3 overexpressing HT29 cells growing detectably. Showing representative mice is fine, but this should be supplemented with quantitation of all mice.

      We had provided this data. The BLI signal quantification was shown below the representative BLI images. Seven mice were used per cell line, as annotated at the top of the graph.

      Figure 7B: The authors assert that "canonical WNT signaling, based on levels of active-β-Catenin (non-phosphorylated at Ser33/37/Thr41; Figure 7B), remained unaffected". As shown, 2 of the 3 Myc-Znrf3 tumors have increased active-b-catenin signal over the GFP tumors. This indicates to me that canonical Wnt signaling was affected. The authors either need to present quantitative data that supports this claim or modify their conclusions. As presented, I don't think it is appropriate to decouple the effect of Znrf3 overexpression on EGFR from its effect on WNT.

      As requested, we have quantified the level of non-phospho β-Catenin at Ser33/37/Thr41 and found no significant differences (p > 0.05) between the control group vs. ZNRF3 overexpression group. We once again note that our manuscript was not meant to dispute the proven signaling and biological significance of WNT signaling regulation by ZNRF3/RNF43, and we have proof-read the manuscript multiple times to ensure that we did not make any generalized or misleading statements in this aspect.

      Author response image 2.

      Figure 7E: Here the authors assert that "no substantial activation of canonical Wnt signaling" in the Z&R KO tumors, however, the figure shows a substantial increase in active b-catenin staining. The current resolution is insufficient to claim that there is no increase in nuclear b-catenin. The authors' claim that WNT signaling is not involved here is not supported by the data presented here. One way to demonstrate that this effect is through EGFR activation and not through WNT activation is to treat mice with PORCN inhibitor. WNT-addicted tumors, such as by Rnf43 or Znrf3 deletion, regress upon PORCN inhibition. In this case, if the effect of Z&R KO is mediated through EGFR rather than WNT, then there should be no effect on tumor growth upon PORCN inhibition. This is a critical experiment in order to make this point.

      We appreciate the reviewer’s comments and suggestion of experiments. We based our initial statement on insubstantial nuclear β-catenin staining, but we agree that immunohistochemical staining lacks the resolution suitable for quantification. We could not generate the adequate number of KO animals for these in vivo experiments in the window of time planned for this revision. Rather, as shown in the newly added Fig. 6L, we tested EGFR inhibition and PORCN inhibition in Znrf3 KO MEFs and obtained strong data further supporting EGFR in mediating Znrf3 KO promotion of MEF growth. Notwithstanding, we have carefully revised our description of the in vivo data in Fig 7E to avoid any confusion or over-interpretation.

      Minor points:

      Figure 2A: provide quantitation of this immunoblot.

      We have revised manuscript with quantification result shown next to the immunoblot.

      Figure 2B: provide more detail in the figure legend and in the Materials and Methods section on how the KO MEFs were generated. Confirmation that Znrf3 (or in cases of Rnf43 KO) expression is lost in KO would be advisable.

      We have confirmed Znrf3 KO by genotyping and RNF43 KO by immunofluorescent staining. We have also tested multiple commercial anti-ZNRF3 antibodies and anti-RNF43 antibodies for Western blotting, but they all failed.

      Figure 4C is a little misleading. The schematic indicates that ECD-TM and TM-ICD truncations were analyzed for both ZNRF3 and RNF43. However, Figure 4 only shows data for ZNRF3, and the corresponding Figure S4 lacks data for the TM-ICD of Rnf43. A recommendation is to show only those schematics for which data is presented in that figure. On a related topic, the results using the deltaRING constructs (Figure S5) are not mentioned/described in the text.

      We think that the reviewer meant Fig 5C. We have revised the Fig 5C by removing the RNF43 label, and we confirm that  Results section does include the data in Fig S5.

      Figure S4A: Only ZNRF3 is indicated in this figure. Please explain why RNF43 is not represented here. Also, indicate what is plotted along the x-axis.

      We only detected the endogenous ZNRF3-EGFR interaction, possibly because the RNF43 protein level is relatively low in the cell line we used for the mass spec experiment. X-axis is the proteins ordered based on Y-axis values as detailed in the figure legend  -- each data point was arranged along the x axis based on the fold change of iBAQ of EGFR-associated proteins identified in EGF-stimulated vs. control in the log2 scale, from low to high (from left to right on x axis). We have added the phrase “Proteins detected by Mass-Spec” for X-axis.

      Reviewer #2 (Recommendations For The Authors):

      Minor Points.

      (1) In Figure 2B, the authors claim that Znrf3 KO enhanced both EGFR and p-EGFR levels both in the absence and presence of EGF. Although it is clear in the presence of EGF, the increased in p-EGFR in the absence of EGF is less than clear.

      We have revised the manuscript to more clearly state the result in Fig 2B.

      (2) Importantly the authors validated their findings using three independent RNF43 gRNA (fig S2D) but they do not show the editing efficiency obtained with the gRNA.

      We did not include RNF43 IB in this Figure due to lack of specific antibodies for detecting RNR43 in IB. We have no reasons to doubt adequate efficiency of knockout since EGFR was increased compared to the control group. As a result, we did not perform deep sequencing to validate knockout efficacy.

      (3) In S2E, the authors show that KO of either ZNRF3 or RNF43 enhance HER2 levels. This suggests that there is no redundancy between these E3 ligases, at least in this context. How do the authors reconcile that?

      The reviewer raised an interesting issue. Due to the lack of WB antibodies for these two proteins, we would not easily assess the feedback impact of knockout of either gene on the protein levels of the other gene. We speculate that there may be a threshold level of the sum of the two proteins that is needed for adequate degradation of HER2, leading to HER2 increase when either gene is knocked out. Detailed studies of this issue is beyond the scope of this current work.

      (4) Experiments performed in Fig 3C are performed in only one clone. The authors need to repeat in an additional clone or rescue this phenotype using a RNF43 cDNA.

      Our RNF43 KO HT29 line is a pool of KO cells, not a single clone.

      (5) In Figure 7E, the authors suggest that the absence of nuclear bcatenin means that canonical Wnt signaling is unaffected. It is widely known that nuclear bcatenin is often not correlating with pathway activity.

      As stated above, we have revised the manuscript to avoid confusion and misinterpretation.

      (6) What is the nature of the error bars in Fig 3c? Are the differences statistically significant?

      As mentioned in the figure legend, the error bars are SEM. The result is statistically significant, and p-value is noted in the graph.

      (7) In the Figure legends, it should be stated clearly how many biological replicates were performed for each experiment and single data points should be plotted where applicable (e.g. qPCR data). It would be helpful if the uncropped and unprocessed Western blot membranes and replicates that are not shown would be accessible to allow the reader a more comprehensive view of the acquired data, especially for blots that were quantified (e.g. Figure 2F, Figure 3C, there is clearly some defect on the blot).

      For WB representation, it would be helpful to include more size markers on the Western blots (especially on the Ips that show ubiquitin smear) and in general to use a reference protein (GAPDH, Actin, Vinculin) that is closer to the protein being accessed.

      More details should be added in the Methods section to explain how protocols were performed in detail. For example, it should be explained how the viruses used for infecting cells were produced (which plasmids were transfected using which transfection reagent, how long was the virus collected for, etc). Then, it should be stated how long the cells were undergoing selection before being harvested. Because the expression of the viral constructs potentially has an effect on cell proliferation through EGFR, this information is quite relevant. This is just an example, there are details missing in nearly every section (Flow: washing protocols, gating protocols (Live/dead stain?), WB: RIPA lysis buffer composition? How much protein was loaded on blots? How was protein quantification done? IP: how were washes performed and how often repeated?)

      Missing: antibody dilutions for IF, IHC, and WB, plasmid backbones, sequences and availability, qPCR primer sequences from Origene.

      Incucyte experiments are not described.

      We have revised the relevant sections to include more details.

      (8) Line 141: revise text: 2x mRNA abundance in the same sentence.

      Line 162: define intermediate expression better.

      Line 197/198: revise text ('the predominant one'?).

      Line 218/219: revise text (Internalisation of surface EGFR?).

      Line 245: clarify in text that it is endogenous EGFR that is being pulled down.

      Line 264: typo: conserved instead of conservative.

      Line 324: revise text (What does 'unknown significance' mean).

      Line 396/397: revise text: 2x Co-IP in the same sentence.

      Figure 3 D/E: more details on the Method in the figure legend.

      We have revised them accordingly.

    1. Author Response

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Recommendations for the Authors):

      The authors provide their data and code via Github, and that shiny apps allow easy access to their data. However, spending a few minutes with the snRNAseq app I could not figure out how to search for individual genes (e.g. DBH) on their web interface. Some changes could help to make this app more user-friendly.

      While it was not possible to easily modify the user interface of the snRNA-seq app itself, we have instead added two additional supplementary figures displaying screenshots and schematics with sequential instructions that provide a short tutorial showing how to search for individual genes and display either spatial gene expression (for the Visium SRT data) or gene expression by cluster or population (for the snRNA-seq data) in each interactive web app (Figure 3-figure supplement 20-21). We hope this makes the apps more accessible and assists users to more easily query specific genes that they are interested in.

      The first sentence of the abstract and line 70 on page 2 need to be revised for language / grammar / clarity.

      We have revised these two sentences. Line 70 on page 2 contained a typo / copy-paste error. Thank you for pointing this out.

      Reviewer #2 (Recommendations For The Authors):

      While the efforts of the authors to identify NE neurons in the LC is appreciated, the data fall a little short of conclusively calling these neurons solely noradrenergic as there is an apparent lack of overlap between TH and SLC6A2 in the spots. Undoubtedly, some spots contain both which is consistent with the RNA scope results, but there is clearly a pattern that shows spots that don't contain both. It would be worth testing the presence of other catecholamines in some of these certain spots particularly dopamine (Kempadoo et al. 2016, Takeuchi et al., 2016, Devoto et al. 2005).

      We agree this is an important point. To more rigorously investigate whether TH is co-expressed within cells that produce other catecholamines, particularly dopamine (DA) in addition to norepinephrine (NE), we have included additional analyses of the snRNA-seq and Visium data, as well as generated additional RNAscope data in the revised manuscript, as follows.

      (i) We investigated the spatial expression of DA neuron marker genes besides TH, including SLC6A3 (encoding the dopamine transporter), ALDH1A1, and SLC26A7 in the Visium samples (Figure 3-figure supplement 15), which shows that these genes are not strongly expressed within the manually annotated LC regions in the Visium samples (see Figure 2-figure supplement 1).

      (ii) We investigated expression of DA neuron marker genes SLC6A3, ALDH1A1, and SLC26A7 in the snRNA-seq clustering (updated heatmap in Figure 3-figure supplement 8), which shows minimal expression of these genes within the NE neuron cluster (cluster 6).

      (iii) Despite the data above suggesting little expression of markers for DA neurons within the human LC, we wanted to investigate this question more thoroughly with an orthogonal method given that relatively lower coverage in the sequencing approaches may miss expression, particularly for more lowly expressed transcripts. We generated new high-resolution RNAscope smFISH images at 40x magnification for samples from 3 additional donors (Br8689, Br5529, and Br5426) showing expression of NE neuron marker genes (DBH and TH), a 5-HT neuron marker gene (TPH2), and a DA neuron marker gene (SLC6A3) within individual cells within the LC regions in these samples. Expression of SLC6A3 within individual NE neurons (identified by co-expression of DBH and TH) was not apparent in these RNAscope images (Figure 3-figure supplement 16).

      Together with the previous high-magnification RNAscope images showing co-expression of NE neuron marker genes (DBH, TH, and SLC6A2) within individual NE neurons (Figure 3-figure supplement 4), these new results further strengthen the conclusion that the observed TH+ cells we profiled in the LC are NE-producing neurons. In our view, the lack of observed co-expression of TH and SLC6A2 within some individual Visium spots is likely due to sampling variability and relatively lower sequencing coverage in the Visium data, rather than a true lack of co-expression. We have included additional text in the Results and Discussion further discussing this issue.

      Likewise, given the low throughput of RNA scope, and the fact that it was not done in a systematic manner, it does not conclusively identify the cell types in the region. It might be worth a systematic survey of the cells in the region with both NE and DA markers. Otherwise, it is suggested that the authors be more conservative with their annotations.

      As discussed above, we have now generated additional high-magnification RNAscope images for 3 independent donors (Br8689, Br5529, and Br5426), visualizing expression of two NE neuron marker genes (DBH and TH), one 5-HT neuron marker gene (TPH2), and one DA neuron marker gene (SLC6A3, encoding the dopamine transporter) within individual cells within the LC region in each sample (Figure 3-figure supplement 16). Expression of the DA neuron marker gene (SLC6A3) within individual NE neuron cell bodies (identified by co-expression of DBH and TH) was not apparent in these RNAscope images. Together with our previous RNAscope images showing co-expression of DBH, TH, and SLC6A2 within individual cells (Figure 3-figure supplement 4), in our view, these results provide strong evidence that the observed TH+ cells in the LC are NE-producing neurons, and the data do not provide supporting evidence for the existence of DA-synthesizing neurons in the human LC.

      For the manual annotation, it would be useful to include HE tissue images to better understand how the annotations were derived especially because the annotations are not well corroborated by the clustering.

      We have now included the H&E stained histology images for the Visium samples in Figure 2-figure supplement 2A, which can be compared with the previous figures showing the manual annotations for the LC regions (Figure 2-figure supplement 1). The histology images can also be viewed at higher resolution through the Shiny web app (https://libd.shinyapps.io/locus-c_Visium/).

      The unsupervised clustering is certainly contingent on the number of genes detected, which is in turn dependent on the quality of the material and the success of the experiment. It is unclear from the methods whether the samples were pooled for clustering. If they were pooled, the author might consider using only the samples with UMIs > 500. The low UMI may represent free-floating RNA, suggesting issues with tissue permeabilization in turn influencing the ability to confidently associate genes with spots. Sticking with the higher quality sample may improve the ability to perform unsupervised clustering.

      For the spot-level unsupervised clustering using BayesSpace, our aim was to demonstrate whether it is feasible to segment the LC and non-LC regions in the Visium samples in a data-driven manner using a spatial clustering algorithm, instead of relying on manual annotations. We performed clustering across samples (i.e. pooled) -- we have included additional wording in the text and figure caption to clarify this. We agree with the reviewer there may be further optimizations possible, such as filtering out spots or samples with low UMI counts. However, filtering out low-UMI spots may also confound the clustering if low-UMI spots are associated with biological signal (e.g. preferentially located in white matter regions).

      Overall, we found that applying data-driven methods such as BayesSpace to segment the LC and non-LC regions did not perform sufficiently to rely on for our downstream analyses (Figure 2-figure supplement 6), and, in our view, further incremental optimizations were unlikely to reach sufficient performance and robustness, so we chose to rely on the manual annotations instead. In addition, as noted in the Results, this avoids potentially inflated false discoveries due to issues of circularity when performing differential gene expression testing between regions defined by unsupervised clustering on the same sets of genes (Gao et al. 2022). We included the BayesSpace results (Figure 2-figure supplement 6) to provide information and ideas to method developers interested in using this dataset as a test case for further development of spatial clustering algorithms. However, further adapting or optimizing these spatial clustering algorithms ourselves was not within the scope of our current work.

      It is not entirely clear why the authors used FANS, especially with the scored tissue. Do the authors think this could have negatively influenced the capture of the desired cell type since FANS can compromise the integrity of the nuclei? In other words, have the authors considered that this may have resulted in a loss rather than enrichment? The proportion of "NE" neurons in the snRNA-Seq data is less than 2% in all cases and at its lowest in sample 6522 which does not correspond well with the proportion of tissue that was manually annotated as containing NE cells, even when taken into consideration the potential size difference of cells. In the same vein, in some samples, there are more "5-HT" neurons in the region than "NE" according to the numbers.

      As noted in our initial response to reviewers (“Response to Public Review Comments”), we used FANS to enrich for neurons based on our previous success with this approach to identify relatively rare neuronal populations in other brain regions (e.g. nucleus accumbens and amygdala; Tran and Maynard et al. 2021). Based on this previous work, our rationale was that without neuronal enrichment, we could potentially miss the LC-NE population, given the relative scarcity and low absolute number of this neuronal population (e.g. estimates of ~50K total in the entire human LC).

      We do not have a definitive answer to the question of whether our use of FANS to enrich for neurons may have led to damage and contributed to the low recovery rate of LC-NE neurons (as well as the relatively increased levels of mitochondrial contamination compared to other brain regions / preparations in the human brain in our hands). Due to our limited tissue resources for this study, we did not have sufficient tissue to perform a direct comparison with non-sorted data. However, we agree with the reviewer that this is plausible, and warrants further investigation in future work. In particular, the relatively large size and fragility of LC-NE neurons, as well as our use of a standard cell straining approach (70 µm, which may not be ideal for this population), may also be contributing factors.

      Systematically optimizing the preparation to attempt to increase recovery rate (and decrease mitochondrial contamination) are important avenues for future work, and we have decided to share our data and experiences now to assist other groups performing related work. We have included additional wording in the Discussion to further highlight these issues.

      The majority of the snRNA-seq remained unannotated "ambiguous" neurons. It would be highly advantageous to include an annotation for these numerous cells.

      These nuclei were unidentifiable due to ambiguous marker gene expression profiles, i.e. expression of pan-neuronal marker genes without clear expression of either excitatory or inhibitory neuronal marker genes (see Figure 3A and Figure 3-figure supplement 8). Since we were not able to clearly identify these clusters, and due to our additional concerns regarding the data quality (e.g. low recovery rate of the NE neuron population of interest, potential cell damage, and mitochondrial contamination), we decided to label these neuronal clusters as “ambiguous” instead of assigning low-confidence cluster labels. We have included additional wording in the Results section to explain this issue.

      The most likely explanation for identifying serotonergic neurons in these samples is the inclusion of the Raphe Nucleus within the dissection, especially since these cells do not map to the LC per se. As such, is there a way to neuroanatomically define the potential inclusion of this region from these tissue blocks used? Or to the contrary, definitively demonstrate the exclusion of the Raphe?

      As noted in our initial response to reviewers (“Response to Public Review Comments”), our dissection strategy in this initial study precluded the ability to keep track of the exact orientation of the tissue sections on the Visium arrays with respect to their location within the brainstem. Therefore, it is not possible to definitively answer the question of whether the dissections included the raphe nucleus, and if so, which portion of it, based on neuroanatomy from the tissue blocks.

      However, during the course of this study and in parallel, ongoing work for other small, challenging brain regions, we developed a number of specialized technical and logistical strategies for keeping track of orientation and mounting serial sections from the same tissue block onto a single spatial array, which is extremely technically challenging. We are now well-prepared for addressing these issues in future studies, e.g. keeping track of the orientation of the dissections and potential inclusion of adjacent neuroanatomical structures. We have included additional details on this issue in the Discussion.

      Given that one sample (Visium capture area) was excluded as it did not seem to contain a representation of the LC for the profiling of "NE" cells, does it make sense to include this sample in the analysis of 5HT cells given the authors are trying to make claims about the cell composition in and around the LC? Since there appears to be little 5HT contribution from this sample and its inclusion results in inconsistency across experiments and not any notable advantages, the authors might want to reconsider its inclusion in the results.

      We identified a cluster of 5-HT neurons in the snRNA-seq data (Figure 3) and used the Visium samples to further investigate the spatial distribution of this population (Figure 3-figure supplement 9). For the enrichment analyses in the Visium data (Figure 3-figure supplement 9C), we used only the 8 Visium samples that passed quality control (QC). We included the 9th sample (which did not pass QC) in the spot plot visualizations (Figure 3-figure supplement 9A-B) for completeness, but did not base our main conclusions on this sample (in this sample, the tissue resource was likely depleted during earlier sections, so the section for the Visium sample was taken slightly past the extent of the LC within this tissue block). We have included additional wording in the Results section and figure captions to clarify this issue.

      For the RNAscope images, it would be useful to include (draw) the manual annotation of the LC to facilitate interpretation. This is especially useful for demonstrating the separate populations of 5HT and "NE" cells. In general, it would be useful to keep a hashed line perimeter for all sections processed by Visium.

      We have now added a dashed outline indicating the manually annotated LC region in the RNAscope image showing the full tissue section (Figure 3-figure supplement 11). The high-magnification RNAscope images (Figure 3-figure supplement 4, 16, and 17) show regions entirely within the LC regions -- we have included additional wording to note this in the figure captions. For the Visium spot

      plots, we either labeled spots within the annotated regions within the figures or included additional wording in the figure captions to refer to the figures showing the annotations (Figure 2-figure supplement 1).

      The authors state that they successfully mapped the NE neuron population from snRNA-seq to the manually annotated regions on the Visium slides. Based on the color-coded map, these results are not very convincing since the abundance of the given transcript profile is extremely low. Here again, it would help to draw a hashed line perimeter on the slide to denote the manually annotated region. Perhaps the authors could try a different strategy for mapping snRNA signal to the slide? However, it appears that the mapping worked better for the capture areas with higher UMI/genes counts. Perhaps the authors should consider using only the slides with high gene/UMI counts.

      We agree that the performance of these analyses (Figure 3-figure supplement 14) was not clearly described in the previous version of the manuscript. We have rewritten the corresponding paragraph in the Results section to make it more clear that the mapping (spot-level deconvolution) performance was relatively poor overall, and that we did not use these results for further downstream analyses. We did however want to include these results from the cell2location algorithm to provide information and data for method developers on the challenges of these types of analyses in our dataset (e.g. due to the presence of rare populations, relatively subtle differences in expression profiles between neuronal subpopulations, and potential issues due to large nuclei size and high transcriptional activity for NE neurons). While further approaches for these types of analyses exist, and additional optimizations such as subsetting samples or spots with high UMI counts could also be investigated, in our view, these further optimizations lie outside the scope of our current work. We have also added wording in the figure caption to refer to Figure 2-figure supplement 1, which displays the corresponding annotated LC regions per sample.

      It is hard to see if the RNA scope image Supplementary Figure 11 shows co-localization of SLC6A2, TH, and DBH. Having the individual image from each microscope filter along with the merged image is required to properly assess the colocalization of the signals.

      We updated the multi-channel RNAscope images to show both the merged channels and individual channels in separate panels (Figure 3-figure supplement 4, 16, and 17), which makes the visualization more clear. Thank you for this suggestion. (Note that the previous Supplementary Figure 11 has been re-numbered to Figure 3-figure supplement 4.)

      The heatmap showing the level of marker transcripts shows a much lower expression of specific markers, TH, DBH, SLC6A2 in NE vs other clusters looks surprisingly low (particularly TH), while the much broader marker SLC18A2 (monoamine transporter) is considerably more differential. What do the authors make of this finding?

      This is correct. In the snRNA-seq data, we observed that SLC18A2 is one of the most highly differentially expressed (DE) genes in the NE neuron cluster vs. other neuronal clusters, with a high level of expression in the NE neuron cluster (Figure 3C). Note that this heatmap shows the top 70 DE genes (excluding mitochondrial genes) out of the full list of 327 statistically significant DE genes with elevated expression in the NE neuron cluster (the full list of 327 genes is provided in Supplementary File 2C). While all four of these genes (DBH, TH, SLC6A2, and SLC18A2) are identified as statistically significant DE genes, SLC18A2 is the most highly DE out of these and has an especially high level of expression in the NE neuron cluster, as noted by the reviewer (Figure 3C). This could be due to the fact that SLC18A2 transcripts are expressed at higher absolute levels in these neurons than the transcripts that are more specific to LC-NE neurons. While it is true that SLC18A2 is a “broader” marker in the sense that it is found in more cell types -- e.g. cell types within brain nuclei that contain monoaminergic as well as brain nuclei that contain catecholaminergic cells -- expression of SLC18A2 within the LC is highly specific to the catecholaminergic LC-NE neurons given its specialized functional role within monoamine and catecholamine neurons in packaging amine neurotransmitters into synaptic vesicles. We note that SLC18A2 plays a specialized role that is critical to the core function of LC-NE neurons, and hence we are not particularly surprised with this finding and think that one possibility is that this differential expression appears more robustly due to higher absolute levels of the marker.

      While it is understandable that the authors decided to include cells/nuclei with high mitochondrial reads, further work is needed to ensure these cells are of sufficient quality to use in an unbiased way knowing that a high percentage of mitochondrial reads in nuclei sequencing is usually indicative of low-quality nuclei. This can be assessed by evaluating the quality of the nuclei with GWA, which stains an intact nuclear membrane acting as a measure of the integrity of the nuclei.

      To further investigate these results, we added additional analyses evaluating quality control (QC) metrics for the NE neuron cluster in the snRNA-seq data, which had an unusually high proportion of mitochondrial reads (Figure 3-figure supplement 2, shown also below in comments for Reviewer 3) (see also related Figure 3-figure supplement 1, 3, which were included in the manuscript previously). These additional QC analyses do not show any other problematic values for this cluster, other than the high mitochondrial proportion, so we do not believe this is purely a data quality issue. We are aware that this is an unexpected result -- in most cell populations, a high proportion of mitochondrial reads would be indicative of cell damage and poor data quality. However, we have recently also observed high mitochondrial proportions in other relatively rare neuronal populations characterized by large size and high metabolic demand. As discussed below for Reviewer 3, we believe that this is mitochondrial “contamination”, as there should be no mitochondrial reads per se within the nuclear compartment.

      However, it may be possible that in cell populations that have abundant levels of mitochondria and high transcript expression of mitochondrial transcripts in the cell body, that the likelihood of ambient RNA capture of mitochondrial transcripts during nuclear preparation may be higher than for other cell types that have lower expression of mitochondrial transcripts. Hence, we believe that our interpretation is likely correct, i.e. that a combination of technical and biological factors contributes to the inclusion of a relatively high amount of mitochondrial RNA within the droplets for these nuclei. We agree with the reviewer that this finding warrants further investigation in future work. However, in our current study, the tissue resource is depleted for any further experimental validation of this question, so we preferred to provide our data to the community in its current form, while transparently noting this unexpected finding in our results. We have included additional text in the Results section describing the new QC analyses shown in Figure 3-figure supplement 2.

      Minor comments:

      Line 319-321 could be written more clearly to indicate that due to the lack of resolution in a given spot, there are "contaminating reads" that reduce the precision of the cell profile. This reduced precision is likely what results in the "lack of conservation" across species.

      We have added additional wording to this sentence to clarify this point.

      In the discussion, the authors write that the analyses "unbiasedly identified a number of genes enriched in human LC", however, given the manual annotation of the region for each capture area, this resulted in a biased assessment of the spots.

      We have replaced this wording to refer to “untargeted, transcriptome-wide” analyses (i.e. analyses that are not based on a targeted panel of genes) instead of “unbiased”. We agree that the meaning of “unbiased” is ambiguous in this context.

      Reviewer #3 (Recommendations For The Authors):

      Major points:

      Overall, the discovery of some cells in the LC region that express serotonergic markers is intriguing. However, no evidence is presented that these neurons actually produce 5-HT. Perhaps more conservative language would be appropriate (i.e. "cells that possess mRNA signatures of serotonergic neurons" or something like that). Did these cells co-express other markers one would expect in 5-HT neurons like 5-HT autoreceptors and SLC6A18? Also would be useful to compare expression profiles of these putative 5-HT neurons with any published material on bona fide dorsal raphe 5-HT neurons. For the RNAscope confirmation in the supplementary material, it would be helpful to show each marker separately as well as the overlay, and to include representative higher magnification images like were provided for the ACH markers.

      Thank you for this comment. In order to further investigate the identity of these cells, we have investigated the expression of several additional genes including SLC6A18, 5-HT autoreceptor genes (HTR1A, HTR1B), marker genes for 5-HT neurons (SLC18A2, FEV), and marker genes for 5-HT neuronal subpopulations within the dorsal and median raphe nuclei from the literature (Ren et al. 2019), in both the Visium and the snRNA-seq data.

      We observed some expression of SLC18A2 and FEV within the same areas as SLC6A4 and TPH2 in the Visium samples (Figure 3-figure supplement 10A-B, reproduced below; note that SLC18A2 is also a marker gene for NE neurons located within the LC regions), consistent with Ren et al. (2019). However, we did not observe a strong or consistent expression signal for the 5-HT autoreceptors (HTR1A, HTR1B) (Figure 3-figure supplement 10C-D, reproduced below), and we observed zero expression of SLC6A18 in the Visium samples. In the snRNA-seq data, within the cluster identified as 5-HT neurons, we observed some expression of SLC18A2, low expression of FEV, and almost zero expression of SLC6A18 (Figure 3-figure supplement 8, reproduced below; note that SLC6A18 is not shown since it was removed during filtering for low-expressed genes). Similarly, we observed very low expression of the 5-HT autoreceptors (HTR1A, HTR1B) and the additional marker genes for 5-HT neuronal subpopulations from Ren et al. (2019) -- with the possible exception of the neuropeptide receptor gene HCRTR2, which was identified by Ren et al. (2019) within several clusters in both the dorsal and median raphe in mice (Figure 3-figure supplement 8, reproduced below).

      Overall, these additional results give us some further confidence that these are likely 5-HT neurons (due to expression of SLC18A2 and FEV), while also raising further questions (due to the absence of 5-HT autoreceptor genes HTR1A, HTR1B and 5-HT neuronal subpopulation marker genes). While we believe that the most likely explanation is the inclusion of 5-HT neurons from the edges of the adjacent dorsal raphe nuclei in our samples, we acknowledge that the evidence presented is not fully conclusive and does not identify specific subpopulations of 5-HT neurons. In addition, the limited size of our dataset (number of samples and cells) and the lack of information on sample orientation precludes any definitive identification of subpopulations based on their association with specific anatomical regions within the dorsal raphe nuclei. We have updated the manuscript by (i) adjusting our language in the Results and Discussion, (ii) including the additional analyses, supplementary figures, and reference to the literature (Ren et al. 2019) discussed above, and (iii) including additional wording in the Discussion on improvements to the dissection strategy that would allow these questions to be addressed in future studies via a focused molecular profiling of the dorsal raphe nuclei across the rostral-caudal axis.

      Regarding the RNAscope images, we have included additional images showing channels side-by-side and higher magnification, as suggested (and also discussed above for Reviewers 1 and 2). In addition, we have added an outline highlighting the LC region in Figure 3-figure supplement 11 (as suggested above by Reviewer 2), and included an additional high-magnification RNAscope image demonstrating co-expression of 5-HT neuron marker genes (TPH2 and SLC6A4) within individual cells (Figure 3-figure supplement 12).

      Concerning the snRNA-seq experiments, why were only 3 of the 5 donors used, particularly given the low number of LC-NE nuclear transcriptomes obtained? How were the 3 donors chosen from the 5 total donors and how many 100 um sections were used from each donor? Are the 295 nuclei obtained truly representative of the LC population or are they just the most resilient LC nuclei? How many LC nuclei would be estimated to be captured from staining the 100 um tissue sections?

      As discussed in our previous response to reviewers (“Response to Public Review Comments”), the reason we included only 3 of the 5 donors for the snRNA-seq assays was due to tissue availability on the tissue blocks. In this study, we were working with a finite tissue resource. Due to the logistics and thickness of the required tissue sections for Visium (10 μm) and snRNA-seq (100 μm), running Visium first allowed us to ensure that we could collect data from both assays -- if we ran snRNA-seq first and captured no neurons, the tissue block would be depleted. Due to resource depletion, we did not have sufficient available tissue remaining on all tissue blocks to run the snRNA-seq assay for all donors. We have conducted extensive piloting in other brain regions on the amount (mg) of tissue that is needed from various sized cryosections, and the LC is particularly difficult since these are small tissue blocks and the extent of the structure is small. Hence, in some of the subjects, we did not have sufficient tissue available for the snRNA-seq assay.

      We have included details on the number of 100 μm sections used for each donor in Methods -- this varied between 10-15 sections per donor, approximating 50-80 mg of tissue per donor.

      Regarding the question about the representativeness / resilience of the LC nuclei -- as discussed in our previous response to reviewers (“Response to Public Review Comments”) and above for Reviewer 2, we agree that this is a concern. As discussed above for Reviewer 2, it is plausible that our use of FANS may have contributed to cell damage and the low recovery rate of LC-NE neurons. The relatively large size and fragility of LC-NE neurons, as well as our use of a standard cell straining approach (70 µm, which may not be ideal for this population), may also be contributing factors. Due to our limited tissue resource, we did not have sufficient tissue to perform a direct comparison with non-sorted data.

      Systematically optimizing the preparation to attempt to increase recovery rate is an important avenue for future work. We have included additional discussion of this issue in the Discussion.

      Regarding the question about the number of expected nuclei, we have now included estimates of the number of cells per spot within the LC regions in the Visium data (see also related point below, and Figure 2-figure supplement 2B reproduced below), based on the H&E stained histology images and use of cell segmentation software (VistoSeg; Tippani et al. 2022). While we do not have any confident estimates of the number of expected nuclei in the snRNA-seq data, these estimates of cell density from the Visium data could, together with information on additional factors such as the accuracy of the tissue scoring and the effectiveness of FANS, be used to help derive an an expected number of nuclei in future studies. We have included additional wording in the Discussion to note that these estimates could be used in this manner during future studies.

      The LC displays rostral/caudal and dorsal/ventral differences, including where they project, which functions they regulate, and which parts are vulnerable in neurodegenerative disease (e.g. Loughlin et al., Neuroscience 18:291-306, 1986; Dahl et al., Nat Hum Behav 3:1203-14, 2019; Beardmore et al., J Alzheimer's Dis 83:5-22, 2021; Gilvesy et al., Acta Neuropathol 144:651-76, 2022; Madelung et al., Mov Disord 37:479-89, 2022). Which part(s) of the LC was captured for the SRT and snRNAseq experiments?

      As discussed in our previous response to reviewers (“Response to Public Review Comments”), a limitation of this study was that we did not record the orientation of the anatomy of the tissue sections, precluding our ability to annotate the tissue sections with the rostral/caudal and dorsal/ventral axis labels. We agree with the reviewer that additional spatial studies, in future work, could offer needed and important information about expression profiles across the spatial axes (rostral/caudal, ventral/dorsal) of the LC. Our study provides us with insight about optimizing the dissections for spatial assays, as well as bringing to light a number of technical and logistical issues that we had not initially foreseen. For example, during the course of this study and parallel, ongoing work in other, small, challenging regions, we have now developed a number of specialized technical and logistical strategies for keeping track of orientation and mounting serial sections from the same tissue block onto a single spatial array, which is extremely technically challenging. We are now well-prepared for addressing these issues in future studies with larger numbers of donors and samples in order to make these types of insights. We have included additional details in the Discussion to further discuss this point.

      The authors mention that in other human SRT studies, there are typically between 1-10 cells per expression spot. I imagine that this depends heavily on the part of the brain being studied and neuronal density. In this specific case, can the authors estimate how many LC cells were contained in each expression spot?

      We have now performed additional analyses to provide an estimate of the number of cells per spot in the Visium data (Figure 2-figure supplement 2B), based on the application of cell segmentation software (VistoSeg; Tippani et al. 2022) to identify cell bodies in the H&E stained histology images. We applied this methodology and calculated summary statistics within the annotated LC regions for 6 samples (see Methods), and found that the median number of cells per spot within the LC regions ranged from 2 to 5 per sample. We note that these estimates include both NE neurons and other cell types within the LC regions, and that applying cell segmentation software in this brain region is particularly challenging due to the wide range in cell body sizes, with NE neurons being especially large. We have included these updated estimates in the Results and Discussion, and additional details in Methods.

      Regarding comparison of human LC-associated genes with rat or mouse LC-associated genes (Fig. 2D-F), the authors speculate that the modest degree of overlap may be due to species differences between rodent and human and/or methodological differences (SRT vs microarray vs TRAP). Was there greater overlap between mouse and rat than between mouse/rat and human? If so, that is evidence for the former. If not, that is evidence for the latter. Also would be useful for more in-depth comparison with snRNA-seq data from mouse LC. https://www.biorxiv.org/content/10.1101/2022.06.30.498327v1

      Our comparisons with the mouse (Mulvey et al. 2018) and rat (Grimm et al. 2004) data showed that we observed a relatively higher overlap between the human vs. mouse data than the human vs. rat data (Figures 2F-G and 3D-E). However, we note that the substantially different technologies used (TRAP-seq in mouse vs. laser capture microdissection and microarrays in rat) make it difficult to confidently interpret the degree of overlap between the two studies, and a direct comparison of these alternative platforms (TRAP-seq vs. LCM / microarray) or species (mouse vs. rat) lies outside the scope of our study. We have included updated wording in the Results and Discussion to explain this issue and help interpret these results.

      Regarding the newer mouse study using snRNA-seq (Luskin and Li et al. 2022), we have extended our analyses to perform a more in-depth comparison with this study. Specifically, we have evaluated the expression of an additional set of GABAergic neuron marker genes from this study within our secondary clustering of inhibitory neurons in the snRNA-seq data (Figure 3-figure supplement 13B). We observe some evidence of cluster-specific expression of several genes, including CCK, PCSK1, PCSK2, PCSK1N, PENK, PNOC, SST, and TAC1. We have also included additional text describing these results in the Results section.

      The finding of ACHE expression in LC neurons is intriguing. Susan Greenfield has published a series of papers suggesting that ACHE has functions independent of ACH metabolism that contributes to cellular vulnerability in neurodegenerative disease. This might be worth mentioning.

      We thank the reviewer for pointing this out. We were very surprised too by the observed expression of SLC5A7 and ACHE in the LC regions (Visium data) and within the LC-NE neuron cluster (snRNA-seq data), coupled with absence of other typical cholinergic marker genes (e.g. CHAT, SLC18A3), and we do not have a compelling explanation or theory for this. Hence, the work of Susan Greenfield and colleagues suggesting non-cholinergic actions of ACHE, particularly in other catecholaminergic neuron populations (e.g. dopaminergic neurons in the substantia nigra) is very interesting. We have included references to this work and how it could inform interpretation of this expression (Greenfield 1991; Halliday and Greenfield 2012) in the Discussion.

      High mitochondrial reads from snRNA-seq can indicate lower quality. Can the authors comment on this and explain why they are confident in the snRNA-seq data from presumptive LC-NE neurons?

      As mentioned above for Reviewer 2, we have included additional analyses to further compare quality control (QC) metrics for the NE neuron cluster (which had an unusually high proportion of mitochondrial reads) against other neuronal and non-neuronal clusters and nuclei in the snRNA-seq data (Figure 3-figure supplement 2). These additional QC analyses do not show any other problematic values for this cluster. Specifically, we show that the QC metric values for sum UMIs and detected genes per droplet for the NE neuron cluster fall within the range for (A) other neurons and (B) all other nuclei (excluding droplets with ambiguous / unidentifiable neuronal signatures). In addition, we observe that the droplets with the highest mitochondrial percentages (>75%) (C-D), which also have unusually low number of detected genes (D), tend to be from the ambiguous category (droplets with ambiguous / unidentifiable neuronal signatures), suggesting that true low-quality droplets are correctly identified and included within the ambiguous category (e.g. consisting of a mixture of debris from partial damaged nuclei) instead of as NE neurons. Since our QC analyses for the NE neuron cluster do not show any problems other than the high mitochondrial percentage, we do not believe these are simply mis-classified low-quality droplets. We also note that we have recently observed high mitochondrial proportions in other relatively rare neuronal populations characterized by large size and high metabolic demand in human data. We believe that our interpretation is correct -- i.e. that a combination of technical and biological factors has led to the inclusion of a relatively high amount of mitochondrial RNA within the droplets for these nuclei. We have included these additional QC analyses (Figure 3-figure supplement 2) and further discussion of this issue in the Results section.

      The Discussion could be expanded. Because there is a lot known and/or assumed about the LC, discussing all of it is certainly beyond the scope of this manuscript. However, perhaps the authors could pick a few more for confirmation and hypothesis generation. For example, one of the most well studied and important aspects of the LC is its regulation by neuromodulatory inputs. It would be interesting for the authors to discuss the expression of receptors for CRF, cannabinoids, orexin, galanin, 5-HT, etc, particularly when compared with the available rodent TRAP and snRNA-seq data (https://www.biorxiv.org/content/10.1101/2022.06.30.498327v1) contained some surprises, such as very low expression of CRF1 in LC-NE neurons, suggesting that the powerful activation of LC cells by CRF is indirect. Does this hold up in humans?

      We have expanded the Discussion to include additional discussion and references on several points, as discussed also above. Indeed these are interesting questions and these neuromodulatory systems are all of interest in the context of signaling within the LC in terms of function of the LC-NE system. We note that the manuscript serves primarily as a data resource and will be useful in many different ways depending on the different goals and interests of the readers. This is precisely why we wanted to take the time to make accessible and easy to use tools to interrogate and visualize the data. We have provided screenshots in Author response image 1-4 from the Shiny visualization app for the Visium data (https://libd.shinyapps.io/locus-c_Visium/) querying several main receptors of the neuromodulatory systems that this reviewer is particularly interested in to illustrate how the visualization apps can readily be used to query specific genes and systems of interest.

      Author response image 1.

      CRHR1:

      Author response image 2.

      CNR1:

      Author response image 3.

      OXR1:

      Author response image 4.

      GALR1:

      Minor points:

      Line 46 add stress responses to the key functions of LC neurons

      We have added this point and included additional references to support the findings.

      Line 47 add that the LC was so named "blue spot" because of its signature production of neuromelanin pigment

      We have added this point.

      Line 49 LC's capacity to synthesize NE is not "unique" - several other brainstem/medullary nuclei also synthesize NE (e.g. A1-A7; LC is A6)

      We have updated this wording.

      Line 54 Although prior evidence indicated age-related LC cell loss in people without frank neurodegenerative disease, recent studies that are better powered and used unbiased stereological methods have refuted the idea that LC neurons die during normal aging (reviewed in Matchett et al., Acta Neuropathologica 141:631-50, 2021)

      We have updated this part of the Introduction to focus on cell loss in the LC in neurodegenerative disease and removed the older references describing studies that suggested LC neurons die in normal aging.

      Line 62 Would also be worth mentioning the role of the LC in other mood disorders where adrenergic drugs are often prescribed, such as PTSD (e.g. prazosin), opioid withdrawal (e.g. lofexidine), anxiety and depression (e.g. NE reuptake inhibitors).

      We have added additional references to these disorders and their treatment with noradrenergic drugs in the Introduction.

      Additional updates from Public Review Comments:

      We have also included the following updates, in response to additional reviewer comments received during the initial round of “Public Review Comments” and which are not already described in the responses to the “Recommendations for the Authors” above.

      ● We included updated wording in the Results section and Figure 1C caption to more clearly describe the number of donors included in the final SRT and snRNA-seq data used for analyses after all quality control (QC) steps (4 donors for SRT data, 3 donors for snRNA-seq data).

      ● Figure 3-figure supplement 1D (number of nuclei per cluster in unsupervised clustering of snRNA-seq data) has been updated to show percentages of nuclei per cluster.

      ● We have added comparisons between the lists of differentially expressed (DE) genes identified in the Visium and snRNA-seq data. To make these sets comparable, we have added (i) snRNA-seq DE testing results between the NE neuron cluster and all other clusters (instead of other neuronal clusters only, as shown in the main results in Figure 3) (excluding ambiguous neuronal) (Figure 3-figure supplement 6 and Supplementary File 2D), and (ii) calculated overlaps and comparisons between the sets of DE genes between the Visium data (pseudobulked LC vs. non-LC regions) and the snRNA-seq data (NE neuron cluster vs. all other clusters excluding ambiguous neuronal). This comparison generated a list of 51 genes that were identified as statistically significant DE genes (FDR < 0.05 and FC > 2) in both the Visium and the snRNA-seq data (Figure 3-figure supplement 7 and Supplementary File 2E).

      Other additional updates:

      We have added an additional data repository (Globus). Raw data files (FASTQ sequencing data files and high-resolution TIF image files) are now available via Globus from the WeberDivecha2023_locus_coeruleus data collection from the jhpce#globus01 Globus endpoint, which is also listed at http://research.libd.org/globus/. The Globus repository is not publicly accessible due to individually identifiable donor genetic variants in the FASTQ files. Approved users may request access from the corresponding authors. This data repository is listed in the Data Availability section.

    1. Author response:

      The following is the authors’ response to the current reviews.

      We thank you for sending our manuscript for the second round of review.  We are encouraged by the comments from reviewer #2 that our supplementary work on naïve T cells and antibody blockade work satisfied their previous concerns and is important for our work.

      The Editors raised concerns that we have shared preliminary data on Nrn1 and AMPAR double knockout mice.  We apologize for our enthusiasm for these studies.  Because of the publication model by eLife, we shared that data not because we needed to persuade the reviewer for publication purposes but rather to agree with the reviewer that the molecular target of Nrn1 is important, and we are progressing in understanding this subject.


      The following is the authors’ response to the original reviews.

      To Reviewer #1:

      Thank you for your thorough review and comments on our work, which you described as “the role of neuritin in T cell biology studied here is new and interesting.”.  We have summarized your comments into two categories: biology and investigation approach, experimental rigor, and data presentation.

      Biology and Investigation approach comments:

      (1) Questions regarding the T cell anergy model:

      Major point “(4) Figure 1E-H. The authors assume that this immunization protocol induces anergic cells, but they provide no experimental evidence for this. It would be useful to show that T cells are indeed anergic in this model, especially those that are OVA-specific. The lack of IL-2 production by Cltr cells could be explained by the presence of fewer OVA-specific cells, rather than by an anergic status.”

      T cell anergy is a well-established concept first described by Schwartz’s group. It refers to the hyporesponsive T cell functional state in antigen-experienced CD4 T cells (Chappert and Schwartz, 2010; Fathman and Lineberry, 2007; Jenkins and Schwartz, 1987; Quill and Schwartz, 1987).  Anergic T cells are characterized by their inability to expand and to produce IL2 upon subsequent antigen re-challenge. In this paper, we have borrowed the existing in vivo T cell anergy induction model used by Mueller’s group for T cell anergy induction (Vanasek et al., 2006).  Specifically, Thy1.1+ Ctrl or Nrn1-/- TCR transgenic OTII cells were co-transferred with the congenically marked Thy1.2+ WT polyclonal Treg cells into TCR-/- mice.  After anergy induction, the congenically marked TCR transgenic T cells were recovered by sorting based on Thy1.1+ congenic marker, and subsequently re-stimulation ex vivo with OVA323-339 peptide. We evaluated the T cell anergic state based on OTII cell expansion in vivo and IL2 production upon OVA323-339 restimulation ex vivo.  

      “The authors assume that this immunization protocol induces anergic cells, but they provide no experimental evidence for this.”

      Because the anergy model by Mueller's group is well established (Vanasek et al., 2006), we did not feel that additional effort was required to validate this model as the reviewer suggested. Moreover, the limited IL2 production among the control cells upon restimulation confirms the validity of this model.

      “The lack of IL-2 production by Cltr cells could be explained by the presence of fewer OVAspecific cells, rather than by an anergic status”.

      Cells from Ctrl and Nrn1-/- mice on a homogeneous TCR transgenic (OTII) background were used in these experiments. The possibility that substantial variability of TCR expression or different expression levels of the transgenic TCR could have impacted IL2 production rather than anergy induction is unlikely.

      Overall, we used this in vivo anergy model to evaluate the Nrn1-/- T cell functional state in comparison to Ctrl cells under the anergy induction condition following the evaluation of Nrn1 expression, particularly in anergic T cells.  Through studies using this anergy model, we observed a significant change in Treg induction among OTII cells. We decided to pursue the role of Nrn1 in Treg cell development and function rather than the biology of T cell anergy as evidenced by subsequent experiments.

      Minor points “(6) On which markers are anergic cells sorted for RNAseq analysis?”

      Cells were sorted out based on their congenic marker marking Ctrl or Nrn1-/- OTII cells transferred into the host mice.  We did not specifically isolate anergic cells for sequencing.

      (2) Question regarding the validity of iTreg differentiation model.

      Major point: “(5) Figure 2A-C and Figure 3. The use of iTregs to try to understand what is happening in vivo is problematic. iTregs are cells that have probably no equivalent in vivo, and so may have no physiological relevance. In any case, they are different from pTreg cells generated in vivo. Working with pTreg may be challenging, that is why I would suggest generating data with purified nTreg. Moreover, it was shown in the article of Gonzalez-Figueroa 2021 that Nrn1-/- nTreg retained a normal suppressive function, which would not be what is concluded by the authors of this manuscript. Moreover, we do not even know what the % of Foxp3 cells is in the iTreg used (after differentiation and 20h of re-stimulation) and whether this % is the same between Ctlr and Nrn1 KO cells.”.

      We thank Reviewer #1 for their feedback. While it is true that iTregs made in vitro and in vivo generated pTregs display several distinctions (e. g., differences in Foxp3 expression stability, for example), we strongly disagree with this statement by Revieweer#1 “The use of iTregs to try to understand what is happening in vivo is problematic. iTregs are cells that have probably no equivalent in vivo, and so may have no physiological relevance.”  The induced Treg cell (iTreg) model was established over 20 years ago (Chen et al., 2003; Zheng et al., 2002), and the model is widely adopted with over 2000 citations. Further, it has been instrumental in understanding different aspects of regulatory T cell biology (Hurrell et al., 2022; John et al., 2022; Schmitt and Williams, 2013; Sugiura et al., 2022).   

      Because we have observed reduced pTreg generation in vivo, we choose to use the in vitro iTreg model system to understand the mechanistic changes involved in Treg cell differentiation and function, specifically, neuritin’s role in this process. We have made no claim that iTreg cell biology is identical to pTreg generated in vivo or nTreg cells. However, the iTreg culture system has proved to be a good in vitro system for deciphering molecular events involved in complex processes. As such, it remains a commonly used approach by many research groups in the Treg cell field (Hurrell et al., 2022; John et al., 2022; Sugiura et al., 2022). Moreover, applying the iTreg in vitro culture system has been instrumental in helping us identify the cell electrical state change in Nrn1-/- CD4 cells and revealed the biological link between Nrn1 and the ionotropic AMPA receptor (AMPAR), which we will discuss in the subsequent discussion. It is technically challenging to use nTreg cells for T cell electrical state studies due to their heterogeneous nature from development in an in vivo environment and the effect of manipulation during the nTreg cell isolation process, which can both affect the T cell electrical state.   

      “Moreover, it was shown in the article of Gonzalez-Figueroa 2021 that Nrn1-/- nTreg retained a normal suppressive function, which would not be what is concluded by the authors of this manuscript.” 

      We have also carried out nTreg studies in vitro in addition to iTreg cells. Similar to Gonzalez-Figueroa et al.'s findings, we did not observe differences in suppression function between Nrn1-/- and WT nTreg using the in vitro suppression assay. However, Nrn1-/- nTreg cells revealed reduced suppression function in vivo (Fig. 2D-L). In fact, Gonzalez-Figueroa et al. observed reduced plasma cell formation after OVA immunization in Treg-specific Nrn1-/- mice, implicating reduced suppression from Nrn1-/- follicular regulatory T (Tfr) cells. Thus, our observation of the reduced suppression function of Nrn1-/- nTreg toward effector T cell expansion, as presented in Fig. 2D-L, does not contradict the results from Gonzalez-Figueroa et al. Rather, the conclusions of these two studies agree that Nrn1 can play important roles in immune suppression observable in vivo that are not captured readily by the in vitro suppression assay.

      “Moreover, we do not even know what the % of Foxp3 cells is in the iTreg used (after differentiation and 20h of re-stimulation) and whether this % is the same between Ctlr and Nrn1 KO cells.”

      We have stated in the manuscript on page 7 line 208 that “Similar proportions of Foxp3+ cells were observed in Nrn1-/- and Ctrl cells under the iTreg culture condition, suggesting that Nrn1 deficiency does not significantly impact Foxp3+ cell differentiation”. In the revised manuscript, we will include the data on the proportion of Foxp3+ cells before iTreg restimulation.

      (3) Confirmation of transcriptomic data regarding amino acids or electrolytes transport change

      Minor point“(3) Would not it be possible to perform experiments showing the ability of cells to transport amino acids or electrolytes across the plasma membrane? This would be a more interesting demonstration than transcriptomic data.”

      We appreciate Review# 1’s suggestion regarding “perform experiments showing the ability of cells to transport amino acids or electrolytes across the plasma membrane”.  We have indeed already performed such experiments corroborating the transcriptomics data on differential amino acid and nutrient transporter expression. Specifically, we loaded either iTreg or Th0 cells with membrane potential (MP) dye and measured MP level change after adding the complete set of amino acids (complete AA).  Upon entry, the charge carried by AAs may transiently affect cell membrane potential. Different AA transporter expression patterns may show different MP change patterns upon AA entry, as we showed in Author response image 1. We observed reduced MP change in Nrn1-/- iTreg compared to the Ctrl, whereas in the context of Th0 cells, Nrn1-/- showed enhanced MP change than the Ctrl. We can certainly include these data in the revised manuscript.

      Author response image 1.

      Membrane potential change induced by amino acids entry. a. Nrn1-/- or WT iTreg cells loaded with MP dye and MP change was measured upon the addition of a complete set of AAs. b. Nrn1-/- or WT Th0 cells loaded with MP dye and MP change was measured upon the addition of a complete set of AAs.

      (4) EAE experiment data assessment

      Minor point ”(5) Figure 5F. How are cells re-stimulated? If polyclonal stimulation is used, the experiment is not interesting because the analysis is done with lymph node cells. This analysis should either be performed with cells from the CNS or with MOG restimulation with lymph node cells.”

      In the EAE study, the Nrn1-/- mice exhibit similar disease onset but a protracted non-resolving disease phenotype compared to the WT control mice.  Several reasons may contribute to this phenotype: 1. Enhanced T effector cell infiltration/persistence in the central nervous system (CNS); 2. Reduced Treg cell-mediated suppression to the T effector cells in the CNS; 3. Protracted non-resolving inflammation at the immunization site has the potential to continue sending T effector cells into CNS, contributing to persistent inflammation. Based on this reasoning, we examined the infiltrating T effector cell number and Treg cell proportion in the CNS.  We also restimulated cells from draining lymph nodes close to the inflammation site, looking for evidence of persistent inflammation.  When mice were harvested around day 16 after immunization, the inflammation at the local draining lymph node should be at the contraction stage.  We stimulated cells with PMA and ionomycin intended to observe all potential T effector cells involved in the draining lymph node rather than only MOG antigen-specific cells.  We disagree with Reviewer #1’s assumption that “This analysis should either be performed with cells from the CNS or with MOG restimulation with lymph node cells.”. We think the experimental approach we have taken has been appropriately tailored to the biological questions we intended to answer.

      Experimental rigor and data presentation.

      (1) data labeling and additional supporting data

      Major points

      (2) The authors use Nrn1+/+ and Nrn1+/- cells indiscriminately as control cells on the basis of similar biology between Nrn1+/+ and Nrn1+/- cells at homeostasis. However, it is quite possible that the Nrn1+/- cells have a phenotype in situations of in vitro activation or in vivo inflammation (cancer, EAE). It would be important to discriminate Nrn1+/- and Nrn1+/+ cells in the data or to show that both cell types have the same phenotype in these conditions too.

      (3) Figure 1A-D. Since the authors are using the Nrp1 KO mice, it would be important to confirm the specificity of the anti-Nrn1 mAb by FACS. Once verified, it would be important to add FACS results with this mAb in Figures 1A-C to have single-cell and quantitative data as well.

      Minor points  

      (1) Line 119, 120 of the text. It is said that one of the most up-regulated genes in anergic cells is Nrn1 but the data is not shown.

      (2) For all figures showing %, the titles of the Y axes are written in an odd way. For example, it is written "Foxp3% CD4". It would be more conventional and clearer to write "% Foxp3+ / CD4+" or "% Foxp3+ among CD4+".

      (4) For certain staining (Figure 3E, H) it would be important to show the raw data, in addition to MFI or % values.

      We can adapt the labeling and provide additional data, including Nrn1 staining on Treg cells and flow graphs for pmTOR and pS6 staining (Fig. 3H), as requested by Reviewer #1.

      (2) Experimental rigor:

      General comments:

      “However, it is disappointing that reading this manuscript leaves an impression of incomplete work done too quickly.”

      We were discouraged to receive the comment, “this manuscript leaves an impression of incomplete work done too quickly.” Our study of this novel molecule began without any existing biological tools such as antibodies, knockout mice, etc.  Over the past several years, we have established our own antibodies for Nrn1 detection, obtained and characterized Nrn1 knockout mice, and utilized multiple approaches to identify the molecular mechanism of Nrn1 function. Through the use of the in vitro iTreg system described in this manuscript, we identified the association of Nrn1 deficiency with cell electrical state change, potentially connected to AMPAR function. We have further corroborated our findings by generating Nrn1 and AMPAR T cell specific double knockout mice and confirmed that T cell specific AMPAR deletion could abrogate the phenotype caused by the Nrn1 deficiency (see Support Figure 2).  We did not include the double knockout data in the current manuscript because AMPAR function has not yet been studied thoroughly in T cell biology, and we feel this topic warrants examination in its own right.  However, the unpublished data support the finding that Nrn1 modulates the T cell electrical state and, consequently, metabolism, ultimately influencing tolerance and immunity.  In its current form, the manuscript represents the first characterization of the novel molecule Nrn1 in anergic cells, Tregs, and effector T cells. While this work has led to several exciting additional questions, we disagree that the novel characterization we have presented Is incomplete. We feel that our present data set, which squarely highlights Nrn1’s role as an important immune regulator while shedding unprecedented light on the molecular events involved, will be of considerable interest to a broad field of researchers.

      “Multiple models have been used, but none has been studied thoroughly enough to provide really conclusive and unambiguous data. For example, 5 different models were used to study T cells in vivo. It would have been preferable to use fewer, but to go further in the study of mechanisms.”

      We have indeed used multiple in vivo models to reveal Nrn1's function in Treg differentiation, Treg suppression function, T effector cell differentiation and function, and the overall impact on autoimmune disease. Because the impact of ion channel function is often context-dependent, we examined the biological outcome of Nrn1 deficiency in several in vivo contexts.  We would appreciate it if Reviewer#1 would provide a specific example, given the Nrn1 phenotype, of how to proceed deeper to investigate the electrical change in the in vivo models.

      “Major points

      (1) A real weakness of this work is the fact that in most of the results shown, there are few biological replicates with differences that are often small between Ctrl and Nrn1 -/-. The systematic use of student's t-test may lead to thinking that the differences are significant, which is often misleading given the small number of samples, which makes it impossible to know whether the distributions are Gaussian and whether a parametric test can be used. RNAseq bulk data are based on biological duplicates, which is open to criticism.”

      We respectfully disagree with Reviewer #1 on the question of statistical power and significance to our work. We have used 5-8 mice/group for each in vivo model and 3-4 technical replicates for the in vitro studies, with a minimum of 2-3 replicate experiments. These group sizes and replication numbers are in line with those seen in high-impact publications. While some differences between Ctrl and Nrn1-/- appear small, they have significant biological consequences, as evidenced by the various Nrn1-/- in vivo phenotypes. Furthermore, we believe we have subjected our data to the appropriate statistical tests to ensure rigorous analysis and representation of our findings.

      To Reviewer #2.

      We thank Reviewer #2 for the careful review of the manuscript. We especially appreciate the comments that “The characterizations of T cell Nrn1 expression both in vitro and in vivo are comprehensive and convincing. The in vivo functional studies of anergy development, Treg suppression, and EAE development are also well done to strengthen the notion that Nrn1 is an important regulator of CD4 responsiveness.”

      “The major weakness of this study stems from a lack of a clear molecular mechanism involving Nrn1. “  

      We fully understand this comment from Reviewer #2. The main mechanism we identified contributing to the functional defect of Nrn1-/- T cells involves novel effects on the electric and metabolic state of the cells. Although we referenced neuronal studies that indicate Nrn1 is the auxiliary protein for the ionotropic AMPA-type glutamate receptor (AMPAR) and may affect AMPAR function, we did not provide any evidence in this manuscript as the topic requires further in-depth study.   

      For the benefit of this discussion, we include our preliminary Nrn1 and AMPAR double knockout data (Author response image 2), which indicates that abrogating AMPAR expression can compensate for the defect caused by Nrn1 deficiency in vitro and in vivo. This preliminary data supports the notion that Nrn1 modulates AMPAR function, which causes changes in T cell electric and metabolic state, influencing T cell differentiation and function.  

      Author response image 2.

      Deletion of AMPAR expression in T cells compensates for the defect caused by Nrn1 deficiency. Nrn1-/- mice were crossed with T cell-specific AMPAR knockout mice (AMPARfl/flCD4Cre+) mice. The following mice were generated and used in the experiment: T cell specific AMPAR-knockout and Nrn1 knockout mice (AKONKO), Nrn1 knockout mice (AWTNKO), Ctrl mice (AWTNWT). a. Deletion of AMPAR compensates for the iTreg cell defect observed in Nrn1-/- CD4 cells. iTreg live cell proportion, cell number, and Ki67 expression among Foxp3+ cells 3 days after aCD3 restimulation. b. Deletion of AMPAR in T cells abrogates the enhanced autoimmune response in Nrn1-/- Mouse in the EAE disease model. Mouse relative weight change and disease score progression after EAE disease induction.  

      Ion channels can influence cell metabolism through multiple means (Vaeth and Feske, 2018; Wang et al., 2020). First, ion channels are involved in maintaining cell resting membrane potential. This electrical potential difference across the cell membrane is essential for various cellular processes, including metabolism (Abdul Kadir et al., 2018; Blackiston et al., 2009; Nagy et al., 2018; Yu et al., 2022). Second, ion channels facilitate the movement of ions across cell membranes. These ions are essential for various metabolic processes. For example, ions like calcium (Ca2+), potassium (K+), and sodium (Na+) play crucial roles in signaling pathways that regulate metabolism (Kahlfuss et al., 2020). Third, ion channel activity can influence cellular energy balance due to ATP consumption associated with ion transport to maintain ion balances (Erecińska and Dagani, 1990; Gerkau et al., 2019). This, in turn, can impact processes like ATP production, which is central to cellular metabolism. Thus, ion channel expression and function determine the cell’s bioelectric state and contribute to cell metabolism (Levin, 2021).

      Because the AMPAR function has not been thoroughly studied using a genetic approach in T cells, we do not intend to include the double knockout data in this manuscript before fully characterizing the T cell-specific AMPAR knockout mice.  

      “Although the biochemical and informatics studies are well-performed, it is my opinion that these results are inconclusive in part due to the absence of key "naive" control groups. This limits my ability to understand the significance of these data.

      Specifically, studies of the electrical and metabolic state of Nrn1-/- inducible Treg cells (iTregs) would benefit from similar data collected from wild-type and Nrn1-/- naive CD4 T cells.”

      We appreciate the reviewer’s comments. This comment reflects two concerns in data interpretation:

      (1) Are Nrn1-/- naïve T cells fundamentally different from WT cells? Does this fundamental difference contribute to the observed electrical and metabolic phenotype in iTreg or Th0 cells? This is a very good question we will perform the experiments as the reviewer suggested. While Nrn1 is expressed at a basal (low) level in naïve T cells, deletion of Nrn1 may cause changes in naïve T cell phenotype.   

      (2) Is the Nrn1-/- phenotype caused by Nrn1 functional deficiency or due to the secondary effect of Nrn1 deletion, such as non-physiological cell membrane structure changes?

      We have done the following experiment to address this concern.  We have cultured WT T cells in the presence of Nrn1 antibody and compared the outcome with Nrn1-/- iTreg cells (Figure 3-figure supplement 2D,E,F). WT iTreg cells under antibody blockade exhibited similar changes as Nrn1-/- iTreg cells, confirming the physiological relevance of the Nrn1-/- phenotype.

      Manuscript Revision based on the Reviewer’s suggestions:

      Reviewer #1:

      Major points (3) Figure 1A-D. Since the authors are using the Nrp1 KO mice, it would be important to confirm the specificity of the anti-Nrn1 mAb by FACS. 

      Following the suggestion by Reviewer#1, We have included the Nrn1 Ab staining on activated Nrn1-/- CD4 cells in Figure 1D. We have also added the staining of cell surface Nrn1 on Treg cells in Figure 1-figure supplement 1D.

      Major point: (5) “Moreover, we do not even know what the % of Foxp3 cells is in the iTreg used (after differentiation and 20h of re-stimulation) and whether this % is the same between Ctlr and Nrn1 KO cells.”

      In the revised manuscript, we have included the proportion of Foxp3+ cells among Nrn1-/- and ctrl iTreg cells developed under the iTreg culture condition in Figure 2A.

      Minor points  

      (2) For all figures showing %, the titles of the Y axes are written in an odd way. For example, it is written "Foxp3% CD4". It would be more conventional and clearer to write "% Foxp3+ / CD4+" or "% Foxp3+ among CD4+".

      Following reviewer#1’s suggestion, we have changed the Y-axis label in all the relevant figures.

      (3) Would not it be possible to perform experiments showing the ability of cells to transport amino acids or electrolytes across the plasma membrane? This would be a more interesting demonstration than transcriptomic data.”

      We appreciate Review# 1’s suggestion regarding “perform experiments showing the ability of cells to transport amino acids or electrolytes across the plasma membrane”.  We have used AAinduced cellular MP changes to confirm differential AA transporter expression patterns and their impact on cellular MP levels.  The data are included in the revised manuscript in Figure 3H and Figure 4K.

      (4) For certain staining (Figure 3E, H) it would be important to show the raw data, in addition to MFI or % values.

      We appreciated Reviewer #1’s suggestion and have included the histogram staining data for Figure 3E. We have moved the original Figure 3H to the supplemental figure and included the histogram staining data in Figure 3-figure supplement 1C.  Similarly, we have included the histogram staining data in Figure 4-figure supplement 1C.

      Reviewer#2:

      “Although the biochemical and informatics studies are well-performed, it is my opinion that these results are inconclusive in part due to the absence of key "naive" control groups. This limits my ability to understand the significance of these data.

      Specifically, studies of the electrical and metabolic state of Nrn1-/- inducible Treg cells (iTregs) would benefit from similar data collected from wild-type and Nrn1-/- naive CD4 T cells.”

      We greatly appreciate Reviewer#2’s suggestion and have carried out experiments on naïve CD4 cells derived from Nrn1-/- and WT mice. We have compared membrane potential, AA-induced MP change between Nrn1-/- and WT naïve T cells, and the metabolic state of Nrn1-/- and WT naïve T cells by carrying out glucose stress tests and mitochondria stress tests using a seahorse assay.  Moreover, to investigate whether the phenotype revealed in Nrn1-/- CD4 cells was caused by a secondary effect of cell membrane structure change due to Nrn1 deletion, we carried out Nrn1 antibody blockade in WT CD4 cells and investigated the phenotypic change. These new results are included in Figure 3-figure supplement 2.

      Reference:

      Abdul Kadir, L., M. Stacey, and R. Barrett-Jolley. 2018. Emerging Roles of the Membrane Potential: Action Beyond the Action Potential. Front Physiol 9:1661.

      Blackiston, D.J., K.A. McLaughlin, and M. Levin. 2009. Bioelectric controls of cell proliferation: ion channels, membrane voltage and the cell cycle. Cell Cycle 8:3527-3536.

      Chappert, P., and R.H. Schwartz. 2010. Induction of T cell anergy: integration of environmental cues and infectious tolerance. Current opinion in immunology 22:552-559.

      Chen, W., W. Jin, N. Hardegen, K.J. Lei, L. Li, N. Marinos, G. McGrady, and S.M. Wahl. 2003. Conversion of peripheral CD4+CD25- naive T cells to CD4+CD25+ regulatory T cells by TGF-beta induction of transcription factor Foxp3. The Journal of experimental medicine 198:1875-1886.

      Erecińska, M., and F. Dagani. 1990. Relationships between the neuronal sodium/potassium pump and energy metabolism. Effects of K+, Na+, and adenosine triphosphate in isolated brain synaptosomes. J Gen Physiol 95:591-616.

      Fathman, C.G., and N.B. Lineberry. 2007. Molecular mechanisms of CD4+ T-cell anergy. Nat Rev Immunol 7:599-609.

      Gerkau, N.J., R. Lerchundi, J.S.E. Nelson, M. Lantermann, J. Meyer, J. Hirrlinger, and C.R. Rose. 2019. Relation between activity-induced intracellular sodium transients and ATP dynamics in mouse hippocampal neurons. The Journal of physiology 597:5687-5705.

      Hurrell, B.P., D.G. Helou, E. Howard, J.D. Painter, P. Shafiei-Jahani, A.H. Sharpe, and O. Akbari. 2022. PD-L2 controls peripherally induced regulatory T cells by maintaining metabolic activity and Foxp3 stability. Nature communications 13:5118.

      Jenkins, M.K., and R.H. Schwartz. 1987. Antigen presentation by chemically modified splenocytes induces antigen-specific T cell unresponsiveness in vitro and in vivo. The Journal of experimental medicine 165:302-319.

      John, P., M.C. Pulanco, P.M. Galbo, Jr., Y. Wei, K.C. Ohaegbulam, D. Zheng, and X. Zang. 2022. The immune checkpoint B7x expands tumor-infiltrating Tregs and promotes resistance to anti-CTLA-4 therapy. Nature communications 13:2506.

      Kahlfuss, S., U. Kaufmann, A.R. Concepcion, L. Noyer, D. Raphael, M. Vaeth, J. Yang, P. Pancholi, M. Maus, J. Muller, L. Kozhaya, A. Khodadadi-Jamayran, Z. Sun, P. Shaw, D. Unutmaz, P.B. Stathopulos, C. Feist, S.B. Cameron, S.E. Turvey, and S. Feske. 2020. STIM1-mediated calcium influx controls antifungal immunity and the metabolic function of nonpathogenic Th17 cells. EMBO molecular medicine 12:e11592.

      Levin, M. 2021. Bioelectric signaling: Reprogrammable circuits underlying embryogenesis, regeneration, and cancer. Cell 184:1971-1989.

      Nagy, E., G. Mocsar, V. Sebestyen, J. Volko, F. Papp, K. Toth, S. Damjanovich, G. Panyi, T.A. Waldmann, A. Bodnar, and G. Vamosi. 2018. Membrane Potential Distinctly Modulates Mobility and Signaling of IL-2 and IL-15 Receptors in T Cells. Biophys J 114:2473-2482.

      Quill, H., and R.H. Schwartz. 1987. Stimulation of normal inducer T cell clones with antigen presented by purified Ia molecules in planar lipid membranes: specific induction of a long-lived state of proliferative nonresponsiveness. Journal of immunology (Baltimore, Md. : 1950) 138:3704-3712.

      Schmitt, E.G., and C.B. Williams. 2013. Generation and function of induced regulatory T cells. Frontiers in immunology 4:152.

      Sugiura, A., G. Andrejeva, K. Voss, D.R. Heintzman, X. Xu, M.Z. Madden, X. Ye, K.L. Beier, N.U. Chowdhury, M.M. Wolf, A.C. Young, D.L. Greenwood, A.E. Sewell, S.K. Shahi, S.N. Freedman, A.M. Cameron, P. Foerch, T. Bourne, J.C. Garcia-Canaveras, J. Karijolich, D.C. Newcomb, A.K. Mangalam, J.D. Rabinowitz, and J.C. Rathmell. 2022. MTHFD2 is a metabolic checkpoint controlling effector and regulatory T cell fate and function. Immunity 55:65-81.e69.

      Vaeth, M., and S. Feske. 2018. Ion channelopathies of the immune system. Current opinion in immunology 52:39-50.

      Vanasek, T.L., S.L. Nandiwada, M.K. Jenkins, and D.L. Mueller. 2006. CD25+Foxp3+ regulatory T cells facilitate CD4+ T cell clonal anergy induction during the recovery from lymphopenia. Journal of immunology (Baltimore, Md. : 1950) 176:5880-5889.

      Wang, Y., A. Tao, M. Vaeth, and S. Feske. 2020. Calcium regulation of T cell metabolism. Current opinion in physiology 17:207-223.

      Yu, W., Z. Wang, X. Yu, Y. Zhao, Z. Xie, K. Zhang, Z. Chi, S. Chen, T. Xu, D. Jiang, X. Guo, M. Li, J. Zhang, H. Fang, D. Yang, Y. Guo, X. Yang, X. Zhang, Y. Wu, W. Yang, and D. Wang. 2022. Kir2.1-mediated membrane potential promotes nutrient acquisition and inflammation through regulation of nutrient transporters. Nature communications 13:3544.

      Zheng, S.G., J.D. Gray, K. Ohtsuka, S. Yamagiwa, and D.A. Horwitz. 2002. Generation ex vivo of TGF-beta-producing regulatory T cells from CD4+CD25- precursors. Journal of immunology (Baltimore, Md. : 1950) 169:4183-4189.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The study addresses how faces and bodies are integrated in two STS face areas revealed by fMRI in the primate brain. It builds upon recordings and analysis of the responses of large populations of neurons to three sets of images, that vary face and body positions. These sets allowed the authors to thoroughly investigate invariance to position on the screen (MC HC), to pose (P1 P2), to rotation (0 45 90 135 180 225 270 315), to inversion, to possible and impossible postures (all vs straight), to the presentation of head and body together or in isolation. By analyzing neuronal responses, they found that different neurons showed preferences for body orientation, head orientation, or the interaction between the two. By using a linear support vector machine classifier, they show that the neuronal population can decode head-body angle presented across orientations, in the anterior aSTS patch (but not middle mSTS patch), except for mirror orientation.

      Strengths:

      These results extend prior work on the role of Anterior STS fundus face area in face-body integration and its invariance to mirror symmetry, with a rigorous set of stimuli revealing the workings of these neuronal populations in processing individuals as a whole, in an important series of carefully designed conditions.

      Minor issues and questions that could be addressed by the authors:

      (1) Methods. While monkeys certainly infer/recognize that individual pictures refer to the same pose with varying orientations based on prior studies (Wang et al.), I am wondering whether in this study monkeys saw a full rotation of each of the monkey poses as a video before seeing the individual pictures of the different orientations, during recordings.

      The monkeys had not been exposed to videos of a rotating monkey pose before the recordings. However, they were reared and housed with other monkeys, providing them with ample experience of monkey poses from different viewpoints.

      (2) Experiment 1. The authors mention that neurons are preselected as face-selective, body-selective, or both-selective. Do the Monkey Sum Index and ANOVA main effects change per Neuron type?

      We have performed a new analysis to assess whether the Monkey Sum Index is related to the response strength for the face versus the body as measured in the Selectivity Test of Experiment 1. To do this we selected face- and body-category selective neurons, as well as neurons responding selectively to both faces and bodies. First, we selected those neurons that responded significantly to either faces, bodies, or the two control object categories, using a split-plot ANOVA for these 40 stimuli. From those neurons, we selected face-selective ones having at least a twofold larger mean net response to faces compared to bodies (faces > 2 * bodies) and the control objects for faces (faces  > 2* objects). Similarly, a body-selective neuron was defined by a twofold larger mean net response to bodies compared to faces and the control objects for bodies. A body-and-face selective neuron was defined as having a twofold larger net response to the faces compared to their control objects, and to bodies compared to their control objects, with the ratio between mean response to bodies and faces being less than twofold. Then, we compared the distribution of the Monkey Sum Index (MSI) for each region (aSTS; mSTS), pose (P1, P2), and centering (head- (HC) or monkey-centered (MC)) condition. Too few body-and-face selective neurons were present in each combination of region, pose, and centering (a maximum of 7) to allow a comparison of their MSI distribution with the other neuron types. The Figure below shows the distribution of the MSI for the different orientation-neuron combinations for the body- and face-selective neurons (same format as in Figure 3a, main text). The number of body-selective neurons, according to the employed criteria, varied from 21 to 29, whereas the number of face-selective neurons ranged from 14 to 24 (pooled across monkeys). The data of the two subjects are shown in a different color and the number of cases for each subject is indicated (n1: number of cases for M1; n2: number of cases for M2). The arrows indicate the medians for the data pooled across the monkey subjects. For the MC condition, the MSI tended to be more negative (i.e. relatively less response to the monkey compared to the sum of the body and face responses) for the face compared to the body cells, but this was significant only for mSTS and P1 (p = 0.043; Wilcoxon rank sum test; tested after averaging the indices per neuron to avoid dependence of indices within a neuron). No consistent, nor significant tendencies were observed for the HC stimuli. This absence of a consistent relationship between MSI and face- versus body-selectivity is in line with the absence of a correlation between the MSI and face- versus body-selectivity using natural images of monkeys in a previous study (Zafirova Y, Bognár A, Vogels R. Configuration-sensitive face-body interactions in primate visual cortex. Prog Neurobiol. 2024 Jan;232:102545).

      We did not perform a similar analysis for the main effects of the two-way ANOVA because the very large majority of neurons showed a significant effect of body orientation and thus no meaningful difference between the two neuron types can be expected.

      Author response image 1.

      (3) I might have missed this information, but the correlation between P1 and P2 seems to not be tested although they carry similar behavioral relevance in terms of where attention is allocated and where the body is facing for each given head-body orientation.

      Indeed, we did not compute this correlation between the responses to the sitting (P1) and standing (P2) pose avatar images. However, as pointed out by the reviewer, one might expect such correlations because of the same head orientations and body-facing directions. Thus, we computed the correlation between the 64 head-body orientation conditions of P1 and P2 for those neurons that were tested with both poses and showed a response for both poses (Split-plot ANOVA). This was performed for the Head-Centered and Monkey-Centered tests of Experiment 1 for each monkey and region. Note that not all neurons were tested with both poses (because of failure to maintain isolation of the single unit in both tests or the monkey stopped working) and not all neurons that were recorded in both tests showed a significant response for both poses, which is not unexpected since these neurons can be pose selective. The distribution of the Pearson correlation coefficients of the neurons with a significant response in both tests is shown in Figure S1. The median correlation coefficient was significantly larger than zero for each region, monkey, and centering condition (outcome of Wilcoxon tests, testing whether the median was different from zero (p1 = p-value for M1; p2: p-value for M2) in Figure), indicating that the effect of head and/or body orientation generalizes across pose. We have noted this now in the Results (page 12) and added the Figure (New Figure S1) in the Suppl. Material.

      (4) Is the invariance for position HC-MC larger in aSTS neurons compared to mSTS neurons, as could be expected from their larger receptive fields?

      Yes, the position tolerance of the interaction of body and head orientation was significantly larger for aSTS compared to mSTS neurons, as we described on pages 11 and 12 of the Results. This is in line with larger receptive fields in aSTS than in mSTS. However, we did not plot receptive fields in the present study.

      (5) L492 "The body-inversion effect likely results from greater exposure to upright than inverted bodies during development". Monkeys display more hanging upside-down behavior than humans, however, does the head appear more tilted in these natural configurations?

      Indeed, infant monkeys do spend some time hanging upside down from their mother's belly. While we lack quantitative data on this behavior, casual observations suggest that even young monkeys spend more time upright. The tilt of the head while hanging upside down can vary, just as it does in standing or sitting monkeys (as when they search for food or orient to other individuals). To our knowledge, no quantitative data exist on the frequency of head tilts in upright versus upside-down monkeys. Therefore, we refrain from further speculation on this interesting point, which warrants more attention.

      (6) Methods in Experiment 1. SVM. How many neurons are sufficient to decode the orientation?

      The number of neurons that are needed to decode the head-body orientation angle depends on which neurons are included, as we show in a novel analysis of the data of Experiment 1. We employed a neuron-dropping analysis, similar to Chiang et al. (Chiang FK, Wallis JD, Rich EL. Cognitive strategies shift information from single neurons to populations in prefrontal cortex. Neuron. 2022 Feb 16;110(4):709-721) to assess the positive (or negative) contribution of each neuron to the decoding performance. We performed cross-validated linear SVM decoding N times, each time leaving out a different neuron (using N-1 neurons; 2000 resamplings of pseudo-population vectors). We then ranked decoding accuracies from highest to lowest, identifying the ‘worst’ (rank 1) to ‘best’ (rank N) neurons. Next, we conducted N decodings, incrementally increasing the number of included neurons from 1 to N, starting with the worst-ranked neuron (rank 1) and sequentially adding the next (rank 2, rank 3, etc.). This analysis focused on zero versus straight angle decoding in the aSTS, as it yielded the highest accuracy. We applied it when training on MC and testing on HC for each pose. Plotting accuracy as a function of the number of included neurons suggested that less than half contributed positively to decoding. We show also the ten “best” neurons for each centering condition and pose. These have a variety of tuning patterns for head and body orientation suggesting that the decoding of head-body orientation angle depends on a population code. Notably, the best-ranked (rank N) neuron alone achieved above-chance accuracy. We have added this interesting and novel result to the Results (page 16) and Suppl. Material (new Figure S3).

      (7) Figure 3D 3E. Could the authors please indicate for each of these neurons whether they show a main effect of face, body, or interaction, as well as their median corrected correlation to get a flavor of these numbers for these examples?

      We have indicated these now in Figure 3.

      (8) Methods and Figure 1A. It could be informative to precise whether the recordings are carried in the lateral part of the STS or in the fundus of the STS both for aSTS and mSTS for comparison to other studies that are using these distinctions (AF, AL, MF, ML).

      In experiment 1, the recording locations were not as medial as the fundus. For experiments 2 and 3, the ventral part of the fundus was included, as described in the Methods. We have added this to the Methods now (page 31).

      Wang, G., Obama, S., Yamashita, W. et al. Prior experience of rotation is not required for recognizing objects seen from different angles. Nat Neurosci 8, 1768-1775 (2005). https://doi-org.insb.bib.cnrs.fr/10.1038/nn1600

      Reviewer #2 (Public review):

      Summary:

      This paper investigates the neuronal encoding of the relationship between head and body orientations in the brain. Specifically, the authors focus on the angular relationship between the head and body by employing virtual avatars. Neuronal responses were recorded electrophysiologically from two fMRI-defined areas in the superior temporal sulcus and analyzed using decoding methods. They found that: (1) anterior STS neurons encode head-body angle configurations; (2) these neurons distinguish aligned and opposite head-body configurations effectively, whereas mirror-symmetric configurations are more difficult to differentiate; and (3) an upside-down inversion diminishes the encoding of head-body angles. These findings advance our understanding of how visual perception of individuals is mediated, providing a fundamental clue as to how the primate brain processes the relationship between head and body - a process that is crucial for social communication.

      Strengths:

      The paper is clearly written, and the experimental design is thoughtfully constructed and detailed. The use of electrophysiological recordings from fMRI-defined areas elucidated the mechanism of head-body angle encoding at the level of local neuronal populations. Multiple experiments, control conditions, and detailed analyses thoroughly examined various factors that could affect the decoding results. The decoding methods effectively and consistently revealed the encoding of head-body angles in the anterior STS neurons. Consequently, this study offers valuable insights into the neuronal mechanisms underlying our capacity to integrate head and body cues for social cognition-a topic that is likely to captivate readers in this field.

      Weaknesses:

      I did not identify any major weaknesses in this paper; I only have a few minor comments and suggestions to enhance clarity and further strengthen the manuscript, as detailed in the Private Recommendations section.

      Reviewer #3 (Public review):

      Summary:

      Zafirova et al. investigated the interaction of head and body orientation in the macaque superior temporal sulcus (STS). Combining fMRI and electrophysiology, they recorded responses of visual neurons to a monkey avatar with varying head and body orientations. They found that STS neurons integrate head and body information in a nonlinear way, showing selectivity for specific combinations of head-body orientations. Head-body configuration angles can be reliably decoded, particularly for neurons in the anterior STS. Furthermore, body inversion resulted in reduced decoding of head-body configuration angles. Compared to previous work that examined face or body alone, this study demonstrates how head and body information are integrated to compute a socially meaningful signal.

      Strengths:

      This work presents an elegant design of visual stimuli, with a monkey avatar of varying head and body orientations, making the analysis and interpretation straightforward. Together with several control experiments, the authors systematically investigated different aspects of head-body integration in the macaque STS. The results and analyses of the paper are mostly convincing.

      Weaknesses:

      (1) Using ANOVA, the authors demonstrate the existence of nonlinear interactions between head and body orientations. While this is a conventional way of identifying nonlinear interactions, it does not specify the exact type of the interaction. Although the computation of the head-body configuration angle requires some nonlinearity, it's unclear whether these interactions actually contribute. Figure 3 shows some example neurons, but a more detailed analysis is needed to reveal the diversity of the interactions. One suggestion would be to examine the relationship between the presence of an interaction and the neural encoding of the configuration angle.

      This is an excellent suggestion. To do this, one needs to identify the neurons that contribute to the decoding of head-body orientation angles. For that, we employed a neuron-dropping analysis, similar to Chiang et al. (Chiang FK, Wallis JD, Rich EL. Cognitive strategies shift information from single neurons to populations in prefrontal cortex. Neuron. 2022 Feb 16;110(4):709-721.) to assess the positive (or negative) contribution of each neuron to the decoding performance. We performed cross-validated linear SVM decoding N times, each time leaving out a different neuron (using N-1 neurons; 2000 resamplings of pseudo-population vectors). We then ranked decoding accuracies from highest to lowest, identifying the ‘worst’ (rank 1) to ‘best’ (rank N) neurons. Next, we conducted N decodings, incrementally increasing the number of included neurons from 1 to N, starting with the worst-ranked neuron (rank 1) and sequentially adding the next (rank 2, rank 3, etc.). This analysis focused on zero versus straight angle decoding in the aSTS, as it yielded the highest accuracy. We applied it when training on MC and testing on HC for each pose. Plotting accuracy as a function of the number of included neurons suggested that less than half contributed positively to decoding (see Figure S3). We examined the tuning for head and body orientation of the 10 “best” neurons (Figure S3). For half or more of those the two-way ANOVA showed a significant interaction. These are indicated by the red color in the Figure. They showed a variety of tuning patterns for head and body orientation, suggesting that the decoding of the head-body orientation angle results from a combination of neurons with different tuning profiles. Based on a suggestion from reviewer 2, we performed for each neuron of experiment 1 a one-way ANOVA with as factor head-body orientation angle. To do that, we combined all 64 trials that had the same head-body orientation angle. The percentage of neurons (required to be responsive in the tested condition) for which this one-way ANOVA was significant was low but larger than the expected 5% (Type 1 error), with a median of 16.5% (range: 3 to 23%) in aSTS and 8% for mSTS (range: 0-19%). However, a higher percentage of the 10 best neurons for each pose (indicated by the star) showed a significant one-way ANOVA for angle (for P1, MC: 50% (95% confidence interval (CI): 19% – 81%); P1, HC: 70% (CI: 35% - 93%); P2, MC: 70% (CI: 35% – 93%); P2: HC: 50% (CI: 19%-81%)). These percentages were significantly higher than expected for a random sample from the population of neurons for each pose-centering combination (expected percentages listed in the same order as above: 16%, 13%, 16%, and 10%; all outside CI). Thus, for at least half of the “best” neurons, the response differed significantly among the head-orientation angles at the single neuron level. Nonetheless, the tuning profiles were diverse, suggesting a populationl code for head-body orientation angle. We have added this interesting and novel result to the Results (page 16) and Suppl. Material (Figure S3).

      (2) Figure 4 of the paper shows a better decoding of the configuration angle in the anterior STS than in the middle STS. This is an interesting result, suggesting a transformation in the neural representation between these two areas. However, some control analyses are needed to further elucidate the nature of this transformation. For example, what about the decoding of head and body orientations - dose absolute orientation information decrease along the hierarchy, accompanying the increase in configuration information?

      We have performed now two additional analyses, one in which we decoded the orientation of the head and another one in which we decoded the orientation of the body. We employed the responses to the avatar of experiment 1, using the same sample of neurons of which we decoded the head-body orientation angle. To decode the head orientation, the trials with identical head orientation, irrespective of their body orientation, were given the same label. For this, we employed only responses in the head-centered condition. To decode the body orientation, the trials with identical body orientation, irrespective of their head orientation, had the same label, and we employed only responses in the body-centered condition. The decoding was performed separately for each pose (P1 and P2) and region. We decoded either the responses of 20 neurons (10 randomly sampled from each monkey for each of the 1000 resamplings), 40 neurons (20 randomly sampled per monkey), or 60 neurons (30 neurons per monkey) since the sample of 60 neurons yielded close to ceiling performance for the body orientation decoding. For each pose, the body orientation decoding was worse for aSTS than for mSTS, although this difference reached significance only for P1 and for the 40 neurons sample of P2 (p < 0.025; two-tailed test; same procedure as employed for testing the significance of the decoding of whole-body orientation for upright versus inverted avatars (Experiment 3))). Face orientation decoding was significantly worse for aSTS compared to mSTS. These results are in line with the previously reported decreased decoding of face orientation in the anterior compared to mid-STS face patches (Meyers EM, Borzello M, Freiwald WA, Tsao D. Intelligent information loss: the coding of facial identity, head pose, and non-face information in the macaque face patch system. J Neurosci. 2015 May 6;35(18):7069-81), and decreased decoding of body orientation in anterior compared to mid-STS body patches (Kumar S, Popivanov ID, Vogels R. Transformation of Visual Representations Across Ventral Stream Body-selective Patches. Cereb Cortex. 2019 Jan 1;29(1):215-229). As mentioned by the reviewer, this contrasts with the decoding of the head-body orientation angle, which increases when moving more anteriorly. We mention this finding now in the Discussion (page 27) and present the new Figure S10 in the Suppl. Material.    

      (3) While this work has characterized the neural integration of head and body information in detail, it's unclear how the neural representation relates to the animal's perception. Behavioural experiments using the same set of stimuli could help address this question, but I agree that these additional experiments may be beyond the scope of the current paper. I think the authors should at least discuss the potential outcomes of such experiments, which can be tested in future studies.

      Unfortunately, we do not have behavioral data. One prediction would be that the discrimination of head-body orientation angle, irrespective of the viewpoint of the avatar, would be more accurate for zero versus straight angles compared to the right versus left angles. We have added this to the Discussion (page 28).

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) P22 L373. It should read Figure S5C instead of S4C.

      Thanks; corrected.

      (2) Figure 7B. All inverted decoding accuracies, although significantly lower than upright decoding accuracies, appear significantly above baseline. Should the title be amended accordingly?

      Thanks for pointing this out. To avoid future misunderstanding we have changed the title to:

      “Integration of head and body orientations in the macaque superior temporal sulcus is stronger for upright bodies”

      (3) Discussion L432-33. "with some neurons being tuned to a particular orientation of both the head and the body". Wouldn't that be visible as a diagonal profile on the normalized net responses in Fig 3D? Or can the Anova evidence such a tuning?

      We meant to say that some neurons were tuned to a particular combination of head and body orientation, like the third aSTS example neuron shown in Figure 3D. We have corrected the sentence.

      Reviewer #2 (Recommendations for the authors):

      Major comment:

      This paper effectively demonstrates that the angular relationship between the head and body can be decoded from population responses in the anterior STS. In other words, these neurons encode information about the head-body angle. However, how exactly do these neurons encode this information? Given that the study employed electrophysiological recordings from a local population of neurons, it might be possible to provide additional data on the response patterns of individual neurons to shed light on the underlying encoding mechanisms.

      Although the paper already presents example response patterns (Figures 3D, E) and shows that STS neurons encode interactions between head and body orientations (Figure 3B), it remains unclear whether the angle difference between the head and body has a systematic effect on neuronal responses. For instance, a description of whether some neurons preferentially encode specific head-body angle differences (e.g., a "45-degree angle neuron"), or additional population analyses such as a one-way ANOVA with angle difference as the main effect (or two-way ANOVA with angle difference as one of the main effect), would be very informative. Such data could offer valuable insights into how individual neurons contribute to the encoding of head-body angle differences-a detail that may also be reflected in the decoding results. Alternatively, it is possible that the encoding of head-body angle is inherently complex and only discernible via decoding methods applied to population activity. Either scenario would provide interesting and useful information to the field.

      We have performed two additional analyses which are relevant to this comment. First, we attempted to relate the tuning for body and head orientation with the decoding of the head-body orientation angle. To do this, one needs to identify the neurons that contribute to the decoding of head-body orientation angles. For that, we employed a neuron-dropping analysis, similar to Chiang et al. (Chiang FK, Wallis JD, Rich EL. Cognitive strategies shift information from single neurons to populations in prefrontal cortex. Neuron. 2022 Feb 16;110(4):709-721.) to assess the positive (or negative) contribution of each neuron to the decoding performance. We performed cross-validated linear SVM decoding N times, each time leaving out a different neuron (using N-1 neurons; 2000 resamplings of pseudo-population vectors). We then ranked decoding accuracies from highest to lowest, identifying the ‘worst’ (rank 1) to ‘best’ (rank N) neurons. Next, we conducted N decodings, incrementally increasing the number of included neurons from 1 to N, starting with the worst-ranked neuron (rank 1) and sequentially adding the next (rank 2, rank 3, etc.). This analysis focused on zero versus straight angle decoding in the aSTS, as it yielded the highest accuracy. We applied it when training on MC and testing on HC for each pose. Plotting accuracy as a function of the number of included neurons suggested that less than half contributed positively to decoding (see Figure S3). We examined the tuning for head and body orientation of the 10 “best” neurons (Figure S3). For half or more of those the two-way ANOVA showed a significant interaction. These are indicated by the red color in the Figure. They showed a variety of tuning patterns for head and body orientation, suggesting that the decoding of the head-body orientation angle results from a combination of neurons with different tuning profiles.

      Second, we have followed the suggestion of the reviewer to perform for each neuron of experiment 1 a one-way ANOVA with as factor head-body orientation angle. To do that, we combined all 64 trials that had the same head-body orientation angle. The percentage of neurons (required to be responsive in the tested condition) for which this one-way ANOVA was significant is shown in the Tables below for each region, separately for each pose (P1, P2), centering condition (MC = monkey-centered; HC = head-centered) and monkey subject (M1, M2). The percentages were low but larger than the expected 5% (Type 1 error), with a median of 16.5% (range: 3 to 23%) in aSTS and 8% for mSTS (range: 0-19%).

      Author response table 1.

      Interestingly, a higher percentage of the 10 best neurons for each pose (indicated by the star in the Figure above) showed a significant one-way ANOVA for angle (for P1, MC: 50% (95% confidence interval (CI): 19% – 81%); P1, HC: 70% (CI: 35% - 93%); P2, MC: 70% (CI: 35% – 93%); P2: HC: 50% (CI: 19%-81%)). These percentages were significantly higher than expected for a random sample from the population of neurons for each pose-centering combination (expected percentages listed in the same order as above: 16%, 13%, 16%, and 10%; all outside CI). Thus, for at least half of the “best” neurons, the response differed significantly among the head-orientation angles at the single neuron level. Nonetheless, the tuning profiles were quite diverse, suggesting population coding of head-body orientation angle. We have added this interesting and novel result to the Results (page 16) and Suppl. Material (Figure S3).    

      Minor comments:

      (1) Figure 4A, Fourth Row Example (Zero Angle vs. Straight Angle, Bottom of the P2 Examples): The order of the example stimuli might be incorrect- the 0{degree sign} head with 180{degree sign} body stimulus (leftmost) might be swapped with the 180{degree sign} head with 0{degree sign} body stimulus (5th from the left). While this ordering may be acceptable, please double-check whether it reflects the authors' intended arrangement.

      We have changed the order of the two stimuli in Figure 4A, following the suggestion of the reviewer.

      (2) Page 12, Lines 192-194: The text states, "Interestingly, some neurons (e.g. Figure 3D) were tuned to a particular combination of a head and body irrespective of centering." However, Figure 3D displays data for a total of 10 neurons. Could you please specify which of these neurons are being referred to in this context?

      The wording was not optimal. We meant to say that some neurons were tuned to a particular combination of head and body orientation, like the third aSTS example neuron of Figure 3D. We have rephrased the sentence and clarified which example neuron we referred to.

      (3) Page 28, Lines 470-471: The text states, "We observed no difference in response strength between anatomically possible and impossible configurations." Please clarify which data were compared for response strength, as I could not locate the corresponding analyses.

      The anatomically possible and impossible configurations differ in the head-body orientation angle. However, as we reported before in the Results, there was no effect of head-body orientation angle on mean response strength across poses (Friedman ANOVA; all p-values for both poses and centerings > 0.1). We have clarified this now in the Discussion (page 28).

      (4) Pages 40-43, Decoding Analyses: In experiments 2 and 3, were the decoding analyses performed on simultaneously recorded neurons? If so, such analyses might leverage trial-by-trial correlations and thus avoid confounds from trial-to-trial variability. In contrast, experiment 1, which used single-shank electrodes, would lack this temporal information. Please clarify how trial numbers were assigned to neurons in each experiment and how this assignment may have influenced the decoding performance.

      For the decoding analyses of experiments 2 and 3, we combined data from different daily penetrations, with only units from the same penetration being recorded simultaneously. In the decoding analyses of each experiment, the trials were assigned randomly to the pseudo-population vectors, shuffling on each resampling the trial order per neuron. This shuffling abolishes noise correlations in the analysis of each experiment.

      (5) Page 41, Lines 792-802: The authors state that "To assess the significance of the differences in classification scores between pairs of angles ... we computed the difference in classification score between the two pairs for each resampling and the percentile of 0 difference corresponded to the p-value." In a two-sided test under the null hypothesis of no difference between the distributions, the conventional approach would be to compute the p-value as the proportion of resampled differences that are as extreme or more extreme than the observed difference. Since a zero difference might be relatively rare, relying solely on its percentile could potentially misrepresent the tail probabilities relevant to a two-sided test. Could you clarify how their method addresses this issue?

      This test is based on the computation of the distribution of the difference between classification accuracies across resamplings. This is similar to the computation of the confidence interval of a  difference. Thus, we assess whether the theoretical zero value (= no difference; = null hypothesis) is outside the 2.5 and 97.5 percentile interval of the computed distribution of the empirically observed differences. We clarified now in the Methods (page 41) that for a two-tailed test the computed p-value (the percentile of the zero value) should be smaller than 0.025.

      (6) Page 43, Lines 829-834: The manuscript explains: "The mean of 10 classification accuracies (i.e., of 10 resamplings) was employed to obtain a distribution (n=100) of the differences in classification accuracy ... The reported standard deviations of the classification accuracies are computed using also the means of 10 resamplings." I am unfamiliar with this type of analysis and am unclear about the rationale for calculating distributions and standard deviations based on the means of 10 resamplings rather than using the original distribution of classification accuracies. This resampling procedure appears to yield a narrower distribution and smaller standard deviations than the original data. Could you please justify this approach?

      The logic of the analysis is to reduce the noise in the data, by averaging across 10 randomly selected resamplings, but still keeping a sufficient number of data (100 values) for a test.

      Reviewer #3 (Recommendations for the authors):

      (1) Some sentences are too long and difficult to parse. For example, in line 177: "the correlations between the responses to the 64 head-body orientation conditions of the two centerings for the neuron and pose combinations showing significant head-body interactions for the two centerings were similar to those observed for the whole population."

      We have modified this sentence: For neuron and pose combinations with significant head-body interactions in both centerings, the correlations between responses to the 64 head-body orientation conditions were similar to those observed in the whole population.

      (2) The authors argue in line 485: "in our study, a search bias cannot explain the body-inversion effect since we selected responsive units using both upright and inverted images." However, the body-selective patches were localized using upright images, correct?

      The monkey-selective patches were localized using upright images indeed. However, we recorded in experiment 3 (and 2) also outside the localized patches (as we noted before in the Methods:  “In experiments 2 and 3 we recorded from a wider region, which overlapped with the two monkey patches and the recording locations of experiment 1”). Furthermore, the preference for upright monkey images is not an all-or-nothing phenomenon: most units still responded to inverted monkeys. Also, we believe it is likely that the mean responses to the inverted bodies in the monkey patches, defined by upright bodies versus objects, would be larger than those to objects and we would be surprised to learn that there is a patch selective for inverted bodies that we would have missed with our localizer.

      (3) Typo: line 447, "this independent"->"is independent"?

      Corrected.

    1. Author Response

      The following is the authors’ response to the original reviews.

      Reviewer #1

      Public Review

      Summary:

      (1) This work describes a simple mechanical model of worm locomotion, using a series of rigid segments connected by damped torsional springs and immersed in a viscous fluid.

      (2) It uses this model to simulate forward crawling movement, as well as omega turns.

      Strengths:

      (3) The primary strength is in applying a biomechanical model to omega-turn behaviors.

      (4) The biomechanics of nematode turning behaviors are relatively less well described and understood than forward crawling.

      (5) The model itself may be a useful implementation to other researchers, particularly owing to its simplicity.

      Weaknesses:

      (6) The strength of the model presented in this work relative to prior approaches is not well supported, and in general, the paper would be improved with a better description of the broader context of existing modeling literature related to undulatory locomotion.

      (7) This paper claims to improve on previous approaches to taking body shapes as inputs.

      (8) However, the sole nematode model cited aims to do something different, and arguably more significant, which is to use experimentally derived parameters to model both the neural circuits that induce locomotion as well as the biomechanics and to subsequently compare the model to experimental data.

      (9) Other modeling approaches do take experimental body kinematics as inputs and use them to produce force fields, however, they are not cited or discussed.

      (10) Finally, the overall novelty of the approach is questionable.

      (11) A functionally similar approach was developed in 2012 to describe worm locomotion in lattices (Majmudar, 2012, Roy. Soc. Int.), which is not discussed and would provide an interesting comparison and needed context.

      9-11: The paper you recommended and our manuscript have some similarities and differences.

      Similarities

      Firstly, the components constituting the worm are similar in both models. ElegansBot models the worm as a chain of n rods, while the study by Majmudar et al. (2012) models it as a chain of n beads. Each bead in the Majmudar et al. model has a directional vector, making it very similar to ElegansBot's rod. However, there's a notable difference: in the Majmudar et al. model, each bead has an area for detecting contact between the obstacle and the bead, while in ElegansBot, the rod does not feature such an area.

      Secondly, the types of forces and torques acting on the components constituting the worm are similar. Each rod in ElegansBot receives frictional force, muscle force, and joint force. Each bead in the Majmudar et al. model receives a constraint force, viscous force, and a repulsive force from obstacles. Each rod in ElegansBot receives frictional torque, muscle torque, and joint torque. Each bead in the Majmudar et al. model receives elastic torque, constraint torque, drive torque, and viscous torque. The Majmudar et al. model's constraint force and torque are similar to ElegansBot's joint force and torque in that they prevent two connected components of the worm from separating. The Majmudar et al. model's viscous force and torque are similar to ElegansBot's frictional force and torque in that they are forces exchanged between the worm and its surrounding environment (ground surface). The Majmudar et al. model's drive torque is similar to ElegansBot's muscle force and muscle torque as a cause of the worm's motion. However, unlike ElegansBot, the Majmudar et al. model did not consider the force generating the drive torque, and there are differences in how each force and torque is calculated. This will be discussed in more detail below.

      Differences

      Firstly, the medium in which the worm locomotes is different. ElegansBot is a model describing motion in a homogeneous medium like agar or water without obstacles, while the Majmudar et al. model describes motion in water with circular obstacles fixed at each lattice point. This is because the purposes of the models are different. ElegansBot analyzes locomotion patterns based on the friction coefficient, while the Majmudar et al. model analyzes locomotion patterns based on the characteristics of the obstacle lattice, such as the distance between obstacles. Also, for this reason, the Majmudar et al. model's bead, unlike ElegansBot's rod, receives a repulsive force from obstacles.

      Secondly, the specific methods of calculating similar types of forces differ. ElegansBot calculates joint forces by substituting frictional forces, muscle forces, frictional torques, and muscle torques into an equation derived from differentiating a boundary condition equation twice over time, where two neighboring rods always meet at one point. This involves determining the process through which various forces and torques are transmitted across the worm. Specifically, it entails calculating how the frictional forces and torques, as well as the muscle forces and torques acting on each rod, are distributed throughout the entire length of the worm. In contrast, The Majmudar et al. model uses Lagrange multipliers method based on a boundary condition that the curve length determined by each bead's tangential angle does not change, to calculate the constraint force and torque before calculating the drive torque and viscous force. This implies that the Majmudar et al. model did not consider the mechanism by which the drive torque and viscous force received by one bead are distributed throughout the worm. ElegansBot's rod receives an anisotropic Stokes frictional force from the ground surface, while the Majmudar et al. model considered the frictional force according to the Navier-Stokes equation for incompressible fluid, assuming the fluid velocity at the bead's location as the bead's velocity.

      Thirdly, unlike the Majmudar et al. model, ElegansBot considers the inertia of the worm components. Therefore, ElegansBot can simulate regardless of how low or high the ground surface's friction coefficient is. the Majmudar et al. model is not like this.

      (12) The idea of applying biomechanical models to describe omega turns in C. elegans is a good one, however, the kinematic basis of the model as used in this paper (the authors do note that the control angle could be connected to a neural model, but don't do so in this work) limits the generation of neuromechanical control hypotheses.

      8, 12: We do not agree with the claim that ElegansBot could limit other researchers in generating neuromechanical control hypotheses. The term θ_("ctrl" ,i)^((t) ) used in our model is designed to be replaceable with neuromechanical control in the future.

      (13) The model may provide insights into the biomechanics of such behaviors, however, the results described are very minimal and are purely qualitative.

      (14-1) Overall, direct comparisons to the experiments are lacking or unclear.

      14-1: If you look at the text explaining Fig. 2 and 5 (Fig. 2 and 4 in old version), it directly compares the velocity, wave-number, and period as numerical indicators representing the behavior of the worm, between the experiment and ElegansBot.

      (14-2) Furthermore, the paper claims the value of the model is to produce the force fields from a given body shape, but the force fields from omega turns are only pictured qualitatively.

      13, 14-2: We gratefully accept the point that our analysis of the omega-turn is qualitative. Therefore, we have conducted additional quantitative analysis on the omega-turn and inserted the results into the new Fig. 4. We have considered the term 'Force field' as referring to the force vector received by each rod. We have created numerical indicators representing various behaviors of the worm and included them in the revised manuscript.

      (15) No comparison is made to other behaviors (the force experienced during crawling relative to turning for example might be interesting to consider) and the dependence of the behavior on the model parameters is not explored (for example, how does the omega turn change as the drag coefficients are changed).

      Thank you for the great idea. To compare behaviors, first, a clear criterion for distinguishing behaviors is needed. Therefore, we have created a new mathematical definition for behavior classification in the revised manuscript (“Defining Behavioral Categories” in Method). After that, we compared the force and power (energy consuming rate) between each forward locomotion, backward locomotion, and omega-turn (Fig. 4). And in the revised manuscript, we newly analyzed how the turning behavior changes with variations in the friction coefficients in Figs. S4-S7.

      (16) If the purpose of this paper is to recapitulate the swim-to-crawl transition with a simple model, and then apply the model to new behaviors, a more detailed analysis of the behavior of the model variables and their dependence on the variables would make for a stronger result.

      In our revised manuscript, we have quantitatively analyzed the changes occurring in turning behavior from water to agar, and the results are presented in Figs. S9 and S10.

      (17) In some sense, because the model takes kinematics as an input and uses previously established techniques to model mechanics, it is unsurprising that it can reproduce experimentally observed kinematics, however, the forces calculated and the variation of parameters could be of interest.

      (18) Relatedly, a justification of why the drag coefficients had to be changed by a factor of 100 should be explored.

      (19) Plate conditions are difficult to replicate and the rheology of plates likely depends on a number of factors, but is for example, changes in hydration level likely to produce a 100-fold change in drag? or something more interesting/subtle within the model producing the discrepancy?

      18, 19: As mentioned in the paper, we do not know if the friction coefficients in the study of Boyle et al. (2012) and the friction coefficients in the experiment of Stephens et al. (2016) are the same. In our revised manuscript, we have explored more in detail the effects of the friction coefficient's scale factor, and explained why we chose a scale factor of 1/100 (“Proper Selection of Friction Coefficients” in Supplementary Information). In summary, we analyzed the changes in trajectory due to scaling of the friction coefficient, and chose the scale factor 1/100 as it allowed ElegansBot to accurately reproduce the worm's trajectory while also being close to the friction coefficients in the Boyle et al. paper.

      (20) Finally, the language used to distinguish different modeling approaches was often unclear.

      (21) For example, it was unclear in what sense the model presented in Boyle, 2012 was a "kinetic model" and in many situations, it appeared that the term kinematic might have been more appropriate. Thank you for the feedback. As you pointed it out, we have corrected that part to 'kinematic' in the revised manuscript.

      (22) Other phrases like "frictional forces caused by the tension of its muscles" were unclear at first glance, and might benefit from revision and more canonical usage of terms.

      We agree that the expression may not be immediately clear. This is due to the word limit for the abstract (the abstract of eLife VOR should be under 200 words, and our paper's abstract is 198 words), which forced us to convey the causality in a limited number of words. Therefore, although we will not change the abstract, the expression in question means that the muscle tension, which is the cause of the worm's locomotion, ultimately generates the frictional force between the worm and the ground surface.

      Recommendations For The Authors

      (23) As I stated in my public review, I think the paper could be made much stronger if a more detailed exploration of turning mechanics was presented.

      (24) Relatedly, rather than restricting the analysis to individual videos of turning behaviors, I wonder if a parameterized model of the turning kinematics would be fruitful to study, to try to understand how different turning gaits might be more or less energetically favorable.

      We thank the reviewer once again for their suggestion. Thanks to their proposal, we were able to conduct additional quantitative analysis on turning behavior.

      Reviewer #2

      Public Review

      Summary:

      (1) Developing a mechanical model of C. elegans is difficult to do from basic principles because it moves at a low (but not very small) Reynolds number, is itself visco-elastic, and often is measured moving at a solid/liquid interface.

      (2) The ElegansBot is a good first step at a kinetic model that reproduces a wide range of C. elegans motiliy behavior.

      Strengths: (3) The model is general due to its simplicity and likely useful for various undulatory movements.

      (4) The model reproduces experimental movement data using realistic physical parameters (e.g. drags, forces, etc).

      (5) The model is predictive (semi?) as shown in the liquid-to-solid gait transition.

      (6) The model is straightforward in implementation and so likely is adaptable to modification and addition of control circuits.

      Weaknesses:

      (7) Since the inputs to the model are the actual shape changes in time, parameterized as angles (or curvature), the ability of the model to reproduce a realistic facsimile of C. elegans motion is not really a huge surprise. (8) The authors do not include some important physical parameters in the model and should explain in the text these assumptions.

      (9. 1) The cuticle stiffness is significant and has been measured [1].

      (10. 2) The body of C. elegans is under high hydrostatic pressure which adds an additional stiffness [2].

      (11. 3) The visco-elasticity of C. elegans body has been measured. [3]

      Thank you for asking. The stiffness of C. elegans is an important consideration. We took this into account when creating ElegansBot, but did not explain it in the paper. The detailed explanation is as follows. C. elegans indeed has stiffness due to its cuticle and internal pressure. This stiffness is treated as a passive elastic force (elastic force term of lateral passive body force) in the paper of Boyle et al. (2012). However, the maximum spring constant of the passive elastic force is 1/20 of the maximum spring constant of the active elastic force. If we consider this fact in our model, the elastic term of the muscle torque is as follows: ( is the active torque elasticity coefficient, is the passive torque elasticity coefficient)

      where

      Therefore, there is no need to describe the active and passive terms separately in

      Furthermore, since , assuming , then and .

      (12) There is only a very brief mention of proprioception.

      (13) The lack of inclusion of proprioception in the model should be mentioned and referenced in more detail in my opinion.

      As you emphasized, proprioception is an important aspect in the study of C. elegans' locomotion. In our paper, its importance is briefly introduced with a sentence each in the introduction and discussion. However, our research is a model about the process of the creation of body motion originated from muscle forces, and it does not model the sensory system that senses body posture. Therefore, there is no mention of using proprioception in our paper's results section. What is mentioned in the discussion is that ElegansBot can be applied as the kinetic body model part in a combination model of a kinetic body model and a neuronal circuit model that receives proprioception as a sensory signal.

      (14) These are just suggested references.

      (15) There may be more relevant ones available.

      The papers you provided contain specific information about the Young's modulus of the C. elegans body. The first paper (Rahimi et al., 2022) measured the Young's modulus of the cuticle after chemically isolating it from C. elegans, while the second paper (Park et al., 2007) and third paper (Backholm et al., 2013) measured the elasticity and Young's modulus of C. elegans without separating the cuticle. Based on the Young's modulus provided in each paper (although the second and third papers did not measure stiffness in the longitudinal direction), we derived the elastic coefficient (assuming a worm radius of 25 μm, cuticle thickness of 0.5 μm, and 1/25 of longitudinal length of the cuticle of 40 μm). The range was quite broad, from 9.82ⅹ1011 μg/sec2 (from the first paper) to 2.16 ⅹ 108 μg / sec2 (from the third paper). Although the elastic coefficient value in our paper falls within this range, since the range of the elastic coefficient is wide, we think we can modify the elastic coefficient in our paper and will be able to reapply our model if more accurate values become known in the future.

      Reviewer #3

      Public Review

      Summary:

      (1) A mechanical model is used with input force patterns to generate output curvature patterns, corresponding to a number of different locomotion behaviors in C. elegans

      Strengths:

      (2) The use of a mechanical model to study a variety of locomotor sequences and the grounding in empirical data are strengths.

      (3) The matching of speeds (though qualitative and shown only on agar) is a strength.

      Weaknesses:

      (4) What is the relation between input and output data?

      ElegansBot takes the worm's body control angle as the input, and produces trajectory and force of each segment of the worm as the output.

      (5) How does the input-output relation depend on the parameters of the model?

      If 'parameter' is understood as vertical and horizontal friction coefficients, then the explanation for this can be found in Fig. 5 (Fig. 4 in the old version).

      (6) What biological questions are addressed and can significant model predictions be made?

      Equation of motion deciphering locomotion of C. elegans including turning behaviors which were relatively less well understood.

      Recommendations For The Authors

      (7) The novelty and significance of the paper should be clarified.

      We have added quantitative analyses of turning behavior in the revised manuscript, and we hope this will be helpful to you.

      (8) Previously much more detailed models have been published, as compared to this one.

      We hope the reviewer can point out any previous model that we may have missed.

      (9) The mechanics here are simplified (e.g. no information about dorsal/ventral innervation but only a bending angle) setting limitations on the capacity for model predictiveness.

      (10) Such limitations should be discussed.

      We view the difference between dorsal/ventral innervation and bending angle not as a matter of simplification, but rather as a reflection of the hierarchy that our model implements. Our model does not consider dorsal/ventral innervation, but it uses the bending angle to reproduce behavior in various input and frictional environments, which signifies the strong predictiveness of ElegansBot (Figure 2, 3, 5 (2, 3, 4 in the old version)). Moreover, if the midline of C. elegans is incompressible, then modeling by dividing into dorsal/ventral, as opposed to modeling solely with the bending angle, does not increase the degree of freedom of the worm model, and therefore does not increase its predictiveness.

      (11) The aims of the paper and results need to be supported quantitatively and analyzed through parameter sweeps and intervention.

      We have conducted additional quantitative analyses on turning behavior as suggested by Reviewer #1 (Fig. 4, S4-S7, S9, and S10).

      (12) The methods are given only in broad brushstrokes, and need to be much more clear (and ideally sharing all code).

      We have thoroughly detailed every aspect of this research, from deriving the physical constants of C. elegans, agar, and water to developing the formulas and proofs necessary for operating ElegansBot and its applications. This comprehensive information is all presented in the Results, Methods, and Supplementary Information sections, as well as in the source code. Moreover, we have already ensured that our research can be easily reproduced by providing detailed explanations and by making ElegansBot accessible through public software databases (PyPI, GitHub). To further aid in its application and understanding, especially for those less familiar with the subject, we have also included minimal code as examples in the database. This code is designed to simplify the process of reproducing the results of the paper, thereby making our research more accessible and understandable. Therefore, we believe that readers will easily gain significant assistance from the extensive information we have provided. Should readers require further help, they can always contact us, and we will be readily available to offer support.

      (13) The supporting figures and movies need to include a detailed analysis to evidence the claims.

      We have conducted and provided additional quantitative analyses on turning behavior as suggested by Reviewer #1 (Fig. 4, S4-S7, S9, and S10).

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      In this manuscript, Chen et al. used cryo-ET and in vitro reconstituted system to demonstrate that the autoinhibited form of LRRK2 can also assemble into filaments that wrap around the microtubule, although the filaments are typically shorter and less regular compared to the previously reported active-LRRK2 filaments. The structure revealed a new interface involving the N-terminal repeats that were disordered in the previous active-LRRK2 filament structure. The autoinhibited-LRRK2 filament also has different helical parameters compared to the active form.

      Strengths:

      The structure obtained in this study is the highest resolution of LRRK2 filaments done by subtomogram averaging, representing a major technical advance compared to the previous Cell paper from the same group. Overall, I think the data are well presented with beautiful graphic rendering, and valuable insights can be gained from this structural study.

      Weaknesses:

      (1) There are only three main figures, together with 9 supplemental figures. The authors may consider breaking the currently overwhelming Figures 1 and 3 into smaller figures and moving some of the supplemental figures to the main figure, e.g., Figure S7.

      (2) The key analysis of this manuscript is to compare the current structure with the previous active-LRRK2 filament structure. Currently, such a comparison is buried in Figure 3H. It should be part of Figure 1.

      We thank the reviewer for this suggestion. As suggested, we have rearranged the figures, split Figure 1 and 3 into smaller Figures, and moved the comparison analysis in Figure 3H to the new Figure 1. Specifically, the old Figure 1 is separated into two figures, introducing the model-building process and describing the two symmetric axes. The old Figure 3 is also separated into two small figures, describing the geometric analysis and model comparison, respectively.

      Reviewer #2 (Public review):

      The authors of this paper have done much pioneering work to decipher and understand LRRK2 structure and function, to uncover the mechanism by which LRRK2 binds to microtubules, and to study the roles that this may play in biology. Their previous data demonstrated that LRRK2 in the active conformation (pathogenic mutation or Type I inhibitor complex) bound to microtubule filaments in an ordered helical arrangement. This they showed induced a "roadblock" in the microtubule impacting vesicular trafficking. The authors have postulated that this is a potentially serious flaw with Type 1 inhibitors and that companies should consider generating Type 2 inhibitors in which the LRRK2 is trapped in the inactive conformation. Indeed the authors have published much data that LRRK2 complexed to Type 2 inhibitors does not seem to associate with microtubules and cause roadblocks in parallel experiments to those undertaken with type 1 inhibitors published above.

      In the current study, the authors have undertaken an in vitro reconstitution of microtubule-bound filaments of LRRK2 in the inactive conformation, which surprisingly revealed that inactive LRRK2 can also interact with microtubules in its auto-inhibited state. The authors' data shows that while the same interphases are seen with both the active LRRK2 and inactive microtubule bound forms of LRRK2, they identified a new interphase that involves the WD40-ARM-ANK- domains that reportedly contributes to the ability of the inactive form of LRRK2 to bind to microtubule filaments. The structures of the inactive LRRK2 complexed to microtubules are of medium resolution and do not allow visualisation of side chains.

      This study is extremely well-written and the figures are incredibly clear and well-presented. The finding that LRRK2 in the inactive autoinhibited form can be associated with microtubules is an important observation that merits further investigation. This new observation makes an important contribution to the literature and builds upon the pioneering research that this team of researchers has contributed to the LRRK2 fields. However, in my opinion, there is still significant work that could be considered to further investigate this question and understand the physiological significance of this observation.

      We thank the reviewer for the positive comments and we agree that more work can be done next to understand the physiological significance of the autoinhibited LRRK2 in cellular environments. We are actively working on understanding how the stability of autoinhibited full-length LRRK2 is regulated, especially how the transfer between autoinhibited and active forms of LRRK2 can happen. Our in situ data (Watabane et al. 2020) indicates that overexpressed hyperactive PD-mutant LRRK2 mainly adopts its active-like conformation in cells. Thus, learning how the state transfer occurs will allow us to target autoinhibited LRRK2 specifically and efficiently in cells and study its structure and function in physiological conditions.

      Reviewer #3 (Public review):

      Summary:

      The manuscript by Chen et al examines the structure of the inactive LRRK2 bound to microtubules using cryo-EM tomography. Mutations in this protein have been shown to be linked to Parkinson's Disease. It is already shown that the active-like conformation of LRRK2 binds to the MT lattice, but this investigation shows that full-length LRRk2 can oligomerize on MTs in its autoinhibited state with different helical parameters than were observed with the active-like state. The structural studies suggest that the autoinhibited state is less stable on MTs.

      Strengths:

      The protein of interest is very important biomedically and a novel conformational binding to microtubules in the proposed.

      Weaknesses:

      (1) The structures are all low resolution.

      We thank the reviewer for the comments on both the strengths and weaknesses of the manuscript. We agree with the reviewer that higher resolution would provide more information about how LRRK2 interacts with microtubules and oligomerizes in its autoinhibited form. However, with the current resolution, our model-building benefited significantly from the published high-resolution models and the alpha-fold predictions. We used cryo-ET and subtomogram analysis to solve the structure because this filament is less regular than the right-handed active LRRK2 filament, preventing us from using conventional single-particle analysis. As highlighted by reviewer 1, being able to push the resolution to sub-nanometer is an important advance reflecting state-of-the-art subtomogram analysis, especially for a heterogeneous sample.  Notably, the microtubule reconstruction reached higher resolution, comparable to our previous single-particle studies on LRRK2-RCKW (Snead and Matyszewski et al.), confirming the data quality.

      (2) There are no measurements of the affinity of the various LRRK2 molecules (with and without inhibitors) to microtubules. This should be addressed through biochemical sedimentation assay.

      We thank the reviewer for the suggestion and we agree that learning the binding affinity between LRRK2 and microtubules would be informative. We attempted to purify the LRRK2 with mutants on the WD40:ARM/ANK interface we identified in the manuscript.. Unfortunately, either LRRK2 or LRRK2<sup>I2020T</sup> with N-terminal mutants (R521A/F573A/E854K), the yield and purity of the final samples are significantly worse than our routine LRRK2 prep. Our chromatography and gel electrophoresis results indicate that proteins are degrading during purification.

      Author response image 1.

      While we have attached the results here, and it would be interesting to investigate why N-terminal mutations destabilize LRRK2, we anticipate that significant efforts would be required for further experiments, which we respectfully consider outside of the scope of this manuscript. 

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) In Figure S9, the graphic definition of "chain length" in panel A is misleading. The authors can simply note in the figure legend that "chain length is the number of asymmetric units in a continuous chain".

      We thank the reviewer for the suggestion. The updated figure and legend have incorporated the changes.

      (2) In Figure S7B, the conformation changes of the 'G-loop' and the 'DYG' motifs are not so convincing at the current resolution.

      We thank the reviewer for pointing it out. We agree that our model resolution is not high enough to support the unbiased observation of the conformation changes of the key kinase motifs. In the revised manuscript, we avoided emphasizing the comparison between the two models. Instead, we state that for both the MLi-2 bound map and the GZD-824 bound map, the corresponding published high-resolution models fit into each kinase map, but the MLi-2 bound model doesn’t fit as well in the GZD-824 bound map, with a correlation value dropped from 0.44 to 0.4, supporting our statement that “full-length LRRK2 bound to microtubules is in its autoinhibited state in our reconstituted system”.

      Reviewer #2 (Recommendations for the authors):

      (1) Are there any cellular experiments that could be done to demonstrate that inactive LRRK2 associates with microtubules in cells?

      We thank the reviewer for pointing out this direction for future studies. We are studying the physiological significance of the autoinhibited LRRK2 in cells, but haven’t yet been successful at demonstrating physiological binding to microtubules. Further, as noted in our response to reviewer #3, we are also actively working on understanding how the stability of autoinhibited full-length LRRK2 is regulated, especially how the transfer between autoinhibited and active forms of LRRK2 can happen. Our in situ data (Watabane et al. 2020) indicates that hyperactive PD-mutant overexpressed LRRK2 mainly adopts its active-like conformation in cells. Thus, learning how the state transfer occurs will allow us to target autoinhibited LRRK2 specifically and efficiently in cells and study its structure and function in physiological conditions.

      (2) Previous work that the authors and others have undertaken has suggested that only LRRK2 in its active conformation can associate with microtubule filaments and the authors have shown that this leads to a roadblock in vesicular transport only when LRRK2 is complexed with Type 1 but not Type 2 inhibitors. There seems to be some discrepancy here that is not addressed in the paper as based on the current results one would also expect LRRK2 bound to Type 2 inhibitors to induce roadblocks in microtubule filaments. How can this be explained?

      We thank the reviewer for raising this important question. Taking all of our published data together, we believe that LRRK2 can introduce roadblocks with Type 1 inhibitor bound in the active-like conformation, where N-terminus LRRK2 domains are flexible and don’t block the kinase active site. In other words, full-length LRRK2 can form roadblocks when it behaves more like the truncated LRRK2<sup>RCKW</sup> variant. The autoinhibited LRRK2 forms shorter and less stable oligomers on microtubules, making it harder to block transport. Consistent with this, our in situ LRRK2-microtubule structure was observed in cells where LRRK2 is in an active-like conformation, and the LRRK2 N-terminus appeared to be flexible and away from the microtubule when forming right-handed filaments.

      (3) Does the finding that inactive LRRK2 only binds to microtubules as a short filament, explain the differences between the inactive and active forms of LRRK2 binding to microtubules and causing roadblocks?

      We thank the reviewer for discussing this point with us and asking the question. As we replied in the previous comment, the reviewer’s conclusion explains how the roadblock phenomenon occurs only under certain circumstances. We expanded our discussion to add the following and address the question:

      “Notably, we previously demonstrated that active‐like LRRK2, when bound to a Type I inhibitor, can form roadblocks that impair vesicular transport. Since autoinhibited LRRK2 assembles into shorter, less stable oligomers on microtubules, we anticipate it will exert reduced road‐blocking effects in cells, regardless of the inhibitor bound.”

      (4) Could the authors undertake further characterization of the new WD40-ARM-ANK interphase that they have identified? Is this important for the binding of the autoinhibited mutant? Could mutants be made in this interphase to see if this prevents the autoinhibited but not the active conformation of LRRK2 binding to microtubules?

      We thank the reviewer for the comment. As mentioned in our response to Reviewer #2, public comment #2, we attempted to purify the LRRK2 with mutants on the WD40:ARM/ANK interface we identified in the manuscript multiple times. Unfortunately, either LRRK2 or LRRK2<sup>I2020T</sup> with N-terminal mutants (R521A/F573A/E854K), the yield and purity of the final samples are significantly worse than our routine LRRK2 prep. Our chromatography and gel electrophoresis results indicate that proteins are degrading during purification.

      (5) The authors identify several disease-relevant missense mutations that appear to lie within the novel interphase that the authors have characterised in this study. Although this is discussed in the Discussion, some experimental data demonstrating how these missense mutations impact the ability of inactive LRRK2 to bind to microtubule filaments in the presence or absence of Type 1 and Type 2 compounds could provide further experimental data that emphasises the physiological importance of the results presented in this study.

      We thank the reviewer for discussing this interesting direction. The disease-relevant missense mutations can have a direct or indirect impact on the binding of autoinhibited LRRK2 to microtubules, and we agree that it would be interesting to test it out in the future. However, we anticipate that significant effort would be required for further experiments. Alas, our funding for this project ended suddenly and we want to report our results to the community.

      (6) For the data that is shown in Figure 1, could the authors explain how this differs from results in previous papers of the authors showing that the active form of LRRK2 binds microtubules? How does the binding observed here differ from that observed in the previous studies? To a non-specialist reader, the data looks fairly like what has previously been reported.

      We thank the reviewer for asking the question. As mentioned in the response to the public review, the detailed comparison between the data and the previous papers is described in Figure 3, and we agree that it is helpful to incorporate this information in Figure 1. In the revised manuscript, we have incorporated the comparison panel in Figure 1.

      (7) The finding that the autoinhibited LRRK2 forms short and sparse oligomers on microtubules raises the question of how physiological this observation is. Having some data that suggests that this is physiologically relevant would boost the impact of this study.

      We agree with the reviewer on this comment. As discussed in the response to the first comment from the reviewer, we have not been able to assess the physiological relevance of LRRK2 binding to microtubules in either active or inactive state, but continue to pursue this line of research. We are aware and regret that this lessens the impact of this work.

      (8) For the more general reader the authors could potentially better highlight why the key finding in this paper is important.

      We thank the reviewer for the suggestion. To further address the significance of the key findings, especially how it can open up more possibilities for inhibitor-based drug development, we expand our discussion section to include the following:

      “Understanding how Type I and Type II inhibitors’ binding to LRRK2 affects its mechanism is vital to the design of inhibitor-based PD drug development strategies. Our findings revealed that different LRRK2 kinase inhibitors bind to autoinhibited LRRK2 similarly either in solution or on microtubules. Furthermore, the observation of autoinhibited LRRK2 forming short, less stable oligomers on microtubules opens new possibilities to inhibit LRRK2 activity in PD patients. A Type I inhibitor specifically targeting autoinhibited LRRK2 may alleviate the effect of LRRK2 roadblocks on microtubules. Alternatively, a promising strategy of LRRK2 inhibitor design can focus on the stabilization of allosteric N-terminus blocking on the kinase domain, which favors the formation of autoinhibited LRRK2 oligomers on microtubules and causes fewer side effects.”

      Reviewer #3 (Recommendations for the authors):

      In the third paragraph of the introduction, expand on whether type-1 inhibitors which "capture kinases in a closed, "active-like" conformation still inhibit the kinase activity.

      We thank the reviewer for the request to expand this paragraph. We added the following explanation for better understanding in the third paragraph:

      “Type-I inhibitors bind to the ATP binding site and target the kinase in its ‘active-like' conformation, inhibiting its kinase activity.”

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      The aim of this paper is to describe a novel method for genetic labelling of animals or cell populations, using a system of DNA/RNA barcodes.

      Strengths:

      • The author's attempt at providing a straightforward method for multiplexing Drosophila samples prior to scRNA-seq is commendable. The perspective of being able to load multiple samples on a 10X Chromium without antibody labelling is appealing.

      • The authors are generally honest about potential issues in their method, and areas that would benefit from future improvement.

      • The article reads well. Graphs and figures are clear and easy to understand.

      We thank the reviewer for these positive comments.

      Weaknesses:

      • The usefulness of TaG-EM for phototaxis, egg laying or fecundity experiments is questionable. The behaviours presented here are all easily quantifiable, either manually or using automated image-based quantification, even when they include a relatively large number of groups and replicates. Despite their claims (e.g., L311-313), the authors do not present any real evidence about the cost- or time-effectiveness of their method in comparison to existing quantification methods.

      While the behaviors that were quantified in the original manuscript were indeed relatively easy to quantify through other methods, they nonetheless demonstrated that sequencing-based TaG-EM measurements faithfully recapitulated manual behavioral measurements. In response to the reviewer’s comment, we have added additional experiments that demonstrate the utility of TaG-EM-based behavioral quantification in the context of a more labor-intensive phenotypic assay (measuring gut motility via food transit times in Drosophila larvae, Figure 4, Supplemental Figure 7). We found that food transit times in the presence and absence of caffeine are subtly different and that, as with larger effect size behaviors, TaG-EM data recapitulates the results of the manual assay. This experiment demonstrates both that TaG-EM can be used to streamline labor-intensive behavioral assays (we have included an estimate of the savings in hands-on labor for this assay by using a multiplexed sequencing approach, Supplemental Figure 8) and that TaG-EM can quantify small differences between experimental groups. We also note in the discussion that an additional benefit of TaGEM-based behavioral assays is that the observed is blinded as to the experimental conditions as they are intermingled in a single multiplexed assay. We have added the following text to the paper describing these experiments.

      Results:

      “Quantifying food transit time in the larval gut using TaG-EM

      Gut motility defects underlie a number of functional gastrointestinal disorders in humans (Keller et al., 2018). To study gut motility in Drosophila, we have developed an assay based on the time it takes a food bolus to transit the larval gut (Figure 4A), similar to approaches that have been employed for studying the role of the microbiome in human gut motility (Asnicar et al., 2021). Third instar larvae were starved for 90 minutes and then fed food containing a blue dye. After 60 minutes, larvae in which a blue bolus of food was visible were transferred to plates containing non-dyed food, and food transit (indicated by loss of the blue food bolus) was scored every 30 minutes for five hours (Supplemental Figure 7). 

      Because this assay is highly labor-intensive and requires hands-on effort for the entire five-hour observation period, there is a limit on how many conditions or replicates can be scored in one session (~8 plates maximum). Thus, we decided to test whether food transit could be quantified in a more streamlined and scalable fashion by using TaG-EM (Figure 4B). Using the manual assay, we observed that while caffeinecontaining food is aversive to larvae, the presence of caffeine reduces transit time through the gut (Figure 4C, Supplemental Figure 7). This is consistent with previous observations in adult flies that bitter compounds (including caffeine) activate enteric neurons via serotonin-mediated signaling and promote gut motility (Yao and Scott, 2022). We tested whether TaG-EM could be used to measure the effect of caffeine on food transit time in larvae. As with prior behavioral tests, the TaG-EM data recapitulated the results seen in the manual assay (Figure 4D). Conducting the transit assay via TaGEM enables several labor-saving steps. First, rather than counting the number of larvae with and without a food bolus at each time point, one simply needs to transfer nonbolus-containing larvae to a collection tube. Second, because the TaG-EM lines are genetically barcoded, all the conditions can be tested at once on a single plate, removing the need to separately count each replicate of each experimental condition. This reduces the hands-on time for the assay to just a few minutes per hour.  A summary of the anticipated cost and labor savings for the TaG-EM-based food transit assay is shown in Supplemental Figure 8.”

      Discussion:

      “While the utility of TaG-EM barcode-based quantification will vary based on the number of conditions being analyzed and the ease of quantifying the behavior or phenotype by other means, we demonstrate that TaG-EM can be employed to cost-effectively streamline labor-intensive assays and to quantify phenotypes with small effect sizes (Figure 4, Supplemental Figure 8). An additional benefit of multiplexed TaG-EM behavioral measurements is that the experimental conditions are effectively blinded as the multiplexed conditions are intermingled in a single assay.”

      Methods:

      “Larval gut motility experiments

      Preparing Yeast Food Plates

      Yeast agar plates were prepared by making a solution containing 20% Red Star Active Dry Yeast 32oz (Red Star Yeast) and 2.4% Agar Powder/Flakes (Fisher) and a separate solution containing 20% Glucose (Sigma-Aldrich). Both mixtures were autoclaved with a 45-minute liquid cycle and then transferred to a water bath at 55ºC. After cooling to 55ºC, the solutions were combined and mixed, and approximately 5 mL of the combined solution was transferred into 100 x 15 mm petri dishes (VWR) in a PCR hood or contamination-free area. For blue-dyed yeast food plates, 0.4% Blue Food Color (McCormick) was added to the yeast solution. For the caffeine assays, 300 µL of a solution of 100 mM 99% pure caffeine (Sigma-Aldrich) was pipetted onto the blue-dyed yeast plate and allowed to absorb into the food during the 90-minute starvation period.

      Manual Gut Motility Assay

      Third instar Drosophila larvae were transferred to empty conical tubes that had been misted with water to prevent the larvae from drying out. After a 90-minute starvation period the larvae were moved from the conical to a blue-dyed yeast plate with or without caffeine and allowed to feed for 60 minutes. Following the feeding period, the larvae were transferred to an undyed yeast plate. Larvae were scored for the presence or absence of a food bolus every 30 minutes over a 5-hour period. Up to 8 experimental replicates/conditions were scored simultaneously. 

      TaG-EM Gut Motility Assay

      Third instar larvae were starved and fed blue dye-containing food with or without caffeine as described above. An equal number of larvae from each experimental condition/replicate were transferred to an undyed yeast plate. During the 5-hour observation period, larvae were examined every 30 minutes and larvae lacking a food bolus were transferred to a microcentrifuge tube labeled for the timepoint. Any larvae that died during the experiment were placed in a separate microcentrifuge tube and any larvae that failed to pass the food bolus were transferred to a microcentrifuge tube at the end of the experiment. DNA was extracted from the larvae in each tube and TaG-EM barcode libraries were prepared and sequenced as described above.”

      • Behavioural assays presented in this article have clear outcomes, with large effect sizes, and therefore do not really challenge the efficiency of TaG-EM. By showing a Tmaze in Fig 1B, the authors suggest that their method could be used to quantify more complex behaviours. Not exploring this possibility in this manuscript seems like a missed opportunity.

      See the response to the previous point.

      • Experiments in Figs S3 and S6 suggest that some tags have a detrimental effect on certain behaviours or on GFP expression. Whereas the authors rightly acknowledge these issues, they do not investigate their causes. Unfortunately, this question the overall suitability of TaG-EM, as other barcodes may also affect certain aspects of the animal's physiology or behaviour. Revising barcode design will be crucial to make sure that sequences with potential regulatory function are excluded.

      We have determined that the barcode (BC#8) that had no detectable Gal4induced gene expression in Figure S6 (now Supplemental Figure 9) has a deletion in the GFP coding region that ablates GFP function. Interestingly, the expressed TaG-EM barcode transcript is still detectable in single cell sequencing experiments, but obviously this line cannot be used for cell enrichment (at least based solely on GFP expression from the TaG-EM construct). While it is unclear how this line came to have a lesion in the GFP gene, we have subsequently generated >150 additional TaG-EM stocks and we have tested the GFP expression of these newly established stocks by crossing them to Mhc-Gal4. All of the additional stocks had GFP expression in the expected pattern, indicating that the BC#8 construct is an outlier with respect to inducibility of GFP. We have added the following text to the results section to address this point:

      “No GFP expression was visible for TaG-EM barcode number 8, which upon molecular characterization had an 853 bp deletion within the GFP coding region (data not shown). We generated and tested GFP expression of an additional 156 TaG-EM barcode lines (Alegria et al., 2024), by crossing them to Mhc-Gal4 and observing expression in the adult thorax. All 156 additional TaG-EM lines had robust GFP expression (data not shown).”

      It is certainly the case that future improvements to the construct design may be necessary or desirable and that back-crossing could likely be used to alleviate line-toline differences for specific phenotypes, we also address this point in the discussion with the following text:

      “We excluded this poor performing barcode line from the fecundity tests, however, backcrossing is often used to bring reagents into a consistent genetic background for behavioral experiments and could also potentially be used to address behavior-specific issues with specific TaG-EM lines. In addition, other strategies such as averaging across multiple barcode lines or permutation of barcode assignment across replicates could also mitigate such deficiencies.”

      • For their single-cell experiments, the authors have used the 10X Genomics method, which relies on sequencing just a short segment of each transcript (usually 50-250bp - unknown for this study as read length information was not provided) to enable its identification, with the matching paired-end read providing cell barcode and UMI information (Macosko et al., 2015). With average fragment length after tagmentation usually ranging from 300-700bp, a large number of GFP reads will likely not include the 14bp TaG-EM barcode. 

      The 10x Genomics 3’ workflows that were used for sequencing TaG-EM samples reads the cell barcode and UMI in read one and the expressed RNA sequence in read two. We sequenced the samples shown in Figure 5 in the initial manuscript using a run configuration that generated 150 bp for read two. The TaG-EM barcodes are located just upstream of the poly-adenylation sites (based on the sequencing data, we observe two different poly-A sites and the TaG-EM barcode is located 35 and 60 bp upstream of these sites). Based on the location of the TaG-EM barcodes,150 bp reads is sufficient to see the barcode in any GFP-associated read (when using the 3’ gene expression workflow). In addition to detecting the expression of the TaG-EM barcodes in the 10x Genomics gene expression library, it is possible to make a separate library that enriches the barcode sequence (similar to hashtag or CITE-Seq feature barcode libraries). We have added experimental data where we successfully performed an enrichment of the TaG-EM barcodes and sequenced this as a separate hashtag library (Supplemental Figure 18). We have added text to the results describing this work and also included a detailed information in the methods for performing TaG-EM barcode enrichment during 10x library prep. 

      Results:

      “In antibody-conjugated oligo cell hashing approaches, sparsity of barcode representation is overcome by spiking in an additional primer at the cDNA amplification step and amplifying the hashtag oligo by PCR. We employed a similar approach to attempt to enrich for TaG-EM barcodes in an additional library sequenced separately from the 10x Genomics gene expression library. Our initial attempts at barcode enrichment using spike-in and enrichment primers corresponding to the TaG-EM PCR handle were unsuccessful (Supplemental Figure 18). However, we subsequently optimized the TaG-EM barcode enrichment by 1) using a longer spike-in primer that more closely matches the annealing temperature used during the 10x Genomics cDNA creation step, and 2) using a nested PCR approach to amplify the cell-barcode and unique molecular identifier (UMI)-labeled TaG-EM barcodes (Supplemental Figure 18). Using the enriched library, TaG-EM barcodes were detected in nearly 100% of the cells at high sequencing depths (Supplemental Figure 19). However, although we used a polymerase that has been engineered to have high processivity and that has been shown to reduce the formation of chimeric reads in other contexts (Gohl et al., 2016), it is possible that PCR chimeras could lead to unreliable detection events for some cells. Indeed, many cells had a mixture of barcodes detected with low counts and single or low numbers of associated UMIS. To assess the reliability of detection, we analyzed the correlation between barcodes detected in the gene expression library and the enriched TaG-EM barcode library as a function of the purity of TaG-EM barcode detection for each cell (the percentage of the most abundant detected TaG-EM barcode, Supplemental Figure 19). For TaG-EM barcode detections where the most abundance barcode was a high percentage of the total barcode reads detected (~75%-99.99%), there was a high correlation between the barcode detected in the gene expression library and the enriched TaG-EM barcode library. Below this threshold, the correlation was substantially reduced. 

      In the enriched library, we identified 26.8% of cells with a TaG-EM barcode reliably detected, a very modest improvement over the gene expression library alone (23.96%), indicating that at least for this experiment, the main constraint is sufficient expression of the TaG-EM barcode and not detection. To identify TaG-EM barcodes in the combined data set, we counted a positive detection as any barcode either identified in the gene expression library or any barcode identified in the enriched library with a purity of >75%. In the case of conflicting barcode calls, we assigned the barcode that was detected directly in the gene expression library. This increased the total fraction of cells where a barcode was identified to approximately 37% (Figure 6B).”

      Methods:

      “The resulting pool was prepared for sequencing following the 10x Genomics Single Cell 3’ protocol (version CG000315 Rev C), At step 2.2 of the protocol, cDNA amplification, 1 µl of TaG-EM spike-in primer (10 µM) was added to the reaction to amplify cDNA with the TaG-EM barcode. Gene expression cDNA and TaG-EM cDNA were separated using a double-sided SPRIselect (Beckman Coulter) bead clean up following 10x Genomics Single Cell 3’ Feature Barcode protocol, step 2.3 (version CG000317 Rev E). The gene expression cDNA was created into a library following the CG000315 Rev C protocol starting at section 3. Custom nested primers were used for enrichment of TaG-EM barcodes after cDNA creation using PCR.  The following primers were tested (see Supplemental Figure 18):

      UMGC_IL_TaGEM_SpikeIn_v1:

      GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTCTTCCAACAACCGGAAGT*G*A UMGC_IL_TaGEM_SpikeIn_v2:

      GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTGCAGCTTATAACTTCCAACAACCGGAAGT*G*A

      UMGC_IL_TaGEM_SpikeIn_v3:

      TGTGCTCTTCCGATCTGCAGCTTATAACTTCCAACAACCGGAAGT*G*A D701_TaGEM:

      CAAGCAGAAGACGGCATACGAGATCGAGTAATGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTGCAGC*T*T

      SI PCR Primer:

      AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGC*T*C

      UMGC_IL_DoubleNest:

      GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTGCAGCTTATAACTTCCAACAACCGG*A*A

      P5: AATGATACGGCGACCACCGA

      D701:

      GATCGGAAGAGCACACGTCTGAACTCCAGTCACATTACTCGATCTCGTATGCCGTCTTCTGCTTG

      D702:

      GATCGGAAGAGCACACGTCTGAACTCCAGTCACTCCGGAGAATCTCGTATGCCGTCTTCTGCTTG

      After multiple optimization trials, the following steps yielded ~96% on-target reads for the TaG-EM library (Supplemental Figure 18, note that for the enriched barcode data shown in Figure 6 and Supplemental Figure 19, a similar amplification protocol was used TaG-EM barcodes were amplified from the gene expression library cDNA and not the SPRI-selected barcode pool). TaG-EM cDNA was amplified with the following PCR reaction: 5 µl purified TaG-EM cDNA, 50 µl 2x KAPA HiFi ReadyMix (Roche), 2.5 µl UMGC_IL_DoubleNest primer (10 µM), 2.5 µl SI_PCR primer (10 µM), and 40 µl nuclease-free water. The reaction was amplified using the following cycling conditions: 98ºC for 2 minutes, followed by 15 cycles of 98ºC for 20 seconds, 63ºC for 30 seconds, 72ºC for 20 seconds, followed by 72ºC for 5 minutes. After the first PCR, the amplified cDNA was purified with a 1.2x SPRIselect (Beckman Coulter) bead cleanup with 80% ethanol washes and eluted into 40 µL of nuclease-water. A second round of PCR was run with following reaction: 5 µl purified TaG-EM cDNA, 50 µl 2x KAPA HiFi ReadyMix (Roche), 2.5 µl D702 primer (10 µM), 2.5 µl p5 Primer (10 µM), and 40 µl nuclease-free water. The reaction was amplified using the following cycling conditions: 98ºC for 2 minutes, followed by 10 cycles of 98ºC for 20 seconds, 63ºC for 30 seconds, 72ºC for 20 seconds, followed by 72ºC for 5 minutes. After the second PCR, the amplified cDNA was purified with a 1.2x SPRIselect (Beckman Coulter) bead cleanup with 80% ethanol washes and eluted into 40uL of nuclease-water. The resulting 3’ gene expression library and TaG-EM enrichment library were sequenced together following Scenario 1 of the BioLegend “Total-Seq-A Antibodies and Cell Hashing with 10x Single Cell 3’ Reagents Kit v3 or v3.1” protocol. Additional sequencing of the enriched TaG-EM library also done following Scenario 2 from the same protocol.” 

      When a given cell barcode is not associated with any TaG-EM barcode, then demultiplexing is impossible. This is a major problem, which is particularly visible in Figs 5 and S13. In 5F, BC4 is only detected in a couple of dozen cells, even though the Jon99Ciii marker of enterocytes is present in a much larger population (Fig 5C). Therefore, in this particular case, TaG-EM fails to detect most of the GFP-expressing cells. 

      Figure 5 in the original manuscript represented data from an experiment in which there were eight different TaG-EM barcoded samples present, including four replicates of the pan-midgut driver (each of which included enterocyte populations). One would not expect the BC4 enterocyte driver expression to be observed in all of the Jon99Ciii cells, since the majority of the GFP+ cells shown in the UMAP plot were likely derived from and are labeled by the pan-midgut driver-associated barcodes. Thus, the design and presentation of this particular experiment (in particular, the presence of eight distinct samples in the data set) is making the detection of the TaG-EM barcodes look sparser than it actually is. We have added a panel in both Figure 6B and Supplemental Figure 17B that shows the overall detection of barcodes in the enriched barcode library and gene expression library or the gene expression library only, respectively, for this experiment.

      However, the reviewer’s overall point regarding barcode detection is still valid in that if we consider all eight barcodes, we only see TaG-EM barcode labeling associated with about a quarter of all the cells in this gene expression library, or about 37% of cells when we include the enriched TaG-EM barcode library. While improving barcode detection will improve the yield and is necessary for some applications (such as robust detection of multiplets), we would argue that even at the current level of success this approach has significant utility. First, if one’s goal is to unambiguously label a cell cluster and trace it to a defined cell population in vivo, sparse labeling may be sufficient. Second, demultiplexing is still possible (as we demonstrate) but involves a trade off in yield (not every cell is recovered and there is some extra sequencing cost as some sequenced cells cannot be assigned to a barcode). 

      Similarly, in S13, most cells should express one of the four barcodes, however many of them (maybe up to half - this should be quantified) do not. Therefore, the claim (L277278) that "the pan-midgut driver were broadly distributed across the cell clusters" is misleading. Moreover, the hypothesis that "low expressing driver lines may result in particularly sparse labelling" (L331-333) is at least partially wrong, as Fig S13 shows that the same Gal4 driver can lead to very different levels of barcode coverage.

      As described above, since this experiment included eight different TaG-EM barcodes expressed by five different drivers, the expectation is that only about half of the cells in Figure S13 (now Figure S20) should express a TaG-EM barcode. It is not clear why BC2 is underrepresented in terms of the number of cells labeled and BC7 is overrepresented. We agree with the reviewer that this should be described more accurately in the paper and that it does impact our interpretation related to driver strength and barcode detection. We have revised this sentence in the discussion and also added additional text in the results describing the within driver variability seen in this experiment.

      Results text:

      “As expected, the barcodes expressed by the pan-midgut driver were broadly distributed across the cell clusters (Supplemental Figure 20). However, the number of cells recovered varied significantly among the four pan-midgut driver associated barcodes.”

      Discussion text:

      “It is likely that the strength of the Gal4 driver contributes to the labeling density. However, we also observed variable recovery of TaG-EM barcodes that were all driven by the same pan-midgut Gal4 driver (Supplemental Figure 20).”

      • Comparisons between TaG-EM and other, simpler methods for labelling individual cell populations are missing. For example, how would TaG-EM compare with expression of different fluorescent reporters, or a strategy based on the brainbow/flybow principle?

      The advantage of TaG-EM is that an arbitrarily large number of DNA barcodes can be used (contingent upon the availability of transgenic lines – we described 20 barcoded lines in our initial manuscript and we have now extended this collection to over 170 lines), while the number of distinguishable FPs is much lower. Brainbow/Flybow uses combinatorial expression of different FPs, but because this combinatorial expression is stochastic, tracing a single cell transcriptome to a defined cell population in vivo based on the FP signature of a Brainbow animal would likely not be possible (and would almost certainly be impossible at scale).

      • FACS data is missing throughout the paper. The authors should include data from their comparative flow cytometry experiment of TaG-EM cells with or without additional hexameric GFP, as well as FSC/SSC and fluorescence scatter plots for the FACS steps that they performed prior to scRNA-seq, at least in supplementary figures.

      We have added Supplemental Figures with the FACS data for all of the single cell sequencing data presented in the manuscript (Supplemental Figures 12 and 14).

      • The authors should show the whole data described in L229, including the cluster that they chose to delete. At least, they should provide more information about how many cells were removed. In any case, the fact that their data still contains a large number of debris and dead cells despite sorting out PI negative cells with FACS and filtering low abundance barcodes with Cellranger is concerning.

      This description was referring to the unprocessed Cellranger output (not filtered for low abundance barcodes). Prior to filtering for cell barcodes with high mitochondria or rRNA (or other processing in Seurat/Scanpy), we saw two clusters, one with low UMI counts and enrichment of mitochondrial genes (see Cellranger report below). 

      Author response image 1.

      These cell barcodes were removed by downstream quality filtering and the remaining cells showed expression of expected intestinal stem cell and enteroblast marker genes.

      Overall, although a method for genetic tagging cell populations prior to multiplexing in single-cell experiments would be extremely useful, the method presented here is inadequate. However, despite all the weaknesses listed above, the idea of barcodes expressed specifically in cells of interest deserves more consideration. If the authors manage to improve their design to resolve the major issues and demonstrate the benefits of their method more clearly, then TaG-EM could become an interesting option for certain applications.

      We thank the reviewer for this comment and hope that the above responses and additional experiments and data that we have added have helped to alleviate the noted weaknesses.

      Reviewer #2 (Public Review):

      In this manuscript, Mendana et al developed a multiplexing method - Targeted Genetically-Encoded Multiplexing or TaG-EM - by inserting a DNA barcode upstream of the polyadenylation site in a Gal4-inducible UAS-GFP construct. This Multiplexing method can be used for population-scale behavioral measurements or can potentially be used in single-cell sequencing experiments to pool flies from different populations. The authors created 20 distinctly barcoded fly lines. First, TaG-EM was used to measure phototaxis and oviposition behaviors. Then, TaG-EM was applied to the fly gut cell types to demonstrate its applications in single-cell RNA-seq for cell type annotation and cell origin retrieving.

      This TaG-EM system can be useful for multiplexed behavioral studies from nextgeneration sequencing (NGS) of pooled samples and for Transcriptomic Studies. I don't have major concerns for the first application, but I think the scRNA-seq part has several major issues and needs to be further optimized.

      Major concerns:

      (1) It seems the barcode detection rate is low according to Fig S9 and Fig 5F, J and N. Could the authors evaluate the detection rate? If the detection rate is too low, it can cause problems when it is used to decode cell types.

      See responses to Reviewer #1 on this topic above.  

      (2) Unsuccessful amplification of TaG-EM barcodes: The authors attempted to amplify the TaG-EM barcodes in parallel to the gene expression library preparation but encountered difficulties, as the resulting sequencing reads were predominantly offtarget. This unsuccessful amplification raises concerns about the reliability and feasibility of this amplification approach, which could affect the detection and analysis of the TaG-EM barcodes in future experiments.

      As noted above, we have now established a successful amplification protocol for the TaG-EM barcodes. This data is shown in Figure 6, and Supplemental Figures 18-19 and we have included a detailed information in the methods for performing TaG-EM barcode enrichment during 10x library prep. We have also included code in the paper’s Github repository for assigning TaG-EM barcodes from the enriched library to the associated 10x Genomics cell barcodes.

      (3) For Fig 5, the singe-cell clusters are not annotated. It is not clear what cell types are corresponding to which clusters. So, it is difficult to evaluate the accuracy of the assignment of barcodes.

      We have added annotation information for the cell clusters based on expression of cell-type-specific marker genes (Figure 6A, Supplemental Figures 16-17).

      (4) The scRNA-seq UMAP in Fig 5 is a bit strange to me. The fly gut epithelium contains only a few major cell types, including ISC, EB, EC, and EE. However, the authors showed 38 clusters in fig 5B. It is true that some cell types, like EE (Guo et al., 2019, Cell Reports), have sub-populations, but I don't expect they will form these many subtypes. There are many peripheral small clusters that are not shown in other gut scRNAseq studies (Hung et al., 2020; Li et al., 2022 Fly Cell Atlas; Lu et al., 2023 Aging Fly Cell Atlas). I suggest the authors try different data-processing methods to validate their clustering result.

      For all of the single cell experiments, after doublet and ambient RNA removal (as suggested below), we have reclustered the datasets and evaluated different resolutions using Clustree. As the Reviewer points out, there are different EE subtypes, as well as regionalized expression differences in EC and other cell populations, so more than four clusters are expected (an analysis of the adult midgut identified 22 distinct cell types). With this revised analysis our results more closely match the cell populations observed in other studies (though it should be noted that the referenced studies largely focus on the adult and not the larval stage).  

      (5) Different gut drivers, PMC-, PC-, EB-, EC-, and EE-GAL4, were used. The authors should carefully characterize these GAL4 expression in larval guts and validate sequencing data. For example, does the ratio of each cell type in Fig 5B reflect the in vivo cell type ratio? The authors used cell-type markers mostly based on the knowledge from adult guts, but there are significant morphological and cell ratio differences between larval and adult guts (e.g., Mathur...Ohlstein, 2010 Science).

      We have characterized the PC driver which is highlighted in Supplemental Figure 13, and the EC and EE drivers which are highlighted in Figure 6G-N in detail in larval guts and have added this data to the paper (Supplemental Figure 21). The EB driver was not characterized histologically as EB-specific antibodies are not currently available. The PMG-Gal4 line exhibits strong expression throughout the larval gut (Figure 5B and barcodes are recovered from essentially all of the larval gut cell clusters using this driver (Supplemental Figure 20). We don’t necessarily expect the ratios of cells observed in the scRNA-Seq data to reflect the ratios typically observed in the gut as we performed pooled flow sorting on a multiplexed set of eight genotypes and driver expression levels, flow sorting, and possibly other processing steps could all influence the relative abundance of different cell types. However, detailed characterization of these driver lines did reveal spatial expression patterns that help explain aspects of the scRNA-Seq data. We have also added the following text to the paper to further describe the characterization of the drivers:

      Results:

      “Detailed characterization of the EC-Gal4 line indicated that although this line labeled a high percentage of enterocytes, expression was restricted to an area at the anterior and middle of the midgut, with gaps between these regions and at the posterior (Supplemental Figure 21). This could explain the absence of subsets of enterocytes, such as those labeled by betaTry, which exhibits regional expression in R2 of the adult midgut (Buchon et al., 2013).”

      “Detailed characterization of the EE-Gal4 driver line indicated that ~80-85% of Prospero-positive enteroendocrine cells are labeled in the anterior and middle of the larval midgut, with a lower percentage (~65%) of Prospero-positive cells labeled in the posterior midgut (Supplemental Figure 21). As with the enterocyte labeling, and consistent with the Gal4 driver expression pattern, the EE-Gal4 expressed TaG-EM barcode 9 did not label all classes of enteroendocrine cells and other clusters of presumptive enteroendocrine cells expressing other neuropeptides such as Orcokinin, AstA, and AstC, or neuropeptide receptors such as CCHa2 (not shown) were also observed.”

      Methods:

      “Dissection and immunostaining

      Midguts from third instar larvae of driver lines crossed to UAS-GFP.nls or UAS-mCherry were dissected in 1xPBS and fixed with 4% paraformaldehyde (PFA) overnight at 4ºC. Fixed samples were washed with 0.1% PBTx (1xPBS + 0.1% Triton X-100) three times for 10 minutes each and blocked in PBTxGS (0.1% PBTx + 3% Normal Goat Serum) for 2–4 hours at RT. After blocking, midguts were incubated in primary antibody solution overnight at 4ºC. The next day samples were washed with 0.1% PBTx three times for 20 minutes each and were incubated in secondary antibody solution for 2–3 hours at RT (protected from light) followed by three washes with 0.1% PBTx for 20 minutes each. One µg/ml DAPI solution prepared in 0.1% PBTx was added to the sample and incubated for 10 minutes followed by washing with 0.1% PBTx three times for 10 minutes each. Finally, samples were mounted on a slide glass with 70% glycerol and imaged using a Nikon AX R confocal microscope. Confocal images were processed using Fiji software. 

      The primary antibodies used were rabbit anti-GFP (A6455,1:1000 Invitrogen), mouse anti-mCherry (3A11, 1:20 DSHB), mouse anti-Prospero (MR1A, 1:50 DSHB) and mouse anti-Pdm1 (Nub 2D4, 1:30 DSHB). The secondary antibodies used were goat antimouse and goat anti-rabbit IgG conjugated to Alexa 647 and Alexa 488 (1:200) (Invitrogen), respectively. Five larval gut specimens per Gal4 line were dissected and examined.”

      (6) Doublets are removed based on the co-expression of two barcodes in Fig 5A. However, there are also other possible doublets, for example, from the same barcode cells or when one cell doesn't have detectable barcode. Did the authors try other computational approaches to remove doublets, like DoubleFinder (McGinnis et al., 2019) and Scrublet (Wolock et al., 2019)?

      We have included DoubleFinder-based doublet removal in our data analysis pipeline. This is now described in the methods (see below).

      (7) Did the authors remove ambient RNA which is a common issue for scRNA-seq experiments?

      We have also used DecontX to remove ambient RNA. This is now described in the methods:

      “Datasets were first mapped and analyzed using the Cell Ranger analysis pipeline (10x Genomics). A custom Drosophila genome reference was made by combining the BDGP.28 reference genome assembly and Ensembl gene annotations. Custom gene definitions for each of the TaG-EM barcodes were added to the fasta genome file and .gtf gene annotation file. A Cell Ranger reference package was generated with the Cell Ranger mkref command. Subsequent single-cell data analysis was performed using the R package Seurat (Satija et al., 2015). Cells expressing less than 200 genes and genes expressed in fewer than three cells were filtered from the expression matrix. Next, percent mitochondrial reads, percent ribosomal reads cells counts, and cell features were graphed to determine optimal filtering parameters. DecontX (Yang et al., 2020) was used to identify empty droplets, to evaluate ambient RNA contamination, and to remove empty cells and cells with high ambient RNA expression. DoubletFinder (McGinnis et al., 2019) to identify droplet multiplets and remove cells classified as multiplets. Clustree (Zappia and Oshlack, 2018) was used to visualize different clustering resolutions and to determine the optimal clustering resolution for downstream analysis. Finally, SingleR (Aran et al., 2019) was used for automated cell annotation with a gut single-cell reference from the Fly Cell Atlas (Li et al., 2022). The dataset was manually annotated using the expression patterns of marker genes known to be associated with cell types of interest. To correlate TaG-EM barcodes with cell IDs in the enriched TaG-EM barcode library, a custom Python script was used (TaGEM_barcode_Cell_barcode_correlation.py), which is available via Github: https://github.com/darylgohl/TaG-EM.”

      (8) Why does TaG-EM barcode #4, driven by EC-GAL4, not label other classes of enterocyte cells such as betaTry+ positive ECs (Figures 5D-E)? similarly, why does TaG-EM barcode #9, driven by EE-GAL4, not label all EEs? Again, it is difficult to evaluate this part without proper data processing and accurate cell type annotation.

      As noted in the response to a comment by Reviewer #1 above, part of this apparent sparsity of labeling is due to the way that this experiment was designed and visualized. We have added a new Figure panel in both Figure 6B and Supplemental Figure 17B that shows the overall detection of barcodes in the enriched barcode library and gene expression library or the gene expression library only, respectively, to better illustrate the efficacy of barcode detection. See also the response to point 5 above. Both the lack of labelling of betaTry+ ECs and subsets of EEs is consistent with the expression patterns of the EC-Gal4 and EE-Gal4 drivers.

      (9) For Figure 2, when the authors tested different combinations of groups with various numbers of barcodes. They found remarkable consistency for the even groups. Once the numbers start to increase to 64, barcode abundance becomes highly variable (range of 12-18% for both male and female). I think this would be problematic because the differences seen in two groups for example may be due to the barcode selection rather than an actual biologically meaningful difference.

      While there is some barcode-to-barcode variability for different amplification conditions, the magnitude of this variation is relatively consistent across the conditions tested. We looked at the coefficient of variation for the evenly pooled barcodes or for the staggered barcodes pooled at different relative levels. While the absolute magnitude of the variation is higher for the highly abundant barcodes in the staggered conditions, the CVs for these conditions (0.186 for female flies and for 0.163 male flies) were only slightly above the mean CV (0.125) for all conditions (see Supplemental Figure 3):

      We have added this analysis as Supplemental Figure 3 and added the following text to the paper:(

      “The coefficients of variation were largely consistent for groups of TaG-EM barcodes pooled evenly or at different levels within the staggered pools (Supplemental Figure 3).”

      (10) Barcode #14 cannot be reliably detected in oviposition experiment. This suggests that the BC 14 fly line might have additional mutations in the attp2 chromosome arm that affects this behavior. Perhaps other barcode lines also have unknown mutations and would cause issues for other untested behaviors. One possible solution is to backcross all 20 lines with the same genetic background wild-type flies for >7 generations to make all these lines to have the same (or very similar) genetic background. This strategy is common for aging and behavior assays.

      See response to Reviewer #1 above on this topic.

      Reviewer #3 (Public Review):

      The work addresses challenges in linking anatomical information to transcriptomic data in single-cell sequencing. It proposes a method called Targeted Genetically-Encoded Multiplexing (TaG-EM), which uses genetic barcoding in Drosophila to label specific cell populations in vivo. By inserting a DNA barcode near the polyadenylation site in a UASGFP construct, cells of interest can be identified during single-cell sequencing. TaG-EM enables various applications, including cell type identification, multiplet droplet detection, and barcoding experimental parameters. The study demonstrates that TaGEM barcodes can be decoded using next-generation sequencing for large-scale behavioral measurements. Overall, the results are solid in supporting the claims and will be useful for a broader fly community. I have only a few comments below:

      We thank the reviewer for these positive comments.

      Specific comments:

      (1) The authors mentioned that the results of structure pool tests in Fig. 2 showed a high level of quantitative accuracy in detecting the TaG-EM barcode abundance. Although the data were generally consistent with the input values in most cases, there were some obvious exceptions such as barcode 1 (under-represented) and barcodes 15, 20 (overrepresented). It would be great if the authors could comment on these and provide a guideline for choosing the appropriate barcode lines when implementing this TaG-EM method.

      See the response to point 9 from Reviewer 2. Although there seem to be some systematic differences in barcode amplification, the coefficient of variation was relatively consistent across all of the barcode combinations and relative input levels that we examined. Our recommendation (described in the text) is to average across 3-4 independent barcodes (which yielded a R2 values of >0.99 with expected abundance in the structured pooled tests).  

      (2) In Supplemental Figure 6, the authors showed GFP antibody staining data with 20 different TaG-EM barcode lines. The variability in GFP antibody staining results among these different TaG-EM barcode lines concerns the use of these TaG-EM barcode lines for sequencing followed by FACS sorting of native GFP. I expected the native GFP expression would be weaker and much more variable than the GFP antibody staining results shown in Supplemental Figure 6. If this is the case, variation of tissue-specific expression of TaG-EM barcode lines will likely be a confounding factor.

      Aside from barcode 8, which had a mutation in the GFP coding sequence, we did not see significant variability in expression levels either in the wing disc. Subtle differences seen in this figure most likely result from differences in larval staging. Similar consistent native (unstained) GFP expression of the TaG-EM constructs was seen in crosses with Mhc-Gal4 (described above). 

      (3) As the authors mentioned in the manuscript, multiple barcodes for one experimental condition would be a better experimental design. Could the authors suggest a recommended number of barcodes for each experiential condition? 3? 4? Or more? 

      See response to Reviewer #3, point number 1 above.

      (3b) Also, it would be great if the authors could provide a short discussion on the cost of such TaG-EM method. For example, for the phototaxis assay, if it is much more expensive to perform TaG-EM as compared to manually scoring the preference index by videotaping, what would be the practical considerations or benefits of doing TaG-EM over manual scoring?

      While this will vary depending on the assay and the scale at which one is conducting experiments, we have added an analysis of labor savings for the larval gut motility assay (Supplemental Figure 8). We have also added the following text to the Discussion describing some of the trade-offs to consider in assessing the potential benefit of incorporating TaG-EM into behavioral measurements:

      “While the utility of TaG-EM barcode-based quantification will vary based on the number of conditions being analyzed and the ease of quantifying the behavior or phenotype by other means, we demonstrate that TaG-EM can be employed to cost-effectively streamline labor-intensive assays and to quantify phenotypes with small effect sizes (Figure 4, Supplemental Figure 8).”

      Recommendations for the authors:  

      While recognising the potential of the TaG-EM methodology, we had a few major concerns that the authors might want to consider addressing:

      As stated above, we are grateful to the reviewers and editor for their thoughtful comments. We have addressed many of the points below in our responses above, so we will briefly respond to these points and where relevant direct the reader to comments above.

      (1) We were concerned about the efficacy of TaG-EM in assessing more complex behaviours than oviposition and phototaxis. We note that Barcode #14 cannot be reliably detected in oviposition experiment. This suggests that the BC 14 fly line might have additional mutations in the attp2 chromosome arm that affects this behavior. Perhaps other barcode lines also have unknown mutations and would cause issues for other untested behaviors. One possible solution is to back-cross all 20 lines with the same genetic background wild-type flies for >7 generations to make all these lines to have the same (or very similar) genetic background. This strategy is common for aging and behavior assays.

      See response to Reviewer #1 and Reviewer #2, item 10, above.

      (2) We were unable to assess the drop-out rates of the TaG-EM barcode from the sequencing. The barcode detection rate is low (Fig S9 and Fig 5F, J and N). This would be a considerable drawback (relating to both experimental design and cost), if a large proportion of the cells could not be assigned an identity.

      See comments above addressing this point.

      (3) The effectiveness of TaG-EM scRNA-seq on the larvae gut is not very effective - the cells are not well annotated, the barcodes seem not to have labelled expected cell types (ECs and EEs), and there is no validation of the Gal4 drivers in vivo.

      See previous comments. We have addressed specific comments above on data processing and annotation, included a visualization of the overall effectiveness of labeling, added a protocol and data on enriched TaG-EM barcode libraries, and have added detailed characterization of the Gal4 drivers in the larval gut (Figure 6, Supplemental Figures 17-21).

      (4) A formal assessment of the cost-effectiveness would be an important consideration in broad uptake of the methodology.

      While this is difficult to do in a comprehensive manner given the breadth of potential applications, we have included estimates of labor savings for one of the behavioral assays that we tested (Supplemental Figure 8). We have also included a discussion of some of the factors that would make TaG-EM useful or cost-effective to apply for behavioral assays (see response to Reviewer #3, comment 3b, above). We have also added the following text to the discussion to address the cost considerations in applying TaG-EM for scRNA-Seq:

      “For single cell RNA-Seq experiments, the cost savings of multiplexing is roughly the cost of a run divided by the number of independent lines multiplexed, plus labor savings by also being able to multiplex upstream flow cytometry, minus loss of unbarcoded cells. Our experiments indicated that for the specific drivers we tested TaG-EM barcodes are detected in around one quarter of the cells if relying on endogenous expression in the gene expression library, though this fraction was higher (~37%) if sequencing an enriched TaG-EM barcode library in parallel (Figure 6, Supplemental Figures 18-19).”

      (5) Similarly, a formal assessment of the effect of the insertion on the variability in GFP expression and the behaviour needs to be documented.

      See responses to Reviewer #1, Reviewer #2, item 9, and Reviewer #3, item 2 above.

      Reviewer #1 (Recommendations For The Authors):

      (in no particular order of importance)

      • L84-85: the authors should either expand, or remove this statement. Indeed, lack of replicates is only true if one ignores that each cell in an atlas is indeed a replicate. Therefore, depending on the approach or question, this statement is inaccurate.

      This sentence was meant to refer to experiments where different experimental conditions are being compared and not to more descriptive studies such as cell atlases. We have revised this sentence to clarify.

      “Outside of descriptive studies, these costs are also a barrier to including replicates to assess biological variability; consequently, a lack of biological replicates derived from independent samples is a common shortcoming of single-cell sequencing experiments.”

      • L103-104: this sentence is unclear.

      We have revised this sentence as follows:

      “Genetically barcoded fly lines can also be used to enable highly multiplexed behavioral assays which can be read out using high throughput sequencing.”

      • In Fig S1 it is unclear why there are more than 20 different sequences in panel B where the text and panel A only mention the generation of 20 distinct constructs. This should be better explained.

      The following text was added to the Figure legend to explain this discrepancy:

      “Because the TaG-EM barcode constructs were injected as a pool of 29 purified plasmids, some of the transgenic lines had inserts of the same construct. In total 20 unique lines were recovered from this round of injection.”

      • It would be interesting to compare the efficiency of TaG-EM driven doublet removal (Fig 5A) with standard doublet-removing software (e.g., DoubletFinder, McGinnis et al., 2019).

      We have done this comparison, which is now shown in Supplemental Figure 15.

      • I would encourage the authors to check whether barcode representation in Fig S13  can be correlated to average library size, as one would expect libraries with shorter reads to be more likely to include the 14-bp barcode and therefore more accurately recapitulate TaG-EM barcode expression.

      These are not independent sequencing libraries, but rather data from barcodes that were multiplexed in a single flow sort, 10x droplet capture, and sequencing library. Thus, there must be some other variable that explains the differential recovery of these barcodes.

      • Fig 4A should appear earlier in the paper.

      We have moved Figure 4A from the previous manuscript (a schematic showing the detailed design of the TaG-EM construct) to Figure 1A in the revised version.

      Reviewer #2 (Recommendations For The Authors):

      Minor:

      (1) There is a typo for Fig S13 figure legends: BC1, BC1, BC3... should be BC1, BC2, BC3.

      Fixed.

      Reviewer #3 (Recommendations For The Authors):

      Comments to authors:

      (1) It would be great if the authors could provide an additional explanation on how these 29 barcode sequences were determined.

      Response: This information is in the Methods section. For the original cloned plasmids:

      “Expected construct size was verified by diagnostic digest with _Eco_RI and _Apa_LI. DNA concentration was determined using a Quant-iT PicoGreen dsDNA assay (Thermo Fisher Scientific) and the randomer barcode for each of the constructs was determined by Sanger sequencing using the following primers:

      SV40_post_R: GCCAGATCGATCCAGACATGA

      SV40_5F: CTCCCCCTGAACCTGAAACA”

      For transgenic flies, after DNA extraction and PCR enrichment (details also in the Methods section):

      “The barcode sequence for each of the independent transgenic lines was determined by Sanger sequencing using the SV40_5F and SV40_PostR primers.”

      (2) Why did the authors choose myr-GFP as the backbone instead of nls-GFP if the downstream application is to perform sequencing?

      We initially chose myr::GFP as we planned to conduct single cell and not single nucleus sequencing and myr::GFP has the advantage of labeling cell membranes which could facilitate the characterization or confirmation of cell type-specific expression, particularly in the nervous system. However, we have considered making a version of the TaG-EM construct with a nuclear targeted GFP (thereby enabling “NucEM”). In the Discussion, we mention this possibility as well as the possibility of using a second nuclear-GFP construct in conjunction with TaG-EM lines is nuclear enrichment is desired:

      “In addition, while the original TaG-EM lines were made using a membrane-localized myr::GFP construct, variants that express GFP in other cell compartments such as the cytoplasm or nucleus could be constructed to enable increased expression levels or purification of nuclei. Nuclear labeling could also be achieved by co-expressing a nuclear GFP construct with existing TaG-EM lines in analogy to the use of hexameric GFP described above.”

      Minor comments:

      (1) Line 193, Supplemental Figure 4 should be Supplemental Figure 5

      Fixed.

      (2) Scale bars should be added in Figure 4, Supplemental Figures 6, 7, and 8A.

      We have added scale bars to these figures and also included scale bars in additional Supplemental Figures detailing characterization of the gut driver lines.

      (3) Were Figure 4C and Supplemental Figure 7 data stained with a GFP antibody?

      No, this is endogenous GFP signal. This is now noted in the Figure legends.

      (4) Line 220, specify the three barcode lines (lines #7, 8, 9) in the text. 

      Added this information.

      Same for Lines 251-254. Line 258, which 8 barcode Gal4 line combinations?

      (5) Line 994, typo: (BC1, BC1, BC3, and BC7)-> (BC1, BC2, BC3, and BC7)

      Fixed.

      (6) Figure 5 F, J and N, add EC-Gal4, EB-Gal4, and EE-Gal4 above each panel to improve readability.

      We have added labels of the cell type being targeted (leftmost panels), the barcode, and the marker gene name to Figure 6 C-N.

    1. Author Response

      The following is the authors’ response to the original reviews.

      Thank you for the detailed and constructive reviews. We revised the paper accordingly, and a point-by-point reply appears below. The main changes are:

      • An extended discussion section that places our work in context with other related developments in theory and modeling.

      • A new results section that demonstrates a substantial improvement in performance from a non-linear activation function. This led to addition of a co-author.

      • The mathematical proof that the resolvent of the adjacency matrix leads to the shortest path distances has been moved to a separate article, available as a preprint and attached to this resubmission. This allows us to present that work in the context of graph theory, and focus the present paper on neural modeling.

      Reviewer #1 (Public Review):

      This paper presents a highly compelling and novel hypothesis for how the brain could generate signals to guide navigation towards remembered goals. Under this hypothesis, which the authors call "Endotaxis", the brain co-opts its ancient ability to navigate up odor gradients (chemotaxis) by generating a "virtual odor" that grows stronger the closer the animal is to a goal location. This idea is compelling from an evolutionary perspective and a mechanistic perspective. The paper is well-written and delightful to read.

      The authors develop a detailed model of how the brain may perform "Endotaxis", using a variety of interconnected cell types (point, map, and goal cells) to inform the chemotaxis system. They tested the ability of this model to navigate in several state spaces, representing both physical mazes and abstract cognitive tasks. The Endotaxis model performed reasonably well across different environments and different types of goals.

      The authors further tested the model using parameter sweeps and discovered a critical level of network gain, beyond which task performance drops. This critical level approximately matched analytical derivations.

      My main concern with this paper is that the analysis of the critical gain value (gamma_c) is incomplete, making the implications of these analyses unclear. There are several different reasonable ways in which the Endotaxis map cell representations might be normalized, which I suspect may lead to different results. Specifically, the recurrent connections between map cells may either be an adjacency matrix, or a normalized transition matrix. In the current submission, the recurrent connections are an unnormalized adjacency matrix. In a previous preprint version of the Endotaxis manuscript, the recurrent connections between the map cells were learned using Oja's rule, which results in a normalized state-transition matrix (see "Appendix 5: Endotaxis model and the successor representation" in "Neural learning rules for generating flexible predictions and computing the successor representation", your reference 17). The authors state "In summary, this sensitivity analysis shows that the optimal parameter set for endotaxis does depend on the environment". Is this statement, and the other conclusions of the sensitivity analysis, still true if the learned recurrent connections are a properly normalized state-transition matrix?

      Yes, this is an interesting topic. In v.1 of our bioRxiv preprint we used Oja’s rule for learning, which will converge on a map connectivity that reflects the transition probabilities. The matrix M becomes a left-normalized or right-normalized stochastic matrix, depending on whether one uses the pre-synaptic or the post-synaptic version of Oja’s rule. This is explained well in Appendix 5 of Fang 2023.

      In the present version of the model we use a rule that learns the adjacency matrix A, not the transition matrix T. The motivation is that we want to explain instances of oneshot learning, where an agent acquires a route after traversing it just once. For example, we had found experimentally that mice can execute a complex homing route on the first attempt.

      An agent can establish whether two nodes are connected (adjacency) the very first time it travels from one node to the other. Whereas it can evaluate the transition probability for that link only after trying this and all the other available links on multiple occasions. Hence the normalization terms in Oja’s rule, or in the rule used by Fang 2023, all involve some time-averaging over multiple visits to the same node. This implements a gradual learning process over many experiences, rather than a one-shot acquisition on the first experience.

      Still one may ask whether there are advantages to learning the transition matrix rather than the adjacency matrix. We looked into this with the following results:

      • The result that (1/γ − A)−1 is monotonically related to the graph distances D in the limit of small γ (a proof now moved to the Meister 2023 preprint) , holds also for the transition matrix T. The proof follows the same steps. So in the small gain limit, the navigation model would work with T as well.

      • If one uses the transition matrix to compute the network output (1/γ − T)-1 then the critical gain value is γc = 1. It is well known that the largest eigenvalue of any Markov transition matrix is 1, and the critical gain γc is the inverse of that. This result is independent of the graph. So this offers the promise that the network could use the same gain parameter γ regardless of the environment.

      • In practice, however, the goal signal turned out to be less robust when based on T than when based on A. We illustrate this with the attached Author response image 1. This replicates the analysis in Figure 3 of the manuscript, using the transition matrix instead of the adjacency matrix. Some observations:

      • Panel B: The goal signal follows an exponential dependence on graph distance much more robustly for the model with A than with T. This holds even for small gain values where the exponential decay is steep.

      • Panel C: As one raises the gain closer to the critical value, the goal signal based on T scatters much more than when based on A.

      • Panels D, E: Navigation based on A works better than based on T. For example, using the highest practical gain value, and a readout noise of ϵ = 0.01, navigation based on T has a range of only 8 steps on this graph, whereas navigation based on A ranges over 12 steps, the full size of this graph.

      We have added a section “Choice of learning rule” to explain this. The Author response image 1 is part of the code notebook on Github.

      Author response image 1.

      Overall, this paper provides a very compelling model for how neural circuits may have evolved the ability to navigate towards remembered goals, using ancient chemotaxis circuits.

      This framework will likely be very important for understanding how the hippocampus (and other memory/navigation-related circuits) interfaces with other processes in the brain, giving rise to memory-guided behavior.

      Reviewer #2 (Public Review):

      The manuscript presents a computational model of how an organism might learn a map of the structure of its environment and the location of valuable resources through synaptic plasticity, and how this map could subsequently be used for goal-directed navigation.

      The model is composed of 'map cells', which learn the structure of the environment in their recurrent connections, and 'goal-cell' which stores the location of valued resources with respect to the map cell population. Each map cell corresponds to a particular location in the environment due to receiving external excitatory input at this location. The synaptic plasticity rule between map cells potentiates synapses when activity above a specified threshold at the pre-synaptic neuron is followed by above-threshold activity at the post-synaptic neuron. The threshold is set such that map neurons are only driven above this plasticity threshold by the external excitatory input, causing synapses to only be potentiated between a pair of map neurons when the organism moves directly between the locations they represent. This causes the weight matrix between the map neurons to learn the adjacency for the graph of locations in the environment, i.e. after learning the synaptic weight matrix matches the environment's adjacency matrix. Recurrent activity in the map neuron population then causes a bump of activity centred on the current location, which drops off exponentially with the diffusion distance on the graph. Each goal cell receives input from the map cells, and also from a 'resource cell' whose activity indicates the presence or absence of a given values resource at the current location. Synaptic plasticity potentiates map-cell to goal-cell synapses in proportion to the activity of the map cells at time points when the resource cell is active. This causes goal cell activity to increase when the activity of the map cell population is similar to the activity where the resource was obtained. The upshot of all this is that after learning the activity of goal cells decreases exponentially with the diffusion distance from the corresponding goal location. The organism can therefore navigate to a given goal by doing gradient ascent on the activity of the corresponding goal cell. The process of evaluating these gradients and using them to select actions is not modelled explicitly, but the authors point to the similarity of this mechanism to chemotaxis (ascending a gradient of odour concentration to reach the odour source), and the widespread capacity for chemotaxis in the animal kingdom, to argue for its biological plausibility.

      The ideas are interesting and the presentation in the manuscript is generally clear. The two principle limitations of the manuscript are: i) Many of the ideas that the model implements have been explored in previous work. ii) The mapping of the circuit model onto real biological systems is pretty speculative, particularly with respect to the cerebellum.

      Regarding the novelty of the work, the idea of flexibly navigating to goals by descending distance gradients dates back to at least Kaelbling (Learning to achieve goals, IJCAI, 1993), and is closely related to both the successor representation (cited in manuscript) and Linear Markov Decision Processes (LMDPs) (Piray and Daw, 2021, https://doi.org/ 10.1038/s41467-021-25123-3, Todorov, 2009 https://doi.org/10.1073/pnas.0710743106). The specific proposal of navigating to goals by doing gradient descent on diffusion distances, computed as powers of the adjacency matrix, is explored in Baram et al. 2018 (https://doi.org/10.1101/421461), and the idea that recurrent neural networks whose weights are the adjacency matrix can compute diffusion distances are explored in Fang et al. 2022 (https://doi.org/10.1101/2022.05.18.492543). Similar ideas about route planning using the spread of recurrent activity are also explored in Corneil and Gerstner (2015, cited in manuscript). Further exploration of this space of ideas is no bad thing, but it is important to be clear where prior literature has proposed closely related ideas.

      We have added a discussion section on “Theories and models of spatial learning” with a survey of ideas in this domain and how they come together in the Endotaxis model.

      Regarding whether the proposed circuit model might plausibly map onto a real biological system, I will focus on the mammalian brain as I don't know the relevant insect literature. It was not completely clear to me how the authors think their model corresponds to mammalian brain circuits. When they initially discuss brain circuits they point to the cerebellum as a plausible candidate structure (lines 520-546). Though the correspondence between cerebellar and model cell types is not very clearly outlined, my understanding is they propose that cerebellar granule cells are the 'map-cells' and Purkinje cells are the 'goal-cells'. I'm no cerebellum expert, but my understanding is that the granule cells do not have recurrent excitatory connections needed by the map cells. I am also not aware of reports of place-field-like firing in these cell populations that would be predicted by this correspondence. If the authors think the cerebellum is the substrate for the proposed mechanism they should clearly outline the proposed correspondence between cerebellar and model cell types and support the argument with reference to the circuit architecture, firing properties, lesion studies, etc.

      On further thought we agree that the cerebellum-like circuits are not a plausible substrate for the endotaxis algorithm. The anatomy looks compelling, but plasticity at the synapse is anti-hebbian, and - as the reviewer points out - there is little evidence for recurrence among the inputs. We changed the discussion text accordingly.

      The authors also discuss the possibility that the hippocampal formation might implement the proposed model, though confusingly they state 'we do not presume that endotaxis is localized to that structure' (line 564).

      We have removed that confusing bit of text.

      A correspondence with the hippocampus appears more plausible than the cerebellum, given the spatial tuning properties of hippocampal cells, and the profound effect of lesions on navigation behaviours. When discussing the possible relationship of the model to hippocampal circuits it would be useful to address internally generated sequential activity in the hippocampus. During active navigation, and when animals exhibit vicarious trial and error at decision points, internally generated sequential activity of hippocampal place cells appears to explore different possible routes ahead of the animal (Kay et al. 2020, https://doi.org/10.1016/j.cell.2020.01.014, Reddish 2016, https:// doi.org/10.1038/nrn.2015.30). Given the emphasis the model places on sampling possible future locations to evaluate goal-distance gradients, this seems highly relevant.

      In our model, the possible future locations are sampled in real life, with the agent moving there or at least in that direction, e.g. via VTE movements. In this simple form the model has no provision for internal planning, and the animal never learns any specific route sequence. One can envision extending such a model with some form of sequence learning that would then support an internal planning mechanism. We mention this in the revised discussion section, along with citation of these relevant articles.

      Also, given the strong emphasis the authors place on the relationship of their model to chemotaxis/odour-guided navigation, it would be useful to discuss brain circuits involved in chemotaxis, and whether/how these circuits relate to those involved in goal-directed navigation, and the proposed model.

      The neural basis of goal-directed navigation is probably best understood in the insect brain. There the locomotor decisions seem to be initiated in the central complex, whose circuitry is getting revealed by the fly connectome projects. This area receives input from diverse sensory areas that deliver the signal on which the decisions are based. That includes the mushroom body, which we argue has the anatomical structure to implement the endotaxis algorithm. It remains a mystery how the insect chooses a particular goal for pursuit via its decisions. It could be revealing to force a change in goals (the mode switch in the endotaxis circuit) while recording from brain areas like the central complex. Our discussion now elaborates on this.

      Finally, it would be useful to clarify two aspects of the behaviour of the proposed algorithm:

      1) When discussing the relationship of the model to the successor representation (lines 620-627), the authors emphasise that learning in the model is independent of the policy followed by the agent during learning, while the successor representation is policy dependent. The policy independence of the model is achieved by making the synapses between map cells binary (0 or 1 weight) and setting them to 1 following a single transition between two locations. This makes the model unsuitable for learning the structure of graphs with probabilistic transitions, e.g. it would not behave adaptively in the widely used two-step task (Daw et al. 2011, https://doi.org/10.1016/ j.neuron.2011.02.027) as it would fail to differentiate between common and rare transitions. This limitation should be made clear and is particularly relevant to claims that the model can handle cognitive tasks in general. It is also worth noting that there are algorithms that are closely related to the successor representation, but which learn about the structure of the environment independent of the subjects policy, e.g. the work of Kaelbling which learns shortest path distances, and the default representation in the work of Piray and Daw (both referenced above). Both these approaches handle probabilistic transition structures.

      Yes. Our problem statement assumes that the environment is a graph with fixed edge weights. The revised text mentions this and other assumptions in a new section “Choice of learning rule”.

      2) As the model evaluates distances using powers of adjacency matrix, the resulting distances are diffusion distances not shortest path distances. Though diffusion and shortest path distances are usually closely correlated, they can differ systematically for some graphs (see Baram et al. ci:ted above).

      The recurrent network of map cells implements a specific function of the adjacency matrix, namely the resolvent (Eqn 7). We have a mathematical proof that this function delivers the shortest graph distances exactly, in the limit of small gain (γ in Eqn 7), and that this holds true for all graphs. For practical navigation in the presence of noise, one needs to raise the gain to something finite. Figure 3 analyzes how this affects deviations from the shortest graph distance, and how nonetheless the model still supports effective navigation over a surprising range. The mathematical details of the proof and further exploration of the resolvent distance at finite gain have been moved to a separate article, which is cited from here, and attached to the submission. The preprint by Baram et al. is cited in that article.

      Reviewer #3 (Public Review):

      This paper argues that it has developed an algorithm conceptually related to chemotaxis that provides a general mechanism for goal-directed behaviour in a biologically plausible neural form.

      The method depends on substantial simplifying assumptions. The simulated animal effectively moves through an environment consisting of discrete locations and can reliably detect when it is in each location. Whenever it moves from one location to an adjacent location, it perfectly learns the connectivity between these two locations (changes the value in an adjacency matrix to 1). This creates a graph of connections that reflects the explored environment. In this graph, the current location gets input activation and this spreads to all connected nodes multiplied by a constant decay (adjusted to the branching number of the graph) so that as the number of connection steps increases the activation decreases. Some locations will be marked as goals through experiencing a resource of a specific identity there, and subsequently will be activated by an amount proportional to their distance in the graph from the current location, i.e., their activation will increase if the agent moves a step closer and decrease if it moves a step further away. Hence by making such exploratory movements, the animal can decide which way to move to obtain a specified goal.

      I note here that it was not clear what purpose, other than increasing the effective range of activation, is served by having the goal input weights set based on the activation levels when the goal is obtained. As demonstrated in the homing behaviour, it is sufficient to just have a goal connected to a single location for the mechanism to work (i.e., the activation at that location increases if the animal takes a step closer to it); and as demonstrated by adding a new graph connection, goal activation is immediately altered in an appropriate way to exploit a new shortcut, without the goal weights corresponding to this graph change needing to be relearnt.

      As the reviewer states, allowing a graded strengthening of multiple synapses from the map cells increases the effective range of the goal signal. We have now confirmed this in simulations. For example, in the analysis of Fig 3E, a single goal synapse enables perfect navigation only over a range of 7 steps, whereas the distributed goal synapses allow perfect navigation over the full 12 steps. This analysis is included in the code notebook on Github.

      Given the abstractions introduced, it is clear that the biological task here has been reduced to the general problem of calculating the shortest path in a graph. That is, no real-world complications such as how to reliably recognise the same location when deciding that a new node should be introduced for a new location, or how to reliably execute movements between locations are addressed. Noise is only introduced as a 1% variability in the goal signal. It is therefore surprising that the main text provides almost no discussion of the conceptual relationship of this work to decades of previous work in calculating the shortest path in graphs, including a wide range of neural- and hardwarebased algorithms, many of which have been presented in the context of brain circuits.

      The connection to this work is briefly made in appendix A.1, where it is argued that the shortest path distance between two nodes in a directed graph can be calculated from equation 15, which depends only on the adjacency matrix and the decay parameter (provided the latter falls below a given value). It is not clear from the presentation whether this is a novel result. No direct reference is given for the derivation so I assume it is novel. But if this is a previously unknown solution to the general problem it deserves to be much more strongly featured and either way it needs to be appropriately set in the context of previous work.

      As far as we know this proposal for computing all-pairs-shortest-path is novel. We could not find it in textbooks or an extended literature search. We have discussed it with two graph theorist colleagues, who could not recall seeing it before, although the proof of the relationship is elementary. Inspired by the present reviewer comment, we chose to publish the result in a separate article that can focus on the mathematics and place it in the appropriate context of prior work in graph theory. For related work in the area of neural modeling please see our revised discussion section.

      Once this principle is grasped, the added value of the simulated results is somewhat limited. These show: 1) in practical terms, the spreading signal travels further for a smaller decay but becomes erratic as the decay parameter (map neuron gain) approaches its theoretical upper bound and decreases below noise levels beyond a certain distance. Both follow the theory. 2) that different graph structures can be acquired and used to approach goal locations (not surprising) .3) that simultaneous learning and exploitation of the graph only minimally affects the performance over starting with perfect knowledge of the graph. 4) that the parameters interact in expected ways. It might have been more impactful to explore whether the parameters could be dynamically tuned, based on the overall graph activity.

      This is a good summary of our simulation results, but we differ in the assessment of their value. In our experience, simulations can easily demolish an idea that seemed wonderful before exposure to numerical reality. For example, it is well known that one can build a neural integrator from a recurrent network that has feedback gain of exactly 1. In practical simulations, though, these networks tend to be fickle and unstable, and require unrealistically accurate tuning of the feedback gain. In our case, the theory predicts that there is a limited range of gains that should work, below the critical value, but large enough to avoid excessive decay of the signal. Simulation was needed to test what this practical range was, and we were pleasantly surprised that it is not ridiculously small, with robust navigation over a 10-20% range. Similarly, we did not predict that the same parameters would allow for effective acquisition of a new graph, learning of targets within the graph, and shortest-route navigation to those targets, without requiring any change in the operation of the network.

      Perhaps the most biologically interesting aspect of the work is to demonstrate the effectiveness, for flexible behaviour, of keeping separate the latent learning of environmental structure and the association of specific environmental states to goals or values. This contrasts (as the authors discuss) with the standard reinforcement learning approach, for example, that tries to learn the value of states that lead to reward. Examples of flexibility include the homing behaviour (a goal state is learned before any of the map is learned) and the patrolling behaviour (a goal cell that monitors all states for how recently they were visited). It is also interesting to link the mechanism of exploration of neighbouring states to observed scanning behaviours in navigating animals.

      The mapping to brain circuits is less convincing. Specifically, for the analogy to the mushroom body, it is not clear what connectivity (in the MB) is supposed to underlie the graph structure which is crucial to the whole concept. Is it assumed that Kenyon cell connections perform the activation spreading function and that these connections are sufficiently adaptable to rapidly learn the adjacency matrix? Is there any evidence for this?

      Yes, there is good evidence for recurrent synapses among Kenyon cells (map cells in the model), and for reward-gated synaptic plasticity at the synapses onto mushroom body output cells (goal cells in our model). We have expanded this material in the discussion section. Whether those functions are sufficient to learn the structure of a spatial environment has not been explored; we hope our paper might give an impetus, and are exploring behavioral experiments on flies with colleagues.

      As discussed above, the possibility that an algorithm like 'endotaxis' could explain how the rodent place cell system could support trajectory planning has already been explored in previous work so it is not clear what additional insight is gained from the current model.

      Please see our revised discussion section on “theories and models of spatial learning”. In short, some ingredients of the model have appeared in prior work, but we believe that the present formulation offers an unexpectedly simple end-to-end solution for all components of navigation: exploration, target learning, and goal seeking.

      Reviewer #1 (Recommendations For The Authors):

      Major concern:

      See the public review. How do the results change depending on whether the recurrent connections between map cells are an adjacency matrix vs. a properly normalized statetransition matrix? I'm especially asking about results related to critical gain (gamma_c), and the dependence of the optimal parameter values on the environment.

      Please see our response above including the attached reviewer figure.

      Minor concerns:

      It is not always clear when the learning rule is symmetric vs asymmetric (undirected vs directed graph), and it seems to switch back and forth. For example, line 127 refers to a directed graph; Fig 2B and the intro describe symmetric Hebbian learning. Most (all?) of the simulations use the symmetric rule. Please make sure it's clear.

      For simplicity we now use a symmetric rule throughout, as is appropriate for undirected graphs. We mention that a directed learning rule could be used to learn directed graphs. See the section on “choice of learning rule”. M_ij is not defined when it's first introduced (eq 4). Consider labeling the M's and the G's in Fig 2.

      Done.

      The network gain factor (gamma, eq 4) is distributed over both external and recurrent inputs (v = gamma(u + Mv)), instead of local to the recurrent weights like in the Successor Representation. This notational choice is obviously up to the authors. I raise slight concern for two reasons -- first, distributing gamma may affect some of the parameter sweep results (see major concern), and second, it may be confusing in light of how gamma is used in the SR literature (see reviewer's paper for the derivation of how SR is computed by an RNN with gain gamma).

      In our model, gamma represents the (linear) activation function of the map neuron, from synaptic input to firing output. Because the synaptic input comes from point cells and also from other map cells, the gain factor is applied to both. See for example the Dayan & Abbott book Eqn 7.11, which at steady state becomes our Eqn 4. In the formalism of Fang 2023 (Eqn 2), the factor γ is only applied to the recurrent synaptic input J ⋅ f, but somehow not to the place cell input ϕ. Biophysically, one could imagine applying the variable gain only to the recurrent synapses and not the feed-forward ones. Instead we prefer to think of it as modulating the gain of the neurons, rather than the synapses. The SR literature follows conventions from the early reinforcement learning papers, which were unconstrained by thinking about neurons and synapses. We have added a footnote pointing the reader to the uses of γ in different papers.

      In eq 13, and simulations, noise is added to the output only, not to the activity of recurrently connected neurons. It is possible this underestimates the impact of noise since the same magnitude of noise in the recurrent network (map cells) could have a compounded effect on the output.

      Certainly. The equivalent output noise represents the cumulative effect of noise everywhere in the network. We argue that a cumulative effect of 1% is reasonable given the overall ability of animals at stimulus discrimination, which is also limited by noise everywhere in the network. This has been clarified in the text.

      Fig 3 E, F, it looks like the navigated distance may be capped. I ask because the error bars for graph distance = 12 are so small/nonexistent. If it's capped, this should be in the legend.

      Correct. 12 is the largest distance on this graph. This has been added to the caption.

      Fig 3D legend, what does "navigation failed" mean? These results are not shown.

      On those occasions the agent gets trapped at a local maximum of the goal signal other than the intended goal. We have removed that line as it is not needed to interpret the data.

      Line 446, typo (Lateron).

      Fixed.

      Line 475, I'm a bit confused by the discussion of birds and bats. Bird behavior in the real world does involve discrete paths between points. Even if they theoretically could fly between any points, there are costs to doing so, and in practice, they often choose discrete favorite paths. It is definitely plausible that animals that can fly could also employ Endotaxis, so it is confusing to suggest they don't have the right behavior for Endotaxis, especially given the focus on fruit flies later in the discussion.

      Good points, we removed that remark. Regarding fruit flies, they handle much important business while walking, such as tracking a mate, fighting rivals over food, finding a good oviposition site.

      Section 9.3, I'm a bit confused by the discussion of cerebellum-like structures, because I don't think they have as dense recurrent connections as needed for the map cells in Endotaxis. Are you suggesting they are analogous to the output part of Endotaxis only, not the whole thing?

      Please see our reply in the public review. We have removed this discussion of cerebellar circuits.

      Line 541, "After sufficient exploration...", clarify that this is describing learning of just the output synapses, not the recurrent connections between map cells?

      We have revised this entire section on the arthropod mushroom body.

      In lines 551-556, the discussion is confusing and possibly not consistent with current literature. How can a simulation prove that synapses in the hippocampus are only strengthened among immediately adjacent place fields? I'd suggest either removing this discussion or adding further clarification. More broadly, the connection between Endotaxis and the hippocampus is very compelling. This might also be a good point to bring up BTSP (though you do already bring it up later).

      As suggested, we removed this section.

      Line 621 "The successor representation (at least as currently discussed) is designed to improve learning under a particular policy" That's not actually accurate. Ref 17 (reviewer's manuscript, cited here) is not policy-specific, and instead just learns the transition statistics experienced by the animal, using a biologically plausible learning rule that is very similar to the Endotaxis map cell learning rule (see our Appendix 5, comparing to Endotaxis, though that was referencing the previous version of the Endotaxis preprint where Oja's rule was used).

      We have edited this section in the discussion and removed the reference to policyspecific successor representations.

      Line 636 "Endotaxis is always on" ... this was not clear earlier in the paper (e.g. line 268, and the separation of different algorithms, and "while learning do" in Algorithm 2).

      The learning rules are suspended during some simulations so we can better measure the effects of different parts of endotaxis, in particular learning vs navigating. There is no interference between these two functions, and an agent benefits from having the learning rules on all the time. The text now clarifies this in the relevant sections.

      Section 9.6, I like the idea of tracing different connected functions. But when you say "that could lead to the mode switch"... I'm a bit confused about what is meant here. A mode switch doesn't need to happen in a different brain area/network, because winnertake-all could be implemented by mutual inhibition between the different goal units.

      This is an interesting suggestion for the high-level control algorithm. A Lorenzian view is that the animal’s choice of mode depends on internal states or drives, such as thirst vs hunger, that compete with each other. In that picture the goal cells represent options to be pursued, whereas the choice among the options occurs separately. But one could imagine that the arbitrage between drives happens through a competition at the level of goal cells: For example the consumption of water could lead to adaptation of the water cell, such that it loses out in the winner-take-all competition, the food cell takes over, and the mouse now navigates towards food. In this closed-loop picture, the animal doesn’t have to “know” what it wants at any given time, it just wants the right thing. This could eliminate the homunculus entirely! Of course this is all a bit speculative. We have edited the closing comments in a way that leaves open this possibility.

      Line 697-704, I need more step-by-step explanation/derivation.

      We now derive the properties of E step by step starting from Eqn (14). The proof that leads to Eqn 14 is now in a separate article (available as a preprint and attached to this submission).

      Reviewer #3 (Recommendations For The Authors):

      • Please include discussion and comparison to previous work of graph-based trajectory planning using spreading activation from the current node and/or the goal node. Here is a (far from comprehensive) list of papers that present similar algorithms:

      Glasius, R., Komoda, A., & Gielen, S. C. (1996). A biologically inspired neural net for trajectory formation and obstacle avoidance. Biological Cybernetics, 74(6), 511-520.

      Gaussier, P., Revel, A., Banquet, J. P., & Babeau, V. (2002). From view cells and place cells to cognitive map learning: processing stages of the hippocampal system. Biological cybernetics, 86(1), 15-28.

      Gorchetchnikov A, Hasselmo ME. A biophysical implementation of a bidirectional graph search algorithm to solve multiple goal navigation tasks. Connection Science. 2005;17(1-2):145-166

      Martinet, L. E., Sheynikhovich, D., Benchenane, K., & Arleo, A. (2011). Spatial learning and action planning in a prefrontal cortical network model. PLoS computational biology, 7(5), e1002045.

      Ponulak, F., & Hopfield, J. J. (2013). Rapid, parallel path planning by propagating wavefronts of spiking neural activity. Frontiers in computational neuroscience, 7, 98.

      Khajeh-Alijani, A., Urbanczik, R., & Senn, W. (2015). Scale-free navigational planning by neuronal traveling waves. PloS one, 10(7), e0127269.

      Adamatzky, A. (2017). Physical maze solvers. All twelve prototypes implement 1961 Lee algorithm. In Emergent computation (pp. 489-504). Springer, Cham.

      Please see our reply to the public review above, and the new discussion section on “Theories and models of spatial learning”, which cites most of these papers among others.

      • Please explain, if it is the case, why the goal cell learning (other than a direct link between the goal and the corresponding map location) and calculation of the overlapping 'goal signal' is necessary, or at least advantageous.

      Please see our reply in the public review above.

      • Map cells are initially introduced (line 84) as getting input from "only one or a few point cells". The rest of the paper seems to assume only one. Does it work when this is 'a few'? Does it matter that 'a few' is an option?

      We simplified the text here to “only one point cell”. A map cell with input from two distant locations creates problems. After learning the map synapses from adjacencies in the environment, the model now “believes” that those two locations are connected. This distorts the graph on which the graph distances are computed and introduces errors in the resulting goal signals. One can elaborate the present toy model with a much larger population of map cells that might convey more robustness, but that is beyond our current scope.

      • (line 539 on) Please explain what feature in the mushroom body (or other cerebellumlike) circuits is proposed to correspond to the learning of connections in the adjacency matrix in the model.

      Please see our response to this critique in the public review above. In the mushroom body, the Kenyon cells exhibit sparse responses and are recurrently connected. These would correspond to map cells in Endotaxis. For vertebrate cerebellum-like circuits, the correspondence is less compelling, and we have removed this topic from the discussion.

    1. Author response:

      The following is the authors’ response to the current reviews.

      Gating of Kv10 channels is unique because it involves coupling between non-domain swapped voltage sensing domains, a domain-swapped cytoplasmic ring assembly formed by the N- and C-termini, and the pore domain. Recent structural data suggests that activation of the voltage sensing domain relieves a steric hindrance to pore opening, but the contribution of the cytoplasmic domain to gating is still not well understood. This aspect is of particular importance because proteins like calmodulin interact with the cytoplasmic domain to regulate channel activity. The effects of calmodulin (CaM) in WT and mutant channels with disrupted cytoplasmic gating ring assemblies are contradictory, resulting in inhibition or activation, respectively. The underlying mechanism for these discrepancies is not understood. In the present manuscript, Reham Abdelaziz and collaborators use electrophysiology, biochemistry and mathematical modeling to describe how mutations and deletions that disrupt inter-subunit interactions at the cytoplasmic gating ring assembly affect Kv10.1 channel gating and modulation by CaM. In the revised manuscript, additional information is provided to allow readers to identify within the Kv10.1 channel structure the location of E600R, one of the key channel mutants analyzed in this study. However, the mechanistic role of the cytoplasmic domains that this study focuses on, as well as the location of the ΔPASCap deletion and other perturbations investigated in the study remain difficult to visualize without additional graphical information. This can make it challenging for readers to connect the findings presented in the study with a structural mechanism of channel function.

      The authors focused mainly on two structural perturbations that disrupt interactions within the cytoplasmic domain, the E600R mutant and the ΔPASCap deletion. By expressing mutants in oocytes and recording currents using Two Electrode Voltage-Clamp (TEV), it is found that both ΔPASCap and E600R mutants have biphasic conductance-voltage (G-V) relations and exhibit activation and deactivation kinetics with multiple voltage-dependent components. Importantly, the mutant-specific component in the G-V relations is observed at negative voltages where WT channels remain closed. The authors argue that the biphasic behavior in the G-V relations is unlikely to result from two different populations of channels in the oocytes, because they found that the relative amplitude between the two components in the G-V relations was highly reproducible across individual oocytes that otherwise tend to show high variability in expression levels. Instead, the G-V relations for all mutant channels could be well described by an equation that considers two open states O1 and O2, and a transition between them; O1 appeared to be unaffected by any of the structural manipulations tested (i.e. E600R, ΔPASCap, and other deletions) whereas the parameters for O2 and the transition between the two open states were different between constructs. The O1 state is not observed in WT channels and is hypothesized to be associated with voltage sensor activation. O2 represents the open state that is normally observed in WT channels and is speculated to be associated with conformational changes within the cytoplasmic gating ring that follow voltage sensor activation, which could explain why the mutations and deletions disrupting cytoplasmic interactions affect primarily O2. 

      Severing the covalent link between the voltage sensor and pore reduced O1 occupancy in one of the deletion constructs. Although this observation is consistent with the hypothesis that voltage-sensor activation drives entry into O1, this result is not conclusive. Structural as well as functional data has established that the coupling of the voltage sensor and pore does not entirely rely on the S4-S5 covalent linker between the sensor and the pore, and thus the severed construct could still retain coupling through other mechanisms, which is consistent with the prominent voltage dependence that is observed. If both states O1 and O2 require voltage sensor activation, it is unclear why the severed construct would affect state O1 primarily, as suggested in the manuscript, as opposed to decreasing occupancy of both open states. In line with this argument, the presence of Mg2+ in the extracellular solution affected both O1 and O2. This finding suggests that entry into both O1 and O2 requires voltage-sensor activation because Mg2+ ions are known to stabilize the voltage sensor in its most deactivated conformations. 

      We agree with the reviewer that access to both states requires a conformational change in the voltage sensor. This was stated in our revised article: “In contrast, to enter O2, all subunits must complete both voltage sensor transitions and the collective gating ring transition.” We interpret the two gating steps as sequential; the effective rotation of the intracellular ring would happen only once the sensor is in its fully activated position.

      We also agree that the S4-S5 segment cannot be the only interaction mechanism, as we demonstrated in our earlier work (Lörinczi et al., 2015; Tomczak et al., 2017).  

      Activation towards and closure from O1 is slow, whereas channels close rapidly from O2. A rapid alternating pulse protocol was used to take advantage of the difference in activation and deactivation kinetics between the two open components in the mutants and thus drive an increasing number of channels towards state O1. Currents activated by the alternating protocol reached larger amplitudes than those elicited by a long depolarization to the same voltage. This finding is interpreted as an indication that O1 has a larger macroscopic conductance than O2. In the revised manuscript, the authors performed single-channel recordings to determine why O1 and O2 have different macroscopic conductance. The results show that at voltages where the state O1 predominates, channels exhibited longer open times and overall higher open probability, whereas at more depolarized voltages where occupancy of O2 increases, channels exhibited more flickery gating behavior and decreased open probability. These results are informative but not conclusive because additional details about how experiments were conducted, and group data analysis are missing. Importantly, results showing inhibition of single ΔPASCap channels by a Kv10-specific inhibitor are mentioned but not shown or quantitated - these data are essential to establish that the new O1 conductance indeed represents Kv10 channel activity.

      We observed the activity of a channel compatible with Kv10.1 ΔPAS-Cap (long openings at low-moderate potentials, very short flickery activity at strong depolarizations) in 12 patches from oocytes obtained from different frog operations over a period of two and a half months once the experimental conditions could be established. As stated in the text, we did not proceed to generate amplitude histograms because we could not resolve clear single-channel events at strong depolarizations. Astemizole abolished the activity and (remarkably) strongly reduced the noise in traces at strong depolarizations, which we interpret as partially caused by flicker openings.

      Author response image 1.

      We include two example recordings of Astemizole application (100µM) on two different patches. Both recordings are performed at -60 mV (to decrease the likelihood that the channel visits O2) with 100 mM internal and 60 mM external K+. In both cases, the traces in Astemizole are presented in red.

      It is shown that conditioning pulses to very negative voltages result in mutant channel currents that are larger and activate more slowly than those elicited at the same voltage but starting from less negative conditioning pulses. In voltage-activated curves, O1 occupancy is shown to be favored by increasingly negative conditioning voltages. This is interpreted as indicating that O1 is primarily accessed from deeply closed states in which voltage sensors are in their most deactivated position. Consistently, a mutation that destabilizes these deactivated states is shown to largely suppress the first component in voltage-activation curves for both ΔPASCap and E600R channels.

      The authors then address the role of the hidden O1 state in channel regulation by calmodulation. Stimulating calcium entry into oocytes with ionomycin and thapsigarging, assumed to enhance CaM-dependent modulation, resulted in preferential potentiation of the first component in ΔPASCap and E600R channels. This potentiation was attenuated by including an additional mutation that disfavors deeply closed states. Together, these results are interpreted as an indication that calcium-CaM preferentially stabilizes deeply closed states from which O1 can be readily accessed in mutant channels, thus favoring current activation. In WT channels lacking a conducting O1 state, CaM stabilizes deeply closed states and is therefore inhibitory. It is found that the potentiation of ΔPASCap and E600R by CaM is more strongly attenuated by mutations in the channel that are assumed to disrupt interaction with the C-terminal lobe of CaM than mutations assumed to affect interaction with the N-terminal lobe. These results are intriguing but difficult to interpret in mechanistic terms. The strong effect that calcium-CaM had on the occupancy of the O1 state in the mutants raises the possibility that O1 can be only observed in channels that are constitutively associated with CaM. To address this, a biochemical pull-down assay was carried out to establish that only a small fraction of channels are associated with CaM under baseline conditions. These CaM experiments are potentially very interesting and could have wide physiological relevance. However, the approach utilized to activate CaM is indirect and could result in additional nonspecific effects on the oocytes that could affect the results.

      Finally, a mathematical model is proposed consisting of two layers involving two activation steps for the voltage sensor, and one conformational change in the cytoplasmic gating ring - completion of both sets of conformational changes is required to access state O2, but accessing state O1 only requires completion of the first voltage-sensor activation step in the four subunits. The model qualitatively reproduces most major findings on the mutants. Although the model used is highly symmetric and appears simple, the mathematical form used for the rate constants in the model adds a layer of complexity to the model that makes mechanistic interpretations difficult. In addition, many transitions that from a mechanistic standpoint should not depend on voltage were assigned a voltage dependence in the model. These limitations diminish the overall usefulness of the model which is prominently presented in the manuscript. The most important mechanistic assumptions in the model are not addressed experimentally, such as the proposition that entry into O1 depends on the opening of the transmembrane pore gate, whereas entry into O2 involves gating ring transitions - it is unclear why O2 would require further gating ring transitions to conduct ions given that the gating ring can already support permeation by O1 without any additional conformational changes.

      In essence, we agree with the reviewer; we already have addressed these points in our revised article:

      Regarding the voltage dependence we write “the κ/λ transition could reasonably be expected to be voltage independent because we related it to ring reconfiguration, a process that should occur as a consequence of a prior VSD transition. We have made some attempts to treat this transition as voltage independent but state-specific with upper-layer bias for states on the right and lower-layer bias for states on the left. This is in principle possible, as can already be gleaned from the similar voltage ranges of the left-right transition (α/β) and the κL/λ transition. However, this approach leads to a much larger number of free, less well constrained kinetic parameters and drastically complicated the parameter search. ” As you can see, we also formulated a strategy to free the model of the potentially spurious voltage dependence and (in bold here) explained why we did not follow this route in this study. 

      Regarding the need for gating ring transitions after O1, we wrote, “Thus, the underlying gating events can be separated into two steps: The first gating step involves only the voltage sensor without engaging the ring and leads to a pre-open state, which is non-conducting in the WT but conducting in our mutants. The second gating event operates at higher depolarizations, involves a change in the ring, and leads to an open state both in WT and in the mutants. ” 

      We interpret your statements such that you expect the conducting state to remain available once O1 is reached. However, the experimental evidence speaks against that the pore availability remains regardless of the further gating steps beyond O1. The description of model construction is informative here: “... we could exclude many possible [sites at which O1 connects to closed states] because the attachment site must be sufficiently far away from the conventional open state [O2]. Otherwise, the transition from "O1 preferred" to "O2 preferred" via a few closed intermediate states is very gradual and never produces the biphasic GV curves [that we observed]. ” 

      In other words, voltage-dependent gating steps beyond the state that offers access to O1 appear to close the pore, after it was open. That might occur because only then (for states in which at least one voltage sensor exceeded the intermediate position) the ring is fixed in a particular state until all sensors completed activation. In the WT, closing the pore in deactivated states might rely on an interaction that is absent in the mutant because, at least in HERG: “the interaction between the PAS domain and the C-terminus is more stable in closed than in open KV11.1 (HERG) channels, and a single chain antibody binding to the interface between PAS domain and CNBHD can access its epitope in open but not in closed channels, strongly supporting a change in conformation of the ring during gating ”

      Reviewer #3 (Public Review):

      In the present manuscript, Abdelaziz and colleagues interrogate the gating mechanisms of Kv10.1, an important voltage-gated K+ channel in cell cycle and cancer physiology. At the molecular level, Kv10.1 is regulated by voltage and Ca-CaM. Structures solved using CryoEM for Kv10.1 as well as other members of the KCNH family (Kv11 and Kv12) show channels that do not contain a structured S4-S5 linker imposing therefore a non-domain swapped architecture in the transmembrane region. However, the cytoplasmatic N- and C- terminal domains interact in a domain swapped manner forming a gating ring. The N-terminal domain (PAS domain) of one subunit is located close to the intracellular side of the voltage sensor domain and interacts with the C-terminal domain (CNBHD domain) of the neighbor subunit. Mutations in the intracellular domains has a profound effect in the channel gating. The complex network of interactions between the voltage-sensor and the intracellular domains makes the PAS domain a particularly interesting domain of the channel to study as responsible for the coupling between the voltage sensor domains and the intracellular gating ring.

      The coupling between the voltage-sensor domain and the gating ring is not fully understood and the authors aim to shed light into the details of this mechanism. In order to do that, they use well established techniques such as site-directed mutagenesis, electrophysiology, biochemistry and mathematical modeling. In the present work, the authors propose a two open state model that arises from functional experiments after introducing a deletion on the PAS domain (ΔPAS Cap) or a point mutation (E600R) in the CNBHD domain. The authors measure a bi-phasic G-V curve with these mutations and assign each phase as two different open states, one of them not visible on the WT and only unveiled after introducing the mutations.

      The hypothesis proposed by the authors could change the current paradigm in the current understanding for Kv10.1 and it is quite extraordinary; therefore, it requires extraordinary evidence to support it.

      STRENGTHS: The authors use adequate techniques such as electrophysiology and sitedirected mutagenesis to address the gating changes introduced by the molecular manipulations. They also use appropriate mathematical modeling to build a Markov model and identify the mechanism behind the gating changes.

      WEAKNESSES: The results presented by the authors do not fully support their conclusions since they could have alternative explanations. The authors base their primary hypothesis on the bi-phasic behavior of a calculated G-V curve that do not match the tail behavior, the experimental conditions used in the present manuscript introduce uncertainties, weakening their conclusions and complicating the interpretation of the results. Therefore, their experimental conditions need to be revisited. 

      We respectfully disagree. We think that your suggestions for alternative explanations are addressed in the current version of the article. We will rebut them once more below, but we feel the need to point out that our arguments are already laid out in the revised article.

      I have some concerns related to the following points:

      (1) Biphasic gating behavior

      The authors use the TEVC technique in oocytes extracted surgically from Xenopus Leavis frogs. The method is well established and is adequate to address ion channel behavior. The experiments are performed in chloride-based solutions which present a handicap when measuring outward rectifying currents at very depolarizing potentials due to the presence of calcium activated chloride channel expressed endogenously in the oocytes; these channels will open and rectify chloride intracellularly adding to the outward rectifying traces during the test pulse. The authors calculate their G-V curves from the test pulse steady-state current instead of using the tail currents. The conductance measurements are normally taken from the 'tail current' because tails are measured at a fix voltage hence maintaining the driving force constant. 

      We respectfully disagree. In contrast to other channels, like HERG, a common practice for Kv10 is not to use tail currents. It is long known that in this channel, tail currents and test-pulse steady-state currents can appear to be at odds because the channels deactivate extremely rapidly, at the border of temporal resolution of the measurements and with intricate waveforms. This complicates the estimation of the instantaneous tail current. Therefore, the outward current is commonly used to estimate conductance (Terlau et al., 1996; Schönherr et al., 1999; Schönherr et al., 2002; Whicher and MacKinnon, 2019), while the latter authors also use the extreme of the tail for some mutants.

      Due to their activation at very negative voltage, the reversal potential in our mutants can be measured directly; we are, therefore, more confident with this approach. Nevertheless, we have determined the initial tail current in some experiments. The behavior of these is very similar to the average that we present in Figure 1. The biphasic behavior is unequivocally present.

      Author response image 2.

      Calculating the conductance from the traces should not be a problem, however, in the present manuscript, the traces and the tail currents do not agree. 

      The referee’s observation is perfectly in line with the long-standing experience of several labs working with KV10: tail current amplitudes in KV10 appear to be out of proportion for the WT open state (O2). Importantly, this is due to the rapid closure, which is not present in O1. As a consequence, the initial amplitude of tail currents from O1 are easier to estimate correctly, and they are much more obvious in the graphs. Taken together, these differences between O1 and O2 explain the misconception the reviewer describes next.

      The tail traces shown in Fig1E do not show an increasing current amplitude in the voltage range from +50mV to +120mV, they seem to have reached a 'saturation state', suggesting that the traces from the test pulse contain an inward chloride current contamination. 

      As stated in the text and indicated in Author response image 3, the tail currents In Figure 1E increase in amplitude between +50 and +120 mV, as can be seen in the examples below from different experiments (+50 is presented in black, +120 in red). As stated above, the increase is not as evident as in traces from other mutants because the predominance of O2 also implies a much faster deactivation.

      Author response image 3. 

      We are aware that Ca2+-activated Cl- currents can represent a problem when interpreting electrophysiological data in oocytes. In fact, we show in Supplement 1 to Figure 8 that this can be the case during the Ca2+-CaM experiments, where the increase in Ca2+ would certainly augment Cl- contribution to the outward current. This is why we performed these experiments in Cl--free solutions. As we show in Figure 8, the biphasic behavior was also present in those experiments. 

      Importantly, Cl- free bath solutions would not correct contamination during the tail, since this would correspond to Cl- exiting the oocyte. Yet, if there would be contamination of the outward currents by Cl-, one would expect it to increase with larger depolarizations as the typical Ca2+activated Cl- current in oocytes does. As the reviewer states, this does not seem to be the case.

      In addition, this second component identified by the authors as a second open state appears after +50mV and seems to never saturate. The normalization to the maximum current level during the test pulse, exaggerates this second component on the calculated G-V curve. 

      We agree that this second component continues to increase; the reviewer brought this up in the first review, and we have already addressed this in our reply and in the discussion of the revised version: “This flicker block might also offer an explanation for a feature of the mutant channels, that is not explained in the current model version: the continued increase in current amplitude, hundreds of milliseconds into a strong depolarization (Supp. 4 to Fig. 9). If the relative stability of O2 and C2 continued to change throughout depolarization, such a current creep-up could be reproduced. However, this would require either the introduction of further layers of On ↔Cn states, or a non-Markovian modification of the model’s time evolution.” With non-Markovian, we mean a Langevin-type diffusive process. 

      It's worth noticing that the ΔPASCap mutant experiments on Fig 5 in Mes based solutions do not show that second component on the G-V.

      For the readers of this conversation, we would like to clarify that the reviewer likely refers to experiments shown in Fig. 5 of the initial submission but shown in Fig. 6 of the revised version (“Hyperpolarization promotes access to a large conductance, slowly activating open state.” Fig. 5 deals with single channels). We agree that these data look different, but this is because the voltage protocols are completely different (compare Fig. 6A (fixed test pulse, varied prepulse) and Fig. 2A (varied test pulse, fixed pre-pulse). Therefore, no biphasic behavior is expected. 

      Because these results are the foundation for their two open state hypotheses, I will strongly suggest the authors to repeat all their Chloride-based experiments in Mes-based solutions to eliminate the undesired chloride contribution to the mutants current and clarify the contribution of the mutations to the Kv10.1 gating.

      In summary, we respectfully disagree with all concerns raised in point (1). Our detailed arguments rebutting them are given above, but there is a more high-level concern about this entire exchange: the referee casts doubt on observations that are not new. Several labs have reported for a group of mutant KCNH channels: non-monotonic voltage dependence of activation (see, e.g., Fig. 6D in Zhao et al., 2017), multi-phasic tail currents (see e.g. Fig. 4A in Whicher and MacKinnon, 2019, in CHO cells where Cl- contamination is not a concern), and activation by high [Ca2+]i (Lörinczi et al., 2016). Our study replicates those observations and hypothesizes that the existence of an additional conducting state can alone explain all previously unexplained observations. We highlight the potency of this hypothesis with a Markov model that qualitatively reproduces all phenomena. We not only factually disagree with the individual points raised, but we also think that they don't touch on the core of our contribution

      (2) Two step gating mechanism.

      The authors interpret the results obtained with the ΔPASCap and the E600R as two step gating mechanisms containing two open states (O1 and O2) and assign them to the voltage sensor movement and gating ring rotation respectively. It is not clear, however how the authors assign the two open states.

      The results show how the first component is conserved amongst mutations; however, the second one is not. The authors attribute the second component, hence the second open state to the movement of the gating ring. This scenario seems unlikely since there is a clear voltagedependence of the second component that will suggest an implication of a voltage-sensing current.

      We do not suggest that the gating ring motion is not voltage dependent. We would like to point out that voltage dependence can be conveyed by voltage sensor coupling to the ring; this is the widely accepted theory of how the ring can be involved. Should the reviewer mean it in a narrow sense, that the model should be constructed such that all voltage-dependent steps occur before and independently of ring reconfiguration and that only then an additional step that reflects the (voltage-independent) reconfiguration solely, we would like to point the reviewer to the article, where we write: “the κ/λ transition could reasonably be expected to be voltage independent because we related it to ring reconfiguration, a process that should occur as a consequence of a prior VSD transition. We have made some attempts to treat this transition as voltage independent but state-specific with upper-layer bias for states on the right and lower-layer bias for states on the left. This is in principle possible, as can already be gleaned from the similar voltage ranges of the left-right transition (α/β) and the κL/λ transition. However, this approach leads to a much larger number of free, less well constrained kinetic parameters and drastically complicated the parameter search. ” As you can see, we also formulated a strategy to free the model from the potentially spurious voltage dependence and (in bold here) explained why we did not follow this route in this study. 

      The split channel experiment is interesting but needs more explanation. I assume the authors expressed the 2 parts of the split channel (1-341 and 342-end), however Tomczak et al showed in 2017 how the split presents a constitutively activated function with inward currents that are not visible here, this point needs clarification.

      As stated in the panel heading, the figure legend, and the main text, we did not use 1-341 and 342-end as done in Tomczak et al. Instead, “we compared the behavior of ∆2-10 and ∆210.L341Split,”. Evidently, the additional deletion (2-10) causes a shift in activation that explains the difference you point out. However, as we do not compare L341Split and ∆210.L341Split but ∆2-10 and ∆2-10.L341Split, our conclusion remains that “As predicted, compared to ∆2-10, ∆2-10.L341Split showed a significant reduction in the first component of the biphasic GV (Fig. 2C, D).” Remarkably, the behavior of the ∆3-9 L341Split described in Whicher and MacKinnon, 2019 (Figure 5) matches that of our ∆2-10 L341Split, which we think reinforces our case.

      Moreover, the authors assume that the mutations introduced uncover a new open state, however the traces presented for the mutations suggest that other explanations are possible. Other gating mechanisms like inactivation from the closed state, can be introduced by the mutations. The traces presented for ΔPASCap but specially E600R present clear 'hooked tails', a direct indicator of a populations of inactive channels during the test pulse that recover from inactivation upon repolarization (Tristani-Firouzi M, Sanguinetti MC. J Physiol. 1998). 

      There is a possibility that we are debating nomenclature here. In response to the suggestion that all our observations could be explained by inactivation, we attempted a disambiguation of terms in the reply and the article. As the argument is brought up again without reference to our clarification attempts, we will try to be more explicit here:

      If, starting from deeply deactivated states, an open state is reached first, and then, following further activation steps, closed states are reached, this might be termed “inactivation”. In such a reading, our model features many inactivated states. The shortest version of such a model is C-O-I. It is for instance used by Raman and Bean (2001; DOI: 10.1016/S00063495(01)76052-3) to explain NaV gating in Purkinje neurons. If “inactivation” is meant in the sense that a gating transition exists, which is orthogonal to an activation/deactivation axis, and that after this orthogonal transition, an open state cannot be reached anymore, then all of the upper floor in our model is inactivated with respect to the open state O1. Finally, the state C2 is an inactivated state to O2. In this view, “inactivation” explains the observed phenomena. 

      However, we must disagree if the referee means that a parsimonious explanation exists in which a single conducting state is the only source for all observed currents.   

      There is a high-level reason: we found a single assumption that explains three different phenomena, while the inactivation hypothesis with one conducting state cannot explain one of them (the increase of the first component under raised CaM). But there is also a low-level reason: the tails in Tristani-Firouzi and Sanguinetti 1998 are fundamentally different from what we report herein in that they lack a third component. Thus, those tails are consistent with recovery from inactivation through a single open state, while a three-component tail is not. In the framework of a Markov model, the time constants of transitions from and to a given state (say O2), cannot change unless the voltage changes. During the tail current, the voltage does not change, yet we observe: 

      i) a rapid decrease with a time constant of at most a few milliseconds (Fig 9 S2, 1-> 2),  ii) a slow increase in current, peaking after approximately 25 milliseconds and iii) a relaxation to zero current with a time constant of >50 ms. 

      According to the reviewer’s suggestion, these processes on three timescales should all be explained by depopulating and repopulating the same open state while all rates are constant. There might well be a complicated multi-level state diagram with a single open state with different variants, like (open and open inactivated) that could produce triphasic tails with these properties if the system had not reached a steady state distribution at the end of the test pulse. It cannot, however, achieve it from an equilibrated system, and certainly, it cannot at the same time produce “biphasic activation” and “activation by CaM”. 

      The results presented by the authors can be alternatively explained with a change in the equilibrium between the close to inactivated/recovery from inactivation to the open state. 

      Again, we disagree. The model construction explains in detail that the transition from the first to the second phase is not gradual. Shifting equilibria cannot reproduce this. We have extensively tested that idea and can exclude this possibility.

      Finally, the authors state that they do not detect "cumulative inactivation after repeated depolarization" but that is considering inactivation only from the open state and ignoring the possibility of the existence of close state inactivation or, that like in hERG, that the channel inactivates faster that what it activates (Smith PL, Yellen G. J Gen Physiol. 2002). 

      We respectfully disagree. We explicitly model an open state that inactivates faster (O2->C2) than it activates. Once more, this is stated in the revised article, which we point to for details. Again, this alternative mechanism does not have the potential to explain all three effects. As discussed above about the chloride contamination concerns, this inactivation hypothesis was mentioned in the first review round and, therefore, addressed in our reply and the revised article. We also explained that “inactivation” has no specific meaning in Markov models. In the absence of O1, all transitions towards the lower layer are effectively “inactivation from closed states”, because they make access to the only remaining open state less likely”. But this is semantics. What is relevant is that no network of states around a single open state can reproduce the three effets in a more parsimonious way than the assumption of the second open state does.

      (3) Single channel conductance.

      The single channels experiments are a great way to assess the different conductance of single channel openings, unfortunately the authors cannot measure accurately different conductances for the two proposed open states. The Markov Model built by the authors, disagrees with their interpretation of the experimental results assigning the exact same conductance to the two modeled open states. To interpret the mutant data, it is needed to add data with the WT for comparison and in presence of specific blockers. 

      We respectfully disagree. As previously shown, the conductance of the flickering wild-type open state is very difficult to resolve. Our recordings do not show that the two states have different single-channel conductances, and therefore the model assumes identical singlechannel conductance. 

      The important point is that the single-channel recordings clearly show two different gating modes associated with the voltage ranges in which we predict the two open states. One has a smaller macroscopic current due to rapid flickering (aka “inactivation”). These recordings are another proof of the existence of two open states because the two gating modes occur.  Wild-type data can be found in Bauer and Schwarz, (2001, doi:10.1007/s00232-001-0031-3) or Pardo et al., (1998, doi:10.1083/jcb.143.3.767) for comparison.

      We appreciate the effort editors and reviewers invested in assessing the revised manuscript. Yet, we think that the demanded revision of experimental conditions and quantification methods contradicts the commonly accepted practice for KV10 channels. Some of the reviewer comments are skeptical about the biphasic behavior, which is an established and replicated finding for many mutants and by many researchers. The alternative explanations for these disbelieved findings are either “semantics” or cannot quantitatively explain the measurements. Therefore, only the demand for more explanations and unprecedented resolution in singlechannel recordings remains. We share these sentiments.

      ———— The following is the authors’ response to the original reviews.

      (1) The authors must show that the second open state is not just an artifact of endogenous activity but represents the activity of the same EAG channels. I suggest that the authors repeat these experiments in Mes-based solutions. 

      (2) Along the same lines, it is necessary to show that these currents can be blocked using known EAG channel blockers such as astemizole. Ultimately, it will be important to demonstrate using single-channel analysis that these do represent two distinct open states separated by a closed state. 

      We have addressed these concerns using several approaches. The most substantial change is the addition of single-channel recordings on ΔPASCap. In those experiments, we could provide evidence of the two types of events in the same patch, and the presence of an outward current at -60 mV, 50 mV below the equilibrium potential for chloride. The channels were never detected in uninjected oocytes, and Astemizole silenced the activity in patches containing multiple channels. These observations, together with the maintenance of the biphasic behavior that we interpret as evidence of the presence of O1 in methanesulfonate-based solutions, strongly suggest that both O1 and O2 obey the expression of KV10.1 mutants.

      (3) Currents should be measured by increasing the pulse lengths as needed in order to obtain the true steady-state G-V curves. 

      We agree that the endpoint of activation is ill-defined in the cases where a steady-state is not reached. This does indeed hamper quantitative statements about the relative amplitude of the two components. However, while the overall shape does change, its position (voltage dependence) would not be affected by this shortcoming. The data, therefore, supports the claim of the “existence of mutant-specific O1 and its equal voltage dependence across mutants.”

      (4) A more clear and thorough description should be provided for how the observations with the mutant channels apply to the behavior of WT channels. How exactly does state O1 relate to WT behavior, and how exactly do the parameters of the mathematical model differ between WT and mutants? How can this be interpreted at a structural level? What could be the structural mechanism through which ΔPASCap and E600R enable conduction through O1? It seems contradictory that O1 would be associated exclusively with voltage-sensor activation and not gating ring transitions, and yet the mutations that enable cation access through O1 localize at the gating ring - this needs to be better clarified. 

      We have undertaken a thorough rewriting of all sections to clarify the structural correlates that may explain the behavior of the mutants. In brief, we propose that when all four voltage sensors move towards the extracellular side, the intracellular ring maintains the permeation path closed until it rotates. If the ring is altered, this “lock” is incompetent, and permeation can be detected (page 34). By fixing the position of the ring, calmodulin would preclude permeation in the WT and promote the population of O1 in the mutants.

      (5) Rather than the t80% risetime, exponential fits should be performed to assess the kinetics of activation. 

      We agree that the assessment of kinetics by a t80% is not ideal. We originally refrained from exponential fits because they introduce other issues when used for processes that are not truly exponential (as is the case here). We had planned to perform exponential fits in this revised version, but because the activation process is not exponential, the time constants we could provide would not be accurate, and the result would remain qualitative as it is now. In the experiments where we did perform the fits (Fig. 3), the values obtained support the statement made. 

      (6) It is argued based on the G-V relations in Figure 2A that none of the mutations or deletions introduced have a major effect on state O1 properties, but rather affect state O2. However, the occupancy of state O2 is undetermined because activation curves do not reach saturation. It would be interesting to explore the fitting parameters on Fig.2B further to test whether the data on Fig 2A can indeed only be described by fits in which the parameters for O1 remain unchanged between constructs. 

      We agree that the absolute occupancy of O2 cannot be properly determined if a steady state is not reached. This is, however, a feature of the channel. During very long depolarizations in WT, the current visually appears to reach a plateau, but a closer look reveals that the current keeps increasing after very long depolarizations (up to 10 seconds; see, e.g., Fig. 1B in Garg et al., 2013, Mol Pharmacol 83, 805-813. DOI: 10.1124/mol.112.084384). Interestingly, although the model presented here does not account for this behavior, we propose changes in the model that could. “If the relative stability of O2 and C2 continued to change throughout the depolarization such a current creep-up could be reproduced. However, this would require either the introduction of further layers of On↔Cn states or a non-Markovian modification of the model’s evolution.” Page 34.

      (7) The authors interpret the results obtained with the mutants DPASCAP and E600R -tested before by Lorinczi et al. 2016, to disrupt the interactions between the PASCap and cNBHD domains- as a two-step gating mechanism with two open states. All the results obtained with the E600R mutant and DPASCap could also be explained by inactivation/recovery from inactivation behavior and a change in the equilibrium between the closed states closed/inactivated states and open states. Moreover, the small tails between +90 to +120 mV suggest channels accumulate in an inactive state (Fig 1E). It is not convincing that the two open-state model is the mechanism underlying the mutant's behavior.  

      We respectfully disagree with the notion that a single open state can provide a plausible explanation for "All the results obtained with the E600R mutant and DPASCap". We think that our new single channel results settle the question, but even without this direct evidence, a quantitative assessment of the triphasic tail currents all but excludes the possibility of a single open state. We agree that it is, in principle, possible to obtain some form of a multiphasic tail with a single open state using the scheme suggested in this comment: at the end of the test pulse, a large fraction of the channels must be accumulated in inactive states, and a few are in the open state. The hyperpolarization to -100mV then induces a rapid depopulation of the open state, followed by slower replenishments from the inactive state. Exactly this process occurs in our model, when C2 empties through O2 (Supp. 5 to Fig 9, E600R model variant). However, this alone is highly unlikely to quantitatively explain the measured tail currents, because of the drastically different time scales of the initial current decay (submillisecond to at most a few milliseconds lifetime) and the much slower transient increase in current (several tens of milliseconds) and the final decay with time constants of >100 ms (see for instance data in Fig. 1 E for E600R +50 to +120mV test pulse). To sustain the substantial magnitude of slowly decaying current by slow replenishment of an open state with a lifetime of 1 ms requires vast amounts of inactivated channels. A rough estimation based on the current integral of the initial decay and the current integral of the slowly decaying current suggests that at the end of the test pulse, the ratio inactivated/open channels would have to be 500 to 1500 for this mechanism to quantitatively explain the observed tail currents. To put this in perspective: This would suggest that without inactivation all the expressed channels in an oocyte would provide 6 mA current during the +100 mV test pulse. While theoretically possible, we consider this a less likely explanation than a second open state.

      (8) Different models should be evaluated to establish whether the results in Figure 4 can also be explained by a model in which states O1 and O2 have the same conductance. It would be desirable if the conductance of both states were experimentally determined - noise analysis could be applied to estimate the conductance of both states. 

      In the modified model, O1 and O2 have the same single-channel conductance. The small conductance combined with the fast flickering did not allow an accurate determination, but we can state that there is no evidence that the single-channel conductance of the states is different.

      (9) Although not included, it looks like the model predicts some "conventional inactivation" This can be appreciated in Fig 8, and in the traces at -60mV. Interestingly, the traces obtained in the absence of Cl- also undergo slow inactivation, or 'conventional inactivation' as referred to by the authors. Please revise the following statement "Conventional inactivation was never detected in any mutants after repeated or prolonged depolarization. In the absence of inactivation, the pre-pulse dependent current increase at +40 mV could be related to changes in the relative occupancy of the open states". 

      We have carefully edited the manuscript to address this concern. The use of the term inactivation admittedly represents a challenge. We agree that the state that results from the flickering block (C2) could be defined as “inactivated” because it is preceded by an open state. Yet, in that case, the intermediate states that the channel travels between O1 and O2 would also be sensu stricto “inactivated”, but only in the mutants. We have made this clear in page 17.

      Recommendations for improving the writing and presentation.

      (1) Methods section: Please state the reversal potential calculated for the solution used. It looks like the authors used an Instantaneous I-V curve method to calculate the reversal potential; if that's correct, please show the I-V and the traces together with the protocol used. 

      We have provided the calculated reversal potentials for excised patches. We cannot predict the reversal potential in whole oocytes because we have no control over the intracellular solution. The reversal potential was determined in the mutants through the current at the end of the stimulus because the mutants produced measurable inward currents. The differences in reversal potential were not significant among mutants.

      Pulse protocols have been added to the figures.

      (2) Figure 1 suggestion: Combine the two panels in panel D and move the F panel up so the figure gets aligned in the lower end.

      Thank you, this has been done.

      (3) Please clarify the rationale for using the E600R-specific mutant. I assume it is based on the Lorinzci et al. 2016 effect and how this is similar to the DPASCap phenotype, or is it due to the impact of this mutation in the interactions between the N-term and the cNBHD? 

      We have explained the rationale for the use of E600R explicitly on page 6.

      (4) Fig S1A is not present in the current version of the manuscript. Include a cartoon as well as a structural figure clearly depicting the perturbations introduced by E600R, ΔPASCap, and the other deletions that are tested. Additional structural information supporting the discussion would also be helpful to establish clearer mechanistic links between the experimental observations described here and the observed conformational changes between states in Kv10 channel structures. 

      We have corrected this omission, thank you for pointing it out.

      (5) It would be informative to see the traces corresponding to the I-V shown in Fig 7 A and B at the same indicated time points (0, 60, 150, and 300s). Did the authors monitor the Ca2+ signal rise after the I&T treatment to see if it coincides with the peak in the 60s? 

      In Figure 7 (now Figure 8) we used voltage ramps instead of discrete I-V protocols because of the long time required for recording the latter. This is stated on page 19. Ca2+ was monitored through Cl- current after ionomycin/thapsigargin. The duration of the Ca2+ increase was reproducible among oocytes and in good agreement with the changes observed in the biphasic behavior of the mutants (Supplement 1 to Figure 8).

      (6) Fig 4. Please state in the legend what the different color traces correspond to in E600R and DPASCap. Is there a reason to change the interpulse on DPASCap to -20mV and not allow this mutant to close? Please state. How do the authors decide the 10 ms interval for the experiments in Fig 2? 

      Thank you for pointing this out, we have added the description. We have explained why we use a different protocol for ΔPASCap and the reason for using 10 ms interval (we believe the referee means Figure 4) on page 12.  

      (7) Fig. 5. Since the pre-pulse is supposed to be 5s, but the time scale doesn't correspond with a pre-pulse of 5 s before the test pulse to +40mV. Has the pre-pulse been trimmed for representation purposes? If so, please state. 

      The pre-pulse was 5s, but as the reviewer correctly supposed, the trace is trimmed to keep the +40 mV stimulus visible. This has now been clearly stated in the legend.

      (8) The mutant L322H is located within the S4 helix according to the Kv10.1 structure (PDB 5K7L), not in the 'S3-S4 linker'; please correct. 

      This has been done, thank you.

      The introduction of this mutant should also shift the voltage dependence toward more hyperpolarizing potentials (around 30mV, according to Schoenherr et al. 1999). It looks like that shift is present within the first component of the G-V. Still, since the max amplitude from the second component could be contaminated by endogenous Cl- currents, this effect is minimized. Repeating these experiments in the no Cl- solutions will help clarify this point and see the effect of the DPASCap and E600R in the background of a mutation that accelerates the transitions between the closed states (see Major comment 1). Did the authors record L322H alone for control purposes? 

      We have decided not to measure L322H alone or repeat the measurements in Cl--free solutions because we do not see a way to use the quantitative assessment of the voltage dependence of L322H and the L322H-variants of the eag domain mutants. Like in our answer to main point 3, we base our arguments not on the precise voltage dependence of the second component but on the shape of the G-V curves instead, specifically the consistent appearance of the first component and the local conductance minimum between the first and second components. After the introduction of L322H the first component is essentially absent.

      We think that the measurements of the L322H mutants cannot be interpreted as a hyperpolarizing shift in the first component. The peak of the first conductance component occurs around -20 mV in ΔPASCap and E600R (Fig. 7 C, D). After a -30mV shift, in L322H+DPASCap and L322H+E600R, this first peak would still be detected within the voltage range in our experiments, but it is not. A contamination of the second component would have little impact on this observation, which is why we refrain from the suggested measurements.  

      (9) The authors differentiate between an O1 vs. O2 state with different conductances, and maybe I missed it, but there's no quantitative distinction between the components; how are they different?

      Please see the response to the main comments 1 and 2. This has been addressed in singlechannel recordings.

      (10) Please state the voltage protocols, holding voltages, and the solutions (K+ concentration and Cl-presence/absence) used for the experiments presented in the legends on the figures. Hence, it's easier to interpret the experiments presented. 

      Thank you, this has been done.

      (11) The authors state on page 7 that "with further depolarizations, the conductance initially declined to rise again in response to strong depolarizations. This finding matches the changes in amplitude of the tail currents, which, therefore, probably reflect a true change in conductance" However, the tails in the strong voltage range (+50 to +120 mV) for the E600R mutant argue against this result. Please review.

      The increase in the amplitude of the tail current is also present in E600R, but the relative increase is smaller. We have decided against rescaling these traces because the Figure is already rather complex. We indicated this fact with a smaller arrow and clarified it in the text (page 8).

      (12) The authors mention that the threshold of activation for the WT is around -20mV; however, the foot of the G-V is more around -30 or -40mV. Please revise. 

      Thank you. We have done this. 

      (13) The authors state on page 9 that the 'second component occurs at progressively more depolarized potentials for increasingly larger N-terminal deletions" However E600R mutant that conserves the N-terminal intact has a shift as pronounced as the DPASCap and larger than the D2-10. How do the authors interpret this result? 

      We have corrected this statement in page 10 : “…the second component occurs at progressively more depolarized potentials for increasingly larger N-terminal deletions and when the structure of the ring is altered through disruption of the interaction between N- and C-termini (E600R)”.

      (14) The equation defined to fit the G-Vs, can also be used to describe the WT currents. If the O1 is conserved and present in the WT, this equation should also fit the WT data properly. The 1-W component shown could also be interpreted as an inactivating component that, in the WT, shifts the voltage-dependence of activation towards depolarizing potentials and is not visible. Still, the mutants do show it as if the transition from closed-inactivated states is controlled by interactions in the gating ring, and disturbing them does affect the transitions to the open state. 

      Out of the two open states in the mutant, O2 is the one that shares properties with the WT (e.g. it is inaccessible during Ca2+-CaM binding) while O1 is the open state with the voltage dependence that is conserved across the mutants. We, therefore, believe that this question is based on a mix-up of the two open states. We appreciate the core of the question: does the pattern in the mutants’ G-V curves find a continuation in the WT channel? 

      Firstly, the component that is conserved among mutants does not lead to current in the WT because the corresponding open state (O1) is not observed in WT. However, the gating event represented by this component should also occur in WT and –given its apparent insensitivity to eag domain mutations–  this gating step should occur in WT with the same voltage dependence as in all the mutants. This means that this first component sets a hard boundary for the most hyperpolarized G-V curve we can expect in the WT, based on our mutant measurements. Secondly, the second component shows a regular progression across mutants: The more intact the eag domain is, the more hyperpolarized the Vhalf values of transition term (1-W) and O2 activation. In Δ2-10, the transition term already almost coincides with O1 activation (estimated Vhalf values of -33.57 and -33.47 mV). A further shift of (1-W) in the WT is implausible because, if O1 activation is coupled to the earliest VSD displacement, the transition should not occur before O1 activation. Still, the second component might shift to more hyperpolarized values in the WT, depending on the impact of amino acids 2 to 10 on the second VSD transition.

      In summary, in WT the G-V should not be more hyperpolarized than the first component of the mutants, and the (1-W)-component probably corresponds to the Δ2-10 (1-W)-component. In WT the second component should be no more depolarized than the second component of Δ2-10. The WT G-V (Fig.1B) meets all these predictions derived from the pattern in the mutant GVs: When we use Eq. 4 to fit the WT G-V with A1=0 (O1 is not present in WT) and the parameters of the transition term (1-W)  fixed to the values attained in Δ2-10, we obtain a fit for the O2 component with Vhalf\=+21mV. This value nicely falls into the succession of Vhalf values for Δeag, ΔPASCap, and Δ2-10 (+103mV,+80mV,+52mV) and, at the same time, it is not more hyperpolarized than the conserved first component (Vhalf -34mV). Our measurements therefore support that the O2 component in the mutants corresponds to the single open state in the WT. 

      (15) Page 15, the authors state that 'The changes in amplitude and kinetics in response to rising intracellular Ca2+ support our hypothesis that Ca-CaM stabilized O1, possibly by driving the channels to deep closed states (Fig 5 and 6)' (pg 15). This statement seems contradictory; I can't quite follow the rationale since Ca2+ potentiates the current (Fig 7), and the addition of the L322H mutant in Fig 7 makes the shift of the first component to negative potentials visible.

      Please check the rationale for this section. 

      We have explained this more explicitly in the discussion (page 32). “Because access to O1 occurs from deep closed states, this could be explained by an increased occupancy of such deactivated states in response to CaM binding. This appears to be the case since CaM induces a biphasic behavior in the mutant channels that show reduced access to deep closed states; thus, L322H mutants behave like the parental variants in the presence of Ca2+-CaM. This implies a mechanistic explanation for the effect of Ca2+-CaM on WT since favoring entry into deep closed states would result in a decrease in current amplitude in the absence of (a permeable) O1”.

      Also, Figs 5 and 6 seem miscited here. 

      Thank you, we have corrected this.

      (16) For Figure 5, it would be helpful if each of the current traces corresponding to a particular voltage had a different color. That way, it will be easier to see how the initial holding voltage modulates current. 

      We have considered this suggestion, and we agree that it would make it easier to follow. Yet, since we have identified the mutants with different colors, it would be inconsistent if we used another color palette for this Figure. Supplement 3 to Figure 9 shows the differences in a clearer way.

      (17) Add zero-current levels to all current traces.

      We have done this.

      (18) The mathematical model should be described better. Particularly, the states from which O1 can be accessed should be described more clearly, as well as whether the model considers any direct connectivity between states O1 and O2. The origin of the voltage-dependence for transitions that do not involve voltage-sensor movements should be discussed. Also, it separation of kappa into kappa-l and kappa-r should be described. 

      We have extensively rewritten the description of the mathematical model to address these concerns.

      (19) Page 4, "reveals a pre-open state in which the transmembrane regions of the channel are compatible with ion permeation, but is still a nonconducting state". Also, page 27, "renders a hydrophobic constriction wider than 8 Å, enough to allow K+ flow, but still corresponds to a non-conducting state". These sentences are confusing - how can the regions be compatible with ion permeation, and still not be conducting? Is cation conductance precluded by a change in the filter, or elsewhere? How is it established that it represents a non-conducting state? 

      We have rephrased to clarify this apparent inconsistence. Page 4: “(…) in which the transmembrane regions of the channel are compatible with ion permeation (the permeation path is dilated, like in open states) but the intracellular gate is still in the same conformation as in closed states (Zhang et al., 2023).” Page 31: “The presence of an intact intracellular ring would preclude ionic flow in the WT, and its alteration would explain the permeability of this state in the mutants.”

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife Assessment

      This is a useful report of a spatially-extended model to study the complex interactions between immune cells, fibroblasts, and cancer cells, providing insights into how fibroblast activation can influence tumor progression. The model opens up new possibilities for studying fibroblast-driven effects in diverse settings, which is crucial for understanding potential tumor microenvironment manipulations that could enhance immunotherapy efficacy. While the results presented are solid and follow logically from the model’s assumptions, some of these assumptions may require further validation, as they appear to oversimplify certain aspects in light of complex experimental findings, system geometry, and general principles of active matter research.

      We thank the editor for recognizing the usefulness of our work. This work does not aim to precisely describe the complexity of the tumor microenvironment in lung cancer, but rather to classify and rigorously calibrate a minimum number of parameters to the clinical data we collect and generate, and reproduce the global structures of the microenvironment. We identify different scenarios, and show how they depend on the local interactions within this framework. Although we started in the first version with coalescence in the main text and anisotropic geometry in the supporting information, we realized that we needed to provide more directions to better show how our model can be extended. Thus, in Section III-4 we added an analysis of a microenvironment with blood vessels, and showed how to introduce anisotropic friction as a function of fiber orientation, as well as active stress, paving the way for further studies, that would make our model more complex. However, in a first step, it is crucial to start with a limited number of parameters that can be rigorously determined, and this is how this first work was conceived.

      Public Reviews:

      Reviewer #1 (Public review):

      The authors present an important work where they model some of the complex interactions between immune cells, fibroblasts and cancer cells. The model takes into account the increased ECM production of cancer-associated fibroblasts. These fibres trap the cancer but also protect it from immune system cells. In this way, these fibroblasts’ actions both promote and hinder cancer growth. By exploring different scenarios, the authors can model different cancer fates depending on the parameters regulating cancer cells, immune system cells and fibroblasts. In this way, the model explores non-trivial scenarios. An important weakness of this study is that, though it is inspired by NSCLC tumors, it is restricted to modelling circular tumor lesions and does not explore the formation of ramified tumors, as in NSCLC. In this way, is only a general model and it is not clear how it can be adapted to simulate more realistic tumor morphologies.

      We thank the reviewer for highligting the importance of our work. We acknowledge that although we provided anisotropic geometries and the study of the coalescence in the first version, more effort was needed to provide tools to extend our formalism to non-ideal cases. This is now added as Section III-4, where we analyze the impact of blood vessels, and the anisotropic friction due to the nematic order for the fibers; this nematic order can also be used to introduce active nematic stress.

      Reviewer #2 (Public review):

      Summary:

      The authors develop a computational model (and a simplified version thereof) to treat an extremely important issue regarding tumor growth. Specifically, it has been argued that fibroblasts have the ability to support tumor growth by creating physical conditions in the tumor microenvironment that prevent the relevant immune cells from entering into contact with, and ultimately killing, the cancer cells. This inhibition is referred to as immune exclusion. The computational approach follows standard procedures in the formulation of models for mixtures of different material species, adapted to the problem at hand by making a variety of assumptions as to the activity of different types of fibroblasts, namely ”normal” versus ”cancer-associated”. The model itself is relatively complex, but the authors do a convincing job of analyzing possible behaviors and attempting to relate these to experimental observations.

      Strengths:

      As mentioned, the authors do an excellent job of analyzing the behavior of their model both in its full form (which includes spatial variation of the concentrations of the different cellular species) and in its simplified mean field form. The model itself is formulated based on established physical principles, although the extent to which some of these principles apply to active biological systems is not clear (see Weaknesses). The results of the model do offer some significant insights into the critical factors which determine how fibroblasts might affect tumor growth; these insights could lead to new experimental ways of unraveling these complex sets of issues and enhancing immunotherapy.

      We thank the referee for this summary and for recognizing the strengths of our paper.

      Weaknesses:

      Models of the form being studied here rely on a large number of assumptions regarding cellular behavior. Some of these seemed questionable, based on what we have learned about active systems. The problem of T cell infiltration as well as the patterning of the extracellular matrix (ECM) by fibroblasts necessarily involve understanding cell motion and cell interactions due e.g. to cell signaling. Adopting an approach based purely on physical systems driven by free energies alone does not consider the special role that active processes can play, both in motility itself and in the type of self-organization that can occur due to these cell-cell interactions. This to me is the primary weakness of this paper.

      We thank the referee for this important comment, that allows us to clarify this important point. Although biological materials are out of equilibrium, their behavior often resembles that dictated by thermodynamics. Hence the usefulness of constructing a free energy, in terms of these variables. In a first approach to decipher the complex interactions and describe the different and sometimes non-trivial outcomes in this system that involves many components, we must start by minimizing the number of parameters, and identifying those complex processes, that control the evolution of the system. The free energy that we build on this biological system contains therefore out-of-equilibrium processes that can be approximated by a ”close to equilibrium” description. Our approach is a classical one in statistical physics of active systems, namely in the effort to construct an equivalent free-energy for out-of-equilibrium systems. This allows to gain a clearer insight into those complex processes.

      We have added a sentence in the main text, section III.1, to clarify this point:

      “Building a free-energy density for a biological material is justified, because, although biological materials are out of equilibrium, their behavior often resembles that dictated by thermodynamics. It is therefore useful to write a free energy in terms of state variables.”

      Nevertheless, we recognize that we should have provided more tools for using our formalism by making it active. This is why we introduced the nematic order in the fibers in Section III-4. This nematic order can be used to introduce active stress, and we have cited previous works by some of us see [?, ?, ?] as references for building active processes out of it.

      We must also note that cell signaling has been introduced a minima in our system for providing the cue for the arrival of T-cells and NAFs from the boundaries. However, we found that although we had evoked the other role of the chemicals in the transformation from NAFs to CAFs in the text, details were not well explained. We have therefore corrected and added some explanations in the introduction of section III, and III.1, III.2.

      A separate weakness concerns the assumption that fibroblasts affect T cell behavior primarily by just making a more dense ECM. There are a number of papers in the cancer literature (see, for some examples, Carstens, J., Correa de Sampaio, P., Yang, D. et al. Spatial computation of intratumoral T cells correlates with survival of patients with pancreatic cancer. Nat Commun 8, 15095 (2017);Sun, Xiujie, Bogang Wu, Huai-Chin Chiang, Hui Deng, Xiaowen Zhang, Wei Xiong, Junquan Liu et al. ” Tumour DDR1 promotes collagen fibre alignment to instigate immune exclusion.” Nature 599, no. 7886 (2021): 673-678) that seem to indicate that density alone is not a sufficient indicator of T cell behavior. Instead, the organization of the ECM (for example, its anisotropy) could be playing a much more essential role than is given credit for here. This possibility is hinted at in the Discussion section but deserves much more emphasis.

      The referee is right in his comment, and we thank him for raising this issue. We have therefore introduced the anisotropic orientation of the fibers, which induces an anisotropic friction in a new section III-4. In addition, the references pointed out were included in this section. However, although the anisotropy strongly influences the fate of the tumor when the fibers are oriented perpendicular to the surface of the cancer nest, it is less effective when the fibroblasts are oriented in the direction of surface of the cancer nest. In the latter case, which is often the case before cancer cells reshape the tumor microenvironment, the matrix density should correlate with the friction.

      Finally, the mixed version of the model is, from a general perspective, not very different from many other published models treating the ecology of the tumor microenvironment (for a survey, see Arabameri A, Asemani D, Hadjati J (2018), A structural methodology for modeling immune-tumor interactions including pro-and anti-tumor factors for clinical applications. Math Biosci 304:48-61). There are even papers in this literature that specifically investigate effects due to allowing cancer cells to instigate changes in other cells from being tumor-inhibiting to tumor-promoting. This feature occurs not only for fibroblasts but also for example for macrophages which can change their polarization from M1 to M2. There needed to be some more detailed comparison with this existing literature.

      The referee is right that the first part of our approach, namely the dynamical system may be common in this kind of system, and it needs to be mentioned. So we added the following sentence in the discussion: ”This is in line with several similar mathematical models, that study through this lens the inhibition/activation of the immune system by cancer cells either by means of compartmental nonlinear models similar to our dynamical system, for instance regarding macrophage recruitment and cytokine signaling {arabameri2018structural} {li2019computational}, or mixture models {fotso2024mixture}. We combine the two approaches in order to rigorosly derive the parameters of the model and gain insights from both.”

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      The authors should address the following points:

      Major issues

      (1) The shape of tumors simulated differs immensely from the observed tumors in Fig. 2. Here, the tumor is constituted by irregular domains, not dissimilar from domains in phase separating mixtures. The domains simulated are circular. Since the authors are using the space dependent model to model the increase in tumor cells with time in the different scenarios (immune-desert, immune-excluded, immune inflamed), it should explain how non-spherical tumor structures can be observed in these scenarios. The authors introduce tumor coalescence in page 28, however, it is not expected that the structures observed in Fig 2 are the result from different tumors merging and coalescing, because that would result from an unlikely large number of initial mutation events in the same region of the tissue. The authors should explain what mechanisms present in the model can lead to non-spherical forms.

      We agree with the reviewer that real tumors are rarely round contrary to what our numerics suggests. In fact, only the last figure of our paper in the supporting information was more appropriate for such a discussion. We are now adding discussions and new figures to better illustrate our spatial model, see Figure 6 and section III-4. The in situ geometry of tumors depends on the shape of the host organ, the diffusive (chemical) or advected species such as T cells and fibroblasts, and on the nutrients. Thus, in our case, only cancer cells are produced locally, but during growth the tumor is strongly constrained by the microenvironment, and thus the geometry of the domain we model in the numerics and its boundary conditions. This is also true for the chemicals responsible for growth, cellular advection and phenotypic transformation. Their concentration depends on a convection-diffusion equation and boundary conditions. For a tumor in situ, such as in the lung, the available space is a constraint that will dominate the final geometry of the tumor nests. We do not think that coalescence is controlled by mutational events, but most likely by the search for space necessary for growth. Compared to the first version, we add new figures (Figure 6) that show that the geometry of the organ, as well as the localization of blood vessels, are a cause of the irregularity of the tumor shapes. We also introduce orientational order, which as suggested in section III-4, can induce anisotropic friction and stresses, as well as anisotropic growth. We cite (Ackermann, Joseph, and Martine Ben Amar. ”Onsager’s variational principle in proliferating biological tissues, in the presence of activity and anisotropy.” The European Physical Journal Plus 138.12 (2023): 1103.) where we described active stresses and coupling related to anisotropic growth.

      (2) According to the authors, the model presented in equations (1) and onwards simulates the evolution of the fraction of tumor cells in the tissue. However the fraction of tumor cells, for example, depends itself on the variation of other cell types. For example, if fibroblasts were to proliferate with rate alpha, even without tumor cells proliferating, the fraction of tumor cells in the mixture should decrease as alpha times the tumor cells fraction. These terms are missing. The equations do not describe the evolution of the cells’ fractions but of the amount of cells of each type, normalised by the total carrying capacity of non-normal cells in the tissue. The text should be rewritten accordingly.

      We agree with the referee: our definition of cell density was not precise enough and may appear misleading. In the paragraph II1, we more explictly introduce the word mass fraction which is the correct physical quantity to introduce into the spatial model.

      ”All these cells have the same mass density and the sum of their mass fraction satisfies the relationship S = C + T + F<sub>NA</sub> + F<sub>A</sub> = 1-N, where N is a healthy non active component as healthy cells, for example.”

      It is less intuitive than ”number of cells per unit volume” but necessary for the following (III)

      (3) The authors start by calculating fixed points of different versions of the dynamical system without spatial dependence. They should explain what is the relevance of these fixed points: in a real situation, where the concentration of tumor fibroblasts and T-cells depend on position, in which conditions are these fixed points relevant?

      The referee is right and we will clarify this point: the dynamic analysis is a help for understanding and predicting the scenario occurring in the system. After all the steps of paragraph 2.2, we are faced with 11 independent parameters only for the dynamical system and without the parameters generated by the space modeling itself. Our estimation concerns only lung cancer. These parameters do not appear in the literature. The parameters introduced in Sec. III which are more related to physical interactions such as friction, cell-cell adhesion, etc. can be found in the literature or can be estimated and thus measured in in vitro experiments (see Ackermann and Ben Amar, EPJP 2023, P. Benaroch, J. Nikolic et al. 2024, biorxiv). So what are the fixed points for: they help to get the right numbers for spatial analysis. To recover special features of cancer evolution, we need a model, but also correct estimates of the data in a code that is quite technical and heavy, with each simulation taking a certain amount of time. For users who only need rough predictions, the analysis in section 2 is sufficient.

      It is also important to note that the global result depends only on the source terms, and on the boundary conditions. This can be illustrated with a simple example: Consider the governing equation for the density of a component with velocity v and source term:

      Integrating the equation over a fixed volume V of surface S gives:

      . This integrated equation can then be approximated by the dynamical system that we write. Thus, while the dynamical system does not give any information about the local structure of the system, it may be indicative of its global outcome.

      (4)   In page 15, the authors identify that α<sub>NA</sub> is proportional to δ𝝐<sup>4</sup>. However, in equation (7), they replace α<sub>NA</sub> by δ𝝐<sup>4</sup> without the proportionality constant. This should be corrected.

      Thank you for your remark. This typo is now corrected.

      (5) The tumor cell movement should be much slower than the T-cells. Here, the authors assign a similar friction coefficient for the cancer cells and T-cells, for example. However, in lung cancer tumor cells are epithelial, and adhere to each other in the tissue. Their movement is very restricted by the basement membranes and by cell-cell adhesion. Immune cells and T-cells on the other hand move rapidly throughout the stroma. It is a gross simplification to not consider the low epitelial tissue mobility in the context of lung cancer.

      It is possible to assume different friction coe cients for each phase pair. This has been done in a previous publication, Ackermann et al., Physics report 2021. It is also possible to play with the cell-cell adhesion in the energy density and on the diffusion coe cient introduced in the Flory-Higgins free energy. Cell-cell adhesion is taken into account in the energy, and this makes the tumor a more dense phase, while T-cells can move towards cancer cells to which they are attracted. In the last part of the paper, we show the role of an anisotropic friction due to a nematic order for activated fibroblasts and all the other cells

      (6) What is the biological mechanism by which the T-cells form a colony with a surface tension? In the phase-field model, the authors have a surface tension assigned to the cancer cells, T-cells and fibroblasts. Can the authors justify biologically why do they consider these surface tensions?

      The fact that T-cells form a colony is due to the accumulation of T-cells at the outer boundary of the tumor, as they are attracted to it but cannot penetrate due to the strong cell-cell adhesion of the tumor cells in the nest. Adding a gradient square is standard in continuous models to limit the sharp variations. In a continuous approach, the gradient square contribution limits the sharp variations in cell density which are not physical.

      Minor issues

      (a) Page 6 (end), characterisation of the fibre barrier produced by CAFs missing: what is the fibre density, how it can hinder the spread of cancer and T-cell motility? Is it so dense that it prevents ameboid movement? Can cells move through it using matrix degradation proteins?

      The fiber density corresponds to the fibrous organic extracellular matrix secreted by cancer-associated fibroblasts. In desmotic (highly fibrous tumors such as PDAC or NSCLC), this extracellular matrix deposited around the tumor forms a physical barrier around the tumor nest, preventing both cell migration and capillary and immune cells penetration. In these cases, the fibrous belt actually prevents ameboid movement and cells must deform significantly to migrate. The role of this barrier was particularly demonstrated in the reference (Grout, John A., et al. ”Spatial positioning and matrix programs of cancer-associated fibroblasts promote T-cell exclusion in human lung tumors.” Cancer Discovery 12.11 (2022): 2606-2625.). In later stages of cancer, the tumor may adapt and develop strategies to metastasize, such as matrix degradation. This matrix can be oriented, organized or disordered. To build a minimal model, we first considered an isotropic friction and also an anisotropic friction of the nematic belt, due to the activated fibroblasts. In the case of T-cells, as mentioned in section I.1, it is true that the biological literature also considers a phenotypic transformation of the T cells by the activated fibroblasts: this concerns both their proliferative capacities, antigen recognition and also their cytotoxic function. To better document the different mechanisms, we add the following publication: Cancer associated fibroblasts-an impediment to effective anti-cancer T cell immunity, by Koppensteiner, Lilian and Mathieson, Layla and O’Connor, Richard A and Akram, Ahsan R, Frontiers in immunology (2022).

      However, our goal is to build a minimal model and to characterize and quantify the physical process in which CAFs are involved, namely the role of a physical barrier, that has been documented, as documented above.

      (b) Page 19 (Fig 3), in the figure legend it is written ”resting fibroblasts”, should be ”non-activated fibroblasts”.

      The referee is right: it will be better to write non-activated fibroblasts. This is now changed in the main text.

      (c) Page 21 (equation), what is dΩ? It is dr?

      We thank the referee for raising this point. The text was indeed ambiguous as sometimes dΩ was replaced by dr. To be clearer, all the elements of volume are now noted dV , and the element of surface of the system are noted dS.

      In the article the units are in italic and should be in roman.

      Thank you for raising this point. It has been corrected.

      (d) Page 25 (beginning section III.3), the authors mention that the simulation is 2D, however, the simulation has radial symmetry. A 1D simulation in radial coordinates could simulate a 3D spherical system. Is the simulation of this section equivalent to a 1D radial simulation (in 2D)?

      The referee is right that in radial symmetry, a 1d equation may be written. We therefore present numerics with irregular shapes of the tumor nest in order to make the system fully 2d.

      (e) Page 26 (Fig 4). Legends inside the plots of plates A, B, C and D are not clear. Colorbar range of plates A and D is different. Would facilitate if the ranges were the same.

      The referee is right: the surface plots presented in figure 4 would be easier to compare with the same colorbar range for the legends. In fact, as the referee noted, figures in A, B and C have the same legends, while figure in D has a different one. This is due to the fact that D represents the case of the immune-inflamed tumor where the cancer mass fraction is quite vanishing, resulting in values that are of 3 orders of magnitude lower than those present in A, B and C. Therefore, they would disappear if the colorbar range were equal to the others.We insist more on the change of scale in the legend of Figure 4, in the new version.

      (f) Page 29 (Fig 5), would facilitate if the order of immune-desert, immune-excluded, immune-inflamed was maintained throughout the document. In this figure the immune-inflamed case appears first.

      We agree with the reviewer that following the same order in which the different cases are presented throughout the manuscript would be helpful in comparing the different figures. Therefore, we have modified Figure 5.

      (g) Page 31, the authors indicate that pharmacodynamics and pharmacokinetics are highly dependent on tumour spatial structure. Can they provide examples and citations?

      In the discussion, we have added references concerning pharmacodynamics.

      (h) Page 33 (Fig Sup2), would facilitate if the order of immune-desert, immune-excluded, immune-inflamed was maintained throughout the document. ±±

      We thank the reviewer for pointing this out, the order of the different scenarios in Fig Sup 2 has now been changed.

      Reviewer #2 (Recommendations for the authors):

      Major points

      (1) Following on from the discussion in the public review, I feel that there are a number of critical issues that need to be addressed regarding modeling assumptions. I would like to understand why the authors believe it is possible to use a free energy-driven model of the microenvironment when many of the processes relevant for their study have an undeniably ”active media” flavor.

      The referee is right that processes in biology are active processes. However, it is a classical approach to model physical interactions between biological components with a free-energy, especially cell adhesion, as they often lead to quasi-stationary equilibrium-like patterns. The free-energy approach has also the advantage to derive straight-forwardly complex phenomena involving many components. Activity can indeed be introduced in such a framework, if we know that the fibroblasts transform into myo-fibroblasts, see for example our previous publication Ackermann and Ben Amar, EPJP 2023. However, in the interest of simplification and reduction of the number of free parameters, we have not not considered further complication of the model here, as a minimal model allows to distinguish the main processes that occur. Nevertheless, introducing more precisely activity, in the nematic approach already achieved for the friction, is a natural continuation of our work: See the new Section III-4, where we introduce the nematic order, and we indicate that active nematic stresses can be written from it.

      Next, I don’t understand the assumption that T cells do not proliferate once they detect neoantigens on the cancer cells; activation of T cells usually causes them to become more proliferative.

      We thank the referee for this question. The T-cell fraction has two origins: proliferation of T-cells in situ in the stroma or inside tumor nest or external arrival from the sources that we privilege. We recognize that a full analysis of the tumor-microenvironment would require to consider proliferation near the tumor, as many more other processes which is do able but requires the knowledge of more biological date. In addition, besides, the proliferation of T-cells will be equivalent to increase the killing abilities of T-cells and these two effect overlapp in our approach.

      In order to clarify this point, we modify the following sentence in Section II.2:

      “Although proliferation of cytotoxic T-cells has been observed, we do not consider explicitly proliferation in our study as we focus on their ability to infiltrate the tumor.”

      Rather, we consider that T-cells proliferate outside the domain boundaries, so that this proliferation is included in the boundary source contributions.

      Finally, the issue of whether the density of fibers is sufficient to understand the role of fibroblasts is not at all settled. There should be a full discussion of this issue including mentioning of the Nature paper (cited in the public review) that argues that orientation (and not density) is the key to the role of fibers, as well as the earlier cited work of Kalluri and collaborators on the role of ECM density in pancreatic cancer.

      We thank the referee for this remark. As we wrote above in the response to the public review, we introduced significant additions that aim to tackle this question in the article.

      (2) The authors present a picture of a tumor cell with fibroblasts apparently arrayed circumferentially around the tumor boundary and therefore blocking infiltration. This type of tumor structure has been seen before, for example in ”On the mechanism of long-range orientational order of fibroblasts.” Proceedings of the National Academy of Sciences 114, no. 34 (2017): 8974-8979, which should be cited. More importantly, in that paper the argument is made that positive feedback between fibroblasts and ECM geometry can cause structures like this to form. If this is indeed what is occurring, this would indicate the crucial importance of a mechanism beyond what is contained in the current model. This issue should therefore be discussed within this paper. This issue is of course connected to the previous point regarding the role of ECM structure beyond density.

      We completely agree that the interplay between the fibroblast layer and the tumor shapes the tumor boundary. One of the authors has worked recently on this precise topic (Aging and freezing of active nematic dynamics of cancer-associated fibroblasts by fibronectin matrix remodeling, C Jacques, J Ackermann, S Bell, C Hallopeau, CP Gonzalez, ... bioRxiv, 2023.11. 22.568216, Ordering, spontaneous flows and aging in active fluids depositing tracks S Bell, J Ackermann, A Maitra, R Voituriez arXiv preprint arXiv:2409.05195). Since the fibroblast layer is an active material, it contributes to an anisotropic stress that can be introduced into the model. Our first strategy was to present the simplest modeling in order to focus on the most important interactions as cell-cell adhesion and cell-tissue adhesion. However, we recognize that those questions should be discussed in the text, and we discuss it in the new section III-4

      Minor points

      There are also a number of more minor points to consider:

      (1) Since the parameter is taken to be O(1), why exactly does it matter how the other parameters scale with it?

      It is very important to compare the order of magnitude of the other parameters once the selected parameter of order O(1) is really the driving parameter of the coupling. It gives a first picture of the main interactions that has to consider.

      (2) I didn’t understand the relevance of referring specifically to IL 6 among many other possibly relevant signals, as is currently done on page 7.

      This corresponds to studies aiming to correlate lung cancer risks and the concentration of interleukin, mostly IL6 and IL8 (McKeown, D. J., et al. ”The relationship between circulating concentrations of C-reactive protein, inflammatory cytokines and cytokine receptors in patients with non-small-cell lung cancer.” British journal of cancer 91.12 (2004): 1993-1995.,Brenner, Darren R., et al. ”Inflammatory cytokines and lung cancer risk in 3 prospective studies.” American journal of epidemiology 185.2 (2017): 86-95. ) but in the absence of very detailed biological information, the modeling and its results are not modified if other chemicals intervene..We slightly modeified the following phrase in section I.1:

      “In particular, in the family of inflammatory proteins, also called cytokines, Interlukin-6 (IL6) and (IL8) seem, among others to stimulate the infiltration of CD8<sup>+</sup>.

      (3) The authors need to mention the possibility of T-cell chemotaxis to the tumor being ”self-amplified” in the T cell system, as put forth in Galeano Nin˜o, Jorge Luis, Sophie V. Pageon, Szun S. Tay, Feyza Colakoglu, Daryan Kempe, Jack Hywood, Jessica K. Mazalo et al. ”Cytotoxic T cells swarm by homotypic chemokine signalling.” eLife 9 (2020): e56554. This might again reveal a needed extension of the current modelling strategy.

      We thank the referee for his/her comment on the self-amplification of T-cell population in the stroma and we mention the indicated reference in our paper. This auto-chemoatactic process which induces a dynamic of more e cient recruitment towards the tumor, may be important for immunotherapy. To have more e cient T-cell arriving at the site of the tumor, will lead a better issue for the patient, if the swarming organization is maintained in a desmoplastic nematic stroma.

      (4) It is not obvious to me that in sub figures 3F and 3H the tumor is enroute to being totally eradicated, as is stated in the text. The blue lines seemed to asymptote at non-zero population values.

      Looking at sub-figures 3F and 3H, we stated in the main text that the tumor is eradicated as the representative population approaches a 0 value fraction, or at least decays around the 0 (0.01/0.05 to be more precise). This is even more evident when compared with the other cases where the tumor mass fraction reaches values of a higher order (up to 0.6), thus leading us to dinstinguish between these different scenarios.

      (5) The description of the interaction of cells with fibers as being increased friction might be misleading, as the real effect could be actual trapping in the network (as opposed to just slowing down the motion).

      We thank the referee for this question as it allow us to make an important distinction. Indeed, what the referee describes seems to correspond to a discrete event, namely a cell trapped in a network. However, coarse-graining the dynamics to the continuous modeling seems to us as leading to an effective friction between the two phases. Moreover, we also now introduced an anisotropic friction which can represent a trapping. The velocities are not only directed around the tumor but can also be oriented towards the tumor, so that eventually the friction along the radius mimics a trapping (see Fig.4 on top). We have introduced this anisotropic friction via a nematic model, see the appendix.

    1. Author response:

      The following is the authors’ response to the current reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      Authors showed the presence of Mtb in human liver biopsy samples of TB patient and reported that chronic infection of Mtb causes immune-metabolic dysregulation. Authors showed that Mtb replicates in hepatocytes in a lipid rich environment created by up regulating transcription factor PPARγ. Authors also reported that Mtb protects itself from anti-TB drugs by inducing drug metabolising enzymes.

      Strengths:

      It has been shown that Mtb induces storage of triacylglycerol in macrophages by induction of WNT6/ACC2 which helps in its replication and intracellular survival, however, creation of favorable replicative niche in hepatocytes by Mtb is not reported. It is known that Mtb infect macrophages and induces formation of lipid-laden foamy macrophages which eventually causes tissue destruction in TB patient. In a recent article it has been reported that "A terpene nucleoside from M. tuberculosis induces lysosomal lipid storage in foamy macrophages" that shows how Mtb manipulates host defense mechanisms for its survival. In this manuscript, authors reported the enhancement of lipid droplets in Mtb infected hepatocytes and convincingly showed that fatty acid synthesis and triacylglycerol formation is important for growth of Mtb in hepatocytes. Authors also showed the molecular mechanism for accumulation of lipid and showed that the transcription factor associated with lipid biogenesis, PPARγ and adipogenic genes were upregulated in Mtb infected cells.

      The comparison of gene expression data between macrophages and hepatocytes by authors is important which indicates that Mtb modulates different pathways in different cell type as in macrophages it is related to immune response whereas, in hepatocytes it is related to metabolic pathways.

      Authors also reported that Mtb residing in hepatocytes showed drug tolerance phenotype due to up regulation of enzymes involved in drug metabolism and showed that cytochrome P450 monooxygenase that metabolize rifampicin and NAT2 gene responsible for N-acetylation of isoniazid were up regulated in Mtb infected cells.

      Weaknesses:

      There are reports of hepatic tuberculosis in pulmonary TB patients especially in immune-compromised patients, therefore finding granuloma in human liver biopsy samples is not surprising.

      Mtb infected hepatic cells showed induced DME and NAT and this could lead to enhanced metabolism of drug by hepatic cells as a result Mtb in side HepG2 cells get exposed to reduced drug concentration and show higher tolerance to drug. Authors mentioned that " hepatocyte resident Mtb may display higher tolerance to rifampicin". In my opinion higher tolerance to drug is possible only when DME of Mtb inside is up regulated or target is modified. Although, in the end authors mentioned that drug tolerance phenotype can be better attributed to host intrinsic factors rather than Mtb efflux pumps. It may be better if Drug tolerant phenotype section can be rewritten to clarify the facts.

      In the revised manuscript, by immune-staining authors convincingly showed that hepatocytes are a favourable niche for replication of MTb.

      Authors have rewritten the drug tolerant phenotype section which reads better.

      Overall, this paper has new and important information on how MTb establishes a favourable niche for growth in hepatocytes and creates a drug tolerant environment.

      We thank the reviewer for the through and insightful review.

      Reviewer #2 (Public review):

      The manuscript by Sarkar et al has demonstrated the infection of liver cells/hepatocytes with Mtb and the significance of liver cells in the replication of Mtb by reprogramming lipid metabolism during tuberculosis. Besides, the present study shows that similar to Mtb infection of macrophages (reviewed in Chen et al., 2024; Toobian et al., 2021), Mtb infects liver cells but with a greater multiplication owing to consumption of enhanced lipid resources mediated by PPARg that could be cleared by its inhibitors. The strength of the study lies in clinical evaluation of the presence of Mtb in human autopsied liver samples from individuals with miliary tuberculosis and presence of a clear granuloma-like structure. The interesting observation is of granuloma-like structure in liver which prompts further investigations in the field.

      The modulation of lipid synthesis during Mtb infection, such as PPARg upregulation, appears generic to different cell types including both liver cells and macrophage cells. It is also known that infection affect PPARγ expression and activity in hepatocytes. It is also known that this can lead to lipid droplet accumulation in the liver and the development of fatty liver disease (as shown for HCV). This study is in similar line for M.tb infection. As liver is the main site for lipid regulation, the availability of lipid resources is greater and higher is the replication rate. In short, the observations from the study confirm the earlier studies with these additional cell types. It is known that higher the lipid content, greater are Lipid Droplet-positive Mtb and higher is the drug resistance (Mekonnen et al., 2021). The DMEs of liver cells add further to the phenotype.

      Comments on revised version:

      The authors noted that even in experiments where mice were infected with lower CFUs, the presence of Mtb colonies could still be detected in the liver. It would be beneficial to include some experimental data related to this in the supplementary information, as it could provide valuable insights for the research field.

      We thank the reviewer for the in depth evaluation of our manuscript and as suggested we will include the data where Mtb was detected in the liver at low CFUs

      Reviewer #3 (Public review):

      In this revised manuscript, the authors explore how Mtb can infect hepatocytes and create a favorable niche associated with upregulation of the transcription factor PPARγ which presumably allows the bacteria to scavenge lipids from lipid droplets in host cells and upregulate drug-metabolizing enzymes to protect against its elimination. In response to the review, the authors have performed some additional immunostaining of hepatocytes, added more detail to figure legends, added experiments somewhat showing improved colocalization and staining, clarified several points and paragraphs, and updated the referenced literature and discussion.

      The current manuscript provides evidence that human miliary TB patients have infection of hepatocytes with Mtb, with evidence that the bacteria survive at least partially through upregulation of PPARγ, which significantly changes the lipid milieu of the cells. There is also an examination of transcriptomics and lipid metabolism in response to Mtb infection, as well as drug tolerance of Mtb inside hepatocytes. The current manuscript is an improvement over the previous one.

      However, although the manuscript is improved, tissue immunophenotyping of the various cells in the liver remains weak and unconvincing. This is truly a missed opportunity and lessens the rigor of the central findings and conclusions. As pointed out by another reviewer, literature has described different fates of Mtb in the liver. Given the tissue available to the authors, carefully dissecting the various cells that the bacteria are in (esp. hepatocytes versus Kupffer cells) is critical. The authors use only 2 generic markers and do not distinguish among cell types within the tissue slices. A review of the literature shows a variety of both human and mouse antibody markers. In fact, a liver atlas based on immunophenotyping has been published. Likewise, the authors comment on liver granulomas, but this is not justified without immunophenotyping.

      We would like to thank the reviewer for the in-depth and detailed suggestions. We would like to clarify that the primary aim of our study was to determine the localization of Mtb within hepatocytes and the downstream biological consequences. To this end, we employed two well-established and widely validated markers (ASPGR 1 and albumin) that are consistently used to identify hepatocytes in both human and murine liver tissue. While we acknowledge the broader potential of comprehensive immunophenotyping, our focused approach was designed to specifically address the question of hepatocyte involvement, which the selected markers effectively support, which was further reiterated by the Reviewer 1.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      In my opinion this paper contains important information and no further information is required for this manuscript.

      We thank the reviewer for the insightful comments

      Reviewer #2 (Recommendations for the authors):

      The authors noted that even in experiments where mice were infected with lower CFUs, the presence of Mtb colonies could still be detected in the liver. It would be beneficial to include some experimental data related to this in the supplementary information, as it could provide valuable insights for the research field.

      As suggested,  we will include the data with the low CFUs in the updated manuscript.

      Reviewer #3 (Recommendations for the authors):

      • Line 340, the fact that PPARγ inhibition decreases bacterial load should not be surprising, as the authors cite several papers where this is already shown.

      • Line 379, the increased tolerance of Mtb to drugs in hepatocytes is only significant at the lower 2 concentrations, not at 5 ug/mL.

      • Fig S4F-H, the y axis is inappropriately not set to zero on the lower limit.

      • Fig S9B, the Y-axis states "relative" CFU, but there is no indication what the bars are normalized to, and the numbers are much more typical of standard CFU values. Was the "Relative" part left in by mistake?

      • Double check the ending of the figure legend for Figure S10 and S11.

      • Line 352, phenomenom [sic] is misspelled.

      • On re-read, several sentences throughout this manuscript need improvement regarding structure and grammar. I suggest careful editorial review.

      We thank the reviewer for pointing out the issues and these will be carefully modified in the next version.


      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The authors showed the presence of Mtb in human liver biopsy samples of TB patients and reported that chronic infection of Mtb causes immune-metabolic dysregulation. Authors showed that Mtb replicates in hepatocytes in a lipid rich environment created by up regulating transcription factor PPARγ. Authors also reported that Mtb protects itself from anti-TB drugs by inducing drug metabolising enzymes.

      Strengths:

      It has been shown that Mtb induces storage of triacylglycerol in macrophages by induction of WNT6/ACC2 which helps in its replication and intracellular survival, however, creation of favorable replicative niche in hepatocytes by Mtb is not reported. It is known that Mtb infects macrophages and induces formation of lipid-laden foamy macrophages which eventually causes tissue destruction in TB patients. In a recent article it has been reported that "A terpene nucleoside from M. tuberculosis induces lysosomal lipid storage in foamy macrophages" that shows how Mtb manipulates host defense mechanisms for its survival. In this manuscript, authors reported the enhancement of lipid droplets in Mtb infected hepatocytes and convincingly showed that fatty acid synthesis and triacylglycerol formation is important for growth of Mtb in hepatocytes. The authors also showed the molecular mechanism for accumulation of lipid and showed that the transcription factor associated with lipid biogenesis, PPARγ and adipogenic genes were upregulated in Mtb infected cells.

      The comparison of gene expression data between macrophages and hepatocytes by authors is important which indicates that Mtb modulates different pathways in different cell type as in macrophages it is related to immune response whereas, in hepatocytes it is related to metabolic pathways.

      Authors also reported that Mtb residing in hepatocytes showed drug tolerance phenotype due to up regulation of enzymes involved in drug metabolism and showed that cytochrome P450 monooxygenase that metabolize rifampicin and NAT2 gene responsible for N-acetylation of isoniazid were up regulated in Mtb infected cells.

      We thank the reviewer for the positive feedback and for highlighting the strengths of our study.

      Weaknesses:

      There are reports of hepatic tuberculosis in pulmonary TB patients especially in immune-compromised patients, therefore finding granuloma in human liver biopsy samples is not surprising.

      Mtb infected hepatic cells showed induced DME and NAT and this could lead to enhanced metabolism of drug by hepatic cells as a result Mtb in side HepG2 cells get exposed to reduced drug concentration and show higher tolerance to drug. The authors mentioned that " hepatocyte resident Mtb may display higher tolerance to rifampicin". In my opinion higher tolerance to drugs is possible only when DME of Mtb inside is up regulated or the target is modified. Although, in the end authors mentioned that drug tolerance phenotype can be better attributed to host intrinsic factors rather than Mtb efflux pumps. It may be better if the Drug tolerant phenotype section can be rewritten to clarify the facts.

      We agree that several case studies regarding liver infection in pulmonary TB patients have been reported in the literature, however this report is the first comprehensive study that establishes hepatocytes to be a favourable niche for Mtb survival and growth.

      Drug tolerance is a phenomenon that is exhibited by the bacteria and during hostpathogen interactions, can be influenced by both intrinsic (bacterial) and extrinsic (host-mediated) factors. Multiple examples of tolerance being attributed to host driven factors can be found in literature (PMID 32546788, PMID: 28659799, PMID: 32846197). Our studies demonstrate that Mtb infected hepatocytes create a drug tolerant environment by modulating the expression of Drug modifying enzymes (DMEs) in the hepatocytes.

      As suggested by the reviewer we will rewrite the drug tolerant phenotype section.

      Reviewer #2 (Public review):

      The manuscript by Sarkar et al has demonstrated the infection of liver cells/hepatocytes with Mtb and the significance of liver cells in the replication of Mtb by reprogramming lipid metabolism during tuberculosis. Besides, the present study shows that similar to Mtb infection of macrophages (reviewed in Chen et al., 2024; Toobian et al., 2021), Mtb infects liver cells but with a greater multiplication owing to consumption of enhanced lipid resources mediated by PPARg that could be cleared by its inhibitors. The strength of the study lies in the clinical evaluation of the presence of Mtb in human autopsied liver samples from individuals with miliary tuberculosis and the presence of a clear granuloma-like structure. The interesting observation is of granuloma-like structure in liver which prompts further investigations in the field.

      The modulation of lipid synthesis during Mtb infection, such as PPARg upregulation, appears generic to different cell types including both liver cells and macrophage cells. It is also known that infection affect PPARγ expression and activity in hepatocytes. It is also known that this can lead to lipid droplet accumulation in the liver and the development of fatty liver disease (as shown for HCV). This study is in a similar line for M.tb infection. As the liver is the main site for lipid regulation, the availability of lipid resources is greater and higher is the replication rate. In short, the observations from the study confirm the earlier studies with these additional cell types. It is known that higher the lipid content, the greater are Lipid Droplet-positive Mtb and higher is the drug resistance (Mekonnen et al., 2021). The DMEs of liver cells add further to the phenotype.

      We thank the reviewer for emphasizing on the strengths of our study and how it can lead to further investigations in the field.

      Reviewer #3 (Public review):

      This manuscript by Sarkar et al. examines the infection of the liver and hepatocytes during M. tuberculosis infection. They demonstrate that aerosol infection of mice and guinea pigs leads to appreciable infection of the liver as well as the lung. Transcriptomic analysis of HepG2 cells showed differential regulation of metabolic pathways including fatty acid metabolic processing. Hepatocyte infection is assisted by fatty acid synthesis in the liver and inhibiting this caused reduced Mtb growth. The nuclear receptor PPARg was upregulated by Mtb infection and inhibition or agonism of its activity caused a reduction or increase in Mtb growth, respectively, supporting data published elsewhere about the role of PPARg in lung macrophage Mtb infection. Finally, the authors show that Mtb infection of hepatocytes can cause upregulation of enzymes that metabolize antibiotics, resulting in increased tolerance of these drugs by Mtb in the liver.

      Overall, this is an interesting paper on an area of TB research where we lack understanding. However, some additions to the experiments and figures are needed to improve the rigor of the paper and further support the findings. Most importantly, although the authors show that Mtb can infect hepatocytes in vitro, they fail to describe how bacteria get from the lungs to the liver in an aerosolized infection. They also claim that "PPARg activation resulting in lipid droplets formation by Mtb might be a mechanism of prolonging survival within hepatocytes" but do not show a direct interaction between PPARg activation and lipid droplet formation and lipid metabolism, only that PPARg promotes Mtb growth. Thus, the correlations with PPARg appear to be there but causation, implied in the abstract and discussion, is not proven.

      The human photomicrographs are important and overall, well done (lung and liver from the same individuals is excellent). However, in lines 120-121, the authors comment on the absence of studies on the precise involvement of different cells in the liver. In this study there is no attempt to immunophenotype the nature of the cells harboring Mtb in these samples (esp. hepatocytes). Proving that hepatocytes specifically harbor the bacteria in these human samples would add significant rigor to the conclusions made.

      We thank the reviewer for nicely summarizing our manuscript.

      Our study establishes the involvement of liver and hepatocytes in pulmonary TB infection in mice. Understanding the mechanism of bacterial dissemination from the lung to the liver in aerosol infections demands a detailed separate study.

      Figure 6E and 6F shows how PPARγ agonist and antagonist modulate (increase and decrease respectively) bacterial growth in hepatocytes (further supported by the CFU data in Supplementary Figure 9B). Again, the number of lipid droplets in hepatocytes increase and decrease with the treatment of PPARγ agonist and antagonist respectively as shown in Figure 6G and 6H. Collectively, these studies provide strong evidence that PPARγ activation leads to more lipid droplets that support better Mtb growth.

      We thank the reviewer for finding our human photomicrographs convincing. In the manuscript, we provide evidence for the direct involvement of the hepatocytes (and liver) in Mtb infection. We have performed detailed immunophenotyping of hepatocyte cells in the mice model with ASPGR1 (asialoglycoprotein receptor 1) and in the revised version of record, we have further stained the infected hepatocytes with anti-albumin antibody.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      In my opinion drug tolerant phenotype section should be rewritten for better clarification. The manuscript contains important information about hepatic tuberculosis which are not reported yet.

      We have rewritten the drug tolerant phenotype section for better clarity.

      We appreciate the reviewer’s comments regarding important information about hepatic tuberculosis

      Reviewer #2 (Recommendations for the authors):

      The following are some observations and comments on the manuscript.

      (1) The study delves into the mechanisms related to hepatic TB/miliary TB; however, the introduction and discussion only describe and discuss the data in the context of pulmonary TB giving a sense that the mandate of the MS is the exploration of the role of liver cells in pulmonary TB. There appears a gap in the connection of findings from the Miliary TB to the pulmonary TB. A discussion of the conversion of pulmonary TB to extrapulmonary /hepatic TB in the light of the findings may be helpful.

      We have modified the discussion section to include possible mechanisms that convert pulmonary TB to hepatic TB in the light of findings. Briefly, Pulmonary tuberculosis (TB) can lead to miliary TB probably through hematogenous dissemination, where Mtb spreads from the infected lungs into blood vessels either from a primary lung focus, reactivated TB or caseous necrosis.  Once in blood vessels, the bacteria seed multiple organs, forming tiny granulomas, characteristic of miliary TB. The liver involvement could be either through direct hematogenous spread or extrusion from nearby infected lymph nodes, leading to hepatic TB, which presents with granulomas and liver dysfunction. This spread underscores the severity of untreated pulmonary TB and the need for early intervention. Our in vivo infection data clearly shows that pulmonary infection of Mtb in mice and guinea pigs can steadily leads to significant infection of the liver and metabolic abnormalities in the liver. The study further highlights the need for systemic studies to better understand the route and mode of dissemination from lungs to liver for better pathophysiological understanding of the disease and creating new therapeutic targets.  

      (2) The authors show the presence of Mtb in the liver autopsies of miliary tuberculosis patients. It is well known that Mtb disseminates during the late stages to several organs and liver is a major site (Sharma et al. 2005; 10.1016/S1473-3099(05)70163-8). Other clinical observations also point to the fact that although Mtb infects liver cells, it is cleared (Thandi et al., 2018, https://doi.org/10.4049/jimmunol.200.Supp.173.20). As the samples are from miliary TB, it is expected that the bacterial load must have been very high before spreading to blood. It is known that once in blood, M.tb is expected to spread to various organs, especially highly vascular ones. Were any other tissues (especially with high vasculature) stained and verified? If yes, add to the supplementary data or discuss.

      Other tissues were not collected and stained during this study. Studies are currently underway to understand whether other vasculated organs also harbour Mtb or not. Besides several studies have shown that Mtb can infect a wide range of organs like brain, kidney, bone marrow, etc (PMID: 33142108, PMID: 28046053, PMID: 34269789) during miliary conditions.

      (3) It is not evident from this paper if hepatic infiltration occurs in pulmonary TB patients? It may therefore be important to discuss the status of liver infections in the primary pulmonary infection.

      Based on the available data from human biopsied liver samples, there is an indication of liver involvement in systemic tuberculosis (TB). However, to gain a more comprehensive understanding of hepatic infiltration in pulmonary TB patients, it is essential to conduct well-organized clinical studies. These studies should specifically target pulmonary TB patients and explore the extent and nature of liver involvement in these individuals (discussion). As suggested by the reviewer it is in the discussion

      (4) Similarly, in the mice model, M.tb was shown to localize to liver when aerosolic infection was given. Were any other tissues, such as kidney, bone marrow etc, checked? Is it because of the high dose of M.tb against the standard challenge dose of 50-100 CFU? Further, since the study in the mouse model is to mimic a miliary tuberculosis of liver, did the dissemination occur via bloodstream and if mycobacteremia could be observed in infected mice.

      Currently studies are underway to understand the involvement of other organs like kidney, brain, bone marrow, in aerosol infection mice model and how dissemination occurs in those distant organs.

      The focus of the current study was to understand the role of liver in systemic tuberculosis with emphasis on hepatocytes as a key cell type to be infected. We have also conducted the experiments with lower CFUs and could detect the presence of Mtb colonies in liver, so we do not think that the infection of liver is dependent on the dose of infection.

      (5) There are studies in mouse model which infer that liver carried the lowest bacterial burden, was cleared the fastest, and it is established that as compared to sites persistently seeded by M. tuberculosis, in the liver the bacteria rarely infect cell types other than professional phagocytes. As the observations in this study are contrasting, the discussion section should include a critical comparative analysis to justify why in the conditions used in the study, the hepatocytes and not Kupffer cells are infected. Other than the morphological description to indicate M.tb infection of hepatocytes in the liver section (fig 1E), it will be good to show localization of M.tb specifically to hepatocytes by using hepatocyte specific marker. Unlike as reported, why was a clearance of M.tb not observed even after 10 weeks (figure 2B).

      While some studies show that Mtb from the liver is cleared fast but there are several other studies that report Liver harbours Mtb even after 10 weeks postinfection (PMID: 22359543, PMID: 21533158, PMID: 29242198). We have consistently observed Mtb infection of liver post week 10 in our infection model. 

      We have performed detailed immunophenotyping of hepatocyte cells in the mice model with ASPGR1 (asialoglycoprotein receptor 1) and in the revised version of record, we have further stained the isolated hepatocytes with anti-albumin antibody (albumin is a robust marker of hepatocyte identity) and have showed the presence of Mtb in it. The data has been included in the revised manuscript (Fig 2J)

      (6) While the result section mentions that "individuals with miliary tuberculosis' (line 107), the legend of Figure 1 writes 'Presence of Mtb in human pulmonary tuberculosis patients'. This is confusing. Clarify

      We thank the reviewer for pointing it out, we have changed the figure legends to miliary tuberculosis as most of the liver biopsy samples were obtained from military tuberculosis patients. 

      (7) Supplementary Figure 2D: Corresponding control panel (uninfected) should be added, which will also verify the specificity of Ag85b. As it is known that Ag85B is secreted out from the bacteria and hence the detected signals may not confirm that Mtb is in hepatocytes. Ag85B per bacterium decreases by almost 10,000-fold at later stages of infection because of secretion (Ernst JD, Cornelius A, et al 2019 mBio). In Supl figure 2D, Ag85b signal seems to be present everywhere inside the cells. Hence, it is important that the control panel be added.

      We have included a control image below which shows no staining of Ag85B in the uninfected sample.While we acknowledge with the reviewer’s comment, but Ag85B has been consistently used as a marker for Mtb presence in multiple studies. Nargan et al., uses Ag85B based staining to characterize infection both pulmonary and EPTB samples (PMID: 38880068). Jain et al., uses Ag85B to characterize Mtb infection of Mesenchymal stem cell in lung biopsy samples of pulmonary TB patients (PMID: 32546788)

      Author response image 1.

      Ag85B staining in uninfected mice shows no signals

      (8) The kinetics experiments in Figure 3D-3G should have used time laps microscopy of a few of the infected cells or it should be represented in CFU. If we consider the doubling time of H37Rv is about 22h to 24h, the data showing that MFI increases dramatically from 5 HPI to 120 HPI, gives an impression that the bacterial number inside the cells increased more than its doubling time.

      We have added the modified plot. As suggested, the CFU of Mtb within HepG2, PHCs, THP-1, RAW 264.7 and BMDMs have been included in the revised version (Supplementary Figure 4 D-H)

      (9) What is the effect of C45 and T863 on Mtb growth invitro? The effect of C45 and T863 on Mtb growth invitro should be shown to be ruled out. The representative image in Figure 5F is DMSO or C45 treated cells panel? Please specify it.

      As per the reviewer’s suggestion we have seen the effect of C45 (30 µM) and T863 (25 µM) on Mtb growth in vitro and did not find any difference in the growth kinetics. The representative image in Figure 5F is DMSO treated cells.

      Author response image 2.

      Growth kinetics of Mtb in 7H9 medium with DMSO, C75 and T863

      (10) Supplementary Figure 6B: Correct the Y-axis label from mRNA levels to Fold change (normalised to control). Please do similar changes wherever required.

      We have made the necessary changes as per the suggestion of the reviewer.

      (11) Figure 7B and 7C: How was the normalization performed? Is the data normalized to the number of bacteria that entered the specific cell type or was normalized at 48hrs with respect to DMSO? DMSO alone data should be shown.

      In the drug tolerance assays, we have calculated the ratio of the bacterial burden in hepatocytes treated with drugs compared to hepatocytes treated with DMSO. The infection was given for 48 hours post which the infected cells were treated with the mentioned concentrations of isoniazid and rifampicin for 24 hours. CFU enumeration was conducted after this 24 hour. Figure 7A gives a schematic of the experimental set up.

      % Tolerant Bacterial population= [A/B X 100] % where A is the CFU of Mtb from infected hepatocytes treated with drug and B is the CFU of Mtb infected cells treated with DMSO.Thus the effect of MOI is negated.

      To provide further credence to the CFU data, we have analysed these studies using microscopic studies as well, where no cell death was observed under the conditions. Mouse BMDMs were as a macrophage control. We have calculated the % tolerance as ratio by measuring the mean fluorescent intensity of GFP-Mtb per hepatocyte treated with drug to MFI of GFP-Mtb per hepatocyte treated with DMSO (control). More than 20 fields, each consisting of more than 4 infected cells have been used for analysis providing additional evidence of less killing of Mtb in hepatocytes compared to BMDMs with anti-TB drugs. All these details are included in the manuscript.

      (12) While authors have shown the changes in mRNA levels of CYP3A4, CYP3A43, NAT2, the protein or activities of some of these should be measured to verify the effect.

      Currently studies are underway to understand the activities of the key proteins involved in isoniazid and rifampicin metabolism and will be published as a separate manuscript.

      Reviewer #3 (Recommendations for the authors):

      Additional comments are:

      • Figure 2D, the 20X and 40X magnifications do not look appreciably different in size. Please double-check that the correct images were used.

      We thank the reviewer for pointing it out, we havecorrected it in the revised version.

      • Lines 162-164: The authors state almost 100% purity. However, the contour plot in 2F appears to show 2 cell populations. Figure 2G is missing a legend of which colors correspond to which staining (and again there appears to be highly variable staining).

      We agree with the reviewer that there are two contours observed in Figure 2F. Although both the contours are positive for ASPGR1 protein, but the level of expression of the ASPGR1 protein is variable. The corresponding confocal image (Nucleus stained by DAPI and ASPGR1 stained with ASPGR1 antibody with Alexa fluor 555 conjugated secondary antibody) also indicates a variable staining of isolated primary hepatocytes, where some cells give a stronger intensity signal than the other cells, further visually confirming our statement. Moreover, several studies show differential expression of ASPGR1 protein in hepatocyte like cells (PMID: 27143754)

      To further clarify and be more specific with respect to the identity of the hepatocytes, we have stained primary hepatocytes from infected mouse livers with Albumin antibody (a stable marker for hepatocytes) and Ag85B (2J)

      Multiple figures throughout the manuscript, including this one, would benefit from the use of arrows to depict what is described in the legend and text more clearly, and the use of higher power insets to better define cell architecture. Finally, some images appear blurry to the eye. Improvements are needed throughout.

      As per the suggestion, we have modified the figures and figure legends for better clarity.

      • Lines 153-155. Albumin, AST and GGT appear to be significantly up at week 8, contradicting the statement that there is no change until week 10.

      We thank the reviewer for poiting it out and  have made suitable changes in the write up

      • Lines 203-205: The authors state earlier that bacteria survive in macrophage phagosomes. Do the authors know the niche for bacteria in hepatocytes that enable them to continue to grow? Transcriptome data from HepG2 cells suggest perhaps a phagosomal pathway?

      We thank the reviewer for this insightful question. As rightly pointed out by the reviewer, transcription data indeed suggests changes in several important pathways like macroautophagy, golgi vesicular transport and vacuolar transport, which can affect the subcellular localisation of Mtb within hepatocytes. High resolution microscopic studies with respect to the subcellular localisation of labelled Mtb within Primary hepatocytes, HepG2 and THP-1 has been conducted and the % colocalization within different intra-cellular compartments have been measured. The image of colocalization of labelled Mtb within PHCs is shown below along with the % colocalization within various compartments in PHCs, HepG2 and THP-1 is added. 

      Author response image 3.

      Colocalisation of Mtb-GFP with various intra-cellular markers within PHCs.

      Author response image 4.

      Percentage Colocalisation of Mtb-GFP with various intra-cellular markers within PHCs, HepG2 and THP-1.

      • Validation of some critical genes found in the HepG2 cells should be done by qRTPCR in primary hepatocytes.

      qRT-PCR analysis of some of the key genes in HepG2 have been validated in primary hepatocytes at 24 hours post infection. Majority of the genes show a similar trend.

      Author response image 5.

      Gene expression analysis of the mentioned genes in Mtb infected PHCs as compared to the uninfected control.

      • Lines 259-260: The authors state a high degree of co-localization. The photomicrograph of a single cell in Fig. 5D is not convincing. I'm not even sure that they are really in the same subcellular compartment. Co-localization stated in Fig. S8B is also not convincing as shown.

      The image currently shown in figure 3D is a maximum intensity projection image of multiple z-stacks encompassing the entire cell.

      We agree with the reviewer with respect to figure Fig S8B and will modify the text and the figure legend accordingly.

      Copywriting edits:

      • It is difficult to see individual gene names in Figures 4D and 4E. A higher resolution or larger font would be appreciated for the reader.

      An excel file with the top differentially regulated genes at both 0 hours post infection and 48 hours post infection has been added.

      • Figure 5A has a shadow on the top right image.

      We have changed the image in the revised manuscript

      • Figure 5E is difficult to read the labels on the axes; it would be better in general to make the labels separately instead of relying on the graphing software, since these labels can get stretched when the size of the graph is modified.

      We agree with the reviewer and have made necessary changes.

      • Line 163: should be "percent" and not "perfect."

      We thank the reviewer for pointing it out and have corrected it

      • Line 190: is missing a period at the end of the sentence "...for further experiments"

      We thank the reviewer for pointing it out and have corrected it

      • Line 332: should be "hepatocytes" instead of "hepatoctyte" [sic]

      We thank the reviewer for pointing it out and have corrected it

    1. Author response:

      Reviewer #1 (Public Review):

      Summary:

      Li et al investigated how adjuvants such as MPLA and CpG influence antigen presentation at the level of the Antigen-presenting cell and MHCII : peptide interaction. They found that the use of MPLA or CpG influences the exogenous peptide repertoire presented by MHC II molecules. Additionally, their observations included the finding that peptides with low-stability peptide:MHC interactions yielded more robust CD4+ T cell responses in mice. These phenomena were illustrated specifically for 2 pattern recognition receptor activating adjuvants. This work represents a step forward for how adjuvants program CD4+ Th responses and provides further evidence regarding the expected mechanisms of PRR adjuvants in enhancing CD4+ T cell responses in the setting of vaccination.

      Strengths:

      The authors use a variety of systems to analyze this question. Initial observations were collected in an H pylori model of vaccination with a demonstration of immunodominance differences simply by adjuvant type, followed by analysis of MHC:peptide as well as proteomic analysis with comparison by adjuvant group. Their analysis returns to peptide immunization and analysis of strength of relative CD4+ T cell responses, through calculation of IC:50 values and strength of binding. This is a comprehensive work. The logical sequence of experiments makes sense and follows an unexpected observation through to trying to understand that process further with peptide immunization and its impact on Th responses. This work will premise further studies into the mechanisms of adjuvants on T cells.

      Weaknesses:

      Comment 1. While MDP has a different manner of interaction as an adjuvant compared to CpG and MPLA, it is unclear why MDP has a different impact on peptide presentation and it should be further investigated, or at minimum highlighted in the discussion as an area that requires further investigation.

      Thank you for the suggestion. We investigated the reasons for the different effects of MDP on peptide presentation compared with those of CpG and MPLA. We found that the expression of some proteins involved in antigen processing and presentation, such as CTSS, H2-DM, Ifi30, and CD74, was substantially lower in the MDP-treated group than in the CpG- and MPLA-treated groups. To further confirm whether these proteins play a key role during adjuvant modification of peptide presentation, we knocked down them using shRNA and then performed immunopeptidomics. The original mass spectra and peptide spectrum matches have been deposited in the public proteomics repository iProX (https://www.iprox.cn/page/home.html) under accession number IPX0007611000. Unfortunately, the expected results for peptide presentation repertoires were not observed. Thus, we hypothesized that the different effects of MDP on peptide presentation might not result from differences in protein expression. We cannot exclude the possibility that some other proteins that may be important in this process were overlooked. We are still working on the mechanisms and do not have an exact conclusion. Thus, we did not present related data in this manuscript.

      The related statements were added in the Discussion section on page 13, lines 292–299: “In this study, we found that the peptide repertoires presented by APCs were significantly affected by the adjuvants CpG and MPLA, but not MDP. All three adjuvants belong to the PRR ligand adjuvant family. CpG and MPLA bind to TLRs and MDP is recognized by NOD2. Although the receptors are different, many common molecules are involved both in TLR and NLD pathway activation. Unfortunately, we did not demonstrate why the MDP had different impacts on peptide presentation compared with other adjuvants. Further investigation is required to clarify the mechanism by which MPLA, CpG, and MDP adjuvants modulate the presentation of peptides with different stabilities.”

      Comment 2. It is alluded by the authors that TLR activating adjuvants mediate selective, low affinity, exogenous peptide binding onto MHC class II molecules. However, this was not demonstrated to be related specifically to TLR binding. I wonder if some work with TLR deficient mice (TLR 4KO for example) could evaluate this phenomenon more specifically.

      Thank you for the suggestion. This is an important point that was overlooked in this study. Based on published research on the mechanisms of PRR adjuvants, CpG and MPLA, we believe that the effect of CpG and MPLA on APCs-selective epitope presentation needs to be bound to the corresponding receptor, although we did not give a definitive conclusion in the manuscript.

      To confirm the TLR-activating adjuvants affecting peptides presented on MHC molecules specifically through TLR binding, we have used CRISPR-cas9 to knock out TLR4 and TLR9 of A20 cells and repeated the experiments, as suggested. We chose TLR4- and TLR9- knockout A20 cell lines instead of TLR-deficient mice because a large number of APCs are required for immunopeptidomics. Moreover, the data observed in this study were based on the A20 cell line. However, these experiments are time-consuming. Unfortunately, we were unable to provide timely data. In addition, we believe that elucidating the downstream molecular mechanisms of TLR activation is necessary, as mentioned in comment 1. All these data will be combined and reported in our upcoming publications.

      Comment 3. It is unclear to me if this observation is H pylori model/antigen-specific. It may have been nice to characterize the phenomenon with a different set of antigens as supplemental. Lastly, it is unclear if the peptide immunization experiment reveals a clear pattern related to high and low-stability peptides among the peptides analyzed.

      Q1: It is unclear to me if this observation is H. pylori model/antigen-specific. It may have been nice to characterize the phenomenon with a different set of antigens as supplemental.

      Thank you for the comment. To confirm the effect of the adjuvant on the exogenous peptide repertoire presented by MHC II molecules, a set of antigens from another bacterium, Pseudomonas aeruginosa, was used, and the experiments were repeated. The A20 cells were treated with CpG and pulsed with Pseudomonas aeruginosa antigens. Twelve hours later, MHC-II–peptide complexes were immunoprecipitated, and immunopeptidomics were performed. The data are shown below (Author response image 1). Information on the MHC-peptides from Pseudomonas aeruginosa is given in the Supplementary Table named “Table S3 Response to comment3”. A total of 713 and 205 bacterial peptides were identified in the PBS and CpG groups (Author response image 1A). The number of exogenous peptides in the CpG-treated group was significantly lower than that in the PBS-treated control group (Author response image 1B). A total of 568 bacterial peptides were presented only in the PBS group; 60 bacterial peptides were presented in the CpG-treated group, and 145 bacterial peptides were presented in both groups (Author response image 1C). We then analyzed the MHC-binding stability of the peptides present in the adjuvant-treated group and that of the peptide-deficient after adjuvant stimulation using the IEDB website. We found that the IC50 of the peptides in the adjuvant-treated group were much higher than those of the deficient peptides, which indicated that the peptides presented in the CpG-treated groups have lower binding stability for MHC-II (Author response image 1D). These results indicate that CpG adjuvant affects the presentation of exogenous peptides with high binding stability, which is consistent with the data reported in our manuscript. Using another set of antigens, we confirmed that our observations were not H. pylori model- or antigen-specific.

      Author response image 1.

      MHC-II peptidome measurements in adjuvant-treated APCs pulsed with Pseudomonas aeruginosa antigens. (A) Total number of bacterial peptides identified in the PBS- and CpG-treated groups. (B) The number and length distribution of bacterial peptides in different groups were compared. (C) Venn diagrams showing the distribution of bacterial peptides in different groups. (D) IC50 of the presented, deficient, and co-presented peptides post-adjuvant stimulation from immunopeptidome binding to H2-IA and H2-IE were predicted using the IEDB website. High IC50 means low binding stability. *p<0.05, **p<0.01.

      Q2: Lastly, it is unclear if the peptide immunization experiment reveals a clear pattern related to high and low-stability peptides among the peptides analyzed.

      In this study, we used a peptide immunization experiment to evaluate the responses induced by the screened peptides with different stabilities. In addition to this method, tetramer staining and ELISA have been used to assess epitope-specific T-cell proliferation and cytokine secretion. Among these, tetramer staining is often used in studies involving model antigens. However, as many peptides were screened in our study, synthesizing a sufficient number of tetramers was difficult. However, we believe that the experimental data obtained in this study support the conclusion. Nevertheless, we agree that more methods applied will make the pattern more clearly.

      Reviewer #2 (Public Review):

      Adjuvants boost antigen-specific immune responses to vaccines. However, whether adjuvants modulate the epitope immunodominance and the mechanisms involved in adjuvant's effect on antigen processing and presentation are not fully characterized. In this manuscript, Li et al report that immunodominant epitopes recognized by antigen-specific T cells are altered by adjuvants.

      Using MPLA, CpG, and MDP adjuvants and H. pylori antigens, the authors screened the dominant epitopes of Th1 responses in mice post-vaccination with different adjuvants and found that adjuvants altered antigen-specific CD4+ T cell immunodominant epitope hierarchy. They show that adjuvants, MPLA and CpG especially, modulate the peptide repertoires presented on the surface of APCs. Surprisingly, adjuvant favored the presentation of low-stability peptides rather than high-stability peptides by APCs. As a result, the low stability peptide presented in adjuvant groups elicits T cell response effectively.

      Thanks a lot for your comments.

      Reviewer #1 (Recommendations For The Authors):

      Recommendation 1. Figure 6: The peptides considered low affinity- it would be helpful to specify from which adjuvant they were collected from. When they are pooled it is unclear if we are analyzing peptides collected from adjuvanting with any of the three adjuvants studied.

      Thank you for the suggestion. The related description in Figure 6 has been modified in the revised manuscript. Data for the peptides identified from the adjuvants MPLA- and CpG-treated groups are shown separately.

      Recommendation 2. It is unclear to me why the A20 cell line is less preferred to the J774 line for the immunopeptidome analysis - can the authors expand on this?

      We apologize for not clearly explaining this in the original manuscript. In fact, the A20 cell line is better than J774A.1 cell line for immunopeptidomics experiments. Compared to J774A.1 cells, more MHC-II peptides were obtained from a smaller number of A20 cells using immunopeptidomics. At the beginning of this study, we chose the J774A.1 cell line as it is a macrophage cell line. J774A.1 cells (up to 5×108) were pulsed with the antigens, and MHC-II–peptide complexes were eluted from the cell surface for immunopeptidomics. Unfortunately, only a few hundred peptides from the host were detected and no exogenous peptides were detected. Next, we tested the A20 cell line. In total, 108 A20 cells were used in this study. More than 3500 host peptides and approximately 50 exogenous peptides have been identified. These data indicate that the A20 cell line was better.

      To investigate the reasons for this, we detected MHC-II expression on cell surfaces using FACS. Our purpose was to elute peptides from MHC–peptide complexes present on the cell surface. Low MHC expression resulted in the elution of a few peptides. We found the MFI of MHC-II molecules on J774A.1 cell is about 500; however, the MFI of MHC-II molecules on A20 cells is more than 300,000. These data indicate that MHC-II expression on A20 cells was much higher than that on J774A.1 cells. J774A.1 cell is a macrophage cell line. Macrophages have excellent antigen phagocytic capabilities; however, their ability to present antigens is relatively weak. MHC molecules on the macrophage cell surface can be upregulated in the stimulation of some cytokines, for example, IFN-γ. In this study, we used adjuvants as stimulators and did not want to use additional cytokine stimulators. Thus, J774A.1 cells were not used in the present study.

      The related statements are reflected on page 6 lines 120–128 “We also selected another H-2d cell J774A.1, a macrophage cell line, for immunopeptidome analysis in this study. Briefly, 5×108 J774A.1 cells were used for immunopeptidomics. Moreover, fewer than 350 peptides were observed at a peptide spectrum match (PSM) level of < 1.0% false discovery rate (FDR). However, more than 5500 peptides were detected in 108 A20 cells at FDR < 1.0% (Figure S2A). CD86 and MHC-II molecule expression on J774A.1 cells was substantially lower than that on A20 cells (Figure S2B). Low MHC-II expression on J774A.1 cells could be the reason for the lack of peptides identified by LC–MS/MS. Thus, A20 cells instead of J774A.1 cells were used for the subsequent experiments.”

      Recommendation 3. Lines 172-177, can more details be provided about the whole proteome analysis? The plots are shown for relative representation of protein expression to PBS, but it is unclear to me what examples of these proteins are (IFN pathway, Ubiquitination pathway). Could these be confirmed by protein expression analyses in supplemental?

      Thank you for the suggestion. In this study, we conducted whole proteome analysis to investigate changes in protein expression across different pathways in the adjuvant groups. Through KEGG enrichment analysis, we compared the differential expression of MHC presentation pathway proteins (such as H2-M, Ifi30, CD74, CTSS, proteasome, and peptidase subunits) between the PBS- and adjuvant-treated groups using our proteome data. In addition, we focused on IFN and ubiquitination pathways that play crucial roles in antigen presentation modification and immune response. The proteins and their relative expression in these pathways are shown in Figure S4B. Details regarding the protein names and expressions are provided in Supplemental Table S2 of the revised manuscript.

      The original statements in the results “Then, we analyzed the whole proteome data to determine whether the proteins involved in antigen presentation and processing were altered. We found that proteins involved in antigen processing, peptidase function, ubiquitination pathway, and interferon (IFN) signaling were altered post adjuvants treatment, especially in MPLA and CpG groups (Figure 5C; Figure S4B and S4C). These data suggest that adjuvants MPLA and CpG may affect the antigen processing of APCs, resulting in fewer peptides presentation.” This has been revised on page 8 lines 172–182 as “We then investigated whole-proteome data to determine the evidence of adjuvant modification of antigen presentation. We focused on the proteins involved in antigen processing, peptidase function, ubiquitination pathway, and IFN signaling. The ubiquitination pathway and IFN signaling play crucial roles in the modification of antigen presentation and immune responses. Through KEGG enrichment analysis, we found that many proteins involved in antigen processing, peptidase function, ubiquitination pathways, and IFN signaling were altered after adjuvant treatment, particularly in the MPLA- and CpG-treated groups (Figure 5C; Figure S4B). The expression of each protein is shown in Figure S4C and Supplementary Table 2. These data suggest that MPLA and CpG adjuvants may affect the antigen processing of APCs, resulting in fewer peptide presentations.”

      Recommendation 4. Lines 212-218: I think there needs to be more discussion of interpretation here. Only one of the low-stability peptides required low concentrations for CD4+ T cell responses in vitro. What about the other peptides in the analysis? Perhaps if the data is taken together there is not a clear pattern?

      Thank you for the comment. In this study, epitope-specific CD4+ T-cells were expanded in vitro from the spleens of peptide-pool-immunized mice. T-cell responses to individual peptides were detected using ICS and FACS. Only one peptide, recA #23, with low binding stability, and one high-stability peptide, ureA #2, induced effective T-cell responses. Peptide ureA #3 with high stability induces low Th1 responses. The other peptides cannot induce CD4+ T-cell secreting IFN-γ (Data are shown in Author response image 2). Thus, we compared the strength of IFN-γ responses induced by these three peptides at a set of low concentrations. Data for other peptides without any response could not be taken together.

      Author response image 2.

      The expanded CD4+T cells from peptides immunized mice were screened for their response to the peptides in an ICS assay.

      In this study, we used a peptide pool containing four low-stability peptides to vaccinate mice; however, only one peptide induced an effective CD4+ T-cell response. We speculate that the possible reasons are as follows. First, the number of peptides used for vaccination is too small. Only four low-stability peptides were synthesized and used to immunize mice. Three of these could not induce an effective T-cell response, possibly because of their low immunogenicity. If more peptides are synthesized and used, more peptides that induce T-cell responses may be observed. Second, epitope-specific T-cell responses are variable. Responses to the subdominant peptides can be inhibited by the dominant peptide. The subdominant peptide can become dominant by changing the peptide dose or in the absence of the dominant peptide. Thus, we believe that responses to the other three peptides may be detected if mice are immunized with a peptide pool that does not contain a response epitope.

      The corresponding statements have been added to the Discussion section on page 13 lines 287–291 as “Unfortunately, only one peptide, recA #23, with low binding stability and induced significant Th1 responses, was identified in this study. To further confirm that low-stability peptides can induce stronger and higher TCR-affinity antigen-specific T-cell clonotype responses than high-stability peptides, further studies should monitor more peptides with different stabilities.”

      Recommendation 5. There are some areas where additional editing to text would be beneficial due to grammar (eg lines 122-126; line 116, etc).

      The manuscript has been edited by a professional language editing company.

      Reviewer #2 (Recommendations For The Authors):

      Recommendation 1. It is interesting that there was no difference in IFNg responses induced by different adjuvants.

      Thank you for the comment. Possible reasons for the lack of difference in IFN-γ responses could be as follows. First, all adjuvants used in this study have been confirmed to effectively induce Th1 responses. Second, in this study, IFN-γ responses were examined using expanded antigen-specific T cells in vitro. The in vitro cell expansion efficiency may have affected these results.

      Recommendation 2. The data to support the claim that changes in exogenous peptide presentation among adjuvant groups were not due to differences in antigen phagocytosis is insufficient.

      Thank you for the comment. In this study, proteomics of A20 cells pulsed with antigens in different adjuvant-treated groups were used to determine exogenous antigens phagocytosed by cells. In addition, we used fluorescein isothiocyanate (FITC)-labeled OVA to pulse APCs and detected antigen phagocytosis by APCs after treatment with different adjuvants. The MFI of FITC was detected by FACS at different time points. The data are shown below (Author response image 3). No obvious differences in FITC MFI were detected after adjuvant stimulation, indicating that antigen phagocytosis among the adjuvant groups was almost the same.

      A20 cells, used as APCs, are the B-cell line. Antigen recognition and phagocytosis by B-cells depends on the B-cell receptor (BCR) on the cell surface. The ability of BCRs to bind to different antigens varies, leading to significant differences in the phagocytosis of different antigens by B-cells. Therefore, detecting the phagocytosis of a single antigen may not reflect the overall phagocytic state of the B-cells. Thus, in this study, we used proteomics to detect exogenous proteins in B-cells pulsed with H. pylori antigens, which contain thousands of components, to evaluate their overall phagocytic capacity. Only the proteomic data are presented in our manuscript.

      Author response image 3.

      Antigen phagocytosis of A20 cells were measured using FITC-labeled OVA. (A) A20 cells were pulsed with FITC-labeled OVA. MFI of FITC was measured after 1 h. (B) MFI of FITC was examined post the stimulation of adjuvants at different time points.

      Recommendation 3. It is not clear how MPLA, CpG, and MDP adjuvants modulate the presentation of low vs high stability peptides.

      Thank you for pointing this out. We acknowledge that we did not clarify the mechanisms by which adjuvants affect the stability of the peptide presentations of APCs.

      We performed experiments to detect the expression of proteins involved in antigen processing and presentation in the different adjuvant-treated groups. Furthermore, shRNAs were used to knock down the expression of key molecules. Immunopeptidomics was used to detect peptide presentation. Unfortunately, the expected results for peptide presentation repertoires were not observed. We are still working on the mechanisms.

      Please also see our response to comment 1 of reviewer 1

      The related statements were added in the Discussion section on page 13, lines 292–299: “In this study, we found that the peptide repertoires presented by APCs were significantly affected by the adjuvants CpG and MPLA, but not MDP. All three adjuvants belong to the PRR ligand adjuvant family. CpG and MPLA bind to TLRs and MDP is recognized by NOD2. Although the receptors are different, many common molecules are involved both in TLR and NLD pathway activation.  Unfortunately, we did not demonstrate why the MDP had different impacts on peptide presentation compared with other adjuvants. Further investigation is required to clarify the mechanism by which MPLA, CpG, and MDP adjuvants modulate the presentation of peptides with different stabilities.”

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews: 

      Reviewer #1 (Public review): 

      Summary: 

      Govindan and Conrad use a genome-wide CRISPR screen to identify genes regulating retention of intron 4 in OGT, leveraging an intron retention reporter system previously described (PMID: 35895270). Their OGT intron 4 reporter reliably responds to O-GlcNAc levels, mirroring the endogenous splicing event. Through a genome-wide CRISPR knockout library, they uncover a range of splicing-related genes, including multiple core spliceosome components, acting as negative regulators of OGT intron 4 retention. They choose to follow up on SFSWAP, a largely understudied splicing regulator shown to undergo rapid phosphorylation in response to O-GlcNAc level changes (PMID: 32329777). RNA-sequencing reveals that SFSWAP depletion not only promotes OGT intron 4 splicing but also broadly induces exon inclusion and intron splicing, affecting decoy exon usage. While this study offers interesting insights into intron retention and O-GlcNAc signaling regulation, the RNA sequencing experiments lack the essential controls needed to provide full confidence to the authors' conclusions. 

      Strengths: 

      (1) This study presents an elegant genetic screening approach to identify regulators of intron retention, uncovering core spliceosome genes as unexpected positive regulators of intron retention. 

      (2) The work proposes a novel functional role for SFSWAP in splicing regulation, suggesting that it acts as a negative regulator of splicing and cassette exon inclusion, which contrasts with expected SR-related protein functions. 

      (3) The authors suggest an intriguing model where SFSWAP, along with other spliceosome proteins, promotes intron retention by associating with decoy exons. 

      We thank the reviewer for recognizing and detailing the strengths of our manuscript. 

      Weaknesses: 

      (1) The conclusions on SFSWAP impact on alternative splicing are based on cells treated with two pooled siRNAs for five days. This extended incubation time without independent siRNA treatments raises concerns about off-target effects and indirect effects from secondary gene expression changes, potentially limiting confidence in direct SFSWAP-dependent splicing regulation. Rescue experiments and shorter siRNA-treatment incubation times could address these issues. 

      We repeated our SFSWAP knockdown analysis and analyzed both OGT e4-e5 junction splicing and SFSWAP transcript levels by RT-qPCR (now included in Sup. Fig. S4) from day 2 to day 5 post siRNA treatment. We observed that the time point at which OGT intron 4 removal increases (day 2) coincides with the time at which SFSWAP transcript levels start decrease, consistent with a direct effect of SFSWAP knockdown on OGT intron 4 splicing. Moreover, the effect of SFSWAP knockdown on OGT intron 4 splicing peaks between day 4-5, supporting our use of these longer time points to cast a wide net for SFSWAP targets.

      (2) The mechanistic role of SFSWAP in splicing would benefit from further exploration. Key questions remain, such as whether SFSWAP directly binds RNA, specifically the introns and exons (including the decoy exons) it appears to regulate. Furthermore, given that SFSWAP phosphorylation is influenced by changes in O-GlcNAc signaling, it would be interesting to investigate this relationship further. While generating specific phosphomutants may not yield definitive insights due to redundancy and also beyond the scope of the study, the authors could examine whether distinct SFSWAP domains, such as the SR and SURP domains, which likely overlap with phosphorylation sites, are necessary for regulating OGT intron 4 splicing. 

      We absolutely agree with the reviewer that the current work stops short of a detailed mechanistic study, and we have made every attempt to be circumspect in our interpretations to reflect that limitation. In addition, we are very interested in delving more deeply into the mechanistic aspects of this regulation. In fact, we have initiated many of the experiments suggested by the reviewer (and more), but in each case, rigorous interpretable results will require a minimum another year’s time. 

      For example, we have used crosslinking and biotin labeling techniques (using previously available reagents from Eclipsebio) to test whether SFSWAP binds RNA. The results were negative, but the lack of strong SFSWAP antibodies required that we use a transiently expressed myc-tagged SFSWAP. Therefore, this negative result could be an artifact of the exogenous expression and/or tagging. Given the difficulties of “proving the negative”, considerably more work will be required to substantiate this finding. As another example, we intend to develop a complementation assay as suggested. For an essential gene, the ideal complementation system employs a degron system, and we have spent months attempting to generate a homozygous AID-tagged SFSWAP. Unfortunately, we so far have only found heterozygotes. Of course, this could be because the tag interferes with function, the insert was not efficiently incorporated by homologous repair, or that we simply haven’t yet screened a sufficient number of clones. We’re confident that these technical issues that can be addressed, but they will take a significant amount of time to resolve. While we would ideally define a mechanism, we think that the data reported here outlining functions for SFSWAP in splicing represent a body of work sufficient for publication. 

      (3) Data presentation could be improved (specific suggestions are included in the recommendations section). Furthermore, Excel tables with gene expression and splicing analysis results should be provided as supplementary datasheets. Finally, a more detailed explanation of statistical analyses is necessary in certain sections. 

      We have addressed all specific suggestions as detailed in the recommendations below.

      Reviewer #2 (Public review): 

      Summary: 

      The paper describes an effort to identify the factors responsible for intron retention and alternate exon splicing in a complex system known to be regulated by the O-GlcNAc cycling system. The CRISPR/Cas9 system was used to identify potential factors. The bioinformatic analysis is sophisticated and compelling. The conclusions are of general interest and advance the field significantly. 

      Strengths: 

      (1) Exhaustive analysis of potential splicing factors in an unbiased screen. 

      (2) Extensive genome wide bioinformatic analysis. 

      (3) Thoughtful discussion and literature survey. 

      We thank the reviewer for recognizing and detailing the strengths of our manuscript. 

      Weaknesses: 

      (1) No firm evidence linking SFSWAP to an O-GlcNAc specific mechanism. 

      We couldn’t agree more with this critique. Indeed, our intention at the outset for the screen was to find an O-GlcNAc sensor linking OGT splicing with O-GlcNAc levels. As often occurs with high-throughput screens, we didn’t find exactly what we were looking for, but the screen nonetheless pointed us to interesting biology. Prompted by our screen, we describe new insights into the function of SFSWAP a relatively uncharacterized essential gene. Currently, we are testing other candidates from our screen, and we are performing additional studies to identify potential O-GlcNAc sensors.  

      (2) Resulting model leaves many unanswered questions. 

      We agree (see Reviewer 1, point 2 response).  

      Reviewer #3 (Public review): 

      Summary: 

      The major novel finding in this study is that SFSWAP, a splicing factor containing an RS domain but no canonical RNA binding domain, functions as a negative regulator of splicing. More specifically, it promotes retention of specific introns in a wide variety of transcripts including transcripts from the OGT gene previously studied by the Conrad lab. The balance between OGT intron retention and OGT complete splicing is an important regulator of O-GlcNAc expression levels in cells. 

      Strengths: 

      An elegant CRISPR knockout screen employed a GFP reporter, in which GFP is efficiently expressed only when the OGT retained intron is removed (so that the transcript will be exported from the nucleus to allow for translation of GFP). Factors whose CRISPR knockdown causes decreased intron retention therefore increase GFP, and can be identified by sequencing RNA of GFP-sorted cells. SFSWAP was thus convincingly identified as a negative regulator of OGT retained intron splicing. More focused studies of OGT intron retention indicate that it may function by regulating a decoy exon previously identified in the intron, and that this may extend to other transcripts with decoy exons. 

      We thank the reviewer for recognizing the strengths of our manuscript. 

      Weaknesses: 

      The mechanism by which SFSWAP represses retained introns is unclear, although some data suggests it can operate (in OGT) at the level of a recently reported decoy exon within that intron.

      Interesting/appropriate speculation about possible mechanisms are provided and will likely be the subject of future studies. 

      We completely agree that this is a limitation of the current study (see above). Now that we have a better understanding of SFSWAP functions, we will continue to explore SFSWAP mechanisms as suggested. 

      Overall the study is well done and carefully described but some figures and some experiments should be described in more detail. 

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors): 

      (1) Clarify and add missing statistical details across the figures. For example, Figure S2 lacks statistical comparisons, and in Figures 4A and 4C the tests applied should be specified in the legend. 

      We have added appropriate statistical analysis wherever missing and edited figure legends to specify the tests used.

      (2) The authors are strongly encouraged to provide detailed tables of gene expression and alternative splicing analyses from RNA-Seq experiments (e.g., edgeR, rMATS, Whippet, and MAJIQ), as this would enhance transparency and facilitate data interpretation. 

      We have added tables for gene expression and alternate splicing analysis as suggested (Suppl. tables 3-

      6).

      (3) Although the legend sometimes indicates differently (e.g., Figure 3b, 5a, 5c, etc), the volcano plots showing the splicing changes do not contain a cutoff for marginally differential percent spliced in or intron retention values. 

      The legends have been edited to reflect the correct statistical and/or PSI cutoffs.

      (4) For consistency, use a consistent volcano plot format across all relevant figures (Figures 3b, 5a-c, S3, S4, S7, and S8), including cutoffs for differential splicing and the total count of up- and down-regulated events. 

      Due to different statistical frameworks and calculations employed by different alternate splicing pipelines, we could not use the same cutoffs for different pipelines.  However, we have now indicated the number of up- and down-regulated events for consistency among the volcano plots.

      (5) What is the overlap of differentially regulated events between the different analytical methodologies applied? 

      We analyzed the degree of overlap between the three pipelines used in the paper using a Venn diagram (added to Suppl. Fig. S7). However, as widely reported in literature (e.g., Olofsson et al., 2023; Biochem Biophys Res Commun. 2023; doi: 10.1016/j.bbrc.2023.02.053.), the degree of overlap between pipelines is quite low.

      (6) To further substantiate your conclusions, additional validations of RNA-Seq splicing data, ideally visualized on an agarose gel, would be valuable, especially for exons and introns regulated by SFSWAP, and particularly for OGT decoy exons in Figure 4c. 

      We have not included these experiments as we focused on other critiques for this resubmission. Because the RNA-seq, RT-PCR and RT-qPCR data all align, we are confident that the products we are seeing are correctly identified and orthogonally validated (Figs 2d, 4a, 4b, and 4c).  

      (7) It would be more informative if the CRISPR screen data were presented in a format where both the adjusted p-value and LFC values of the hits are presented. Perhaps a volcano plot? 

      We have now included these graphs in revised Supplementary Figure S2. 

      (8) In Figure 2d, a cartoon showing primer binding sites for each panel could aid interpretation, particularly in explaining the unexpected simultaneous increase in OGT mRNA and intron retention upon SFSWAP knockdown. 

      We have added a cartoon showing primer binding sites similar to that shown in Fig. 4a.

      (9) Page 9, line 1, states that SFSWAP autoregulates its expression by controlling intron retention. Including a Sashimi plot would provide visual support for this claim. 

      The data suggesting that SFSWAP autoregulates its own transcript abundance were reported in Zachar et al. (1994), not from our own studies. Validation of those data with our RNA-seq data is confounded by the fact that we are using siRNAs to knockdown the SFSWAP RNA at the transcript level (Fig. S15). 

      (10) In the legend of Figure S2 the authors state that negative results are inconclusive because RNA knockdowns are not verified by western blotting or qRT-PCR. This is correct, but the reviewer would also argue that the positive results are also inconclusive as they are not supported by a rescue experiment to confirm that the effect is not due to off-target effects. 

      This is a fair point with respect to the siRNA experiments on their own. However, the CRISPR screen was performed with sgRNAs, and MAGeCK RRA scores are high only for those genes that have multiple sgRNAs that up-regulate the gene. Examination of the SFSWAP sgRNAs individually shows that three of four SFSWAP sgRNAs had false discovery rates ≤10<sup>-42</sup> for GFP upregulation. Thus, the siRNAs provide an additional orthogonal approach. It seems unlikely that the siRNAs, and three independent sgRNAs will have the same off-target results. Thus, these combined observations support the conclusion that SFSWAP loss leads to decreased OGT intron retention.  

      (11) For clarity in Figure 3a, consider using differential % spliced in or intron retention bar plots with directionality (positive and negative axis) and labeling siSFSWAP as the primary condition. 

      (12) Consider presenting Figure 5D as a box plot with a Wilcoxon test for statistical comparison. 

      For both points 11 and 12, we have tried the graphs as the reviewer suggested. While these were good suggestions, in both cases we felt that the original plots ended up presenting a clearer presentation of the data (see Author response image 1).

      Author response image 1.

      (13) Please expand the Methods section to detail the Whippet and MAJIQ analyses. 

      We have expanded the methods section to include additional details of the alternate splicing analysis.

      (14) Include coordinates for the four possible OGT decoy exon combinations analyzed in the Methods section. 

      We have added the coordinates of all four decoy forms in the methods section.  

      (15) A section on SFSWAP mass spectrometry is listed in Methods but is missing from the manuscript. 

      This section has now been removed.

      Reviewer #2 (Recommendations for the authors): 

      This is an excellent contribution. The paper describes an effort to identify the factors responsible for intron retention and alternate exon splicing in a complex system known to be regulated by the O-GlcNAc cycling system. The CRISPR/Cas9 system was used to identify potential factors. The bioinformatic analysis is sophisticated and compelling. The conclusions are of general interest and advance the field significantly. 

      Some specific recommendations. 

      (1) The plots in Figure 3 describing SI and ES events are confusing to this reader. Perhaps the violin plot is not the best way to visualize these events. The same holds true for the histograms in the lower panel of Figure 3. Not sure what to make of these plots. 

      For Figure 3b, we include both scatter and violin plots to represent the same data in two distinct ways. For Figure 3d, we agree that these are not the simplest plots to understand, and we have spent significant time trying to come up with a better way of displaying these trends in GC content as they relate to SE and RI events. Unfortunately, we were unable to identify a clearer way to present these data. 

      (2) The model (Figure 6) is very useful but confusing. The legend and the Figure itself are somewhat inconsistent. The bottom line of the figure is apparent but I fear that the authors are trying to convey a more complete model than is apparent from this figure. Please revise. 

      We have simplified the figure from the previous submission. As mentioned above, we admit that mechanistic details remain unknown. However, we have tried to generate a model that reflects our data, adds some speculative elements to be tested in the future, but remains as simple as possible. We are not quite sure what the reviewer was referring to as “somewhat inconsistent”, but we have attempted to clarify the model in the revised Discussion and Figure legend.  

      (3) It is unclear how normalization of the RNA seq experiments was performed (eg. Figure S5 and 6).  

      The normalization differences in Fig. S5 and S6 (now Fig S8 and S9) were due to scaling differences during the use of rmats2sashimiplot software. We have now replaced Fig. S5 to reflect correctly scaled images.

      I am enthusiastic about the manuscript and feel that with some clarification it will be an important contribution. 

      Thank you for these positive comments about our study!

      Reviewer #3 (Recommendations for the authors): 

      (1) In Figure 1f, it is clear that siRNA-mediated knockdown of OGT greatly increases spliced RNA as the cells attempt to compensate by more efficient intron removal (three left lanes). However, there is no discussion of the various treatments with TG or OSMI. Might quantitation of these lanes not also show the desired effects of TG and OSMI on spliced transcript levels? 

      The strong effect of OGT knockdown masks the (comparatively modest) effects of subsequent inhibitor treatments on the reporter RNA. We have edited the results section to clarify this.

      (2) In Figure 2c, why is the size difference between spliced RNA and intron-retained RNA so different in the GFP-probed gel (right) compared with the OGT-probed gel (left)? Even recognizing that the GFP probe is directed against reporter transcripts, and the OGT probe (I think) is directed against endogenous OGT transcripts, shouldn't the difference between spliced and unspliced bands be the same, i.e., +/- the intron 4 sequence. Also, why does the GFP probe detect the unspliced transcript so poorly? 

      The fully spliced endogenous OGT mRNA is ~5.5 kb while the fully spliced reporter is only ~1.6kb, so the difference in size (the apparent shift relative to the mRNA) is quite different. Moreover, the two panels in Fig 2c are not precisely scaled to one another, so direct comparisons cannot be made. 

      The intron retained isoform does not accumulate to high levels in this reporter, a phenotype that we also observed with our GFP reporter designed to probe the regulation of the MAT2A retained intron (Scarborough et al., 2021). We are not certain about the reason for these observations, but suspect that the reporter RNA’s retained intron isoforms are less stable in the nucleus than their endogenous counterparts. Alternatively, the lack of splicing may affect 3´ processing of the transcripts so that they do not accumulate to the high levels observed for the wild-type genes. 

      (3) Please provide more information about the RNA-seq experiments. How many replicates were performed under each of the various conditions? The methods section says three replicates were performed for the UPF1/TG experiments; was this also true for the SFSWAP experiments?  

      All RNA-seq experiments were performed in biological triplicates. We have edited the methods section to clarify this.

      (4) Relatedly, the several IGV screenshots shown in Figure 3C presumably represent the triplicate RNA seq experiments. In part D, how many experiments does the data represent? Is it a compilation of three experiments? 

      Fig. 3d is derived from alternate splicing analysis performed on three biological replicates. We have added the number of replicates (n=3) on the figure to clarify this. We have also noted that the three IGV tracks represent biological replicates in the Figure legend for 3c.  

      (5) Please provide more details regarding the qRT-PCR experiments. 

      We have provided the positions of primer sets used for RT-qPCR analysis and cartoon depictions of target sites below the data wherever appropriate.

      (6) In the discussion of decoy exon function (in the Discussion section), several relevant observations are cited to support a model in which decoy exons promote assembly of splicing factors. One might also cite the finding that eCLIP profiling has found enriched binding of U2AF1 and U2AF2 at the 5' splice site region of decoy exons (reference 16). 

      Excellent point. This has now been added to the Discussion. 

      Minor corrections / clarifications: 

      (1) In the Figure 2A legend, CRISPR is misspelled. 

      Corrected.

      (2) In the discussion, the phrase "indirectly inhibits splicing of exons 4 and 5, but promoting stable unproductive assembly of the spliceosome", the word "but" should probably be "by". 

      Corrected.

    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewer #1:

      (1) Figure 2 and related text: it would be useful to explain more explicitly what is meant by "neurogenic" and "non-neurogenic" models. I presume that the total number of neurons in non-neurogenic models is lower than in neurogenic models because no new neurons are added. It would be useful to plot the number of GCs as a function of timesteps.

      We have clarified the distinction between neurogenic and non-neurogenic models in the text (Lines 142-145), explicitly noting that in non-neurogenic models, no new GCs are added, resulting in a lower total neuron count over time. In response to the reviewer’s suggestion, we generated a plot showing the number of GCs over time (see below). Because the neurogenic model exhibits a simple linear increase, we found this plot not especially informative for inclusion in the manuscript. However, we agree with the reviewer’s later comments that similar plots are useful for interpreting specific results, and we have included those where appropriate.

      Author response image 1.

      Number of GCs over time for neurogenic (solid line) and non-neurogenic (dotted line) networks

      (2) Figure 2F, G: memory declines dramatically when the number of GCs at enrichment onset increases beyond an optimum. Why?

      We have explained the reasoning more thoroughly in the text (Lines 174-177) and added a new supplemental figure to support this reasoning (Figure S2). As the number of GCs increases, the network becomes overly inhibited and the response of abGCs to the stimuli decreases (Fig S2A). This leads to a smaller population of GCs being able to integrate with the stimulus (Fig S2B) which is expected given the activity-dependent plasticity rule. Moreover, it can be seen in Fig S2C that for networks with increasing size, the GCs that do learn only connect to MCs that are driven strongest by the stimuli until they struggle to connect to any MCs at all.

      In principle, a homeostatic mechanism like synaptic scaling could reduce activity to restore balance, but such a mechanism would also likely disrupt existing memories. Alternatively, we suggest activity-dependent apoptosis as a superior homeostatic mechanism because it leads to a stable level of activity without substantially erasing existing memories.

      (3) The paragraph describing synaptic connectivity of abGCs (related to Figure 2H) is confusing. What is the directionality of synapses considered here: mitral-to-granule, or granule-to-mitral? The text is opaque here. Connectivity matrix in Figure 2H: who is presynaptic, who is postsynaptic? If I understand correctly, these questions are actually irrelevant because all mitralgranule synapses in the network are reciprocal. This should be pointed out explicitly in the figure legend. Generally: the fact that the network is fully reciprocal (if I understand correctly) is very important but not stated with sufficient emphasis. It should be stated very explicitly in the text that connectivity matrices are fully reciprocal, and an equation clarifying this point should be included in Methods.

      (6) Connectivity matrix: to what degree was connectivity between mitral and granule cells reciprocal (fraction of connections in either direction that were paired with a connection in the opposite direction between the same cell pair)? Was connectivity shaped by experience (enrichment) reciprocal?

      (7) Directly related to the above: it would be useful to show the disynaptic connectivity matrix between mitral cells and analyze its symmetry. For the symmetric component, it should then be analyzed what fraction of this can be attributed to the reciprocal synapses, and what fraction is contributed by connectivity via different granule cells. This should then be compared to models with biologically realistic fractions of reciprocal connections. Is the model proposed here consistent with a biologically realistic fraction of reciprocal synapses between mitral-granule cell pairs?

      We appreciate these insightful and detailed comments. We agree that the assumption that MC-GC synapses were fully reciprocal was not clearly stated. We now explicitly state this in the main text (lines 90-94, 369-370, Figure 2 caption) and methods (line 561), emphasize its importance. As the reviewer points out, this is a simplifying assumption and does not fully reflect the biology because not all synapses are reciprocal in the true system. We also note that our synaptic plasticity model does not break the reciprocity assumption: all connections added or pruned during learning remain reciprocal. As a result, the disynaptic connectivity matrix (Bottom panel below, MCs sorted by stimulus as shown in the top panel) is always symmetric.

      We have now made these statements explicit in the main text and in the methods. Regarding functional consequences of this assumption, earlier work by our group has examined the impact of the degree of reciprocity of MC-GC synapses in a similar OB model (Chow, Wick & Riecke, Plos Comp Bio 2012). The study examined three different changes in reciprocity by (1) redirecting a fraction of the inhibitory connections of each GC to randomly chosen MCs instead of the MCs that drive that GC, (2) allowing heterogeneity in reciprocal weights so that there is no relationship between the strength of the MC -> GC synapse and the GC -> MC synapse, (3) reducing the level of self-inhibition a MC receives from the GCs that it excites. The model was found to be quite robust to each of these manipulations, suggesting that our present model likely remains functionally relevant even if biological reciprocity is partial. We reference this work now in the discussion, lines 490-492.

      Author response image 2.

      Disynaptic connectivity. Top: MC activity in response to the two stimuli, sorted by MC selectivity. Bottom: Disynaptic connectivity matrix (diagonal subtracted).

      (4) How were mitral cells sorted in Figure 2H? This needs to be explained.

      (5) Directly related to the point above: the text mentions that synaptic connectivity between GCs of the "learning cluster" and mitral cells (which direction?) is increased for mitral cells responding by enrichment odors, but this is not shown in the figure. This statement suggests that mitral cells sorted to the bottom of the y-axis respond more strongly to enrichment odors, but the information is not given directly. Please provide more information to back up your statements.

      Indeed as the reviewer inferred, MCs in Figure 2H were sorted so that those that receive the strongest stimulation from the odor were at the bottom of the y-axis. We have clarified this in the Figure 2 caption and added a subplot to Figure 2H showing the average MC input to make this more explicit.

      (8) Apoptosis (Figure 4 and related text): paragraph 231ff is somewhat difficult to comprehend because the "number" of enrichments should really be the "frequency" of enrichments. In Figure 4, it is not mentioned explicitly that each enrichment is with different random new odors.

      We agree that the term “number” of enrichments was imprecise and have revised the text to refer instead to the frequency of enrichment events (Lines 255-267). We also clarified that in Figure 4, each enrichment corresponds to a different set of randomly sampled odors, and we now state this explicitly in both the Figure 4 legend and main text (Lines 260-261).

      (9) Apoptosis: apoptosis improves memory but the underlying reason remains opaque. A simple prediction of the data in Figure 4D and 4E is that the number of GCs in 4E. It would be helpful to show this. Furthermore, an obvious question that arises is whether a higher frequency of enrichments improves memories because the total number of granule cells is kept low, or because granule cells are removed specifically based on their activity (or both). This could be addressed easily by artificially removing a random subset of granule cells in a simulation such as 4E to match granule cell numbers to the case in 4D.

      Apoptosis improves learning is because it reduces the total inhibition in the network by removing GCs and thus prevents deficits in learning that occur in Fig. 2G as GCs accumulate in the network. As the reviewer inferred, the number of GCs in Figure 4D is lower than in 4E and this is now clarified in the text. This difference was shown implicitly in Supplementary Figure S4D (previously S3D), but we now explicitly reference this plot to support this point as well (Line 266).

      As the reviewer notes, there is a question in whether increased enrichment frequency improves memory because it limits the total number of GCs, or because apoptosis selectively removes GCs based on their activity, or both. Our model supports both mechanisms. Importantly, simply reducing GC numbers through random deletion will degrade existing memories: random removal erodes memory representations encoded by those GCs. In contrast, our age and activity dependent apoptosis rule targets a specific cohort of adult-born GCs. This selective removal minimizes damage to existing memories encoded by GCs outside of this cohort while keeping GC numbers within a regime that supports robust learning (as shown in Figure 2G).

      However, we note that if enrichment frequency becomes too high, even recent memories can be lost due to premature pruning of GCs that have not yet stabilized their synaptic connections. This tradeoff has been shown experimentally (Forest et al., Nat Comm 2019) which we reproduce in our model (Figure S4).

      (10) Text related to Figure 5: "Learning flexibility...approached a steady state when the growth of the network started to saturate". Please show the growth (better: size) of the network (total number of GCs) for these simulations (and other panels in Figure 5). It would also be useful to show the total number of GCs in other figures (e.g. Figure 4; see above).

      We have now added a supplementary figure (Figure S6) that shows the total number of GCs over time for the simulations presented. This confirms that the network size approaches a steady state around the same time that learning flexibility begins to plateau, as noted in the original text (now line 275), and highlights the large number of GCs without apoptosis as well as the slightly reduced number of GCs in the permanent encoding model (line 312).

      (11) As much as I appreciate the comprehensive discussion of the results in a broader context, I feel that the discussion can be somewhat shortened. The section on lateral inhibition is not fully valid given that synaptic connectivity is reciprocal. I also feel that much of the final section (Model assumptions and outlook) can be dropped (except for the last paragraph), not because anything is irrelevant, but because these points have been made, onen repeatedly, in the text above.

      We agree that the discussion could be streamlined and have revised the manuscript accordingly. Specifically, we have shortened the section on lateral inhibition and clarified that the OB relies predominantly on reciprocal connectivity (Line 370). We also agree that parts of the final section were repetitive and have removed these. However, to address comments by Reviewer 3, we also expanded on some of the model assumptions. We thank the reviewer for helping us improve the clarity and focus of the manuscript.

      (12) Figure 5: bolding every 5th curve is confusing.

      We have adjusted our figure accordingly.

      (13) "...we biased the dendritic field...": it would be helpful to explain the idea of a "dendritic field" in a bit more detail prior to this sentence.

      We have now noted that GC’s "dendritic field" refers to the subset of MCs with which it is capable of forming synaptic connections when we initially describe the model (Line 97).

      Reviewer #3:

      (1) The authors find that a network with age-dependent synaptic plasticity outperforms one with constant age-independent plasticity and that having more GC per se is not sufficient to explain this effect. In addition, having an initial higher excitability of GCs leads to increased performance. To what degree the increased excitability of abGCs is conceptually necessarily independent of them having higher synaptic plasticity rates / fast synapses?

      We thank the reviewer for this question, as the difference between excitability and plasticity rate in memory formation is something we intended to highlight in this study. We have updated the (Lines 157-198) to clarify this.

      At the cellular level, a neuron's excitability and its rate of synaptic plasticity are mechanistically distinct: excitability is governed by factors such as ion channel expression or membrane resistance, whereas plasticity rates are influenced by molecular pathways involved in synapse and dendritic spine formation and remodeling. While these are independent properties, they are functionally coupled: most synaptic plasticity rules are activity-dependent, so greater excitability can increase the likelihood of plasticity being induced but does not itself guarantee learning.

      Our model reflects this distinction. Increased excitability biases which neurons become activated and thus eligible to undergo plasticity, but actual learning still depends on the plasticity rate itself. This can be seen by comparing the model constant plasticity and excitability (solid blue and green curves in Figure 2C) to the model with only transient excitability (solid blue and green lines in Figure 2E). In both cases, the strength and duration of the memory remain limited by the plasticity rate. We note additionally that, in this network, neurons compete to learn new stimuli: as GCs start to learn, they suppress MC activity through recurrent inhibition which suppresses learning in other GCs who otherwise would have been in position to learn the odor. As a result there is not a significant increase in the overall number of neurons recruited to learn (Figure 2J). In a different network architecture, such as a feedforward network, we would not expect this to be the case; greater excitability in a population of neurons would likely increase the memory by increasing the number of neurons recruited to learn. Transiently enhanced excitability biases which neurons join the memory engram (Figure 2J), but the extent and rate of learning still depend on the plasticity rates themselves. We did note in the original text (now lines 284-286) that this bias in recruitment subtly increases memory stability, but the extent is not great. In principle, a model can be engineered to rely on transiently increased excitability to encode memories in orthogonal subpopulations of neurons and that this could resolve the flexibility-stability dilemma. However, in that case, the number of memories that can be stored within a short time would be bounded by the size of this subpopulation such that even if a large number of odors are presented, mature GCs cannot become part of the engram and the network would likely fail to learn the stimuli. However, when this was tested experimentally (Forest et al. Cereb Cor. 2020), it was found that mature GCs participated in the engram when the number of odors was sufficiently high. Our results are consistent with these experiments: for complex odor environments, neonatal GCs, which are mature during odor exposure, and abGCs both participate in the engrams.

      Author response image 3.

      Simulating learning in more complex odor environments. Top: enrichment consisted of three odor pairs presented sequentially in a random order. Bottom: enrichment consisted of five odor pairs. Left: discriminability of the odor pairs over time. Middle: connectivity between MCs (sorted by odor selectivity) and GCs (sorted by age). In both cases AbGCs develop a clear connectivity structure. In more complex environments neonatal GCs also start to develop a clear connectivity structure. Right: combined engram membership across all stimuli by GC age.

      In sum, transiently increased excitability alone will not make learning any faster, so a fast learning system must have a high plasticity rate. If this plasticity rate stays high, then memories stored in these neurons, even if no longer highly excitable, will be vulnerable as the neurons can still be driven above their plasticity threshold by moderately interfering stimuli and will thus be quickly forgotten. Conversely, if the reviewer is wondering if a greater increase in the plasticity rate of new neurons can compensate for a lack of excitability, this is not the case: if a newborn neuron is not sufficiently driven by the stimulus it will not learn regardless of how high its plasticity rate is.

      (2) The authors do not mention previous theoretical work on the specificity of mitral to granule cell interactions from several groups (Koulakov & Rinberg - Neuron, 2011; Gilra & Bhalla, PLoSOne, 2015; Grabska-Bawinska...Mainen, Pouget, Latham, Nat. Neurosci. 2017; Tootoonian, Schaefer, Latham, PLoS Comput. Biol., 2022), nor work on the relevance of top-down feedback from the olfactory cortex on the abGC during odor discrimination tasks (Wu & Komiyama, Sci. Adv. 2020), or of top-down regulation from the olfactory cortex on regulating the activity of the mitral/tuned cells in task engaged mice (Lindeman et al., PLoS Comput. Biol., 2024), or in naïve mice that encounter odorants (in the absence of specific context; Boyd, et al., Cell Rep, 2015; Otazu et al., Neuron 2015, Chae et al., Neuron, 2022). In particular, the presence of rich topdown control of granule cell activity (including of abGCs) puts into question the plausibility of one of the opening statements of the authors with respect to relying solely on local circuit mechanisms to solve the flexibility-stability dilemma. I think the discussion of this work is important in order to put into context the idea of specific interactions between the abGCs and the mitral cells.

      We thank the reviewer for these detailed and thorough comments, and whole-heartedly agree that it is important to discuss the listed studies in order to contextualize our work through the broader lens of how information is processed in the OB. We have expanded our discussion to further acknowledge and integrate insight from previous theoretical and experimental work cited by the reviewer. (Lines 361-366, 493-550)

      Regarding the importance of top-down feedback, we of course recognize that in practice cortical inputs play a critical role in abGC survival and synaptic integration. However, its nature is not quite clear and is likely variable across behavioral seungs. In the paradigm that we study in the manuscript, there is likely no key reward value or contextual signal that is relayed to the OB. One plausible interpretation is that in this task, cortical feedback provides a random, variable baseline excitatory drive to GCs. This would likely be consistent with many of the listed studies, e.g.

      (1) Glomerular layer targeting of feedback would be explicitly unrelated to glomerular odor specificity, as in Boyd et al.

      (2) GC activity would decrease if these cortical inputs were silenced, resulting in stronger MC responses as in Otazu et al., Chae et al.

      (3) Silencing PCx during learning would prevent GCs from reaching activity-dependent plasticity thresholds, resulting in decreased spine density as in Wu & Komiyama.

      Likewise activating PCx would lead to increased spine density.

      In this interpretation, the effect of top-down input could be captured implicitly by adjusting model parameters such as activity or plasticity thresholds. For the purposes of our study, we opted to neglect these inputs in favor of model simplicity.

      Critically, even if top-down inputs play a substantially larger role, by perhaps even going as far as providing signals to abGCs to modulate their development, the core solution to the flexibility-stability dilemma that we describe stays local: we predict that the memory persists in the same network in which it was formed.

      (3) To what the degree of specific connectivity reflects a specific stimulus configuration, and is a good proxy for determining the stimulus discriminability and memory capacity in terms of temporal activity patterns (difference in latency/phase with respect to the respiration cycle, etc.) which may account to a substantial fraction of ability to discriminate between stimuli? The authors mention in the discussion that this is, indeed, an upper bound and specific connectivity is necessary for different temporal activity patterns, but a further expansion on this topic would help in understanding the limitations of the model.

      We thank the reviewer for raising this important point. Indeed, there have been several recent experimental studies indicating that much of the information needed for olfactory discrimination is encoded in the temporal activity patterns of mitral and tuned cells. Our model does not explicitly simulate these dynamics. It was for this reason that we defined memory in terms of the learned structure of the network rather than by firing rate activity. This is motivated by the idea that learned patterns of connectivity constrain the space of neural activity the network can support, and thus shape stimulus responses. We now make this limitation more explicit in the discussion and clarify that the specific MC–GC connectivity we analyze should be seen as a structural substrate that constrains the possible temporal transformations the network could support (Lines 492-506).

      (4) Reward or reward prediction error signals are not considered in the model. They however are ubiquitous in nature and likely to be encountered and shape the connectivity and activity patterns of the abGC-mitral cell network. Including a discussion of how the model may be adjusted to incorporate reward/error signals would strengthen the manuscript.

      We appreciate the reviewer’s suggestion and agree that reward and reward prediction error signals are critical components of many learning paradigms. We deliberately chose not to model associative learning, reward signals or top-down neuromodulation in this work. Our goal is to investigate the role of adult neurogenesis in a regime where its contribution has been shown to be experimentally necessary. Specifically, we focused on an unsupervised perceptual learning paradigm where adult neurogenesis is required for successful odor discrimination (Moreno et al. PNAS, 2008). In contrast, when the same odors are used in a rewarded learning paradigm, performance remains intact even when adult neurogenesis is ablated (Imayoshi et al., Nat. Neuro., 2008). This dissociation suggests that neurogenesis is dispensable in contexts where reward can guide learning. As such, we argue that isolating the contribution of local circuit dynamics in an unsupervised setting is critical to understanding what neurogenesis is uniquely enabling, especially given the evolutionary cost of maintaining it.

      We agree that extending this work to incorporate reward-driven plasticity or neuromodulatory influences would be a valuable direction for future research. In particular, it could help clarify how different learning paradigms engage distinct abGC cohorts (e.g., Mandairon et al., eLife 2018; Wu & Komiyama, Sci. Adv. 2020), and how task structure shapes memory allocation and engram composition. We have incorporated this into the discussion regarding extending our model to include top down feedback (lines 539-553).

      Specific comments

      (1) Lines 84-86; 507-509; Eq(3): Sensory input is defined by a basal parameter of MCs spontaneous activity (Sspontaneus) and the odor stimuli input (Siodor) but is not clear from the main text or methods how sensory inputs (glomerular patterns) were modeled

      We now clarify in the Methods section "Stimulus model" how the sensory inputs were modeled. Specifically, odor-evoked inputs to mitral cells (Siodor) were generated either as Gaussian profiles across the mitral cell population (Figs. 2,3) or as sparser random patterns (Figs. 4,5). In Figures 2 and 3, the denser Gaussian stimuli require more GCs to learn the odors, aiding in visualization of the connectivity matrix (Figure 2H) and abGC recruitment plots (Figure 2I,J; Figure 3C,E). However, real olfactory stimuli activate a sparse set of MCs, so in Figures 4 and 5 where we address learning of many stimuli, we utilize sparser, binary, stimuli delivered to only 10% of MCs, in range of experimental data (Wachowiak and Cohen, Neuron, 2001). The fact that the stimuli are binary, however, is not realistic and leads to denser representations. This leads to a worst-case scenario for the model as denser memory representations are easier to overwrite. These points has been added explicitly to the Methods section "Stimulus model" to improve clarity.

      (2) Lines 118-122: The used perceptual learning task explanation is done only in the context of the discriminability of similar artificial stimuli using the Fisher discriminant and "Memory" metric. A detailed description of the logic of the perceptual learning task methods and objective, taking into account Comment 1, would help to better understand the model.

      We thank the reviewer for pointing out had not adequately described the task and have updated the main text (lines 125-132) and included a new methods section "Perceptual learning task" to describe it more explicitly. The experiments that inspired the simulation followed an ecological model of discrimination learning (Moreno et al. PNAS 2009): For one hour a day over a ten day "enrichment period", two tea balls containing similar but distinct odors were suspended from the lid of each mouse's home cage. The mice engaged with the stimuli under self-directed conditions, therefore learning through natural experience. As a result the mice use olfactory information to discriminate between the similar stimuli, a skill potentially relevant for navigation or social behaviors.

      In our simulations, we model these experiments as follows. During the enrichment period, the model is stimulated with a randomly selected stimulus chosen from a set of two similar stimuli, corresponding to a mouse choosing to sniff one of the tea balls. During enrichment, in between these bouts of "sniffing", the model only receives spontaneous activity, reflecting the temporal sparsity of sensory input even over the enrichment period. Outside of enrichment, the model again receives only spontaneous input.

      (3) Rapid re-learning of forgotten odor pair is enabled by sensory-dependent dendritic elaboration of neurons that initially encoded the odors and the observed re-learning would occur even if neurogenesis was blocked following the first enrichment and even though the initial learning did require neurogenesis. When this would ever occur in nature? The re-learning of an odor period? Why is this highlighted in the study?

      We believe that this sort of learning is certainly relevant in nature. To clarify: by “learning,” we do not refer to the memory of an entire “odor period”, but simply an altered mapping of specific stimuli. Therefore, forgeung could occur if these specific stimuli are absent from the environment for a period of time, and re-learning would occur when these stimuli are re-encountered. Natural odor environments are highly dynamic, as environmental conditions and social contexts change over time. The odors an animal encounters also depend strongly on its own behavior; as it explores different environments, it may be exposed to particular odors intermittently: it could encounter them in one location, then not return to that location for some time before returning again.

      Such natural variability in odor exposure makes the ability to forget and re-learn especially valuable, allowing the animal to prioritize relevant information while maintaining flexibility. To this end, we show in Figure 5G that the synaptic forgetting of odors is beneficial to the performance of the model because it reduces interference in the network. Therefore we highlight that re-learning enabled by adult neurogenesis is a highly efficient strategy for memory storage and retrieval, which is why he emphasize it in this study.

      (4) Figure 2A: I understand that the ages shown at the bottom of the colored boxes represent the GC age. If so, find a better way to express that to avoid confusing 'GC ages' from the days shown in the perceptual learning task description (Figure 2B).

      We have updated the text in the figure to disambiguate the two and refer to the “days” shown in the perceptual learning task description now as “time relative to enrichment”

      (5) Figure 2B: Clarify how the two-dimensional arrays are arranged to represent the patterns shown. Does each point of the array represent one neuron? If so, are these neurons re-arranged to help the readers visually differentiate patterns A and B? Are the patterns of activity of MCs in the model spatially and temporally sparse as observed in experimental work?

      In Figure 2B, each point in the two-dimensional array represents the activity of a single mitral cell. The layout is purely for visualization—neurons are re-arranged to make the differences between odor patterns A and B visually apparent. This ordering does not reflect anatomical position or model architecture. We revised the Figure 2 caption to say this explicitly.

      Regarding spatial sparseness, as we mentioned in the response to the reviewer’s comment (1), the activity of mitral cells in response to odors is spatially sparse in the model. Regarding temporal sparseness, while the model is not spiking and does not include temporal dynamics within the timescale of the breath, however, odor input is delivered in discrete, odorspecific epochs interleaved with periods of no input, which leads to temporally structured activity patterns. This information has been made explicit in the new methods sections "Stimulus model" and "Perceptual learning task"

      (6) Figure 3C and Line 189: potential confusion between the color code mentioned in the legend for the enrichment and developing periods.

      It appeared to be a confusion in the text and has been corrected (Lines 212-213).

      (7) Figure 5F: For clarity, this would benefit from replacing the bold line with areas in the plot to depict the enrichment periods.

      We agree that replacing the bolded line segments with shaded areas is more clear and have updated the figure accordingly, and appreciate the reviewer's suggestion to clarify the figure.

      (8) Lines 380, 416: Potential role of cortical feedback and or neuromodulation depending on behavioral relevance or permanent exposure? Later mentioned in Lines 467 - 474.

      We have updated the text to acknowledge the role of potential cortical feedback and neuromodulation, now in lines 403-407.

    1. Author Response

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      In this manuscript, Butkovic et al. perform a genome-wide association (GWA) study on Arabidopsis thaliana inoculated with the natural pathogen turnip mosaic virus (TuMV) in laboratory conditions, with the aim to identify genetic associations with virus infection-related parameters. For this purpose, they use a large panel of A. thaliana inbred lines and two strains of TuMV, one naïve and one pre-adapted through experimental evolution. A strong association is found between a region in chromosome 2 (1.5 Mb) and the risk of systemic necrosis upon viral infection, although the causative gene remains to be pinpointed.

      This project is a remarkable tour de force, but the conclusions that can be reached from the results obtained are unfortunately underwhelming. Some aspects of the work could be clarified, and presentation modified, to help the reader.

      (Recommendations For The Authors):

      • It is important to note that viral accumulation and symptom development do not necessarily correlate, and that only the former is a proxy for "virus performance". These concepts need to be clear throughout the text, so as not to mislead the reader.

      This has been explained better in line 118-120, “Virus performance has been removed.

      • Sadly, only indirect measures of the viral infection (symptoms) are used, and not viral accumulation. It is important to note that viral accumulation and symptom development do not necessarily correlate and that only the former is a proxy for "virus performance". These concepts need to be clear throughout the text, so as not to mislead the reader. The mention of "virus performance" in line 143 is therefore not appropriate, nor is the reference to viral replication and movement in the Discussion section.

      "Virus performance" was removed. Also, the reference to viral replication and movement in the Discussion section has been removed.

      Now we mention: “We did not measure viral accumulation, but note this is significantly correlated with intensity of symptoms within the Col-0 line (Corrêa et al. 2020), although it is not clear if this correlation occurs in all lines.”

      • Since symptoms are at the center of the screen, images representing the different scores in the arbitrary scales should ideally be shown.

      Different Arabidopsis lines would look different and this could mislead a reader not familiar with the lines. In order to make a representation of our criteria to stablish the symptoms, we believe that a schematic representation is clearer to interpret. Here are some pictures of different lines showing variating symptoms:

      Author response image 1.

      • Statistical analyses could be added to the figures, to ease interpretation of the data presented.

      Statistical analysis can be found in methods. We prefer to keep the figure legend as short as possible.

      • The authors could include a table with the summary of the phenotypes measured in the panel of screened lines (mean values, range across the panel, heritability, etc.).

      These data are plotted in Fig. 1. We believe that repeating this information in tabular form would not contribute to the main message of the work. Phenotype data and the code to reproduce figure 1 are available at GitHub (as stated in Data Availability), anyone interested can freely explore the phenotypes of the screened lines.

      • The definition of the association peak found in chromosome 2 could be explained further: is the whole region (1.5 Mb) in linkage disequilibrium? How many genes are found within this interval, and how were the five strong candidates the authors mention in line 161 selected? It is also not clear which are these 5 candidates, apart from AT2G14080 and DRP3B - and among those in Table 1 (which, by the way, is cited only in the Discussion and not in the Results section)? Why were AT2G14080 and DRP3B in particular chosen?

      We have replaced Table 1 with an updated Table S1 listing all genes found within the range of significant SNPs for each peak. We now highlight a subset of these genes as candidate genes if they have functions related to disease resistance or defence, and mentioned them explicitly in the text (lines 173-179. We have explicitly described how this table was constructed in the methods (lines 525-538).

      • Concerning the validation of the association found in chromosome 2 (line 169 and onward): the two approaches followed cannot be considered independent validations; wouldn't using independent accessions, or an independent population (generated by the cross between two parental lines, showing contrasting phenotypes, for example) have been more convincing?

      We aim to compare the hypothesis that the association is due to a causal locus to the null hypothesis that the observed association is a fluke due to, for example, the small number of lines showing necrosis. If this null hypothesis is true then we would not expect to see the association if we run the experiment again using the same lines. An alternative hypothesis is that the genotype at the QTL and disease phenotypes are not directly causally linked, but are both correlated with some other factor, such as another QTL, or maternal effects. We agree that an independent sample would be required to exclude the latter hypothesis, but argue that the former is the more pertinent. We have edited the text to be explicit about the hypothesis we are testing, and altered the language to shift the focus from ‘validation’ to ‘confirming the robustness’ of the association (line 182).

      • Regarding the identification of the transposon element in the genomic region of AT2G14080: is the complementation of the knock-out mutant with the two alleles (presence/absence of the transposon) possible to confirm its potential role in the observed phenotype?

      This could be feasible but we cannot do it as none of the researchers can continue this project.

      • On the comparison between naïve and evolved viral strains: is the evolved TuMV more virulent in those accessions closer to Col-0?

      This is not something we have looked at but would certainly be an interesting follow-up investigation.

      • The Copia-element polymorphism is identified in an intron; the potential functional consequences of this insertion could be discussed. In the example the authors provide, the transposable element is inserted into the protein-coding sequence instead.

      We now state explicitly that such insertions are expected to influence expression; beyond that we can only speculate. We have removed the reference to the insertion in the coding sequence.

      • The authors state in line 398 that "susceptibility is unquestionably deleterious" - is this really the case? Are the authors considering susceptibility as the capacity to be infected, or to develop symptoms? Viral infections in nature are frequently asymptomatic, and plant viruses can confer tolerance to other stresses.

      We have tone down the expression and clarify our wording: “Given that potyvirus outbreaks are common in nature (Pagán et al., 2010) and susceptibility to symptomatic infection can be deleterious”

      Additional minor comments:

      • In Table 1, Wu et al., 2018 should refer to DRP2A and 2B, not 3B.

      We have removed Table 1 altogether.

      • Line 126: a 23% increase in symptom severity is mentioned, but how is this calculated, considering that severity is measured in four different categories?

      This is the change in mean severity of symptoms between the two categories.

      • Figure 1F: "...symptoms"

      Fixed.

      • Line 179: "...suggesting an antiviral role..."

      Changed.

      • Lines 288-300: This paragraph does not fit into the narrative and could be omitted.

      It has been removed and some of the info moved to the last paragraph of the Intro, when the two TuMV variants were presented.

      • Lines 335-337: The rationale here is unclear since DRP2B will also be in the background - wouldn't DRPB2B and 3B be functionally redundant in the viral infection?

      Our results suggest that DRPB3B is redundant with DRPB2B for the ancestral virus but not for the evolved viral strain. We speculate that the evolved viral isolate may have acquired the capacity to recruit DRPB3B for its replication and hence it produces less symptoms when the plant protein is missing.

      We have spotted a mistake that may have add to the confusion. Originally the text said “In contrast, loss of function of DRP3B decreased symptoms relative to those in Col-0 in response to the ancestral, but not the evolved virus”. The correct statement is “In contrast, loss of function of DRP3B decreased symptoms relative to those in Col-0 in response to the evolved, but not the ancestral virus.”  

      Reviewer #2 (Public Review):

      The manuscript presents a valuable investigation of genetic associations related to plant resistance against the turnip mosaic virus (TuMV) using Arabidopsis thaliana as a model. The study infects over 1,000 A. thaliana inbred lines with both ancestral and evolved TuMV and assesses four disease-related traits: infectivity, disease progress, symptom severity, and necrosis. The findings reveal that plants infected with the evolved TuMV strain generally exhibited more severe disease symptoms than those infected with the ancestral strain. However, there was considerable variation among plant lines, highlighting the complexity of plant-virus interactions.

      A major genetic locus on chromosome 2 was identified, strongly associated with symptom severity and necrosis. This region contained several candidate genes involved in plant defense against viruses. The study also identified additional genetic loci associated with necrosis, some common to both viral isolates and others specific to individual isolates. Structural variations, including transposable element insertions, were observed in the genomic region linked to disease traits.

      Surprisingly, the minor allele associated with increased disease symptoms was geographically widespread among the studied plant lines, contrary to typical expectations of natural selection limiting the spread of deleterious alleles. Overall, this research provides valuable insights into the genetic basis of plant responses to TuMV, highlighting the complexity of these interactions and suggesting potential avenues for improving crop resilience against viral infections.

      Overall, the manuscript is well-written, and the data are generally high-quality. The study is generally well-executed and contributes to our understanding of plant-virus interactions. I suggest that the authors consider the following points in future versions of this manuscript:

      1. Major allele and minor allele definition: When these two concepts are mentioned in the figure, there is no clear definition of the two words in the text. Especially for major alleles, there is no clear definition in the whole text. It is recommended that the author further elaborate on these two concepts so that readers can more easily understand the text and figures.

      We agree that the distinction between major/minor alleles and major/minor associations in our previous manuscript may have been confusing. In the current manuscript we now define the minor allele at a locus as the less-common allele in the population (line 167). We have removed references to major/minor associations, and instead refer to strong/weak associations.

      1. Possible confusion caused by three words (Major focus / Major association and major allele): Because there is no explanation of the major allele in the text, it may cause readers to be confused with these two places in the text when trying to interpret the meaning of major allele: major locus (line 149)/ the major association with disease phenotypes (line 183).

      See our response to the previous comment.

      1. Discussion: The authors could provide a more detailed discussion of how the research findings might inform crop protection strategies or breeding programs.

      We would prefer to restrain speculating about future applications in breeding programs.

      (Recommendations For The Authors):

      1. Stacked bar chart for the Fig 1F. It is recommended that the author use the form of a stacked bar chart to display the results of Fig 1F. On the one hand, it can fit in with the format of Fig 1D/E/G, on the other hand, it can also display the content more clearly.

      We think the results are easier to interpret without the stacked bar chart.

      1. Language Clarity: While there are no apparent spelling errors, some sentences could be rewritten for greater clarity, especially when explaining the results in Figure 1 and Figure 2.

      We have reviewed these sections and attempted to improve clarity where that seemed appropriate.

      There are some possibilities to explore in the future. For example: clarity of mechanisms for the future. While the study identifies genetic associations, it lacks an in-depth exploration of the underlying molecular mechanisms. Elaborating on the mechanistic aspects would enhance the scientific rigor and practical applicability of the findings.

      Yes, digging into the molecular mechanisms is an ongoing task and will be published elsewhere. It was out of the scope of this already dense manuscript.  

      Reviewer #3 (Public Review):

      Summary of Work

      This paper conducts the largest GWAS study of A. thaliana in response to a viral infection. The paper identifies a 1.5 MB region in the chromosome associated with disease, including SNPs, structural variation, and transposon insertions. Studies further validate the association experimentally with a separate experimental infection procedure with several lines and specific T-DNA mutants. Finally, the paper presents a geographic analysis of the minor disease allele and the major association. The major take-home message of the paper is that structural variants and not only SNPs are important changes associated with disease susceptibility. The manuscript also makes a strong case for negative frequency-dependent selection maintaining a disease susceptibility locus at low frequency.

      Strengths and Weaknesses

      A major strength of this manuscript is the large sample sizes, careful experimental design, and rigor in the follow-up experiments. For instance, mentioning non-infected controls and using methods to determine if geographic locus associations were due to chance. The strong result of a GWAS-detected locus is impressive given the complex interaction between plant genotypes and strains noted in the results. In addition to the follow-up experiments, the geographic analysis added important context and broadened the scope of the study beyond typical lab-based GWAS studies. I find very few weaknesses in this manuscript.

      Support of Conclusions

      The support for the conclusions is exceptional. This is due to the massive amount of evidence for each statement and also due to the careful consideration of alternative explanations for the data.

      Significance of Work

      This manuscript will be of great significance in plant disease research, both for its findings and its experimental approach. The study has very important implications for genetic associations with disease beyond plants.

      (Recommendations For The Authors):

      Line 41 - Rephrase, not clear "being the magnitude and sign of the difference dependent on the degree of adaptation of the viral isolate to A. thaliana."

      Now it reads: “When inoculated with TuMV, loss-of-function mutant plants of this gene exhibited different symptoms than wild-type plants, where the scale of the difference and the direction of change between the symptomatology of mutant and wild-type plants depends on the degree of adaptation of the viral isolate to A. thaliana.”

      Line 236 - typo should read: "and 21-fold"

      Changed.

    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewer 1:

      I would suggest that the authors focus on what I think is the main goal of the work, namely, to consider the whole cell contour when characterizing cell shape instead of only some points on the contour. A reference to the connection with Minkowski tensors and the biologically relevant mathematical consequences of this connection would suffice; a detailed definition of the Minkowski tensors does not seem to be necessary. Especially because you do not really use them. You could use the analysis of the simulation data to explain what the γ<sub>p</sub> miss and for which statements they would be sufficient.

      We argue that the explanation of Minkowski tensors is helpful and should remain in the Methods and materials section. There are two reasons: First, our argumentation relays on the robustness and stability properties of Minkowski tensors. Introducing q<sub>p</sub> without the connection to Minkowski tensors would not allow us to make these statements. Second, Minkowski tensors seem not well known in the community, otherwise measures like γ<sub>p</sub> would not have been introduced. Furthermore, readers not interested in the technical details could skip this part of the manuscript and directly go to the Results section. Concerning the questions, what the γ<sub>p</sub> miss and for which statements they would be sufficient, the answer from a purly mathematical point of view is rather simple: As γ<sub>p</sub> does not share robustness and stability it should not be used in any case! The provided results on computational and experimental data demonstrate the consequences of using such measures. In case of the proposed nematic-hexatic transition in Armengol-Collade et al. (2023) the consequence is severe, as this transition is specific only to the used method but not to the underlying physics. A second aspect which we now further highlight is the influence of approximating a cell by a polygon. We demonstrate that this approximation is responsible for a strong hexatic order on the cellular scale in the considered MDCK data from Armengol-Collade et al. (2023).

      It is not clear to me what we should learn about the two tissue models by using q<sub>2</sub> and q<sub>6</sub> to quantify cell shape. Can you clearly formulate one or more conclusions?

      What we can learn from the research is a dependence of q<sub>p</sub> on model parameters in the two tissue models is

      increases with higher activity or deformability

      decreases with higher activity or deformability.

      Furthermore, q<sub>2</sub> and q<sub>6</sub> are independent and describe distinct properties. Using these models as a basis to coarse-grain and derive continuous models on the tissue scale, these results indicate that more general p-atic liquid crystal theories should be used and the simplest nematic liquid crystal theories might not be sufficient.

      The experimental data and their analysis does not seem to add anything to the work. Do you report only data from independent measurements, or did you consider all images of a monolayer?

      As we now also analyze experimental data from Armengol-Collado et al. (2023) which confirm our findings on independency of q<sub>2</sub> and q<sub>6</sub> and also confirm that the proposed nematic-hexatic transition is only specific to the use of γ<sub>p</sub> for characterizing the shape, additional experimental data are indeed no longer needed. We, therefore, skip the detailed analysis of this data and only keep the results in Fig 1 and Fig 2 and the corresponding figures in the appendix as illustrating examples.

      L13: ”P-atic liquid crystal theories offer new perspectives on how cells self-organize (...)” This is a difficult entry, because the average reader of eLife might not be familiar with p-atic liquid crystals.

      We agree that p-atic liquid crystals might not be familiar to the average reader. For this reason we introduce orientational order in the introduction with examples demonstrating that not only nematic, but also tetratic and hexatic order have been identified in tissue and introduce the different symmetries. Furthermore, we provide examples for p-atic liquid crystals from other fields and various references. In the conclusion, we also cite models for p-atic liquid crystal theories. Even if the average reader is not familiar with these theories, it should become evident that nematic order might not be sufficient to describe tissue as other symmetries are present as well.

      L32: ”nematic” needs to be introduced.

      Nematic order is already explained as rotational order with 180° degrees. The references cited discuss nematic liquid crystals in the context of morphological changes in tissue. We therefore only added a standard text book as reference for liquid crystal theories and refrain introducing it in more detail in the manuscript.

      Figure 1: Why do you show the data for q<sub>3</sub>, q<sub>4</sub>, and q<sub>5</sub>, which you do not really consider in this manuscript? Same for Figure 2. Why not combine the two figures? Furthermore, you show q<sub>p</sub> without having defined them yet.

      We consider all p \= 2,3,4,5,6, but focus on p = 2,6 in the main text and p = 3,4,5 in the appendix. Figures 1 and 2 essentially only introduce the subject and help to relate p-atic order to cell shapes and introduce the methodology to analyze the data. Our conclusion is that all p can be important and should be considered in continuous descriptions of tissue.

      Equation 1: The notation is confusing: the domain of integration (C or ∂C) also appears as the variable you integrate.

      The equation is correct. The variable of integration is 1 or H and the domain of integration is C (cell) or ∂C (cell contour).

      L68: ”a snapshot of the considered monolayer of wild-type MDCK cells”. Did you analyse only one monolayer? Please, provide information about the number of monolayers that were imaged and how many cell shapes were analyzed.

      We have analyzed one monolayer and have added the missing information.

      L86: ”field-specific prefactors” I do not understand what is meant by these.

      Different communities, e.g. physics, mathematics, cosmology, .... use different prefactors in the definition. We have removed this statement.

      L89: ”Hadwiger’s characterization theorem”. What is this?

      This mathematical result is important to claim robustness and stability, it can be found in the cited reference.

      L104: ”the essential property is the continuity”. Essential for what?

      Essential ”for our purpose” to characterize the shape of cells by a robust method.

      L120: ”the theory also guarantees robust description of p-atic orientation for p = 3,4,5,6,...” I do not understand what you mean.

      The previous examples only consider p \= 2. However, the cited theoretical results also hold for p = 3,4,5,6,..

      Equations (5) and (6): You define ψ<sub>p</sub>(C) twice. Are the definitions equivalent? Why do you need both?

      This is not a different definition, equation (6) is a reformulation which is more useful for our purpose. But we indeed define ϑ<sub>p</sub> twice. We now use a new symbol to distinguish ϑ<sub>p</sub> in Equation 7 and 9.

      Figure 4: ”The visualization uses rotationally-symmetric direction fields (known as p-RoSy fields in computer graphics (Vaxman et al., 2016)).” I guess that you have used these fields already in Figure 1, so why introduce them only now?

      We have moved this comment to Figure 1.

      Figure 6: Using a few discrete values cannot illustrate continuity. Also, the ”jump” in γ<sub>p</sub> results from deleting a vertex, so I doubt that this is a fair comparison. Still, I think that it is important to point out to the reader that the value γ<sub>p</sub> depends on the number of vertices (here, I allow that two edges connected by a vertex are aligned).

      We adjusted the caption to make our point more clear. The last image is a triangle and according to the definition of γ<sub>p</sub> is, therefore, described by only three vertices. So, it is indeed a fair comparison. The reviewer is right that the value of γ<sub>p</sub> has a strong dependency of the number of used vertices, this is exactly the point that we are trying to make with this figure. Also, adding vertices artificially to make γ<sub>p</sub> continuous leads to more problems, as the values for γ<sub>p</sub> change if we change the number of vertices. But an equilateral triangle should be recognized as an equilateral triangle, no matter if there is an artificial fourth vertex or not. The triangle in our picture and the triangle that the reviewer mentioned (so our triangle with an artificial fourth vertex) both have the shape of an equilateral triangle, yet for one it is |γ<sub>3</sub>| = 1.0 and for the other one it is |γ<sub>3</sub>| = 0.935.

      While we agree on the reviewers statement about continuity, we did not modify the sentence, as the meaning should be clear.

      L160: The definition of the center of mass is incorrect as it is not that of an extended object whose contour is defined by a polygon, but only of the set of vertices. In Figure 6 you write ”the choice of the center of mass highly influences the value of γ<sub>p</sub>” - is there really a choice of the center of mass? I thought that it was uniquely defined.

      We here only repeat the definition from Armengol-Collado et al. (2023) in order to be able to directly compare our analyses with the results presented therein. We adjusted the caption to be more clear.

      L166: What is the weighting you refer to in Equation 9?

      We apologize, the reference is to Equation 8. We have modified this.

      L312: ”Quantifying orientational order in biological tissues can be realized by Minkowsky tensors”. As mentioned above, you do not really use them, but use Equation (7), which can be defined without reference to Minkowski tensors.

      Eq. (7) is part of the irreducible representations of the Minkowsky tensor. Therefore the sentence is correct.

      L318: I do not quite understand the link between being able (or not) to compare q<sub>p</sub>’s for different values of p and the interpretability of q<sub>2</sub> and q<sub>6</sub>. Also, since you introduce q<sub>p</sub>, how can the question about their comparability be a recurrent challenge? Finally, would you agree that even though a comparison between the absolute values of q<sub>2</sub> and q<sub>6</sub> is inappropriate, one can still meaningfully compare relative changes as a parameter is changed or when comparing cells in different conditions?

      We have modified the sentence. Furthermore we agree that one can still meaningfully compare relative changes as a parameter is changed, as we do. However, our claim that q<sub>2</sub> and q<sub>6</sub> are independent, does not allow to conclude any kind of nematic-hexatic phase transition. We have now provided further evidence using the published data of Armengol-Collado et al. (2023), which unequivocally supports this statement. We would also like to remark that the detection of a phase-transition requires a single order parameter, which cannot exist as q<sub>2</sub> and q<sub>6</sub> are independent.

      We have further explained this in the main text.

      Figure 7: The axes are not labeled.

      We added the labels.

      L359: ”q<sub>2</sub> and q<sub>6</sub> values cluster tightly”, L362 ”q<sub>2</sub> and q<sub>6</sub> values become highly scattered” Please, quantify.

      We kept these formulations but have added statistical measures to these qualitative descriptions, see Supplementary Figures to Fig 7 for the distance correlation and the P-values of the distance correlation. These data support our claim of independence.

      L362: ”each q<sub>2</sub> value spans a broad range of q<sub>6</sub> values and vice versa, demonstrating their independence”. Please, use a quantitative test of statistical independence.

      We have added statistical information by using the distance correlation and statistical tests, see Supplementary Figures to Fig 7. Similar results are obtained for the Pearson correlation and corresponding tests. However, they are not included as the distance correlation is more general.

      L371: Please, define Q<sub>2</sub> and Q<sub>6</sub> in the main text.

      We have now added the definition to the Materials and methods section.

      L420: A reference seems to be missing.

      Thanks for pointing this out. This was a formatting error, we only wanted to cite Balasubramaniam et al. (2021).

      L425: ”strong dependence of cell shape on cell density”. But q<sub>6</sub> seems to be rather independent of density, see Figure 11. Also, what do you mean by ”strong”? Can you quantify?

      The dependency of the cell shape on the cell density is shown in detail in (Eckert et al., 2023). Furthermore, to describe the cell shape the values for all p are needed. So the change in q<sub>2</sub> already indicates a change in the overall cell shape even as q<sub>6</sub> is barely changing. As we excluded these experimental results now in favor of the experimental data also used in Armengol-Collado et al. (2023), we did not add further evaluations regarding cell density.

      L453 ”These divergences [nonmonotonic dependence of γ<sub>p</sub> on activity or deformability] highlight the limitations of γ<sub>p</sub> in capturing consistent patterns”. I am not sure to follow your argument here.

      Besides the quantitative differences seen in comparing Fig. 1 and Fig 2 with the corresponding figures in the appendix, these results show qualitative differences. Using a method which is not robust and not continuous leads to qualitative different results. The nonmonotonic dependence of γ<sub>p</sub> is specific to the method but not to the underlying physics.

      Appendix 3 - Figure 20: It is not clear how to compare this figure to Figure 3e of Armengol-Collado et al 2023. Please, provide more details.

      Appendix 3 - Figure 20 (Appendix 3 - Figure 25 in the revised version) and Figure 3e in Armengol-Collado et al. (2023) cannot be directly compared. Fig 3e shows results of experiments and multiphase field simulations for one parameter stetting and Fig 20 results of the active vertex model for various parameter settings. But both are considered using γ<sub>p</sub> and Γ<sub>p</sub>. We have added these computation, see Fig. 13, which indeed reproduces the results from Fig 3e. We refrain from considering corresponding plots to Fig 20 for the multiphase field model, as this first requires computing the vertices and no additional information can be expected.

      Reviewer 2:

      The manuscript lacks statistical information. The following should be addressed: How often have the experiments been performed? How many monolayers have been analyzed? How many time steps have been considered and in what duration? How many cells have been included in the analysis? What are the p-values to determine if q<sub>p</sub>’s (Figure 2, panel a) and γ<sub>p</sub>’s (Appendix 3-Figure 17, panel a) are significantly different? Same figures: How many cells and experiments have been considered here? Figure 11: What is the density of cells for each condition? Please provide the corresponding values. How significant are the differences? How many times has the experiment been repeated? Figure 12: Due to cell proliferation, the cell density changes over time. Does this need to be taken into account?

      We agree, our information have only been qualitative. We have added the missing information. Especially we added statistical information by using the distance correlation and statistical tests, see Supplementary Figures to Fig. 7. Similar results are obtained for the Pearson correlation and corresponding tests (not included). As we excluded the experimental results previously shown in Figure 11 and Figure 12, in the revised version in favor of the experimental data that is already published in Armengol-Collado et al. (2023), we did not add further statistics regarding this. We added the number of frames and cells in the text.

      The image analysis part of the Method section states that time-series were xy-drift corrected, and cells were tracked. However, the manuscript does not contain results of dynamical data, timedependent analyses, or discussions of how q<sub>p</sub> changes over time. The authors mention that the fluidity of the tissue was confirmed by the MSD, neighbor number variance, and the self-intermediate scattering function, but none of the results are shown in the manuscript. I would like to ask the authors to provide the results and related content in the Method section.

      We have modified the description and removed all parts related to dynamical data. Due to the heavy overload of images in the manuscript we refrain from providing all the results for the phase diagram to distinguish solid and fluid phase. These measures have been provided previously for the considered modeling approaches and provide here only a side remark. Our results do not depend on an exact localization of a solid-fluid phase boundary.

      Additional information is missing in the Image analysis part of the Method section. Could the authors provide the information on the image analysis steps between obtaining the segmented image and inputting the parameters for the Minkowski tensor? This should include how the normal vectors have been determined and whether this has been done for all pixels along the contour.

      We added further details in the section Extraction of the contour in Experimental setup in Methods and Materials and also provide the code to compute q<sub>p</sub> for segmented images.

      The authors have analyzed low-resolution phase contrast images acquired with a 10x objective to experimentally support their introduced Minkowski tensors. This may have decreased the resolution of the cell boundary detection and its curvature. I strongly suggest imaging the tissue with higher magnification (40x or 63x) and/or fluorescent markers to visualize the cell boundaries in high quality. This would allow the authors to distinguish between circles and circle-like shapes (lines 432-434) and to further investigate differences between MDCK wild-type and MDCK E-cad KO cells.

      We agree that higher resolution of the images would be beneficial. However, we are convinced that this will not influence our findings. Instead of performing the experiments with higher magnification or using fluorescent markers, we have considered the experimental data from Armengol-Collado et al. (2023) to support our results.

      The authors have coarse-grained the shape function, Γ<sub>p</sub>, and have chosen the active vertex model (Appendix 3-Figure 20) for comparison with the Minkowski tensors, Q<sub>p</sub> (Appendix 2 Figure 13). In both figures, the hexatic-nematic crossover does not occur. Armengol-Collado et al. have previously reported that the Voronoi model failed to achieve the hexatic-nematic crossover and argued that this is due to the artificial enhancement of the polygon’s hexagonality, leading to high hexatic order at the tissue scale. Since the authors have used the Voronoi-tailing method (line 196), I would like to ask the authors to compare the multiphase field models for Γ<sub>p</sub> andQ<sub>p</sub> instead.

      We would like to mention that we do not consider a Voronoi model but an active vertex model. A Voronoi model is only used for initialization. Both models are certainly related but not identical and claims for a Voronoi model do not need to hold for an active vertex model. The suggested comparison for the multi phasefield model is not an easy task as it requires to compute the vertices from the phase field variables. There are gaps between cells and a reliable algorithm to identify the vertices is a task on its own. We, therefore, refrain from doing these calculations. Instead, we have used the experimental data from Armengol-Collado et al. (2023) for which the polygonal information are provided, see Figure 11. Especially for p \= 6, strong differences can be seen by comparing the PDF obtained by the full shape and the polygonal shape. Indeed, the strong hexatic order at the cellular scale is only a consequence of the approximation by polygons. With this result analysing the multi phasefield data by γ<sub>p</sub> does not add any new information as this first requires an approximation by polygons.

      The authors show the q<sub>p</sub> distributions for the experimental systems (Figure 2, Figure 11). For completeness, I would like to ask the authors to also coarse-grain q<sub>p</sub> and γ<sub>p</sub> of the experimental data as shown for the computational models in Appendix 2 - Figure 13 and Appendix 2 - Figure 14. It would be interesting to see if the hexatic-nematic crossover appears. I would recommend that the authors avoid using the Voronoi tailing of the experimental system, as this may fail to obtain the crossover as explained in (5) above. Instead, I suggest using the real vertex positions for γ<sub>p</sub>, which can be obtained from the segmented images.

      It remains open what is meant by ”the real vertex positions for γ<sub>p</sub>, which can be obtained from the segmented images”. Segmenting the images leads to smooth contours, partly even with gaps between cells. As the magnitude of γ<sub>p</sub> depends on the number of points used in the calculation it is not meaningful to use all points of the contour for calculating γ<sub>p</sub>, as this would lead to artificially low values for |γ<sub>p</sub>|. Identifying the vertex positions for an approximating polygon is an issue of its own and the consequence of this approximation is already mentioned above. For a comparison we therefore added the experimental data from Armengol-Collado et al (2023) and used the provided vertex positions to compute q<sub>p</sub> and γ<sub>p</sub> as well as the raw data and performed the segmentation and used these data to compute q<sub>p</sub>. See Figure 11. These results confirm our findings and show that the proposed nematic-hexatic phase transition is specific to γ<sub>p</sub> to characterize shape.

      In order to show that shape descriptors like the shape function, γ<sub>p</sub>, introduced by Armengol-Collado et al., ’fail to capture the nuance of irregular shapes’ (line 445), the authors have compared γ<sub>p</sub> with the Minkowski tensors, q<sub>p</sub>, using the same dataset (Figure 1 with Appendix 3 - Figure 16, Figure 2 with Appendix 3 - Figure 17, and Figure 4 with Appendix 3 - Figure 15 Appendix 3). I agree that γ<sub>p</sub> and q<sub>p</sub> are different, not showing identical values. However, I see no evidence in these figures that q<sub>p</sub> describes the symmetry of a cell better than γ<sub>p</sub>, since the values are similar and vary quite similarly between different p-atic orders. What is the quantitative difference that shows the failure of the shape function to capture the nuance of irregular shapes?

      The statement already follows from the mathematical properties of robustness and stability, which is illustrated in Fig. 6. The mentioned comparisons for simulation and experimental data only demonstrate that the lack of robustness and stability of γ<sub>p</sub> also leads to different results if applied to averages of cell measures. The differences are twofold, first the approximation of cells by polygons leads to different results, and second even for polygons different results follow, as only one approach is continuous and the other not. This has strong consequences for the proposed nematic-hexatic phase transition if coarse-grained. Our added results for the experimental data from Armengo-Collado et al. (2023) show that this behavior is not a physical feature but only specific to the use of γ<sub>p</sub>.

      The authors claim that the Minkowski tensors provide a ’reliable framework’ and that this framework ’opens new pathways for understanding the role of orientational symmetries in tissue mechanics and development’ (line 78-79). However, the p-atic orders in the experimental systems peak at very low orders of q<sub>p</sub> < 0.3, which may not allow conclusions about (non-)dominant orientational symmetry(ies) of cells. Can this framework be applied to experimental systems? Since the Minkowski tensors display the independence of the hexatic and nematic symmetry, the variations of cell shapes in experimental systems are too strong to provide any additional results (line 437), as stated by the authors, and no crossover was found, while the crossover was reported by Armengol-Collado et al., what new pathways can be opened to study tissues?

      We have added a comparison with experimental data from Armengol-Collado et al. (2023) and demonstrate that the proposed nematic-hexatic transition is only specific to the use of γ<sub>p</sub> for characterizing the shape. So our results first of all essentially close the ”pathway for understanding the role of orientational symmetries in tissue mechanics and development”, which was proposed on this nematic-hexatic transition. On the other side, even if q<sub>p</sub> peaks at relatively low values, the results demonstrate independence of the measures for different p’s, for two different modeling approaches and two different sets of experimental data. This motivates to consider p-atic order for different p simultaneously. Such theories of ”multi”-p-atic liquid crystals, as proposed in the conclusions, are the mentioned new pathways.

      In principle, the introduced Minkowski tensors integrate the orientation of the normal vectors (Equation 6) and consider the perimeter of the contour (Equation 1). Do the tensors distinguish between convex and concave curvature since both are present in tissues? Does a square with 4 concave and a square with 4 convex edges (same curvature) have the same q<sub>p</sub> values?

      For the specific situation of a square with 4 concave or 4 convex edges even p would lead to the same orientation and the same value for q<sub>p</sub>, as even p have a 180 degree symmetry. Odd p would result in the same value for q<sub>p</sub> but in a different orientation ϑ<sub>p</sub>. In more general cases, e.g. shapes with concave and convex edges, no general statements can be made. In general the theoretical results on stability of q<sub>p</sub> only hold for convex shapes. However, as discussed in Methods and materials the known counterexamples for concave shapes are not relevant for cell shapes.

      In lines 169-172 and Figure 6, the authors report a jump in γ<sub>p</sub>. Why has the fourth vertex in the last image been removed? The vertices are essential for the calculation of γ<sub>p</sub>. If the fourth vertex is not removed, the following values result: γ<sub>3</sub> = 0.935 and γ<sub>4</sub> = 0.474, which leads to changes of the same order of magnitude as those of q<sub>p</sub>. I think it is therefore not the choice of the center of mass that ’heavily influences the value of γ<sub>p</sub>’, but the removal of the fourth vertex.

      We adjusted the caption to make our point more clear. The last image is a triangle and according to the definition of γ<sub>p</sub> is therefore described by only three vertices. The reviewer is right that the value of γ<sub>p</sub> has a strong dependency of the number of used vertices, this is exactly the point that we are trying to make with this figure. An equilateral triangle should be recognized as an equilateral triangle, no matter if there is an artificial fourth vertex or not. The triangle in our picture and the triangle that the reviewer described (so our triangle with an artificial fourth vertex) both have the shape of an equilateral triangle, yet for one |γ<sub>3</sub>| = 1.0 and for the other one it is |γ<sub>3</sub>| = 0.935. This can be seen even more clearly if even more artificial vertices on the outline of the equilateral triangle are added, which will decrease |γ<sub>3</sub>| even more. Furthermore, we think there was a misunderstanding regarding our statement about the center of mass. The general problem of γ<sub>p</sub> - so the dependence of the values on the number of vertices - is independent of the calculation of the center of mass. The exact values of γ<sub>p</sub> on the other hand depend on the choice of this. We follow Armengol-Collado et al. (2023) and use the mean of all vertex coordinates as center of mass. If the reviewer would use the center of mass of the equilateral triangle and do the same calculations the resulting values for γ<sub>p</sub> would be different. This is what we meant with ’heavily influences the value of γ<sub>p</sub>’.

      In Appendix 3 - Figure 18, the authors show that the shape function, γ<sub>6</sub>, exhibits a non-monotonic trend as a function of activity and deformability. I have no objection to this statement. However, I would like to ask the authors to check the values for γ<sub>6</sub>. In the bottom-left corner, for example, γ<sub>6</sub> = 0.55. This value seems very low to me. In Appendix 3-Figure 20, |Q<sub>6</sub>| for R/Rcell = 2 is already in this range, while |Q<sub>6</sub>| for R/Rcell = 1 (not shown), corresponding to γ<sub>6</sub>, must be even higher. Also, the parameters p<sub>6</sub> = 3.5 and v<sub>0</sub> = 0.1 should result in a nearly hexagonal lattice, which should be captured with high γ<sub>6</sub> values. I would expect γ<sub>6</sub> to be in the same range as q<sub>6</sub>.

      Many thanks for pointing this out. There are two different points addressed in this question: The first is if |Γ<sub>p</sub>| is too high. We checked the values, |Γ<sub>p</sub>| = 0.5075 for R/R<sub>cell</sub> = 2, so it is lower than = 0.58. The second question is why γ<sub>p</sub> and q<sub>p</sub> are not in the same value range. You are right that for a perfectly hexagonal lattice both should give the same value, namely = = 1.0. However, even at p<sub>6</sub> = 3.5 and v<sub>0</sub> = 0.1 this is not a perfectly hexagonal lattice anymore and how fast the values of q<sub>6</sub> and |γ<sub>6</sub>| drop if we move away from a perfect hexagon scales differently. As q<sub>p</sub> is stable and only changes slightly for slight changes in the shape it makes sense, that q<sub>p</sub> is still close to 1.0 . We included an image, see below, of one time step in said parameter to showcase that cells do not form a perfect hexagonal lattice anymore.

      Reviewer 3:

      Could the authors show why and how this method could bring new information which were missing so far in the understanding of morphogenesis in vitro and in vivo with the current quantification?

      The introduction provides examples of how orientational order and its topological defects can be linked to morphological changes in tissues. The orientational order emerges from the shape of the cells. Most commonly nematic order has been considered, but more recently also hexatic order and even a nematic-hexactic crossover on larger scales. This suggests a mechanical mechanism for morphogenesis, like a phase transition from hexatic to nematic, which would have consequences on the evolution of shape. We demonstrate that the measures q<sub>2</sub> and q<sub>6</sub> are independent. Furthermore the proposed nematic-hexatic transition is only specific to the use of γ<sub>p</sub> for characterizing the shape and coarse-graining of the associated order. These measures are not robust and therefore should not be used. Results for the robust measures q<sub>p</sub> suggest to consider all p for a coarse-grained theory to model morphological changes in tissues.

      Could authors show quantitative comparisons between available methods with the same sets of data and highlight pros and cons?

      Author response image 1.

      Screenshot from p<sub>6</sub> = 3.5 and v<sub>0</sub> = 0.1

      In addition to what was already done for the simulation data we have added data from Armengol-Collado et al. (2023) and compared the results for q<sub>p</sub> and Q<sub>p<sub> and γ<sub>p</sub> and Γ<sub>p</sub>. The theoretical results and the illustrating example in Fig. 6 already show that there are no pros for γ<sub>p</sub>. Other methods belong to the class of bond-order methods and measure neighbor relations instead of shape. We already comment that these methods are inappropriate to classify shape, see Methods and materials, last sentence and Mickel et al. (2013) for a detailed discussion why these methods are not robust.

      Instead of using phase contrast images, which exhibit curved cell-cell contours, could authors use data with E-cadherin staining instead - as used in many epithelial studies in vitro and in vivo? Could they show both images for wild type and for the E-cadherin KO cell lines with fluorescent readout?

      We are convinced that our results do not depend on the way to visualize the cell contours. Furthermore the images do not provide additional information. To further strengthen the experimental part of the manuscript, we instead analyzed data from Armengol-Collado et al. (2023).

      They confirm our findings.

      The authors acknowledge differences in density between cell lines p. 13 so this calls for new experiments with solid readouts and analysis using comparable experimental conditions.

      Additionally, we analyzed data from Armengol-Collado et al. (2023) which confirm our findings. Our results are now supported by two different modeling approaches and two different experimental settings. Because of redundancy we removed the original experimental data from the revised manuscript.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      In this manuscript, the authors employed direct RNA sequencing with nanopores, enhanced by 5' end adaptor ligation, to comprehensively interrogate the human transcriptome at singlemolecule and nucleotide resolution. They conclude that cellular stress induces prevalent 5' end RNA decay that is coupled to translation and ribosome occupancy. Contrary to the literature, they found that, unlike typical RNA decay models in normal conditions, stress-induced RNA decay is dependent on XRN1 but does not depend on the removal of the poly(A) tail. The findings presented are interesting but a substantial amount of work is needed to fully establish these paradigm-shifting findings.

      Strengths:

      These are paradigm-shifting observations using cutting-edge technologies.

      Weaknesses:

      The conclusions do not appear to be fully supported by the data presented.

      Our response to the reviewer comments is provided at the end of this document in the section "Recommendations For The Authors"

      Reviewer #2 (Public Review):

      In the manuscript "Full-length direct RNA sequencing uncovers stress-granule dependent RNA decay upon cellular stress", Dar, Malla, and colleagues use direct RNA sequencing on nanopores to characterize the transcriptome after arsenite and oxidative stress. They observe a population of transcripts that are shortened during stress. The authors hypothesize that this shortening is mediated by the 5'-3' exonuclease XRN1, as XRN1 knockdown results in longer transcripts. Interestingly, the authors do not observe a polyA-tail shortening, which is typically thought to precede decapping and XRN1-mediated transcript decay. Finally, the authors use G3BP1 knockout cells to demonstrate that stress granule formation is required for the observed transcript shortening.

      The manuscript contains intriguing findings of interest to the mRNA decay community. That said, it appears that the authors at times overinterpret the data they get from a handful of direct RNA sequencing experiments. To bolster some of the statements additional experiments might be desirable.

      A selection of comments:

      (1) Considering that the authors compare the effects of stress, stress granule formation, and XRN1 loss on transcriptome profiles, it would be desirable to use a single-cell system (and validated in a few more). Most of the direct RNAseq is performed in HeLa cells, but the experiments showing that stress granule formation is required come from U2OS cells, while short RNAseq data showing loss of coverage on mRNA 5'ends is reanalyzed from HEK293 cells. It may be plausible that the same pathways operate in all those cells, but it is not rigorously demonstrated.

      We agree with the reviewer that performing all experiments in a single cell system would be desirable. Presently, our core findings on 5’ RNA shortening are all performed in HeLa cells: the identification of 5’ RNA shortening, the reliance of shortening through XRN1 silencing, suppression of shortening by translation inhibition, and now the relationship between 5’ shortening and deadenylation/decapping through experiments described further below. Our use of other cell lines is primarily to show that 5’ shortening is a general phenomenon, and we have now done this for U20S cells, HEK293 cells, and primary 3T3 cells from mouse. 

      Regarding stress granule formation, we are unfortunately restricted by the lack of available wellcharacterized resources. The DDG3BP1/2 U2OS is a well characterized cell line that has been extensively used for stress granule-related experiments. We have therefore opted to use it and performed experiments to verify both the occurrence of stress-induced RNA shortening as well as the rescue in the absence of stress granules. The reproducibility and breadth of the cell lines used in our analysis makes us confident on the generality of our findings.

      (2) An interesting finding of the manuscript is that polyA tail shortening is not observed prior to transcript shortening. The authors would need to demonstrate that their approach is capable of detecting shortened polyA tails. Using polyA purified RNA to look at the status of polyA tail length may not be ideal (as avidity to oligodT beads may increase with polyA tail length and therefore the authors bias themselves to longer tails anyway). At the very least, the use of positive controls would be desirable; e.g. knockdown of CCR4/NOT.

      We thank the reviewer for their comment. Previous studies, using in vitro transcribed RNA molecules, have shown that direct RNA sequencing can capture and quantify poly(A) tails of varying lengths (Krause et al. 2019). Specifically, a range of 10 to 150 nt has been tested and a high concordance between known and dRNA-Seq determined values was observed. Both tailfindR and nanopolish (used in this work) showed high poly(A) tail estimation accuracy.

      Regardless, we agree with the reviewer that our method depends on poly(A) tail capture and thus may be incomplete for fully quantifying poly(A) length changes. We therefore opted to replace these data and instead follow this and other reviewers’ suggestions and perform experiments following knockdown of CCR4/NOT using cells expressing a catalytically inactive CNOT8 (CNOT8*) dominant negative mutant (Chang et al. 2019). Our new data show that stress-induced 5’ end decay is indeed not dependent on prior removal of the poly(A) tail. Specifically, we find that transcript shortening is still observed upon oxidative stress in cells expressing CNOT8* compared to control cells. We present these new results in Fig. 3 and Sup. Fig 3. 

      (3) The authors use a strategy of ligating an adapter to 5' phosphorylated RNA (presumably the breakdown fragments) to be able to distinguish true mRNA fragments from artifacts of abortive nanopore sequencing. This is a fantastic approach to curating a clean dataset. Unfortunately, the authors don't appear to go through with discarding fragments that are not adapter-ligated (presumably to increase the depth of analysis; they do offer Figure 1e that shows similar changes in transcript length for fragments with adapter, compared to Figure 1d). It would be good to know how many reads in total had the adapter. Furthermore, it would be good to know what percentage of reads without adapters are products of abortive sequencing. What percentage of reads had 5'OH ends (could be answered by ligating a different adapter to kinasetreated transcripts). More read curation would also be desirable when building the metagene analysis - why do the authors include every 3'end of sequenced reads (their RNA purification scheme requires a polyA tail, so non-polyadenylated fragments are recovered in a nonquantitative manner and should be discarded).

      We thank the reviewer for appreciating our approach. The reviewer is correct that we do not discard reads that are not adapter-ligated. As the reviewer correctly mentions this is to increase the sequencing depth. We have found that the ligation efficiency is very low, ~1-2 % of total reads (now in Sup. Table. 1), across all libraries, and so the percentage of REL5-ligated reads does not directly infer the total amount of non-artifactual 5’ ends. Instead, we use these REL5ligated reads as a subset of our data for which we have extremely high confidence in the true 5’end. Our results show that non-ligated reads display the same length distribution as ligated ones, and that the results are reproducible regardless of read selection (e.g. Fig. 1c, e, Sup. Fig. 1k, l, Fig. 3b, c). This strong concordance between REL5-ligated and non-ligated reads suggests that our conclusions on 5’ end shortening are not substantially influenced by abortive sequencing or other artefactual creation of 5’ shortening. We have modified the text to clarify these points and have added plots using only ligated molecules for relevant figures that this was not previously done (Sup. Fig 1l, 3c)

      We agree with the reviewer that non-polyadenylated reads could be discarded from metagene analysis and we have performed this change in the revised version. Our conclusions following removal of non-polyadenylated reads remain unchanged (Sup. Fig. 1g).

      (4) The authors should come to a clear conclusion about what "transcript shortening" means. Is it exonucleolytic shortening from the 5'end? They cannot say much about the 3'ends anyway (see above). Or are we talking about endonucleolytic cuts leaving 5'P that then can be attached by XRN1 (again, what is the ratio of 5'P and 5'OH fragments; also, what is the ratio of shortened to full-length RNA)?

      We thank the reviewer for their suggestion. We have performed additional experiments to investigate the role of deadenylation and decapping by expressing dominant negative forms of the NOT8 deadenylase (NOT8*) and DCP2 decapping (DCP2*) enzyme in HeLa cells. Our results show that neither expression of NOT8* nor DCP2* can inhibit stress-induced transcript shortening following arsenite treatment (Fig. 3e-f). These new data suggest that neither deadenylation nor decapping are required for stress-induced RNA decay. Instead, our data are more compatible with endonucleolytic cleavage as the most likely mechanism for stressinduced RNA decay. We have incorporated these results in the text and present them in Fig. 3 and Sup. Fig. 3.

      (5) The authors should clearly explain how they think the transcript shortening comes about. They claim it does not need polyA shortening, but then do not explain where the XRN1 substrate comes from. Does their effect require decapping? Or endonucleolytic attacks?

      Please also refer to our answer to the previous comment (#4). Collectively, our results from a) the dominant negative expression of NOT8* and DCP2* that show no effect on stress-induced shortening and b) the rescue of transcript length upon translation initiation inhibition, indicate a potential endonucleolytic mechanism as a mediator of stress-induced RNA decay. However, we believe that extensive, further studies currently beyond the scope of this work, will be required to discover the nuclease and to dissect the exact molecular mechanisms that define the 5' ends of mRNAs upon stress-induced decay. We now discuss these points in the discussion.

      (6) XRN1 KD results in lengthened transcripts. That is not surprising as XRN1 is an exonuclease - and XRN1 does not merely rescue arsenite stress-mediated transcript shortening, but results in a dramatic transcript lengthening.

      The reviewer raises an intriguing point. Additional analysis of data has showed that in fact, in unstressed cells, XRN1 KD leads to modestly significant reduction in overall transcript length (Fig. 3b, c). This could possibly be the result of an accumulation of intermediate cleavage products normally expected to be degraded by XRN1 as previously described (Pelechano, Wei, and Steinmetz 2015; Ibrahim et al. 2018).

      Instead, we find that under stress, XRN1 KD shows an almost identical transcript length distribution to unstressed cells and significantly higher than siCTRL stressed cells (Fig. 3b, c). These results indicate that in the absence of XRN1, stress-induced decay is largely abolished. As the reviewer correctly points out, this seems to affect the majority of RNAs which we believe is evidence of the general lack of specificity in the mechanism. Nevertheless, we find that transcripts that are the primary substrates to stress-induced shortening are substantially more lengthened than all other transcripts (Fig. 3e). This indicates that transcripts primarily affected by stress-induced decay are also lengthened the most in the absence of XRN1 and at an even higher level than expected by general XRN1 KD effects.

      Reviewer #3 (Public Review):

      The work by Dar et al. examines RNA metabolism under cellular stress, focusing on stressgranule-dependent RNA decay. It employs direct RNA sequencing with a Nanopore-based method, revealing that cellular stress induces prevalent 5' end RNA decay that is coupled to translation and ribosome occupancy but is independent of the shortening of the poly(A) tail. This decay, however, is dependent on XRN1 and enriched in the stress granule transcriptome. Notably, inhibiting stress granule formation in G3BP1/2-null cells restores the RNA length to the same level as wild-type. It suppresses stress-induced decay, identifying RNA decay as a critical determinant of RNA metabolism during cellular stress and highlighting its dependence on stress-granule formation.

      This is an exciting and novel discovery. I am not an expert in sequencing technologies or sequencing data analysis, so I will limit my comments purely to biology and not technical points. The PI is a leader in applying innovative sequencing methods to studying mRNA decay.

      One aspect that appeared overlooked is that poly(A) tail shortening per se does lead to decapping. It is shortening below a certain threshold of 8-10 As that triggers decapping. Therefore, I found the conclusion that poly(A) tail shortening is not required for stress-induced decay to be somewhat premature. For a robust test of this hypothesis, the authors should consider performing their analysis in conditions where CNOT7/8 is knocked down with siRNA.

      We agree with the reviewer. We have now performed experiments in cells expressing a well characterized catalytically inactive dominant negative NOT8 isoform (NOT8*) (Chang et al.

      2019). Our new data show that stress-induced decay still occurs in cells expressing NOT8*.

      These results confirm our findings that stress-induced decay does not require deadenylation. We present these new results in Fig. 3 and Sup. Fig. 3. 

      Similarly, as XRN1 requires decapping to take place, it necessitates the experiment where a dominant-negative DCP2 mutant is over-expressed.

      We agree with the reviewer and have performed this experiment as requested. Expression of a dominant negative DCP2 (DCP2*) isoform (Loh, Jonas, and Izaurralde 2013) in HeLa cells showed that decapping is also not required for stress-induced decay. We present these new results in Fig. 3 and Sup. Fig. 3.

      Are G3BP1/2 stress granules required for stress-induced decay or simply sites for storage? This part seems unclear. A very worthwhile test here would be to assess in XRN1-null background.

      We thank the reviewer for their comment. Our data show that stress-induced decay is not observed in DDG3BP1/2 U2OS cells, unable to form stress granules (Fig. 6). This result suggests that G3BP1/2 SGs are either a) required for 5’ RNA shortening or b) preserve partially fragmented RNAs that would otherwise be rapidly degraded. We find the second option unlikely for two reasons. First, even if the fragments were rapidly degraded, we would still expect to find evidence of their presence in our data. However, Fig. 6f shows that the length distribution of DDG3BP1/2 U2OS cells, with and without arsenite, are almost identical, thus arguing against the presence of such a pool of rapidly degrading RNAs. Second, if these RNAs were protected by SGs, then they would be expected to be downregulated in the absence of SGs in DDG3BP1/2 U2OS cells treated with arsenite. Our results contradict this hypothesis as no association is found between the level of downregulation in arsenite-treated DDG3BP1/2 U2OS cells and the observed stress-induced fragmentation in WT. Collectively our results point towards G3BP1/2 stress granules being required for stress-induced decay. We have expanded on these points in the manuscript to clarify.

      Finally, the authors speculate that the mechanism of stress-induced decay may have evolved to relieve translational load during stress. But why degrade the 5' end when removing the cap may be sufficient? This returns to the question of assessing the role of decapping in this mechanism.

      The reviewer raises a very interesting point. Our new results, following expression of dominant negative DCP2, show that stress-induced decay does not require decapping. It is therefore plausible that a stress-induced co-translational mechanism cleaves mRNAs endonucleolyticaly to reduce the translational load. Such a mechanism would have many functional benefits as it would acutely reduce the translational load, degrade non-essential RNAs, preserve energy and release ribosomes for translation of the stress response program. We have expanded the discussion to mention these points.

      Recommendations for the authors:

      Reviewing Editor (Recommendations For The Authors):

      As you can see from the comments, although the reviewers appreciate the novelty of your findings, there was a consensus opinion from all reviewers that the authors overinterpreted their data, since they only have one assay and did not fully analyze it, as laid out in one of the reviewer's critiques. Some orthogonal validation of the "groundbreaking" claims is necessary. Examination of the effects of upstream events in 5'-to-3' decay, namely deadenylation, and decapping, would be necessary for a better understanding of the phenomena the authors describe. Many tools and approaches for studying this are described well in the literature (CNOT7-KD, dominant negative DCP2 E148Q, XRN1-null cell lines), so it is well within the authors' reach. Overall, while some of the evidence presented is novel and solid, for some of the claims there is only incomplete evidence.

      We thank the reviewers and the editor for their comments and suggestions. We have performed several additional experiments to further support our conclusions. We have notably investigated the role of deadenylation and decapping in the stress-induced decay by expressing dominant negative NOT8 and DCP2, respectively, as suggested. Our results show that neither deadenylation nor decapping is necessary for stress-induced transcript shortening, suggesting an endonucleolytic event. We believe that these additional experiments strengthen the main conclusions of our work. 

      Reviewer #1 (Recommendations For The Authors):

      Major comments:

      (1) The experiments were conducted in two unrelated cell lines, HeLa and U2OS. The authors should determine if the 5'end RNA decay in response to stress is also observed in normal human cells such as normal human diploid fibroblasts. Furthermore, it would be important to know if this mechanism is conserved between human and mouse cells. This can be tested in mouse embryonic fibroblasts.

      We thank the reviewer for their suggestion. We have now also performed experiments in the mouse embryonic fibroblast NIH 3T3 cell line. Our new results confirm that stress-induced 5’ end RNA decay is also observed in this primary cell line and is conserved between human and mouse (Sup. Fig. 1k, I). 

      (2) The authors state that they monitored cell viability up to 24 hours after Arsenite treatment, but the data is shown up to 240 min (Suppl. 1a). Also, the Y-axis label of this Figure is "Active cells (%)". This should be changed to "Live cells (%)" if this is what they are referring to.

      We thank the reviewer for identifying this mistake. Cell viability was monitored up to 4 hours after arsenite treatment. We have corrected the text and modified the figure according to the reviewer’s suggestion.

      (3) Based on direct Nanopore-based RNA-seq the authors surprisingly found that RNAs in oxidative stress were globally shorter than unstressed cells. Since Nanopore-based RNA-seq will not detect RNAs that lack a poly A-tail, are they not missing out on RNAs that have already started getting degraded due to the loss of a poly A-tail? Also, I am not sure if they used a spikein control which would be critical to claim global changes in RNA expression.

      We agree with the reviewer that our strategy does not capture RNA molecules without a poly(A) tail. Nevertheless, our data do identify shortening upon stress at the 5’ end of RNAs that include poly(A) tails. We considered this as direct evidence that decay at the 5’ end does not require prior removal of the poly(A) tail. Otherwise, these molecules would not have been captured and observed. Indeed, our newly added data from cells expressing a well characterized catalytically inactive dominant negative NOT8 isoform (Chang et al. 2019) show that stress-induced decay occurs even upon silencing of the CCR4-NOT deadenylation complex. We present these results in Fig. 3 and Sup. Fig 3.

      We would like to clarify that in our results we did not use a spike-in control and thus refrain from claiming global changes in RNA expression. Instead, we compare relative ratios of groups of molecules within libraries that are internally normalized, we perform correlative comparisons that are invariant to normalization and we perform differential gene expression using established normalization schemes such as DESeq2 (Love, Huber, and Anders 2014). 

      (4) Many graphs are confusing and inconsistent. For example, samples for Nanopore RNA-seq were prepared in triplicates. Biological or technical? The schematic in Figure 1a shows ISRIB but it appears from Figure 4 onwards. It is missing in the Figure 1 results and the Figure legend. The X-axis labels of many graphs are confusing. For example, Supplementary Figure 1d, 1e, 1g and 1h. It says transcript length but are these nucleotides? P-values are missing from many of these graphs. For some graphs, the authors compared Unstressed vs Arsenite (Figure 1), but in other panels they state No Ars vs 0.5 mM Ars (Fig. 3a) or Control vs Ars (Figure 5c). Likewise, in Figure 1b, Expression change (log2) is unstressed vs Arsenite or Arsenite vs unstressed?

      We thank the reviewer identifying these inconsistencies in the presentation of our results. The replicates for nanopore RNA-seq experiments were biological. We have now clarified this point in the text. Furthermore, we have removed “ISRIB” from Fig. 1a to avoid any confusion. We have also made our labelling across all figures more consistent using ‘unstressed’ for NO arsenite treatment vs “arsenite” or ‘+ Ars’ for arsenite treatment. 

      (5) The authors transfected cells with siCTRL or siXRN1 using electroporation and treated the cells 72 hours after transfection. Since XRN1 is an essential gene, it would be important to determine the viability of cells 72 hours after transfection. Along these lines, in Figure 3b, it would be important to determine the effect of XRN1 knockdown in unstressed cells. Currently, there are only 3 comparisons in Figure 3b - unstressed, siCTRL + Ars and siXRN1 + Ars, and this is insufficient to conclude the effects of XRN1 knockdown in the presence of Arsenite.

      We thank the reviewer for their suggestion. We have updated Fig. 3b and the text to show the requested conditions: siCTRL and siXRN1 with and without arsenite. While XRN2 is an essential gene for many organisms, XRN1 is not essential in mammalian cells and no increased cell death has been reported for XRN1-KO or –KD cells (Brothers et al. 2023). We have also tested different concentration (up to 40 nM) of siRNA and monitored the cells up to five days after transfection without observing any cell toxicity, as previously reported.

      (6) More broadly, the whole study is somewhat descriptive. The biological effect of 5'end mRNA shortening on gene expression is unclear. There is no data indicating how these changes in RNA lengths impact protein expression. Global quantitative proteomics would be critical to determine this.

      We thank the reviewer for their suggestion. To address this concern we have performed additional experiments using cells expressing catalytically inactive forms of NOT8 (Chang et al. 2019) and DCP2 (Loh, Jonas, and Izaurralde 2013) to inhibit deadenylation and decapping.

      These experiments provide additional mechanistic details for 5’ shortening and suggest endonucleolytic cleavage as a critical step (Fig. 3 and Sup. Fig. 3). We agree that it would be interesting to study the fate of these shortened transcripts notably regarding translation. However, given the complexity of the expected proteome changes also following global translation arrest under stress (Harding et al., 2003; Pakos-Zebrucka et al., 2016), we think that this work is beyond the scope of this manuscript and will be the subject of future studies. 

      Minor comments:

      (1) Some of the affected RNAs can be validated in HeLa and other cell lines.

      We thank the reviewer for their suggestion. We have performed RT-qPCR on 3 different mRNAs that present 5’ shortening upon oxidative stress using different primers located along the mRNA. We hypothesized that the closer the primer set is located to the 5’ end, the less abundant the corresponding region would be for arsenite-treated compared to untreated cells. Our results show indeed that the measured level of these mRNAs depends on the location of the primer sets used for the qPCR, the closer to the 5’end it is, the less abundant the mRNA is upon oxidative stress compared to control cells. We present these data as well as a schematic representing the positions of the primers in Sup. Fig. 2d. 

      (2) The authors should check whether XRN1 also co-localizes in SGs.

      We thank the reviewer for their suggestion. We have performed immunofluorescence on U2OS and HeLa upon oxidative stress and did not observe a co-localization of XRN1 with TIA-1, a marker of stress granules (see below). These results are consistent with (Kedersha et al. 2005) that have shown that XRN1 mainly co-localizes to processing bodies and are very weakly detectable in SGs in DU145 cells. We think that this result is beyond the scope of this study and thus decided to only include it for the reviewers.

      Author response image 1.

      Representative immunofluorescence merged image of HeLa (left panel) and U2OS (right panel) cells treated with sodium arsenite and labelled with anti-TIA1 (red), anti-XRN1 (green) antibodies and DAPI (blue). Scale bar 50 µm.

      (3) XRN1 should be knocked down with more than one siRNA.

      We thank the reviewer for this suggestion. Our results show that our XRN1 KD specifically rescues the length of the most shortened mRNAs (Fig. 3e). This is a highly specific effect that makes us confident it is not mediated by non-specific siRNA binding; thus, we do not consider it necessary to repeat the experiment.

      (4) There are typos in the text regarding Figure 6d, e, and f. Also, Supplementary Figure 4a.

      We thank the reviewer for identifying these mistakes. We have corrected the typos. 

      Reviewer #3 (Recommendations For The Authors):

      The authors should consider testing their hypotheses by arresting the decay pathway using the approaches I mentioned previously. As it stands, some conclusions are somewhat speculative.

      We have replied to the reviewer comments in the public review section. 

      References:

      • Brothers, William R., Farah Ali, Sam Kajjo, and Marc R. Fabian. 2023. “The EDC4-XRN1 Interaction Controls P-Body Dynamics to Link MRNA Decapping with Decay.” The EMBO Journal, August, e113933.

      • Chang, Chung-Te, Sowndarya Muthukumar, Ramona Weber, Yevgen Levdansky, Ying Chen, Dipankar Bhandari, Catia Igreja, Lara Wohlbold, Eugene Valkov, and Elisa Izaurralde. 2019. “A Low-Complexity Region in Human XRN1 Directly Recruits Deadenylation and Decapping Factors in 5’-3’ Messenger RNA Decay.” Nucleic Acids Research 47 (17): 9282–95.

      • Harding, Heather P., Yuhong Zhang, Huiquing Zeng, Isabel Novoa, Phoebe D. Lu, Marcella Calfon, Navid Sadri, et al. 2003. “An Integrated Stress Response Regulates Amino Acid Metabolism and Resistance to Oxidative Stress.” Molecular Cell 11 (3): 619–33.

      • Ibrahim, Fadia, Manolis Maragkakis, Panagiotis Alexiou, and Zissimos Mourelatos. 2018. “Ribothrypsis, a Novel Process of Canonical MRNA Decay, Mediates Ribosome-Phased MRNA Endonucleolysis.” Nature Structural & Molecular Biology 25 (4): 302–10.

      • Kedersha, Nancy, Georg Stoecklin, Maranatha Ayodele, Patrick Yacono, Jens Lykke-Andersen, Marvin J. Fritzler, Donalyn Scheuner, Randal J. Kaufman, David E. Golan, and Paul Anderson. 2005. “Stress Granules and Processing Bodies Are Dynamically Linked Sites of MRNP Remodeling.” The Journal of Cell Biology 169 (6): 871–84.

      • Krause, Maximilian, Adnan M. Niazi, Kornel Labun, Yamila N. Torres Cleuren, Florian S. Müller, and Eivind Valen. 2019. “Tailfindr: Alignment-Free Poly(A) Length Measurement for Oxford Nanopore RNA and DNA Sequencing.” RNA  25 (10): 1229–41.

      • Loh, Belinda, Stefanie Jonas, and Elisa Izaurralde. 2013. “The SMG5-SMG7 Heterodimer Directly Recruits the CCR4-NOT Deadenylase Complex to MRNAs Containing Nonsense Codons via Interaction with POP2.” Genes & Development 27 (19): 2125–38.

      • Love, Michael I., Wolfgang Huber, and Simon Anders. 2014. “Moderated Estimation of Fold Change and Dispersion for RNA-Seq Data with DESeq2.” Genome Biology 15 (12): 550.

      • Pakos-Zebrucka, Karolina, Izabela Koryga, Katarzyna Mnich, Mila Ljujic, Afshin Samali, and Adrienne M. Gorman. 2016. “The Integrated Stress Response.” EMBO Reports 17 (10): 1374–95.

      • Pelechano, Vicent, Wu Wei, and Lars M. Steinmetz. 2015. “Widespread Co-Translational RNA Decay Reveals Ribosome Dynamics.” Cell 161 (6): 1400–1412.

    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewer #1 (Public review):

      Summary:

      The manuscript discusses the role of phosphorylated ubiquitin (pUb) by PINK1 kinase in neurodegenerative diseases. It reveals that elevated levels of pUb are observed in aged human brains and those affected by Parkinson's disease (PD), as well as in Alzheimer's disease (AD), aging, and ischemic injury. The study shows that increased pUb impairs proteasomal degradation, leading to protein aggregation and neurodegeneration. The authors also demonstrate that PINK1 knockout can mitigate protein aggregation in aging and ischemic mouse brains, as well as in cells treated with a proteasome inhibitor. While this study provided some interesting data, several important points should be addressed before being further considered.

      Strengths:

      (1) Reveals a novel pathological mechanism of neurodegeneration mediated by pUb, providing a new perspective on understanding neurodegenerative diseases.

      (2) The study covers not only a single disease model but also various neurodegenerative diseases such as Alzheimer's disease, aging, and ischemic injury, enhancing the breadth and applicability of the research findings.

      Weaknesses:

      (1) PINK1 has been reported as a kinase capable of phosphorylating Ubiquitin, hence the expected outcome of increased p-Ub levels upon PINK1 overexpression. Figures 5E-F do not demonstrate a significant increase in Ub levels upon overexpression of PINK1 alone, whereas the evident increase in Ub expression upon overexpression of S65A is apparent. Therefore, the notion that increased Ub phosphorylation leads to protein aggregation in mouse hippocampal neurons is not yet convincingly supported.

      Indeed, overexpression of sPINK1 alone resulted in minimal changes in Ub levels in the soluble fraction (Figure 5E), which is expected given that the soluble Ub pool remains relatively stable and buffered. However, sPINK1* overexpression led to a marked increase in Ub levels in the insoluble fraction, indicative of increased protein aggregation (Figure 5F). The molecular weight distribution of Ub in the insoluble fraction was predominantly below 70 kDa, suggesting that phosphorylation inhibits Ub chain elongation.

      To further validate this mechanism, we utilized the Ub/S65A mutant to antagonize Ub phosphorylation and observed a significant reduction in the intensity of aggregated bands at low molecular weights, indicating restored proteasomal activity. The observed increase in Ub levels in the soluble fraction upon Ub/S65A overexpression is likely due to enhanced ubiquitination driven by elevated Ub-S65A, and notably, Ub/S65A was also detectable using an antibody against wild-type Ub.

      Consistent with these findings, overexpression of Ub/S65E resulted in a further increase in Ub levels in the insoluble fraction, with intensified low molecular weight bands. The effect was even more pronounced than that observed with sPINK1 transfection, likely resulting from the complete phosphorylation mimicry achieved by Ub/S65E, compared to the relatively low levels of phosphorylation by PINK1.

      These findings collectively support the conclusion that sPINK1 promotes protein aggregation via Ub phosphorylation. We have updated the Results and Discussion sections to more clearly present the data and explain the various controls.

      (2) The specificity of PINK1 and p-Ub antibodies requires further validation, as a series of literature indicate that the expression of the PINK1 protein is relatively low and difficult to detect under physiological conditions.

      We acknowledge the challenges in achieving high specificity with commercially available and customgenerated antibodies targeting PINK1 and pUb, particularly given their low endogenous expression under physiological conditions. However, in our study, we observed robust immunofluorescent staining for PINK1 (Figures 1A, 1C, and 1G) and pUb (Figures 1B, 1D, and 1G) in human brain samples from Alzheimer's disease (AD) patients, as well as in mouse models of AD and cerebral ischemia. The clear visualization can be partly attributed to the pathological upregulation of PINK1 and pUb under disease conditions. Importantly, the images from pink1<sup>-/-</sup> mice exhibit much weaker staining.

      Additionally, we detected a significant elevation in the pUb levels in aged mouse brains compared to younger ones (Figures 1E and 1F). In contrast, pink1<sup>-/-</sup> mice showed no change in pUb levels with aging, despite some background signals, demonstrating that pUb accumulation during aging is PINK1dependent. Collectively, these results support the specificity of the antibodies used in detecting pathophysiological changes in PINK1 and pUb levels.

      For cultured cells, pink1<sup>-/-</sup> cells served as a negative control for both PINK1 (Figures 2B and 2C) and pUb (Figures 2D and 2E). While the pUb Western blot exhibited some nonspecific background, pUb levels in pink1<sup>-/-</sup> cells remained unchanged across all MG132 treatment conditions (Figures 2D and 2E), further attesting the usability of the antibodies in conjunction with appropriated controls.

      We have updated the manuscript with higher-resolution images; individual image files have been uploaded separately.

      (3) In Figure 6, relying solely on Western blot staining and Golgi staining under high magnification is insufficient to prove the impact of PINK1 overexpression on neuronal integrity and cognitive function. The authors should supplement their findings with immunostaining results for MAP2 or NeuN to demonstrate whether neuronal cells are affected.

      We included NeuN immunofluorescent staining at 10, 30, and 70 days post transfection in Figure 5— figure supplement 2. The results clearly demonstrate a significant loss of NeuN-positive cells in the hippocampus following Ub/S65E overexpression, while no apparent reduction was observed with sPINK1 transfection alone. 

      We have also quantified MAP2 protein levels via Western blotting and examined morphology of neuronal dendrite and synaptic structure using Golgi staining. These analyses revealed a significant reduction in MAP2 levels and synaptic damage upon sPINK1 or Ub/S65E overexpression (Figures 6F and 6H), consistent with the proteomics analysis (Figure 5—figure supplementary 5). Notably, these detrimental effects could be rescued by co-expression of Ub/S65A, reinforcing the role of pUb in mediating these structural changes.

      Together, our findings from NeuN immunostaining, MAP2 protein analysis, proteomics analysis, and Golgi staining provide strong evidence for the impact of PINK1 overexpression and pUb elevation on neuronal integrity and synaptic structure.

      (4) The authors should provide more detailed figure captions to facilitate the understanding of the results depicted in the figures.

      Figure captions have been updated with more details incorporated in the revised manuscript.

      (5) While the study proposes that pUb promotes neurodegeneration by affecting proteasomal function, the specific molecular mechanisms and signaling pathways remain to be elucidated.

      The molecular mechanisms and signaling pathways through which pUb promotes neurodegeneration are likely multifaceted and interconnected. Our findings suggest that mitochondrial dysfunction plays a central role following sPINK1* overexpression. This is supported by (1) an observed increase in full-length PINK1, indicative of impaired mitochondrial quality control, and (2) proteomic data showing enhanced mitophagy at 30 days post-transfection, followed by substantial mitochondrial injuries at 70 days post-transfection (Figure 5—figure supplement 5 and Supplementary Data). The progressive mitochondrial damage caused by protein aggregates would exacerbate neuronal injury and degeneration.

      Additionally, reduced proteasomal activity may lead to the accumulation of inhibitory proteins that are normally degraded by the ubiquitin-proteasome system. Our proteomics analysis identified a >50fold increase in CamK2n1 (UniProt ID: Q6QWF9), an endogenous inhibitor of CaMKII activation, following sPINK1* overexpression. The accumulation of CamK2n1 suppresses CaMKII activation, thereby inhibiting the CREB signaling pathway (Figure 7), which is essential for synaptic plasticity and neuronal survival. This disruption can further contribute to neurodegenerative processes.

      Thus, our findings underscore the complexity of pUb-mediated neurodegeneration and call for further investigation into downstream consequences.

      Reviewer #1 (Recommendations for the authors):

      Suggestions for improved or additional experiments, data or analyses.

      We have performed additional experiments to investigate how the impairment of ubiquitinproteasomal activity contributes to neurodegeneration. Specifically, we investigated CamK2n1, an endogenous inhibitor of CaMKII, which is normally degraded by the proteasome to allow CaMKII activation. Our proteomics analysis revealed a significant (>50-fold) elevation of CamKI2n1 following sPINK1 overexpression (Figure 5—figure supplement 5 and Supplementary Data).

      To validate this mechanism, we conducted immunofluorescence and Western blot analyses, demonstrating reduced levels of phosphorylated CaMKII (pCaMKII) and phosphorylated CREB (pCREB), as well as reduced levels of downstream proteins such as BDNF and ERK. These results have been incorporated into the revised manuscript (Figure 7).

      As the proteasome is crucial in maintaining proteostasis, its dysregulation would trigger neurodegeneration through multiple pathways, contributing to a broad cascade of pathological events.

      Reviewer #2 (Public review):

      Summary:

      The manuscript makes the claim that pUb is elevated in a number of degenerative conditions including Alzheimer's Disease and cerebral ischemia. Some of this is based on antibody staining which is poorly controlled and difficult to accept at this point. They confirm previous results that a cytosolic form of PINK1 accumulates following proteasome inhibition and that this can be active. Accumulation of pUb is proposed to interfere with proteostasis through inhibition of the proteasome. Much of the data relies on over-expression and there is little support for this reflecting physiological mechanisms.

      Weaknesses:

      The manuscript is poorly written. I appreciate this may be difficult in a non-native tongue, but felt that many of the problems are organizational. Less data of higher quality, better controls and incision would be preferable. Overall the referencing of past work is lamentable. Methods are also very poor and difficult to follow.

      Until technical issues are addressed I think this would represent an unreliable contribution to the field.

      (1) Antibody specificity and detection under pathological conditions

      We recognize the limitations of commercially available antibodies for detecting PINK1 and pUb. Nevertheless, our findings reveal a significant elevation in PINK1 and pUb levels under pathological conditions, such as Alzheimer's disease (AD) and ischemia. Additionally, we observed an increase in pUb level during brain aging, further demonstrating its relevance and a potentially causative role for this special pathological condition. Similarly, elevated pUb levels were observed for cultured cells following pharmacological treatment or oxygen-glucose deprivation (OGD).

      In contrast, in pink1<sup>-/-</sup> mice and HEK293 cells used as negative controls, PINK1 and pUb levels remained consistently low. Therefore, the observed elevation of PINK1 and pUb are associated with special pathological conditions, rather than an antibody-detection anomaly.

      (2) Overexpression as a model for pathological conditions

      To investigate whether the inhibitory effects of sPINK1 on the ubiquitin-proteasome system (UPS) depend on its kinase activity, we employed a kinase-dead version of sPINK1* as a negative control. Given that PINK1 targets multiple substrates, we also investigated whether its effects on UPS inhibition were specifically mediated by ubiquitin phosphorylation. To this end, we used Ub/S65A (a phospho-null mutant) to block Ub phosphorylation by sPINK1, and Ub/S65E (a phospho-mimetic mutant) to mimic phosphorylated Ub. These well-defined controls ensured the robustness of our conclusions.

      Although overexpression does not perfectly replicate physiological conditions, it provides a valuable model for studying pathological scenarios such as neurodegeneration and brain aging, where pUb levels are elevated. For example, we observed a 30.4% increase in pUb levels in aged mouse brains compared to young brains (Figure 1F). Similarly, in our sPINK1 overexpression model, pUb levels increased by 43.8% and 59.9% at 30- and 70-days post-transfection, respectively, compared to controls (Figures 5A and 5C). Notably, co-expression of sPINK1* with Ub/S65A almost entirely prevented sPINK1* accumulation (Figure 5B), indicating that an active UPS can efficiently degrade this otherwise stable variant of sPINK1.

      Together, our findings demonstrate that sPINK1 accumulation inhibits UPS activity, an effect that can be reversed by the phospho-null Ub mutant. The overexpression model mimics pathological conditions and provides valuable insights into pUb-mediated proteasomal dysfunction.

      (3) Organization of the manuscript

      Following your suggestion, we have restructured the manuscript to present the key findings in a more logical and cohesive sequence:

      (a) Evidence for elevated PINK1 and pUb levels across a broad spectrum of pathological and neurodegenerative conditions;

      (b) The effects of pUb elevation in cultured cells, focusing on the proteasome;

      (c) Mechanistic insights into how pUb elevation inhibits proteasomal activity;

      (d) The absence of PINK1 and pUb alleviates protein aggregation;

      (e) Evidence for the causative relationship between elevated pUb levels and proteasomal inhibition;

      (f) Demonstration that pUb elevation directly contributes to neuronal degeneration;

      (g) Give an additional evidence to explain the mechanism of neuronal degeneration post sPINK1* over-expression. The downstream effects of elevated CamK2n1, an inhibitor of CaMKII, resulting from proteasomal inhibition.

      This reorganization should ensure a clear and progressive narrative, and enhance the overall coherence and impact of the revised manuscript.

      (4) Revisions to writing, referencing, and methodology

      We have made a great effort to enhance the clarity and flow of the manuscript, including the addition of references to appropriately acknowledge prior work. We have also expanded the Methods section with additional details to improve readability and ensure reproducibility. We believe these revisions effectively address the concerns raised and strengthen the overall quality of the manuscript.

      Reviewer #2 (Recommendations for the authors):

      Figure 1: PINK1 is a poorly expressed protein and difficult to detect by Western blot let alone by immunofluorescence. I have direct experience of the antibody used in this study and do not consider it reliable. There are much cleaner reagents out there, although they still have many challenges. The minimal requirement here is for the PINK1 antibody staining to be compared in wild-type and knockout mice. One would also expect to see a mitochondrial staining which would require higher magnification to be definitive, but it does not look like it to me. This is a key foundational figure and is unreliable. The pUb antibody also has a high background, see for example figure 2E.

      Under physiological conditions, PINK1 and pUb levels are indeed low, making their detection challenging. However, under pathological conditions, their expression is significantly elevated, correlating with disease severity. Given the limitations of available reagents, using appropriate controls is a standard approach in biological research.

      Nevertheless, we observed robust immunofluorescent staining for PINK1 (Figures 1A, 1C, and 1G) and pUb (Figures 1B, 1D, and 1G) in human brain samples from Alzheimer’s disease (AD) patients and mouse models of AD and cerebral ischemia. Compared to healthy controls, the significant elevation of PINK1 and pUb under these pathological conditions accounts for their clear visualization. To validate antibody specificity, we have included images from pink1<sup>-/-</sup> mice as negative controls (Figure 1C and 1D, third panel).

      Furthermore, we analyzed pUb levels in both young and aged mice, using pink1<sup>-/-</sup> mice as controls.

      Our results revealed a significant increase in pUb levels in aged wild-type mice (Figures 1E and 1F), In contrast, pink1<sup>-/-</sup> mice exhibited relatively low pUb levels, with no notable change between young and aged groups. These findings reinforce the conclusion that pUb accumulation during aging is dependent on PINK1.Furthermore, we analyzed pUb levels in both young and aged mice, using pink1<sup>-/-</sup> mice as controls.

      For HEK293 cells, pink1<sup>-/-</sup> cells were used as a negative control for assessing PINK1 (Figures 2B and 2C) and pUb levels (Figures 2D and 2E). While the pUb Western blot did show some nonspecific background, as you have noted, pUb levels significantly increased following MG132 treatment of the wildtype cells. In contrast, no such increase was observed in pink1<sup>-/-</sup> cells (Figure 2D and 2E). These results further validate the reliability of our findings.

      Regarding mitochondrial staining, we recognize that PINK1 localization can vary depending on the pathological context. For example, in Alzheimer’s disease, PINK1 exhibits relatively high nuclear staining, while in cerebral ischemia and brain aging, it is predominantly cytoplasmic and punctate. In contrast, in young, healthy mouse brains, PINK1 is more uniformly distributed. The observed elevation in pUb levels could arise from mitochondrial PINK1 or soluble sPINK1 in the cytoplasm, and it remains unclear whether nuclear PINK1 contributes to pUb accumulation. Investigating the role of PINK1 in different forms and subcellular localizations will be an important avenue for future research.

      To enhance clarity, we have updated our images and replaced them with higher-resolution versions in the revised manuscript.

      Please also confirm that the GAPDH loading controls represent the same gels, to my eye they do not match.

      We have reviewed all the bands, and confirmed that the GAPDH loading controls correspond to the same gels. For different gels, we use separate GAPDH loading controls. There are two experimental scenarios to consider:

      (1) When there is a large difference in molecular weight between target proteins, we cut the gel into sections and incubate each section with different antibodies separately.

      (2) When the molecular weight difference is small and cutting is not feasible, we first probe the membrane with one antibody, strip it, and then re-incubate the membrane with a second antibody.

      These approaches ensure accurate and reliable detection of target proteins with various molecular weights relative to GAPDH.

      1H. Ponceau.

      We have corrected the spelling.

      Figure 2 many elements are confirmation of work already reported and this must be made clearer in the text. 

      Indeed, the elevation of sPINK1 and pUb upon proteasomal inhibition has been previously reported, and these studies have been acknowledged (Gao, et al, 2016; Dantuma, et al, 2000). In the present study, we expand on these findings by conducting a detailed analysis of the time- and concentrationdependent effects of MG132 on sPINK1 and pUb levels, establishing a causative relationship between pUb accumulation and proteasomal inhibition. Furthermore, we demonstrate that sPINK1 overexpression and MG132-induced proteasomal inhibition exhibit no additive effect, indicating that both converge on the same pathway, resulting in the impairment of proteasomal activity.

      It has been established that ubiquitin phosphorylation inhibits Ub chain elongation (Wauer, et al, 2015). However, our study provides novel insights by identifying an additional mechanism: phosphorylated Ub also interferes with the noncovalent interactions between Ub chain and Ub receptors in the proteasome, which further contributes to the impairment of UPS function.

      The PINK1 kinase-dead mutant construction (Figure 2F) and the use of Ub-GFP as a proteasomal substrate were based on established methodologies, which have been appropriately cited in the manuscript (Beilina, etal 2005 for KD sPINK1; Yamano, et al for endogenous PINK1; Samant, et al, 2018 and Dantuma, et al, 2000 for Ub-GFP probe). Similarly, our use of puromycin and BALA treatments follows previously reported protocols (Gao, et al, 2016), which allowed us to dissect the relative contributions of sPINK1* overexpression to proteasomal vs. autophagic dysfunction.

      As you have noted, our study has built upon prior findings while introducing new mechanistic insights into sPINK1 and pUb-mediated proteasomal dysfunction.

      2C 24h MG132 not recommended, most cells are dead by then.

      We used MG132 treatment for 24 hours to evaluate the time-course effects of proteasomal inhibition on PINK1 and pUb levels in HEK293 cells (Figures 2C and 2E). We did observe some decrease in both PINK1 and pUb levels at 24 hours compared to 12 hours, which may result from some extend of cell death at the longer treatment duration.

      In SH-SY5Y cells, we collected cells at 24 hours after MG132 administration (Figure 5—figure supplementary 1). Though protein aggregation was evident in these cells, we did not observe pronounced cell death under these conditions, justifying our treatment.

      Our findings are consistent with previous studies demonstrating that MG132 at 5 µM for 24 hours effectively induces proteasomal inhibition without substantial cytotoxicity. For example, studies using human esophageal squamous cancer cells have reported that this treatment condition inhibits cell proliferation while maintaining cell viability, with cell viability >70% after 24-hour treatment with 5 µM MG132 (Int J Mol Med 33: 1083-1088, 2014). 

      MG132 has been commonly used at concentrations ranging from 5 to 50 µM for durations of 1 to 24 hours, as stated at the vendor’s website (https://www.cellsignal.com/products/activatorsinhibitors/mg-132/2194).

      2I what is BALA do they mean bafilomycin. This is a v-ATPase inhibitor, not just an autophagy inhibitor.

      We appreciate the reviewer’s comment regarding the use of BALA in Figure 2I. To clarify, BALA refers to bafilomycin A1, a well-established v-ATPase inhibitor that blocks lysosomal acidification. While bafilomycin A1 is commonly used as an autophagy inhibitor, its primary mechanism involves inhibiting lysosomal function, which is critical for autophagosome-lysosome fusion and subsequent degradation of autophagic cargo.

      In our study, we used bafilomycin A1 in conjunction with puromycin to dissect the relative contributions of sPINK1 overexpression on proteasomal and autophagic activities. Puromycin induces protein misfolding and aggregation, causing stress on both degradation pathways. By inhibiting lysosomal function with bafilomycin A1 and blocking the protein degradation load at various stages, we can tell the relative contributions of autophagy and UPS pathways.

      We acknowledge that bafilomycin A1’s effects extend beyond autophagy, as it also inhibits v-ATPase activity. However, its inhibition of lysosomal degradation is integral to distinguishing autophagy’s contribution under the experimental conditions, and BALA treatment has been used in extensively in previous studies (Mauvezin and Neufeld, 2015). 

      We have further clarified this treatment in the revised manuscript.

      Figure 3. Legend or text needs to be more explicit about how chains have been produced. From what I can gather from methods only a single E2 has been trialed. Authors should use at least one of the criteria used by Wauer et al. (2014) to confirm the stoichiometry of phosphorylation. The concept that pUb can interfere with E2 discharging is not new, but not universal across E2s.

      We have cited in the manuscript that PINK1-mediated ubiquitin phosphorylation can interfere with ubiquitin chain elongation for certain E2 enzymes (Wauer et al., 2015). 

      To clarify, the focus of our current work is on how elevation of Ub phosphorylation impacts UPS activity, rather than exploring the broader effects of Ub phosphorylation on Ub chain elongation. For this reason, we have used the standard E2 that is well-established for generating K48-linked polyUb chain (Pickart CM, 2005). Moreover, our findings go further and by demonstrate that phosphorylated K48-linked polyubiquitin exhibits weaker non-covalent interactions with proteasomal ubiquitin receptors. This dual effect—on both covalent chain elongation and non-covalent interactions— contributes to the observed reduction in ubiquitin-proteasome activity, a novel aspect of our study.

      To address the reviewer’s concerns, we have added details in the Methods section and figure legends regarding the generation of ubiquitin chains. Specifically, we used ubiquitin-activating enzyme E1 (UniProt ID: P22314) and ubiquitin-conjugating enzyme E2-25K (UniProt ID: P61086) to generate K48-linked ubiquitin chains. 

      Our ESI-MS analysis showed that only 1–2 phosphoryl groups were incorporated into the K48-linked tetra-ubiquitin chains (Figure 3—figure supplement 2). This is consistent with our in vivo findings, where pUb levels increased by 30.4% in aged mouse brains compared to young brains (Figure 1F). Notably, even sub-stoichiometric phosphorylation onto the K48-linked ubiquitin chain significantly weakens the non-covalent interactions with the proteasome (Figures 3E and 3H).

      Figure 4. I could find no definition of the insoluble fraction, nor details on how it is prepared.

      The insoluble fraction primarily contains proteins that are aggregated or associated with hydrophobic interactions and cannot be solubilized by RIPA buffer. We have provided more details in the Methods of the revised manuscript about how the insoluble fraction was prepared. Our approach was based on established protocols for fractionating soluble and insoluble proteins from brain tissues (Wirths, 2017). Here is an outline of the procedure, which enables the separation and subsequent analysis of distinct protein populations:

      • Lysis and preparation of soluble fraction: Cells and brain tissues were lysed using RIPA buffer (Beyotime Biotechnology, cat# P0013B) containing protease (P1005) and phosphatase inhibitors (P1081) on ice for 30 minutes, with gentle vortexing every 10 minutes. Brain samples were homogenized using a precooled TissuePrep instrument (TP-24, Gering Instrument Company). Lysates were centrifuged at 12,000 rpm for 30 minutes at 4°C. The supernatant was collected as the soluble protein fraction.

      • Preparation of insoluble fraction: The pellet was resuspended in 20 µl of SDS buffer (2% SDS, 50 mM Tris-HCl, pH 7.5) and subjected to ultrasonic pyrolysis at 4°C for 8 cycles (10 seconds ultrasound, 30 seconds interval). The samples were then centrifuged at 12,000 rpm for 30 minutes at 4°C. The supernatant obtained after this step was designated as the insoluble protein fraction.

      • Protein quantification: Protein concentrations for both soluble and insoluble fractions were determined using the BCA Protein Assay Kit (Beyotime Biotechnology, cat# P0009).

      Figure 5. What is the transfection efficiency? How many folds is sPINK1 over-expressed? Typically, a neuron will have only a few hundred copies of PINK1 at the basal state. How much mutant ubiquitin is expressed relative to wild type, seeing the free ubiquitin signals on the gels might be helpful here, but they seem to have been cut off. 

      We appreciate the reviewer's insightful comments regarding transfection efficiency, the extent of sPINK1 overexpression, and the expression levels of mutant ubiquitin relative to wild-type ubiquitin. Below, we provide detailed responses to each point:

      Transfection Efficiency: Our immunofluorescent staining for NeuN, a neuronal marker, demonstrated that over 90% of NeuN-positive cells were co-localized with GFP (Figure 5—figure supplement 2), indicating a high transfection efficiency in our neuronal cultures.

      Extent of sPINK1 Overexpression: Quantifying the exact fold increase of sPINK1 upon overexpression is inherently difficult due to its low basal expression under physiological conditions, making the relative increase difficult to measure (small denominator effect). However, our Western blot analysis shows that ischemic events can cause a substantial elevation of PINK1 levels, including both full-length and cleaved forms (Figure 1H). This suggests that our overexpression model recapitulates the pathological increase in PINK1, making it a relevant system for studying disease mechanisms.

      From Figure 5B, it is evident that sPINK1 levels differ significantly between neurons overexpressing sPINK1 alone and those co-expressing sPINK1 + Ub/S65A (70 days post-transfection). Overexpression of sPINK1 alone results in multiple PINK1 bands, consistent with sPINK1, endogenous PINK1 (induced by mitochondrial damage), and ubiquitinated sPINK1. In comparison, co-expressing Ub/S65A leads to faint PINK1 bands, suggesting that in the presence of a functionally restored proteasome, overexpressed sPINK1 is rapidly degraded. Therefore, actual accumulation of sPINK1 depends on proteasomal activity, and the “over-expressed” PINK1 level can be comparable to levels observed under native, pathological conditions.

      Expression Levels of Mutant Ubiquitin Relative to Wild-Type: Assessing the expression levels of mutant versus wild-type ubiquitin is indeed valuable. In Figure 5E, we observed a 38.9% increase in high-molecular-weight ubiquitin conjugates in the soluble fraction when comparing the sPINK1+Ub/S65A group to the control. This increase suggests that mutant ubiquitin is actively incorporated into polyubiquitin chains.

      Regarding free monomeric ubiquitin, its low abundance and rapid incorporation into polyubiquitin chains make it difficult to visualize in Western blots. Additionally, its low molecular weight and lower antibody binding valency further reduce its visibility.

      General: a number of effects are shown following over-expression but no case is made that these levels of pUb are ever attained physiologically. I am very unconvinced by these findings and think the manuscript needs to be improved at multiple levels before being added to the record.

      We understand the reviewer’s concerns regarding the relevance of pUb levels observed in our overexpression model. To clarify, our study is not focused on physiological levels of pUb, but rather on pathologically elevated levels, which have been documented in various neurodegenerative conditions. While overexpression is not a perfect replication of pathological states, it provides a valuable tool to investigate mechanisms that become relevant under disease conditions. Moreover, we have taken steps to ensure the validity of our findings and to address potential limitations associated with overexpression models:

      Pathological Relevance: Besides several reported literatures, we observed significant increases in PINK1 and pUb levels in human brain samples from Alzheimer's disease (AD) patients, as well as in mouse models of AD, cerebral ischemia (including mouse middle cerebral artery occlusion ischemic model and oxygen glucose deprivation cell model), and aging (e.g., Figures 1E, 1F, and 1H). All these data show that pUb levels are elevated under pathological conditions. Our overexpression model mimics these pathological scenarios by recreating the high levels of pUb, which lead to the impairment of proteasomal activity and subsequent disruption of proteostasis.

      Use of Robust Controls: To ensure the reliability of our results and interpretations, we employed multiple controls for our experiments. We have used pink1<sup>-/-</sup> mice and cells to confirm that pUb accumulation is PINK1-dependent (Figures 1C and 2C). We have also included kinase-dead sPINK1 mutant and Ub/S65A phospho-null mutants to negate/counteract the specific roles of PINK1 activity and pUb in proteasomal dysfunction. On the other hand, we have used Ub/S65E for phosphomimetic mutant, corresponding to a 100% Ub phosphorylation.

      Importantly, we have compared sPINK1 overexpression with both baseline and disease-mimicking conditions, thus to ensure that the observed effects are consistent with pathological changes. Furthermore, our findings are supported by complementary evidences from human brain samples, model animals, cell cultures, and molecular assays. Integrating the different controls and various approaches, we have provided mechanistic insights into how elevated pUb levels causes proteasomal impairment and contributes to neurodegeneration.

      Our findings elucidate how elevated pUb level contributes to the disruption of proteostasis in neurodegenerative conditions. While overexpression may have limitations, it remains a powerful tool for dissecting pathological mechanisms and testing hypotheses. Our results align with and expand upon previous studies suggesting pUb as a biomarker of neurodegeneration (Hou, et al, 2018; Fiesel, et al, 2015), and provide mechanistic insights into how elevated pUb and sPINK1 drive a viscous feedforward cycle, ultimately leading to proteasomal dysfunction and neurodegeneration. 

      We hope these clarifications highlight the relevance and rigor of our study, and welcome additional suggestions to improve the manuscript.

      Reviewer #3 (Public review):

      Summary:

      This study aims to explore the role of phosphorylated ubiquitin (pUb) in proteostasis and its impact on neurodegeneration. By employing a combination of molecular, cellular, and in vivo approaches, the authors demonstrate that elevated pUb levels contribute to both protective and neurotoxic effects, depending on the context. The research integrates proteasomal inhibition, mitochondrial dysfunction, and protein aggregation, providing new insights into the pathology of neurodegenerative diseases.

      Strengths:

      - The integration of proteomics, molecular biology, and animal models provides comprehensive insights.

      - The use of phospho-null and phospho-mimetic ubiquitin mutants elegantly demonstrates the dual effects of pUb.

      - Data on behavioral changes and cognitive impairments establish a clear link between cellular mechanisms and functional outcomes.

      Weaknesses:

      - While the study discusses the reciprocal relationship between proteasomal inhibition and pUb elevation, causality remains partially inferred.

      It has been well-established that protein aggregates, particularly neurodegenerative fibrils, can impair proteasomal activity (McDade, et al., 2024; Kinger, et al., 2024; Tseng, et al., 2008). Other contributing factors, including ATP depletion, reduced proteasome component expression, and covalent modifications of proteasomal subunits, can also lead to declined proteasomal function. Additionally, mitochondrial injury serves as an important source of elevated PINK1 and pUb levels. Recent studies have demonstrated that efficient mitophagy is essential to prevent pUb accumulation, whereas partial mitophagy failure results in elevated PINK1 levels (Chin, et al, 2023; Pollock, et al. 2024).

      While pathological conditions can impair proteasomal function and slow sPINK1 degradation, leading to its accumulation, our results demonstrate that overexpression of sPINK1 or PINK1 can initiate this cycle as well. Once this cycle is initiated, it becomes self-perpetuating, as sPINK1 and pUb accumulation progressively impair proteasomal function, leading to more protein aggregates and mitochondrial damages.

      Importantly, we show that co-expression of Ub/S65A effectively rescues cells from this cycle, which further illustrates the pivotal role of pUb in driving proteasomal inhibition and the causality between pUb elevation and proteasomal inhibition. At the animal level, pink1 knockout prevents protein aggregation under aging and cerebral ischemia conditions (Figures 1E and 1G). 

      Together, by controlling at protein, cell, and animal levels, our findings support this self-reinforcing and self-amplifying cycle of pUb elevation, proteasomal inhibition, protein aggregation, mitochondrial damage, and ultimately, neurodegeneration.

      - The role of alternative pathways, such as autophagy, in compensating for proteasomal dysfunction is underexplored.

      Indeed, previous studies have shown that elevated sPINK1 can enhance autophagy (Gao, et al., 2016,), potentially compensating for impaired UPS function. One mechanism involves PINK1mediated phosphorylation of p62, which enhances autophagic activity.

      In our study, we observed increased autophagic activity upon sPINK1 overexpression, as shown in Figure 2I (middle panel, without BALA). This increase in autophagy may facilitate the degradation of ubiquitinated proteins induced by puromycin, partially mitigating proteasomal dysfunction. This compensation might also explain why protein aggregation, though statistically significant, increased only slightly at 70 days post-sPINK1 transfection (Figure 5F). Additionally, we detected a mild but statistically insignificant increase in LC3II levels in the hippocampus of mouse brains at 70 days postsPINK1 transfection (Figure 5—figure supplement 6), further supporting the notion of autophagy activation.

      However, while autophagy may provide some compensation, its effect is likely limited. The UPS and autophagy serve distinct roles in protein degradation:

      • Autophagy is a bulk degradation pathway, primarily targeting damaged organelles, intracellular pathogens, and protein aggregates, often in a non-selective manner.

      • The UPS, in contrast, is highly selective, degrading short-lived regulatory proteins, misfolded proteins, and proteins tagged for degradation via ubiquitination.

      Thus, while sPINK1 overexpression enhances autophagy-mediated degradation, it simultaneously impairs UPS-mediated degradation. This suggests that autophagy partially compensates for proteasomal dysfunction but is insufficient to counterbalance the UPS's selective degradation function. We have incorporated additional discussion in the revised manuscript.

      - The immunofluorescence images in Figure 1A-D lack clarity and transparency. It is not clear whether the images represent human brain tissue, mouse brain tissue, or cultured cells. Additionally, the DAPI staining is not well-defined, making it difficult to discern cell nuclei or staging. To address these issues, lower-magnification images that clearly show the brain region should be provided, along with improved DAPI staining for better visualization. Furthermore, the Results section and Figure legends should explicitly indicate which brain region is being presented. These concerns raise questions about the reliability of the reported pUb levels in AD, which is a critical aspect of the study's findings.

      We have taken steps to address the concerns regarding clarity and transparency in Figure 1A-D. We have already addressed the source of tissues at the left of each images. For example, we have written “human brain with AD” at the left side of Figure 1A, and “mouse brains with AD” at the left side of Figure 1C.

      Briefly, the human brain samples in Figure 1 originate from the cingulate gyrus of Alzheimer’s disease (AD) patients. Our analysis revealed that PINK1 is primarily localized within cell bodies, whereas pUb is more abundant around Aβ plaques, likely in nerve terminals. For the mouse brain samples, we have now explicitly indicated in the figure legends and Results section that the images represent the neocortex of APP/PS1 mice, a mouse model relevant to AD pathology, as well as the corresponding regions in wild-type and pink1<sup>-/-</sup> mice. We have ensured that the brain regions and sources are clearly stated throughout the manuscript.

      Regarding image clarity, we have uploaded higher-resolution versions of the images in the revised manuscript to improve visualization of key features, including DAPI staining. We believe these revisions enhance the reliability and interpretability of our findings, particularly in relation to the reported pUb levels in AD. 

      - Figure 4B should also indicate which brain region is being presented.

      The images were taken for layer III-IV in the neocortex of mouse brains. We have included this information in the figure legend of the revised manuscript.

      Reviewer #3 (Recommendations for the authors):

      - Expand on the potential compensatory role of autophagy in response to proteasomal dysfunction.

      Upon proteasomal inhibition, cells may activate autophagy as an alternative pathway of degradation to help clear damaged or misfolded proteins. Autophagy is a bulk degradation process that targets long-lived proteins, damaged organelles, and aggregated proteins for lysosomal degradation. While this pathway can provide some compensation, it is distinct from the ubiquitin-proteasome system (UPS), which specializes in the selective degradation of short-lived regulatory proteins and misfolded proteins.

      In our study, we observed increased autophagic activity following sPINK1 overexpression (Figure 2J, middle panel, without BALA) and a slight, though statistically insignificant, increase in LC3II levels in the hippocampus of mouse brains at 70 days post-sPINK1 transfection (Figure 5—figure supplement 6). These findings suggest that autophagy is indeed upregulated as a compensatory response to proteasomal dysfunction, potentially facilitating the degradation of aggregated ubiquitinated proteins. Additionally, gene set enrichment analysis (GSEA) revealed similar enrichment of autophagy pathways at 30 and 70 days post-sPINK1 overexpression (Figure 5—figure supplement 5).

      However, the compensatory capacity of autophagy is likely limited. While autophagy can reduce protein aggregation, it is an inherently non-selective process and cannot fully replace the targeted functions of the UPS. Moreover, as we illustrate in Figure 7 of the revised manuscript, UPS is essential for degrading specific regulatory and inhibitory proteins and plays a critical role in cellular proteostasis, particularly in signaling regulation, cell cycle control, and stress responses.

      Together, while autophagy activation provides some degree of compensation, it cannot fully restore cellular proteostasis. The interplay between these two degradation pathways is an important area for future investigation. For the present study, our focus is on how pUb elevations impact proteasomal activity and elicits downstream effects.

      We have incorporated these additional discussions on this topic in the revised manuscript.

      - Simplify the discussion of complex mechanisms to improve accessibility for readers.

      We have revised the Discussion to present the mechanisms in a more coherent and accessible manner, ensuring clarity for a broader readership. These revisions should make the discussion more intuitive while preserving the depth of our findings.

      - Statistical analyses could benefit from clarifying how technical replicates and biological replicates were accounted for across experiments.

      We have clarified our statistical analysis in the Methods section and figure legends, explicitly detailing how many biological replicates were accounted for across experiments. These revisions should enhance transparency and clarity, ensuring that our findings are robust and reproducible.

      - The image in Figure 3D is too small to distinguish any signals. A larger and clearer image should be presented.

      We have expanded the images in Figure 3D. Additionally, we have replaced figures with version of better resolutions throughout the manuscript.

      - NeuN expression in Figure 4B differs between wildtype and pink-/- mice. Additional validation is needed to determine whether pink-/- enhances NeuN expression.

      The difference in NeuN immunofluorescence intensity between wild-type and pink1<sup>-/-</sup> mice in Figure 4B may simply result from variations in image acquisition rather than an actual difference in NeuN expression.

      Our single nuclei RNA-seq analyses of wild-type and pink1<sup>-/-</sup> mice at 3 and 18 months of age reveal no significant differences in NeuN expression at the transcript level (data provided below). This confirms that the observed variation in fluorescence intensity is unlikely to reflect an authentic upregulation of NeuN expression. Thus, factors like the concentration of antibody, image exposure and processing may contribute to differences in staining intensity.

      Author response image 1.

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife assessment

      This valuable work analyzes how specialized cells in the auditory cells, known as the octopus cells, can detect coincidences in their inputs at the submillisecond time scale. While previous work indicated that these cells receive no inhibitory inputs, the present study unambiguously demonstrates that these cells receive inhibitory glycinergic inputs. The physiologic impact of these inputs needs to be studied further. It remains incomplete at present but could be made solid by addressing caveats related to similar sizes of excitatory postsynaptic potentials and spikes in the octopus neurons.

      We apologize for not explicitly describing our experimental methods and analyses procedures that ensure the discrimination between action potentials and EPSPs. This has been addressed in responses to reviewer comments and amended in the manuscript.

      Reviewer #1 (Public Review):

      Kreeger and colleagues have explored the balance of excitation and inhibition in the cochlear nucleus octopus cells of mice using morphological, electrophysiological, and computational methods. On the surface, the conclusion, that synaptic inhibition is present, does not seem like a leap. However, the octopus cells have been in the past portrayed as devoid of inhibition. This view was supported by the seeming lack of glycinergic fibers in the octopus cell area and the lack of apparent IPSPs. Here, Kreeger et al. used beautiful immunohistochemical and mouse genetic methods to quantify the inhibitory and excitatory boutons over the complete surface of individual octopus cells and further analyzed the proportions of the different subtypes of spiral ganglion cell inputs. I think the analysis stands as one of the most complete descriptions of any neuron, leaving little doubt about the presence of glycinergic boutons.

      Kreeger et al then examined inhibition physiologically, but here I felt that the study was incomplete. Specifically, no attempt was made to assess the actual, biological values of synaptic conductance for AMPAR and GlyR. Thus, we don't really know how potent the GlyR could be in mediating inhibition. Here are some numbered comments:

      (1) "EPSPs" were evoked either optogenetically or with electrical stimulation. The resulting depolarizations are interpreted to be EPSPs. However previous studies from Oertel show that octopus cells have tiny spikes, and distinguishing them from EPSPs is tricky. No mention is made here about how or whether that was done. Thus, the analysis of EPSP amplitude is ambiguous.

      We agree that large EPSPs can be difficult to distinguish from an octopus cell’s short spikes during experiments. During analysis, we distinguished spikes from EPSPs by generating phase plots, which allow us to visualize the first derivative of the voltage trace on the y-axis and the value of the voltage on the x-axis at each moment in time. In the example shown below, four depolarizing events were electrically evoked in an octopus cell (panel A). The largest of these events (shown in orange in panels B-D) has an amplitude of ~9mV and could be a small spike. The first derivative of the voltage (panel C) reveals a bi-phasic response in the larger orange trace, where during the rising phase (mV/ms > 0) of the EPSP there is a second, sharper rising phase for the spike. Like more traditionally sized action potentials, phase plots for octopus cell spikes also reveal a sharp change in the rate of voltage change over time (Author response image 1 panel D, ✱) after the rising action of the EPSP begins to slow. EPSPs (shown in blue in panels B-D) lack the deflection in the phase plot. Not all cases were as unambiguous as this example. Therefore, our analysis only included subthreshold stimulation that unambiguously evoked EPSPs, not spikes. A brief description of this analysis has been added to the methods text (lines 625-627) and we have noted in the results section that both ChR2-evoked and electrically-evoked stimulation can produce small action potentials, which were excluded from analysis (lines 156-158).

      Author response image 1.

      (2) For this and later analysis, a voltage clamp of synaptic inputs would have been a simple alternative to avoid contaminating spikes or shunts by background or voltage-gated conductances. Yet only the current clamp was employed. I can understand that the authors might feel that the voltage clamp is 'flawed' because of the failure to clamp dendrites. But that may have been a good price to pay in this case. The authors should have at least justified their choice of method and detailed its caveats.

      We agree that data collected using voltage-clamp would have eliminated the confound of short action potentials and avoided the influence of voltage-gated conductances. The large-diameter, and comparatively simple dendritic trees of octopus cells make them good morphological candidates for reliable voltage clamp. However, as suggested, we were concerned that the abundance of channels open at the neuron’s resting potential would make it difficult to sufficiently clamp dendrites. Ultimately, given the low input resistances of octopus cells and the fast kinetics of excitatory inputs, we determined that bad voltage clamp conditions were likely to result in unclamped synaptic events with unpredicted distortions in kinetics and attenuation (To et al. 2022; PMID: 34480986; DOI: 10.1016/j.neuroscience.2021.08.024). We therefore chose to focus our efforts on current-clamp.

      Beyond the limits of both current-clamp and voltage-clamp, we chose to leave all conductances that influence EPSP dendritic propagation intact because our model demonstrates that active Kv and leak conductances shape and attenuate synaptic inputs as they travel through the dendritic tree (Supp. Fig. 4F-G). The addition of voltage-clamp recordings would not impact the conclusions we make about EPSP summation at the soma. Future studies will need to focus on a dendrite-centric view of local excitatory and inhibitory summation. For dendrite-centric experiments, dendritic voltage-clamp recordings are well suited to answer that set of questions.

      (3) The modeling raised several concerns. First, there is little presentation of assumptions, and of course, a model is entirely about its assumptions. For example, what excitatory conductance amplitudes were used? The same for inhibitory conductance? How were these values arrived at? The authors note that EPSGs and IPSGs had peaks at 0.3 and 3 ms. On what basis were these numbers obtained? The model's conclusions entirely depend on these values, and no measurements were made here that could have provided them. Parenthetical reference is made to Figure S5 where a range of values are tested, but with little explanation or justification.

      We apologize for not providing this information. We used our octopus neuron model to fit both EPSP and IPSP parameters to match experimental data. We have expanded the methods to include final values for the conductances (lines 649-651), which were adjusted to match experimental values seen in current-clamp recordings. We have also expanded the results section to describe each of the parameters we tuned (lines 203-222). An example of these adjustments is illustrated in Fig. 4F where the magnitude of inhibitory potentials at different conductances (100nS and 1nS) was compared to experimental data over a range of octopus cell input resistance conditions. Kinetic parameters were determined by aligning modeled PSPs to the rise times and full width at half maximum (FWHM) measurements from experiments under control and Kv block conditions. The experimental data for EPSPs and IPSPs that was used to fit the model is shown in Author response image 2 below.

      Author response image 2.

      (4) In experiments that combined E and I stimulation, what exactly were time courses of the conductance changes, and how 'synchronous' were they, given the different methods to evoke them? (had the authors done voltage clamp they would know the answers).

      We chose to focus data collection on voltage changes at the soma under physiological conditions to better understand how excitation and inhibition integrate at the somatic compartment. Our conclusions in the combined E and I stimulation experiments require the resting membrane properties of octopus cells to be intact to make physiologically-relevant conclusions. Our current-clamp data includes the critical impact of leak, Kv, and HCN conductances on this computation. Reliable voltage-clamp would necessitate the removal of the Kv and HCN conductances that shape PSP magnitude, shape, and speed. Because it was not necessary to measure the conductances and kinetics of specific channels, we chose to use current-clamp.

      Evoked IPSPs and EPSPs had cell-to-cell variability in their latencies to onset. Somatically-recorded optically-evoked inhibition under pharmacological conditions that changed cable properties had onset latencies between 2.5 and 4.3ms; electrically-evoked excitation under control conditions had latencies between 0.8 and 1.4ms. To overcome cell-to-cell timing variabilities, we presented a shuffled set of stimulation pairings that had a 3ms range of timings with 200µs intervals. As the evoked excitation and inhibition become more ‘synchronous’, the impact on EPSP magnitude and timing is greatest. Data presented in this paper was for the stimulation pairings that evokes a maximal shift in EPSP timing. On average, this occurred when the optical stimulation began ~1.2ms before electrical stimulation. Stimulation pairing times ranged between a 0ms offset and a 1.8ms offset at the extremes. An example of the shuffled stimulation pairings is shown in Author response image 3 below, and we have included information about the shuffled stimulus in the methods (lines 627-630)

      Author response image 3.

      (5) Figure 4G is confusing to me. Its point, according to the text, is to show that changes in membrane properties induced by a block of Kv and HCN channels would not be expected to alter the amplitudes of EPSCs and IPSCs across the dendritic expanse. Now we are talking about currents (not shunting effects), and the presumption is that the blockers would alter the resting potential and thus the driving force for the currents. But what was the measured membrane potential change in the blockers? Surely that was documented. To me, the bigger concern (stated in the text) is whether the blockers altered exocytosis, and thus the increase in IPSP amplitude in blockers is due BOTH to loss of shunting and increase in presynaptic spike width. Added to this is that 4AP will reduce the spike threshold, thus allowing more ChR2-expressing axons to reach the threshold. Figure 4G does not address this point.

      These are valuable points that motivated us to improve the clarity of this figure and the corresponding text. We discussed two separate points in this paragraph and were not clear. Our intention with Figure 4G was to address concerns that using pharmacological blockers changes driving forces and may confound the measured change in magnitude of postsynaptic potentials. Membrane potentials hyperpolarized by approximately 8-10 mV after application of blockers. We corrected for this effect by adding a holding current to depolarize the neuron to its baseline resting potential. Text in the results (lines 187-190) and figure legends have been changed to clarify these points.

      We also removed any discussion of presynaptic effects from this portion of the text because our description was incomplete and we did not directly collect data related to these claims. We originally wrote, “While blocking Kv and HCN allowed us to reveal IPSPs at the soma, 4-AP increases the duration of the already unphysiological ChR2-evoked presynaptic action potential (Jackman et al., 2014; DOI: 10.1523/jneurosci.4694-13.2014), resulting in altered release probabilities and synaptic properties, amongst other caveats (Mathie et al., 1998; DOI: 10.1016/S0306-3623(97)00034-7)”. Ultimately, effects on exocytosis, presynaptic excitability, or release probability are only relevant for the experiments presented in Figure 4. Figure 4 serves as evidence that synaptic release of glycine elicits strychnine-sensitive inhibitory postsynaptic potentials in octopus cells. Concerns of presynaptic effects do not carry over to the data presented in Figure 5, as Kv and HCN were not blocked in these experiments. Therefore, we have removed this portion of the text.

      (6) Figure 5F is striking as the key piece of biological data that shows that inhibition does reduce the amplitude of "EPSPs" in octopus cells. Given the other uncertainties mentioned, I wondered if it makes sense as an example of shunting inhibition. Specifically, what are the relative synaptic conductances, and would you predict a 25% reduction given the actual (not modeled) values?

      We agree that both shunting and hyperpolarizing inhibition could play a role in the measured EPSP changes. Because we focused data collection on voltage changes at the soma under physiological conditions, we cannot calculate the relative synaptic conductances. Together, our experimental current-clamp results paired with estimates from the model provide compelling evidence for the change we observe in EPSPs. Regardless, the relative weights of the synaptic conductances is a very interesting question, but this information is not necessary to answer the questions posed in this study, namely the impact of dendritic inhibition on the arrival of EPSPs in the soma.

      (7) Some of the supplemental figures, like 4 and 5, are hardly mentioned. Few will glean anything from them unless the authors direct attention to them and explain them better. In general, the readers would benefit from more complete explanations of what was done.

      We apologize for not fully discussing these figures in the results text. We have fully expanded the results section to detail the experiments and results presented in the supplement (lines 203-238).

      Reviewer #2 (Public Review):

      Summary:

      Kreeger et.al provided mechanistic evidence for flexible coincidence detection of auditory nerve synaptic inputs by octopus cells in the mouse cochlear nucleus. The octopus cells are specialized neurons that can fire repetitively at very high rates (> 800 Hz in vivo), yield responses dominated by the onset of sound for simple stimuli, and integrate auditory nerve inputs over a wide frequency span. Previously, it was thought that octopus cells received little inhibitory input, and their integration of auditory input depended principally on temporally precise coincidence detection of excitatory auditory nerve inputs, coupled with a low input resistance established by high levels of expression of certain potassium channels and hyperpolarization-activated channels.

      In this study, the authors used a combination of numerous genetic mouse models to characterize synaptic inputs and enable optogenetic stimulation of subsets of afferents, fluorescent microscopy, detailed reconstructions of the location of inhibitory synapses on the soma and dendrites of octopus cells, and computational modeling, to explore the importance of inhibitory inputs to the cells. They determined through assessment of excitatory and inhibitory synaptic densities that spiral ganglion neuron synapses are densest on the soma and proximal dendrite, while glycinergic inhibitory synaptic density is greater on the dendrites compared to the soma of octopus cells. Using different genetic lines, the authors further elucidated that the majority of excitatory synapses on the octopus cells are from type 1a spiral ganglion neurons, which have low response thresholds and high rates of spontaneous activity. In the second half of the paper, the authors employed electrophysiology to uncover the physiological response of octopus cells to excitatory and inhibitory inputs. Using a combination of pharmacological blockers in vitro cellular and computational modeling, the authors conclude that glycine in fact evokes IPSPs in octopus cells; these IPSPs are largely shunted by the high membrane conductance of the cells under normal conditions and thus were not clearly evident in prior studies. Pharmacological experiments point towards a specific glycine receptor subunit composition. Lastly, Kreeger et. al demonstrated with in vitro recordings and computational modeling that octopus cell inhibition modulates the amplitude and timing of dendritic spiral ganglion inputs to octopus cells, allowing for flexible coincidence detection.

      Strengths:

      The work combines a number of approaches and complementary observations to characterize the spatial patterns of excitatory and inhibitory synaptic input, and the type of auditory nerve input to the octopus cells. The combination of multiple mouse lines enables a better understanding of and helps to define, the pattern of synaptic convergence onto these cells. The electrophysiology provides excellent functional evidence for the presence of the inhibitory inputs, and the modeling helps to interpret the likely functional role of inhibition. The work is technically well done and adds an interesting dimension related to the processing of sound by these neurons. The paper is overall well written, the experimental tests are well-motivated and easy to follow. The discussion is reasonable and touches on both the potential implications of the work as well as some caveats.

      Weaknesses:

      While the conclusions presented by the authors are solid, a prominent question remains regarding the source of the glycinergic input onto octopus cells. In the discussion, the authors claim that there is no evidence for D-stellate, L-stellate, and tuberculoventral cell (all local inhibitory neurons of the ventral and dorsal cochlear nucleus) connections to octopus cells, and cite the relevant literature. An experimental approach will be necessary to properly rule out (or rule in) these cell types and others that may arise from other auditory brainstem nuclei. Understanding which cells provide the inhibitory input will be an essential step in clarifying its roles in the processing of sound by octopus cells.

      We are glad that the reviewer agrees with the conclusions we have made and is interested in learning more about how these findings impact sound processing. We agree that defining the source of inhibition will dramatically shape our understanding of the computation octopus cells are making. However, this is not an easy task, given the small size of the octopus cell area, and will involve considerable additional work. Since the overall findings do not depend on knowing the source of inhibition, we have instead re-written the discussion to clarify the lack of evidence for intrinsic inhibitory inputs to octopus cells, in addition to presenting likely candidates. As genetic profiles of cochlear nucleus and other auditory brainstem neurons become available, we intend to make and utilize genetic mouse models to answer questions like this.

      The authors showed that type 1a SGNs are the most abundant inputs to octopus cells via microscopy. However, in Figure 3 they compare optical stimulation of all classes of ANFs, then compare this against stimulation of type 1b/c ANFs. While a difference in the paired-pulse ratio (and therefore, likely release probability) can be inferred by the difference between Foxg1-ChR2 and Ntng1-ChR2, it would have been preferable to have specific data with selective stimulation of type 1a neurons.

      We agree that complete genetic access to only the Ia population would have been the preferable approach, but we did not have an appropriate line when beginning these experiments. Because our results did not suggest a meaningful difference between the populations, we did not pursue further investigation once a line was available.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      Besides the points mentioned in the main review:

      Minor

      (1) I really like the graphics and the immunohistological presentation.

      (2) Lines 316-319 say that octopus cells lack things like back-propagating spikes and dendritic Ca spikes. How do you know this?

      This statement was intended to be a summary of suggestions from the literature and lacked references and context as written. We have rewritten this section and clarified that our hypothesis was formed from data found in the literature (lines 334-337).

      (3) Spectrograms of Figure 6A...where were these data obtained?

      We recorded and visualized human-generated rhythmic tapping and high-frequency squeaking sounds using Audacity. The visualizations of rhythmic tapping and imitated vocalizations are meant to show two different types of multi-frequency stimuli we hypothesize would result in somatic summation within an octopus cell’s spike integration window, despite differences in timing. We rewrote the figure legend to explain more clearly what is shown and how it relates to the model in Figure 6.

      (4) 'on-path' and 'off-path' seem like jargon that may not be clear to the average reader.

      Thank you for pointing out our use of unapproachable jargon. We have replaced the term from the figure with “proximal” and “distal” inhibition. In the main text, we now describe on-path and off-path together as the effect of location of dendritic inhibition on somatically recorded EPSPs.

      (5) The paper could benefit from a table of modeled values.

      We have added specific details about the modelling in the text and clarified which modeled values were referenced from previous computational models and which were tuned to fit experimental data. Since most values were taken from a referenced publication, we did not add a table and instead point readers towards that source.

      (6) Figure S4A-C what currents were delivered to the modeled cells?

      The model cells were injected with a -0.8 nA DC current for 300 ms in current clamp mode. This information has been added to the figure legend.

      (7) In that figure "scaling factors" scale exactly which channels?

      Scaling factor is used to scale low-voltage activated K<sup>+</sup> (ḡ<sub>KLT</sub>), high threshold K<sup>+</sup> (ḡ<sub>KHT</sub>), fast transient K<sup>+</sup> (ḡ<sub>KA</sub>), hyperpolarization-activated cyclic nucleotide-gated HCN (ḡ<sub>h</sub>) but not fast Na<sup>+</sup> (ḡ<sub>Na</sub>) and leak K<sup>+</sup> (ḡ<sub>leak</sub>). This information has been added to the text (lines 205-208 and 646-653).

      (8) In performing and modeling Kv/HCN block, do you know how complete the level of the block is?

      Since we cannot assess how complete the level of block is, we have changed the language in the text to clarify that we are reducing Kv and HCN channel conductance to the degree needed to increase resistance of the neuron (line 185).

      (9) More on this Figure S4. It is hardly referred to in the text except to say that it supports that blocking the Kv/HCN channels will enhance the IPSP. Given how large the figure is, can you offer more of a conclusion than that? Also, in the synaptic model in that figure, the IPSCs are presumably happening in current-clamp conditions, and the reduction in amplitude of the IPSC (as opposed to the increase in IPSP) is due to hyperpolarization. Can you simply state that so readers can track what this figure is showing? Other similar things: what is a transfer impedance? How is it measured? What do we take from the analysis?

      We have elaborated on our description of both Supp. Fig. 4 and Supp. Fig. 5 in the results section of the text (lines 203-238).

      (10) Figure S5 also needs a better explanation. E.g., in C-D, what does 'average' mean? The gray is an SD of this average? You modeled a range of values...but which ones are physiological? To me, this is a key point.

      We have elaborated on our description of both Supp. Fig. 4 and Supp. Fig. 5 in the results section of the text (lines 203-238).

      Reviewer #2 (Recommendations For The Authors):

      General:

      The images and 3-D reconstructions are visually stunning, but they are not colorblind-friendly and in some cases, hard to distinguish. This shows up particularly in the green and blue colors used in Figure 1. Also, better representative images could be used for Figure 1B.

      Thank you for pointing out that blue and green were difficult to distinguish in Figure 1H. We have outlined the green inhibitory puncta in this image to make them more distinguishable. We have also increased the resolution of the image in Figure 1B for better clarity. All other colors are selected from Wong, 2011 (PMID: 21850730; DOI: https://doi.org/10.1038/nmeth.1618).

      Supplemental Figure 1D: The low-power view is good to have, but the CN is too small and the image appears a bit noisy. An inset showing the CN on a larger scale (higher resolution image?) would be more convincing. In this image, I see what appear to be cells in the DCN labeled, which calls into question the purity of the source of optogenetic synaptic activation. It is also difficult to tell whether there are other cells labeled in the VCN. Such inputs would still be minor, but it would be good to be very clear about the expression pattern.

      To offer more information about the activity of the Ntng1<sup>Cre</sup> line in other regions of the auditory system, we increased the resolution of the image included in Supp. Fig. 1D and have also included an additional image (Supp. Fig. 1E) of a coronal section of the cochlear nucleus complex with Ntng1-tdT labelling. This image provides additional context for the cells labeled in the DCN. The text in the figure legend has been changed to clarify that some cells in the DCN were labeled (lines 118-120).

      We agree that in the Ntng1<sup>Cre</sup> experiments, there is the possibility of minor contamination from excitatory cells that express ChR2 outside of the spiral ganglion. This is also true for our Foxg1<sup>Cre</sup> and Foxg1<sup>Flp</sup> experiments, because these lines label cortical cells in addition to cochlear cells. However, we do not observe direct descending inputs from the cortex into the PVCN, making contamination from other Foxg1<sup>Cre</sup>-positive neurons unlikely. While non-cochlear inputs from the Ntng1<sup>Cre</sup> line are possible, evidence from both lines gives us confidence that we are not capturing inputs to octopus cells outside the cochlea. Central axons from Type I spiral ganglion neurons have VGLUT1+ synaptic terminals. When comparing the overlap between VGLUT1+ terminals and Foxg1-tdT labelling, we see full coverage. That is, all VGLUT+ terminals on octopus cells are co-labelled by Foxg1<sup>Cre</sup>-mediated expression of tdTomato. An example image is shown below. Here, an octopus cell soma is labeled with blue fluorescent Nissl stain and inputs to the cochlear nucleus complex are labeled with Foxg1<sup>Cre</sup>-dependent tdTomato (Foxg1-tdT; magenta). We have also immunolabeled for VGLUT1 puncta in green. This eliminates the possibility that VGLUT+ cells from outside the cochlea and cortex are sources of excitation to octopus cells.

      Author response image 4.

      Further, we have looked at expression of Ntng1-tdT and Foxg1-EYFP together in the octopus cell area.  An example image is shown below. All Ntng1-tdT+ fibers (magenta) are also Foxg1-EYFP+ (green), suggesting that all Ntng1<sup>Cre</sup>-targeted inputs to octopus cells are a part of the Foxg1<sup>Cre</sup>-targeted input population, which are very likely to only be from the cochlea. We have expanded the results section to include information about the overlap in expression driven by the Ntng1<sup>Cre</sup> and Foxg1<sup>Flp</sup> lines.

      Author response image 5.

      Supplemental Figure 2 G: These are a bit hard to read. Perhaps use a different image, or provide a reference outline drawing telling us what is what.

      We have used a different image with a Thy1-YFP labeled octopus cell for clarity.

      In some places, the term "SGN" is used when referencing the axons and terminals within the CN, and without some context, this was occasionally confusing (SGN would seem to refer to the cell bodies). In some places in the text, it may be preferable to separate SGN, auditory nerve fibers (ANFs), and terminals, as entities for clarity.

      In order to make the study accessible to a broad neuroscience audience, we refer to the neurons of the spiral ganglion and their central axon projections using one name. We understand why, for those well acquainted with the auditory periphery, condensing terminology may feel awkward. However, for those readers unfamiliar with the anatomy of the cochlea and auditory nerve, we feel that the use of “SGN central axon” makes it clear that the “auditory nerve fibers” come from neurons in the spiral ganglion. This is clarified in the first paragraph of the introduction (lines 29-31) and in the methods (line 533).

      Specific: Numbers refer to the line numbers on the manuscript.

      L29-31: Cochlear nucleus neurons are more general in their responses than this sentence indicates. While we can all agree that they are specialized to carry (or improve upon) the representation of these specific features of sound, they also respond more generally to sounds that might not have specific information in any of these domains. They are not silos of neural computation, and their outputs become mixed and "re-represented" well before they reach the auditory cortex. Octopus cells are no exception to this. I suggest striking most of the first paragraph, and instead using the first sentence to lead into the second paragraph, and putting the last sentence (of the current first paragraph) at the end of the second (now first) paragraph.

      We agree with this assessment and have made major changes to the introduction in line with these suggestions.

      L33-46: A number of points in this paragraph need references (exp. line 41).

      We agree and have added references accordingly.

      L43: Not sure what is meant by "fire at the onset of the sound, breaking it up into its frequency components"?

      We changed this text as part of a major reworking of the introduction.

      L47-66: Again more citations are needed (at the end of sentence at line 55, probably moving some of the citations from the next sentence up).

      We agree and have added references accordingly.

      L51: The consistent orientation of octopus cell dendrites across the ANFs has been claimed in the literature (as mentioned here), but there are some (perhaps problematic - plane of sectioning?) counterexamples from the older Golgi-stained images, and even amongst intracellularly stained cells (for example see Reccio-Spinoza and Rhode, 2020). This is important with regards to the broader hypothesis regarding traveling-wave compensation (e.g., McGinley et al; but also many others); if the cells are not all in the appropriate orientation then such compensation may be problematic. Likewise, the data from Lu et al., 2022, points towards a range of sensitivity to frequency-swept stimuli, some of which work in opposition to the traveling wave compensation hypothesis. It would seem that with the Thy1 mice, you have an opportunity to clarify the orientation. Figures 1A and 2A show a consistent dendritic orientation, assuming that these drawings are reconstructions of the cells as they were actually oriented in the tissue. Can you either comment on this or provide clearer evidence?

      We are happy to offer more information about the appearances of octopus cells in our preparations. In our hands, sparsely labeled octopus cells in Thy1-YFP-H mice show consistent dendritic orientation when visualized in a 15 degree parasaggital plane, with the most diversity apparent in cells with somas located more dorsally in the octopus cell area. We hypothesize that this is due to the limited area through which the central projections of spiral ganglion neurons (i.e. ANFs) must pass through before they enter the dorsal cochlear nucleus and continue their tonotopic organization in that area.

      A caveat to studies without physiological or genetic identification of octopus cells is the assumption that all neurons in the octopus cell area are octopus cells. We find, especially along the borders of the octopus cell area, that stellate cells can be seen amongst octopus cells. Because stellate cell dendrites are not oriented like octopus cell dendrites, any stellate cells misidentified as octopus cells would appear to have poorly-oriented dendrites. This may explain why some studies report this finding. In addition, it can be difficult to assess tonotopic organization because of the 3D trajectory of tightly bundled axons, which is not capturable by a single section plane. Although a parasaggital plane of sectioning captures the tonotopic axis in one part of the octopus cell area, that same plane may be perpendicular at the opposing end.

      L67: canonical -> exceptional.

      Thank you for the suggestion. We have made this change in the introduction.

      L127: This paragraph was confusing on first reading. I don't think Supplemental Figure 1D shows the restricted pattern of expression very clearly. The "restricted to SGNs" might be better as "restricted to auditory nerve fibers" (except in the DCN, where there seem to be some scattered small cells?). A higher magnification image of the CN, but lower magnification than in panel E, would be helpful here.

      To avoid confusion, we have re-written this paragraph (lines 117-127) and included a higher magnification image of the CN in a revised Supp. Fig. 1.

      L168: Here, perhaps say ANFs instead of SGNs.

      As above, we have decided to describe ANFs as SGN central axons to make the anatomy more accessible to people unfamiliar with cochlear anatomy.

      L201-204: The IPSPs are surprisingly slow (Figures 5B, C), especially given the speed of the EPSPs/EPSCs in these cells. This is reminiscent of the asymmetry between EPSC and IPSC kinetics in bushy cells (Xie and Manis, 2014). The kinetics used in the model (3 ms; mentioned on line 624) however seem a bit arbitrary and no data is provided for the selection of that value. Were there any direct measurements of the IPSC kinetics (all of the traces in the paper are in the current clamp) that were used to justify this value?

      The kinetics of the somatically-recorded IPSPs are subject to the effects of our pharmacological manipulations. EPSPs measured at the soma under control conditions are small amplitude and rapid. With pharmacological reduction of HCN and Kv channels, EPSPs are larger and slower (please see figure in response to a similar question posed by Reviewer #1). We expect that this change also occurs with the IPSP kinetics under pharmacological conditions. Our justification of kinetics has been expanded and justified in the methods section (lines 641-661).

      L594: Technically, this is a -11 mV junction potential, but thanks for including the information.

      We have corrected this in the text (line 618). Thank you for the close reading of all experimental and methodological details.

      L595: The estimated power of the LED illumination at the focal plane should be measured and indicated here.

      We measured the power of the LED illumination at the focal plane using a PM100D Compact Power and Energy Meter Console (Thorlabs), a S120C Photodiode Power Sensor (Thorlabs), and a 1000µm diameter Circular Precision Pinhole (Thorlabs). Light intensity at the focal plane ranged between 1.9 and 4.1mW/mm<sup>2</sup>, corresponding to 6% and 10% intensity on the Colibri5 system. We have reported these measurements in the results section (Lines 621-622).

      L609: One concern about the model is that the integration time of 25 microseconds is rather close to the relative shifts in latency. While I doubt it will make a difference (except in the number), it may be worth verifying (spot checks, at least) that running the model with a 5 or 10-microsecond step yields a similar pattern of latency shifts (e.g., Supplementary Figure 5, Figure 5).

      Also, it is not clear what temperature the model was executed at (I would presume 35C); this needs to be given, and channel Q10's listed.

      We realize that additional information is needed to fully understand the model and have added this to the results and the methods. The synaptic mechanism (.mod) files were obtained from Manis and Campagnola (2018) (PMID: 29331233; DOI: https://doi.org/10.1016/j.heares.2017.12.017). Q10 (3) and temperature (22°C) were also matched to parameters from Manis and Campagnola (2018). Because temperature is a critical factor for channel kinetics, we verified that our primary results remain consistent under conditions using a temperature of 35°C and a time step of 5µs, depicted below. Panel A illustrates the increase in IPSP as a function of glycine conductance under Kv+HCN block conditions at 35°C. As at 22°C, an increase in IPSP magnitude is absent in the control condition at 35°C. Panels B and C provide a direct comparison between the initial (i.e. 22°C) and suggested (i.e. 35°C) simulation conditions. Again we found that temperature does not have a major impact on the amplitude of IPSPs. Thus, results at 35°C do not change the conclusions we make from the model.

      Author response image 6.

      The nominal conductance densities should at least be provided in a table (supplemental, in addition to including them in the deposited code). The method for "optimization" of the conductance densities to match the experimental recordings needs to be described; the parameter space can be quite large in a model such as this. The McGinley reference needs a number.

      We added a more thorough description of modeling parameters and justification of choices in the methods section of the text (lines 641-661). We have also added a reference number to the McGinley 2012 reference in the text.

      I think this is required by the journal:

      The model code, test results, and simulation results should be deposited in a public resource (Github would be preferable, but dryad, Zenodo, or Figshare could work), and the URL/doi for the resource provided in the manuscript. This includes the morphology swc/hoc file. The code should be in a form, and with a description, that readily allows an interested party with appropriate skills to download it and run it to generate the figures.

      We will upload the code and all associated simulation files to the ModelDB repository upon publication.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Response to Reviewer #1:

      Thank you for the careful reading and the positive evaluation of our manuscript. As you mentioned, the present study tried to address the question of how the lost genomic functions could be compensated by evolutionary adaptation, indicating the potential mechanism of "constructive" rather than "destructive" evolution. Thank you for the instructive comments that helped us to improve the manuscript. We sincerely hope the revised manuscript and the following point-to-point response meet your concerns.

      • Line 80 "Growth Fitness" is this growth rate?

      Yes. The sentence was revised as follows.

      (L87-88) “The results demonstrated that most evolved populations (Evos) showed improved growth rates, in which eight out of nine Evos were highly significant (Fig. 1B, upper).”

      • Line 94 a more nuanced understanding of r/K selection theory, allows for trade-ups between R and K, as well as trade-offs. This may explain why you did not see a trade-off between growth and carrying capacity in this study. See this paper https://doi.org/10.1038/s41396-023-01543-5. Overall, your evos lineages evolved higher growth rates and lower carrying capacity (Figures 1B, C, E). If selection was driving the evolution of higher growth rates, it may have been that there was no selective pressure to maintain high carrying capacity. This means that the evolutionary change you observed in carrying capacity may have been neutral "drift" of the carrying capacity trait, during selection for growth rate, not because of a trade-off between R and K. This is especially likely since carrying capacity declined during evolution. Unless the authors have convincing evidence for a tradeoff, I suggest they remove this claim.

      • Line 96 the authors introduce a previous result where they use colony size to measure growth rate, this finding needs to be properly introduced and explained so that we can understand the context of the conclusion.

      • Line 97 This sentence "the collapse of the trade-off law likely resulted from genome reduction." I am not sure how the authors can draw this conclusion, what is the evidence supporting that the genome size reduction causes the breakdown of the tradeoff between R and K (if there was a tradeoff)?

      Thank you for the reference information and the thoughtful comments. The recommended paper was newly cited, and the description of the trade-off collapse was deleted. Accordingly, the corresponding paragraph was rewritten as follows.

      (L100-115) “Intriguingly, a positive correlation was observed between the growth fitness and the carrying capacity of the Evos (Fig. 1D). It was somehow consistent with the positive correlations between the colony growth rate and the colony size of a genome-reduced strain 11 and between the growth rates and the saturated population size of an assortment of genome reduced strains 13. Nevertheless, the negative correlation between growth rate and carrying capacity, known as the r/K selection30,31 was often observed as the trade-off relationship between r and K in the evolution and ecology studies 32 33,34. As the r/K trade-off was proposed to balance the cellular metabolism that resulted from the cost of enzymes involved 34, the deleted genes might play a role in maintaining the metabolism balance for the r/K correlation. On the other hand, the experimental evolution (i.e., serial transfer) was strictly performed within the exponential growth phase; thus, the evolutionary selection was supposed to be driven by the growth rate without selective pressure to maintain the carrying capacity. The declined carrying capacity might have been its neutral "drift" but not a trade-off to the growth rate. Independent and parallel experimental evolution of the reduced genomes selecting either r or K is required to clarify the actual mechanisms.”

      • Line 103 Genome mutations. The authors claim that there are no mutations in parallel but I see that there is a 1199 base pair deletion in eight of the nine evo strains (Table S3). I would like the author to mention this and I'm actually curious about why the authors don't consider this parallel evolution.

      Thank you for your careful reading. According to your comment, we added a brief description of the 1199-bp deletion detected in the Evos as follows.

      (L119-122) “The number of mutations largely varied among the nine Evos, from two to 13, and no common mutation was detected in all nine Evos (Table S3). A 1,199-bp deletion of insH was frequently found in the Evos (Table S3, highlighted), which well agreed with its function as a transposable sequence.”

      • Line 297 Please describe the media in full here - this is an important detail for the evolution experiment. Very frustrating to go to reference 13 and find another reference, but no details of the method. Looked online for the M63 growth media and the carbon source is not specified. This is critical for working out what selection pressures might have driven the genetic and transcriptional changes that you have measured. For example, the parallel genetic change in 8/9 populations is a deletion of insH and tdcD (according to Table S3). This is acetate kinase, essential for the final step in the overflow metabolism of glucose into acetate. If you have a very low glucose concentration, then it could be that there was selection to avoid fermentation and devote all the pyruvate that results from glycolysis into the TCA cycle (which is more efficient than fermentation in terms of ATP produced per pyruvate).

      Sorry for the missing information on the medium composition, which was additionally described in the Materials and Methods. The glucose concentration in M63 was 22 mM, which was supposed to be enough for bacterial growth. Thank you for your intriguing thinking about linking the medium component to the genome mutation-mediated metabolic changes. As there was no experimental result regarding the biological function of gene mutation in the present study, please allow us to address this issue in our future work.

      (L334-337) “In brief, the medium contains 62 mM dipotassium hydrogen phosphate, 39 mM potassium dihydrogen phosphate, 15 mM ammonium sulfate, 15 μM thiamine hydrochloride, 1.8 μM Iron (II) sulfate, 0.2 mM magnesium sulfate, and 22 mM glucose.”

      • Line 115. I do not understand this argument "They seemed highly related to essentiality, as 11 out of 49 mutated genes were essential (Table S3)." Is this a significant enrichment compared to the expectation, i.e. the number of essential genes in the genome? This enrichment needs to be tested with a Hypergeometric test or something similar.

      • Also, "As the essential genes were known to be more conserved than nonessential ones, the high frequency of the mutations fixed in the essential genes suggested the mutation in essentiality for fitness increase was the evolutionary strategy for reduced genome." I do not think that there is enough evidence to support this claim, and it should be removed.

      Sorry for the unclear description. Yes, the mutations were significantly enriched in the essential genes (11 out of 45 genes) compared to the essential genes in the whole genome (286 out of 3290 genes). The improper description linking the mutation in essential genes to the fitness increase was removed, and an additional explanation on the ratio of essential genes was newly supplied as follows.

      (L139-143) “The ratio of essential genes in the mutated genes was significantly higher than in the total genes (286 out of 3290 genes, Chi-square test p=0.008). As the essential genes were determined according to the growth35 and were known to be more conserved than nonessential ones 36,37, the high frequency of the mutations fixed in the essential genes was highly intriguing and reasonable.”

      • Line 124 Regarding the mutation simulations, I do not understand how the observed data were compared to the simulated data, and how conclusions were drawn. Can the authors please explain the motivation for carrying out this analysis, and clearly explain the conclusions?

      Random simulation was additionally explained in the Materials and Methods and the conclusion of the random simulation was revised in the Results, as follows.

      (L392-401) “The mutation simulation was performed with Python in the following steps. A total of 65 mutations were randomly generated on the reduced genome, and the distances from the mutated genomic locations to the nearest genomic scars caused by genome reduction were calculated. Subsequently, Welch's t-test was performed to evaluate whether the distances calculated from the random mutations were significantly longer or shorter than those calculated from the mutations that occurred in Evos. The random simulation, distance calculation, and statistic test were performed 1,000 times, which resulted in 1,000 p values. Finally, the mean of p values (μp) was calculated, and a 95% reliable region was applied. It was used to evaluate whether the 65 mutations in the Evos were significantly close to the genomic scars, i.e., the locational bias.”

      (L148-157) “Random simulation was performed to verify whether there was any bias or hotspot in the genomic location for mutation accumulation due to the genome reduction. A total of 65 mutations were randomly generated on the reduced genome (Fig. 2B), and the genomic distances from the mutations to the nearest genome reduction-mediated scars were calculated. Welch's t-test was performed to evaluate whether the genomic distances calculated from random mutations significantly differed from those from the mutations accumulated in the Evos. As the mean of p values (1,000 times of random simulations) was insignificant (Fig. 2C, μp > 0.05), the mutations fixed on the reduced genome were either closer or farther to the genomic scars, indicating there was no locational bias for mutation accumulation caused by genome reduction.”

      • Line 140 The authors should give some background here - explain the idea underlying chromosomal periodicity of the transcriptome, to help the reader understand this analysis.

      • Line 142 Here and elsewhere, when referring to a method, do not just give the citation, but also refer to the methods section or relevant supplementary material.

      The analytical process (references and methods) was described in the Materials and Methods, and the reason we performed the chromosomal periodicity was added in the Results as follows.

      (L165-172) “As the E. coli chromosome was structured, whether the genome reduction caused the changes in its architecture, which led to the differentiated transcriptome reorganization in the Evos, was investigated. The chromosomal periodicity of gene expression was analyzed to determine the structural feature of genome-wide pattern, as previously described 28,38. The analytical results showed that the transcriptomes of all Evos presented a common six-period with statistical significance, equivalent to those of the wild-type and ancestral reduced genomes (Fig. 3A, Table S4).”

      • Line 151 "The expression levels of the mutated genes were higher than those of the remaining genes (Figure 3B)"- did this depend on the type of mutation? There were quite a few early stops in genes, were these also more likely to be expressed? And how about the transcriptional regulators, can you see evidence of their downstream impact?

      Sorry, we didn't investigate the detailed regulatory mechanisms of 49 mutated genes, which was supposed to be out of the scope of the present study. Fig. 3B was the statistical comparison between 3225 and 49 genes. It didn't mean that all mutated genes expressed higher than the others. The following sentences were added to address your concern.

      (L181-185) “As the regulatory mechanisms or the gene functions were supposed to be disturbed by the mutations, the expression levels of individual genes might have been either up- or down-regulated. Nevertheless, the overall expression levels of all mutated genes tended to be increased. One of the reasons was assumed to be the mutation essentiality, which remained to be experimentally verified.”

      • Line 199 onward. The authors used WGCNA to analyze the gene expression data of evolved organisms. They identified distinct gene modules in the reduced genome, and through further analysis, they found that specific modules were strongly associated with key biological traits like growth fitness, gene expression changes, and mutation rates. Did the authors expect that there was variation in mutation rate across their populations? Is variation from 3-16 mutations that they observed beyond the expectation for the wt mutation rate? The genetic causes of mutation rate variation are well understood, but I could not see any dinB, mutT,Y, rad, or pol genes among the discovered mutations. I would like the authors to justify the claim that there was mutation rate variation in the evolved populations.

      Thank you for the intriguing thinking. We don't think the mutation rates were significantly varied across the nine populations, as no mutation occurred in the MMR genes, as you noticed. Our previous study showed that the spontaneous mutation rate of the reduced genome was higher than that of the wild-type genome (Nishimura et al., 2017, mBio). As nonsynonymous mutations were not detected in all nine Evos, the spontaneous mutation rate couldn't be calculated (because it should be evaluated according to the ratio of nonsynonymous and synonymous single-nucleotide substitutions in molecular evolution). Therefore, discussing the mutation rate in the present study was unavailable. The following sentence was added for a better understanding of the gene modules.

      (L242-245) “These modules M2, M10 and M16 might be considered as the hotspots for the genes responsible for growth fitness, transcriptional reorganization, and mutation accumulation of the reduced genome in evolution, respectively.”

      • Line 254 I get the idea of all roads leading to Rome, which is very fitting. However, describing the various evolutionary strategies and homeostatic and variable consequence does not sound correct - although I am not sure exactly what is meant here. Looking at Figure 7, I will call strategy I "parallel evolution", that is following the same or similar genetic pathways to adaptation and strategy ii I would call divergent evolution. I am not sure what strategy iii is. I don't want the authors to use the terms parallel and divergent if that's not what they mean. My request here would be that the authors clearly describe these strategies, but then show how their results fit in with the results, and if possible, fit with the naming conventions, of evolutionary biology.

      Thank you for your kind consideration and excellent suggestion. It's our pleasure to adopt your idea in tour study. The evolutionary strategies were renamed according to your recommendation. Both the main text and Fig. 7 were revised as follows.

      (L285-293) “Common mutations22,44 or identical genetic functions45 were reported in the experimental evolution with different reduced genomes, commonly known as parallel evolution (Fig. 7, i). In addition, as not all mutations contribute to the evolved fitness 22,45, another strategy for varied phenotypes was known as divergent evolution (Fig. 7, ii). The present study accentuated the variety of mutations fixed during evolution. Considering the high essentiality of the mutated genes (Table S3), most or all mutations were assumed to benefit the fitness increase, partially demonstrated previously 20. Nevertheless, the evolved transcriptomes presented a homeostatic architecture, revealing the divergent to convergent evolutionary strategy (Fig. 7, iii).”

      Author response image 1.

      • Line 327 Growth rates/fitness. I don't think this should be called growth fitness- a rate is being calculated. I would like the authors to explain how the times were chosen - do the three points have to be during the log phase? Can you also explain what you mean by choosing three ri that have the largest mean and minor variance?

      Sorry for the confusing term usage. The fitness assay was changed to the growth assay. Choosing three ri that have the largest mean and minor variance was to avoid the occasional large values (blue circle), as shown in the following figure. In addition, the details of the growth analysis can be found at https://doi.org/10.3791/56197 (ref. 59), where the video of experimental manipulation, protocol, and data analysis is deposited. The following sentence was added in accordance.

      Author response image 2.

      (L369-371) “The growth rate was determined as the average of three consecutive ri, showing the largest mean and minor variance to avoid the unreliable calculation caused by the occasionally occurring values. The details of the experimental and analytical processes can be found at https://doi.org/10.3791/56197.”

      • Line 403 Chromosomal periodicity analysis. The windows chosen for smoothing (100kb) seem big. Large windows make sense for some things - for example looking at how transcription relates to DNA replication timing, which is a whole-genome scale trend. However, here the authors are looking for the differences after evolution, which will be local trends dependent on specific genes and transcription factors. 100kb of the genome would carry on the order of one hundred genes and might be too coarse-grained to see differences between evos lineages.

      Thank you for the advice. We agree that the present analysis focused on the global trend of gene expression. Varying the sizes may lead to different patterns. Additional analysis was performed according to your comment. The results showed that changes in window size (1, 10, 50, 100, and 200 kb) didn't alter the periodicity of the reduced genome, which agreed with the previous study on a different reduced genome MDS42 of a conserved periodicity (Ying et al., 2013, BMC Genomics). The following sentence was added in the Materials and Methods.

      (L460-461) “Note that altering the moving average did not change the max peak.”

      • Figures - the figures look great. Figure 7 needs a legend.

      Thank you. The following legend was added.

      (L774-777) “Three evolutionary strategies are proposed. Pink and blue arrowed lines indicate experimental evolution and genome reduction, respectively. The size of the open cycles represents the genome size. Black and grey indicate the ancestor and evolved genomes, respectively.”

      Response to Reviewer #2:

      Thank you for reviewing our manuscript and for your fruitful comments. We agree that our study leaned towards elaborating observed findings rather than explaining the detailed biological mechanisms. We focused on the genome-wide biological features rather than the specific biological functions. The underlying mechanisms indeed remained unknown, leaving the questions as you commented. We didn't perform the fitness assay on reconstituted (single and combinatorial) mutants because the research purpose was not to clarify the regulatory or metabolic mechanisms. It's why the RNA-Seq analysis provided the findings on genome-wide patterns and chromosomal view, which were supposed to be biologically valuable. We did understand your comments and complaints that the conclusions were biologically meaningless, as ALE studies that found the specific gene regulation or improved pathway was the preferred story in common, which was not the flow of the present study.

      For this reason, our revision may not address all these concerns. Considering your comments, we tried our best to revise the manuscript. The changes made were highlighted. We sincerely hope the revision and the following point-to-point response are acceptable.

      Major remarks:

      (1) The authors outlined the significance of ALE in genome-reduced organisms and important findings from published literature throughout the Introduction section. The description in L65-69, which I believe pertains to the motivation of this study, seems vague and insufficient to convey the novelty or necessity of this study i.e. it is difficult to grasp what aspects of genome-reduced biology that this manuscript intends to focus/find/address.

      Sorry for the unclear writing. The sentences were rewritten for clarity as follows.

      (L64-70) “Although the reduced growth rate caused by genome reduction could be recovered by experimental evolution, it remains unclear whether such an evolutionary improvement in growth fitness was a general feature of the reduced genome and how the genome-wide changes occurred to match the growth fitness increase. In the present study, we performed the experimental evolution with a reduced genome in multiple lineages and analyzed the evolutionary changes of the genome and transcriptome.”

      (2) What is the rationale behind the lineage selection described in Figure S1 legend "Only one of the four overnight cultures in the exponential growth phase (OD600 = 0.01~0.1) was chosen for the following serial transfer, highlighted in red."?

      The four wells (cultures of different initial cell concentrations) were measured every day, and only the well that showed OD600=0.01~0.1 (red) was transferred with four different dilution rates (e.g., 10, 100, 1000, and 10000 dilution rates). It resulted in four wells of different initial cell concentrations. Multiple dilutions promised that at least one of the wells would show the OD600 within the range of 0.01 to 0.1 after the overnight culture. They were then used for the next serial transfer. Fig. S1 provides the details of the experimental records. The experimental evolution was strictly controlled within the exponential phase, quite different from the commonly conducted ALE that transferred a single culture in a fixed dilution rate. Serial transfer with multiple dilution rates was previously applied in our evolution experiments and well described in Nishimura et al., 2017, mBio; Lu et al., 2022, Comm Biol; Kurokawa et al., 2022, Front Microbiol, etc. The following sentence was added in the Materials and Methods.

      (L344-345) “Multiple dilutions changing in order promised at least one of the wells within the exponential growth phase after the overnight culture.”

      (3) The measured growth rate of the end-point 'F2 lineage' shown in Figure S2 seemed comparable to the rest of the lineages (A1 to H2), but the growth rate of 'F2' illustrated in Figure 1B indicates otherwise (L83-84). What is the reason for the incongruence between the two datasets?

      Sorry for the unclear description. The growth rates shown in Fig. S2 were obtained during the evolution experiment using the daily transfer's initial and final OD600 values. The growth rates shown in Fig. 1B were obtained from the final population (Evos) growth assay and calculated from the growth curves (biological replication, N=4). Fig. 1B shows the precisely evaluated growth rates, and Fig. S2 shows the evolutionary changes in growth rates. Accordingly, the following sentence was added to the Results.

      (L84-87) “As the growth increases were calculated according to the initial and final records, the exponential growth rates of the ancestor and evolved populations were obtained according to the growth curves for a precise evaluation of the evolutionary changes in growth.”

      (4) Are the differences in growth rate statistically significant in Figure 1B?

      Eight out of nine Evos were significant, except F2. The sentences were rewritten and associated with the revised Fig. 1B, indicating significance.

      (L87-90) “The results demonstrated that most evolved populations (Evos) showed improved growth rates, in which eight out of nine Evos were highly significant (Fig. 1B, upper). However, the magnitudes of growth improvement were considerably varied, and the evolutionary dynamics of the nine lineages were somehow divergent (Fig. S2).”

      (5) The evolved lineages showed a decrease in their maximal optical densities (OD600) compared to the ancestral strain (L85-86). ALE could accompany changes in cell size and morphologies, (doi: 10.1038/s41586-023-06288-x; 10.1128/AEM.01120-17), which may render OD600 relatively inaccurate for cell density comparison. I suggest using CFU/mL metrics for the sake of a fair comparison between Anc and Evo.

      The methods evaluating the carrying capacity (i.e., cell density, population size, etc.) do not change the results. Even using CFU is unfair for the living cells that can not form colonies and unfair if the cell size changes. Optical density (OD600) provides us with the temporal changes of cell growth in a 15-minute interval, which results in an exact evaluation of the growth rate in the exponential phase. CFU is poor at recording the temporal changes of population changes, which tend to result in an inappropriate growth rate. Taken together, we believe that our method was reasonable and reliable. We hope you can accept the different way of study.

      (6) Please provide evidence in support of the statement in L115-119. i.e. statistical analysis supporting that the observed ratio of essential genes in the mutant pool is not random.

      The statistic test was performed, and the following sentence was added.

      (L139-141) “The ratio of essential genes in the mutated genes was significantly higher than in the total genes (286 out of 3290 genes, Chi-square test p=0.008).”

      (7) The assumption that "mutation abundance would correlate to fitness improvement" described in L120-122: "The large variety in genome mutations and no correlation of mutation abundance to fitness improvement strongly suggested that no mutations were specifically responsible or crucially essential for recovering the growth rate of the reduced genome" is not easy to digest, in the sense that (i) the effect of multiple beneficial mutations are not necessarily summative, but are riddled with various epistatic interactions (doi: 10.1016/j.mec.2023.e00227); (ii) neutral hitchhikers are of common presence (you could easily find reference on this one); (iii) hypermutators that accumulate greater number of mutations in a given time are not always the eventual winners in competition games (doi: 10.1126/science.1056421). In this sense, the notion that "mutation abundance correlates to fitness improvement" in L120-122 seems flawed (for your perusal, doi: 10.1186/gb-2009-10-10-r118).

      Sorry for the improper description and confusing writing, and thank you for the fruitful knowledge on molecular evolution. The sentence was deleted, and the following one was added.

      (L145-146) “Nevertheless, it was unclear whether and how these mutations were explicitly responsible for recovering the growth rate of the reduced genome.”

      (8) Could it be possible that the large variation in genome mutations in independent lineages results from a highly rugged fitness landscape characterized by multiple fitness optima (doi: 10.1073/pnas.1507916112)? If this is the case, I disagree with the notion in L121-122 "that no mutations were specifically responsible or crucially essential" It does seem to me that, for example, the mutations in evo A2 are specifically responsible and essential for the fitness improvement of evo A2 in the evolutionary condition (M63 medium). Fitness assessment of individual (or combinatorial) mutants reconstituted in the Ancestral background would be a bonus.

      Thank you for the intriguing thinking. The sentence was deleted. Please allow us to adapt your comment to the manuscript as follows.

      (L143-145) “The large variety of genome mutations fixed in the independent lineages might result from a highly rugged fitness landscape 38.”

      (9) L121-122: "...no mutations were specifically responsible or crucially essential for recovering the growth rate of the reduced genome". Strictly speaking, the authors should provide a reference case of wild-type E. coli ALE in order to reach definitive conclusions that the observed mutation events are exclusive to the genome-reduced strain. It is strongly recommended that the authors perform comparative analysis with an ALEed non-genome-reduced control for a more definitive characterization of the evolutionary biology in a genome-reduced organism, as it was done for "JCVI-syn3.0B vs non-minimal M. mycoides" (doi: 10.1038/s41586-023-06288-x) and "E. coli eMS57 vs MG1655" (doi: 10.1038/s41467-019-08888-6).

      The improper description was deleted in response to comments 7 and 8. The mentioned references were cited in the manuscript (refs 21 and 23). Thank you for the experimental advice. We are sorry that the comparison of wild-type and reduced genomes was not in the scope of the present study and will probably be reported soon in our future work.

      (10) L146-148: "The homeostatic periodicity was consistent with our previous findings that the chromosomal periodicity of the transcriptome was independent of genomic or environmental variation" A Previous study also suggested that the amplitudes of the periodic transcriptomes were significantly correlated with the growth rates (doi: 10.1093/dnares/dsaa018). Growth rates of 8/9 Evos were higher compared to Anc, while that of Evo F2 remained similar. Please comment on the changes in amplitudes of the periodic transcriptomes between Anc and each Evo.

      Thank you for the suggestion. The correlation between the growth rates and the amplitudes of chromosomal periodicity was statistically insignificant (p>0.05). It might be a result of the limited data points. Compared with the only nine data points in the present study, the previous study analyzed hundreds of transcriptomes associated with the corresponding growth rates, which are suitable for statistical evaluation. In addition, the changes in growth rates were more significant in the previous study than in the present study, which might influence the significance. It's why we did not discuss the periodic amplitude.

      (11) Please elaborate on L159-161: "It strongly suggested the essentiality mutation for homeostatic transcriptome architecture happened in the reduced genome.".

      Sorry for the improper description. The sentence was rewritten as follows.

      (L191-193) “The essentiality of the mutations might have participated in maintaining the homeostatic transcriptome architecture of the reduced genome.”

      (12) Is FPKM a valid metric for between-sample comparison? The growing consensus in the community adopts Transcripts Per Kilobase Million (TPM) for comparing gene expression levels between different samples (Figure 3B; L372-379).

      Sorry for the unclear description. The FPKM indicated here was globally normalized, statistically equivalent to TPM. The following sentence was added to the Materials and Methods.

      (L421-422) “The resulting normalized FPKM values were statistically equivalent to TPM.”

      (13) Please provide % mapped frequency of mutations in Table S3.

      They were all 100%. The partially fixed mutations were excluded in the present study. The following sentence was added to the caption of Table S3.

      (Supplementary file, p 9) “Note that the entire population held the mutations, i.e., 100% frequency in DNA sequencing.”

      (14) To my knowledge, M63 medium contains glucose and glycerol as carbon sources. The manuscript would benefit from discussing the elements that impose selection pressure in the M63 culture condition.

      Sorry for the missing information on M63, which contains 22 mM glucose as the only carbon source. The medium composition was added in the Materials and Methods, as follows.

      (L334-337) “In brief, the medium contains 62 mM dipotassium hydrogen phosphate, 39 mM potassium dihydrogen phosphate, 15 mM ammonium sulfate, 15 μM thiamine hydrochloride, 1.8 μM Iron (II) sulfate, 0.2 mM magnesium sulfate, and 22 mM glucose.”

      (15) The RNA-Seq datasets for Evo strains seemed equally heterogenous, just as their mutation profiles. However, the missing element in their analysis is the directionality of gene expression changes. I wonder what sort of biological significance can be derived from grouping expression changes based solely on DEGs, without considering the magnitude and the direction (up- and down-regulation) of changes? RNA-seq analysis in its current form seems superficial to derive biologically meaningful interpretations.

      We agree that most studies often discuss the direction of transcriptional changes. The present study aimed to capture a global view of the magnitude of transcriptome reorganization. Thus, the analyses focused on the overall features, such as the abundance of DEGs, instead of the details of the changes, e.g., the up- and down-regulation of DEGs. The biological meaning of the DEGs' overview was how significantly the genome-wide gene expression fluctuated, which might be short of an in-depth view of individual gene expression. The following sentence was added to indicate the limitation of the present analysis.

      (L199-202) “Instead of an in-depth survey on the directional changes of the DEGs, the abundance and functional enrichment of DEGs were investigated to achieve an overview of how significant the genome-wide fluctuation in gene expression, which ignored the details of individual genes.”

      Minor remarks

      (1) L41: brackets italicized "(E. coli)".

      It was fixed as follows.

      (L40) “… Escherichia coli (E. coli) cells …”

      (2) Figure S1. It is suggested that the x-axis of ALE monitor be set to 'generations' or 'cumulative generations', rather than 'days'.

      Thank you for the suggestion. Fig. S1 describes the experimental procedure, so the" day" was used. Fig. S2 presents the evolutionary process, so the "generation" was used, as you recommended here.

      (3) I found it difficult to digest through L61-64. Although it is not within the job scope of reviewers to comment on the language style, I must point out that the manuscript would benefit from professional language editing services.

      Sorry for the unclear writing. The sentences were revised as follows.

      (L60-64) “Previous studies have identified conserved features in transcriptome reorganization, despite significant disruption to gene expression patterns resulting from either genome reduction or experimental evolution 27-29. The findings indicated that experimental evolution might reinstate growth rates that have been disrupted by genome reduction to maintain homeostasis in growing cells.”

      (4) Duplicate references (No. 21, 42).

      Sorry for the mistake. It was fixed (leaving ref. 21).

      (5) Inconsistency in L105-106: "from two to 13".

      "From two to 13" was adopted from the language editing. It was changed as follows.

      (L119) “… from 2 to 13, …”

      Response to Reviewer #3:

      Thank you for reviewing our manuscript and for the helpful comments, which improved the strength of the manuscript. The recommended statistical analyses essentially supported the statement in the manuscript were performed, and those supposed to be the new results in the scope of further studies remained unconducted. The changes made in the revision were highlighted. We sincerely hope the revised manuscript and the following point-to-point response meet your concerns. You will find all your suggested statistic tests in our future work that report an extensive study on the experimental evolution of an assortment of reduced genomes.

      (1) Line 106 - "As 36 out of 45 SNPs were nonsynonymous, the mutated genes might benefit the fitness increase." This argument can be strengthened. For example, the null expectation of nonsynonymous SNPs should be discussed. Is the number of observed nonsynonymous SNPs significantly higher than the expected one?

      (2) Line 107 - "In addition, the abundance of mutations was unlikely to be related to the magnitude of fitness increase." Instead of just listing examples, a regression analysis can be added.

      Yes, it's significant. Random mutations lead to ~33% of nonsynonymous SNP in a rough estimation. Additionally, the regression is unreliable because there's no statistical significance between the number of mutations and the magnitude of fitness increase. Accordingly, the corresponding sentences were revised with additional statistical tests.

      (L123-129) “As 36 out of 45 SNPs were nonsynonymous, which was highly significant compared to random mutations (p < 0.01), the mutated genes might benefit fitness increase. In addition, the abundance of mutations was unlikely to be related to the magnitude of fitness increase. There was no significant correlation between the number of mutations and the growth rate in a statistical view (p > 0.1). Even from an individual close-up viewpoint, the abundance of mutations poorly explained the fitness increase.”

      (3) Line 114 - "They seemed highly related to essentiality, as 11 out of 49 mutated genes were essential (Table S3)." Here, the information mentioned in line 153 ("the ratio of essential to all genes (302 out of 3,290) in the reduced genome.") can be used. Then a statistical test for a contingency table can be used.

      (4) Line 117 - "the high frequency of the mutations fixed in the essential genes suggested the mutation in essentiality for fitness increase was the evolutionary strategy for reduced genome." What is the expected number of fixed mutations in essential genes vs non-essential genes? Is the observed number statistically significantly higher?

      Sorry for the improper and insufficient information on the essential genes. Yes, it's significant. The statistical test was additionally performed. The corresponding part was revised as follows.

      (L134-146) “They seemed highly related to essentiality7 (https://shigen.nig.ac.jp/ecoli/pec/genes.jsp), as 11 out of 49 mutated genes were essential (Table S3). Although the essentiality of genes might differ between the wild-type and reduced genomes, the experimentally determined 302 essential genes in the wild-type E. coli strain were used for the analysis, of which 286 were annotated in the reduced genome. The ratio of essential genes in the mutated genes was significantly higher than in the total genes (286 out of 3290 genes, Chi-square test p=0.008). As the essential genes were determined according to the growth35 and were known to be more conserved than nonessential ones 36,37, the high frequency of the mutations fixed in the essential genes was highly intriguing and reasonable. The large variety of genome mutations fixed in the independent lineages might result from a highly rugged fitness landscape 38. Nevertheless, it was unclear whether and how these mutations were explicitly responsible for recovering the growth rate of the reduced genome.”

      (5) The authors mentioned no overlapping in the single mutation level. Is that statistically significant? The authors can bring up what the no-overlap probability is given that there are in total x number of fixed mutations observed (either theory or simulation is good).

      Sorry, we feel confused about this comment. It's unclear to us why it needs to be statistically simulated. Firstly, the mutations were experimentally observed. The result that no overlapped mutated genes were detected was an Experimental Fact but not a Computational Prediction. We feel sorry that you may over-interpret our finding as an evolutionary rule, which always requires testing its reliability statistically. We didn't conclude that the evolution had no overlapped mutations. Secondly, considering 65 times random mutations happened to a ~3.9 Mb sequence, the statistical test was meaningful only if the experimental results found the overlapped mutations. It is interesting how often the random mutations cause the overlapped mutations in parallel evolutionary lineages while increasing the evolutionary lineages, which seems to be out of the scope of the present study. We are happy to include the analysis in our ongoing study on the experimental evolution of reduced genomes.

      (6) The authors mentioned no overlapping in the single mutation level. How about at the genetic level? Some fixed mutations occur in the same coding gene. Is there any gene with a significantly enriched number of mutations?

      No mutations were fixed in the same gene of biological function, as shown in Table S3. If we say the coding region, the only exception is the IS sequences, well known as the transposable sequences without genetic function. The following description was added.

      (L119-122) “The number of mutations largely varied among the nine Evos, from 2 to 13, and no common mutation was detected in all nine Evos (Table S3). A 1,199-bp deletion of insH was frequently found in the Evos (Table S3, highlighted), which well agreed with its function as a transposable sequence.”

      (7) Line 151-156- It seems like the authors argue that the expression level differences can be just explained by the percentage of essential genes that get fixed mutations. One further step for the argument could be to compare the expression level of essential genes with vs without fixed mutations. Also, the authors can compare the expression level of non-essential genes with vs without fixed mutations. And the authors can report whether the differences in expression level became insignificant after the control of the essentiality.

      It's our pleasure that the essentiality intrigued you. Thank you for the analytical suggestion, which is exciting and valuable for our studies. As only 11 essential genes were detected here and "Mutation in essentiality" was an indication but not the conclusion of the present study, we would like to apply the recommended analysis to the datasets of our ongoing study to demonstrate this statement. Thank you again for your fruitful analytical advice.

      (8) Line 169- "The number of DEGs partially overlapped among the Evos declined significantly along with the increased lineages of Evos (Figure 4B). " There is a lack of statistical significance here while the word "significantly" is used. One statistical test that can be done is to use re-sampling/simulation to generate a null expectation of the overlapping numbers given the DEGs for each Evo line and the total number of genes in the genome. The observed number can then be compared to the distribution of the simulated numbers.

      Sorry for the inappropriate usage of the term. Whether it's statistically significant didn't matter here. The word "significant" was deleted as follows.

      (L205--206) “The number of DEGs partially overlapped among the Evos declined along with the increased lineages of Evos (Fig. 4B).”

      (9) Line 177-179- "In comparison,1,226 DEGs were induced by genome reduction. The common DEGs 177 of genome reduction and evolution varied from 168 to 540, fewer than half of the DEGs 178 responsible for genome reduction in all Evos" Is the overlapping number significantly lower than the expectation? The hypergeometric test can be used for testing the overlap between two gene sets.

      There's no expectation for how many DEGs were reasonable. Not all numbers experimentally obtained are required to be statistically meaningful, which is commonly essential in computational and data science.

      (10) The authors should give more information about the ancestral line used at the beginning of experimental evolution. I guess it is one of the KHK collection lines, but I can not find more details. There are many genome-reduced lines. Why is this certain one picked?

      Sorry for the insufficient information on the reduced genome used for the experimental evolution. The following descriptions were added in the Results and the Materials and Methods, respectively.

      (L75-79) “The E. coli strain carrying a reduced genome, derived from the wild-type genome W3110, showed a significant decline in its growth rate in the minimal medium compared to the wild-type strain 13. To improve the genome reduction-mediated decreased growth rate, the serial transfer of the genome-reduced strain was performed with multiple dilution rates to keep the bacterial growth within the exponential phase (Fig. S1), as described 17,20.”

      (L331-334) “The reduced genome has been constructed by multiple deletions of large genomic fragments 58, which led to an approximately 21% smaller size than its parent wild-type genome W3110.”

      (11) How was the saturated density in Figure 1 actually determined? In particular, the fitness assay of growth curves is 48h. But it seems like the experimental evolution is done for ~24 h cycles. If the Evos never experienced a situation like a stationary phase between 24-48h, and if the author reported the saturated density 48 h in Figure 1, the explanation of the lower saturated density can be just relaxation from selection and may have nothing to do with the increase of growth rate.

      Sorry for the unclear description. Yes, you are right. The evolution was performed within the exponential growth phase (keeping cell division constant), which means the Evos never experienced the stationary phase (saturation). The final evolved populations were subjected to the growth assay to obtain the entire growth curves for calculating the growth rate and the saturated density. Whether the decreased saturated density and the increased growth rate were in a trade-off relationship remained unclear. The corresponding paragraph was revised as follows.

      (L100-115) “Intriguingly, a positive correlation was observed between the growth fitness and the carrying capacity of the Evos (Fig. 1D). It was somehow consistent with the positive correlations between the colony growth rate and the colony size of a genome-reduced strain 11 and between the growth rates and the saturated population size of an assortment of genome reduced strains 13. Nevertheless, the negative correlation between growth rate and carrying capacity, known as the r/K selection30,31 was often observed as the trade-off relationship between r and K in the evolution and ecology studies 32 33,34. As the r/K trade-off was proposed to balance the cellular metabolism that resulted from the cost of enzymes involved 34, the deleted genes might play a role in maintaining the metabolism balance for the r/K correlation. On the other hand, the experimental evolution (i.e., serial transfer) was strictly performed within the exponential growth phase; thus, the evolutionary selection was supposed to be driven by the growth rate without selective pressure to maintain the carrying capacity. The declined carrying capacity might have been its neutral "drift" but not a trade-off to the growth rate. Independent and parallel experimental evolution of the reduced genomes selecting either r or K is required to clarify the actual mechanisms.”

      (12) What annotation of essentiality was used in this paper? In particular, the essentiality can be different in the reduced genome background compared to the WT background.

      Sorry for the unclear definition of the essential genes. They are strictly limited to the 302 essential genes experimentally determined in the wild-type E coli strain. Detailed information can be found at the following website: https://shigen.nig.ac.jp/ecoli/pec/genes.jsp. We agree that the essentiality could differ between the WT and reduced genomes. Identifying the essential genes in the reduced genome will be an exhaustedly vast work. The information on the essential genes defined in the present study was added as follows.

      (L134-139) “They seemed highly related to essentiality7 (https://shigen.nig.ac.jp/ecoli/pec/genes.jsp), as 11 out of 49 mutated genes were essential (Table S3). Although the essentiality of genes might differ between the wild-type and reduced genomes, the experimentally determined 302 essential genes in the wild-type E. coli strain were used for the analysis, of which 286 were annotated in the reduced genome.”

      (13) The fixed mutations in essential genes are probably not rarely observed in experimental evolution. For example, fixed mutations related to RNA polymerase can be frequently seen when evolving to stressful environments. I think the author can discuss this more and elaborate more on whether they think these mutations in essential genes are important in adaptation or not.

      Thank you for your careful reading and the suggestion. As you mentioned, we noticed that the mutations in RNA polymerases (rpoA, rpoB, and rpoD) were identified in three Evos. As they were not shared across all Evos, we didn't discuss the contribution of these mutations to evolution. Instead of the individual functions of the mutated essential gene functions, we focused on the enriched gene functions related to the transcriptome reorganization because they were the common feature observed across all Evos and linked to the whole metabolic or regulatory pathways, which are supposed to be more biologically reasonable and interpretable. The following sentence was added to clarify our thinking.

      (L268-273) “In particular, mutations in the essential genes, such as RNA polymerases (rpoA, rpoB, rpoD) identified in three Evos (Table S3), were supposed to participate in the global regulation for improved growth. Nevertheless, the considerable variation in the fixed mutations without overlaps among the nine Evos (Table 1) implied no common mutagenetic strategy for the evolutionary improvement of growth fitness.”

      (14) In experimental evolution to new environments, several previous literature also show that long-term experimental evolution in transcriptome is not consistent or even reverts the short-term response; short-term responses were just rather considered as an emergency plan. They seem to echo what the authors found in this manuscript. I think the author can refer to some of those studies more and make a more throughput discussion on short-term vs long-term responses in evolution.

      Thank you for the advice. It's unclear to us what the short-term and long-term responses referred to mentioned in this comment. The "Response" is usually used as the phenotypic or transcriptional changes within a few hours after environmental fluctuation, generally non-genetic (no mutation). In comparison, long-term or short-term experimental "Evolution" is associated with genetic changes (mutations). Concerning the Evolution (not the Response), the long-term experimental evolution (>10,000 generations) was performed only with the wild-type genome, and the short-term experimental evolution (500~2,000 generations) was more often conducted with both wild-type and reduced genomes, to our knowledge. Previous landmark studies have intensively discussed comparing the wild-type and reduced genomes. Our study was restricted to the reduced genome, which was constructed differently from those reduced genomes used in the reported studies. The experimental evolution of the reduced genomes has been performed in the presence of additional additives, e.g., antibiotics, alternative carbon sources, etc. That is, neither the genomic backgrounds nor the evolutionary conditions were comparable. Comparison of nothing common seems to be unproductive. We sincerely hope the recommended topics can be applied in our future work.

      Some minor suggestions

      • Figures S3 & Table S2 need an explanation of the abbreviations of gene categories.

      Sorry for the missing information. Figure S3 and Table S3 were revised to include the names of gene categories. The figure was pasted followingly for a quick reference.

      Author response image 3.

      • I hope the authors can re-consider the title; "Diversity for commonality" does not make much sense to me. For example, it can be simply just "Diversity and commonality."

      Thank you for the suggestion. The title was simplified as follows.

      (L1) “Experimental evolution for the recovery of growth loss due to genome reduction.”

      • It is not easy for me to locate and distinguish the RNA-seq vs DNA-seq files in DRA013662 at DDBJ. Could you make some notes on what RNA-seq actually are, vs what DNA-seq files actually are?

      Sorry for the mistakes in the DRA number of DNA-seq. DNA-seq and RNA-seq were deposited separately with the accession IDs of DRA013661 and DRA013662, respectively. The following correction was made in the revision.

      (L382-383) “The raw datasets of DNA-seq were deposited in the DDBJ Sequence Read Archive under the accession number DRA013661.”

    1. Author Response

      The following is the authors’ response to the original reviews.

      Reviewer 1 (Public Review):

      1. The name of the new method "inter-haplotype distance" is more confusing than helpful, as the haplotype information is not critical for implementing this method. First, the mutation spectrum is aggregated genome-wide regardless of the haplotypes where the mutations are found. Second, the only critical haplotype information is that at the focal site (i.e., the locus that is tested for association): individuals are aggregated together when they belong to the same "haplotype group" at the focal site. However, for the classification step, haplotype information is not really necessary: individuals can be grouped based on their genotypes at the given locus (e.g., AA vs AB). As the authors mentioned, this method can be potentially applied to other mutation datasets, where haplotype information may well be unavailable. I hope the authors can reconsider the name and remove the term "haplotype" (perhaps something like "inter-genotype distance"?) to avoid giving the wrong impression that haplotype information is critical for applying this method.

      We appreciate the reviewer's concern about the name of our method. The reviewer is correct that haplotype information is not critical for our method to work, and as a result we've decided to simply rename the approach to "aggregate mutation spectrum distance" (abbreviated AMSD). For simplicity, we refer to the method as IHD throughout our responses to reviewers, but the revised manuscript now refers to AMSD.

      1. The biggest advantage of the IHD method over QTL mapping is alleviation of the multiple testing burden, as one comparison tests for any changes in the mutation spectrum, including simultaneous, small changes in the relative abundance of multiple mutation types. Based on this, the authors claim that IHD is more powerful to detect a mutator allele that affects multiple mutation types. Although logically plausible, it is unclear under what quantitative conditions IHD can actually have greater power over QTL. It will be helpful to support this claim by providing some simulation results.

      This comment prompted us to do a more detailed comparison of IHD vs. QTL power under conditions that are more similar to those observed in the BXD cohort. While preparing the original manuscript, we assumed that IHD might have greater power than QTL mapping in a population like the BXDs because some recombinant inbred lines have accumulated many more germline mutations than others (see Figure 1 in Sasani et al. 2022, Nature). In a quantitative trait locus scan (say, for the fraction of C>A mutations in each line) each BXD's mutation data would be weighted equally, even if a variable number of mutations was used to generate the phenotype point estimate in each line.

      To address this, we performed a new series of simulations in which the average number of mutations per haplotype was allowed to vary. At the low end, some BXDs accumulated as few as 100 total germline mutations, while others have accumulated as many as 2,000. Thus, instead of simulating a mean number of mutations on each simulated haplotype, we allowed the mean number of mutations per haplotype to vary from N to 20N. By simulating a variable count of mutations on each haplotype, we could more easily test the benefits of comparing aggregate, rather than individual, mutation spectra between BXDs.

      In these updated simulations, we find that IHD routinely outperforms QTL mapping under a range of parameter choices (see Author Response image 1). Since IHD aggregates the mutation spectra of all haplotypes with either B or D alleles at each locus in the genome, the method is much less sensitive to individual haplotypes with low mutation counts. We include a mention of these updated simulations on lines 135-138 and describe the updated simulations in greater detail in the Materials and Methods (lines 705-715).

      Author response image 1.

      Power of IHD and QTL mapping on simulated haplotypes with variable counts of mutations. We simulated germline mutations on the specified number of haplotypes (as described in the manuscript) but allowed the total number of mutations per haplotype to vary by a factor of 20.

      1. The flip side of this advantage of IHD is that, when a significant association is detected, it is not immediately clear which mutation type is driving the signal. Related to this, it is unclear how the authors reached the point that "...the C>A mutator phenotype associated with the locus on chromosome 6", when they only detected significant IHD signal at rs46276051 (on Chr6), when conditioning on D genotypes at the rs27509845 (on Chr4) and no significant signal for any 1-mer mutation type by traditional mapping. The authors need to explain how they deduced that C>A mutation is the major source of the signal. In addition, beyond C>A mutations, can mutation types other than C>A contribute to the IHD signal at rs46276051? More generally, I hope the authors can provide some guidelines on how to narrow a significant IHD signal to specific candidate mutation type(s) affected, which will make the method more useful to other researchers.

      We thank the reviewer for pointing out this gap in our logic. We omitted specific instructions for narrowing down an IHD signal to specific mutation type(s) for a few reasons. First, this can be addressed using mutational signature analysis methods that are in widespread use. For example, upon identifying one or more candidate mutator loci, we can enter the mutation spectra of samples with each possible mutator genotype into a program (e.g., SigProfilerExtractor) to determine which combinations of mutation types occur proportionally more often in the genomes that harbor mutators (see Figure 3c in our manuscript). A second approach for narrowing down an IHD signal, highlighted in Figure 3a (and now described in the text of the Results section at lines 256-261), is to simply test which mutation type proportion(s) differ significantly between groups of samples with and without a candidate mutator (for example, with a Chi-square test of independence for each mutation type).

      Although this second approach incurs a multiple testing burden, the burden is offset somewhat by using IHD to identify mutator loci, rather than performing association tests for every possible mutation type to begin with. Although Figure 3a only shows the significant difference in C>A fraction among BXDs with different mutator locus genotypes, Figure 3-figure supplement 1 shows the complete set of 1-mer spectrum comparisons. It is possible that this second approach would not prove very useful in the case of a mutator with a “flat” signature (i.e., a mutator that slightly perturbs the rates of many different mutation types), but in our case it clearly shows which mutation type is affected.

      1. To account for differential relatedness between the inbred lines, the authors regressed the cosine distance between the two aggregate mutation spectra on the genome-wide genetic similarity and took the residual as the adjusted test metric. What is the value of the slope from this regression? If significantly non-zero, this would support a polygenic architecture of the mutation spectrum phenotype, which could be interesting. If not, is this adjustment really necessary? In addition, is the intercept assumed to be zero for this regression, and does such an assumption matter? I would appreciate seeing a supplemental figure on this regression.

      The reviewer raises a good question. We find that the slope of the "distance vs. genetic similarity" regression is significantly non-zero, though the slope estimate itself is small. A plot of cosine distance vs. genome-wide genetic similarity (using all BXDs) is shown below in Author response image 2:

      Author response image 2.

      Relationship between cosine distance and genetic similarity in the BXDs. As described in the Materials and Methods, we computed two values at each marker in the BXDs: 1) the cosine distance between the aggregate mutation spectra of BXDs with either B or D genotypes at the marker, and 2) the correlation between genome-wide D allele frequencies in BXDs with either B or D genotypes at the marker. We then regressed these two values across all genome-wide markers.

      This result indicates that if two groups of BXDs (one with D genotypes and one with B genotypes at a given locus) are more genetically similar, their mutation spectra are also more similar. Since the regression slope estimate is significantly non-zero (p < 2.2e-16), we believe that it's still worth using residuals as opposed to raw cosine distance values. This result also suggests that there may be a polygenic effect on the mutation spectrum in the BXDs.

      We have also generated a plot showing the cosine distance between the mutation spectra of every possible pair of BXDs, regressed against the genetic similarity between each of those pairs (Author Response image 3). Here, the potential polygenic effects on mutation spectra similarity are perhaps more obvious.

      Author response image 3.

      Pairwise cosine distance between BXD mutation spectra as a function of genetic similarity. We computed two values for every possible pair of n = 117 BXDs: 1) the cosine distance between the samples' individual 1-mer mutation spectra and 2) the correlation coefficient between the samples' genome-wide counts of D alleles.

      Private Comments

      1. It will also be useful to see how the power of IHD and QTL mapping depend on the allele frequency of the mutator allele and the sample size, as mutator alleles are likely rare or semi-rare in natural populations (such as the human de novo mutation dataset that the authors mentioned).

      This is another good suggestion. In general, we'd expect the power of both IHD and QTL mapping to decrease as a function of mutator allele frequency. At the same time, we note that the power of these scans should mostly depend on the absolute number of carriers of the mutator allele and less on its frequency. In the BXD mouse study design, we observe high frequency mutators but also a relatively small sample size of just over 100 individuals. In natural human populations, mutator frequencies might be orders of magnitude smaller, but sample sizes may be orders of magnitude larger, especially as new cohorts of human genomes are routinely being sequenced. So, we expect to have similar power to detect a mutator segregating at, say, 0.5% frequency in a cohort of 20,000 individuals, as we would to detect a mutator segregating at 50% frequency in a dataset of 200 individuals.

      To more formally address the reviewer's concern, we performed a series of simulations in which we simulated a population of 100 haplotypes. We assigned the same average number of mutations to each haplotype but allowed the allele frequency of the mutator allele to vary between 0.1, 0.25, and 0.5. The results of these simulations are shown in Author response image 4 and reveal that AMSD tends to have greater power than QTL mapping at lower mutator allele frequencies. We now mention these simulations in the text at lines 135-138 and include the simulation results in Figure 1-figure supplement 4.

      Author response image 4.

      Power of AMSD and QTL mapping on simulated haplotypes with variable marker allele frequencies. We simulated germline mutations on the specified number of haplotypes (as described in the manuscript), but simulated genotypes at the mutator allele such that "A" alleles were at the specified allele frequency.

      1. In the Methods section of "testing for epistasis between the two mutator loci", it will be helpful to explicitly lay out the model and assumptions in mathematical formulae, in addition to the R scripts. For example, are the two loci considered independent when their effects on mutation rate is multiplicative or additive? Given the R scripts provided, it seems that the two loci are assumed to have multiplicative effects on the mutation rate, and that the mutation count follows a Poisson distribution with mean being the mutation rate times ADJ_AGE (i.e., the mutation opportunity times the number of generations of an inbred line). However, this is not easily understandable for readers who are not familiar with R language. In addition, I hope the authors can be more specific when discussing the epistatic interaction between the two loci by explicitly saying "synergistic effects beyond multiplicative effects on the C>A mutation rate".

      The reviewer raises a good point about the clarity of our descriptions of tests for epistasis. We have now added a more detailed description of these tests in the section of the Materials and Methods beginning at line 875. We have also added a statement to the text at lines 289-291: “the combined effects of D genotypes at both loci exceed the sum of marginal effects of D genotypes at either locus alone.” We hope that this will help clarify the results of our tests for statistical epistasis.

      Reviewer 2 (Public Review):

      1. The main limitation of the approach is that it is difficult to see how it might be applied beyond the context of mutation accumulation experiments using recombinant inbred lines. This is because the signal it detects, and hence its power, is based on the number of extra accumulated mutations linked to (i.e. on the same chromosome as) the mutator allele. In germline mutation studies of wild populations the number of generations involved (and hence the total number of mutations) is typically small, or else the mutator allele becomes unlinked from the mutations it has caused (due to recombination), or is lost from the population altogether (due to chance or perhaps selection against its deleterious consequences).

      The reviewer is correct that as it currently exists, IHD is mostly limited to applications in recombinant inbred lines (RILs) like the BXDs. This is due to the fact that IHD assumes that each diploid sample harbors one of two possible genotypes at a particular locus and ignores the possibility of heterozygous genotypes for simplicity. In natural, outbreeding populations, this assumption will obviously not hold. However, as we plan to further iterate on and improve the IHD method, we hope that it will be applicable to a wider variety of experimental systems in the future. We have added additional caveats about the applicability of our method to other systems in the text at lines 545-550.

      Private Comments

      1. On p. 8, perhaps I've misunderstood but it's not clear in what way the SVs identified were relevant to the samples used in this dataset - were the founder strains assembled? Is there any chance that additional SVs were present, e.g. de novo early in the accumulation line?

      Our description of this structural variation resource could have been clearer. The referenced SVs were identified in Ferraj et al. (2023) by generating high-quality long read assemblies of inbred laboratory mice. Both DBA/2J and C57BL/6J (the founder strains for the BXD resource) were included in the Ferraj et al. SV callset. We have clarified our description of the callset at lines 247-248.

      It is certainly possible that individual BXD lines have accumulated de novo structural variants during inbreeding. However, these "private" SVs are unlikely to produce a strong IHD association signal (via linkage to one of the ~7,000 markers) at either the chromosome 4 or chromosome 6 locus, since we only tested markers that were at approximately 50% D allele frequency among the BXDs.

      1. On p. 13, comparing the IHD and QTL approaches, regarding the advantage of the former in that it detects the combined effect of multiple k-mer mutation types, would it not be straightforward to aggregate counts for different types in a QTL setting as well?

      The mutation spectrum is a multi-dimensional phenotype (6-dimensional if using the 1-mer spectrum, 96-dimensional if using the 3-mer spectrum, etc.). Most QTL mapping methods use linear models to test for associations between genotypes and a 1-dimensional phenotype (e.g., body weight, litter size). In the past, we used QTL mapping to test for associations between genotypes and a single element of the mutation spectrum (e.g., the rate of C>A mutations), but there isn't a straightforward way to aggregate or collapse the mutation spectrum into a 1dimensional phenotype that retains the information contained within the full 1-mer or 3-mer spectrum. For that reason, we developed the "aggregate mutation spectrum" approach, as it preserves information about the complete mutation spectrum in each group of strains.

      The reviewer is correct that we could also aggregate counts of different mutation types to, say, perform a QTL scan for the load of a specific mutational signature. For example, we could first perform standard mutational signature analysis on our dataset and then test for QTLs associated with each signature that is discovered. However, this approach would not solve the second problem that our method is designed to solve: the appropriate weighting of samples based on how many mutations they contain.

      1. pp. 15-16: In the discussion of how you account for relatedness between strains, I found the second explanation (on p. 16) much clearer. It would be interesting to know how much variance was typically accounted for by this regression?

      As shown in the response to Reviewer 1, genotype similarity between genotype groups (i.e., those with either D or B genotypes at a marker) generally explains a small amount of variance in the cosine distance between those groups (R2 ~= 0.007). However, since the slope term in that regression is significantly non-zero, correcting for this relationship should still improve our power relative to using raw cosine distance values that are slightly confounded by this relationship.

      1. Similarly, in the section on Applying the IHD method to the BXDs (pp. 18-19), I think this description was very useful, and some or all of this description of the experiment (and how the DNMs in it arise) could profitably be moved to the introduction.

      We appreciate the reviewer’s feedback about the details of the BXD cohort. Overall, we feel the description of the BXDs in the Introduction (at lines 65-73) is sufficient to introduce the cohort, though we now add some additional detail about variability in BXD inbreeding duration (at lines 89-93) to the Introduction as well, since it is quite relevant to some of the new simulation results presented in the manuscript.

      1. A really minor one, not sure if this is for the journal or the authors, but it would be much better to include both page and line numbers in any version of an article for review. My pdf had neither!

      We apologize for the lack of page/line numbers in the submitted PDF. We have now added line numbers to the revised version of the manuscript.

      Reviewer 3 (Public Review):

      1. Under simulated scenarios, the authors' new IHD method is not appreciably more powerful than conventional QTL mapping methods. While this does not diminish the rigor or novelty of the authors findings, it does temper enthusiasm for the IHD method's potential to uncover new mutators in other populations or datasets. Further, adaptation of this methodology to other datasets, including human trios or multigenerational families, will require some modification, which could present a barrier to broader community uptake. Notably, BXD mice are (mostly) inbred, justifying the authors consideration of just two genotype states at each locus, but this decision prevents out-of-the-box application to outbred populations and human genomic datasets. Lastly, some details of the IHD method are not clearly spelled out in the paper. In particular, it is unclear whether differences in BXD strain relatedness due to the breeding epoch structure are fully accounted for in permutations. The method's name - inter-haplotype distance - is also somewhat misleading, as it seems to imply that de novo mutations are aggregated at the scale of sub-chromosomal haplotype blocks, rather than across the whole genome.

      The reviewer raises very fair concerns. As mentioned in response to a question from Reviewer 1, we performed additional simulation experiments that demonstrate the improved power of IHD (as compared to QTL mapping) in situations where mutation counts are variable across haplotypes or when mutator alleles are present at allele frequencies <50% (see Author response image 2 and 3, as well as new supplements to Figure 1 in the manuscript). However, the reviewer is correct that the IHD method is not applicable to collections of outbred individuals (that is, individuals with both heterozygous and homozygous genotypes), which will limit its current applications to datasets other than recombinant inbred lines. We have added a mention of these limitations to the Results at lines 138-141 and the Discussion at lines 545-550, but plan to iterate on the IHD method and introduce new features that enable its application to other datasets. We have also explicitly stated that we account for breeding epochs in our permutation tests in the Materials and Methods at lines 670-671. Both Reviewer 1 and Reviewer 3 raised concerns about the name of our method, and we have therefore changed “inter-haplotype distance” to “aggregate mutation spectrum distance” throughout the manuscript.

      1. Nominating candidates within the chr6 mutator locus requires an approach for defining a credible interval and excluding/including specific genes within that interval as candidates. Sasani et al. delimit their focal window to 5Mb on either side of the SNP with the most extreme P-value in their IHD scan. This strategy suffers from several weaknesses. First, no justification for using 10 Mb window, as opposed to, e.g., a 5 Mb window or a window size delimited by a specific threshold of P-value drop, is given, rendering the approach rather ad hoc. Second, within their focal 10Mb window, the authors prioritize genes with annotated functions in DNA repair that harbor protein coding variants between the B6 and D2 founder strains. While the logic for focusing on known DNA repair genes is sensible, this locus also houses an appreciable number of genes that are not functionally annotated, but could, conceivably, perform relevant biological roles. These genes should not be excluded outright, especially if they are expressed in the germline. Further, the vast majority of functional SNPs are non-coding, (including the likely causal variant at the chr4 mutator previously identified in the BXD population). Thus, the author's decision to focus most heavily on coding variants is not well-justified. Sasani et al. dedicate considerable speculation in the manuscript to the likely identity of the causal variant, ultimately favoring the conclusion that the causal variant is a predicted deleterious missense variant in Mbd4. However, using a 5Mb window centered on the peak IHD scan SNP, rather than a 10Mb window, Mbd4 would be excluded. Further, SNP functional prediction accuracy is modest [e.g., PMID 28511696], and exclusion of the missense variant in Ogg1 due its benign prediction is potentially premature, especially given the wealth of functional data implicating Ogg1 in C>A mutations in house mice. Finally, the DNA repair gene closest to the peak IHD SNP is Rad18, which the authors largely exclude as a candidate.

      We agree that the use of a 10 Mb window, rather than an empirically derived confidence interval, is a bit arbitrary and ad hoc. To address this concern, we have implemented a bootstrap resampling approach (Visscher et al. 1996, Genetics) to define confidence intervals surrounding IHD peaks. We have added a description of the approach to the Materials and Methods at lines 609-622, but a brief description follows. In each of N trials (here, N = 10,000), we take a bootstrap sample of the BXD phenotype and genotype data with replacement. We then perform an IHD scan on the chromosome of interest using the bootstrap sample and record the position of the marker with the largest cosine distance value (i.e., the "peak" marker). After N trials, we calculate the 90% confidence interval of bootstrapped peak marker locations; in other words, we identify the locations of two genotyped markers, between which 90% of all bootstrap trials produced an IHD peak. We note that bootstrap confidence intervals can exhibit poor "coverage" (a measure of how often the confidence intervals include the "true" QTL location) in QTL mapping studies (see Manichaikul et al. 2006, Genetics), but feel that the bootstrap is more reasonable than simply defining an ad hoc interval around an IHD peak.

      The new 90% confidence interval surrounding the IHD peak on chromosome 6 is larger than the original (ad hoc) 10 Mbp window, now extending from around 95 Mbp to 114 Mbp. Notably, the new empirical confidence interval excludes Mbd4. We have accordingly updated our Results and Discussion sections to acknowledge the fact that Mbd4 no longer resides within the confidence interval surrounding the IHD peak on chromosome 6 and have added additional descriptions of genes that are now implicated by the 90% confidence interval. Given the uncertainties associated with using bootstrap confidence intervals, we have retained a brief discussion of the evidence supporting Mbd4 in the Discussion but focus primarily on Ogg1 as the most plausible candidate.

      The reviewer raises a valid concern about our treatment of non-DNA repair genes within the interval surrounding the peak on chromosome 6. We have added more careful language to the text at lines 219-223 to acknowledge the fact that non-annotated genes in the confidence interval surrounding the chromosome 6 peak may play a role in the epistatic interaction we observed.

      The reviewer also raises a reasonable concern about our discussions of both Mbd4 and Ogg1 as candidate genes in the Discussion. Since Mbd4 does not reside within the new empirical bootstrap confidence interval on chromosome 6 and given the strong prior evidence that Ogg1 is involved in C>A mutator phenotypes (and is in the same gene network as Mutyh), we have reframed the Discussion to focus on Ogg1 as the most plausible candidate gene (see lines 357360).

      Using the GeneNetwork resource, we also more carefully explored the potential effects of noncoding variants on the C>A mutator phenotype we observed on chromosome 6. We have updated the Results at lines 240-246 and the Discussion at line 439-447 to provide more evidence for regulatory variants that may contribute to the C>A mutator phenotype. Specifically, we discovered a number of strong-effect cis-eQTLs for Ogg1 in a number of tissues, at which D genotypes are associated with decreased Ogg1 expression. Given new evidence that the original mutator locus we discovered on chromosome 4 harbors an intronic mobile element insertion that significantly affects Mutyh expression (see Ferraj et al. 2023, Cell Genomics), it is certainly possible that the mutator phenotype associated with genotypes on chromosome 6 may also be mediated by regulatory, rather than coding, variation.

      1. Additionally, some claims in the paper are not well-supported by the author's data. For example, in the Discussion, the authors assert that "multiple mutator alleles have spontaneously arisen during the evolutionary history of inbred laboratory mice" and that "... mutational pressure can cause mutation rates to rise in just a few generations of relaxed selection in captivity". However, these statements are undercut by data in this paper and the authors' prior publication demonstrating that a number of candidate variants are segregating in natural mouse populations. These variants almost certainly did not emerge de novo in laboratory colonies, but were inherited from their wild mouse ancestors. Further, the wild mouse population genomic dataset used by the authors falls far short of comprehensively sampling wild mouse diversity; variants in laboratory populations could derive from unsampled wild populations.

      The reviewer raises a good point. In our previous publication (Sasani et al. 2022, Nature), we hypothesized that Mutyh mutator alleles had arisen in wild, outbreeding populations of Mus musculus, and later became fixed in inbred strains like DBA/2J and C57BL/6J. However, in the current manuscript, we included a statement about mutator alleles "spontaneously arising during the evolutionary history of inbred laboratory mice" to reflect new evidence (from Ferraj et al. 2023, Cell Genomics) that the mutator allele we originally identified in Mutyh may not be wild derived after all. Instead, Ferraj et al. suggest that the C>A mutator phenotype we originally identified is caused by an intronic mobile element insertion (MEI) that is present in DBA/2J and a handful of other inbred laboratory strains. Although this MEI may have originally occurred in a wild population of mice, we wanted to acknowledge the possibility that both the original Mutyh mutator allele, as well as the new mutator allele(s) we discovered in this manuscript, could have arisen during the production and inbreeding of inbred laboratory lines. We have also added language to the Discussion at lines 325-327 to acknowledge that the 67 wild mice we analyzed do not comprise a comprehensive picture of the genetic diversity present in wild-derived samples.

      We have added additional language to the Discussion at lines 349-357 in which we acknowledge that the chromosome 6 mutator allele might have originated in either laboratory or wild mice and elaborate on the possibility that mutator alleles with deleterious fitness consequences may be more likely to persist in inbred laboratory colonies.

      1. Finally, the implications of a discovering a mutator whose expression is potentially conditional on the genotype at a second locus are not raised in the Discussion. While not a weakness per se, this omission is perceived to be a missed opportunity to emphasize what, to this reviewer, is one of the most exciting impacts of this work. The potential background dependence of mutator expression could partially shelter it from the action of selection, allowing the allele persist in populations. This finding bears on theoretical models of mutation rate evolution and may have important implications for efforts to map additional mutator loci. It seems unfortunate to not elevate these points.

      We agree and have added additional discussion of the possibility that the C>A mutator phenotypes in the BXDs are a result of interactions between the expression of two DNA repair genes in the same base-excision network to the Discussion section at lines 447-449.

      Private comments

      1. The criteria used to determine or specify haplotype size are not specified in the manuscript. I mention this above but reiterate here as this was a big point of confusion for me when reading the paper. Haplotype length is important consideration for overall power and for proper extension of this method to other systems/populations.

      We may not have been clear enough in our description of our method, and as suggested by Reviewer 1, the name "inter-haplotype distance" may also have been a source of confusion. At a given marker, we compute the aggregate mutation spectrum in BXDs with either B or D genotypes using all genome-wide de novo mutations observed in those BXDs. Since the BXDs were inbred for many generations, we expect that almost all de novo germline mutations observed in an RIL are in near-perfect linkage with the informative genotypes used for distance scans. Thus, the "haplotypes" used in the inter-haplotype distance scans are essentially the lengths of entire genomes.

      1. Results, first paragraph, final sentence. I found the language here confusing. I don't understand how one can compute the cosine distance at single markers, as stated. I'm assuming cosine distance is computed from variants residing on haplotypes delimited by some defined window surrounding the focal marker?

      As discussed above, we aggregate all genome-wide de novo mutations in each group of BXDs at a given marker, rather than only considering DNMs within a particular window surrounding the marker. The approach is discussed in greater detail in the caption of Figure 1.

      1. Nominating candidates for the chr6 locus, Table 1. It would be worth confirming that the three prioritized candidates (Setmar, Ogg1, and Mbd4) all show germline expression.

      Using the Mouse Genome Informatics online resource, we confirmed that all prioritized candidate genes (now including Setmar and Ogg1, but not Mbd4) are expressed in the male and female gonads, and mention this in the Results at lines 228 and 233-234.

      1. Does the chr6 peak on the C>A LOD plot (Figure 2- figure supplement 1) overlap the same peak identified in the IHD scan? And, does this peak rise to significance when using alpha = 0.05? Given that the goal of these QTL scans is to identify loci that interact with the C>A mutator on chr4, it is reasonable to hypothesize that the mutation impact of epistatic loci will also be restricted to C>A mutations. Therefore, I am not fully convinced that the conservative alpha = 0.05/7 threshold is necessary.

      The chromosome 6 peak in Figure 2-figure supplement 1 does, in fact, overlap the peak marker we identified on chromosome 6 using IHD. One reason we decided to use a more conservative alpha of (0.05 / 7) is that we wanted these results to be analogous to the ones we performed in a previous paper (Sasani et al. 2022, Nature), in which we first identified the mutator locus on chromosome 4. However, the C>A peak does not rise to genome-wide significance if we use a less conservative alpha value of 0.05 (see Author response image 5). As discussed in our response to Reviewer 1, we find that QTL mapping is not as powerful as IHD when haplotypes have accumulated variable numbers of germline mutations (as in the BXDs), which likely explains the fact that the peak on chromosome 6 is not genome-wide significant using QTL mapping.

      Author response image 5.

      QTL scan for the fraction of C>A mutations in BXDs harboring D alleles at the locus near Myth QTL scan was performed at a genome-wide significance alpha of 0.05, rather than 0.05/7.

      1. Is there significant LD between the IHD peaks on chr6 and chr4 across the BXD? If so, it could suggest that the signal is driven by cryptic population structure that is not fully accounted for in the author's regression based approach. If not, this point may merit an explicit mention in the text as an additional validation for the authenticity of the chr6 mutator finding.

      This is a good question. We used the scikit-allel Python package to calculate linkage disequilibrium (LD) between all pairs of genotyped markers in the BXD cohort, and found that the two peak loci (on chromosomes 4 and 6) exhibit weak LD (r2 = 4e-5). We have added a mention of this to the main text of the Results at lines 212-213. That being said, we do not think the chromosome 6 mutator association (or the apparent epistasis between the alleles on chromosomes 4 and 6) could be driven by cryptic population structure. Unlike in human GWAS and other association studies in natural populations, there is no heterogeneity in the environmental exposures experienced by different BXD subpopulations. In humans, population structure can create spurious associations (e.g., between height and variants that are in LD and are most common in Northern Europe), but this requires the existence of a phenotypic gradient caused by genetic or environmental heterogeneity that is not likely to exist in the context of inbred laboratory mice that are all the progeny of the same two founder strains.

      1. Discussion, last sentence of the "Possible causal alleles..." section: I don't understand how the absence of the Mariner-family domain leads the authors to this conclusion. Setmar is involved in NHEJ, which to my knowledge is not a repair process that is expected to have a specific C>A mutation bias. I think this is grounds enough for ruling out its potential contributions, in favor of focusing on other candidates, (e.g., Mbd4 and Ogg1).

      The reviewer raises a good point. Our main reason for mentioning the absence of the Marinerfamily domain is that even if NHEJ were responsible for the C>A mutator phenotype, it likely wouldn't be possible for Setmar to participate in NHEJ without the domain. However, the reviewer is correct that NHEJ is not expected to cause a C>A mutation bias, and we have added a mention of this to the text as well at lines 379-382.

      1. Discussion, second to last paragraph of section "Mbd4 may buffer...": The authors speculate that reduced activity of Mbd4 could modulate rates of apoptosis in response to DNA damage. This leads to the prediction that mice with mutator alleles at both Mutyh and Mbd4 should exhibit higher overall mutation rates compared to mice with other genotypes. This possibility could be tested with the authors' data.

      The reviewer raises a good question. As mentioned above, however, we implemented a new approach to calculate confidence intervals surrounding distance peaks and found that this empirical approach (rather than the ad hoc 10-Mbp window approach we used previously) excluded Mbd4 from the credible interval. Although we still mention Mbd4 as a possible candidate (since it still resides within the 10 Mbp window), we have refactored the Discussion section to focus primarily on the evidence for Ogg1 as a candidate gene on chromosome 6.

      In any case, we do not observe that mice with mutator alleles at both the chromosome 4 and chromosome 6 loci have higher overall mutation rates compared to mice with other genotype combinations. This may not be terribly surprising, however, since C>A mutations only comprise about 10% of all possible mutations. Thus, given the variance in other 1-mer mutation counts, even a substantial increase in the C>A mutation rate might not have a detectable effect on the overall mutation rate. Indeed, in our original paper describing the Mutyh mutator allele (Sasani et al. 2022, Nature), we did not identify any QTL for the overall mutation rate in the BXDs and found that mice with the chromosome 4 mutator allele only exhibited a 1.11X increase in their overall mutation rates relative to mice without the mutator allele.

      1. Methods, "Accounting for BXD population structure": An "epoch-aware" permutation strategy is described here, but it is not clear when (and whether) this strategy is used to determine significance of IHD P-values.

      We have added a more explicit mention of this to the Methods section at lines 670-671, as we do, in fact, use the epoch-aware permutation strategy when calculating empirical distance thresholds.

      1. The simulation scheme employed for power calculations is highly specific to the BXD population. This is not a weakness, and perfectly appropriate to the study population used here. However, it does limit the transferability of the power analyses presented in this manuscript to other populations. This limitation may merit an explicit cautionary mention to readers who may aspire to port the IHD method over to their study system.

      This is true. Our simulation strategy is relatively simple and makes a number of assumptions about the simulated population of haplotypes (allele frequencies normally distributed around 0.5, expected rates of each mutation type, etc.). In response to concerns from Reviewer 1, we performed an updated series of simulations in which we varied some of these parameters (mutator allele frequencies, mean numbers of mutations on haplotypes, etc.). However, we have added a mention of the simulation approach's limitations and specificity to the BXDs to the text at lines 545-550.

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife assessment

      This valuable study uses a novel experimental design to elegantly demonstrate how we exploit stimulus structure to overcome working memory capacity limits. While the behavioural evidence is convincing, the neural evidence is incomplete, as it only provides partial support for the proposed information compression mechanism. This study will be of interest to cognitive neuroscientists studying structure learning and memory.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      Huang and Luo investigated whether regularities between stimulus features can be exploited to facilitate the encoding of each set of stimuli in visual working memory, improving performance. They recorded both behavioural and neural (EEG) data from human participants during a sequential delayed response task involving three items with two properties: location and colour. In the key condition ('aligned trajectory'), the distance between locations of successively presented stimuli was identical to their 'distance' in colour space, permitting a compression strategy of encoding only the location and colour of the first stimulus and the relative distance of the second and third stimulus (as opposed to remembering 3 locations and 3 colours, this would only require remembering 1 location, 1 colour, and 2 distances). Participants recalled the location and colour of each item after a delay.

      Consistent with the compression account, participants' location and colour recall errors were correlated and were overall lower compared to a non-compressible condition ('misaligned trajectory'). Multivariate analysis of the neural data permitted decoding of the locations and colours during encoding. Crucially, the relative distance could also be decoded - a necessary ingredient for the compression strategy.

      Strengths:

      The main strength of this study is a novel experimental design that elegantly demonstrates how we exploit stimulus structure to overcome working memory capacity limits. The behavioural results are robust and support the main hypothesis of compressed encoding across a number of analyses. The simple and well-controlled design is suited to neuroimaging studies and paves the way for investigating the neural basis of how environmental structure is detected and represented in memory. Prior studies on this topic have primarily studied behaviour only (e.g., Brady & Tenenbaum, 2013).

      Thanks for the positive comments and excellent summary.

      Weaknesses:

      The main weakness of the study is that the EEG results do not make a clear case for compression or demonstrate its neural basis. If the main aim of this strategy is to improve memory maintenance, it seems that it should be employed during the encoding phase. From then on, the neural representation in memory should be in the compressed format. The only positive evidence for this occurs in the late encoding phase (the re-activation of decoding of the distance between items 1 and 2, Fig. 5A), but the link to behaviour seems fairly weak (p=0.068).

      Thanks for raising this important concern. The reviewer is correct that in principle subjects should employ the compression strategy during the encoding phase when sequence stimuli are presented, yet our results show that the 1-2 trajectory could only be decoded during the late encoding phase.

      Meanwhile, subjects could not get enough information to form the compressed strategy for the location and color sequences until the appearance of the 3rd item. Specifically, based on the first two items, the 1st and 2nd item, they only learn whether the 1st-2nd trajectories are congruent between location and color features. However, they could not predict whether it would also apply to the incoming 2nd-3rd trajectory. This is exactly what we found in neural decoding results. The 1st-2nd trajectory could be decoded after the 2nd item presentation, and the 2nd-3rd trajectory appears after the 3rd item onset. Most critically, the 1st-2nd trajectory is reactivated after the 3rd item but only for alignment condition, implicating formation of the full-sequence compression strategy wherein the previously formed 1st-2nd trajectory is reactivated to be connected to the 2nd-3rd trajectory.

      Regarding the difference between higher- and lower-correlation groups, previously we used the time window based on the overall 2nd-3rd neural reactivations, which might not be sensitive to reactivation strength. We now re-chose the time window based on the higher-correlation group (bootstrap test, p = 0.037, two sides).

      Results have been updated (Figure 5; Results, Page 16). Interpretations about the formation of compression strategy during encoding phase have been added to Results (Page 15-16) and Discussion (Page 18).

      Stronger evidence would be showing decoding of the compressed code during memory maintenance or recall, but this is not presented. On the contrary, during location recall (after the majority of memory maintenance is already over), colour decoding re-emerges, but in the un-compressed item-by-item code (Fig. 4B). The authors suggest that compression is consolidated at this point, but its utility at this late stage is not obvious.

      Thank you for the important question we apologize for omitting previously - neural evidence for the compressive account.

      The reason we did not perform neural decoding during maintenance is that previous EEG/MEG studies including our own failed to reveal robust and sustained time-resolved memory decoding during this period. This is posited to arise from “activity-silent” WM states, wherein memories are not necessarily retained in sustained firing but silently stored within connection weights of WM networks (Stokes, Trends Cogn. Sci., 2015; Rose, Curr Dir Psychol Sci, 2020). Our previous work showed that by transiently perturbing the 'activity-silent' WM using a retrocue or neutral impulse, memories could be reactivated and robustly decoded from neural activities (Huang et al., eLife, 2021). However, due to the lack of transient events during retention in the current design, we do not expect robust decoding results during maintenance. As shown below (AB), this is indeed what we have observed, i.e., no robust neural decoding of trajectories during retention.

      We further used alpha-band (8-11 Hz) neural activities, which have been shown to carry WM information (de Vries et al., Trends Cogn. Sci, 2020; Foster et al., Curr. Biol, 2016; Fukuda et al., J. Neurophysiol, 2016; Sutterer et al., PLOS Biol., 2019) to perform decoding analysis of compression trajectories during maintenance. As shown below, the alpha-band decoding results are indeed stronger than raw activities. Importantly, as shown below (CD), the aligned condition indeed showed significant and long-lasting decoding of compression trajectories (1st-2nd, 2nd-3rd) during retention, while the misaligned condition only showed decoding at the beginning (GH), which might be due to the non-specific offset response of the 3rd item. The results, although not as clear as those during encoding and recalling periods, support the reviewer’s hypothesis that the compressive strategy, if exploited, would be demonstrated during both encoding and maintenance periods. New results and related discussion have been added (Page 16, Supplementary Figure 4).

      With regards to the observed item-by-item color replay during location recall, the reviewer was concerned that this was not consistent with the compressive account, given the lack of trajectory decoding.

      First, item sequences stored in compressive formats need to be converted to sequences during serial recall. In other words, even though color and location sequences are retained in a compressive format (i.e., common 1st-2nd, 2nd-3rd trajectories) throughout the encoding and retention phases, they should be transferred to two sequences as outputs. This is exactly why we performed decoding analysis on individual color and location items rather than trajectories.

      Second and most importantly, we observed serial replay of color sequences when recalling locations. In our view, these results constitute strong evidence for common structure, since the spontaneous color replay during location recall for aligned condition highlights the close bound between color and location sequences stored in WM. In fact, item-by-item serial replay has been well acknowledged as a critical neural index of cognitive maps, not only for spatial navigation but also for higher-order tasks (e.g., Liu et al., Cell, 2019; Liu et al., Science, 2021). Therefore, spontaneous color sequence replay during location sequence recall supports their shared underlying cognitive map.

      Finally, spontaneous serial replay is also correlated with the reactivation of compressive trajectories during encoding (Supplementary Figure 3). This further indicates that serial replay during recalling is associated with memory reorganization formed during encoding.

      Taken together, we posit that memories need to be converted to sequences as outputs, which leads to serial reactivations during recalling. Importantly, the observed spontaneous replay of color sequences for the aligned condition provides strong evidence supporting the associations between color and location sequences in WM.

      We have now added relevant interpretations and discussions (Page 11&13).

      Reviewer #2 (Public Review):

      Summary:

      In this study, the authors wanted to test if using a shared relational structure by a sequence of colors in locations can be leveraged to reorganize and compress information.

      Strength:

      They applied machine learning to EEG data to decode the neural mechanism of reinstatement of visual stimuli at recall. They were able to show that when the location of colors is congruent with the semantically expected location (for example, green is closer to blue-green than purple) the related color information is reinstated at the probed location. This reinstatement was not present when the location and color were not semantically congruent (meaning that x displacement in color ring location did not displace colors in the color space to the same extent) and semantic knowledge of color relationship could not be used for reducing the working memory load or to benefit encoding and retrieval in short term memory.

      Weakness:

      The experiment and results did not address any reorganization of information or neural mechanism of working memory (that would be during the gap between encoding and retrieval).

      We apologize for not presenting clear neural evidence for memory reorganization, particularly neural decoding during WM maintenance and retrieval, in the previous version. As below, we explain why the findings provide converging neural evidence for WM reorganization based on a shared cognitive map.

      First, during the encoding phase when location and color sequences are serially presented, our results reveal reactivation of the 1st-2nd trajectories upon the onset of the 3rd item when location and color sequences are aligned with each other. The reactivation of 1st-2nd trajectory right after the emergence of 2nd-3rd trajectory for aligned but not for misaligned sequences strongly supports WM reorganization, since only stimulus sequences that could be compressed based on shared trajectories (aligned condition) show the co-occurrence of 1st-2nd and 2nd-3rd trajectories. Moreover, the relevance of 1st-2nd reactivation to behavioral measurements of color-location reorganization (i.e., behavioral trajectory correlation, Figure 5D) further indicates its link to WM reorganization.

      Second, the reason we originally did not perform neural decoding during maintenance is that previous EEG/MEG studies including our own failed to reveal robust and sustained time-resolved memory decoding during this period. This is posited to arise from “activity-silent” WM states, wherein memories are not necessarily retained in sustained firing but silently stored within connection weights of WM networks (Stokes, Trends Cogn. Sci., 2015; Wolff et al., Nat. Neurosci, 2017; Rose et al., Curr Dir Psychol Sci, 2020). Our previous work showed that by transiently perturbing the 'activity-silent' WM using a retrocue or neutral impulse, memories could be reactivated and robustly decoded from neural activities (Huang et al., eLife, 2021). However, due to the lack of transient events during retention in the current design, we do not expect robust decoding results during maintenance. As shown in Supplementary Figure 4(AB), this is indeed what we have observed, i.e., no robust neural decoding of trajectories during retention.

      We then used alpha-band (8-11 Hz) neural activities, which have been found to carry WM information (de Vries et al., Trends Cogn. Sci, 2020; Foster et al., Curr. Biol, 2016; Fukuda et al., J. Neurophysiol, 2016; Sutterer et al., PLOS Biol., 2019) to perform decoding analysis of compression trajectories during maintenance. As shown below, the alpha-band decoding results are indeed stronger than raw activities. Importantly, as shown in Supplementary Figure 4(CD), the aligned condition indeed showed significant and long-lasting decoding of compression trajectories (1st-2nd, 2nd-3rd) during retention, while the misaligned condition only showed decoding at the beginning (GH), which might be due to the non-specific offset response of the 3rd item. The results, although not as clear as those during encoding and recalling periods, thus also support WM reorganization.

      Finally, during the recalling period, we observed automatic serial replay of color sequences when recalling locations. In our view, these results constitute strong evidence for common structure, since the spontaneous color replay during location recall for aligned condition highlights the close bound between color and location sequences stored in WM. In fact, item-by-item serial replay has been well acknowledged as a critical neural index of cognitive maps, not only for spatial navigation but also for higher-order tasks (e.g., Liu et al., Cell, 2019; Liu et al., Science, 2021). Therefore, spontaneous replay of color sequence during location recall supports their shared underlying cognitive map. Moreover, the spontaneous serial replay is correlated with the reactivation of compressive trajectories during encoding (Supplementary Figure 3). This further indicates that serial replay during recalling is associated with memory reorganization formed during encoding.

      Taken together, we have added updated results about the maintenance period (Page 16, Supplementary Figure 4) and included clarifications and interpretations about why the findings during the encoding and retrieval periods support the WM reorganization view (Page 15-16).

      There was also a lack of evidence to rule out that the current observation can be addressed by schematic abstraction instead of the utilization of a cognitive map.

      The likely impact of the initial submission of the study would be in the utility of the methods that would be helpful for studying a sequence of stimuli at recall. The paper was discussed in a narrow and focused context, referring to limited studies on cognitive maps and replay. The bigger picture and long history of studying encoding and retrieval of schema-congruent and schema-incongruent events is not discussed.

      We agree with the reviewer that cognitive map referred here could be understood as schematic abstraction. Cognitive map refers to the internal representation of spatial relations in a specific environment (Tolman 1948). Schematic abstraction denotes a more broad range of circumstances, whereby the gist or structure of multiple environments or episodes can be integrated (Bartlett, 1932; Farzanfar et al., Nat. Rev. Neurosci, 2023).

      In other words, schema refers to highly abstract framework of prior knowledge that captures common patterns across related experiences, which does not necessarily occur in a spatial framework as cognitive maps do. Meanwhile, in the current design, we specifically manipulate the consistency of spatial trajectory distance between color and location sequences. Therefore, we would argue that cognitive map is a more conservative and appropriate term to frame our findings.

      Relevant discussions have been added (Page 3&19).

      We apologize for the lack of more generalized discussion and have added schema-related literatures. Thanks for the suggestion.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      (1) Do time-frequency-domain data (e.g., alpha-band power) in the delay provide evidence for delay-period decoding of trajectory lengths? This might strengthen the case for compression.

      Thanks for the suggestion. We now performed decoding analysis of the delay period based on alpha-band power. As shown in supplementary figure 4, both the 1st-2nd and 2nd-3rd trajectories could be decoded for the aligned condition.

      Added in supplementary figure 4 and Page 16.  

      (2) Do participants erroneously apply the compression strategy in the misaligned condition? This would not show up in the trajectory error correlation analysis, but might be visible when examining correlations between raw trajectory lengths.

      Thanks for raising this interesting suggestion. To test the hypothesis, we chose a typical misaligned condition where 1st-2nd trajectory distances are same between location and color sequences, while the 2nd-3rd trajectory distances are different between the two features.

      In this case, participants might exploit the compression strategy for the first two items and erroneously apply the strategy to the 3rd item. If so, we would expect better memory performance for the first two items but worse memory for the 3rd item, compared to the rest of misaligned trials. As shown below, the 1st-2nd aligned trials showed marginally significant higher performance than misaligned trials for the first two items (t(32) = 1.907, p = 0.066, Cohen’s d = 0.332) . Unfortunately, we did not find significant worse performance for the 3rd item between the two conditions (t(32) = -0.4847, p = 0.631, Cohen’s d = -0.084). We observed significant interactions between the last two items and the alignment effect (t(32) = 2.082, p = 0.045, Cohen’s d = 0.362), indicating a trend of applying wrong compression strategy to the 3nd item.

      Author response image 1.

      (3a) Some more detail on some of the methods might help readers. For instance, did trajectories always move in a clockwise direction? Could the direction reverse on the third item? If not, did this induce a response bias? Could such a bias possibly account for the trajectory error correlations

      Sorry for the unclear statement. For individual trial, both the color and location features of the three items are randomly selected from nine possible values without any constraint about the directions. That is to say, the trajectories can move in a clockwise or anticlockwise direction, and the direction can also reverse on the third item in some trials. Thus, we think the current design can actually help us to reduce the influence of response bias. Taking a step back, if trajectory error correlations are due to response bias, we should expect consistent significant correlation for all conditions, instead of only observing significant correlation for 1st-2nd and 2nd-3rd trajectories but not for 1st-3rd trajectory and only in aligned trajectory condition but not in misaligned condition. Therefore, we think the trajectory error correlations cannot be simply explained by response bias.

      Details have been added (Page 23).

      (3b) Is the colour wheel always oriented the same way for a participant? If so, given there are only nine colors, it seems possible that colors are mapped to locations and remembered in a location code instead. This does not seem to be a problem in principle for the behavioural findings, but might change the interpretation of what is being decoded from the EEG. If this is a possibility then this might be acknowledged.

      The color wheel is always oriented the same way for each participant. We agree with the reviewer that it is possible that participants tend to map colors to locations and remembered in a location code. We don’t have sufficient evidence to rule out this possibility. One possible way could be running another experiment with varied color wheel during response period. Meanwhile, we would like to point out that the underlying logic of the current design is based on the facts that thinking spatially is intuitive and spatial metaphors like “location” and “distance” is commonly used to describe world, e.g., the well-known mental number line (Dehaene et al., JEP: General, 1993). Therefore, we expected participants to associate or integrate location and color maps based on trajectory distance.

      The reviewer is correct that the color decoding would reflect spatial location rather than the genuine color feature. This is actually the point of the experimental design, whereby two irrelevant features could be possibly combined within a common cognitive map. Without the realignment of the two feature maps defined in space, subjects could not at all form the strategy to compress the two sequences. In other words, decoding of color sequences could be understood as neural representation of a series of corresponding locations along the ring that are independent of the physical locations of the items.

      Interpretations and clarifications have been added (Page 23&26).

      (4) Does the discretisation of the stimulus distribution (to only 9 possible locations) make the compression strategy easier to use? If the features had been continuously distributed across the location/colour circle, would participants still pick up on and use the shared trajectory structure?

      Thanks for the question. Without further data, it’s hard to say whether the discretization of the stimulus distribution would make the compression strategy easier to use or not, compared to continuous distribution. Both outcomes seem possible. On the one hand, discrete stimulus distribution would result in discrete trajectory distribution, which helps participants to realize the common trajectory strategy. On the other hand, discrete stimulus distribution would result in category or label representation, which may weaken the effectiveness of structure compression strategy. We postulate that our findings could be generalized to continuous trajectories in a cognitive map within certain resolution.

      (5a) Minor point: I disagree that avoiding the same points for location and colour for a given item allows them to be independently decoded. I would argue the contrary - this kind of constraint should create a small anti-correlation that in principle could lead to spurious decoding of one variable (although this seems unlikely here).

      We appreciate the concern. As mentioned above, with discrete stimulus distribution (9 possible values for both color and location domains), it is quite possible that a fraction of trials would share same values in location and color. Therefore, the neural decoding for one domain might be confounded by another domain. To dissociate their neural representations, we imposed constraints that color and location could not occupy the same value for a given item.

      We agree that this kind of constraint might create a small anti-correlation, even though it is not observed here. Future studies using continuous stimulus distribution would reduce the correlation or anti-correlation between stimuli.

      (5b) Very minor point: 1,000 permutations for significance testing seems on the low side. Since some of the p-values are close to 0.05 it may be worth running more permutations.

      Thanks for this suggestion. We got similar results using 1000 or 10000 permutations.

      (6) Missing reference: H. H. Li et al., 2021 (line 213) seems not to be on the list of references.

      Sorry for the mistake. Added.

      Reviewer #2 (Recommendations For The Authors):

      The study aimed to discuss the working memory mechanism, instead, it seems to be focused on the encoding and recall strategies after a short while, I recommend updating the manuscript to refer to the relevant cognitive mechanism.

      There was a strong voice on the effect of using the cognitive map in working memory, without any tests on if indeed a cognitive map was used (for example the novel link between stimuli and how a cognitive map can be used to infer shortcuts). Was the participant required to have any mental map beyond the schema of the shown color ring?

      In the current experiment, to discuss if the effect is driven by utilizing a cognitive map or schematic abstraction of color-relatedness, further analysis is required to possibly assess the effects of schema on neural activity and behavior. Namely,<br /> (1) Was there any reinstatement of schematically congruent (expected) colors that were probed by location 1, at locations 2 and 3 in the MAT condition?

      Thanks for pointing out this possibility. However, we don’t think there will be stable color expectations given location information under the MAT condition. First, as the trajectory distance varied on a trial-by-trial basis, no prior common trajectory knowledge could be used to make inference about the current stimuli in individual trial. Second, the starting points for color and location (1st item) were randomly and independently selected, such that color sequence could not be predicted based on the location sequence for both aligned and misaligned conditions.

      (2) Given that response time can be a behavioral marker of schematic conflict, was the response time faster for congruent than incongruent conditions?

      Thanks for this question. Unfortunately, due to the experimental design, the response time could not be used as a behavioral marker to infer mental conflicts, since participants were not required to respond as fast as possible. Instead, they took their own pace to reproduce sequences without time limit. They could even take a short break before submitting their response to initiate the next trial.

      (3) In case you cannot rule out that utilizing schema is the cognitive mechanism that supports working memory performance (the behavior), please add the classical literature (on the memory of schematically congruent and incongruent events) to the discussion.

      Thanks for this suggestion and we have added relevant literatures now (Page 3&19).

      (4) On page 6, 'common structure in the cognitive map' is the schema, isn't it?

      Correct. Based on our understanding, ‘common structure in the cognitive map’ is a spatial schema.

      (5) In Figure 2 EFG, would you please use a mixed effect model or show evidence that all participants demonstrated a correlation between the location trajectory error and color trajectory error?

      Thanks for the suggestion. We have added the mixed effect model results, which are consistent with Figure 2EFG (AT: 1st-2nd trajectory, β = 0.071, t = 4.215, p < 0.001; 2nd-3rd trajectory, β = 0.077, t = 3.570, p < 0.001; 1st-3rd trajectory, β = 0.019, t = 1.118, p = 0.264; MAT: 1st-2nd trajectory, β = 0.031, t = 1.572, p = 0.116; 2nd-3rd trajectory, β = 0.002, t = 0.128 , p = 0.898; 1st-3rd trajectory, β = -0.017, t = -1.024, p = 0.306).

      In general, doesn't such correlation just show that good participants/trials were good (some did well in the study and some did poorly throughout?)

      We don’t think the trajectory error correlation results just reveal that some participants did well and some participants did poorly. If that is the case, we shouldn’t observe significant correlation in Figure 2D, where we first run correlation for each participant and then test correlation significance at group level. Indeed, trajectory error correlation between color and location domains characterizes the consistent changes between the two domains.

      It is worth to note that the correlation was estimated with signed trajectory errors in color and location domains, which meant that we indeed cared about whether the errors in the two domains were consistently varied in the same direction, i.e., whether longer trajectory memory compared to the actual trajectory in location domain would predict longer trajectory memory in color domain.

      Moreover, as shown in Figure 2EFG, by dividing trials into 4 bins according to the location trajectory error for each participant and pooling the data across participants, we observed 4 clusters along x-axis (location trajectory error). This suggests that participants’ memory performance is rather consistent instead of being extremely good or bad. Besides, if trajectory error correlation is due to different overall memory performance between participants, we should observe significant trajectory error correlations both in AT and MAT conditions, instead of only under AT condition and for 1st-2nd and 2nd-3rd trajectories but not for 1st-3rd trajectory.

      In Figure 2 G, is the marginal error just too big to be sensitive? I am not sure what we are learning here, please clarify.

      Sorry for the confusion. To examine this possibility, we excluded errors which are beyond 2.5 * σ, and still observed non-significant 1st-3rd trajectory error correlation between color and location domains (r = 0.119, p = 0.167).

      The 1st-3rd trajectory showed nonsignificant behavioral correlation and neural representation, which suggests that the current sequential memory task would encourage participants to organize all information by relying more on the adjacent items and their distance. Thus, we think the 1st-3rd trajectory would serve as a control trajectory, which helps us not only exclude other possible explanation (e.g., systematic response bias), but also validate current findings both in behavioral and neural level.

      Results and statements (Page 10-11) added now.

      Author response image 2.

      (6) Regarding the first lines on page 11, did you do qualitative research to know if less information was encoded in congruent conditions?

      The current experimental design is inspired by the mental compression of spatial sequence studies from Dehaene’s lab (Amalric er al., 2017; Roumi et al., 2021), in which they propose that human brain compresses spatial sequence using an abstract language and formalize minimal description length of a sequence as the “language-of-thought complexity.” Based on this evidence, we think less information is required to describe congruent condition compared to incongruent condition. This idea is supported by better memory performance for congruent condition. Unfortunately, we couldn’t manage to quantify how less information was encoded in congruent condition.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      This work by Ding et al uses agent-based simulations to explore the role of the structure of molecular motor myosin filaments in force generation in cytoskeletal structures. The focus of the study is on disordered actin bundles which can occur in the cell cytoskeleton and have also been investigated with in vitro purified protein experiments.

      Strengths:

      The key finding is that cooperative effects between multiple myosin filaments can enhance both total force and the efficiency of force generation (force per myosin). These trends were possible to obtain only because the detailed structure of the motor filaments with multiple heads is represented in the model.

      We appreciate your comments about the strength of our study. 

      Weaknesses:

      It is not clearly described what scientific/biological questions about cellular force production the work answers. There should be more discussion of how their simulation results compare with existing experiments or can be tested in future experiments.

      Please see our response to the comment (1) below.

      The model assumptions and scientific context need to be described better.

      We apologize for the insufficient descriptions about the model and the scientific context. We revised the manuscript to better explain model assumptions and scientific context as described in our responses below.

      The network contractility seems to be a mere appendix to the bundle contractility which is presented in much more detail.

      Please see our response to the comment (6) below.

      Reviewer #1 (Recommendations for the authors):

      (1) It is not clearly described what scientific/biological questions about cellular force production the work answers. There should be more discussion of how their simulation results compare with existing experiments, or can be tested in future experiments. The authors do briefly mention Reference 4 where different myosin isoforms were used, but it is not clear that these experiments support the scalings predicted in this work in Figures 3-6. Also, the experiments in Ref. 4 apparently did not involve passive crosslinkers (ACPs) which are key in this study.

      Thank you for the comment. In the 5th paragraph of the discussion section of the original manuscript, we applied our findings to understand how structural differences between ventral stress fibers and actin arcs could affect force generation. In addition, at the end of the discussion section, we mentioned that experiments with artificially-made myosin thick filaments could be used for verifying our results. 

      The experiments in Ref. 4 were only ones that we could directly compare our results with. In previous study, actomyosin bundles were experimentally created with ACPs (K.L. Weirich et al., Biophys J, 2021, 120: 1957-1970), but the motions of myosin thick filaments were only quantities measured in the experiments. In general, measuring forces generated by in vitro actomyosin bundles is very challenging. This is why the predictions from our model are particularly valuable for understanding the force generation of actomyosin structures. 

      (2) The architecture of the bundles seems to be prescribed by hand in these simulations. Several well-known stochastic aspects of the dynamics of actin and actin-binding proteins are not included in the model. For example, there is no remodeling of the actin structures through actin polymerization and depolymerization, or crosslink (ACP) binding and unbinding. Can the authors comment on why these effects could be neglected for the questions they want to address?

      Thank you for the comment. We previously showed that the force generation process in actomyosin networks and bundles is affected by actin dynamics (Q. Yu et al., Biophys J, 2018, 115: 2003-2013) and the unbinding of ACPs (T. Kim, Biomech Model Mechanobiol, 2015, 14(2): 345-355 and W. Jung et al., Comput Part Mech, 2015, 2(4): 317-327). 

      However, we did not include the actin dynamics and the ACP unbinding in the current study to clearly understand the effects of the structural properties of thick filaments on the force generation process. We have learned that the stochastic behaviors of cytoskeletal components lead to noisier results, which requires us to run a much larger number of simulations to obtain statistically convincing data. We added the following paragraph in the discussion section of the revised manuscript:

      “Although this study focused mainly on parameters related to motor structures, we expect that other parameters would affect the force generation process. For example, as we showed before, a decrease in ACP density would reduce forces by deteriorating connectivity between filaments. With very low ACP density, some of neighboring motors may not have ACPs between them, thus adding up their forces as shown in Fig. 2. However, such low ACP density may not maintain the structure of bundles or cross-linked networks well. In addition, the force-dependent unbinding of ACPs could change the spatial distribution of ACPs during force generation. If they behave as a slip bond which unbinds more frequently with higher forces, ACPs may not stay between two motors for long time due to high tension. Then, forces generated by two motors may have a higher chance to add up. By contrast, if they behave as a catch bond which unbinds less frequently with larger forces, more ACPs will be recruited between two motors, reducing a chance to add up

      forces. The length of actin filaments is unlikely to affect the force generation process significantly unless filaments are very short. Additionally, as we showed before, actin turnover would reduce forces by competing with motor activities, change connectivity between filaments over time, and prevent motors from being stalled for long time, all of which could affect force generation.”

      (3) The present study is confined to the fixed density of motors and ACPs. However, these can be easily varied in in vitro experiments. Works such as Reference 4 show an optimum in contractility vs myosin concentration. Myosins act not only to slide actin filaments but also crosslink them.

      Can the authors vary myosin concentration to demonstrate such effects in their model?

      As the reviewer pointed out, there is a belief that myosin thick filaments can serve as crosslinkers as well. However, unless there are a fraction of dead myosins (which remain bound on filaments without walking) or myosins dwell at the barbed ends filaments for very long time, it looks very hard for bundles or networks to generate large forces. A former experiment showed that active myosins increases the viscosity of actin networks, not elasticity (D. Humphrey et al., Nature, 2002, 416: 413-416) Computer simulations with reasonable assumptions did not show significant force generation without cross-linkers. We have tested systems with a large number of motors and a few cross-linkers in previous studies (T. Kim, Biomech Model Mechanobiol, 2015, 14(2): 345-355 and W. Jung et al., Comput Part Mech, 2015, 2(4): 317-327). We observed that large force/stress was generated momentarily, but it was relaxed very fast. It is expected that there will be similar outcomes if we try such conditions in the current study.

      (4) Why is there a (factor of 1.5-2) discrepancy in the measured (Ftot) and estimated (Fest) force values in Figure 4-6? How can the authors improve their scaling arguments to capture this? What about the estimated efficiency?

      Thank you for the comment. Indeed, there was a discrepancy between the actual and estimated forces. When the estimated force was calculated, we used the z positions of motors without consideration of the actual bundle geometry with multiple filaments. For example, if two motors are located on the opposite sides of the bundle (i.e., if they are located far from each other in x or y direction), forces generated by them may not counterbalance each other. Then, the estimated force can be smaller than the actual force because counterbalance between motors can be overcounted. The original manuscript had the following sentences to clarify this point: “F</sub>est</sub> was generally smaller than F<sub>tot</sub> because this analysis does not account for actual bundle geometry consisting of multiple F-actins; if two motors are located far from each other in x or y direction, they may not counterbalance or add up forces. Nevertheless, we found that F<sub>est</sub> captures the overall dependence of F<sub>tot</sub> on parameters well.”

      (5) Several choices of parameter values used in the simulations are not clear:

      a) Why consider F actin of 140 nm specifically? Actin can come in a range of lengths. How do their results depend upon the length scale of actin?

      It seems that there is a misunderstanding. 140 nm is the equilibrium length of one actin segment in our model. The actual F-actin consists of multiple actin segments. The length of Factin was 9 μm in bundle simulations and 10 μm (average) in network simulations. We expect that the general tendency of our results would not change with different filament length. However, if filament length becomes too short, the force generation process would be impaired due to lack of connectivity between filaments. 

      b) Similarly, very specific values of myosin backbone length (42 nm), number of myosin heads (8), number of arms (24), and Actin Cross-linking Proteins (ACPs). What informs these values and how will the results change if they are different? It is not especially clear how an "Arm" differs from "heads" and what kind of coarse-graining is involved.

      In the “model overview” section of the original manuscript, we mentioned the following to clarify the definitions of motor arms and motor heads: 

      “To mimic the structure of bipolar filaments, each motor has a backbone, consisting of serially linked segments, and two arms on each endpoint of the backbone segments that represent 8 myosin heads (N<sub>h</sub> = 8).”

      We devised this coarse-graining scheme of myosin thick filaments in our previous work (T. Kim, Biomech Model Mechanobiol, 2015, 14(5): 1143-1155). Through extensive tests, we showed that force generation and motor behaviors are largely independent of coarse-graining level. In other words, a motor with the same value of N<sub>h</sub>N<sub>a</sub> leads to similar outcomes regardless of the value of N<sub>a</sub>. However, in a bundle with multiple filaments, each motor has a sufficient number of arms to ensure simultaneous interactions with those filaments. This is why we decided to useN<sub>h</sub> = 8 and N<sub>a</sub> = 24. 

      To match the length of thick filaments and the total number of heads (N<sub>h</sub>N<sub>a</sub>) in the model with real myosin thick filaments, we have used 42 nm for each backbone length. Varying this length is equivalent to a variation in L<sub>sp</sub> that we did for Fig. 6.

      We used high ACP density to ensure connections between all neighboring pairs of actin filaments. We already showed how the presence of ACPs affects the force generation process in Fig. 2 using two actin filaments. It is expected that a variation of ACP density would affect our results to some extent. Since the main focus of the current study is the structural properties of motors, we did not explore the effects of ACP density. I hope that the reviewer would understand our intention. 

      (6) The manuscript focuses on disordered bundles with only one figure on networks. However, actin fibers also ubiquitously exist as disordered networks, and it is important to explore in more detail the contractile forces in such network arrangements.

      We appreciate the comment. Because we plan to delve into the effects of motor structures on the force generation in networks as a follow-up study, we showed the minimal results in the current study to prove the generality of our findings. I hope that the reviewer would understand our intention and plan.

      It is not described very clearly how these networks were generated.

      We apologize for lack of explanation about how the networks were generated. We added the following section in Supplementary Text of the revised manuscript:

      “Network assembly

      Unlike F-actin in bundle simulations, F-actin in network simulations is formed by stochastic processes as in our previous studies. The formation of F-actin is initiated from a nucleation event with a constant rate constant, k<sub>n,A</sub>, with the appearance of one cylindrical segment in a random position with a random orientation perpendicular to the z direction. The polymerization of F-actin is simulated by adding cylindrical segments at the barbed end of existing filaments with a rate constant, k<sub>p,A</sub>. The ratio of k<sub>n,A</sub>to k<sub>p,A</sub> is adjusted to result in the average filament length of ~10 μm. The rest of the assembly process is identical to that described in the main text.”

      Crosslinked biopolymers like actin typically form disordered elastic networks with their coordination number below rigidity percolation threshold (z=4 in 2D), see for example review by Broedersz and Mackintosh Rev. Mod, Phys. 2013. Such networks should exist in the bendingdominated regime, where bending forces play a vital role in force propagation. Was that observed in the simulations? Why or why not?

      We appreciate the comment. We are aware of the bending-dominated regime and indeed showed the importance of the bending stiffness of actin filaments at low shear strain level in our previous work (T. Kim et al., PLOS Comput Biol, 2009, 5(7): e1000439). In case of active networks with motors, such a bending-dominated regime has not been observed without external shear strain. Instead, buckling of actin filaments was found to be essential for breaking symmetry between tensile and compressive forces developed by motor activities. We have shown that the free contraction of networks is inhibited if filament bending stiffness is increased substantially (J. Li et al., Soft Matter, 2017, 13: 3213-3220 and T. Bidone et al., PLOS Comput Biol, 2017, 13(1): e1005277). We expect that contractile forces generated by bundles or networks will be reduced significantly if we highly increase bending stiffness. However, considering the focus of the current study is on the structural properties of motors, we did not perform such simulations. 

      (7) It would be interesting to see the simulated predictions of the bundle or network contraction dynamics. This can be done by changing to free boundary conditions so that the bundle can contract.

      Thank you for the suggestion. We have previously investigated the free contraction of actomyosin networks with different motor density and ACP density (J Li et al., Soft Matter, 2017, 13: 3213). We observed that the rate of network contraction was higher with more motors and ACPs. However, we did not test the effects of the structural properties of thick filaments in the previous study. We plan to investigate the effects in future studies because the focus of the current study is the force generation process. Please note that in the discussion section of the original manuscript, we mentioned the following:

      “Although we focused on force generation, the contractile behaviors of actomyosin structures (i.e., a decrease in length) have also been of great interest. Our model can be used to study such contractile behaviors by deactivating the periodic boundary condition and removing connection between one end of bundle/network and a domain boundary as done previously [20]. To achieve higher contractile speed with the same total number of myosin heads, the existence of multiple contractile units would be better as suggested in a previous work [4]. This means that there is a trade-off between force generation and contractile speed. Previous studies also showed that the contractile speed of networks is proportional to motor density [18, 43, 51]. We may be able to use our model to systematically investigate how the contractile speed is regulated by parameters that we tested in this study, including the number, distribution, length, and structure of motors.”

      Minor suggestions for improvement:

      (1) What are the vertical markers in Figures 1E and F? They should be labelled. if they are crosslinkers, it is not clear why the color is different from Figure 1A and B.

      We believe that the reviewer meant Figs. 2E, F. Those vertical lines are indeed ACPs (crosslinkers). We changed the color of ACPs in Fig. 1A and Fig. 2B-D to purple to be consistent. In addition, we changed the colors of two filaments in Figs. 2B-D slightly to be consistent with Fig. 2E.

      (2) To help understanding, please include a figure showing how forces are measured.

      We added Fig. S1 in the revised manuscript to explain how the bundle force is calculated.

      (3) It should be possible to extend the scaling arguments to predict what is the crossover myosin density (N_M) in Figure 4a at which the efficiency changes from going as 1/N_M to saturating. 

      As the reviewer might have observed, the slope of the efficiency in Fig. 4A gradually changes, rather than showing a sharp transition. Thus, it is hard to define one crossover myosin density. 

      Similarly, what are the slopes in Figure 6a-b?

      We drew the reference lines in those two plots. Unfortunately, we do not have explanations about the origin of these slopes.

      (4) Some more explanation for the observed values should be added. Figure 4: Why does efficiency plateau at a value close to 0.8 in (A)? 

      We assume that the reviewer meant the plateau of η close to 0.08, not 0.8. Our speculation for the origin of this plateau value is related to L<sub>M</sub> (= 462 nm under the reference condition). Ideally, ~43 motors are required to cover the entire length of the bundle (= 20 μm). Under this condition, η is ~0.023. Although this is not 0.08, we believe that these two values are related to each other. For example, if we increase L<sub>M</sub>, this plateau level would increase. We added the following sentences in the result section of the revised manuscript:

      “The plateau level of η at ~0.08 is related to the minimum number of motors required for saturating an entire bundle, implying that the plateau level would be higher if each motor is longer.”

      Figure 5: Overlapping between motors seems to increase the total force applied by them because of cooperative effects. However, it is not abundantly clear why that should peak at a value of f = 0.06.

      As shown in Fig. 5B, smaller f always results in higher F<sub>tot</sub> due to higher level of cooperative overlap. The minimum value of f we tested in this study was 0.06, so F<sub>tot</sub> was maximal at f = 0.06.

      (5) Why is the network force expected to scale approximately as sqrt(N_M)? Is it because of the 2D geometry where the number of motors along the x or y-direction scale as sqrt(N_M)?

      We initially thought that the weaker dependence of the total force on N<sub>M</sub> was related to the random orientations of motors. However, if the network is fully saturated with motors, the inclusion of more motors will increase forces in both x and y directions almost linearly, resulting in the direct proportionality of F<sub>tot</sub> to N<sub>M</sub>. Our new hypothesis for weaker dependence is consistent with the reviewer’s speculation; the network is not fully saturated even with 1000 motors, so the entire regime shown in Fig. 7B corresponds to that with N<sub>M</sub> < 100 in Fig. 4A where similar weaker dependence on N<sub>M</sub> was observed. We added the following sentence in the result section of the revised manuscript to clarify this point:

      “the average number of motors in each direction which can experience the cooperative overlap would be ~. Maximal N<sub>M</sub> tested with the network was ~2,500, so the dependence of F<sub>tot</sub> on N<sub>M</sub> with the network is similar to that with N<sub>M</sub> < ~50 with the bundle (Fig. 4A).”

      (6) Figures 6 D and A: Figure 6D suggests that there is a more full overlap in the cases where there was a longer bare zone or larger spacing between motor arms. However, the quantification of the total force in A shows that the force is highest for the case where LM was increased by increasing the number of arms. Why do the authors think that is? I would expect from the explanation in Fig 6D that the Lsp and Lbz would be higher than Na in Fig 6A.

      Fig. 6D shows a difference in the level of the cooperative overlap () between two motors. As the reviewer pointed out, the case with more arms shows the lowest , resulting in the lowest as we showed in Fig. S2B. However, as show in in Eq. 7, the total force is a function of both N<sub>a</sub> and . Thus, due to higher N<sub>a</sub> and lower , the force in the case with different N<sub>a</sub> can be similar to that in the case with different L<sub>bz</sub>. In the original manuscript, we had the following sentence to explain how the force can be similar between the two cases: 

      “Thus, was higher (Fig. S2B, blue), resulting in higher F<sub>tot</sub> and η despite smaller N<sub>a</sub>.”

      Reviewer #2 (Public review):

      Summary:

      In this study, the authors use a mechanical model to investigate how the geometry and deformations of myosin II filaments influence their force generation. They introduce a force generation efficiency that is defined as the ratio of the total generated force and the maximal force that the motors can generate. By changing the architecture of the myosin II filaments, they study the force generation efficiency in different systems: two filaments, a disorganized bundle, and a 2D network. In the simple two-filament systems, they found that in the presence of actin crosslinking proteins motors cannot add up their force because of steric hindrances. In the disorganized bundle, the authors identified a critical overlap of motors for cooperative force generation. This overlap is also influenced by the arrangement of the motor on the filaments and influenced by the length of the bare zone between the motor heads.

      Strengths:

      The strength of the study is the identification of organizational principles in myosin II filaments that influence force generation. It provides a complementary mechanistic perspective on the operation of these motor filaments. The force generation efficiency and the cooperative overlap number are quantitative ways to characterize the force generation of molecular motors in clusters and between filaments. These quantities and their conceptual implications are most likely also applicable in other systems.

      Thank you for the comments about the strength of our study. 

      Weaknesses:

      The detailed model that the authors present relies on over 20 numerical parameters that are listed in the supplement. Because of this vast amount of parameters, it is not clear how general the findings are. On the other hand, it was not obvious how specific the model is to myosin II, meaning how well it can describe experimental findings or make measurable predictions. The model seems to be quantitative, but the interpretation and connection to real experiments are rather qualitative in my point of view.

      As the reviewer mentioned, all agent-based computational models for simulating the actin cytoskeleton are inevitably involved with such a large number of parameters. Some of the parameter values are not known well, so we have tuned our parameter values carefully by comparing our results with experimental observations in our previous studies since 2009.We were aware of the importance of rigorous representation of unbinding and walking rates of myosin motors, so we implemented the parallel cluster model, which can predict those rates with consideration of the mechanochemical rates of myosin II, into our model. Thus, we are convincing that our motors represent myosin II.

      In our manuscript, our results were compared with prior observations in Ref. 4 (Thoresen et al., Biophys J, 2013) several times. In particular, larger force generation with more myosin heads per thick filament was consistent between the experiment and our simulations. 

      Our study can make various predictions. First, our study explains why non-muscle myosin II in stress fibers shows focal distributions rather than uniform distributions; if they stay closely, they can generate much larger forces in the stress fibers via the cooperative overlap. Our study also predicts a difference between bipolar structures (found in skeletal muscle myosins and nonmuscle myosins) and side polar structures (found in smooth muscle myosins) in terms of the likelihood of the cooperative overlap. As shown below, myosin filaments with the bipolar structure can add up their forces better than those with the side polar structure when their overlap level is the same.

      Author response image 1.

       

      It was often difficult for me to follow what parameters were changed and what parameters were set to what numerical values when inspecting the curve shown in the figures. The manuscript could be more specific by explicitly giving numbers. For example, in the caption for Figure 6, instead of saying "is varied by changing the number of motor arms, the bare zone length, the spacing between motor arms", the authors could be more specific and give the ranges: "is varied by changing the number of motor arms form ... to .., the bare zone length from .. to..., and the spacing between motor arms from .. to ..".

      This unspecificity is also reflected in the text: "We ran simulations with a variation in either L<sub>sp</sub> or L<sub>bz</sub>" What is the range of this variation? "WhenL<sub>M</sub> was similar" similar to what? "despite different N<sub>M</sub>." What are the different values for N<sub>M</sub>? These are only a few examples that show that the text could be way more specific and quantitative instead of qualitative descriptions.

      We appreciate the comment. In the revised manuscript, we specified the range of the variation in each parameter.

      In the text, after equation (2) the authors discuss assumptions about the binding of the motor to the actin filament. I think these model-related assumptions and explanations should be discussed not in the results section but rather in the "model overview" section.

      Thank you for pointing this out. In the original manuscript, we described all the details of the model in Supplementary Material. We feel that the assumptions about interactions between motors and actin filaments are too detailed information to be included in the model overview section.

      The lines with different colors in Figure 2A are not explained. What systems and parameters do they represent?

      The different colors used in Fig. 2A were used for distinguishing 20 cases. We added the explanation about the colors in the figure caption in the revised manuscript.

      Reviewer #2 (Recommendations for the authors):

      To guarantee the reproducibility of the results, I recommend that the authors publish their simulation code on GitHub.

      We appreciate the reviewer’s suggestion. Following the suggestion, we prepared and posted the code on GitHub as mentioned in the Data Availability of the revised manuscript: The source code of our model is available on GitHub: https://github.com/ktyman2/ThickFilament”

    1. Author Response

      The following is the authors’ response to the original reviews.

      Reviewer 1

      “The exact levels of inhibition, excitation, and neuromodulatory inputs to neural networks are unknown. Therefore, the work is based on fine-tuned measures that are indirectly based on experimental results. However, obtaining such physiological information is challenging and currently impossible. From a computational perspective it is a challenge that in theory can be solved. Thus, although we have no ground-truth evidence, this framework can provide compelling evidence for all hypothesis testing research and potentially solve this physiological problem with the use of computers.”

      Response: We agree with the reviewer. This work was intended to determine the feasibility of reverse engineering motor unit firing patterns, using neuron models with a high degree realism. Given the results support this feasibility, our model and technique will therefore serve to construct new hypotheses as well as testing them.

      • Common input structure lines 115

      I agree with the following concepts, but I would specify that there is not only one dominant common input. It has been shown that there are multiple common inputs to the same motor nuclei (e.g., the two inputs are orthogonal and are shared with a subset of the active motoneurons) particularly for agonist motoneuron pools of synergistic muscles. On the hand muscles the authors are correct that there is only one dominant common input. Moreover, there is also some animal work suggesting that common inputs is just an epiphenomenon. This is completely in contradiction to what we observe in-vivo in the firing patterns of motor units, but perhaps worth mentioning and discussing.

      Response: Thanks for emphasizing this point. We have cited a recent reference discussing the important issue of common drive and the possibility of more than one source. Our simulations assume the net form of the excitatory input to all motoneurons in the pool is the same, except for noise. This net form (which produces the linear CST output in each case) essentially represents the sum of all inputs, both descending and sensory. Our results show the same over pattern as human data, i.e. that all motor unit firing patterns have similar trajectories (again allowing for the impact of noise). Future studies will consider separating excitatory inputs into different sources.

      It is interesting that the authors mention suprathreshold rate modulation. Could the authors just discuss more on how the model would respond to a simulated suprathreshold current for all simulated motoneurons (i.e., like the ones generated during a suprathreshold-injected current or voluntary maximal feedforward movement?)

      Response: Thank you for this point. Our use of the term “suprathreshold” was not applied correctly. We meant “suprathreshold” to refer to amount of input above the recruitment threshold. We have decided to remove this term so now the sentence “…so less is available for rate modulation…”.

      194 a full point is missing.

      Response: We addressed the error.

      204-231 and 232-259, these two paragraphs have been copied twice.

      Response: We addressed the error.

      Line 475 typo

      Response: We addressed the error.

      591 It would be interesting to add the me it takes a standard computer with known specs and a super computer to run over one batch of simulation (i.e., how long one of the 6,300,000 simulation takes).

      Response: Each simulation took about 20 minutes of real me. Assuming a standard computer with 16 processor cores using a similar microarchitecture as Bebop (Intel Broadwell architecture), the standard computer could run 16 simulations at a me (one simulation assigned per core). This would take the standard computer about 15 years to complete all 6.3M simulations.

      594 I don't understand why there are 6M simulations, could the authors provide more info on the combinations and why there are 6M simulations.

      Response: The 6M simulations are the total number of simulations that were performed for this work. A detailed explanation can be found in section: “Machine learning inference of motor pool characteristics” at line 591. Briefly, there were 315,000 simulations of a pool of 20 motoneurons (20 x 315,000 = 6.3 million). The 315,000 simulations was required to run all possible combinations of 15 patens of inhibition, 5 of neuromodulation, 7 of distribution of excitatory inputs and 30 different repeats of synaptic noise with different seeds. In addition, there were 20 iterations for each of these combinations to generate a linear CST output (as illustrated in Fig. 3). 15 x5 x 7 x 30 x 20=315,000.

      In several simulations it seems that there was a lot of fine-tuning of inputs to match the measured motor unit firing pattern. Have the authors ever considered a fully black-box AI approach? If they think is interesting maybe it could spice up the discussion.

      Response: We agree that AI has potential for reverse engineering the whole system and we are looking into adding it to future version of this algorithm as an alternative. We started with a simple but powerful grid search to enhance our understanding of the interaction between inputs, neuron properties and outputs.

      Reviewer 2

      Comment 1:

      “First, I believe that the relation between individual motor neuron behavioral characteristics (delta F, brace height etc.) and the motor neuron input properties can be illustrated more clearly. Although this is explained in the text, I believe that this is not optimally supported by figures. Figure 6 to some extent shows this, but figures 8 and 9 as well as Table 1 shows primarily the goodness of fit rather than the actual fit.”

      Response: We agree with the reviewer that showing the relationship between the motor neuron behavioral characteristics (delta F, brace height etc.) and the motor neuron input properties would be a great addition to the manuscript. Because the regression models have multiple dimensions (7 inputs and 3 outputs) it is difficult to show the relationship in a static image. We thought it best to show the goodness of fit even though it is more abstract and less intuitive. We added a supplemental diagram to Figure 8 to show the structure of the reverse engineered model that was fit (see Figure 8D).

      Author response image 1.

      Figure 8. Residual plots showing the goodness of fit of the different predicted values: (A) Inhibition, (B) Neuromodulation and (C) excitatory Weight Rao. The summary plots are for the models showing highest 𝑅𝑅2 results in Table 1. The predicted values are calculated using the features extracted from the firing rates (see Figure 7, section Machine learning inference of motor pool characteristics and Regression using motoneuron outputs to predict input organization). Diagram (D) shows the multidimensionality of the RE models (see Model fits) which have 7 feature inputs (see Feature Extraction) predicting 3 outputs (Inhibition, Neuromodulation and Weight Rao).

      Comment 2:

      “Second, I would have expected the discussion to have addressed specifically the question of which of the two primary schemes (pushpull, balanced) is the most prevalent. This is the main research question of the study, but it is to some degree le unanswered. Now that the authors have identified the relation between the characteristics of motor neuron behaviors (which has been reported in many previous studies), why not exploit this finding by summarizing the results of previous studies (at least a few representative ones) and discuss the most likely underlying input scheme? Is there a consistent trend towards one of the schemes, or are both strategies commonly used?”

      Response: We agree with the reviewer that our discussion should have addressed which of the two primary schemes – push-pull or balanced – is the most prevalent. At first glance, the upper right of Figure 6 looks the most realistic when compared to real data. We thus would expect that the push-pull scheme to dominate for the given task.

      We added a brief section (Push-Pull vs Balance Motor Command) in the discussion to address the reviewer’s comments. This section is not exhaustive but frames the debate using relevant literature. We are also now preparing to deploy these techniques on real data.

      Comment 3:

      In addition, it seems striking to me that highly non-linear excitation profiles are necessary to obtain a linear CST ramp in many model configurations. Although somewhat speculative, one may expect that an approximately linear relation is desired for robust and intuitive motor control. It seems to me that humans generally have a good ability to accurately grade the magnitude of the motor output, which implies that either a non-linear relation has been learnt (complex task), or that the central nervous system can generally rely on a somewhat linear relation between the neural drive to the muscle and the output (simpler task).

      Response: We agree with the reviewer, and we were surprised by these results. Our motoneuron pool is equipped with persistent inward currents (PICs) which are nonlinear. Therefore, for the motoneuron to produce a linear output the central nervous system would have to incorporate these nonlinearities into its commands.

      Following this reasoning, it could be interesting to report also for which input scheme, the excitation profile is most linear. I understand that this is not the primary aim of the study, but it may be an interesting way to elaborate on the finding that in many cases non-linear excitation profiles were needed to produce the linear ramp.

      This is a very interesting point. The most realistic firing patterns – with respect to human data – are found in the parameter regions in the upper right in Figure 6, which in fact produce the most nonlinear input (see push-pull pattern in Figure 4C). However, in future studies we hope to separate the total motor command illustrated here into descending and feedback commands. This may result in a more linear descending drive.

    1. Author Response

      The following is the authors’ response to the original reviews.

      eLife assessment

      The study provides valuable insights into allosteric regulation of BTK, a non-receptor protein kinase, challenging previous models. Using a variety of biophysical and functional techniques, the paper presents evidence that the N-terminal PH-TH domain of BTK exists in a conformational ensemble surrounding a compact SH3-SH2-kinase core, that the BTK kinase domain can form partially active dimers, and that the PH domain can form a novel inhibitory interface after SH2/SH3 disengagement. Overall the presented evidence is solid, but the EM results may be over-interpreted and the work would benefit from additional functional validation.

      We made every effort in our descriptions of the cryoEM data presented for full-length BTK to not overinterpret the results. In essence this is not an ideal EM target but given the failure by us and others to capture the full-length multi-domain protein crystallographically, we decided that the albeit low resolution cryoEM data are useful to the field.

      Reviewer #1 (Public Review):

      The manuscript by Lin et al describes a wide biophysical survey of the molecular mechanisms underlying full-length BTK regulation. This is a continuation of this lab's excellent work on deciphering the myriad levels of regulation of BTKs downstream of their activation by plasma membrane localised receptors.

      The manuscript uses a synergy of cryo EM, HDX-MS and mutational analysis to delve into the role of how the accessory domains modify the activity of the kinase domain. The manuscript essentially has three main novel insights into BTK regulation.

      1) Cryo EM and SAXS show that the PHTH region is dynamic compared to the conserved Src module.

      2) A 2nd generation tethered PH-kinase construct crystal of BTK reveals a unique orientation of the PH domain relative to the kinase domain, that is different from previous structures.

      3) A new structure of the kinase domain dimer shows how trans-phosphorylation can be achieved.

      Excitingly these structural works allow for the generation of a model of how BTK can act as a strict coincidence sensor for both activated BCR complex as well as PIP3 before it obtains full activity. To my eye the most exciting result of this work is describing how the PH domain can inhibit activity once the SH3/SH2 domain is disengaged, allowing for an additional level of regulatory control.

      I have very few experimental concerns as the methods and figures are well-described and clear. As the authors are potentially saying that the previously solved PH domain-kinase interface is artefactual, additional evidence strengthening their model would be helpful to resolve any possible controversies.

      We do not argue that the previously solved PH domain-kinase interface is artefactual. Instead we point out that the PH/kinase interface identified in the prior structure is incompatible with the contacts between the SH3 and kinase domains in autoinhibited BTK. This then leads us to the suggestion that a PH/kinase inhibitory interaction may instead occur upon dissociation of the SH3-SH2 cassette from the kinase domain. Our data support that model. Moreover, our data suggest the PHTH domain is dynamic, likely not settling in to one particular autoinhibitory state. Thus, it is possible the previously solved PH/kinase structure exists within the conformational ensemble of a range PH/kinase domain interactions. In an effort to clarify our think we added two sentences to the Discussion (pg. 19).

      Reviewer #2 (Public Review):

      In this study, multiple biophysical techniques were employed to investigate the activation mechanism of BTK, a multi-domain non-receptor protein kinase. Previous studies have elucidated the inhibitory effects of the SH3 and SH2 domains on the kinase and the potential activation mechanism involving the membranebound PIP3 inducing transient dimerization of the PH-TH domain, which binds to lipids.

      The primary focus of the present study was on three new constructs: a full-length BTK construct, a construct where the PH-TH domain is connected to the kinase domain, and a construct featuring a kinase domain with a phosphomimetic at the autophosphorylation site Y551. The authors aimed to provide new insights into the autoinhibition and allosteric control of BTK.

      The study reports that SAXS analysis of the full-length BTK protein construct, along with cryoEM visualization of the PH-TH domain, supports a model in which the N-terminal PH-TH domain exists in a conformational ensemble surrounding a compact/autoinhibited SH3-SH2-kinase core. This finding is interesting because it contradicts previous models proposing that each globular domain is tightly packed within the core.

      Furthermore, the authors present a model for an inhibitory interaction between the N-lobe of the kinase and the PH-TH domain. This model is based on a study using a tethered complex with a longer tether than a previously reported construct where the PH-TH domain was tightly attached to the kinase domain (ref 5). The authors argue that the new structure is relevant. However, this assertion requires further explanation and discussion, particularly considering that the functional assays used to assess the impact of mutating residues within the PH-TH/kinase domain contradict the results of the previous study (ref 5).

      In our hands BTK activity is not significantly affected by mutation of just two residues, R133 and Y134. It is somewhat difficult to compare the previously reported activity assay for the same BTK mutant (Wang et al. ref 5, Figure 4D) with the data we report here. For unexplained reasons, the time scale for the quantitative assay in the previous work is truncated to 50 munutes for the R133/Y134 mutant data compared to 120 minutes for all of the other activity data reported in that figure. In our data, if we qualitatively examine the differences in a representative progress curve at 50 minutes between WT and the double R133/Y134 mutant (see Figure 6a, dark blue and pink traces) one might conclude that the R133/Y134 mutation is activating BTK. However, when we calculate the average kinase activity rate ± standard error for three independent experiments we find that the difference between WT and the double R133/Y134 mutant is not significant (see Figure 6b and c). Thus, instead of making any assertions about the previously published data we are trying to be as rigoruous as possible in presentation and interpretation of our own data.

      In addition, throughout the manuscript we tried to be very careful in our discussion of our data and that published previously, to avoid conclusive statements about the previously described interface. Afterall, one of our overriding conclusions is that the N-terminal region of BTK is highly dynamic. See response to reviewer 1 above.

      Additionally, the study presents the structure of the kinase domain with swapped activation loops in a dimeric form, representing a previously unseen structure along the trans-phosphorylation pathway. This structure holds potential relevance. To better understand its significance, employing a structure/function approach like the one described for the PH-TH/kinase domain interface would be beneficial.

      We completely agree with this comment and are pursuing such studies now.

      Overall, this study contributes to our understanding of the activation mechanism of BTK and sheds light on the autoinhibition and allosteric control of this protein kinase. It presents new structural insights and proposes novel models that challenge previous understandings. However, further investigation and discussion would significantly strengthen the study.

      As indicated we are pursuing further investigation and felt that the body of work presented here is sufficient for a single manuscript.

      Reviewer #3 (Public Review):

      Yin-wei Lin et al set out to visualize the inactive conformation of full-length Bruton's Tyrosine Kinase (BTK), a molecule that has evaded high-resolution structural studies in its full-length form to this date. An open question in the field is how the Pleckstrin Homology-Tec Homology (PHTH) domain inhibits BTK activity, with multiple competing models in the field. The authors used a complimentary set of biophysical techniques combined with well-thought-out stabilizing mutations to obtain structural insights into BTK regulation in its full-length form. They were able to crystallize the full-length construct of BTK but unfortunately, the PHTH was not resolved yielding a structure similar to that previously obtained in the field. The investigation of the same construct by SAXS yielded an elongated structural model, consistent with previous SAXS studies. Using cryo-EM the authors obtained a low-resolution model for the FL BTK with a loosely connected density assigned to the dynamic PHTH around the compact SH2-SH3-Kinase Domain (KD) core. To gain further molecular insights into PHTH-KD interactions the authors followed a previously reported strategy and generated a fusion of PHTH-KD with a longer linker, yielding a crystal structure with a novel PHTH-KD interface which they tested in biochemical assays. Lastly, Yin-wei Lin et al crystallized the BTK KD in a novel partially active state in a "face-to-face" dimer with kinases exchanging the activation loops, although partially disordered, being theoretically perfectly positioned for transphosphorylation. Overall this presents a valiant effort to gain molecular insights into what clearly is a dynamic regulatory motif on BTK and is a valuable addition to the field.

      However, this work can be improved by considering these points:

      1) The cryo-EM reconstructions are potentially over-interpreted. The reported resolution for all of the analyzed reconstructions is better than 8Å, at which point helices should be recognized as well-resolved structural elements. In the current view/depiction of the cryo-EM maps/models it is hard to see such structural features and it would be great if the authors could include a panel showing maps at higher thresholds to show correspondence between the helices in the kinase C lobe and the cryo-EM maps. Otherwise, the overall positioning of the models within the cryo-EM maps is hard to evaluate and may very well be wrong. (Fig 4, S2).

      First, we fully recognize the model is low-resolution and we are careful in our discussion of the cryo-EM data to use language that acknowledges the limitations of the model. Nevertheless, this is the model we have (specific data processing points are discussed below).

      The resolution numbers are from the Fourier Shell Correlation (FSC) curve given by Cryosaprc at the end of refinement. We do acknowledge the reviewer’s comments that the resolution could be over estimated in that calculation, but our main focus is to show that the overall domain arrangement of the autoinhibited BTK core (Src-module) fits into the reconstructions.

      We tested visualizing the maps at higher threshold, but the secondary structures of the reconstructions were still not well resolved. We do realize that with the current reconstructions, we do not have the structural details to correctly orientate and fit individual domains; this is why we chose to simply fit the available crystal structure of the autoinhibited BTK SH3-SH2-kinase core into the maps.

      2) With the above in mind, if the maps are not at the point where helices are well resolved, it may be beneficial to low-pass filter the maps to a more conservative resolution for fitting, analysis, and representation. (Fig 4, S2).

      Using low-pass filtered maps at 10Å or unsharpened maps, the fitting of the BTK model and map do not change significantly.

      3) It would be valuable to get a quantitative metric on the model/map fitting for the cryo-EM work. One good package for this is Situs which provides cross-correlation values for the top orthogonal fits, without user input for initial fitting. This would again increase confidence in the correctness of model positioning on the map. (Fig 4, S2).

      Thank you for this suggestion. We tested the colores feature (Exhaustive One-At-A-Time 6D Search) in Situs to perform model to map fitting without user input as the reviewer suggested. The highest ranked fitting is identical to what we presented in the manuscript. Following are the cross-corelation numbers calculated from “Fit-in-map” tool in chimera and from “collage” function in Situs. We now indicate this step in the caption to Figure 4.

      Author response table 1.

      4) It would be great to see 2D class averages from the particles contributing to each of the 3D classes. Theoretically, a clear bright "blob" (hypothesized to be the PHTH domain) should be observable in the 2D class averages. In the current 2D class averages that region is unconvincingly weak. (Fig 4, S2).

      We attempted to improve both 2D and 3D reconstructitions by feeding the particles from each 3D class through many cycles of 2D classification and selection to exclude ‘bad’ paritcles, but neither the 2D class averages nor 3D reconstructions could be improved.

      We agree the feature that appears in the 2D class averages is weak. The BTK protein is only 77kD in size and is highly dynamic and flexible. Thus, in reality this is not an ideal system for cryo-EM. As well, the PHTH domain itself is quite small and NMR data, acquired in the context of a different project, provides evidence that the isolated PHTH domain is dynamic in solution (NMR linewidths vary throughout the protein suggesting intermediate exchange). Nevertheless, given the inability to capture the PHTH domain in crystal structures of full-llength BTK we reasoned that cryo-EM could provide some insight. In the future we anticipate building on these data to include inhibitory binding partners of BTK; however such an effort is beyond the scope of the current work.

      5) It seems like there was quite a large circular mask applied during 2D classification. Are authors confident that the weak density attributed to the PHTH domain is not neighboring particles making their way into the extraction box? It would be great if the authors would trim their particle stack with a very stringent interparticle distance cutoff (or report the cutoff in the manuscript if already done so) to minimize this possibility.

      We initially picked particles using a small radius (100 Å), and stringently selected 2D classes with particles that contained only density aligning to the core SH3-SH2-kinase domains. We found, however, that 3D ab initio reconstruction always resulted in an additional density located at different positions around the larger core density. The structure of a single BTK PHTH domain fits into that additional remote density. Given the additional density that consistently appeared in 3D reconstructions, we went back and picked particles using a larger circular mask (200 A). Subsequent 2D classification and 3D reconstruction from this analysis gave similar results and are presented in the manuscript.

      Regardless of the mask radius, we used stringent conditions for particle picking and checked for the presence of duplicates. An interparticle distance cutoff of 0.1 to 0.5 times the particle diameter was used and resulted in fewer number of particles, but the presence of the extended density remains. We also made use of template picking (2D class averages) to repick the particles and found no significant difference in the number of particles or quality of 2D classifications.

      6) The cryo-EM processing may benefit from more stringent particle picking. The authors picked over 2M particles from 750 micrographs which likely represents very heavy overpicking. I would encourage the authors to re-pick the micrographs with 2D class averages and use more stringent metrics to reduce the overpicking. This may result in higher-resolution reconstructions. (Fig 4, S2).

      This was an effort to maximize the number of particles extracted. After multiple rounds of 2D classification and selection to exclude empty and junk particles, the final number of particles selected for 3D ab-initio reconstructions were only 68,788, and only ~20K particles for each 3D reconstruction. Thus, we are not concerned that we overpicked particles. This approach is described in Supp Figure S2.

      7) The Dmax from SAXS for the Full Length BTK is at 190Å. It would be great if the authors could make a cartoon of what domain arrangement may satisfy this distance, as it is quite extended for such a small particle. Can the authors rule out dimerization at SAXS concentrations? (Fig 1).

      SAXS data for full-length, wild-type BTK has been previously published (Márquez et al, 2003 EMBO J. (2003) 22:4616-4624). Our data for WT BTK are consistent with that published previously (and we have cited this previous work). In that work, the authors attribute the ~200 Å Dmax value to an elongated BTK conformation where the domains of BTK are arranged in a linear fashion (a figure showing this domain arragement is provided by Marquez et al. precluding the need for such a cartoon here).

      In the present work we take advantage of targeted mutations to stabilize the autoinhibted SH2-SH2-kinase core and the Dmax value that we report for this more autoinhibited version of full-length BTK (FL 4P1F) is ~150Å. Notwithstanding low resolution in both SAXS and cryoEM, it is notable that superposition of the cryoEM models in Figure 4c & d gives a distance of ~150Å between the PHTH domains from the two models.

      Finally, we cannot completely rule out that a small fraction of full length BTK is forming dimers. However, in our experience purifying and working with this protein, we find that purified and concentrated monomeric fulllength Btk proteins (as high as 15mg/ml) are quite stable and remain monomeric and free of aggregation even after sitting at 4°C for more than a week. Here the BTK SAXS data were collected within 24 hours after the samples were thawed.

      8) In Figure S1 (C) it seems that the curves are just scattering curves with Guinier plots in the inserts, but are labeled as Guinier plots in the legend. The Guinier plots for some samples (FL 4P1F) show signs of aggregation, which may complicate the analysis, it could be beneficial to redo.

      We thank the reviewer for pointing out our mistake in presention of the SAXS data. We have now replaced plots in Figure S1c with the correct scattering profiles for each construct with the Guinier insets shown. We revised the label of this panel to “Scattering profile and Guinier plots (insets)”.

      In addition, we re-processed the FL 4P1F data by performing buffer subtraction (using a different buffer alone scattering dataset (also collected during original data acquisition)). The data quality after reprocessing were significantly improved (see new scattering profiles and Guinier plots for full-length BTK in Supplementary Figure S1). Protein stability (see above) and the current data quality therefore suggest that aggregation is not complicating the SAXS analysis.

      9) Have the authors verified that the activation loop mutations that they introduce do not disrupt the PHTH binding as they previously reported an activation loop on BTK to interact with PHTH, an interaction they do not see here? If so, a citation would be helpful in the text. If not, testing this would strengthen the paper.

      The same activation loop mutations were included in the constructs used in the previous solution studies of the PHTH/kinase domain interaction by NMR and HDX (see ref [11]). We clarify this point in the methods section. As well, all but one of the sequence changes introduced into the activation loop are at positions at the ‘base’ of the activation loop and therefore are not surface exposed. Only one amino acid change is on the exposed part of the activation loop (V555T).

      10) Can the authors comment on the surfaces which are accessible and inaccessible to the PHTH in the crystal (Fig 3E)? The fact that PHTH doesn't adopt a stable conformation in the solvent channel to some degree indicates that the accessible interaction surfaces are not suitable for PHTH interactions, as the "effective concentration" of the PHTH would be quite high. Are these surfaces consistent with the cryo-EM analysis?

      This is an excellent point and we did state the following in describing the crystallization results:

      “the crystallography results are consistent with a flexible N-terminal PHTH domain with the caveat that the domain swapped dimer organization might limit native autoinhibitory contacts between the PHTH and SH3SH2-kinase regions.”

      In the domain swapped dimer seen in the crystal, a symmetry related molecule does partially block the Ghelix region of the kinase domain while the activation loop and C-helix in the N-lobe remain accessible. Our previous solution studies (ref [11]) pointed to the G helix as part of the interaction interface in addition to the activation loop and part of the N-lobe. We have now modified the sentence above to more clearly describe which parts of the kinase domain are inaccessible in the crystal and the possible ramifications of the steric environment on PHTH domain mobility in the crystal (see pg. 10). That said, all of our previous HDX data shows little protection in the PHTH domain in full-length BTK (mapping of the PHTH/kinase interaction was only possible in trans using excess PHTH domain) and so our data can be best summarized by concluding that the PHTH domain visits a number of conformational states and makes transient contacts with various regions of the kinase domain (dependent upon whether the SH3-SH2 region is engaged or not). This is similar to the ‘fuzzy’ intramolecular contacts described for the N-terminal region of the SRC family. Like the SRC family, BTK (and other TEC kinases) contain a long disordered linker between the N-terminal region and the compact SH3-SH2-kinase core.

      11) For the novel active state dimer of the Kinase Domain it would be great to see some functional validation of the dimerization interface. It is structurally certainly quite suggestive, but without such experiments the functional significance is unclear. If appropriate mutations have been published previously a citation would be helpful.

      We completely agree. We scoured the literature and our own facuntional assay results over many years but the appropriate mutations to test the functional significance of the kinase domain dimer have not been reported or previously studied in our lab. We are therefore actively pursuing this line of investigation now.

      Reviewer #1 (Recommendations For The Authors):

      I have the following proposed experiments/analysis that should help.

      1) To better validate the putative PH-kinase interface seen, the authors should try some alphafold multimer / rosettaTTFold modelling of just the PHTH module with the kinase domain. The advantage of this is that it will test how conserved over evolution the potential interface is, and will help to decipher discrepancies between the two structures. This may end up being similar to what is seen in Akt (in this case the alphafold prediction does not match the allosteric inhibitor structure, or the nanobody bound structure), but this could help provide additional insight into how the PH domain interacts.

      We have applied alphafold to this system. The PHTH-kinase fusion sequence was fed to Alphafold and the separate PHTH and kinase domains to Aphafold multimer. The results provide a range of ‘complexes’ none of which recapitulate the PHTH/kinase interface reported here or that reported by Wang et al in previous work. Three of five results from Alphafold Multimer place the PHTH domain on the activation loop face of the kinase domain consistent with the previous solution data pointing to a similar regulatory interface. This is interesting but our experience in applying alphafold to dynamic confromationally heterogeneous systems is that the results need to be considered with caution. For that reason we did not include any of the alphafold predictions in the manuscript.

      Evolutionary conservation is discussed further in the next section:

      2) Could the authors provide a detailed evolutionarily analysis of the binding surface between the PHTH and kinase domains and include this in Fig5, this also would help interpret the likelihood of this interface.

      This is an excellent question and we have in fact previously published a detailed evolutionary analysis of the BTK kinase domain in collaboration with Kannan Natarajan (see Amatya et al., PNAS, 2019, [ref 11]). In that work we found that evolutionarily conserved residues on the kinase domain map to the activation loop face, supporting the solution data that the PHTH interacts with the kinase domain across the activation loop face. That work predated alphafold but it is interesting that, to the exent that alphafold predicts anything, it seems to converge on the PHTH domain containg the activation loop face.

      In the context of our current work, and this question from the reviewer, we re-examined the evolutionary anlysis carried out previously and find that BTK (or TEC family) specific residues on the kinase domain do not appear at the newly identified PHTH/kinase interface we report here. We could speculate that since the ‘back’ of the kinase domain N-lobe interacts with multiple binding partners (SH3, SH2-linker and PHTH) evolutionary pressures may have resulted in a certain degree of plasticity to allow recognition of multiple binding partners.

      Evolutionary analysis of the BTK PH domain was also carried out previously and shows that the conserved sites map to the phospholipid binding pocket of the PH domain. The analysis did not include TH domain residues. Since we find the TH domain contributes to the PHTH/kinase interface in our crystal structure, we do not have the data at this time to do a thourough anaylsis but we appreciate this comment and can address this in furture work with collaborators.

    1. Author response:

      The following is the authors’ response to the original reviews.

      We thank the reviewers and editors for their careful read of our paper, and appreciate the thoughtful comments.

      Both reviewers agreed that our work had several major strengths: the large dataset collected in collaboration across ten labs, the streamlined processing pipelines, the release of code repositories, the multi-task neural network, and that we definitively determined that electrode placement is an important source of variability between datasets.

      However, a number of key potential improvements were noted: the reviewers felt that a more standard model-based characterization of single neuron responses would benefit our reproducibility analysis, that more detail was needed about the number of cells, sessions, and animals, and that more information was needed to allow users to deploy the RIGOR standards and to understand their relationship to other metrics in the field.

      We agree with these suggestions and have implemented many major updates in our revised manuscript. Some highlights include:

      (1)  A new regression analysis that specifies the response profile of each neuron, allowing a comparison of how similar these are across labs and areas (See Figure 7 in the new section, “Single neuron coefficients from a regression-based analysis are rep oducible across labs”);

      (2) A new decoding analysis (See Figure 9 in the section, “Decodability of task variables is consistent across labs, but varies by brain region”);

      (3) A new RIGOR notebook to ease useability;

      (4) A wealth of additional information about the cells, animals and sessions in each figure;

      (5) Many new additional figure panels in the main text and supplementary material to clarify the specific points raised by the reviewers.

      Again, we are grateful to the reviewers and editors for their helpful comments, which have significantly improved the work. We are hopeful that the many revisions we have implemented will be sufficient to change the “incomplete” designation that was originally assigned to the manuscript.

      Reviewer #1 (Public review):

      Summary:

      The authors explore a large-scale electrophysiological dataset collected in 10 labs while mice performed the same behavioral task, and aim to establish guidelines to aid reproducibility of results collected across labs. They introduce a series of metrics for quality control of electrophysiological data and show that histological verification of recording sites is important for interpreting findings across labs and should be reported in addition to planned coordinates. Furthermore, the authors suggest that although basic electrophysiology features were comparable across labs, task modulation of single neurons can be variable, particularly for some brain regions. The authors then use a multi-task neural network model to examine how neural dynamics relate to multiple interacting task- and experimenter-related variables, and find that lab-specific differences contribute little to the variance observed. Therefore, analysis approaches that account for correlated behavioral variables are important for establishing reproducible results when working with electrophysiological data from animals performing decision-making tasks. This paper is very well-motivated and needed. However, what is missing is a direct comparison of task modulation of neurons across labs using standard analysis practice in the fields, such as generalized linear model (GLM). This can potentially clarify how much behavioral variance contributes to the neural variance across labs; and more accurately estimate the scale of the issues of reproducibility in behavioral systems neuroscience, where conclusions often depend on these standard analysis methods.

      We fully agree that a comparison of task-modulation across labs is essential. To address this, we have performed two new analyses and added new corresponding figures to the main text (Figures 7 and 9). As the reviewer hoped, this analysis did indeed clarify how much behavioral variance contributes to the variance across labs. Critically, these analyses suggested that our results were more robust to reproducibility than the more traditional analyses would indicate.

      Additional details are provided below (See detailed response to R1P1b).

      Strengths:

      (1) This is a well-motivated paper that addresses the critical question of reproducibility in behavioural systems neuroscience. The authors should be commended for their efforts.

      (2) A key strength of this study comes from the large dataset collected in collaboration across ten labs. This allows the authors to assess lab-to-lab reproducibility of electrophysiological data in mice performing the same decision-making task.

      (3) The authors' attempt to streamline preprocessing pipelines and quality metrics is highly relevant in a field that is collecting increasingly large-scale datasets where automation of these steps is increasingly needed.

      (4) Another major strength is the release of code repositories to streamline preprocessing pipelines across labs collecting electrophysiological data.

      (5) Finally, the application of MTNN for characterizing functional modulation of neurons, although not yet widely used in systems neuroscience, seems to have several advantages over traditional methods.

      Thanks very much for noting these strengths of our work.

      Weaknesses:

      (1) In several places the assumptions about standard practices in the field, including preprocessing and analyses of electrophysiology data, seem to be inaccurately presented:

      a) The estimation of how much the histologically verified recording location differs from the intended recording location is valuable information. Importantly, this paper provides citable evidence for why that is important. However, histological verification of recording sites is standard practice in the field, even if not all studies report them. Although we appreciate the authors' effort to further motivate this practice, the current description in the paper may give readers outside the field a false impression of the level of rigor in the field.

      We agree that labs typically do perform histological verification. Still, our methods offer a substantial improvement over standard practice, and this was critical in allowing us to identify errors in targeting. For instance, we used new software, LASAGNA, which is an innovation over the traditional, more informal approach to localizing recording sites. Second, the requirement that two independent reviewers concur on each proposed location for a recording site is also an improvement over standard practice. Importantly, these reviewers use electrophysiological features to more precisely localize electrodes, when needed, which is an improvement over many labs. Finally, most labs use standard 2D atlases to identify recording location (a traditional approach); our use of a 3D atlas and a modern image registration pipeline has improved the accuracy of identifying the true placement of probes in 3D space.

      Importantly, we don’t necessarily advocate that all labs adopt our pipeline; indeed, this would be infeasible for many labs. Instead, our hope is that the variability in probe trajectory that we uncovered will be taken into account in future studies. Here are 3 example ways in which that could happen. First, groups hoping to target a small area for an experiment might elect to use a larger cohort than previously planned, knowing that some insertions will miss their target. Second, our observation that some targeting error arose because experimenters had to move probes due to blood vessels will impact future surgeries: when an experimenter realizes that a blood vessel is in the way, they might still re-position the probe, but they can also adjust its trajectory (e.g., changing the angle) knowing that even little nudges to avoid blood vessels can have a large impact on the resulting insertion trajectory. Third, our observation of a 7 degree deviation between stereotaxic coordinates and Allen Institute coordinates can be used for future trajectory planning steps to improve accuracy of placement. Uncovering this deviation required many insertions and our standardized pipeline, but now that it is known, it can be easily corrected without needing such a pipeline.

      We thank the reviewer for bringing up this issue and have added new text (and modified existing text) in the Discussion to highlight the innovations we introduced that allowed us to carefully quantify probe trajectory across labs (lines 500 - 515):

      “Our ability to detect targeting error benefited from an automated histological pipeline combined with alignment and tracing that required agreement between multiple users, an approach that greatly exceeds the histological analyses done by most individual labs. Our approach, which enables scalability and standardization across labs while minimizing subjective variability, revealed that much of the variance in targeting was due to the probe entry positions at the brain surface, which were randomly displaced across the dataset. … Detecting this offset relied on a large cohort size and an automated histological pipeline, but now that we have identified the offset, it can be easily accounted for by any lab. Specifically, probe angles must be carefully computed from the CCF, as the CCF and stereotaxic coordinate systems do not define the same coronal plane angle. Minimizing variance in probe targeting is another important element in increasing reproducibility, as slight deviations in probe entry position and angle can lead to samples from different populations of neurons. Collecting structural MRI data in advance of implantation could reduce targeting error, although this is infeasible for most labs. A more feasible solution is to rely on stereotaxic coordinates but account for the inevitable off-target measurements by increasing cohort sizes and adjusting probe angles when blood vessels obscure the desired location.”

      b) When identifying which and how neurons encode particular aspects of stimuli or behaviour in behaving animals (when variables are correlated by the nature of the animals behaviour), it has become the standard in behavioral systems neuroscience to use GLMs - indeed many labs participating in the IBL also has a long history of doing this (e.g., Steinmetz et al., 2019; Musall et al., 2023; Orsolic et al., 2021; Park et al., 2014). The reproducibility of results when using GLMs is never explicitly shown, but the supplementary figures to Figure 7 indicate that results may be reproducible across labs when using GLMs (as it has similar prediction performance to the MTNN). This should be introduced as the first analysis method used in a new dedicated figure (i.e., following Figure 3 and showing results of analyses similar to what was shown for the MTNN in Figure 7). This will help put into perspective the degree of reproducibility issues the field is facing when analyzing with appropriate and common methods. The authors can then go on to show how simpler approaches (currently in Figures 4 and 5) - not accounting for a lot of uncontrolled variabilities when working with behaving animals - may cause reproducibility issues.

      We fully agree with the reviewer's suggestion. We have addressed their concern by implementing a Reduced-Rank Regression (RRR) model, which builds upon and extends the principles of Generalized Linear Models (GLMs). The RRR model retains the core regression framework of GLMs while introducing shared, trainable temporal bases across neurons, enhancing the model’s capacity to capture the structure in neural activity (Posani, Wang, et al., bioRxiv, 2024). Importantly, Posani, Wang et al compared the predictive performance of GLMs vs the RRR model, and found that the RRR model provided (slightly) improved performance, so we chose the RRR approach here.

      We highlight this analysis in a new section (lines 350-377) titled, “Single neuron coefficients from a regression-based analysis are reproducible across labs”. This section includes an entirely new Figure (Fig. 7), where this new analysis felt most appropriate, since it is closer in spirit to the MTNN analysis that follows (rather than as a new Figure 3, as the reviewer suggested). As the reviewer hoped, this analysis provides some reassurance that including many variables when characterizing neural activity furnishes results with improved reproducibility. We now state this in the Results and the Discussion (line 456-457), highlighting that these analyses complement the more traditional selectivity analyses, and that using both methods together can be informative.

      When the authors introduce a neural network approach (i.e. MTNN) as an alternative to the analyses in Figures 4 and 5, they suggest: 'generalized linear models (GLMs) are likely too inflexible to capture the nonlinear contributions that many of these variables, including lab identity and spatial positions of neurons, might make to neural activity'). This is despite the comparison between MTNN and GLM prediction performance (Supplement 1 to Figure 7) showing that the MTNN is only slightly better at predicting neural activity compared to standard GLMs. The introduction of new models to capture neural variability is always welcome, but the conclusion that standard analyses in the field are not reproducible can be unfair unless directly compared to GLMs.

      In essence, it is really useful to demonstrate how different analysis methods and preprocessing approaches affect reproducibility. But the authors should highlight what is actually standard in the field, and then provide suggestions to improve from there.

      Thanks again for these comments. We have also edited the MTNN section slightly to accommodate the addition of the previous new RRR section (line 401-402).

      (2) The authors attempt to establish a series of new quality control metrics for the inclusion of recordings and single units. This is much needed, with the goal to standardize unit inclusion across labs that bypasses the manual process while keeping the nuances from manual curation. However, the authors should benchmark these metrics to other automated metrics and to manual curation, which is still a gold standard in the field. The authors did this for whole-session assessment but not for individual clusters. If the authors can find metrics that capture agreed-upon manual cluster labels, without the need for manual intervention, that would be extremely helpful for the field.

      We thank the reviewer for their insightful suggestions regarding benchmarking our quality control metrics against manual curation and other automated methods at the level of individual clusters. We are indeed, as the reviewer notes, publishing results from spike sorting outputs that have been automatically but not manually verified on a neuron-by-neuron basis. To get to the point where we trust these results to be of publishable quality, we manually reviewed hundreds of recordings and thousands of neurons, refining both the preprocessing pipeline and the single-unit quality metrics along the way. All clusters, both those passing QCs and those not passing QCs, are available to review with detailed plots and quantifications at https://viz.internationalbrainlab.org/app (turn on “show advanced metrics” in the upper right, and navigate to the plots furthest down the page, which are at the individual unit level). We would emphasize that these metrics are definitely imperfect (and fully-automated spike sorting remains a work in progress), but so is manual clustering. Our fully automated approach has the advantage of being fully reproducible, which is absolutely critical for the analyses in the present paper. Indeed, if we had actually done manual clustering or curation, one would wonder whether our results were actually reproducible independently. Nevertheless, it is not part of the present manuscript’s objectives to validate or defend these specific choices for automated metrics, which have been described in detail elsewhere (see our Spike Sorting whitepaper, https://figshare.com/articles/online_resource/Spike_sorting_pipeline_for_the_International_Brain_La boratory/19705522?file=49783080). It would be a valuable exercise to thoroughly compare these metrics against a careful, large, manually-curated set, but doing this properly would be a paper in itself and is beyond the scope of the current paper. We also acknowledge that our analyses studying reproducibility across labs could, in principle, result in more or less reproducibility under a different choice of metrics, which we now describe in the Discussion (line 469-470)”:

      “Another significant limitation of the analysis presented here is that we have not been able to assess the extent to which other choices of quality metrics and inclusion criteria might have led to greater or lesser reproducibility.”

      (3) With the goal of improving reproducibility and providing new guidelines for standard practice for data analysis, the authors should report of n of cells, sessions, and animals used in plots and analyses throughout the paper to aid both understanding of the variability in the plots - but also to set a good example.

      We wholeheartedly agree and have added the number of cells, mice and sessions for each figure. This information is included as new tabs in our quality control spreadsheet (https://docs.google.com/spreadsheets/d/1_bJLDG0HNLFx3SOb4GxLxL52H4R2uPRcpUlIw6n4 n-E/). This is referred to in line 158-159 (as well as its original location on line 554 in the section, “Quality control and data inclusion”).

      Other general comments:

      (1) In the discussion (line 383) the authors conclude: 'This is reassuring, but points to the need for large sample sizes of neurons to overcome the inherent variability of single neuron recording'. - Based on what is presented in this paper we would rather say that their results suggest that appropriate analytical choices are needed to ensure reproducibility, rather than large datasets - and they need to show whether using standard GLMs actually allows for reproducible results.

      Thanks. The new GLM-style RRR analysis in Figure 7, following the reviewer’s suggestion, does indeed indicate improved reproducibility across labs. As described above, we see this new analysis as complementary to more traditional analyses of neural selectivity and argue that the two can be used together. The new text (line 461) states:

      “This is reassuring, and points to the need for appropriate analytical choices to ensure reproducibility.”

      (2) A general assumption in the across-lab reproducibility questions in the paper relies on intralab variability vs across-lab variability. An alternative measure that may better reflect experimental noise is across-researcher variability, as well as the amount of experimenter experience (if the latter is a factor, it could suggest researchers may need more training before collecting data for publication). The authors state in the discussion that this is not possible. But maybe certain measures can be used to assess this (e.g. years of conducting surgeries/ephys recordings etc)?

      We agree that understanding experimenter-to-experimenter variability would be very interesting and indeed we had hoped to do this analysis for some time. The problem is that typically, each lab employed one trainee to conduct all the data collection. This prevents us from comparing outcomes from two different experimenters in the same lab. There are exceptions to this, such as the Churchland lab in which 3 personnel (two postdocs and a technician) collected the data. However, even this fortuitous situation did not lend itself well to assessing experimenter-to-experimenter variation: the Churchland lab moved from Cold Spring Harbor to UCLA during the data collection period, which might have caused variability that is totally independent of experimenter (e.g., different animal facilities). Further, once at UCLA, the postdoc and technician worked closely together- alternating roles in animal training, surgery and electrophysiology. We believe that the text in our current Discussion (line 465-468) accurately characterizes the situation:

      “Our experimental design precludes an analysis of whether the reproducibility we observed was driven by person-to-person standardization or lab-to-lab standardization. Most likely, both factors contributed: all lab personnel received standardized instructions for how to implant head bars and train animals, which likely reduced personnel-driven differences.”

      Quantifying the level of experience of each experimenter is an appealing idea and we share the reviewer’s curiosity about its impact on data quality. Unfortunately, quantifying experience is tricky. For instance, years of conducting surgeries is not an unambiguously determinable number. Would we count an experimenter who did surgery every day for a year as having the same experience as an experimenter who did surgery once/month for a year? Would we count a surgeon with expertise in other areas (e.g., windows for imaging) in the same way as surgeons with expertise in ephys-specific surgeries? Because of the ambiguities, we leave this analysis to be the subject of future work; this is now stated in the Discussion (line 476).

      (3) Figure 3b and c: Are these plots before or after the probe depth has been adjusted based on physiological features such as the LFP power? In other words, is the IBL electrophysiological alignment toolbox used here and is the reliability of location before using physiological criteria or after? Beyond clarification, showing both before and after would help the readers to understand how much the additional alignment based on electrophysiological features adjusts probe location. It would also be informative if they sorted these penetrations by which penetrations were closest to the planned trajectory after histological verification.

      The plots in Figure 3b and 3c reflect data after the probe depth has been adjusted based on electrophysiological features. This adjustment incorporates criteria such as LFP power and spiking activity to refine the trajectory and ensure precise alignment with anatomical landmarks. The trajectories have also been reviewed and confirmed by two independent reviewers. We have clarified this in line 180 and in the caption of Figure 3.

      To address this concern, we have added a new panel c in Figure 3 supplementary 1 (also shown below) that shows the LFP features along the probes prior to using the IBL alignment toolbox. We hope the reviewer agrees that a comparison of panels (a) and (c) below make clear the improvement afforded by our alignment tools.

      In Figure 3 and Figure 3 supplementary 1, as suggested, we have also now sorted the probes by those that were closest to the planned trajectory. This way of visualizing the data makes it clear that as the distance from the planned trajectory increases, the power spectral density in the hippocampal regions becomes less pronounced and the number of probes that have a large portion of the channels localized to VISa/am, LP and PO decreases. We have added text to the caption to describe this. We thank the reviewer for this suggestion and agree that it will help readers to understand how much the additional alignment (based on electrophysiological features) adjusts probe location.

      (4) In Figures 4 and 6: If the authors use a 0.05 threshold (alpha) and a cell simply has to be significant on 1/6 tests to be considered task modulated, that means that they have a false positive rate of ~30% (0.05*6=0.3). We ran a simple simulation looking for significant units (from random null distribution) from these criteria which shows that out of 100.000 units, 26500 units would come out significant (false error rate: 26.5%). That is very high (and unlikely to be accepted in most papers), and therefore not surprising that the fraction of task-modulated units across labs is highly variable. This high false error rate may also have implications for the investigation of the spatial position of task-modulated units (as effects of the spatial position may drown in falsely labelled 'task-modulated' cells).

      Thank you for this concern. The different tests were kept separate, so we did not consider a neuron modulated if it was significant in only one out of six tests, but instead we asked whether a neuron was modulated according to test one, whether it was modulated according to test two, etc., and performed further analyses separately for each test. Thus, we are only vulnerable to the ‘typical’ false positive rate of 0.05 for any given test. We made this clearer in the text (lines 232-236) and hope that the 5% false positive rate seems more acceptable.

      (5) The authors state from Figure 5b that the majority of cells could be well described by 2 PCs. The distribution of R2 across neurons is almost uniform, so depending on what R2 value one considers a 'good' description, that is the fraction of 'good' cells. Furthermore, movement onset has now been well-established to be affecting cells widely and in large fractions, so while this analysis may work for something with global influence - like movement - more sparsely encoded variables (as many are in the brain) may not be well approximated with this suggestion. The authors could expand this analysis into other epochs like activity around stimulus presentation, to better understand how this type of analysis reproduces across labs for features that have a less global influence.

      We thank the reviewer for the suggestion and fully agree that the window used in our original analysis would tend to favor movement-driven neurons. To address this, we repeated the analysis, this time using a window centered around stimulus onset (from -0.5 s prior to stimulus onset until 0.1 s after stimulus onset). As the reviewer suspected, far fewer neurons were active in this window and consequently far fewer were modelled well by the first two PCs, as shown in Author response image 1b (below). Similar to our original analysis using the post-movement window, we found mixed results for the stimulus-centered window across labs. Interestingly, regional differences were weaker in this new analysis compared to the original analysis of the post-movement window. We have added a sentence to the results describing this. Because the results are similar to the post-movement window main figure, we would prefer to restrict the new analysis only to this point-by-point response, in the hopes of streamlining the paper.

      Author response image 1.

      PCA analysis applied to a stimulus-aligned window ([-0.5, 0.1] sec relative to stim onset). Figure conventions as in main text Fig 5. Results are comparable to the post-movement window analysis, however regional differences are weaker here, possibly because fewer cells were active in the pre-movement window. We added panel j here and in the main figure, showing cell-number-controlled results. I.e. for each test, the minimum neuron number of the compared classes was sampled from all classes (say labs in a region), this sampling was repeated 1000 times and p-values combined via Fisher’s method, overall resulting in much fewer significant differences across laboratories and, independently, regions.

      (6) Additionally, in Figure 5i: could the finding that one can only distinguish labs when taking cells from all regions, simply be a result of a different number of cells recorded in each region for each lab? It makes more sense to focus on the lab/area pairing as the authors also do, but not to make their main conclusion from it. If the authors wish to do the comparison across regions, they will need to correct for the number of cells recorded in each region for each lab. In general, it was a struggle to fully understand the purpose of Figure 5. While population analysis and dimensionality reduction are commonplace, this seems to be a very unusual use of it.

      We agree that controlling for varying cell numbers is a valuable addition to this analysis. We added panel j in Fig. 5 showing cell-number-controlled test results of panel i. I.e. for a given statistical comparison, we sample the lowest number of cells of compared classes from the others, do the test, and repeat this sampling 1000 times, before combining the p-values using Fisher’s method. This cell-number controlled version of the tests resulted in clearly fewer significant differences across distributions - seen similarly for the pre-movement window shown in j in Author response image 1. We hope this clarified our aim to illustrate that low-dimensional embedding of cells’ trial-averaged activity can show how regional differences compare with laboratory differences.

      As a complementary statistical analysis to the shown KS tests, we fitted a linear-mixed-effects model (statsmodels.formula.api mixedlm), to the first and second PC for both activity windows (“Move”: [-0.5,1] first movement aligned; “Stim”: [-0.5,0.1] stimulus onset aligned), independently. Author response image 2 (in this rebuttal only) is broadly in line with the KS results, showing more regional than lab influences on the distributions of first PCs for the post-movement window.

      Author response image 2:

      Linear mixed effects model results for two PCs and two activity windows. For the post-movement window (“Move”), regional influences are significant (red color in plots) for all but one region while only one lab has a significant model coefficient for PC1. For PC2 more labs and three regions have significant coefficients. For the pre-movement window (“Stim”) one region for PC1 or PC2 has significant coefficients. The variance due to session id was smaller than all other effects (“eids Var”). “Intercept” shows the expected value of the response variable (PC1, PC2) before accounting for any fixed or random effects. All p-values were grouped as one hypothesis family and corrected for multiple comparisons via Benjamini-Hochberg.

      (7) In the discussion the authors state: " Indeed this approach is a more effective and streamlined way of doing it, but it is questionable whether it 'exceeds' what is done in many labs.

      Classically, scientists trace each probe manually with light microscopy and designate each area based on anatomical landmarks identified with nissl or dapi stains together with gross landmarks. When not automated with 2-PI serial tomography and anatomically aligned to a standard atlas, this is a less effective process, but it is not clear that it is less precise, especially in studies before neuropixels where active electrodes were located in a much smaller area. While more effective, transforming into a common atlas does make additional assumptions about warping the brain into the standard atlas - especially in cases where the brain has been damaged/lesioned. Readers can appreciate the effectiveness and streamlining provided by these new tools without the need to invalidate previous approaches.

      We thank the reviewer for highlighting the effectiveness of manual tracing methods used traditionally. Our intention in the statement was not to invalidate the precision or value of these classical methods but rather to emphasize the scalability and streamlining offered by our pipeline. We have revised the language to more accurately reflect this (line 500-504):

      “Our ability to detect targeting error benefited from an automated histological pipeline combined with alignment and tracing that required agreement between multiple users, an approach that greatly exceeds the histological analyses done by most individual labs. Our approach, which enables scalability and standardization across labs while minimizing subjective variability, revealed that much of the variance in targeting was due to the probe entry positions at the brain surface, which were randomly displaced across the dataset.”

      (8) What about across-lab population-level representation of task variables, such as in the coding direction for stimulus or choice? Is the general decodability of task variables from the population comparable across labs?

      Excellent question, thanks! We have added the new section “Decodability of task variables is consistent across labs, but varies by brain region” (line 423-448) and Figure 9 in the revised manuscript to address this question. In short, yes, the general decodability of task variables from the population is comparable across labs, providing additional reassurance of reproducibility.

      Reviewer #2 (Public review):

      Summary:

      The authors sought to evaluate whether observations made in separate individual laboratories are reproducible when they use standardized procedures and quality control measures. This is a key question for the field. If ten systems neuroscience labs try very hard to do the exact same experiment and analyses, do they get the same core results? If the answer is no, this is very bad news for everyone else! Fortunately, they were able to reproduce most of their experimental findings across all labs. Despite attempting to target the same brain areas in each recording, variability in electrode targeting was a source of some differences between datasets.

      Major Comments:

      The paper had two principal goals:

      (1) to assess reproducibility between labs on a carefully coordinated experiment

      (2) distill the knowledge learned into a set of standards that can be applied across the field.

      The manuscript made progress towards both of these goals but leaves room for improvement.

      (1) The first goal of the study was to perform exactly the same experiment and analyses across 10 different labs and see if you got the same results. The rationale for doing this was to test how reproducible large-scale rodent systems neuroscience experiments really are. In this, the study did a great job showing that when a consortium of labs went to great lengths to do everything the same, even decoding algorithms could not discern laboratory identity was not clearly from looking at the raw data. However, the amount of coordination between the labs was so great that these findings are hard to generalize to the situation where similar (or conflicting!) results are generated by two labs working independently.

      Importantly, the study found that electrode placement (and thus likely also errors inherent to the electrode placement reconstruction pipeline) was a key source of variability between datasets. To remedy this, they implemented a very sophisticated electrode reconstruction pipeline (involving two-photon tomography and multiple blinded data validators) in just one lab-and all brains were sliced and reconstructed in this one location. This is a fantastic approach for ensuring similar results within the IBL collaboration, but makes it unclear how much variance would have been observed if each lab had attempted to reconstruct their probe trajectories themselves using a mix of histology techniques from conventional brain slicing, to light sheet microscopy, to MRI imaging.

      This approach also raises a few questions. The use of standard procedures, pipelines, etc. is a great goal, but most labs are trying to do something unique with their setup. Bigger picture, shouldn't highly "significant" biological findings akin to the discovery of place cells or grid cells, be so clear and robust that they can be identified with different recording modalities and analysis pipelines?

      We agree, and hope that this work may help readers understand what effect sizes may be considered “clear and robust” from datasets like these. We certainly support the reviewer’s point that multiple approaches and modalities can help to confirm any biological findings, but we would contend that a clear understanding of the capabilities and limitations of each approach is valuable, and we hope that our paper helps to achieve this.

      Related to this, how many labs outside of the IBL collaboration have implemented the IBL pipeline for their own purposes? In what aspects do these other labs find it challenging to reproduce the approaches presented in the paper? If labs were supposed to perform this same experiment, but without coordinating directly, how much more variance between labs would have been seen? Obviously investigating these topics is beyond the scope of this paper. The current manuscript is well-written and clear as is, and I think it is a valuable contribution to the field. However, some additional discussion of these issues would be helpful.

      We thank the reviewer for raising this important issue. We know of at least 13 labs that have implemented the behavioral task software and hardware that we published in eLife in 2021, and we expect that over the next several years labs will also implement these analysis pipelines (note that it is considerably cheaper and faster to implement software pipelines than hardware). In particular, a major goal of the staff in the coming years is to continue and improve the support for pipeline deployment and use. However, our goal in this work, which we have aimed to state more clearly in the revised manuscript, was not so much to advocate that others adopt our pipeline, but instead to use our standardized approach as a means of assessing reproducibility under the best of circumstances (see lines 48-52): “A high level of reproducibility of results across laboratories when procedures are carefully matched is a prerequisite to reproducibility in the more common scenario in which two investigators approach the same high-level question with slightly different experimental protocols.”

      Further, a number of our findings are relevant to other labs regardless of whether they implement our exact pipeline, a modified version of our pipeline, or something else entirely. For example, we found probe targeting to be a large source of variability. Our ability to detect targeting error benefited from an automated histological pipeline combined with alignment and tracing that required agreement between multiple users, but now that we have identified the offset, it can be easily accounted for by any lab. Specifically, probe angles must be carefully computed from the CCF, as the CCF and stereotaxic coordinate systems do not define the same coronal plane angle. Relatedly, we found that slight deviations in probe entry position can lead to samples from different populations of neurons. Although this took large cohort sizes to discover, knowledge of this discovery means that future experiments can plan for larger cohort sizes to allow for off-target trajectories, and can re-compute probe angle when the presence of blood vessels necessitates moving probes slightly. These points are now highlighted in the Discussion (lines 500-515).

      Second, the proportion of responsive neurons (a quantity often used to determine that a particular area subserves a particular function), sometimes failed to reproduce across labs. For example, for movement-driven activity in PO, UCLA reported an average change of 0 spikes/s, while CCU reported a large and consistent change (Figure 4d, right most panel, compare orange vs. yellow traces). This argues that neuron-to-neuron variability means that comparisons across labs require large cohort sizes. A small number of outlier neurons in a session can heavily bias responses. We anticipate that this problem will be remedied as tools for large scale neural recordings become more widely used. Indeed, the use of 4-shank instead of single-shank Neuropixels (as we used here) would have greatly enhanced the number of PO neurons we measured in each session. We have added new text to Results explaining this (lines 264-268):

      “We anticipate that the feasibility of even larger scale recordings will make lab-to-lab comparisons easier in future experiments; multi-shank probes could be especially beneficial for cortical recordings, which tend to be the most vulnerable to low cell counts since the cortex is thin and is the most superficial structure in the brain and thus the most vulnerable to damage. Analyses that characterize responses to multiple parameters are another possible solution (See Figure 7).”

      (2) The second goal of the study was to present a set of data curation standards (RIGOR) that could be applied widely across the field. This is a great idea, but its implementation needs to be improved if adoption outside of the IBL is to be expected. Here are three issues:

      (a) The GitHub repo for this project (https://github.com/int-brain-lab/paper-reproducible-ephys/) is nicely documented if the reader's goal is to reproduce the figures in the manuscript. Consequently, the code for producing the RIGOR statistics seems mostly designed for re-computing statistics on the existing IBL-formatted datasets. There doesn't appear to be any clear documentation about how to run it on arbitrary outputs from a spike sorter (i.e. the inputs to Phy).

      We agree that clear documentation is key for others to adopt our standards. To address this, we have added a section at the end of the README of the repository that links to a jupyter notebook (https://github.com/int-brain-lab/paper-reproducible-ephys/blob/master/RIGOR_script.ipynb) that runs the RIGOR metrics on a user’s own spike sorted dataset. The notebook also contains a tutorial that walks through how to visually assess the quality of the raw and spike sorted data, and computes the noise level metrics on the raw data as well as the single cell metrics on the spike sorted data.

      (b) Other sets of spike sorting metrics that are more easily computed for labs that are not using the IBL pipeline already exist (e.g. "quality_metrics" from the Allen Institute ecephys pipeline [https://github.com/AllenInstitute/ecephys_spike_sorting/blob/main/ecephys_spike_sorting/m odules/quality_metrics/README.md] and the similar module in the Spike Interface package [https://spikeinterface.readthedocs.io/en/latest/modules/qualitymetrics.html]). The manuscript does not compare these approaches to those proposed here, but some of the same statistics already exist (amplitude cutoff, median spike amplitude, refractory period violation).

      There is a long history of researchers providing analysis algorithms and code for spike sorting quality metrics, and we agree that the Allen Institute’s ecephys code and the Spike Interface package are the current options most widely used (but see also, for example, Fabre et al. https://github.com/Julie-Fabre/bombcell). Our primary goal in the present work is not to advocate for a particular implementation of any quality metrics (or any spike sorting algorithm, for that matter), but instead to assess reproducibility of results, given one specific choice of spike sorting algorithm and quality metrics. That is why, in our comparison of yield across datasets (Fig 1F), we downloaded the raw data from those comparison datasets and re-ran them under our single fixed pipeline, to establish a fair standard of comparison. A full comparison of the analyses presented here under different choices of quality metrics and spike sorting algorithms would undoubtedly be interesting and useful for the field - however, we consider it to be beyond the scope of the present work. It is therefore an important assumption of our work that the result would not differ materially under a different choice of sorting algorithm and quality metrics. We have added text to the Discussion to clarify this limitation:

      “Another significant limitation of the analysis presented here is that we have not been able to assess the extent to which other choices of quality metrics and inclusion criteria might have led to greater or lesser reproducibility.”

      That said, we still intend for external users to be able to easily run our pipelines and quality metrics.

      (c) Some of the RIGOR criteria are qualitative and must be visually assessed manually. Conceptually, these features make sense to include as metrics to examine, but would ideally be applied in a standardized way across the field. The manuscript doesn't appear to contain a detailed protocol for how to assess these features. A procedure for how to apply these criteria for curating non-IBL data (or for implementing an automated classifier) would be helpful.

      We agree. To address this, we have provided a notebook that runs the RIGOR metrics on a user’s own dataset, and contains a tutorial on how to interpret the resulting plots and metrics (https://github.com/int-brain-lab/paper-reproducible-ephys/blob/master/RIGOR_script.ipynb).

      Within this notebook there is a section focused on visually assessing the quality of both the raw data and the spike sorted data. The code in this section can be used to generate plots, such as raw data snippets or the raster map of the spiking activity, which are typically used to visually assess the quality of the data. In Figure 1 Supplement 2 we have provided examples of such plots that show different types of artifactual activity that should be inspected.

      Other Comments:

      (1) How did the authors select the metrics they would use to evaluate reproducibility? Was this selection made before doing the study?

      Our metrics were selected on the basis of our experience and expertise with extracellular electrophysiology. For example: some of us previously published on epileptiform activity and its characteristics in some mice (Steinmetz et al. 2017), so we included detection of that type of artifact here; and, some of us previously published detailed investigations of instability in extracellular electrophysiological recordings and methods for correcting them (Steinmetz et al. 2021, Windolf et al. 2024), so we included assessment of that property here. These metrics therefore represent our best expert knowledge about the kinds of quality issues that can affect this type of dataset, but it is certainly possible that future investigators will discover and characterize other quality issues.

      The selection of metrics was primarily performed before the study (we used these assessments internally before embarking on the extensive quantifications reported here), and in cases where we refined them further during the course of preparing this work, it was done without reference to statistical results on reproducibility but instead on the basis of manual inspection of data quality and metric performance.

      (2) Was reproducibility within-lab dependent on experimenter identity?

      We thank the reviewer for this question. We have addressed it in our response to R1 General comment 2, as follows:

      We agree that understanding experimenter-to-experimenter variability would be very interesting and indeed we had hoped to do this analysis for some time. The problem is that typically, each lab employed one trainee to conduct all the data collection. This prevents us from comparing outcomes from two different experimenters in the same lab. There are exceptions to this, such as the Churchland lab in which 3 personnel (two postdocs and a technician) collected the data. However, even this fortuitous situation did not lend itself well to assessing experimenter-to-experimenter variation: the Churchland lab moved from Cold Spring Harbor to UCLA during the data collection period, which might have caused variability that is totally independent of experimenter (e.g., different animal facilities). Further, once at UCLA, the postdoc and technician worked closely together- alternating roles in animal training, surgery and electrophysiology. We believe that the text in our current Discussion (line 465-468) accurately characterizes the situation:

      “Our experimental design precludes an analysis of whether the reproducibility we observed was driven by person-to-person standardization or lab-to-lab standardization. Most likely, both factors contributed: all lab personnel received standardized instructions for how to implant head bars and train animals, which likely reduced personnel-driven differences.”

      Quantifying the level of experience of each experimenter is an appealing idea and we share the reviewer’s curiosity about its impact on data quality. Unfortunately, quantifying experience is tricky. For instance, years of conducting surgeries is not an unambiguously determinable number. Would we count an experimenter who did surgery every day for a year as having the same experience as an experimenter who did surgery once/month for a year? Would we count a surgeon with expertise in other areas (e.g., windows for imaging) in the same way as surgeons with expertise in ephys-specific surgeries? Because of the ambiguities, we leave this analysis to be the subject of future work; this is now stated in the Discussion (line 476).

      (3) They note that UCLA and UW datasets tended to miss deeper brain region targets (lines 185-188) - they do not speculate why these labs show systematic differences. Were they not following standardized procedures?

      Thank you for raising this point. All researchers across labs were indeed following standardised procedures. We note that our statistical analysis of probe targeting coordinates and angles did not reveal a significant effect of lab identity on targeting error, even though we noted the large number of mis-targeted recordings in UCLA and UW to help draw attention to the appropriate feature in the figure. Given that these differences were not statistically significant, we can see how it was misleading to call out these two labs specifically. While the overall probe placement surface error and angle error both show no such systematic difference, the magnitude of surface error showed a non-significant tendency to be higher for samples in UCLA & UW, which, compounded with the direction of probe angle error, caused these probe insertions to land in a final location outside LP & PO.

      This shows how subtle differences in probe placement & angle accuracy can lead to compounded inaccuracies at the probe tip, especially when targeting deep brain regions, even when following standard procedures. We believe this is driven partly by the accuracy limit or resolution of the stereotaxic system, along with slight deviations in probe angle, occurring during the setup of the stereotaxic coordinate system during these recordings.

      We have updated the relevant text in lines 187-190 as follows, to clarify:

      “Several trajectories missed their targets in deeper brain regions (LP, PO), as indicated by gray blocks, despite the lack of significant lab-dependent effects in targeting as reported above. These off-target trajectories tended to have both a large displacement from the target insertion coordinates and a probe angle that unfavorably drew the insertions away from thalamic nuclei (Figure 2f).”

      (4) The authors suggest that geometrical variance (difference between planned and final identified probe position acquired from reconstructed histology) in probe placement at the brain surface is driven by inaccuracies in defining the stereotaxic coordinate system, including discrepancies between skull landmarks and the underlying brain structures. In this case, the use of skull landmarks (e.g. bregma) to determine locations of brain structures might be unreliable and provide an error of ~360 microns. While it is known that there is indeed variance in the position between skull landmarks and brain areas in different animals, the quantification of this error is a useful value for the field.

      We thank the reviewer for their thoughtful comment and are glad that they found the quantification of variance useful for the field.

      (5) Why are the thalamic recording results particularly hard to reproduce? Does the anatomy of the thalamus simply make it more sensitive to small errors in probe positioning relative to the other recorded areas?

      We thank the reviewer for raising this interesting question. We believe that they are referring to Figure 4: indeed when we analyzed the distribution of firing rate modulations, we saw some failures of reproducibility in area PO (bottom panel, Figure 4h). However, the thalamic nuclei were not, in other analyses, more vulnerable to failures in reproducibility. For example, in the top panel of Figure 4h, VisAM shows failures of reproducibility for modulation by the visual stimulus. In Fig. 5i, area CA1 showed a failure of reproducibility. We fear that the figure legend title in the previous version (which referred to the thalamus specifically) was misleading, and we have revised this. The new title is, “Neural activity is modulated during decision-making in five neural structures and is variable between laboratories.” This new text more accurately reflects that there were a number of small, idiosyncratic failures of reproducibility, but that these were not restricted to a specific structure. The new analysis requested by R1 (now in Figure 7) provides further reassurance of overall reproducibility, including in the thalamus (see Fig. 7a, right panels; lab identity could not be decoded from single neuron metrics, even in the thalamus).

      Reviewer #1 (Recommendations for the authors):

      (1) Figure font sizes and formatting are variable across panels and figures. Please streamline the presentation of results.

      Thank you for your feedback. We have remade all figures with the same standardized font sizes and formatting.

      (2) Please correct the noncontinuous color scales in Figures 3b and 3d.

      Thank you for pointing this out, we fixed the color bar.

      (3) In Figures 5d and g, the error bars are described as: 'Error bands are standard deviation across cells normalised by the square root of the number of sessions in the region'. How does one interpret this error? It seems to be related to the standard error of the mean (std/sqrt(n)) but instead of using the n from which the standard deviation is calculated (in this case across cells), the authors use the number of sessions as n. If they took the standard deviation across sessions this would be the sem across sessions, and interpretable (as sem*1.96 is the 95% parametric confidence interval of the mean). Please justify why these error bands are used here and how they can be interpreted - it also seems like it is the only time these types of error bands are used.

      We agree and for clarity use standard error across cells now, as the error bars do not change dramatically either way.

      (4) It is difficult to understand what is plotted in Figures 5e,h, please unpack this further and clarify.

      Thank you for pointing this out. We have added additional explanation in the figure caption (See caption for Figure 5c) to explain the KS test.

      (5) In lines 198-201 the authors state that they were worried that Bonferroni correction with 5 criteria would be too lenient, and therefore used 0.01 as alpha. I am unsure whether the authors mean that they are correcting for multiple comparisons across features or areas. Either way, 0.01 alpha is exactly what a Bonferroni corrected alpha would be when correcting for either 5 features or 5 areas: 0.05/5=0.01. Or do they mean they apply the Bonferroni correction to the new 0.01 alpha: i.e., 0.01/5=0.002? Please clarify.

      Thank you, that was indeed written confusingly. We considered all tests and regions as whole, so 7 tests * 5 regions = 35 tests, which would result in a very strong Bonferroni correction. Indeed, if one considers the different tests individually, the correction we apply from 0.05 to 0.01 can be considered as correcting for the number of regions, which we now highlight better. We apply no further corrections of any kind to our alpha=0.01. We clarified this in the manuscript in all relevant places (lines 205-208, 246, 297-298, and 726-727).

      (6) Did the authors take into account how many times a probe was used/how clean the probe was before each recording. Was this streamlined between labs? This can have an effect on yield and quality of recording.

      We appreciate the reviewer highlighting the potential impact of probe use and cleanliness on recording quality and yield. While we did not track the number of times each probe was used, we ensured that all probes were cleaned thoroughly after each use using a standardized cleaning protocol (Section 16: Cleaning the electrode after data acquisition in Appendix 2: IBL protocol for electrophysiology recording using Neuropixels probe). We acknowledge that tracking the specific usage history of each probe could provide additional insights, but unfortunately we did not track this information for this project. In prior work the re-usability of probes has been quantified, showing insignificant degradation with use (e.g. Extended Data Fig 7d from Jun et al. 2017).

      (7) Figure 3, Supplement1: DY_013 missed DG entirely? Was this included in the analysis?

      Thank you for this question. We believe the reviewer is referring to the lack of a prominent high-amplitude LFP band in this mouse, and lack of high-quality sorted units in that region. Despite this, our histology did localize the recording trajectory to DG. This recording did pass our quality control criteria overall, as indicated by the green label, and was used in relevant analyses.

      The lack of normal LFP features and neuron yield might reflect the range of biological variability (several other sessions also have relatively weak DG LFP and yield, though DY_013 is the weakest), or could reflect some damage to the tissue, for example as caused by local bleeding. Because we could not conclusively identify the source of this observation, we did not exclude it.

      (8) Given that the authors argue for using the MTNN over GLMs, it would be useful to know exactly how much better the MTNN is at predicting activity in the held-out dataset (shown in Figure 7, Supplement 1). It looks like a very small increase in prediction performance between MTNN and GLMs, is it significantly different?

      The average variance explained on the held-out dataset, as shown in Figure 8–Figure Supplement 1 Panel B, is 0.065 for the GLMs and 0.071 for the MTNN. As the reviewer correctly noted, this difference is not significant. However, one of the key advantages of the MTNN over GLMs lies in its flexibility to easily incorporate covariates, such as electrophysiological characteristics or session/lab IDs, directly into the analysis. This feature is particularly valuable for assessing effect sizes and understanding the contributions of various factors.

      (9) In line 723: why is the threshold for mean firing rate for a unit to be included in the MTNN results so high (>5Hz), and how does it perform on units with lower firing rates?      

      We thank the reviewer for pointing this out. The threshold for including units with a mean firing rate above 5 Hz was set because most units with firing rates below this threshold were silent in many trials, and reducing the number of units helped keep the MTNN training time reasonable. Based on this comment, we ran the MTNN experiments including all units with firing rates above 1 Hz, and the results remained consistent with our previous conclusions (Figure 8). Crucially, the leave-one-out analysis consistently showed that lab and session IDs had effect sizes close to zero, indicating that both within-lab and between-lab random effects are small and comparable.

      Reviewer #2 (Recommendations for the authors):

      (1) Most of the more major issues were already listed in the above comments. The strongest recommendation for additional work would be to improve the description and implementation of the RIGOR statistics such that non-IBL labs that might use Neuropixels probes but not use the entire IBL pipeline might be able to apply the RIGOR framework to their own data.

      We thank the reviewer for highlighting the importance of making the RIGOR statistics more accessible to a broader audience. We agree that improving the description and implementation of the RIGOR framework is essential for facilitation of non-IBL labs using Neuropixels probes. To address this we created a jupyter notebook with step-by-step guidance that is not dependent on the IBL pipeline. This tool (https://github.com/int-brain-lab/paper-reproducible-ephys/blob/develop/RIGOR_script.ipynb) is publicly available through the repository, accompanied by example datasets and usage tutorials.

      (2) Table 1: How are qualitative features like "drift" defined? Some quantitative statistics like "presence ratio" (the fraction of the dataset where spikes are present) already exist in packages like ecephys_spike_sorting. Who measured these qualitative features? What are the best practices for doing these qualitative analyses?

      At the probe level, we compute the estimate of the relative motion of the electrodes to the brain tissue at multiple depths along the electrode. We overlay the drift estimation over a raster plot to detect sharp displacements as a function of time. Quantitatively, the drift is the cumulative absolute electrode motion estimated during spike sorting (µm). We clarified the corresponding text in Table 1.

      The qualitative assessments were carried out by IBL staff and experimentalists. We have now provided code to run the RIGOR metrics along with an embedded tutorial, to complement the supplemental figures we have shown about qualitative metric interpretation.

      (3) Table 1: What are the units for the LFP derivative?

      We thank the reviewer for noting that the unit was missing. The unit (decibel per unit of space) is now in the table.

      (4) Table 1: For "amplitude cutoff", the table says that "each neuron must pass a metric". What is the metric?

      We have revised the table to include this information. This metric was designed to detect potential issues in amplitude distributions caused by thresholding during deconvolution, which could result in missed spikes. There are quantitative thresholds on the distribution of the low tail of the amplitude histogram relative to the high tail, and on the relative magnitude of the bins in the low tail. We now reference the methods text from the table, which includes a more extended description and gives the specific threshold numbers. Also, the metric and thresholds are more easily understood with graphical assistance; see the IBL Spike Sorting Whitepaper for this (Fig. 17 in that document and nearby text; https://doi.org/10.6084/m9.figshare.19705522.v4). This reference is now also cited in the text.

      (5) Figure 2: In panel A, the brain images look corrupted.

      Thanks; in the revised version we have changed the filetype to improve the quality of the panel image.

      (6) Figure 7: In panel D, make R2 into R^2 (with a superscript)

      Panel D y-axis label has been revised to include superscript (note that this figure is now Figure 8).

      Works Cited

      Julie M.J. Fabre, Enny H. van Beest, Andrew J. Peters, Matteo Carandini, and Kenneth D. Harris. Bombcell: automated curation and cell classification of spike-sorted electrophysiology data, July 2023. URL https://doi.org/10.5281/zenodo.8172822.

      James J. Jun, Nicholas A. Steinmetz, Joshua H. Siegle, Daniel J. Denman, Marius Bauza, Brian Barbarits, Albert K. Lee, Costas A. Anastassiou, Alexandru Andrei, C¸ a˘gatayAydın, Mladen Barbic, Timothy J. Blanche, Vincent Bonin, Jo˜ao Couto, Barundeb Dutta, Sergey L. Gratiy, Diego A. Gutnisky, Michael H¨ausser, Bill Karsh, Peter Ledochowitsch, Carolina Mora Lopez, Catalin Mitelut, Silke Musa, Michael Okun, Marius Pachitariu, Jan Putzeys, P. Dylan Rich, Cyrille Rossant, Wei-lung Sun, Karel Svoboda, Matteo Carandini, Kenneth D. Harris, Christof Koch, John O’Keefe, and Timothy D.Harris. Fully integrated silicon probes for high-density recording of neural activity.Nature, 551(7679):232–236, Nov 2017. ISSN 1476-4687. doi: 10.1038/nature24636. URL https://doi.org/10.1038/nature24636.

      Simon Musall, Xiaonan R. Sun, Hemanth Mohan, Xu An, Steven Gluf, Shu-Jing Li, Rhonda Drewes, Emma Cravo, Irene Lenzi, Chaoqun Yin, Bj¨orn M. Kampa, and Anne K. Churchland. Pyramidal cell types drive functionally distinct cortical activity patterns during decision-making. Nature Neuroscience, 26(3):495– 505, Mar 2023. ISSN 1546-1726. doi: 10.1038/s41593-022-01245-9. URL https://doi.org/10.1038/s41593-022-01245-9.

      Ivana Orsolic, Maxime Rio, Thomas D Mrsic-Flogel, and Petr Znamenskiy. Mesoscale cortical dynamics reflect the interaction of sensory evidence and temporal expectation during perceptual decision-making. Neuron, 109(11):1861–1875.e10, April 2021. Hyeong-Dong Park, St´ephanie Correia, Antoine Ducorps, and Catherine Tallon-Baudry.Spontaneous fluctuations in neural responses to heartbeats predict visual detection.Nature Neuroscience, 17(4):612–618, Apr 2014. ISSN 1546-1726. doi: 10.1038/nn.3671. URL https://doi.org/10.1038/nn.3671.

      Lorenzo Posani, Shuqi Wang, Samuel Muscinelli, Liam Paninski, and Stefano Fusi. Rarely categorical, always high-dimensional: how the neural code changes along the cortical hierarchy. bioRxiv, 2024. doi: 10.1101/2024.11.15.623878. URL https://www.biorxiv.org/content/early/2024/12/09/2024.11.15.623878.

      Nicholas A. Steinmetz, Christina Buetfering, Jerome Lecoq, Christian R. Lee, Andrew J. Peters, Elina A. K. Jacobs, Philip Coen, Douglas R. Ollerenshaw, Matthew T. Valley, Saskia E. J. de Vries, Marina Garrett, Jun Zhuang, Peter A. Groblewski, Sahar Manavi, Jesse Miles, Casey White, Eric Lee, Fiona Griffin, Joshua D. Larkin, Kate Roll, Sissy Cross, Thuyanh V. Nguyen, Rachael Larsen, Julie Pendergraft, Tanya Daigle, Bosiljka Tasic, Carol L. Thompson, Jack Waters, Shawn Olsen, David J. Margolis, Hongkui Zeng, Michael Hausser, Matteo Carandini, and Kenneth D. Harris. Aberrant cortical activity in multiple gcamp6-expressing transgenic mouse lines. eNeuro, 4(5), 2017. doi: 10.1523/ENEURO.0207-17.2017. URL https://www.eneuro.org/content/4/5/ENEURO.0207-17.2017.

      Nicholas A. Steinmetz, Peter Zatka-Haas, Matteo Carandini, and Kenneth D. Harris. Distributed coding of choice, action and engagement across the mouse brain. Nature, 576(7786):266–273, Dec 2019. ISSN 1476-4687. doi: 10.1038/s41586-019-1787-x. URL https://doi.org/10.1038/s41586-019-1787-x.

      Nicholas A. Steinmetz, Cagatay Aydin, Anna Lebedeva, Michael Okun, Marius Pachitariu, Marius Bauza, Maxime Beau, Jai Bhagat, Claudia B¨ohm, Martijn Broux, Susu Chen, Jennifer Colonell, Richard J. Gardner, Bill Karsh, Fabian Kloosterman, Dimitar Kostadinov, Carolina Mora-Lopez, John O’Callaghan, Junchol Park, Jan Putzeys, Britton Sauerbrei, Rik J. J. van Daal, Abraham Z. Vollan, Shiwei Wang, Marleen Welkenhuysen, Zhiwen Ye, Joshua T. Dudman, Barundeb Dutta, Adam W. Hantman,Kenneth D. Harris, Albert K. Lee, Edvard I. Moser, John O’Keefe, Alfonso Renart, Karel Svoboda, Michael H¨ausser, Sebastian Haesler, Matteo Carandini, and Timothy D. Harris. Neuropixels 2.0: A miniaturized high-density probe for stable, long-term brain recordings. Science, 372(6539):eabf4588, 2021. doi: 10.1126/science.abf4588.URL https://www.science.org/doi/abs/10.1126/science.abf4588.

      Charlie Windolf, Han Yu, Angelique C. Paulk, Domokos Mesz´ena, William Mu˜noz, Julien Boussard, Richard Hardstone, Irene Caprara, Mohsen Jamali, Yoav Kfir, Duo Xu, Jason E. Chung, Kristin K. Sellers, Zhiwen Ye, Jordan Shaker, Anna Lebedeva, Manu Raghavan, Eric Trautmann, Max Melin, Jo˜ao Couto, Samuel Garcia, Brian Coughlin, Csaba Horv´ath, Rich´ard Fi´ath, Istv´an Ulbert, J. Anthony Movshon, Michael N. Shadlen, Mark M. Churchland, Anne K. Churchland, Nicholas A. Steinmetz, Edward F. Chang, Jeffrey S. Schweitzer, Ziv M. Williams, Sydney S. Cash, Liam Paninski, and Erdem Varol. Dredge: robust motion correction for high-density extracellular recordings across species. bioRxiv, 2023. doi: 10.1101/2023.10.24.563768. URL https://www.biorxiv.org/content/early/2023/10/29/2023.10.24.563768.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      This study extends the previous interesting work of this group to address the potentially differential control of movement and posture. Their earlier work explored a broad range of data to make the case for a downstream neural integrator hypothesized to convert descending velocity movement commands into postural holding commands. Included in that data were observations from people with hemiparesis due to stroke. The current study uses similar data but pushes into a different, but closely related direction, suggesting that these data may address the independence of these two fundamental components of motor control. I find the logic laid out in the second sentence of the abstract ("The paretic arm after stroke is notable for abnormalities both at rest and during movement, thus it provides an opportunity to address the relationships between control of reaching, stopping, and stabilizing") less than compelling, but the study does make some interesting observations. Foremost among them, is the relation between the resting force postural bias and the effect of force perturbations during the target hold periods, but not during movement. While this interesting observation is consistent with the central mechanism the authors suggest, it seems hard to me to rule out other mechanisms, including peripheral ones. 

      Response 1.1. Thank you for your comments, which we address in detail below and in our response to Recommendations to the authors (see pp. 15-19 of this letter). We would first like to clarify the motivation behind our use of a stroke population to understand the interactions between the control of reaching in and holding. We agree that this idea can be laid out in a more compelling way.

      The fact that stroke patients usually display issues with their control of both reaching and holding, allows for within-individual comparisons of those two modes of control. Further, the magnitude of abnormalities is relatively large, making it easier to measure, compare and investigate effects. And, importantly, these two modes of control can be differentially affected after stroke (also pointed out by Reviewer 2, point 4 in Comments to the Authors). Finally, this kind of work – examining interactions between positive signs of stroke (such as abnormal posture or synergy) vs. negative signs (such as loss of motor control) – needs to be done in humans, as positive signs are relatively absent even in primates (Tower, 1940).

      We have changed our abstract (changes shown below in red), and our intro (expanding the second paragraph, lines 75-76), to lay out our motivation more clearly.

      From the abstract:

      “The paretic arm after stroke exhibits different abnormalities during rest vs. movement, providing an opportunity to ask whether control of these behaviors is independently affected in stroke. “

      On the other hand, the relation between force bias and the well-recognized flexor synergy seems rather self-evident, and I don't see that these results add much to that story.

      Response 1.2. While it seems natural that these biases would be the resting expression of abnormal flexor synergies (given their directionality towards the body, as shown in Figures 2-3, and the other similarities we demonstrate in Figure 8), we do not believe it is self-evident. These biases are measured at rest, with the patient passively moved and held still, whereas abnormal synergies emerge when the patient actively tries to move. The lack of relationship we find between these resting force biases and active movement underlines that the relation between force bias and flexor synergy should not be taken as self-evident, making it worthwhile to examine it (as we motivate in lines 589-596 and show in Figure 8).

      The paradox here is that, in spite of a relationship between force bias and flexor synergy (itself manifesting during attempted movement), there seems to be no relationship between force bias and direct measures of active movement (Figures 5,6). This is the paradox that inspired our conceptual model (Figure 9) and inspires to further investigate the factors under which these two systems are intermingled or kept separate. We thus find it to be a helpful element in the story.

      I am also struck by what seems to be a contradiction between the conclusions of the current and former studies: "These findings in stroke suggest that moving and holding still are functionally separable modes of control" and "the commands that hold the arm and finger at a target location depend on the mathematical integration of the commands that moved the limb to that location." The former study is mentioned here only in passing, in a single phrase in the discussion, with no consideration of the relation between the two studies. This is odd and should be addressed. 

      Response 1.3. While these two sets of findings are not contradictory, we understand how they can appear as such without providing context. We now discuss the relationship between our present study and the previous one more directly (lines 66-70 and 663-669 of the revised manuscript).

      The previous study examined how the control of movement informs the control of holding after the movement was over; the current study examines whether abnormalities in holding measured at rest with the movement leading to the rest position being passive. There are thus two important distinctions:

      First, directionality of potential effects: here we examine the effect of (abnormalities in) holding control upon movement, but the 2020 study (Albert et al., 2020) examines the effects of movement upon holding control. Stroke patient data in the 2020 study showed that, under CST damage, while the reach controller is disrupted, the hold controller can continue to integrate the malformed reach commands faithfully. In line with this, we proposed a model where the postural controller system sits downstream of the moving controller (Figure 7G in the 2020 paper). We thus did not claim, in 2020, that integration of movement commands is the only way to do determine posture control, as we stated explicitly back then, e.g. (emphasis ours):

      “Equations (1) and (2) describe how the integration of move activity may relate to changes in hold commands, but does not specify the hold command at the target.”

      In short, finding no effect of holding abnormalities upon movement (present finding) does not mean there is no potential effect of movement upon holding (2020 finding). This is something we had alluded to in the Discussion but not clarified, which we do now (see edits at the end of our response to this point).

      Second, active vs. passive movement: here, we measure holding control at rest (Experiment 1). The 2020 study shows that endpoint forces reflect the integration of learned dynamics exerted during active movement that led to the endpoint position. However, in Experiment 1, there is no active reaching to integrate, as the robot passively moves the arm to the held position. Thus, resting postural forces measured in Experiment 1 could not reflect the integration of reach commands that led to each rest position.  

      Thus, the two sets of findings are not contradictory. Taking our current and 2020 findings together suggests that active holding control would comprise would reflect both the integration of movement control that led to assuming the held position, plus the force biases measured at rest.

      Hence our decision to describe these two systems as functionally separable: while these systems can interact, the effects of post-stroke malfunctions in each can be independent depending on the function and conditions at hand. This does not make this a limited finding: being able to dissociate post-stroke impairment based on each of these two modes of control may inform rehabilitation, and also importantly, understanding the conditions in which these two modes of control become separable can substantially advance our understanding of both how different stroke signs interact with each other and how motor control is assembled in the healthy motor system. Figure 9 illustrates our conceptual model behind this and may serve as a blueprint to further dissect these circuits in the future.

      We discuss these issues briefly in lines 663-669 in our Discussion section, reproduced below for convenience:

      “It should be noted, however, that having distinct neural circuits for reaching and holding does not rule out interactions between them. For example, we recently demonstrated how arm holding control reflects the integration of motor commands driving the preceding active movement that led to the hold position, in both healthy participants and patients with hemiparesis (Albert et al., 2020). However, in that paper, we did not claim that this integration is the only source of holding control. Indeed, in Experiment 1 of the current study, we used passive movement to bring the arm to each probed position, which means that the postural biases could not be the result of integration of motor commands.” 

      And, we have adjusted our Introduction to provide pertinent context regarding our 2020 work (first paragraph, lines 66-70 of the updated manuscript).

      A minor wording concern I had is that the term "holding still" is frequently hard to parse. A couple of examples: "These findings in stroke suggest that moving and holding still are functionally separable modes of control." This example is easily read, "moving and holding [continue to be] functionally separable". Another: "...active reaching and holding still in the same workspace, " could be "...active reaching and holding [are] still in the same workspace." Simply "holding", "posture" or "posture maintenance" would all be better options.

      Response 1.4. Thank you for your suggestion. Following your comment, we have abbreviated this term to simply “holding”, both on the title and throughout the text.

      Reviewer #2 (Public Review):

      Summary: 

      Here the authors address the idea that postural and movement control are differentially impacted with stroke. Specifically, they examined whether resting postural forces influenced several metrics of sensorimotor control (e.g., initial reach angle, maximum lateral hand deviation following a perturbation, etc.) during movement or posture. The authors found that resting postural forces influenced control only following the posture perturbation for the paretic arm of stroke patients, but not during movement. They also found that resting postural forces were greater when the arm was unsupported, which correlated with abnormal synergies (as assessed by the Fugl-Meyer). The authors suggest that these findings can be explained by the idea that the neural circuitry associated with posture is relatively more impacted by stroke than the neural circuitry associated with movement. They also propose a conceptual model that differentially weights the reticulospinal tract (RST) and corticospinal tract (CST) to explain greater relative impairments with posture control relative to movement control, due to abnormal synergies, in those with stroke.

      Strengths: 

      The strength of the paper is that they clearly demonstrate with the posture task (i.e., active holding against a load) that the resting postural forces influence subsequent control (i.e., the path to stabilize, time to stabilize, max. deviation) following a sudden perturbation (i.e., suddenly removal of the load). Further, they can explain their findings with a conceptual model, which is depicted in Figure 9. 

      Weaknesses: 

      Current weaknesses and potential concerns relate to i) not displaying or reporting the results of healthy controls and non-paretic arm in Experiment 2 and ii) large differences in force perturbation waveforms between movement (sudden onset) and posture (sudden release), which could potentially influence the results and or interpretation. 

      Response 2.0. Thank you for your assessment, and for pointing out ways to improve our paper. We address the weakness and potential concerns in detail below.

      Larger concerns

      (1) Additional analyses to further support the interpretation. In Experiment 1 the authors present the results for the paretic arm, non-paretic arm, and controls. However, in Experiment 2 for several key analyses, they only report summary statistics for the paretic arm (Figure 5D-I; Figure 6D-E; Figure 7F). It is understood that the controls have much smaller resting postural force biases, but they are still present (Figure 3B). It would strengthen the position of the paper to show that controls and the non-paretic arm are not influenced by resting postural force biases during movement and particularly during posture, while acknowledging the caveat that the resting positional forces are smaller in these groups. It is recommended that the authors report and display the results shown in Figure 5D-I; Figure 6D-E; Figure 7F for the controls and non-paretic arm. If these results are all null, the authors could alternatively place these results in an additional supplementary. 

      Response 2.1a. Thank you for your recommendations. We agree both on the value of these analyses and the caveat associated with them: these resting postural force biases are substantially smaller for the non-paretic and control data (for example, the magnitude of resting biases in the supported condition is 2.8±0.4N for the paretic data, but only 1.8±0.4N and 1.3±0.2N for the non-paretic and control data, respectively; the difference is even greater in the unsupported condition, though this is not the one being compared to Experiment 2).

      We now conduct a comprehensive series of supplementary analyses, including the examination of non-paretic and control data for all three components of Experiment 2 (unperturbed reaches; pulse perturbations; and active holding control). These are mentioned in the Results (lines 422-424, 512513, and 574-574 of the revised manuscript) and illustrated in the supplementary materials: Supplementary Figures S5-1, S6-1, and S7-1 contain the main analyses (comparisons of instances with the most extreme resting biases for each individual) for the unperturbed reach analysis, pulse perturbation analysis, and active holding control analysis, respectively.

      We find that non-paretic and control data do not display effects of resting biases upon unperturbed reaching control (Figure S5-1) or control against a pulse perturbation early during movement (Figure S6-1) – as is the case with the paretic data. Non-paretic and control data do not display evidence of influence of their resting force biases upon active holding control either (Figure S7-1), unlike the paretic data. For the non-paretic data, however, these influences are nominally towards the same direction as in the paretic data. Given that resting biases are substantially weaker for the non-paretic case, it is possible a similar relationship exists but requires increased statistical power to discern. Moreover, it is possible that the effect of resting biases is non-linear, with small biases effectively kept under check so that their impact upon active holding control is even less than a linearly scaled version of the impact of the stronger, paretic-side biases. This can be the subject of future work.

      Please also note that, following your recommendation (Recommendations to the Authors, point 2.1), we have conducted secondary analyses which estimate sensitivity to resting bias using all datapoints, validating our main analyses; these analyses were also performed for control and non-paretic data, with similar results (Response 2.A.1).

      Further, the results could be further boosted by reporting/displaying additional analyses. In Figure 6D the authors performed a correlation analysis. Can they also display the same analysis for initial deviation and endpoint deviation for the data shown in Figure 5D-F & 5G-I, as well for 7F for the path to stabilization, time to stabilization, and max deviation? This will also create consistency in the analyses performed for each dependent variable across the paper.

      Response 2.1b. Here, we set to test whether resting biases affect movement. It is best to do this using a within-individual comparison design, rather than using across-individual correlations: while correlation analyses can in general be informative, they obscure within-individual effects which are the main comparisons of interest in our study. Consider a participant with strong resting bias towards one direction, tested on opposing perturbations; averaging these responses for each individual would mostly cancel out any effects of resting biases. Even if we were to align responses to the direction of the perturbation before averaging, the power of correlation analyses may be diluted by inter-individual differences in other factors, such as overall stiffness.

      Thus, our analysis design was instead focused on examining the differential effects of resting posture biases within each individual’s data. We compared the most extreme opposing/aligned or clockwise/counter-clockwise instances within each individual, specifically to assess these differential effects. In our revised version, we have further reinforced these analyses to include all data rather than the most extreme instances (see response 2.A.1.a to the Reviewer’s recommendation to the authors) where we performed correlations of within-individual resting posture vs. the corresponding dependent variables and compared the resulting slopes. 

      The across-individual correlation analyses add little to that for the reasons we outlined above. At the same time, it is possible they can be helpful in e.g. illustrating across-individual variability. We thus now include across-individual correlation analyses for all dependent variables, but, given their limited value, only in the supplementary material. This also means that, for consistency, we moved the correlation analysis in Figure 6 to the corresponding supplementary figure as well (Figure S6-3).

      In addition, following the Reviewer’s comment about consistency in the analyses performed for each dependent variable across the paper, we added within-individual comparisons for settling time following the pulse perturbations (Figure 6D, right).

      (2) Inconsistency in perturbations that would differentially impact muscle and limb states during movement and posture. It is well known that differences in muscle state (activation / preloaded, muscle fiber length and velocity) and limb state (position and velocity) impact sensorimotor control (Pruszynski, J. A., & Scott, S. H. (2012). Experimental brain research, 218, 341-359.). Of course, it is appreciated that it is not possible to completely control all states when comparing movement and posture (i.e., muscle and limb velocity). However, using different perturbations differentially impacts muscle and limb states. Within this paper, the authors used very different force waveforms for movement perturbations (i.e., 12 N peak, bell-shaped, 0.7ms duration -> sudden force onset to push the limb; Figure 6A) and posture perturbations (i.e., 6N, 2s ramp up -> 3s hold -> sudden force release that resulted in limb movement; Figure 4) that would differentially impact muscle (and limb) states. Preloaded muscle (as in the posture perturbation) has a very different response compared to muscle that has little preload (as in the movement perturbations, where muscles that would resist a sudden lateral perturbation would likely be less activated since they are not contributing to the forward movement). Would the results hold if the same perturbation had been used for both posture and movement (e.g., 12 N pulse for both experiments)? It is recommended that the authors comment and discuss in the paper why they chose different perturbations and how that might impact the results. 

      Response 2.2a. We agree that it can be impossible to completely control all states when comparing movement and posture. We would also like to stress that these perturbations were not designed so that responses are directly compared to each other (though of course there is an indirect comparison in the sense that we show influence of biases in one type of perturbation but not the other). Instead, Experiment 2 tried to implement a probe optimized for each motor control modality (moving vs. holding). However, the Reviewer has a point that the potential impact of differences between the perturbations is important to discuss in the paper.

      The Reviewer points out two potentially interesting differences between the two perturbations. First, the magnitude (6N for the posture perturbation vs. 12N for the pulse perturbation); second, the presence of background load in the posture perturbation, in contrast to the pulse perturbation.

      For the movement perturbation, we used a 12-N, 70ms pulse. This perturbation and scaled versions have been tested before in both control and patient populations (Smith et al., 2000; Fine and Thoroughman, 2006). For the holding perturbation, we used a background load to ensure that active holding control is engaged, and the duration of the probe (holding for about 5s) made using a stronger perturbation impractical –maintaining a background load at, say, 12N for that long could lead to increased fatigue.

      The question raised by the Reviewer, whether the findings would be the same if the same, 12-N pulse were used to probe both moving and holding control, is interesting to investigate. We would expect the same qualitative findings (i.e. there would still be a connection between resting posture and active holding control when the latter were probed with a 12N pulse). Recent work provides more specific insight into what to expect. Our posture perturbation task is similar to the Unload Task in (Lowrey et al., 2019), whereby a background torque is released, whereas our pulse perturbation is more similar to their Load Task, whereby a torque is imposed against no background load (though it is a step perturbation rather than a pulse). Lowrey et al., 2019 find that their Unload task is harder than the Load task, with 2x the fraction of patient trials classified as failed (with failure defined as task performance being outside of the 95% confidence interval for controls), though there are still clear effects for the Load task. 

      This suggests that the potential effects of using a pulse-like perturbation to probe posture control would likely be weaker in magnitude, all other things being equal. At the same time, however, the Load and Unload tasks in Lowrey et al., 2019 were perturbations of the same magnitude; it is thus also likely that the reduction in effect would be mitigated, or reversed, by the fact that we would be using a 12N instead of a 6N perturbation.

      A relevant consequence of the Lowrey et al., 2019 findings is that the Unload paradigm is superior in its ability to detect impairment in static, posture perturbations, and thus provides a better signal to detect potential relationships with resting posture biases. This is not surprising, as a background load further engages the control of active holding, which what we were trying to probe in the first place.

      But then why not use the same paradigm (preloading and release) for movement? There are two main reasons. First, requiring a background load throughout the experiment is unfeasible due to fatigue. Second, for the holding perturbation, we wanted to ensure that the postural control system is meaningfully engaged when the perturbation hits, hence we picked the background load. Were we to impose the same during moving – i.e. impose a lateral background load on the movement - we could be engaging posture control on top of movement control. This preloading would reduce the degree to which the pulse probe isolates movement control, and lead to intrusion of the posture control system in the movement task by design. This relates to what the Reviewer proposes in the comment below: preloading may result in postural biases i.e. engage posture control; see below where we argue this interpretation is within the scope of our conceptual model rather a counter to it.

      We now explain the rationale behind our perturbation design in the Methods section (lines 211-220).

      Relatedly, an alternative interpretation of the results is that preloading muscle for stroke patients, whether by supporting the weight of one's arm (experiment 1) or statically resisting a load prior to force release (experiment 2), leads to a greater postural force bias that can subsequently influence control. It is recommended that the authors comment on this. 

      Response 2.2b. We find this interpretation valid, but we do not see how it meaningfully differs from the framework we propose. We already state that the RST may be tailored for both posture/holding control and the production of large forces (which would include muscle preloading):

      “Thus, the accumulated evidence suggests that the RST could control posture and large force production in the upper limb.“ (lines 698-699 in the current version)

      “the RST, in contrast, is weighted more towards slower postural control and generation of large isometric forces” (lines 724-726 in the current version)

      And, we discuss other conditions where the RST is involved in large force production, such as power grip, and how these interact with the role of the RST in posture/holding control (lines 758-768 in the current version).

      To better explain our model, we now provide the two examples mentioned by the reviewer along with our description of the proposed role for the RST (lines 726-727):

      “…the RST, in contrast, is weighted more towards slower postural control and generation of large isometric forces (such as vertical forces for arm support, or horizontal forces for holding the arm still against a background load like in our posture/release perturbation trials).”

      We note, however, that we find resting posture abnormalities even in the presence of arm support, suggesting the involvement of the RST in holding control even when the forces involved (and the need to preload the muscle) are small.

      Reviewer #3 (Public Review): 

      The authors attempt to dissociate differences in resting vs active vs perturbed movement biases in people with motor deficits resulting from stroke. The analysis of movement utilizes techniques that are similar to previous motor control in both humans and non-human primates, to assess impairments related to sensorimotor injuries. In this regard, the authors provide additional support to the extensive literature describing movement abnormalities in patients with hemiparesis both at rest and during active movement. The authors describe their intention to separate out the contribution of holding still at a position vs active movement as a demonstration that these two aspects of motor control are controlled by two separate control regimes.

      Strengths: 

      (1) The authors utilize a device that is the same or similar to devices previously used to investigate motor control of movement in normal and impaired conditions in humans and non-human primates. This allows comparisons to existing motor control studies. 

      (2) Experiment 1 demonstrates resting flexion biases both in supported and unsupported forelimb conditions. These biases show a correlated relationship with FM-UE scores, suggesting that the degree of motor impairment and the degree of resting bias are related.

      (3) The stroke patient participant population had a wide range of both levels of impairment and time since stroke, including both sub-acute and chronic cases allowing the results to be compared across impairment levels.

      The authors describe several results from their study: 1. Postural biases were systematically toward the body (flexion) and increased with distance from the body (when the arm was more extended) and were stronger when the arm was unsupported. 2. These postural biases were correlated with FM-UE score. 3. They found no evidence of postural biases impacting movement, even when that movement was perturbed. 4. When holding a position at the end of a movement, if the position was perturbed opposite of the direction of bias, movement back to the target was improved compared to the perturbation in the direction of bias. Taken together, the authors suggest that there are at least two separate motor controls for tasks at rest versus with motion. Further, the authors propose that these results indicate that there is an imbalance between cortical control of movement (through the corticospinal tracts) and postural control (through the reticulospinal tract).

      Response 3.1. Thank you for pointing out some of the strengths of our work and summarizing our findings. A minor clarification we would like to make, related to (3), is that, while our study did enroll two patients towards the end of the subacute stage (2-3 months), the rest of the population were at the chronic stage, at one year and beyond. We thus find it very unlikely that time after stroke was the primary driver of differences in impairment in the population we studied.

      There are several weaknesses related to the interpretation of the results:

      In Experiment 1, the participants are instructed to keep their limbs in a passive position after being moved. The authors show that, in the impaired limb, these resting biases are significantly higher when the limb is unsupported and increase when the arm is moved to a more extended position.

      When supported by the air sled, the arm is in a purely passive position, not requiring the same antigravity response so will have less RST but also less CST involvement. While the unsupported task invokes more involvement of the reticulospinal tract (RST), it likely also has significantly higher CST involvement due to the increased difficulty and novelty of the task.

      If there were an imbalance in CST regulating RST as proposed by the authors, the bias should be higher in the supported condition as there should be relatively less CST activation/involvement/ modulation leading to less moderating input onto the RST and introducing postural biases. In the unsupported condition, there is likely more CST involvement, potentially leading to an increased modulatory effect on RST. If the proportion of CST involvement significantly outweighs the RST activation in the unsupported task, then it isn't obvious that there is a clear differentiation of motor control. As the degree of resting force bias and FM-UE score are correlated, an argument could be made that they are both measuring the impairment of the CST unrelated to any RST output. If it is purely the balance of CST integrity compared to RST, then the degree of bias should have been the same in both conditions. In this idea of controller vs modulator, it is unclear when this switch occurs or how to weigh individual contributions of CST vs. extrapyramidal tracts. Further, it isn't clear why less modulation on the RST would lead only to abnormal flexion.

      Response 3.2. Our model posits two mechanisms by which CST impairment would lead to increased RST involvement. The first – which is the one discussed by the Reviewer here - is a direct one, whereby weaker modulation of the RST by the CST leads to increased RST involvement. The second is an indirect one, whereby the incapacity of CST to drive sufficient motor output to deal with tasks eventually leads to increased RST drive.

      The reviewer suggests it is likely that the unsupported task demands increased activation through both the CST and the RST. If that were the case, however, it would exaggerate the effects of CST/RST imbalance after stroke compared to healthy motor control: if task conditions (lack of support) required higher CST involvement, then CST damage would have an even larger effect. In turn, this would lead to even higher RST involvement and further diminishing the ability of CST to moderate RST. Thus, RST-driven biases would be higher in the unsupported condition.

      And, given that the CST itself is damaged and has to deal with an even-increased RST activation, we would not expect that the proportion of CST involvement would outweigh RST activation, but the opposite. In fact, a series of relatively recent findings suggest just this. For example,

      • Zaaimi et al., 2012  showed that unilateral CST lesions in monkeys lead to significant increases in the excitability of the contralesional RST (Zaaimi et al., 2012). Interestingly, this effect was present in flexors but not extensors, potentially explaining why less modulation and/or overactivation of the RST would primarily lead to abnormal flexion. 

      • McPherson et al. (further discussed in point 2.A.23, by Reviewer 2 – Recommendations to the Authors) showed that, after stroke, contralesional activity (which would include the ipsilateral RST) increases relative to ipsilesional activity (which would include the contralateral CST)

      (McPherson et al., 2018). The same study also provides evidence that FM-UE may primarily reflect RST-driven impairment. The ipsilateral(RST)/contralateral(CST) balance, expressed as a laterality index, correlated with FM-UE, with lower FM-UE for indices indicating higher RST involvement. (Interestingly, the slope of this relationship was steeper when the laterality of brain activation patterns was examined under tasks with less arm support, mirroring the steeper FM-UE vs resting bias slope when arm support is absent, as shown in our Figure 8).

      • Wilkins et al., 2020 (Wilkins et al., 2020) found that providing less support (i.e. requiring increased shoulder abduction) increases ipsilateral activation (representing RST) relative to contralateral activation (representing CST).

      This resting bias could be explained by an imbalance in the activation of flexors vs extensors which follows the results that this bias is larger as the arm is extended further, and/or in a disconnect in sensory integration that is overcome during active movement. Neither would necessitate separate motor control for holding vs active movement. 

      Response 3.3. We do not think that either of these points necessarily argue against our model. First, the resting biases we observe are clearly pointed towards increased flexion, and can thus be seen as the outcome of an imbalance in the activation of flexors vs. extensors at rest. This imbalance between flexors/extensors can also be explained by the CST/RST imbalance posited by our conceptual model: in their study of CST lesions in the monkey, Zaaimi et al., 2012 found increased RST activation for flexors but not extensors, suggesting that RST over-involvement may specifically lead to flexor abnormalities (Zaaimi et al., 2012). Second, overcoming a disconnect in sensory integration may be one way the motor system switches between separate controllers; how this switch happens is not examined by our conceptual model.

      In Experiment 2, the participants are actively moving to and holding at targets for all trials while being supported by the air sled. Even with the support, the paretic participants all showed start- and endpoint force biases around the movement despite not showing systematic deviations in force direction during active movement start or stop. There could be several factors that limit systematic deviations in force direction. The most obvious is that the measured biases are significantly higher when the limb is unsupported and by testing with a supported limb the authors are artificially limiting any effect of the bias.

      Response 3.4. We do expect, in line with what the reviewer suggests, that any potential effects would be stronger in the unsupported condition. The decision to test active motor control with arm support was done as running the same Experiment 2 would pose challenges, particularly with our most impaired patients, given the duration of Experiment 2 (~2 hours, about 1 hour with each arm) and the expected fatigue that would ensue.

      However, a key characteristic of our comparisons is that we are comparing Experiment 2 active control data under arm support, against Experiment 1 resting bias data also under arm support. While Experiment 1 measured biases without arm support as well, these are not used for this comparison. And, while resting biases are weaker with arm support, they are still clear and significant; yet they do not lead to detectable changes in active movement.

      At the same time, we do not rule out that, if we were to repeat Experiment 2 without arm support, we could find some systematic deviation in the direction of resting bias in movement control. Our conceptual model, in fact, suggests that this may be the case, as we described in lines 618-620 of our original manuscript. The idea here is that, when arm support is not provided, the increased strength requirements lead to increased drive through the RST, to the point that posture control (and its abnormalities) spills into movement control (Figure 9). We now better clarify this position in our Discussion (lines 744-750):

      “The interesting implication of this conceptual model is that synergies are in fact postural abnormalities that spill over into active movement when the CST can no longer modulate the increased RST activation that occurs when weight support is removed (i.e. resting biases may influence active reaching in absence of weight support). Supporting this idea, a study found increased ipsilateral activity (which primarily represents activation via the descending ipsilateral RST (Zaaimi et al., 2012)) when the paretic arm had reduced support compared to full support (McPherson et al., 2018).”

      It is also possible that significant adaptation or plasticity with the CST or rubrospinal tracts could give rise to motor output that already accounts for any intrinsic resting bias.  

      Response 3.5. This kind of adaptation – regardless of the tracts potentially involved – is an issue we examined in our experiment. As we talk about in our Results (lines 458-460 in the updated manuscript), with most of our patient population in the chronic stage, it could be likely that their motor system adapted to those biases to the point that movement planning took them into account, thereby limiting their effect. This motivated us to examine responses to unpredictable perturbations during movement (Figure 6) where we still find lack of an obvious effect of resting biases upon reaching control. We thus believe that our findings are not explained by this kind of adaptation, though we agree it would be of great interest for future work to compare resting biases and reaching control in acute vs. chronic stroke populations to examine the degree to which stroke patients adapt to these biases as they recover.

      In any case, the results from the reaching phase of Experiment 2 do not definitively show that directional biases are not present during active reaching, just that the authors were unable to detect them with their design. The authors do acknowledge the limitations in this design (a 2D constrained task) in explaining motor impairment in 3D unconstrained tasks. 

      Response 3.6. It is, of course, an inherent limitation of a negative finding is that it cannot be proven. What we show here is that, there is no hint of intrusion of resting posture abnormalities upon active movement in spite of these resting posture abnormalities being substantial and clearly demonstrated even under arm support. To allow for the maximum bandwidth to detect any such effects, we specifically chose to compare the most extreme instances (resting bias-wise) for each individual, and yet we did not find any relationship between biases and active reaching.

      This suggests that, even if these biases could be in some form present during active movement, their effect would be minimal and thus limited in meaningfully explaining post-stroke impairment in active movement under arm support.

      Note that, as we already discuss, our conceptual model (Figure 9) suggests that the degree to which directional biases would be present in active reaching may be influenced by arm support (or the specific movements examined – hence our limitation in not examining 3D movement). Thus we do not claim that this independence is absolute. Examples include the last line of the passage quoted right above, and the summary statement of our Discussion quoted below (lines 639-641):

      “…which raises the possibility that the observed dissociation of movement and posture control for planar weight-supported movements may break down for unsupported 3D arm movements.”

      Finally, we now more explicitly acknowledge that abnormal resting biases may influence active movement in the absence of arm support (see Response 3.4).

      It would have been useful, in Experiment 2, to use FM-UE scores (and time from injury) as a factor to determine the relationship between movement and rest biases. Using a GLMM would have allowed a similar comparison to Experiment 1 of how impairment level is related to static perturbation responses. While not a surrogate for imaging tractography data showing a degree of CST involvement in stroke, FM-UE may serve as an appropriate proxy so that this perturbation at hold responses may be put into context relative to impairment.

      Response 3.7. Here the Reviewer suggests we use FM-UE scores as a proxy for CST integrity. We do not think this analysis would be particularly helpful in our case for a number of reasons:

      First, while FM-UE is a general measure of post-stroke impairment, it was designed to track - among other things - the emergence and resolution of abnormal synergies, a sign assumed to result from abnormally high RST outflow (McPherson et al., 2018; McPherson and Dewald, 2022). In line with this, the FM-UE scales with EMG-based measures of synergy abnormality (Bourbonnais et al., 1989). Impairments in dexterity, a sign associated with damage to the CST (Lawrence and Kuypers, 1968; Porter and Lemon, 1995; Duque et al., 2003), dissociate with synergy abnormalities when compared under arm support as we do here (Levin, 1996; Hadjiosif et al., 2022). This means that FM-UE would be a stronger proxy for RST activity and thus not a direct proxy for CST integrity particularly when one wants to dissociate RST-specific vs. CST-specific abnormalities. In fact, as we discuss in Response 3.2 above, there is a number of studies supporting this idea: for example, Zaaimi et al., 2012 show that relative RST activation – the balance between ipsilateral excitability, primarily reflecting RST, and contralateral excitability, primarily reflecting the CST, scales with FM-UE (Zaaimi et al., 2012).

      Second, this kind of analysis would obscure within-individual effects, since FM-UE scores are, of course, assigned to each individual. This is the same issue as doing across-individual correlation analyses in general (see response 2.1b).Strong resting force bias would have opposite effects on opposing perturbations, averaging across subjects would occlude these effects.

      Third, while FM-UE is a good measure of synergy abnormality, weakness alone could also give an abnormal FM-UE (Avni et al., 2024).

      The Reviewer also suggests we use time from injury for this analysis. Time from injury can indeed potentially be an important factor. However, this analysis would not be appropriate for our dataset, since the effective variation in recovery stage within our population is limited: our sample is essentially chronic (only two patients were examined within the subacute stage – at 2 and 3 months after stroke - with everybody else examined more than a year after stroke) with the “positive” elements of their phenotype (and FM-UE itself) essentially plateaued (Twitchell, 1951; Cortes et al., 2017). We thus would not expect to see any meaningful effects of time from injury within our population. It would be an excellent question for future work to investigate both resting biases and their relationship to reaching in acute/subacute patients, and examine whether the trajectory of resting biases (both emergence and abatement due to recovery) follows the one for abnormal synergies.

      It is not clear that even in the static perturbation trials that the hold (and subsequent move from perturbation) is being driven by reticulospinal projections. Given a task where ~20% of the trials are going to be perturbed, there is likely a significant amount of anticipatory or preparatory signaling from the CST. How does this balance with any proposed contribution that the RST may have with increased grip?

      Response 3.8. We included our response to this as part of Response 3.2. In brief, while we cannot rule out that these tasks may recruit increased CST signaling, this would tend to increase, rather than reduce, the effects of post-stroke impairment: the requirement for increased signaling from a CST that is damaged would magnify the effects of this damage, in turn leading to increased recruitment of other tracts, such as the RST.

      In general, the weakness of the interpretation of the results with respect to the CST/RST framework is that it is necessary to ascribe relative contributions of different tracts to different phases of movement and hold using limited or indirect measures. Barring any quantification of this data during these tasks, different investigators are likely to assess these contributions in different ways and proportions limiting the framework's utility.

      Response 3.9. We believe that our Reponses 3.2-3.6 put our findings in fair perspective, and the edits undertaken based on the Reviewer’s comments have clarified our position as to how the dissociation between holding and moving control may break down. We do agree, however, that our framework would be strengthened by the use of direct measures of CST/RST connectivity in future research. We present our conceptual model as a comprehensive explanation of our findings and how they blend with current hypotheses regarding the role of these two tracts in motor control after stroke.  As such, it provides a blueprint towards future research that more directly measures or modulates CST and RST involvement, using tools such as tractography or non-invasive brain stimulation.

      Recommendations for the authors:   

      Reviewer #1 (Recommendations For The Authors):

      L226 “…of this issue, we repeated the analysis of Figure 7F (a) by excluding these four patients…”.  Should this be three, based on the previous sentence? 

      Response 1.A.1. Thank you for pointing this typo, which is now corrected. The analysis in question (Figure S1 in the original submission, now re-numbered as Figure S7-4), excluded the three patients mentioned in the previous sentence.

      L254 “…the hand was held in a more distal position. The postural force biases were strongest when…”  Could this be "extended" rather than distal? See my later comment about the inadequate description of targets.

      Response 1.A.2. The reviewer is correct that, the arm will tend to be more extended in the distal targets. However, since these positions were defined in extrinsic coordinates, we think the terms distal/proximal are also appropriate. In either case, we now clarify these definitions in the text (see Response 1.A.3 below).

      L263 “…contained both distal and proximal targets, and, importantly, they were also the movement…”.  Distal/proximal targets were never described as part of the task. 

      Response 1.A.3. We improved our description by (i) changing the wording above to “represented positions both distal and proximal to the body,”, (ii) doing the same in our Methods (line 175) and (iii) indicating distal/proximal targets in Figure 3A (bottom right of panel A).

      L378 “…the pulse perturbation. We hypothesized that, should resting postural forces play a role, they…”  L379 “…would tend to reduce the effect of the pulse if they were in the opposite direction, and…”  Not really obvious why. A reduction in the displacement caused by a force pulse might be caused by different stiffness or viscosity, but not by a linear, time-invariant force bias. This situation is different from that of "moving the arm through a high-postural bias area vs. a low-postural bias area" where it would encounter time- (actually spatially) varying forces and varying amounts of displacement. Clarify the logic if this is a critical point.

      Response 1.A.4. We thank the Reviewer for highlighting this point of potential confusion. We now clarify that these postural bias forces are neuromuscular in origin (Kanade-Mehta et al., 2023), and likely result from an expression of abnormal synergy, at least under static conditions. In this case, we hypothesized that force pulses acting against the gradient of the postural bias field would act to stretch the already active muscles, which would lead to a further increase in postural resistance due to inherent length-tension properties of active muscle. By contrast, force pulses acting along the gradient of the postural bias field would act to shorten the same active muscles, which would lead to a reduction in postural resistance. The data did not support this in the case of force pulses imposed during movement. We note, however, that similar effects would affect responses to static perturbations as well, wherein we do find an effect of resting biases. We now better explain this reasoning (lines 479482).

      L466 “resting postural force). In short, our perturbations revealed that resting flexor biases switched  467 on after movement was over, providing evidence for separate control between moving” and 

      L468 “holding still.”

      I do not think the authors have presented clear evidence that forces, "switch on", implying the switch to a different controller which they posit. This could as easily be a nonlinear or time-varying property of a single controller (admittedly, the latter possibility overlaps broadly with their idea of distinct, interacting controllers). An example that the authors are certainly aware of is that of muscle "thixotropy" a purely peripheral mechanism due to the dynamics of crossbridge cycling that causes resting muscle to be stiffer than moving muscle, changing with a time constant of ~1-2 seconds. Neither this particular example nor changing levels of contraction (more likely during the unpredictable force perturbations) would be in the direction to explain the main observation here -- a point perhaps worth making, together with the stretch reflex comments. 

      Response 1.A.5. Thank you for this perspective. Indeed, it might be that “switching on” represents a shift along a nonlinear property of the same controller: in the extreme, if this nonlinearity is a step (on/off) function, this single controller would be functionally identical to two separate controllers. We thus cannot tell if these controllers are distinct in the strict sense. What we argue here is that, no matter the underlying controller architecture - two distinct controllers or two distinct modes of the same controller - is that the control of reaching vs. holding can be functionally separable even after stroke. In line with this idea, we used a more nuanced phrasing (e.g. “separable functional modes for moving vs. holding”) throughout our manuscript, and we have now edited out a mention of “separate controllers” to be consistent with this.

      Moreover, thank you for pointing out the example of thixotropy, showing how peripheral mechanisms could interact with central control. As you point out, this effect would not explain the main observation here: in fact, if stiffness were substantially higher during rest or holding (instead of moving) that would reduce the impact of the static perturbation, making it harder to detect any effects of resting biases compared to the moving perturbation case.

      L480 “…during movement (Sukal et al., 2007). Yet, Experiment 2 found no relationship between resting…” L481”… postural force biases and active movement control. To further investigate this apparent…”  The methods of the two studies seem fairly similar, but this question warrants a more careful comparison. How did the size of the two workspaces compare? What about the magnitude of the exerted forces? The movement condition in this study was done with the limb entirely supported. Under that condition, the Sukal study also found fairly small effects of the range of motion.

      Response 1.A.6. Sukal et al., 2007 did not directly measure exerted forces, but instead compared the active range of motion under different loading conditions. They used the extent of reach area to quantify the effect of abnormal synergies, with a more extended active range of motion signifying reduced effect of abnormal synergies. As the Reviewer points out, Sukal et al. found fairly small effects of synergies upon the range of motion when arm support was provided (the reach area for the paretic side was found to be about 85% of the nonparetic side under full arm support, though they were statistically significantly different, Figure 5 of their paper). They found increasing effect of synergies as arm support was reduced: on average, the reach area when participants had to fully support the arm was less than 50% the reach area when full arm support was given (comparing the 0% vs. 100% active support conditions [i.e. 100% vs. 0% external support] in their Figure 5). As we discuss in our paper, this effect of arm support upon synergy mirrors the one we found for resting postures.

      To compare our workspace with the one in Sukal et al., we overlaid our workspace (the array of positions for which the posture biases were measured, for a typical participant from Experiment 1) on the one they used as shown in their Figure 4. Note that their figure only shows an example participant, and thus our ability to compare is limited by the fact that each participant can vary widely in terms of their impairment, and assumptions had to be made to prepare this overlay (e.g. that (0,0) represents the position of the right acromion point). 

      For this example, and our assumptions, our workspace was smaller, with the main points of interest (red dots, the movement start/end points used for Experiment 2) within the Sukal et al. workspace. That our workspace is smaller is not surprising, given that the area in Sukal et al. represents the limit of what can be reached, and thus motor control *has* to be examined in a subset of that area.

      Author response image 1.

      Comparing the two study methodologies, however, suggests an advantage of measuring resting biases in terms of sensitivity and granularity: first, resting biases can be clearly detected even under arm support (something we point out in our Discussion, lines 715-717); second, they can measure abnormalities at any point in the workspace, rather than a binary within/without the reach area. The resting bias approach may thus be a more potent tool to probe the shared bias/synergy mechanisms we propose here.

      Figure 2 

      Needs color code. 

      The red dots could be bigger.

      Response 1.A.7. We have increased the size of the red dots and added a color code to explain the levels illustrated by the contours. We also expanded our caption to better explain this illustration.

      Figure 3

      Labeling is confusing. Drop the colored words (from both A and B), and stick to the color legend. Consider using open and filled symbols (and bars) to represent arm support or lack thereof. The different colored ovals are very hard to distinguish.

      Response 1.A.8. We find these recommendations improve the readability of Figure 3 and we have thus adopted them - see updated Figure 3.

      Figure 4

      Not terribly necessary.  

      Response 1.A.9. While this figure is indeed redundant based our descriptions in the text, we kept it as we believe it can be useful in clarifying the different stages of movement we examine.

      Figure 5 

      Tiny blue and green arrows are impossible to distinguish. 

      Although the general idea is clear, E and H are not terribly intuitive.  Add distance scale bars for D-I. 

      Response 1.A.10. For improved contrast, we now use red and blue (also in line with comment below regarding Figure 7), and switched to brighter colors in general. To make E and H more intuitive and easier to follow, we expanded the on-panel legend. Thank you for pointing out that distance scale bars are missing; we have now added them (panels EFHI).

      Figure 6 

      Panel E inset is too small. 

      Response 1.A.11. We have now moved the inset to the right and enlarged it.

      Figure 7 

      Green and blue colors are not good. 

      Response 1.A.12. For improved contrast, we now use red and blue.

      Figure 8 

      Delete or move to supplement? 

      Response 1.A.13. We respectfully disagree. While the relationships on these data are also captured by the ANOVA, we believe these scatter plots offer a better overview of the relationships between force biases and FM-UE across different conditions.

      Really minor

      L113 “…participants' lower arm was supported using a custom-made air-sled (Figure 1C). Above the  participant's…” 

      Response 1.A.14. We put the apostrophe after the s so to refer to participants in general (plural).

      L117 ”…subject-produced forces on the handle were recorder using a 6-axis force transducer.”  recorded 

      Response 1.A.14. Thank you for pointing out this error which we have now corrected.

      L136 “…2013), Experiment 1 assessed resting postural forces by passively moving participants to>…”  The experiment did not move the participant. 

      Response 1.A.15. We now fix this issue: “by having the robot passively move…”

      L248 “…experiment blocks: two with each arm, with or without arm weight support (provided by an air experimental…”

      Response 1.A.16. We have now corrected this.

      L364 “…responses to mid-movement perturbations. In 1/3 of randomly selected reaching movements…”  Obviously, you mean 1/3 of all movements: "One-third of the reaching movements were chosen randomly"  

      Response 1.A.17. We now clarify: “In 1/3 of reaching movements in Experiment 2, chosen randomly”. Also please note our response to Reviewer 2, point 10: we now report the exact number of trials for which each kind of perturbation was present.

      L609 “Damage to the CST after stroke reduces its moderating influence upon the RST (Figure 9,…”  "its" refers to the subject, "Damage", not "CST".

      Response 1.A.18. We have changed this to “Post-stroke damage to the CST reduces the moderating influence the CST has upon the RST”.

      Reviewer #2 (Recommendations For The Authors):

      (1) Throughout, the authors cleverly selected the most opposed and most aligned resting postural force biases to perform a within-subject analysis. However, this approach excludes a lot of data. The authors could perform an additional within-subject analysis. For each participant they could correlate lateral resting posture force bias to each dependent variable, utilizing all the trials of a participant. 

      Response 2.A.1a. Thank you for your appreciating our analysis design, and suggesting additional analyses. We focused our within-subject analysis design on the most extreme instances, as we believe that this approach would offer the best opportunity to detect any potential effects of resting biases. We reasoned that, since resting biases tend to be relatively small for most locations in the workspace, taking all biases into account would inject a disproportionate amount of noise in our analysis, which would in turn diminish our ability to detect any potential relationships. This could be because small biases lead to small effects but also small biases may themselves be more likely to reflect measurement noise in the first place. Note that our study talks about separability of active reaching from resting abnormalities based on lack of relationships between the two. While one cannot definitely prove a negative, it is also important to take the approach that maximizes the ability to detect any such relationship if there were one. We believe taking the most extreme instances fulfills that role.

      However, as the Reviewer points out, this approach also excludes a substantial amount of data. We agree that our findings could be further strengthened by exploring additional within-subject analyses that utilize all trials. Thus, following the reviewer’s suggestion, we estimated the sensitivity of each dependent variable to lateral resting posture force bias. Specifically, we estimated the slope of this relationship for each individual (separately for paretic and non-paretic data) using linear regression, and assessed whether the average slope is significant for each group (paretic data, non-paretic data, and control data).

      This secondary analysis replicated our main findings: lack of relationship between posture biases and active reaching control (both for unperturbed and perturbed movement), and a significant relationship between posture biases and active holding control. In addition, in line with main point 2.1 by the reviewer, we performed the same analyses for non-paretic and control data. While there are no definitive conclusions to be made for these cases (as was likely, given that the resting force biases are smaller, as also pointed out by the Reviewer in 2.1) these data are worthy of discussion, with potentially interesting insights (for example, there are hints that the connection between resting biases and active holding control is present in the non-paretic arm as well, and may be explored in future research).

      We have included these analyses in the supplementary materials, and we point to them in the main text. Specifically:

      First, in line with our main analyses in Figure 5, we find no effect (the average slope is insignificant) for start and endpoint biases upon the corresponding reaching angles. This is now mentioned in lines 425-434 of the Results, and illustrated in Figure S5-2. There was a lack of effect for the non-paretic and control data as well.

      Second, in line with our main analyses in Figure 6, we find no effect of start biases upon responses to the pulse (Figure S6-2, mentioned in lines 513-517 of the Results). As above, there was no effect of non-paretic or control data either.

      And, finally, in line with our main analysis in Figure 7, we find an effect of resting biases upon performance for the static perturbation (Figure S7-2, mentioned in lines 578-586 of the Results). Interestingly, there is a suggestion that resting biases may affect static perturbation responses in the non-paretic data as well based on the relationship between posture bias and maximum deviation, but not the other two metrics. Given the lack of consistency of resting bias effects for all three different dependent variables examined, however, our current data are thus unable to give a definite answer as to whether there is the connection between resting biases and active holding control is also present in the non-paretic side. Our hypothesis is that, since resting abnormalities and their effects are the pathological over-manifestations of mechanisms inherent in the motor system in general, then such a relationship would exist. Answering this question, however, would require an experiment design better tailored to detect relationships in the non-paretic arm, where resting biases are weaker.

      We thank the Reviewer for their suggestions and believe that these additional analyses provide a more complete picture of the data, and their consistency with our main results reinforces the message of the paper.

      Then, they can report the percentage of participants that display significant correlations separately for the paretic, nonparetic, and control arms. 

      Response 2.A.1b. We note that, even in cases where the average slope (across individuals) is significant, the individual slopes themselves are usually not significant, likely due to the large amount of noise for datapoints corresponding to weak resting biases. To further examine this, we performed additional analyses whereby we examined slopes by (a) pooling all participant data together (centered separately for each individual), and then (b) took a further step to normalize each participant’s data not only by centering but by also adjusting by each individual’s variability along each axis (i.e. assess the slope between z-scores of resting bias vs. z-scores of each dependent variable). These two analyses confirmed our finding that resting biases interacted with active motor control, with significant slopes between resting biases and outcome variables. (a) Pooling all data together: path to stabilization: p = 0.032; time to stabilization: p = 1.4x10-5; maximum deviation: p = 0.021. (b) Pooling and normalizing: path to stabilization: p = 0.0013; time to stabilization: p = 8.6x10-6; maximum deviation: p = 0.00056. The latter analysis showed even stronger connection between resting bias and active holding control, probably due to better accounting for differences in the range of resting biases across participants). For simplicity, however, we only provide the across-individual slope comparisons in the paper.

      (2) An important aspect of all the analyses is that they rely heavily on estimates of the resting postural force bias. How stable are these resting postural force biases at the individual level? The authors could assess this by reporting within-subject variance for both the magnitude and direction of the resting postural force bias.

      Response 2.A.2. Thank you for your suggestion. We now assess the individual-level variance in error across measurements for patients’ paretic data using an ANOVA: the variance that remains after all other factors (same probe location; same arm support condition; same participant) are taken into account. We found that individual level measurement variance explained a mere 9.0% of total variance for resting bias magnitude. (We note that the same figure was 20.2% for the non-paretic data, in line with the weaker average biases which would be more susceptible to noise). We now note this in the Methods, as part of the new subsection “Stability of resting posture bias measurements in Experiment 1” (lines 266-273).

      (3) Does resting postural force bias influence hand movement immediately following force release from the postural perturbation? This could be assessed before any volitional responses by examining the velocity of the hand during the first 50 ms following the postural perturbation.

      Response 2.A.3. The influence seems fairly rapid, within the first 100ms as shown to the right. Here we plot hand deviation in the direction of the perturbation for the most-opposed (red) vs. most-aligned (blue) instances to examine when these curves become different. The bottom plots show the difference between these two, whereas shading indicates SEM (note that these curves are referenced to the average deviation in the last 0.5 s before force release). The rightmost plots zoom in to make it easier to see how responses to the most opposed vs. most aligned instances diverge.

      To detect the earliest post-perturbation timepoint for which this effect was significant, we performed paired t-tests at each timestep, and found that the two responses were systematically statistically different 95ms after perturbation onset onwards. For reference, the same method detected a response at 25ms for the most aligned instances and 40ms for the most opposed instances.

      We have now added Supplementary Figure S7-4 with short commentary in the Supplementary Materials.

      (4) Abstract. lines 7-9. At a glance (and when reading the manuscript linearly) this sentence is unclear. If the paretic arm is compromised across rest and movement, how does that afford the opportunity to address the relationship between reaching, stopping, and stabilizing when all could be impacted? It might be useful to specify that these factors may impacted differently relative to one another with stroke, providing an opportunity to better understand the differences between movement and postural control. 

      Response 2.A.4. Thank you for pointing out this issue (also related to Reviewer 1’s point – Response 1.1). We have changed this to more clearly reflect our reasoning and highlight that the issue is that stroke can differentially impact reaching vs. holding, copied below:

      “The paretic arm after stroke exhibits different abnormalities during rest vs. movement, providing an opportunity to ask whether control of these behaviors is independently affected in stroke.”

      (5) Line 27. It is perhaps more appropriate to say conceptual model than simply 'model'.  

      Response 2.A.5. Thank you for your suggestion, which we have adopted throughout the manuscript.

      (6) Line 122-125. Figure 1A caption. The authors should specify that resting posture force biases occur when the limb or hand is physically constrained in a specific position. 

      Response 2.A.6. Thank you for pointing this out – we have clarified the caption:

      “If one were to physically constrain the hand in a position away from the resting posture, the torques involved in each component of the abnormal resting posture translate to a force on the hand (blue arrow);”

      (7) Line 147. Why was the order not randomized or counterbalanced? 

      Response 2.A.7. We prioritized paretic data, as the primary analyses and comparisons in our paper involved resting posture biases and active movement with the paretic arm. We note that our primary analyses, which rely on paretic-paretic comparisons, would not be affected by paretic vs. non-paretic ordering effects. However, ordering effects could potentially affect comparisons between paretic and non-paretic data. We now note the reasoning behind the absence of counterbalancing, and mention the potential limitation in interpreting paretic to non-paretic comparisons in lines 124-129 of the Methods.

      (8) Line 172. 12N is the peak force of the pulse?

      Response 2.A.8. The reviewer is correct; we have clarified our description (line 463 in the updated manuscript):

      “a 70 ms bell-shaped force pulse which was 12N at its peak”

      (9) Line 175. What is a clockwise pulse? Was the force vector rotating in direction over time so that it was always acting orthogonally to the movement, or did it always act leftwards or rightwards?

      Response 2.A.9. The force vector was not rotating in direction over time. Here, we used clockwise/counterclockwise to indicate rightwards/leftwards with respect to the ideal movement direction – the line from start position to target (which is what we understand the Reviewer means by “always act rightwards or leftwards”). We have clarified the text to indicate this (lines 193-195):

      …was applied by the robot lateral to the ideal movement direction (i.e. the direction formed between the center of the start position and the center of the target) after participants reached 2cm away from the starting position (Smith and Shadmehr, 2005; Fine and Thoroughman, 2006).

      (10) Lines 177-182. It might be useful to explicitly mention the frequency of each of the perturbations, just for ease of the reader. 

      Response 2.A.10. We have added this information to our Methods (lines 206-210):

      Thus, in summary, each 96-movement block consisted of 64 unperturbed movements and 32 movements perturbed with a force pulse (16 clockwise, and 16 counter-clockwise). For 20 out of the 96 movements in each block, the hold period was extended to test the hold perturbation (4 trials for each of the 5 target locations, each one of the 4 trials testing one perturbation direction as shown in Figure 7C).

      (11) Line 191. Lines 188-190. It would be useful to see a sample of several of these force traces over time (0-5s) that were used to make the average for a position. That would give insight into the stability of the forces of a participant for one of the postures. These traces could be shown in Figure 2.

      Response 2.A.11. Thank you for your suggestion. We have added these panels to Figure 1, (as Figure 2 was already large). Each panel illustrates the three measurements taken at similar positions (closest to midline, distal from the body) and the same condition (paretic arm, with arm support given) for one participant (same participants as in Figure 2). Solid lines indicate the force on the x-axis (positive values indicate forces towards the left), whereas dashed lines indicate the force on the y-axis (positive values indicate forces towards the body). The shaded area indicates the part averaged in order to estimate the resting bias, illustrating how resting biases were relatively stable by the 2s mark. Note that these examples include one trial (blue traces in the third panel) which was rejected following visual inspection as described in Materials and Methods – Data Exclusion Criteria (“trials where forces appeared unstable and/or there was movement during the robot hold period”). We find this helpful as this illustrates (and motivates) one component of our methodology. 

      (12) Line 196. Figure 1D (not 1E).  

      Response 2.A.12. Thank you for catching this error, which we have now corrected.

      (13) Line 215: The authors mentioned similar results. Were there any different results that impacted interpretation? Some evidence of this, similar to and in addition to Supplementary 1, would be helpful. 

      Response 2.A.13. We repeated our analyses without these exclusion criteria, with no impact to the interpretation. We now include versions of the main outcome panels from Figures 5, 6, and 7 in the supplementary materials calculated without this outlier exclusion (Figures S5-E, S6-E, and S7-E, respectively). 

      (14) Line 231: Perhaps better to explicitly state the furthest three positions are being across as the distal targets for the ANOVA. 

      Response 2.A.14. Thank you for your suggestion. We now explicitly clarify this in line 276:

      “distal targets [furthest three positions] vs. proximal targets [closest two positions]”

      (15) Figure 3B, lines 265. Clearly, these are different, but the authors should report statistics. 

      Response 2.A.15. We now report these numbers (lines 339-346 of the revised manuscript, which also include statistics related to bias direction as described in 2.A.17 below).

      (16) Figure 2 should have a heat map scale.  

      Response 2.A.16. We have now added this (also Response 1.A.7), including an explanation of what the heat map represents in the caption.

      (17) Figure 3C: It would be useful to quantify and plot the direction of the resting force bias vector. 

      Response 2.A.17. Thank you for your suggestion. We have expanded Figure 3 to include the average direction of the resting force bias vector (note the readjustment of colors following Reviewer 1’s comment: striped bars indicate No Support data, and full bars indicate Support data, with the colors being the same). The direction of the force bias vector, however, may not be very informative in cases where the magnitude is small (and the signal-to-noise ratio is small), whereas averaging the direction of the force bias vector across different positions for one participant may average out systematic variations in this direction across different locations. Nevertheless, the average direction appears generally towards the body (around -90°, or 6 o’clock) even in the non-paretic and control data (though the noise – as suggested by the size of the errorbars – is much higher in the latter cases, especially when the arm is supported). This is a (weak) suggestion that these resting biases may be present, though much subdued, in the nonparetic limb and healthy individuals; further work will be needed to elucidate this.

      (18) Line 428. It is not significantly longer compared to controls. Can the authors slightly revise this sentence?

      Response 2.A.18. We have revised this sentence (lines 529-532):

      Patients showed impaired capacity to resist and recover from this perturbation (the abrupt release of the imposed force). The time to stabilization for the paretic side (0.94±0.05s) was longer compared to the non-paretic side (0.79±0.03s, p = 0.024) and controls (0.78±0.06s, though this was statistically marginal, p = 0.061) as shown in Figure 7E, left.

      (19) Line 541. It is unclear how these data support the idea of three distinct controllers. Can the authors please clarify? 

      Response 2.A.19. Here, we compared our findings to previous ideas about distinct controllers, and discuss a potential fusion of these ideas with ours. Specifically, we find that holding is distinct from both initial reaching and coming to a stop. Previous work argues that initial reaching and coming to a stop are themselves distinct (Ghez et al., 2007; Jayasinghe et al., 2022). Combining these two sets of arguments, we arrive at the possibility of three distinct controllers. 

      (20) It would be useful if the authors provided a definition of synergy, as well as distinguishing between muscle and movement synergies. 

      Response 2.A.20. We now provide this in lines 591-594:

      Here, “synergies” refer to abnormal co-activation patterns across joints that manifest as the patient tries to move – for example, the elbow involuntarily flexing as the patient tries to abduct their shoulder (Twitchell, 1951; Brunnstrom, 1966). 

      (21) Line 592-593. The wording of this sentence could be improved. 

      Response 2.A.21. We have switched this sentence to active voice for more clarity:

      Thus, while full weight support reduces both resting flexor biases and movement-related flexor synergies, this reduction seems more complete for synergies rather than resting biases.

      (22) Figure 9. In the left column, it should read normal synergies and normal resting posture.  

      Response 2.A.22. We intentionally used the same terminology, as the idea behind our conceptual model is that these patterns, which manifest as well-recognized abnormal synergies and abnormal resting postures in stroke, may be present in the healthy motor system as well, but kept in check by CST moderating the RST. At the same time, we recognize that, by definition, synergies and posture in controls are the “normal” reference point against which “abnormal” synergies and posture are defined after stroke. To clarify this issue, we thus decided to forgo the use of the terms “abnormal” in the figure, and instead refer to “synergistic movement ” and “synergistic resting posture”.

      (23) Figure 9. With stroke, is RST upregulated, a decreased influence of CST, or both? All seem plausible.

      Response 2.A.23a. We believe both can be happening. From previous work (e.g. McPherson et al., 2018) it seems safe to say that RST upregulation is the case, whereas one would also expect a decreased CST influence due to its damage due to the stroke. The relative weight of these influences would be interesting to elucidate in future work.

      I have not read the paper, but did McPherson et al., 2018 test these different hypotheses?  

      Response 2.A.23b. The main point of McPherson et al., 2018 is that increased synergy expression is due to increased RST involvement, rather than reduced CST influence. However, McPherson et al. do not show separate increases/reductions in RST/CST activity; they show that contralesional activity relative to ipsilesional activity is increased (using a laterality index). While it does seem that RST is upregulated in this case, this does not exclude the possibility that CST influence is reduced as well.

      We also noticed that the citation itself, while mentioned in the text, was missing from the bibliography. This is now fixed.

      For Figure 9, McPherson is cited as they provide evidence for the idea that RST involvement increases when arm support is decreased. This evidence is both direct (e.g. in their Figure 3 where they show that “Stroke participants exhibited increased activity in the contralesional (R) hemisphere as SABD loading increased” [i.e. arm support was reduced]) and indirect: they connect synergies to RST involvement, and also show increased synergies with reduced arm support (also shown multiple times previously). Both these arguments suggest that arm support reduces RST involvement. We have clarified the relevant sentence:

      The interesting implication of this conceptual model is that synergies are in fact postural abnormalities that spill over into active movement when the CST can no longer modulate the increased RST activation that occurs when weight support is removed. Supporting this idea, McPherson et al. found increased ipsilateral activity (which primarily represents activation via the descending RST (Zaaimi et al., 2012)) when the paretic arm had reduced support compared to full support (McPherson et al., 2018).

      Reviewer #3 (Recommendations For The Authors):

      For Experiment 2, it is not immediately clear how the within-subject values are being pooled and compared across the different conditions. For instance, in the static perturbation trials, there are four blocks with 20 perturbation trials per block per arm (80 total per arm) with each location and direction once per block. For each participant, the comparison is between the location/direction that was most opposed (although this doesn't look accurately represented in Fig 7F). Therefore, the within-subject comparison is 4 trials per participant? Were these values averaged or pooled? It is a little odd that the SD for all the within-subjects trials are identical or nearly identical across conditions especially when looking at the example patient data in 7B and 7F.  

      Response 3.A.1. For static perturbation trials, the within-subject comparison involves 8 trials per participant: 4 trials corresponding to the perturbation direction/position combination with resting bias most opposed to the perturbation, and 4 trials corresponding to the perturbation direction/position combination with resting bias most aligned with the perturbation. These values were averaged for each individual. We have expanded our methods to make this part of our data analysis clear (lines 284-296) for all types of comparisons (unperturbed movement, pulse perturbation, static perturbations – now referred to as “release perturbation”).

      The across-subject SDs for the average resting forces for each one of these two conditions, shown in Figure 7F are indeed identical. This is due to how these two instances (most aligned vs. most resistive) were selected: because the perturbation directions come in pairs that exactly oppose each other (Figure 7B), if one were to select the position with the most opposing resting bias, that would mean that the combination with same position and the oppositely-directed perturbation would be the one with the most assistive resting bias. Hence the resting biases selected for the most opposing/assistive instances would be equal in magnitude and opposite to each other for each participant, as illustrated in Figure 7F, whereby the most-opposed bias for each individual is exactly opposite to the corresponding most-aligned bias for the same individual. We have added a brief commentary about this on the caption (lines 551-554), reproduced below:

      Note how the most-opposed resting bias for each patient is equal and opposite to the their mostaligned resting bias. This is because the same resting bias, when projected along the direction of two oppositely-directed perturbations (illustrated in C), it would oppose one with the same magnitude it would align with the other.

      Importantly, following suggestions by Reviewer 2 (see point 2.A.1), we now provide supplementary analyses that use the entirety of the relevant data, rather than the most extreme instances, which provide evidence supporting our main findings (Figures S5-2, S6-2, and S7-2).

      The printed colors in Figure 3 are very muddled and hard to read/interpret, especially in panel A. 

      Response 3.A.2. Thank you for pointing out this issue, also raised by Reviewer 1. We have adjusted the colors to be more distinct from each other and look clear both in print and on-screen, making use of dashed lines and stripes rather than different shades.

      I think it would improve readability and interpretation if Figure 8 and the results related to FM-UE were contained within the description of results for Experiment 1.

      Response 3.A.3. Thank you for this suggestion. This is actually a debate we had among ourselves earlier, and we can see merits to either ordering. It is very arguable that moving Figure 8 and the FMUE results within the rest of Experiment 1 may improve readability somewhat. However, we believe that presenting these results at the end better serves to illustrate the apparent paradox between the lack of direct connection between resting biases and active movement on one hand, and the relationship between resting biases and abnormal synergies on the other. We believe that this better sets the stage to present our conceptual model, which explains this paradox based on the role arm support plays in modulating the expression of both resting biases and abnormal synergies.

      Additional changes/corrections not outlined above

      Figure 1D displayed a right arm, but showed a target array (red dots) for a left arm paradigm. We now flip the target array shown for consistency.

      We corrected Figure 6C, which accidentally used an earlier definition of settling time which was based on lateral stabilization throughout the entire movement, rather focus on the period immediately following the pulse. The intended definition of settling time (as we had described in the Methods, lines 204-206 of original submission) focuses on lateral corrections specific to the pulse (rather than corrections when the participant approaches the endpoint) and better matches the one for settling time for the release (static) perturbation trials. Note that this change did not affect the (lack of) relationship between settling time and resting force bias, both across individuals (correlation plots now in Figure S6-1) and within individuals (now shown in the right part of panel 6D). Also in panel C, an error in the scaling for the maximum lateral deviation in the pulse direction (right side of the panel) is also now corrected.

      In addition, we made minor edits throughout the text to improve readability.

      References

      Albert ST, Hadjiosif AM, Jang J, Zimnik AJ, Soteropoulos DS, Baker SN, Churchland MM, Krakauer JW, Shadmehr R (2020) Postural control of arm and fingers through integration of movement commands. Elife 9:e52507.

      Avni I, Arac A, Binyamin-Netser R, Kramer S, Krakauer JW, Shmuelof L (2024) The Kinematics of 3D Arm Movements in Sub-Acute Stroke: Impaired Inter-Joint Coordination is Attributable to Both Weakness and Flexor Synergy Intrusion. Neurorehabil Neural Repair 38:646–658.

      Bourbonnais D, VANDEN NOVEN S, Carey KM, Rymer WZ (1989) Abnormal spatial patterns of elbow muscle activation in hemiparetic human subjects. Brain 112:85–102.

      Brunnstrom S (1966) Motor testing procedures in hemiplegia: based on sequential recovery stages. Phys Ther 46:357–375.

      Cortes JC, Goldsmith J, Harran MD, Xu J, Kim N, Schambra HM, Luft AR, Celnik P, Krakauer JW,

      Kitago T (2017) A Short and Distinct Time Window for Recovery of Arm Motor Control Early After Stroke Revealed With a Global Measure of Trajectory Kinematics. Neurorehabil Neural Repair 31:552–560.

      Duque J, Thonnard J, Vandermeeren Y, Sébire G, Cosnard G, Olivier E (2003) Correlation between impaired dexterity and corticospinal tract dysgenesis in congenital hemiplegia. Brain 126:732–747.

      Fine MS, Thoroughman KA (2006) Motor Adaptation to Single Force Pulses: Sensitive to Direction but Insensitive to Within-Movement Pulse Placement and Magnitude. J Neurophysiol 96:710–720.

      Ghez C, Scheidt R, Heijink H (2007) Different Learned Coordinate Frames for Planning Trajectories and Final Positions in Reaching. J Neurophysiol 98:3614–3626.

      Hadjiosif AM, Branscheidt M, Anaya MA, Runnalls KD, Keller J, Bastian AJ, Celnik PA, Krakauer JW (2022) Dissociation between abnormal motor synergies and impaired reaching dexterity after stroke. J Neurophysiol 127:856–868.

      Jayasinghe SA, Scheidt RA, Sainburg RL (2022) Neural Control of Stopping and Stabilizing the Arm. Front Integr Neurosci 16.

      Kanade-Mehta P, Bengtson M, Stoeckmann T, McGuire J, Ghez C, Scheidt RA (2023) Spatial mapping of posture-dependent resistance to passive displacement of the hypertonic arm post-stroke. J NeuroEngineering Rehabil 20:163.

      Lawrence DG, Kuypers HG (1968) The functional organization of the motor system in the monkey: II. The effects of lesions of the descending brain-stem pathways. Brain 91:15–36.

      Levin MF (1996) Interjoint coordination during pointing movements is disrupted in spastic hemiparesis. Brain 119:281–293.

      Lowrey CR, Bourke TC, Bagg SD, Dukelow SP, Scott SH (2019) A postural unloading task to assess fast corrective responses in the upper limb following stroke. J NeuroEngineering Rehabil 16:1–17.

      McPherson JG, Chen A, Ellis MD, Yao J, Heckman C, Dewald JP (2018) Progressive recruitment of contralesional cortico-reticulospinal pathways drives motor impairment post stroke. J Physiol 596:1211–1225.

      McPherson LM, Dewald JP (2022) Abnormal synergies and associated reactions post-hemiparetic stroke reflect muscle activation patterns of brainstem motor pathways. Front Neurol 13:934670.

      Porter R, Lemon R (1995) Corticospinal function and voluntary movement. Oxford University Press.

      Smith MA, Brandt J, Shadmehr R (2000) Motor disorder in Huntington’s disease begins as a dysfunction in error feedback control. Nature 403:544.

      Smith MA, Shadmehr R (2005) Intact ability to learn internal models of arm dynamics in Huntington’s disease but not cerebellar degeneration. J Neurophysiol 93:2809–2821.

      Tower SS (1940) Pyramidal lesion in the monkey. Brain 63:36–90.

      Twitchell TE (1951) The restoration of motor function following hemiplegia in man. Brain 74:443–480.

      Wilkins KB, Yao J, Owen M, Karbasforoushan H, Carmona C, Dewald JP (2020) Limited capacity for ipsilateral secondary motor areas to support hand function post-stroke. J Physiol 598:2153– 2167.

      Zaaimi B, Edgley SA, Soteropoulos DS, Baker SN (2012) Changes in descending motor pathway connectivity after corticospinal tract lesion in macaque monkey. Brain 135:2277–2289.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews: 

      Reviewer #1 (Public Review): 

      In their manuscript, Gerlevik et al. performed an integrative analysis of clinical, genetic and transcriptomic data to identify MDS subgroups with distinct outcomes. The study was based on the building of an "immunoscore" and then combined with genotype and clinical data to analyze patient outcomes using multi-omics factor analysis. 

      Strengths: Integrative analysis of RNA-seq, genotyping and clinical data 

      Weaknesses: Validation of the bioinformatic pipeline is incomplete 

      Major comments: 

      (1) This study considered two RNA-seq data sets publicly available and generated in two distinct laboratories. Are they comparable in terms of RNA-seq technique: polyA versus rRNA depletion, paired-end sequencing, fragment length? 

      We want to reemphasize that the main point of this study is not to compare the BMMNC with the HSPC cohort. These datasets are not comparable because they were

      collected from different cell types, and we should not expect them to be matched. We just analysed them in parallel to check how much HSPCs contribute to the molecular signatures we see in BMMNC samples. However, we agree with the reviewer that similar RNA-seq experimental techniques should be employed to control for confounding factors. Here is the information that we found for HSPC and BMMNC RNA-seq studies:

      HSPC RNA-seq cohort: Total RNA was extracted using TRIzol (Thermo Scientific), and Sequencing was performed on an Illumina HiSeq4000 with 100-bp paired-end reads.

      BMMNC RNA-seq cohort: The RNA was extracted with TRIzol reagent (Thermo Scientific). RNA-sequencing libraries were prepared from poly(A)-selected RNA and were sequenced using Illumina HiSeq 2000 or 2500 platform with 100-bp paired-end reads. 

      The only difference between the two cohorts is that one cohort includes total RNAs, whereas the other has polyA-selected RNAs. Since the gene set signatures use the expression of proteincoding genes, which all have polyA tails and are included in total RNA libraries, the analysis will not be affected by total vs. polyA-selected RNA-seq techniques. 

      (2) Data quality control (figure 1): the authors must show in a graph whether the features (dimensions) of factor 1 were available for each BMMNC and CD34+ samples.  

      By features of Factor 1, we think the reviewer means the features with high weights for Factor 1 in BMMNC and CD34+ samples. Figure 2c-d clearly illustrates the important features and their associations with Factor 1 for all samples in both cohorts. The samples are the columns of the two heatmaps.

      (3) How to validate the importance of "immunoscore"? If GSEA of RNA-seq data was performed in the entire cohort, in the SF3B1-mutated samples or SRSF2-mutated samples (instead of patients having a high versus low level of factor 1 shown in Sup Fig. 4), what would be the ranking of Hallmarks or Reactome inflammatory terms among the others? 

      Our GSEA analysis was an attempt to validate the importance of our identified factors. As described in the paper, Factor 1 represents a combination of immunology scores (or  “immunoscores”) in CD34+ cohort. Applying GSEA, we identified upregulation of inflammation related pathways, chemokines, and Neutrophils in patients having high (4th quartile) versus low (1st quartile) levels of Factor 1. Interestingly, sorting patients by Factor 1 resulted in similar pattern based on gene signature scores (Figure 2d).    

      To show that Factor1 generated by MOFA is important and different from known MDS categories such as SF3B1 and SRSF2 mutants, we performed GSEA in SF3B1-mutated vs. SF3B1-WT samples and SRSF2-mutated vs. SRSF2-WT samples in the CD34+ cohort. As shown in Author response image 1, we did not see the upregulation of inflammation and interferon pathways in SF3B1 and SRSF2 mutant MDS.

      Author response image 1.

      GSEA showed no upregulation of inflammation and interferon pathways for SF3B1 and SRSF2 mutant in CD34+ cohort.  

      (4) To decipher cell-type composition of BMMNC and CD34+ samples, the authors used van Galen's data (2019; supplementary table 3). Cell composition is expressed as the proportion of each cell population among the others. Surprisingly, the authors found that the promonocytelike score was increased in SF3B1-mutated samples and not in SRSF2-mutated samples, which are frequently co-mutated with TET2 and associated with a CMML-like phenotype. Is there a risk of bias if bone marrow subpopulations such as megakaryocytic-erythroid progenitors or early erythroid precursors are not considered? 

      We thank the reviewer for their insightful comment about CMML and the high prevalence of SRSF2 mutation (> 45%) in CMML cases. Using single-cell RNA sequencing and high-parameter flow cytometry, Ferrall-Fairbanks et al. (DOI: 10.1158/2643-3230.BCD-21-0217) recently showed that CMML can be classified into three differentiation trajectories: monocytic, megakaryocyte-erythroid progenitor (MEP), and normal-like. One hallmark of monocytic-biased trajectory was the enrichment of inflammatory granulocyte–macrophage progenitor (GMP)-like cells, which we observed through our analysis for SRSF2 mutants (Figure 6a).

      Unfortunately,  van Galen's data does not provide any gene set for MEP, and there is no singlecell RNA-seq atlas for MDS to employ to calculate the MEP score. Also, we compared the Promono-like and GMP-like gene sets from van Galen's data, and we could not find any overlap, meaning that Promono-like is not specific enough to capture the signatures coming from the more differentiated progenitors such as GMPs. Therefore, as described in the paper, we focused on GMP-like rather than Promono-like.

      (5) Figures 2a and 2b indicated that the nature of retrotransposons identified in BMMNC and CD34+ was dicerent. ERVs were not detected in CD34+ cells. Are ERVs not reactivated in CD34+ cells? Is there a bias in the sequencing or bioinformatic method?  

      As described above, the two cohorts' sequencing methods, read length, etc., are identical.

      CD34+ RNA-seq is total RNA-seq that includes both polyA and non-polyA RTE transcripts.

      Therefore, the chance of bias and missing RTE signatures in CD34+ cohort is very low. L1 and Alu, which are shared between the two cohorts, are the two RTE families that are still active and make new insertions in humans. Our interpretation is that ERV activation in BM is associated with immune cells. As shown by Au et al. (DOI: 10.1016/j.ccell.2021.10.001), several ERV loci had expression in purified immune cell subsets in renal cell carcinoma samples, potentially explaining ERV upregulation in tumours responding to treatment as those biopsies had increased tumour infiltration.

      (6) What is the impact of factor 1 on survival? Is it dicerent between BMMNC and CD34+ cells considering the distinct composition of factor 1 in CD34+ and BMMNC? 

      As shown in Table 1, Factor 1 in the BMMNC cohort is associated with overall survival (P-val < 0.05) when we did multivariate analysis but not univariate analysis. We did not observe any association between Factor 1 and event-free survival in the BMMNC cohort. Also, The 10 factors identified by MOFA in BM CD34+ cohort did not show any significance associated with MDS overall survival (Supplementary Table 5). 

      (7) In Figure 1e, genotype contributed to the variance of in the CD34+ cell analyses more importantly than in the BMMNC. Because the patients are dicerent in the two cohorts, dicerences in the variance could be explained either by a greater variability of the type of mutations in CD34 or an increased frequency of poor prognosis mutations in CD34+ compared to BMMNC. The genotyping data must be shown.  

      The genotype has already been reported in Supplementary Table 2. In fact, the number of inspected genes was much higher in the BMMNC cohort (17 genes) compared to the CD34+ cohort (3 genes). Therefore, we have more significant variability of the type of mutations in the BMMNC cohort compared to the CD34+ cohort. For the CD34+ cohort, we only had mutations for three spliceosome genes, where most cases (n=28) were SF3B1 mutants with good prognosis. We think that the result makes sense because the less genetic variability, the more homogenous groups and the more chance that one factor or a group of factors can explain the genetic variance.   

      (8) Fig. 2a-b: Features with high weight are shown for each factor. For factor 9, features seemed to have a low weight (Fig. 1b and 1c). However, factor 9 was predictive of EFS and OS in the BMMNC cohort. What are the features driving the prognostic value of factor 9? 

      As shown in Figure 3b, The main features are RTE expression from LTR:ERV1, SINE:MIR, and SINE:Alu family.  

      (9) The authors also provided microarray analyses of CD34+ cell. It could be interesting to test more broadly the correlation between features identified by RNA-seq or microarrays. 

      The microarray data did not come with any genetic information or clinical data except survival information. Therefore, we could not apply MOFA on Microarray data. However, we did generate gene signature scores from Microarray data and investigated the relationship between inflammatory chemokines and cytokines, and IFN-I signature scores with MDS survival (Figure 3c and 4c).    

      (10) The authors should discuss the relevance of immunosenescence features in the context of SRSF2 mutation and extend the discussion to the interest of their pipeline for patient diagnosis and follow up under treatments. 

      We have added the below text to the discussion:

      Recent studies have shown that the expression of programmed death-ligand 1 (PD-L1) protein is significantly elevated in senescent cells (DOIs: 10.1128/mcb.00171-22, 10.1172/JCI156250, 10.1038/s41586-022-05388-4). Increased PD-L1 protein levels protect senescent cells from being cleared by cytotoxic immune cells that express the PD-1 checkpoint receptor. In fact, activation of the PD-1 receptor inhibits the cytotoxic capabilities of CD8 + T and NK cells, increasing immunosenescence.   

      Notably, patients with MDS who possess particular somatic mutations, such as those in the TP53, ASXL1, SETBP1, TET2, SRSF2, and RUNX1 genes, have an increased propensity to react favourably to PD-1/PD-L1 inhibitors (DOIs: 10.1111/bjh.17689, https://doi.org/10.1182/blood2020-141100) confirming that many cellular and molecular mechanisms, known to promote cellular senescence, including alteration of splicing machinery, are crucial stimulators of the expression of PD-L1 protein. Interestingly, in our analysis, we also observed a correlation between the senescence gene signature score and the expression of the PD-L1 gene in CD34+ cells (Supplementary Figure 7), supporting the previous findings linking PD-L1 gene expression to cellular senescence.

      The immunology and ageing features extracted from the MDS transcriptomic data used in our analysis pipeline can enhance the conventional risk-scoring systems for MDS by providing new insights into this disease, particularly in the context of inflammation and ageing. For some patients, the clinical and genetic features may remain relatively the same until follow-up. Still, the transcriptomic features might differ considerably from the baseline diagnosis, affecting the course of treatment.    

      Reviewer #2 (Public Review): 

      The authors performed a Multi-Omics Factor Analysis (MOFA) on analysis of two published MDS patient cohorts-1 from bone marrow mononuclear cells (BMMNCs) and CD34 cells (ref 17) and another from CD34+ cells (ref 15) --with three data modalities (clinical, genotype, and transcriptomics). Seven different views, including immune profile, inflammation/aging, Retrotransposon (RTE) expression, and cell-type composition, were derived from these modalities to attempt to identify the latent factors with significant impact on MDS prognosis. 

      SF3B1 was found to be the only mutation among 13 mutations in the BMMNC cohort that indicated a significant association with high inflammation. This trend was also observed to a lesser extent in the CD34+ cohort. The MOFA factor representing inflammation showed a good prognosis for MDS patients with high inflammation. In contrast, SRSF2 mutant cases showed a granulocyte-monocyte progenitor (GMP) pattern and high levels of senescence, immunosenescence, and malignant myeloid cells, consistent with their poor prognosis. Also, MOFA identified RTE expression as a risk factor for MDS. They proposed that this work showed the efficacy of their integrative approach to assess MDS prognostic risk that 'goes beyond all the scoring systems described thus far for MDS'. 

      Several issues need clarification and response: 

      (1) The authors do not provide adequate known clinical and molecular information which demonstrates prognostic risk of their sample cohorts in order to determine whether their data and approach 'goes 'beyond all the scoring systems described thus far for MDS'. For example, what data have the authors that their features provide prognostic data independent of the prior known factors related to prognosis (eg, marrow blasts, mutational, cytogenetic features, ring sideroblasts, IPSS-R, IPSS-M, MDA-SS)? 

      We agree with the reviewer that we did not generate a new cumulative risk score and compare it with the conventional risk scores for MDS. However, we identified individual MOFA factors, which are risk or protective factors for MDS, based on survival analysis in the BMMNC cohort. One reason that we did not generate our independent, cumulative score and compare it with other scores was that we did not receive any conventional risk score for the BMMNC cohort. However, we had access to all the clinical and genetic variables from the BMMNC cohort (except for three patients) that were required to calculate IPSS-R; hence, we calculated the IPSS-R in our resubmission for the BMMNC cohort. We made three IPSS-R risk categories by combining low and very low as low risk, and high and very high as high risk, and keeping intermediate as intermediate risk. Our survival analysis of these three categories showed a clear match between IPSS-R score and MDS survival (Author response image 2a).

      We then investigated the relationship between factors 2, 4, and 9 from MOFA with three IPSS-R risk groups.  Integration of IPSS-R risk groups with factor values confirmed the finding in the manuscript that Factors 4 and 9 generally exert a protective influence over the MDS risk, whilst higher levels of Factor 2 predict a high-risk MDS (Author response image 2b). However, we see so many outliers in all three factors, indicating that some patients were assigned to the wrong IPSS-R categories because IPSS-R calculation is based on clinical and genetic variables and does not include the transcriptomics data for coding and non-coding genomic regions. 

      Author response image 2.

      Comparison of IPSS-R risk categories and MOFA risk and protective factors.

      (2) A major issue in analyzing this paper relates to the specific patient composition from whom the samples and data were obtained. The cells from the Shiozawa paper (ref 17) is comprised of a substantial number of CMML patients. Thus, what evidence have the authors that much of the data from the BMMNCs from these patients and mutant SRSF2 related predominantly to their monocytic dicerentiation state?  

      We thank the reviewer for the insightful comment about the monocytic differentiation state of CMML and SRSF2 mutant cases. The BMMNC cohort has 11 CMML and 17 SRSF2 mutant cases, of which six are shared between the two groups. We have divided the patients into four groups: CMML only, SRSF2 mutant only, CCML and SRSF2 mutant, and others. We have generated boxplots for all cellular composition gene signature scores for these groups and compared the scores between these groups. As explained above, Ferrall-Fairbanks et al. (DOI: 10.1158/2643-3230.BCD-21-0217) recently showed that CMML can be classified into three differentiation trajectories: monocytic, megakaryocyte-erythroid progenitor (MEP), and normal-like. One hallmark of monocytic-biased trajectory was the enrichment of inflammatory granulocyte–macrophage progenitor (GMP)-like cells, which we observed through our analysis for the CMML cases with SRSF2 mutation (Author response image 3.).

      Author response image 3.

      Cellular composition gene signature scores for CMML and SRSF2 mutant versus other cases. CMML cases with SRSF2 mutation show a significant higher level of GMP and GMP-like scores compared to other MDS cases.  

      (3) In addition, as the majority of patients in the Shiozawa paper have ring sideroblasts (n=59), thus potentially skewing the data toward consideration mainly of these patients, for whom better outcomes are well known.  

      We disagree with the reviewer. We used 94 BMMNC samples from Shiozawa’s paper, of which 19 cases had Refractory Anemia with Ring Sideroblasts (RARS), 4 cases had Refractory Anemia with Ring Sideroblasts and thrombocytosis (RARS-T), and 5 cases had Refractory cytopenia with multilineage dysplasia and ring sideroblasts (RCMD-RS). In total, we had 28 cases (~30%) with Ring Sideroblasts (RS), which are not large enough to skew the data.

      (4) Further, regarding this patient subset, what evidence have the authors that the importance of the SF3B1 mutation was merely related to the preponderance of sideroblastic patients from whom the samples were analyzed? 

      We had 34 SF3B1 mutant cases, of which 25 had Ring Sideroblasts (RS). The total number of cases with RS in the BMMNC cohort was 28. Therefore, the BMMNC cohort is not an RSdominant cohort, and RS cases did not include all SF3B1 mutants. Furthermore, it was recently shown by Ochi et al. (DOI: 10.1038/s41598-022-18921-2) that RS is a consequence of SF3B1K700E mutation, and it is not a cause to affect the SF3B1 importance.

      (5) An Erratum was reported for the Shiozawa paper (Shiozawa Y, Malcovati L, Gallì A, et al. Gene expression and risk of leukemic transformation in myelodysplasia. Blood. 2018 Aug 23;132(8):869-875. doi: 10.1182/blood-2018-07-863134) that resulted from a coding error in the construction of the logistic regression model for subgroup prediction based on the gene expression profiles of BMMNCs. This coding error was identified after the publication of the article. The authors should indicate the ecect this error may have had on the data they now report.  

      Thank you for bringing this important issue to our attention. The error resulted from a mistake in the construction of the logistic regression model for subgroup prediction based on the gene expression profiles of BMMNCs. However, this issue does not affect our result because we analysed the expression data from scratch and generated our own gene signature scores. Also, the error has no impact on the genetics and clinical information that we received from the authors.

      (6) What information have the authors as to whether the dicering RTE findings were not predominantly related to the dicerentiation state of the cell population analyzed (ie higher in BM MNCs vs CD34, Fig 1)? What control data have the authors regarding these values from normal (non-malignant) cell populations? 

      As described above, L1 and Alu, the two RTE families shared between the two cohorts, are still active and make new insertions in humans (Figure 2.a-b). Our interpretation is that ERV activation in BM is associated with immune cells. This interpretation is further supported by the findings of Au et al. (DOI: 10.1016/j.ccell.2021.10.001), where several ERV loci had expression in purified immune cell subsets in renal cell carcinoma samples. 

      Unfortunately, none of these two cohorts had normal (non-malignant) cell populations. We think that the MOFA unbiased way of modelling the heterogeneity is su@icient to capture the RTE derepressed phenotype of a subset of MDS cases compared to others, and we do not need normal cases to further support the finding. 

      (7) The statement in the Discussion regarding the ecects of SRSF2 mutation is speculative and should be avoided. Many other somatic gene mutations have known stronger ecects on prognosis for MDS. 

      One aim of this study is to identify specific immune signatures associated with SRSF2 and SF3B1 mutations, which are highly prevalent in MDS. Although other mutations, such as TP53, may have a stronger correlation with poor survival, numerous studies have demonstrated a clear link between SRSF2 mutations and poor prognosis.

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife assessment

      This study provides an important cell atlas of the gill of the mussel Gigantidas platifrons using a single nucleus RNA-seq dataset, a resource for the community of scientists studying deep sea physiology and metabolism and intracellular host-symbiont relationships. The work, which offers solid insights into cellular responses to starvation stress and molecular mechanisms behind deep-sea chemosymbiosis, is of relevance to scientists interested in host-symbiont relationships across ecosystems.

      Public Reviews:

      Reviewer #1 (Public Review):

      Wang et al have constructed a comprehensive single nucleus atlas for the gills of the deep sea Bathymodioline mussels, which possess intracellular symbionts that provide a key source of carbon and allow them to live in these extreme environments. They provide annotations of the different cell states within the gills, shedding light on how multiple cell types cooperate to give rise to the emergent functions of the composite tissues and the gills as a whole. They pay special attention to characterizing the bacteriocyte cell populations and identifying sets of genes that may play a role in their interaction with the symbiotes.

      Wang et al sample mussels from 3 different environments: animals from their native methane-rich environment, animals transplanted to a methane-poor environment to induce starvation, and animals that have been starved in the methane-poor environment and then moved back to the methane-rich environment. They demonstrated that starvation had the biggest impact on bacteriocyte transcriptomes. They hypothesize that the upregulation of genes associated with lysosomal digestion leads to the digestion of the intracellular symbiont during starvation, while the non-starved and reacclimated groups more readily harvest the nutrients from symbiotes without destroying them.

      Strengths:

      This paper makes available a high-quality dataset that is of interest to many disciplines of biology. The unique qualities of this non-model organism and the collection of conditions sampled make it of special interest to those studying deep sea adaptation, the impact of environmental perturbation on Bathymodioline mussels populations, and intracellular symbiotes. The authors do an excellent job of making all their data and analysis available, making this not only an important dataset but a readily accessible and understandable one.

      The authors also use a diverse array of tools to explore their data. For example, the quality of the data is augmented by the use of in situ hybridizations to validate cluster identity and KEGG analysis provides key insights into how the transcriptomes of bacteriocytes change.

      The authors also do a great job of providing diagrams and schematics to help orient non-mussel experts, thereby widening the audience of the paper.

      Thank the reviewer for the valuable feedback on our study. We are grateful that the reviewers found our work to be interesting and we appreciate their thorough evaluation of our research. Their constructive comments will be considered as we continue to develop and improve our study.

      Weaknesses:

      One of the main weaknesses of this paper is the lack of coherence between the images and the text, with some parts of the figures never being referenced in the body of the text. This makes it difficult for the reader to interpret how they fit in with the author's discussion and assess confidence in their analysis and interpretation of data. This is especially apparent in the cluster annotation section of the paper.

      We appreciate the feedback and suggestions provided by the reviewer, and we have revised our manuscript to make it more accessible to general audiences.

      Another concern is the linking of the transcriptomic shifts associated with starvation with changes in interactions with the symbiotes. Without examining and comparing the symbiote population between the different samples, it cannot be concluded that the transcriptomic shifts correlate with a shift to the 'milking' pathway and not other environmental factors. Without comparing the symbiote abundance between samples, it is difficult to disentangle changes in cell state that are due to their changing interactions with the symbiotes from other environmental factors.

      We are grateful for the valuable feedback and suggestions provided by the reviewer. Our keen interest lies in understanding symbiont responses, particularly at the single-cell level. However, it's worth noting that existing commercial single-cell RNA-seq technologies rely on oligo dT priming for reverse transcription and barcoding, thus omitting bacterial gene expression information from our dataset. We hope that advancements in technology will soon enable us to perform an integrated analysis encompassing both host and symbiont gene expression.

      Additionally, conclusions in this area are further complicated by using only snRNA-seq to study intracellular processes. This is limiting since cytoplasmic mRNA is excluded and only nuclear reads are sequenced after the organisms have had several days to acclimate to their environment and major transcriptomic shifts have occurred.

      We appreciate the comments shared by the reviewer and agree that scRNA-seq provides more comprehensive transcriptional information by targeting the entire mRNA of the cell. However, we would like to highlight that snRNA-seq has some unique advantages over scRNA-seq. Notably, snRNA-seq allows for simple snap-freezing of collected samples, facilitating easier storage, particularly for samples obtained during field trips involving deep-sea animals and other ecologically significant non-model animal samples. Additionally, unlike scRNA-seq, snRNA-seq eliminates the need for tissue dissociation, which often involves prolonged enzymatic treatment of deep-sea animal tissue/cells under atmospheric pressure. This process can potentially lead to the loss of sensitive cells or alterations in gene expression. Moreover, snRNA-seq procedures disregard the size and shape of animal cells, rendering it a superior technology for constructing the cell atlas of animal tissues. Consequently, we assert that snRNA-seq offers flexibility and represents a suitable choice for the research objects of our current research.

      Reviewer #2 (Public Review):

      Wang, He et al. shed insight into the molecular mechanisms of deep-sea chemosymbiosis at the single-cell level. They do so by producing a comprehensive cell atlas of the gill of Gigantidas platifrons, a chemosymbiotic mussel that dominates the deep-sea ecosystem. They uncover novel cell types and find that the gene expression of bacteriocytes, the symbiont-hosting cells, supports two hypotheses of host-symbiont interactions: the "farming" pathway, where symbionts are directly digested, and the "milking" pathway, where nutrients released by the symbionts are used by the host. They perform an in situ transplantation experiment in the deep sea and reveal transitional changes in gene expression that support a model where starvation stress induces bacteriocytes to "farm" their symbionts, while recovery leads to the restoration of the "farming" and "milking" pathways.

      A major strength of this study includes the successful application of advanced single-nucleus techniques to a non-model, deep-sea organism that remains challenging to sample. I also applaud the authors for performing an in situ transplantation experiment in a deep-sea environment. From gene expression profiles, the authors deftly provide a rich functional description of G. platifrons cell types that is well-contextualized within the unique biology of chemosymbiosis. These findings offer significant insight into the molecular mechanisms of deep-sea host-symbiont ecology, and will serve as a valuable resource for future studies into the striking biology of G. platifrons.

      The authors' conclusions are generally well-supported by their results. However, I recognize that the difficulty of obtaining deep-sea specimens may have impacted experimental design. In this area, I would appreciate more in-depth discussion of these impacts when interpreting the data.

      Thank the reviewer for their valuable feedback on our study. We're grateful that the reviewers found our work interesting, and we appreciate their thorough evaluation of our research. We'll consider their constructive comments as we continue to develop and improve our study.

      Because cells from multiple individuals were combined before sequencing, the in situ transplantation experiment lacks clear biological replicates. This may potentially result in technical variation (ie. batch effects) confounding biological variation, directly impacting the interpretation of observed changes between the Fanmao, Reconstitution, and Starvation conditions. It is notable that Fanmao cells were much more sparsely sampled. It appears that fewer cells were sequenced, resulting in the Starvation and Reconstitution conditions having 2-3x more cells after doublet filtering. It is not clear whether this is due to a technical factor impacting sequencing or whether these numbers are the result of the unique biology of Fanmao cells. Furthermore, from Table S19 it appears that while 98% of Fanmao cells survived doublet filtering, only ~40% and ~70% survived for the Starvation and Reconstitution conditions respectively, suggesting some kind of distinction in quality or approach.

      There is a pronounced divergence in the relative proportions of cells per cell type cluster in Fanmao compared to Reconstitution and Starvation (Fig. S11). This is potentially a very interesting finding, but it is difficult to know if these differences are the expected biological outcome of the experiment or the fact that Fanmao cells are much more sparsely sampled. The study also finds notable differences in gene expression between Fanmao and the other two conditions- a key finding is that bacteriocytes had the largest Fanmao-vs-starvation distance (Fig. 6B). But it is also notable that for every cell type, one or both comparisons against Fanmao produced greater distances than comparisons between Starvation and Reconstitution (Fig. 6B). Again, it is difficult to interpret whether Fanmao's distinctiveness from the other two conditions is underlain by fascinating biology or technical batch effects. Without biological replicates, it remains challenging to disentangle the two.

      As highlighted by the reviewer, our experimental design involves pooling multiple biological samples within a single treatment state before sequencing. We acknowledge the concern regarding the absence of distinct biological replicates and the potential impact of batch effects on result interpretation. While we recognize the merit of conducting multiple sequencing runs for a single treatment to provide genuine biological replicates, we contend that batch effects may not exert a strong influence on the observed patterns.

      In addition, we applied a bootstrap sampling algorithm to assess whether the gene expression patterns within a cluster are more similar than those between clusters. This algorithm involves selecting a portion of cells per cluster and examining whether this subset remains distinguishable from other clusters. Our assumption was that if different samples exhibited distinct expression patterns due to batch effect, the co-assignment probabilities of a cluster would be very low. This expectation was not met in our data, as illustrated in Fig. S2. The lack of significantly low co-assignment probabilities within clusters suggests that batch effects may not exert a strong influence on our results.

      Indeed, we acknowledge a noticeable shift in the expression patterns of certain cell types, such as the bacteriocyte. However, this is not universally applicable across all cell types. For instance, the UMAP figure in Fig. 6A illustrates a substantial overlap among basal membrane cell 2 from Fanmao, Starvation, and Reconstitution treatments, and the centroid distances between the three treatments are subtle, as depicted in Fig. 6B. This consistent pattern is also observed in DEPC, smooth muscle cells, and the food groove ciliary cells.

      The reviewer also noted variations in the number of cells per treatment. Specifically, Fanmao sequencing yielded fewer than 10 thousand cells, whereas the other two treatments produced 2-3 times more cells after quality control (QC). It is highly probable that the technician loaded different quantities of cells into the machine for single-nucleus sequencing—a not uncommon occurrence in this methodology. While loading more cells may increase the likelihood of doublets, it is crucial to emphasize that this should not significantly impact the expression patterns post-QC. It's worth noting that overloading samples has been employed as a strategic approach to capture rare cell types, as discussed in a previous study (reference: 10.1126/science.aay0267).

      The reviewer highlighted the discrepancy in cell survival rates during the 'doublet filtering' process, with 98% of Fanmao cells surviving compared to approximately 40% and 70% for the Starvation and Reconstitution conditions, respectively. It's important to clarify that the reported percentages reflect the survival of cells through a multi-step QC process employing various filtering strategies.

      Post-doublet removal, we filtered out cells with <100 or >2500 genes and <100 or >6000 unique molecular identifiers (UMIs). Additionally, genes with <10 UMIs in each data matrix were excluded. The observed differences in survival rates for Starvation and Reconstitution cells can be attributed to the total volume of data generated in Illumina sequencing. Specifically, we sequenced approximately 91 GB of data for Fanmao, ~196 GB for Starvation, and ~249 GB for Reconstitution. As a result, the qualified data obtained for Starvation and Reconstitution conditions was only about twice that of Fanmao due to the limited data volume.

      The reviewer also observed a divergence in the relative proportions of cells per cell type cluster in Fanmao compared to Reconstitution and Starvation, as depicted in Fig. S1. This discrepancy may hold true biological significance, presenting a potentially intriguing finding. However, our discussion on this pattern was rather brief, as we acknowledge that the observed differences could be influenced by the sample preparation process for dissection and digestion. It is crucial to consider that cutting a slightly different area during dissection may result in variations in the proportion of cells obtained. While we recognize the potential impact of this factor, we do not think that the sparsity of sampling alone could significantly affect the relative proportions of cells per cell type.

      In conclusion, we acknowledge the reviewer's suggestion that sequencing multiple individual samples per treatment condition would have been ideal, rather than pooling them together. However, the homogenous distribution observed in UMAP and the consistent results obtained from bootstrap sampling suggest that the impact of batch effects on our analyses is likely not substantial. Additionally, based on our understanding, the smaller number of cells in the Fanmao sample should not have any significant effect on the resulting different proportion of cells or the expression patterns per each cluster.

      Reviewer #3 (Public Review):

      Wang et al. explored the unique biology of the deep-sea mussel Gigantidas platifrons to understand the fundamental principles of animal-symbiont relationships. They used single-nucleus RNA sequencing and validation and visualization of many of the important cellular and molecular players that allow these organisms to survive in the deep sea. They demonstrate that a diversity of cell types that support the structure and function of the gill including bacteriocytes, specialized epithelial cells that host sulfur-oxidizing or methane-oxidizing symbionts as well as a suite of other cell types including supportive cells, ciliary, and smooth muscle cells. By performing experiments of transplanting mussels from one habitat which is rich in methane to methane-limited environments, the authors showed that starved mussels may consume endosymbionts versus in methane-rich environments upregulated genes involved in glutamate synthesis. These data add to the growing body of literature that organisms control their endosymbionts in response to environmental change.

      The conclusions of the data are well supported. The authors adapted a technique that would have been technically impossible in their field environment by preserving the tissue and then performing nuclear isolation after the fact. The use of single-nucleus sequencing opens the possibility of new cellular and molecular biology that is not possible to study in the field. Additionally, the in-situ data (both WISH and FISH) are high-quality and easy to interpret. The use of cell-type-specific markers along with a symbiont-specific probe was effective. Finally, the SEM and TEM were used convincingly for specific purposes in the case of showing the cilia that may support water movement.

      We appreciate the valuable feedback provided by the reviewer on our study. It is encouraging to know that our work was found to be interesting and that they conducted a thorough evaluation of our research. We will take their constructive comments into account as we strive to develop and enhance our study. Thank the reviewer for all the input.

      The one particular area for clarification and improvement surrounds the concept of a proliferative progenitor population within the gill. The authors imply that three types of proliferative cells within gills have long been known, but their study may be the first to recover molecular markers for these putative populations. The markers the authors present for gill posterior end budding zone cells (PEBZCs) and dorsal end proliferation cells (DEPCs) are not intuitively associated with cell proliferation and some additional exploration of the data could be performed to strengthen the argument that these are indeed proliferative cells. The authors do utilize a trajectory analysis tool called Slingshot which they claim may suggest that PEBZCs could be the origin of all gill epithelial cells, however, one of the assumptions of this analysis is that differentiated cells are developed from the same precursor PEBZC population.

      However, these conclusions do not detract from the overall significance of the work of identifying the relationship between symbionts and bacteriocytes and how these host bacteriocytes modulate their gene expression in response to environmental change. It will be interesting to see how similar or different these data are across animal phyla. For instance, the work of symbiosis in cnidarians may converge on similar principles or there may be independent ways in which organisms have been able to solve these problems.

      We are grateful for the valuable comments and suggestions provided by the reviewer. All suggestions have been carefully considered, and the manuscript has been revised accordingly. We particularly value the reviewer's insights regarding the characterization of the G. platifrons gill proliferative cell populations. In a separate research endeavor, we have conducted experiments utilizing both cell division and cell proliferation markers on these proliferative cell populations. While these results are not incorporated into the current manuscript, we would be delighted to share our preliminary findings with the reviewer. Our preliminary results indicate that the proliferative cell populations exhibit positivity for cell proliferation markers and contain a significant number of mitotic cells..

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      Further experiments are needed to link the changes in transcriptomes of Bathymodioline mussels in the different environmental conditions to changes in their interactions with symbiotes. For example, quantifying the abundance and comparing the morphology of symbiotes between the environmental conditions would lend much support for shifting between milking and farming strategies. Without analyzing the symbiotes and comparing them across populations, it is difficult to comment on the mechanisms of interactions between symbiotes and the hosts. Without this analysis, this data is better suited towards comments about the general effect of environmental perturbation and stress on gene expression in these mussels.

      We appreciate the reviewer’s comments. We are also very curious about the symbiont responses, especially at the single-cell level. However, all the current commercial single-cell RNA-seq technologies are based on oligo dT priming for reverse transcription and barcoding. Therefore, the bacterial gene expression information is omitted from our dataset. Hopefully, with the development of technology, we could conduct an integrated analysis of both host and symbiont gene expression soon.

      Additionally, clarification is needed on which types of symbiotes are being looked at. Are they MOX or SOX populations? Are they homogenous? What are the concentrations of sulfur at the sampled sites?

      We thank you for your valuable comments and suggestions. Gigantidas platifrons harbors a MOX endosymbiont population characterized by a single 16S rRNA phylotype. We apologize for any confusion resulting from our previous wording. To clarify, we have revised lines 57-59 of our introduction

      In the text and images, consider using standardized gene names and leaving out the genome coordinates. This would greatly help with readability. Also, be careful to properly follow gene naming and formatting conventions (ie italicizing gene names and symbols).

      We appreciate the reviewer’s insightful comments. In model animals, gene nomenclature often stems from forward genetic approaches, such as the identification of loss-of-function mutants. These gene names, along with their protein products, typically correspond to unique genome coordinates. Conversely, in non-model invertebrates (e.g., Gigantidas platifrons of present study), gene prediction relies on a combination of bioinformatics methods, including de novo prediction, homolog-based prediction, and transcriptomics mapping. Subsequently, the genes are annotated by identifying their best homologs in well-characterized databases. Given that different genes may encode proteins with similar annotated functions, we chose to include both the gene ID (genome coordinates) and the gene name in our manuscript. This dual labeling approach ensures that our audience receives accurate and comprehensive information regarding gene identification and annotation.

      Additionally, extending KEGG analysis to the atlas annotation section could help strengthen the confidence of annotations. For example, when identifying bacteriocyte populations, the functional categories of individual marker genes (lysosomal proteases, lysosomal traffic regulators, etc) are used to justify the annotation. Presenting KEGG support that these functional categories are upregulated in this population relative to others would help further support how you characterize this cluster by showing it's not just a few specific genes that are enriched in this cell group, but rather an overall functionality.

      We appreciate the valuable suggestion provided by the reviewer. Indeed, incorporating KEGG analysis into the atlas annotation section could further enhance the confidence in our annotations. However, in our study, we encountered some limitations that impeded us from conducting a comprehensive KEGG enrichment analysis.

      Firstly, the number of differentially expressed genes (DEGs) that we identified for certain cell populations was relatively small, making it challenging to meet the threshold required for meaningful KEGG enrichment analysis. For instance, among the 97 marker genes identified for the Bacteriocyte cluster, only two genes, Bpl_scaf_59648-4.5 (lysosomal alpha-glucosidase-like) and Bpl_scaf_52809-1.6 (lysosomal-trafficking regulator-like isoform X1), were identified as lysosomal genes. To generate reliable KEGG enrichments, a larger number of genes is typically required.

      Secondly, single-nucleus sequencing, as employed in our study, tends to yield a relatively smaller number of genes per cell compared to bulk RNA sequencing. This limited gene yield can make it challenging to achieve sufficient gene representation for rigorous KEGG enrichment analysis.

      Furthermore, many genes in the genome still lack comprehensive annotation, both in terms of KEGG and GO annotations. In our dataset, out of the 33,584 genes obtained through single-nuclei sequencing, 26,514 genes have NO KEGG annotation, and 25,087 genes have NO GO annotation. This lack of annotations further restricts the comprehensive application of KEGG analysis in our study.

      The claim that VEPCs are symbiote free is not demonstrated. Additional double in situs are needed to show that markers of this cell type localize in regions free of symbiotes.

      We appreciate your comments and suggestions. In Figure 5B, our results demonstrate that the bacteriocytes (green fluorescent signal) are distant from the VEPCs, which are located around the tip of the gill filaments (close to the food groove). We have revised our Figure 5B to make it clear.

      Additionally, it does not seem like trajectory analysis is appropriate for these sampling conditions. Generally, to create trajectories confidently, more closely sampled time points are needed to sufficiently parse out the changes in expression. More justification is needed for the use of this type of analysis here and a discussion of the limitations should be mentioned, especially when discussing the hypotheses relating to PEBZCs, VEPCs, and DEPCs.

      We greatly appreciate your thoughtful commentary. It is important to acknowledge that in the context of a developmental study, incorporating more closely spaced time points indeed holds great value. In our ongoing project investigating mouse development, for instance, we have implemented time points at 24-hour intervals. However, in the case of deep-sea adult animals, we hypothesized a slower transcriptional shift in such extreme environment, which led us to opt for a time interval of 3-7 days. Examining the differential expression profiles among the three treatments, we observed that most cell types exhibited minimal changes in their expression profiles. For the cell types strongly impacted by in situ transplantation, their expression profiles per cell type still exhibited highly overlap in the UMAP analysis (Figure 6a), thus enabling meaningful comparisons. Nevertheless, we recognize that our sampling strategy may not be flawless. Additionally, the challenging nature of conducting in situ transplantation in 1000-meter depths limited the number of sampling occasions available to us. We sincerely appreciate your input and understanding.

      Finally, more detail should be added on the computational methods used in this paper. For example, the single-cell genomics analysis protocol should be expanded on so that readers unfamiliar with BD single-cell genomics handbooks could replicate the analysis. More detail is also needed on what criteria and cutoffs were used to calculate marker genes. Also, please be careful to cite the algorithms and software packages mentioned in the text.

      Acknowledged, thank you for highlighting this. In essence, the workflow closely resembles that of the 10x Genomics workflow (despite the use of a different software, i.e., Cell Ranger). We better explain the workflow below, and also noting that this information may no longer be relevant for newer users of BD or individuals who are not acquainted with BD, given that the workflow underwent a complete overhaul in the summer of 2023.

      References to lines

      Line 32: typo "..uncovered unknown tissue heterogeny" should read "uncovering" or "and uncovered")

      Overall abstract could include more detail of findings (ex: what are the "shifts in cell state" in line 36 that were observed)

      We apologize for the mistakes, and have revised the manuscript accordingly.

      Line 60: missing comma "...gill filament structure, but also"

      We apologize for the mistakes, and have revised the manuscript accordingly.

      Line 62-63: further discussion here, or in the relevant sections of the specific genes identified in the referenced bulk RNA-seq project could help strengthen confidence in annotation

      We appreciate the comment, and have revised the manuscript accordingly.

      Line 112: what bootstrapping strategy? Applied to what?

      This is a bootstrap sampling algorithm to assess the robustness of each cell cluster developed in a recent biorxiv paper. (Singh, P. & Zhai, Y. Deciphering Hematopoiesis at single cell level through the lens of reduced dimensions. bioRxiv, 2022.2006.2007.495099 (2022). https://doi.org:10.1101/2022.06.07.495099)

      Lines 127-129: What figures demonstrate the location of the inter lamina cells? Are there in situs that show this?

      We apologize for any errors; the referencing of figures in the manuscript has been revised for clarity

      Lines 185-190: does literature support these as markers of SMCs? Are they known smooth muscle markers in other systems?

      We characterized the SMCs by the expression of LDL-associated protein, angiotensin-converting enzyme-like protein, and the "molecular spring" titin-like protein, all of which are commonly found in human vascular smooth muscle cells. Based on this analysis, we hypothesize that these cells belong to the smooth muscle cell category.

      Line 201: What is meant by "regulatory roles"?

      In this context, we are discussing the expression of genes encoding regulatory proteins, such as SOX transcription factors and secreted-frizzled proteins.

      Line 211: which markers disappeared? What in situs show this?

      We apologize for the mistakes, and have revised the manuscript accordingly.

      Line 211: typo, "role" → "roll"

      We apologize for the mistakes, and have revised the manuscript accordingly.

      Line 214: what are these "hallmark genes"

      We apologize for the mistakes, here we are referring to the genes listed in figure 4B. We have revised the manuscript accordingly.

      Line 220: are there meristem-like cells in metazoans? If so, this would be preferable to a comparison with plants.

      In this context, we are discussing the morphological characteristics of gill proliferative cell populations found in filibranch bivalves. These populations, namely PEPC, VEPC, and DEPC, consist of cells exhibiting morphological traits akin to those of plant cambial-zone meristem cells. These cells typically display small, round shapes with a high nucleus-to-plasma ratio. We acknowledge that while these terms are utilized in bivalve studies (citations below), they lack the robust support seen in model systems backed by molecular biology evidences. The present snRNA-seq data, however, may offer valuable cell markers for future comprehensive investigations.

      Leibson, N. L. & Movchan, O. T. Cambial zones in gills of Bivalvia. Mar. Biol. 31, 175-180 (1975). https://doi.org:10.1007/BF00391629

      Wentrup, C., Wendeberg, A., Schimak, M., Borowski, C. & Dubilier, N. Forever competent: deep-sea bivalves are colonized by their chemosynthetic symbionts throughout their lifetime. Environ. Microbiol. 16, 3699-3713 (2014). https://doi.org:10.1111/1462-2920.12597

      Cannuel, R., Beninger, P. G., McCombie, H. & Boudry, P. Gill Development and its functional and evolutionary implications in the blue mussel Mytilus edulis (Bivalvia: Mytilidae). Biol. Bull. 217, 173-188 (2009). https://doi.org:10.1086/BBLv217n2p173

      Line 335: what is slingshot trajectory analysis? Does this differ from the pseudotime analysis?

      Slingshot is an algorithm that uses the principal graph of the cells to infer trajectories. It models trajectories as curves on the principal graph, capturing the progression and transitions between different cellular states.

      Both Slingshot and pseudotime aim to infer cellular trajectories. Slingshot focuses on capturing branching patterns which is fully compatible with the graph generated using dimensionality reduction such as UMAP and PHATE, while pseudotime analysis aims to order cells along a continuous trajectory. It does not rely on dimensionality reduction graphs. We used both in the MS for different purposes.

      Line 241: introduce FISH methodology earlier in the paper, when in situ images are first referenced

      We appreciate the comment, and have revised the manuscript accordingly.

      Line 246-249: can you quantify the decrease in signal or calculate the concentration of symbiotes in the cells? Was 5C imaged whole? This can impact the fluorescent intensity in tissues of different thicknesses.

      We appreciate your comment. In Figure 5C, most of the typical gill filament region is visible (the ventral tip of the gill filament, and the mid part of the gill filament) except for the dorsal end. The gill filament of bathymodioline mussels exhibits a simple structure: a single layer of bacteriocytes grow on the basal membrane. Consequently, the gill slices have a fairly uniform thickness (with two layers of bacteriocytes and one layer of interlamina cells in between), minimizing any potential impact on fluorescent intensity. As of now, detailed quantification of intracellular symbionts may necessitate continuous TEM or ultra-resolution confocal sections to 3D reconstruct the bacteriocytes, which may exceed the scope of the current study. Therefore, fluorescent intensity remains the only method available to us for estimating bacterial density/distribution across the gill filament.

      Line 249: What is meant by 'environmental gradient?'

      Here we are refereeing the gases need for symbiont’s chemosynthesis. We have revised the manuscript to make it clear.

      Lines 255-256: Were the results shown in the TEM images previously known? Not clear what novel information is conveyed in images Fig 5 C and D

      In the Fig 5 C and D, we’ve delivered a high-quality SEM TEM image of a typical bacteriocyte, showcasing its morphology and subcellular machinery with clarity. These electron microscopy images offer the audience a comprehensive introduction to the cellular function of bacteriocytes. Additionally, they serve as supportive evidence for the bacteriocytes' snRNA-seq data.

      Line 295-296: Can you elaborate on what types of solute carrier genes have been shown to be involved with symbioses?

      We appreciate the comment, and have revised the manuscript accordingly. The putative functions of the solute carriers could be found in Figure 5I.

      Line 297-301: Which genes from the bulk RNA-seq study? Adding more detail and references in cluster annotation would help readers better understand the justifications.

      We appreciate the comment, and have revised the manuscript accordingly.

      Line 316 -322: Can you provide the values of the distances?

      We also provide values in the main text, in addition to the Fig6b. We also provide a supplementary Table (Supplementary Table S19).

      Line 328: What are the gene expression patterns?

      We observed genes that are up- and down-regulated in Starvation and reconstitution.

      LIne 334-337: A visualization of the different expression levels of the specific genes in clusters between sites might be helpful to demonstrate the degree of difference between sites.

      We have prepared a new supplementary file showing the different expression levels.

      Line 337: Citation needed

      We appreciate the comment. Here, we hypothesize the cellular responds based on the gene’s function and their expression patterns.

      Line 402-403: Cannot determine lineages from data presented. Need lineage tracing over time to determine this

      We acknowledge the necessity of conducting lineage tracing over time to validate this hypothesis. Nonetheless, in practical terms, it is difficult to obtain samples for testing this. Perhaps, it is easier to use their shallow sea relatives to test this hypothesis. However, in practice, it is very difficult.

      413-414: What are the "cell-type specific responses to environmental change"? It could be interesting to present these results in the "results and discussion" section

      These results are shown in Supplementary Figure S8.

      Line 419-424: Sampling details might go better earlier on in the paper, when the sampling scheme is introduced.

      We appreciate the comments. Here, we are discussing the limitations of our current study, not sampling details.

      Line 552: What type of sequencing? Paired end? How long?

      We conducted 150bp paired-end sequencing.

      556-563: More detail here would be useful to readers not familiar with the BD guide. Also be careful to cite the software used in analysis!

      The provided guide and handbook elucidate the intricacies of gene name preparation, data alignment to the genome, and the generation of an expression matrix. It is worth mentioning that we relied upon outdated versions of the aforementioned resources during our data analysis phase, as they were the only ones accessible to us at the time. However, we have since become aware of a newer pipeline available this year, rendering the information presented here of limited significance to other researchers utilizing BD.

      Many thanks for your kind reminding. We have now included a reference for STAR. All other software was cited accordingly. There are no scholarly papers or publications to refer to for the BD pipeline that we can cite.

      Line 577-578: How was the number of clusters determined? What is meant by "manually combine the clusters?" If cells were clustered by hand, more detail on the method is needed, as well as direct discussion and justification in the body of the paper.

      It would be more appropriate to emphasize the determination of cell types rather than clusters. The clusters were identified using a clustering function, as mentioned in the manuscript. It's important to note that the clustering function (in our case, the FindClusters function of Seurat) provides a general overview based on diffuse gene expression. Technically speaking, there is no guarantee that one cluster corresponds to a single cell type. Therefore, it is crucial to manually inspect the clustering results to assign clusters to the appropriate cell types. In some cases, multiple clusters may be assigned to the same cell type, while in other cases, a single cluster may need to be further subdivided into two or more cell types or sub-cell types, depending on the specific circumstances.

      For studies conducted on model species such as humans or mice, highly and specifically expressed genes within each cluster can be compared to known marker genes of cell types mentioned in previous publications, which generally suffices for annotation purposes. However, in the case of non-model species like Bathymodioline mussels, there is often limited information available about marker genes, making it challenging to confidently assign clusters to specific cell types. In such situations, in situ hybridisation proves to be incredibly valuable. In our study, WISH was employed to visualise the expression and morphology of marker genes within clusters. When WISH revealed the expression of marker genes from a cluster in a specific type of cell, we classified that cluster as a genuine cell type. Moreover, if WISH demonstrated uniform expression of marker genes from different clusters in the same cell, we assigned both clusters to the same cell type.

      We expanded the description of the strategy in the Method section.

      LIne 690-692: When slices were used, what part of the gill were they taken from?

      We sectioned the gill around the mid part which could represent the mature bacteriocytes.

      References to figures:

      General

      Please split the fluorescent images into different channels with an additional composite. It is difficult to see some of the expression patterns. It would also make it accessible to colorblind readers.

      We appreciate the comments and suggestions from the reviewer. We have converted our figures to CMYK colour which will help the colorblind audiences to read our paper.

      Please provide the number of replicates for each in situ and what proportion of those displayed the presented pattern.

      We appreciate the reviewer’s comments. We have explained in the material and methods part of the manuscript.

      Figure 2.C' is a fantastic summary and really helps the non-mussel audience understand the results. Adding schematics like this to Figures 3-5 would be helpful as well.

      We value the reviewer's comments. We propose that Figures 3K, 4C, and 5A-D could offer similar schematic explanations to assist the audience.

      Figure 2:

      Figures 2.C-F, 2.C', 2.H-J are not referenced in the text. Adding in discussions of them would help strengthen your discussions on the cluster annotation

      We appreciate the reviewer's comments. We have revise the manuscript accordingly.

      In 2.B. 6 genes are highlighted in red and said to be shown in in situs, but only 5 are shown.

      We apology for the mistake. We didn’t include the result 20639-0.0 WISH in present study. We have changed the label to black.

      Figure 3:

      FIg 2C-E not mentioned.

      We appreciate the reviewer's comments. We have revise the manuscript accordingly.

      In 3.B 8 genes are highlighted in red and said to be shown in in situs. Only 6 are.

      The result of the WISH were provided in Supplementary Figures S4 and S5.

      FIgure 3.K is not referenced in the legend.

      We appreciate the comment, and have revised the manuscript accordingly.

      Figure 4:

      In Figure D, it might be helpful to indicate the growth direction.

      We appreciate the comment, and have revised the manuscript accordingly by adding an arrow in panel D to indicate growth direction.

      4F: A double in situ with the symbiote marker is needed to demonstrate the nucleolin-like positive cells are symbiote free.

      We appreciate the comment. The symbiont free region could be found in Figure 5A.

      Figure 5:

      In 5.A, quantification of symbiote concentration would help support your conclusion that they are denser around the edges.

      We appreciate the comment, as we mentioned above, detailed quantification of intracellular symbionts may necessitate continuous TEM or ultra-resolution confocal sections to 3D reconstruct the bacteriocytes, which may exceed the scope of the current study. Therefore, fluorescent intensity remains the only method available to us for estimating bacterial density/distribution across the gill filament.

      In 5.D, the annotation is not clear. Adding arrows like in 5.C would be helpful.

      We appreciate the comment, and have revised the manuscript accordingly.

      A few genes in 5.F are not mentioned in the paper body when listing other genes. Mentioning them would help provide more support for your clustering.

      We appreciate the comment, and have revised the manuscript accordingly.

      Is 5.I meant to be color coded with the gene groups from 5.F? Color Coding the gene names, rather than organelles or cellular structures might portray this better and help visually strengthen the link between the diagram and your dot plot.

      We appreciate the suggestions. We've experimented with color-coding the gene names, but some colors are less discernible against a white background.

      Figure 6:

      6.B Is there a better way to visualize this data? The color coding is confusing given the pairwise distances. Maybe heatmaps?

      We attempted a heatmap, as shown in the figure below. However, all co-authors agree that a bar plot provides clearer visualization compared to the heatmap. We agree that the color scheme maya be confusing because they use the same color as for individual treatment. So we change the colors.

      Author response image 1.

      Figure 6.D: Why is the fanmao sample divided in the middle?

      Fig6C show that single-cell trajectories include branches. The branches occur because cells execute alternative gene expression programs. Thus, in Fig 6D, we show changes for genes that are significantly branch dependent in both lineages at the same time. Specifically, in cluster 2, the genes are upregulated during starvation but downregulated during reconstitution. Conversely, genes in cluster 1 are downregulated during starvation but upregulated during reconstitution. It's of note that Fig 6D displays only a small subset of significantly branch-dependent genes.

      FIgure 6.D: Can you visualize the expression in the same format as in figures 2-5?

      We appreciate the comments from the reviewer. As far as we know, this heatmap are the best format to demonstrate this type of gene expression profile.

      Supplementary Figure S2:

      Please provide a key for the cell type abbreviations

      We appreciate the comment, and have added the abbreviations of cell types accordingly.

      Supplementary Figures S4 and S5:

      What part of the larger images are the subsetted image taken from?

      We appreciate the comment, these images were taken from the ventral tip and mid of the gill slices, respectively. We have revised the figure legends to make it clear.

      Supplemental Figure S7:

      If clusters 1 and 2 show genes up and downregulated during starvation, what do clusters 4 and 3 represent?

      Cluster 1: Genes that are obviously upregulated during Starvation, and downregulated during reconstitution; luster4: genes are downregulated during reconstitution but not obviously upregulated during Starvation.

      Cluster 2 show genes upregulated during reconstitution, and cluster 3 obviously downregulated during Starvation.

      Author response table 1.

      Supplemental Figure S8:

      This is a really interesting figure that I think shows some of the results really well! Maybe consider moving it to the main figures of the paper?

      We appreciate the comments and suggestions. We concur with the reviewer on the significance of the results presented. However, consider the length of this manuscript, we have prioritized the inclusion of the most pertinent information in the main figures. Supplementary materials containing additional figures and details on the genes involved in these pathways are provided for interested readers.

      Supplemental Figure S11:

      Switching the axes might make this image easier for the reader to interpret. Additionally, calculating the normalized contribution of each sample to each cluster could help quantify the extent to which bacteriocytes are reduced when starving.

      Thank you for the insightful suggestion, which we have implemented as detailed below. We acknowledge the importance of understanding the changes in bacteriocyte proportions across different treatments. However, it's crucial to note that the percentage of cells per treatment is highly influenced by factors such as the location of digestion and sequencing, as previously mentioned.

      Author response image 2.

      Reviewer #2 (Recommendations For The Authors):

      The following are minor recommendations for the text and figures that may help with clarity:

      Fig. 3K: This figure describes water flow induced by different ciliary cells. It is not clear what the color of the arrows corresponds to, as they do not match the UMAP (i.e. the red arrow) and this is not indicated in the legend. Are these colours meant to indicate the different ciliary cell types? If so it would be helpful to include this in the legend.

      We appreciate the reviewer's comments and suggestions. The arrows indicate the water flow that might be agitated by the certain types of cilium. We have revised our figure and figure legends to make it clear.

      Line 369: The incorrect gene identifier is given for the mitochondrial trifunctional enzyme. This gene identifier is identical to the one given in line 366, which describes long-chain-fatty-acid-ligase ACSBG2-like (Bpl_scaf_28862-1.5).

      We appreciate the reviewer's comments and suggestions. We have revised our manuscript accordingly.

      Line 554: The Bioproject accession number (PRJNA779258) does not appear to lead to an existing page in any database.

      We appreciate the reviewer's comments and suggestions. We have released this Bioproject to the public.

      Line 597-598: it would be helpful to know the specific number of cells that the three sample types were downsampled to, and the number of cells remaining in each cluster, as this can affect the statistical interpretation of differential expression analyses.

      The number of cells per cluster in our analysis ranged from 766 to 14633. To mitigate potential bias introduced by varying cell numbers, we implemented downsampling, restricting the number of cells per cluster to no more than 3500. This was done to ensure that the differences between clusters remained less than 5 times. We experimented with several downsampling strategies, exploring cell limits of 4500 and 2500, and consistently observed similar patterns across these variations.

      Data and code availability:

      The supplementary tables and supplementary data S1 appear to be the final output of the differential expression analyses. Including the raw data (e.g. reads) and/or intermediate data objects (e.g. count matrices, R objects), in addition to the code used to perform the analyses, may be very helpful for replication and downstream use of this dataset. As mentioned above, the Bioproject accession number appears to be incorrect.

      We appreciate the reviewer's comments and suggestions. Regarding our sequencing data, we have deposited all relevant information with the National Center for Biotechnology Information (NCBI) under Bioproject PRJNA779258. Additionally, we have requested the release of the Bioproject. Furthermore, as part of this round of revision, we have included the count matrices for reference.

      Reviewer #3 (Recommendations For The Authors):

      As noted in the public review, my only major concerns are around the treatment of progenitor cell populations. I am sympathetic to the challenges of these experiments but suggest a few possible avenues to the authors.

      First, there could be some demonstration that these cells in G. platifrons are indeed proliferative, using EdU incorporation labeling or a conserved epitope such as the phosphorylation of serine 10 in histone 3. It appears in Mytilus galloprovincialis that proliferating cell nuclear antigen (PCNA) and phospho-histone H3 have previously been used as good markers for proliferative cells (Maiorova and Odintsova 2016). The use of any of these markers along with the cell type markers the authors recover for PEBZCs for example would greatly strengthen the argument that these are proliferative cells.

      If performing these experiments would not be currently possible, the authors could use some computation approaches to strengthen their arguments. Based on conserved cell cycle markers and the use of Cell-Cycle feature analysis in Seurat could the authors provide evidence that these progenitors occupy the G2/M phase at a greater percentage than other cells? Other than the physical position of the cells is there much that suggests that these are proliferative? While I am more convinced by markers in VEPCs the markers for PEBZCs and DEPCs are not particularly compelling.

      While I do not think the major findings of the paper hinge on this, comments such as "the PBEZCs gave rise to new bacteriocytes that allowed symbiont colonization" should be taken with care. It is not clear that the PBEZCs are proliferative and there does not seem to be any direct evidence that PBEZCs (or DEPCs or VEPCS for that manner) are the progenitor cells through any sort of labeling or co-expression studies.

      We appreciate the comments and suggestions from the reviewer. We have considered all the suggestions and have revised the manuscript accordingly. We especially appreciate the reviewer’s suggestions about the characterisations of the G. platifrons gill proliferative cell populations. In a separate research project, we have tested both cell division and cell proliferation markers on the proliferation cell populations. Though we are not able to include these results in the current manuscript, we are happy to share our preliminary results with the reviewer. Our results demonstrate the proliferative cell populations, particularly the VEPCs, are cell proliferation marker positive, and contains high amount of mitotic cells.

      Author response image 3.

      Finally, there is a body of literature that has examined cell proliferation and zones of proliferation in mussels (such as Piquet, B., Lallier, F.H., André, C. et al. Regionalized cell proliferation in the symbiont-bearing gill of the hydrothermal vent mussel Bathymodiolus azoricus. Symbiosis 2020) or other organisms (such as Bird, A. M., von Dassow, G., & Maslakova, S. A. How the pilidium larva grows. EvoDevo. 2014) that could be discussed.

      We appreciate the comments and suggestions from the reviewer. We have considered all the suggestions and have revised the manuscript accordingly (line 226-229).

      Minor comments also include:

      Consider changing the orientation of diagrams in Figure 2C' in relationship to Figure 2C and 2D-K.

      We appreciate the comments and suggestions from the reviewer. The Figure 2 has been reorganized.

      For the diagram in Figure 3K, please clarify if the arrows drawn for the direction of inter lamina water flow is based on gene expression, SEM, or some previous study.

      We are grateful for the reviewer's valuable feedback and suggestions. The arrows in the figure indicate the direction of water flow that could be affected by specific types of cilium. Our prediction is based on both gene expression and SEM results. To further clarify this point, we have revised the figure legend of Fig. 3.

      Please include a label for the clusters in Figure 5E for consistency.

      We have revised our Figure 5E to keep our figures consistent.

      Please include a note in the Materials and Methods for Monocle analysis in Figure 6.

      We conducted Monocle analyses using Monocle2 and Monocle 3 in R environment. We have revised our material and methods with further information of Figure 6.

      In Supplement 2, the first column is labeled PEBC while the first row is labeled PEBZ versus all other rows and columns have corresponding names. I am guessing this is a typo and not different clusters?

      We appreciate the great effort of the reviewer in reviewing our manuscript. We have corrected the typo in the revised version.

    1. Author Response

      The following is the authors’ response to the original reviews.

      Reviewer #1:

      1. The most important concern that I have refers to the FDTD simulations to characterize the ZMW, as shown in Appendix 2, Figure 4. So far, the explanations given in the caption of Figure 4 are confusing and misleading: the authors should provide more detailed explanations on how the simulations were performed and the actual definition of the parameters used. In particular:

      a. lines 1330-1332: it is not clear to me how the fluorescence lifetime can be calculated from the detected signal S (z), and why they are horizontal, i.e., no z dependence? Which lifetimes are the authors referring to?

      b. lines 1333-1335: Where do these values come from? And how do they relate to panels D & E? From what I can see in these panels the lifetimes are highly dependent on z and show the expected reduction of lifetime inside the nanostructures.

      c. lines 1336-1337: Why the quantum yield of the dyes outside the ZMW differs from those reported in the literature? In particular the changes of quantum yield and lifetime for Alexa 488 are very large (also mentioned in the corresponding part of Materials & Methods but not explained in any detail).

      We thank the Reviewer for his detailed questions on the FDTD simulations. We have now added the missing equation related to the computation of signal-averaged fluorescence lifetimes from the FDTD simulations. Specifically to the three points raised:

      a) The fluorescence lifetime is indeed not calculated from the detected signal S(z), but from the radiative and non-radiative rates in the presence of the ZMW as given in eq. 9-10. However, we use the detected signal S(z) to compute the average fluorescence lifetime over the whole z-profile of the simulation box, which we relate to the experimentally measured fluorescence lifetimes as given in Appendix 7, Figure 1. We have now added the equation to compute the signal-weighted fluorescence lifetimes, which we denote as <𝜏>S , in eq. 13 in the methods. To clarify this point, we have added the symbol <𝜏>S to the plots in Appendix 2, Figure 4 D-E and Appendix 7, Figure 1 C-D.

      b) The estimated lifetimes were obtained as the signal-weighted average over the lifetime profiles, (<𝜏>S) as given in the new eq. 13. All plotted quantities, i.e., the detection efficiency η, quantum yield ϕ, detected signal S(z), and fluorescence lifetime, are computed from the radiative and loss rates obtained from the FDTD simulation according to eqs. 8-11. To make this clearer, we have now added the new Appendix 2 – Figure 5 which shows the z-profiles of the quantities (radiative and loss rates) used to derive the experimental observables.

      c) There are multiple reasons for the differences of the quantum yields of the two analytes used in this study compared to the literature values. For cyanine dyes such as Alexa647, it is well known that steric restriction (as e.g. caused by conjugation to a biomolecule) can lead to an increase of the quantum yield and fluorescence lifetime. We observe a minor increase of the fluorescence lifetime for Alexa647 from the literature value of 1.17 ns to a value of 1.37 ns when attached to Kap95, which is indicative of this effect. In the submitted manuscript, this was discussed in the methods in lines 936-938 (lines 938-945 in the revised manuscript). For the dye Alexa488, which is used to label the BSA protein, this effect is absent. Instead, we observe (as the Reviewer correctly notes) a quite drastic reduction of the fluorescence lifetime compared to the unconjugated dye from 4 ns to 2.3 ns. In cases where a single cysteine is labeled on a protein, such a drastic reduction of the quantum yield usually indicates the presence of a quenching moiety in proximity of the labeling site, such as tryptophane, which acts via the photo-induced electron transfer mechanism. Indeed, BSA contains two tryptophanes that could be responsible for the low quantum yield of the conjugated dyes. The situation is complicated by the fact that BSA contains 35 cysteines that can potentially be labeled (although 34 are involved in disulfide bridges). The labeled BSA was obtained commercially and the manufacturer lists the degree of labeling as ~6 dye molecules per protein, with a relative quantum yield of 0.2 compared to the standard fluorescein. This corresponds to an absolute quantum yield of ~0.16, which is low compared to the literature value for Alexa488 of ~0.8.

      Based on the measured fluorescence lifetime, we estimate a quantum yield of 0.46, which is higher than the photometrically obtained value of 0.16 reported by the manufacturer. Fully quenched, nonfluorescent dyes will not contribute to the lifetime measurement but are detected in the photometric quantum yield estimates. The difference between the lifetime and photometric based quantum yield estimates thus suggest that part of the fluorophores are almost fully quenched. While it is unknown where the dyes are attached to the protein, the low quantum yield could be indicative of dye-dye interactions via pi-pi stacking, which can often lead to non-fluorescent dimers. This is supported by the fact that the manufacturer reports color differences between batches of labeled protein, which indicate spectral shifts of the absorption spectrum when dye-dye adducts are formed by π-π stacking. We have now added a short discussion of this effect in lines 938-941. We note that the conclusions drawn on the quenching effect of the metal nanostructure remain valid despite the drastic reduction of the quantum yield for Alexa488, which leads to a further quantum yield reduction of the partly quenched reference state.

      2) A second important concern refers to Figure 3: Why is there so much variability on the burst intensities reported on panels C, D? They should correspond to single molecule translocation events and thus all having comparable intensity values. In particular, the data shown for BSA in panel D is highly puzzling, since it not only reflects a reduced number of bursts (which is the main finding) but also very low intensity values, suggesting a high degree of quenching of the fluorophore being proximal to the metal on the exit side of the pore. In fact, the count rates for BSA on the uncoated pore range form 50-100kcounts/s, while on the coated pores thy barely reach 30 kcounts/s, a clear indication of quenching. Importantly, and in direct relation to this, could the authors exclude the possibility that the low event rates measured on BSA are largely due to quenching of the dye by getting entangled in the Nsp mesh just underneath the pore but in close contact to the metal?

      The Reviewer raises a valid concern, but further analysis shows that this is unproblematic. Notably, the burst intensities are in fact not reduced, in contrast to the visual impression obtained from the time traces shown in the figure. The time trace of the BSA intensity is visually dominated by high-intensity bursts which mask the low-intensity bursts in the plot. In contrast, in Figure 3 the reduced number of BSA events results in a sparser distribution of the intensity spikes, which allows low-intensity events to be seen. Different to the visual inspection, the spike-detection algorithm does not exhibit any bias in terms of the duration or the number of photons of the detected events between the different conditions for both BSA and Kap95, as shown in the new Appendix 7 – Figure 1. Using FCS analysis it can be tested whether the event duration varies between the different conditions shown in Figure 3 C-D. This did not show a significant difference in the estimated diffusion time for BSA (Appendix 7 – Figure 1 C,D). Contrary to the suggestion of the Reviewer, we also do not observe any indication of quenching by the metal between uncoated and Nsp1-coated pores for BSA. Such quenching should result in differences of the fluorescence lifetimes, which however is not evident in our experimental data (Appendix 7 – Figure 1 F).

      3) Line 91: I suggest the authors remove the word "multiplexed" detection since it is misleading. Essentially the authors report on a two-color excitation/detection scheme which is far from being really multiplexing.

      We have changed the word to “simultaneous” now and hope this avoids further confusion.

      4) Line 121: why are the ZMW fabricated with palladium? Aluminum is the gold-standard to reduce light transmissivity. An explanation for the choice of this material would be appreciated by the community.

      In a previous study (Klughammer and Dekker, Nanotechnology, 2021), we established that palladium can have distinct advantages compared to other ZMW metals such as aluminum and gold, most prominently, an increased chemical stability and reduced photoluminescence. For this study, we chose palladium over aluminum as it allowed the use of simple thiol chemistry for surface modification. In the beginning of the project, we experimented with aluminum pores as well. We consistently found that the pores got closed after measuring their ionic conductance in chlorine-containing solutions such as KCl or PBS. This problem was avoided by choosing palladium.

      5) Lines 281-282: This statement is somewhat misleading, since it reads such that the molecules stay longer inside the pore. However, if I understand correctly, these results suggest that Kap95 stays closer to the metal on the exit side. This is because measurements are being performed on the exit side of the pore as the excitation field inside the pore is quite negligible.

      We thank the Reviewer for this comment and have clarified the text in lines 290-292 as suggested to: “(…) this indicates that, on the exit side, Kap95 diffuses closer to the pore walls compared to BSA due to interactions with the Nsp1 mesh”

      6) Lines 319-320: Although the MD simulations agree with the statement being written here, the variability could be also due to the fact that the proteins could interact in a rather heterogenous manner with the Nsp mesh on the exit side of the pore, transiently trapping molecules that then would stay longer and/or closer to the metal altering the emission rate of the fluorophores. Could the authors comment on this?

      The variation mentioned in the text refers to a pore-to-pore variation and thus needs to be due to a structural difference between individual pores. This effect would also need to be stable for the full course of an experiment, typically hours. We did not find any structural changes in the fluorescence lifetimes measured on individual pores such as suggested by the Reviewer. We think that the suggested mechanism would show up as distinct clusters in Appendix 7 – Figure 1 E,F where we found no trace of such a change to happen. If we understand correctly, the Reviewer suggests a mechanism, not based on changes in the Nup layer density, that would lead to a varying amount of trapping of proteins close to the surface. Such a behavior should show up in the diffusion time of each pore ( Appendix 7 – figure 1 C,D), where we however find no trace of such an effect.

      7) Lines 493-498: These claims are actually not supported by the experimental data shown in this contribution: a) No direct comparison in terms of signal-to-noise ratio between fluorescence-based and conductance-based readouts has been provided in the ms. b) I would change the word multiplexed by simultaneous since it is highly misleading. c) The results shown are performed sequentially and thus low throughput. d) Finally, the use of unlabeled components is dubious since the detection schemes relies on fluorescence and thus requiring labeling.

      We thank the Reviewer for pointing this out.

      a) We have now added a section in appendix 3 that discusses the signal-to-noise ratios. In brief, there are three observations that led us to conclude that ZMWs provide beneficial capabilities to resolve individual events from the background:

      1. The signal-to-background ratio was determined to be 67±53 for our ZMW data of Kap95 which is an order of magnitude higher compared to the ~5.6 value for a conductance-based readout.

      2. The detection efficiency for ZMWs is independent of the Kap95 occupancy within the pore. This is different from conductance based approaches that have reduced capability to resolve individual Kap95 translocations at high concentrations.

      3. The fraction of detected translocations is much higher for ZMWs than for conductance-based data (where lots of translocations occur undetected) and matches closer to the theoretical predictions.

      b) We have changed the wording accordingly.

      c) We agree with the Reviewer that our method is still low throughput. However, the throughput is markedly increased compared to previous conductance-based nanopore measurements. This is because we can test many (here up to 8, but potentially many more) pores per chip in one experiment, whereas conductance-based readouts are limited to a single pore. We have now changed the wording to “increased throughput” in line 507 to avoid confusion.

      d) We agree that only labeled components can be studied directly with our methods. However, the effect of unlabeled analytes can be assessed indirectly without any perturbation of the detection scheme due to the specificity of the fluorescent labeling. This is distinct from previous nanopore approaches using a conductance-based readout that lack specificity. In our study, we have for example used this advantage of our approach to access event rates at high concentrations (1000nM Kap95, 500nM BSA) and large pore diameters by reducing the fraction of labeled analyte in the sample. Finally, the dependence of the BSA leakage rate as a function of the concentration of Kap95 (Figure 6) relies on a specific readout of BSA events in the presence of large amounts of Kap95, which would be impossible in conductance-based experiments.

      8) Line 769: specify the NA of the objective. Using a very long working distance would also affect the detection efficiency. Have the authors considered the NA of the objective on the simulations of the detection efficiency? This information should be included and it is important as the authors are detecting single molecule events.

      We used an NA of 1.1 for the simulation of the Gaussian excitation field in the FDTD simulations, corresponding to the NA of the objective lens used in the experiments and as specified in the methods. The Reviewer is correct that the NA also affects the absolute detection efficiency of the fluorescence signal due to the finite opening angle of the collection cone of ~56˚. In our evaluation of the simulations, we have neglected this effect for simplicity, because the finite collection efficiency of the objective lens represents only an additional constant factor that does not depend on the parameters of the simulated system, such as the pore diameter. Instead, we focused solely the effect of the ZMW and defined the detection efficiency purely based on the fraction of the signal that is emitted towards the detection side and can potentially be detected in the experiment, which also provides the benefit that the discussed numbers are independent of the experimental setup used.

      To clarify this, we have now made this clearer in the method text on lines 917-920.

      9) Line 831: I guess that 1160ps is a mistake, right?

      This is not a mistake. We performed a tail fit of the fluorescence decay curves, meaning that the initial rise of the decay was excluded from the fit. The initial part of the fluorescence decay is dominated by the instrument response function (IRF) of the system, with an approximate width of ~500 ps. To minimize the influence of the IRF on the tail fit, we excluded the first ~1 ns of the fluorescence decay.

      10) Lines 913-917: Why are the quantum yield of Alexa 488 and lifetime so much reduced as compared to the published values in literature?

      See answer to point 1. We have added a short discussion at lines 938-941 where we speculate that the reduced quantum yield is most likely caused by dye-dye interactions due to the high degree of labeling of ~6 dyes per protein.

      11) Lines 1503-1509: The predicted lifetimes with the Nsp-1 coating have not been shown in Appendix 2 - Figure 4. How have they been estimated?

      We have not performed predictions of fluorescence lifetimes in the presence of an Nsp1 coating. Predictions of the fluorescence lifetime in the absence of the Nsp1 coating were obtained by assuming a uniform occupancy of the molecules over the simulation box. A prediction of the fluorescence lifetimes in the presence of the Nsp1 coating would require a precise knowledge of the spatial distribution of analytes, which depends, among other factors, on the extension of the Nsp1 brushes and the interaction strengths with the FG repeats. While simulations provide some insights on this, we consider a quantitative comparison of predicted and measured fluorescence lifetimes in the presence of the Nsp1 coating beyond the scope of the present study.

      12) Lines 1534-1539: I disagree with this comment, since the measurements reported here have been performed outside the nano-holes, and thus the argument of Kap95 translocating along the edges of the pore and being responsible for the reduced lifetime does not make sense to me.

      In accordance with our answer to point 5 above, we have now changed the interpretation to the proximity of Kap95 to the metal surface on the exit side, rather than speculating on the path that the protein takes through the pore (lines 1662-1664), as follows:

      “This indicates that, in the presence of Nsp1, Kap95 molecules diffuse closer to or spend more time in proximity of the metal nanoaperture on the exit side.”

      Reviewer #2:

      (Numbers indicate the line number.)

      48: should cite more recent work: Timney et al. 2016 Popken et al 2015

      59: should cite Zilman et al 2007, Zilman et al 2010

      62: should cite Zilman et al 2010

      We thank the Reviewer for the suggestions and have added them to the manuscript now.

      65: one should be careful in making statements that the "slow" phase is immobile, as it likely rapidly exchanging NTRs with the "fast" phase.

      We have removed this description and replaced it by “This 'slow phase' exhibits a reduced mobility due to the high affinity of NTRs to the FG-Nup mesh.” to avoid misunderstanding.

      67: Schleicher 2014 does not provide evidence of dedicated channels

      We agree with the Reviewer and therefore moved the reference to an earlier position in the sentence.

      74-75: must cite work by Lusk & Lin et al on origami nanochannels

      We thank the Reviewer for this suggestion. We have now added a reference to the nanotraps of Shen et al. 2021, JACS, in line 75. In addition, we now also refer to Shen et al. 2023, NSMB, in the discussion where viral transport is discussed.

      77: Probably Jovanovic- Talisman (2009)?

      We thank the Reviewer for pointing out this typo.

      93; should cite Auger&Montel et al, PRL 2014

      We thank the Reviewer for pointing out this reference. To give proper credit to previous ZMW, we have now incorporated a sentence in lines 100-102 citing this reference.

      111-112: there appears to be some internal inconsistency between this interpretation and the BSA transport mostly taking place through the "central hole" (as seems to be implied by Equation (3). Probably it should be specified explicitly that the "central hole" in large channels is a "void".

      We thank the Reviewer for this suggestion and have added a clarifying sentence.

      115-177: This competition was studied in Jovanovic-Talisman 2009 and theoretically analysed in Zilman et al Plos Comp Biol 2010. The differences in the results and the interpretation should be discussed.

      We agree, therefore it is discussed in the discussion section (around line 594) and now added the reference to Zilman et al.

      Figure 2 Caption: "A constant flow..." - is it clear that is flow does not generate hydrodynamic flow through the pore?

      The Reviewer raises an important point. Indeed, the pressure difference over the membrane generates a hydrodynamic flow through the pore that leads to a reduction of the event rate compared to when no pressure is applied. However, as all experiments were performed under identical pressures, one can expect a proportional reduction of the absolute event rates due to the hydrodynamic flow against the concentration gradient. In other words, this will not affect the conclusions drawn on the selectivity, as it is defined as a ratio of event rates.

      We have now added additional data on the influence of the hydrodynamic flow on the translocation rate in Appendix 3 – Figure 2, where we have measured the signal of free fluorophores at high concentration on the exit side of the pore as a function of the applied pressure. The data show a linear dependence of the signal reduction on the applied pressure. At the pressure values used for the experiments of 50 mbar, we see a ~5% reduction compared to the absence of pressure, implying that the reported absolute event rates are underestimated only by ~5%. Additionally we have added such data for Kap95 translocations that shows a similar effect (however less consistent). Measuring the event rate at zero flow is difficult, since this leads to an accumulation of fluorophores on the detection side.

      Figure 3: it would help to add how long is each translocation, and what is the lower detection limit. A short explanation of why the method detects actual translocations would be good

      With our method, unfortunately, we can not assess the duration of a translocation event since we only see the particle as it exists the pore. Instead, the measured event duration is determined by the time it takes for the particle to diffuse out of the laser focus. This is confirmed by FCS analysis of translocation events that show the same order of magnitude of diffusion times as for free diffusion (Appendix 7 – Figure 1 C,D) in contrast to a massively reduced diffusion time within a nanopore. In Figure 2D we show the detection efficiency at different locations around the ZMW as obtained from FDTD simulations and discuss the light blocking. This clearly shows that the big majority of the fluorescence signal comes from the laser illuminated side and therefore only particles that translocated through the ZMW are detected as presented between lines 170-190. In Yang et al. 2023, bioRxiv (https://doi.org/10.1101/2023.06.26.546504) a more detailed discussion about the optical properties of Pd nanopores is given.

      This point also explains why we see actual translocations: since the light is blocked by the ZMW, fluorophores can only be detected after they have translocated. On parts of the membrane without pores and upstream the amount of spikes found in a timetrace was found to be negligibly small. Additionally, if a significant part of the signal would be contributed by leaking fluorescence from the dark top side, there should no difference in BSA event rate found between small open and Nsp1 pores which we did not observe.

      With respect to the lower detection limit for events: In the burst search algorithm we require a false positive level rate of lower than 1 event in 100. Additionally, as described in Klughammer and Dekker, Nanotechnology (2021), we apply an empirical filtering to remove low signal to noise ratio events that contain less than 5 detected photons per event or a too low event rate. From the event detection algorithm there is no lower limit set on the duration of an event. Such a limit is then set by the instrument and the maximum frequency it which it can detect photons. This time is below 1μs. Practically we don’t find events shorter than 10μs as can be seen in the distribution of events where also the detection limits can be estimated (Appendix 7 – figure 1 A and B.)

      Equation (1): this is true only for passive diffusion without interactions (see eg Hoogenboom et al Physics Reports 2021 for review). Using it for pores with interactions would predict, for instance, that the inhibition of the BSA translocation comes from the decrease in D which is not correct.

      We agree with the Reviewer that this equation would not reproduce the measured data in a numerically correct way. We included it to justify why we subsequently fit a quadratic function to the data. As we write in line 260 we only used the quadratic equation “as a guide to the eye and for numerical comparison” and specifically don’t claim that this fully describes the translocation process. In this quadratic function, we introduced a scaling factor α that can be fitted to the data and thus incorporates deviations from the model. In appendix 5 we added a more elaborate way to fit the data including a confinement-based reduction of the diffusion coefficient (although not incorporating interactions). Given the variations of the measured translocation rates, the data is equally well described by both the simple and the more complex model function.

      Equation (1): This is not entirely exact, because the concentration at the entrance to the pore is lower than the bulk concentration, which might introduce corrections

      We agree with the Reviewer and have added that the concentration difference Δc is measured at the pore entrance and exit, and this may be lower than the bulk concentration. As described in our reaction to the Reviewer’s previous comment, equation (1) only serves as a justification to use the quadratic dependence and any deviations in Δc are absorbed into the prefactor α in equation (2).

      Equation (3): I don't understand how this is consistent with the further discussion of BSA translocation. Clearly BSA can translocate through the pore even if the crossection is covered by the FG nups (through the "voids" presumably?).

      The Reviewer raises an important point here. Equation 3 can only be used for a pore radius r > rprot + b. b was determined to be 11.5 nm and rprot is 3.4 nm for BSA, thus it needs to be that r > 15 nm. We would like to stress, however, that b does not directly give a height of a rigid Nsp1 ring but is related to the configuration of the Nsp1 inside the pore. Equation (3) (and equation (2)) were chosen because even these simple equations could fit the experimentally measured translocation rates well, and not because they would accurately model the setup in the pore. As we found from the simulations, the BSA translocations at low pore diameters presumably happen through transient openings of the mesh. The dynamics leading to the stochastic opening of voids on average leads to the observed translocation rate.

      296-297: is it also consistent with the simulations?

      We compare the experimentally and simulated b values in lines 387-388 and obtained b=9.9 ± 0.1 nm from the simulations (as obtained from fitting the translocation rates and not from measuring the extension of the Nsp1 molecules) and 11.5 ± 0.4 nm from the experiments – which we find in good agreement.

      331: has it been established that the FG nups equilibrate on the microsecond scale?

      As an example, we have analyzed the simulation trajectory of the most dense nanopore (diameter = 40 nm, grafting = 1/200 nm2). In Author response image 1 we show for each of the Nsp1-proteins how the radius of gyration (Rg) changes in time over the full trajectory (2 μs + 5 μs). As expected, the Rg values reached the average equilibrium values very well within 2 μs simulation time, showing that the FG-Nups indeed equilibrate on the (sub)microsecond scale.

      Author response image 1.

      334-347: the details of the method should be explained explicitly in the supplementary (how exactly voids distributions are estimated and the PMF are calculated etc)

      The void analysis was performed with the software obtained from the paper of Winogradoff et al. In our Methods we provide an overview of how this software calculates the void probability maps and how these are converted into PMFs. For a more detailed description of how exactly the analysis algorithm is implemented in the software, we refer the reader to the original work. The analysis codes with the input files that were used in this manuscript have been made public ( https://doi.org/10.4121/22059227.v1 ) along with the manuscript.

      Equation (4) is only an approximation (which works fine for high barriers but not the low ones). Please provide citations/derivation.

      To our knowledge, the Arrhenius relation is a valid approximation for our nanopore simulations. We are unaware of the fact that it should not work for low barriers and cannot find mention of this in the literature. It would be helpful if the Reviewer can point us to relevant literature.

      Figure 4: how was transport rate for Kaps calculated?

      As mentioned in lines 388-391, we assumed that the Kap95 translocation rate through Nsp1-coated pores is equal to that for open pores, as we did not observe any significant hindrance of Kap95 translocation by the Nsp1 mesh in the experiment (Figure 4 A,C).

      378: It's a bit strange to present the selectivity ratio as prediction of the model when only BSA translocation rate was simulated (indirectly).

      We agree with the Reviewer that ideally we should also simulate the Kap95 translocation rate to obtain an accurate selectivity measure of the simulated nanopores. However, as the experiments showed very similar Kap95 translocation rates for open pores and Nsp1-coated pores, we believe it is reasonable to take the Kap95 rates for open and Nsp1-pores to be equal.

      Figure 5C and lines 397: I am a bit confused how is this consistent with Figure 4D?

      Figure 5C and figure 4D both display the same experimental data, where 4D only focuses on a low diameter regime. In relation to line 397 (now 407), the Nsp1 mesh within the 60-nm pore dynamically switches between closed configurations and configurations with an open channel. When taking the temporal average of these configurations, we find that the translocation rate is higher than for a closed pore but lower than for a fully open pore. The stochastic opening and closing of the Nup mesh results in the continuous increase of the translocation rates with increasing diameter, which is in contrast to a step-wise increase that would be expected from an instantaneous collapse of the Nsp1 mesh at a certain pore diameter.

      428-439: Please discuss the differences from Jovanovic-Talisman 2009.

      How our results for a Kap95 induced change of the BSA translocation rate are related to previous literature is discussed extensively in the lines 598-620.

      440: How many Kaps are in the pore at different concentrations?

      This is a very interesting question that we were, unfortunately, not able to answer within the scope of this project. With our fluorescent based methods we could not determine this number because the excitation light does not reach well into the nanopore.

      In our previous work on Nsp1-coated SiN nanopores using conductance measurements, we quantified the drop in conductance at increasing concentrations of Kap95 (Fragasso et al., 2023, NanoResearch, http://dx.doi.org/10.1007/s12274-022-4647-1). From this, we estimated that on average ~20 Kap95 molecules are present in a pore with a diameter of 55 nm at a bulk concentration of 2 µM. In these experiments, however, the height of the pore was only ~20 nm, which is much lower compared to 100 nm long channel used here, and the grafting density of 1 per 21 nm2 was high compared to the grafting density here of 1 per 300 nm2. Assuming that the Kap95 occupancy scales linearly with the number of binding sites (FG repeats) in the vicinity of the pore, and hence the amount of Nsp1 molecules bound to the pore, we would expect approximately ~7 Kap95 molecules in a pore of similar diameter under saturating (> 1 µM) concentrations.

      On the other hand, the simulations showed that the density of Nsp1 within the pore is equal to the density within the 20-nm thick SiN pores (line 380). For the longer channel and lower grafting density used here, Nsp1 was also more constrained to the pore compared to thinner pores used in previous studies (Fragasso et al., 2023, NanoResearch), where the grafted protein spilled out from the nanopores. Thus assuming that the Kap95 occupancy depends on the protein density in the pore volume rather than the total protein amount grafted to the pore walls, we would estimate a number of 100 Kap95 molecules per pore.

      These varying numbers already show that we cannot accurately provide an estimate of the Kap95 occupancy within the pore from our data due to limitations of the ZMW approach.

      445: how is this related to the BSA translocation increase?

      For the calculation of the selectivity ratio, we assumed the normalized Kap95 translocation rate to be independent of the Kap95 concentration. Hence, the observed trends of the selectivity ratios at different concentrations of Kap95, as shown in Figure 6 D, are solely due to a change in the BSA translocation rate at different concentrations of Kap95, as given in Figure 6 B,C.

      462-481: it's a bit confusing how this interfaces with the "void" analysis ( see my previous comments)

      We agree that the phenomenological descriptions in terms of transient openings (small, dynamic voids) that for larger pores become a constantly opened channel (a single large, static void) might cause some confusion to the reader. In the last part of the results, we aimed to relate the loss of the BSA rate to a change of the Nsp1 mesh. We acknowledge that the model of a rim of Nsp1 and an open center described in Figure 5F is highly simplifying . We now explain this in the revised paper at lines 483-486 by referring to an effective layer thickness which holds true under the simplifying assumption of a central transport channel.

      Figure 6D: I think the illustration of the effect of kaps on the brush is somewhat misleading: at low pore diameters, it is possible that the opposite happens: the kaps concentrate the polymers towards the center of the pore. It should be also made clear that there are no kaps in simulations (if I understand correctly?)

      Indeed, at small pore diameters we think it would be possible to observe what the Reviewer describes. The illustration should only indicate what we think is happening for large pore diameters where we observed the opening of a central channel. To avoid confusion, we now shifted the sketches to panel G where the effective layer thickness is discussed.

      Indeed, as stated in lines 331-340 no Kap95 or BSA molecules were present in the simulations. We have now clarified this point in lines 872-876.

      518: Please provide more explanation on the role of hydrodynamics pressure.

      We have now performed additional experiments and quantified the effect of the pressure to be a ~5% reduction of the event rates, as described in the answer to a previous question above.  

      Reviewer #3 (Recommendations For The Authors):

      No experiments have been performed with the Ran-Mix regeneration system. It would be beneficial to add Ran-Mix to the trans compartment and see how this would affect Kap95 translocation events frequency and passive cargo diffusion. As the authors note in their outlook, this setup offers an advantage in using Ran-Mix and thus could also be considered here or in a future follow-up study.

      We thank the Reviewer for this suggestion. We think, however, that it is beyond the scope of this paper and an interesting subject for a follow-up study.

    1. Author Response

      The following is the authors’ response to the original reviews.

      eLife assessment

      This study and associated data is compelling, novel, important, and well-carried out. The study demonstrates a novel finding that different chemotherapeutic agents can induce nucleolar stress, which manifests with varying cellular and molecular characteristics. The study also proposes a mechanism for how a novel type of nucleolar stress driven by CDK inhibitors may be regulated. The study sheds light on the importance of nucleolar stress in defining the on-target and offtarget effects of chemotherapy in normal and cancer cells.

      We are thankful to the reviewers and the editor for their feedback and thorough assessment of our work. Our responses to the comments and suggestions are below.

      Reviewer #1 (Public Review):

      The study titled "Distinct states of nucleolar stress induced by anti-cancer drugs" by Potapova and colleagues demonstrates that different chemotherapeutic agents can induce nucleolar stress, which manifests with varying cellular and molecular characteristics. The study also proposes a mechanism for how a novel type of nucleolar stress driven by CDK inhibitors may be regulated. As a reviewer, I appreciate the unbiased screening approach and I am enthusiastic about the novel insights into cell biology and the implications for cancer research and treatment. The study has several significant strengths: i) it highlights the understudied role of nucleolar stress in the on- and off-target effects of chemotherapy; ii) it defines novel molecular and cellular characteristics of the different types of nucleolar stress phenotypes; iii) it proposes novel modes of action for well-known drugs. However, there are several important points that should be addressed:

      • The rationale behind choosing RPE cells for the screen is unclear. It might be more informative to use cancer cells to study the effects of chemotherapeutic agents. Alternatively, were RPE cells selected to evaluate the side effects of these agents on normal cells? Clarifying these points in the introduction and discussion would guide the reader.

      RPE1, a non-cancer-derived cell line, was chosen for this study to evaluate the effects of anticancer drugs on normal nucleolar function, with the underlying premise that nucleolar stress in normal cells can contribute to non-specific toxicity. This clarification is added to the introduction. Another factor that played in selecting a normal cell line for the drug screen and subsequent experiments was the spectrum of known and unknown genetic and metabolic alterations present in various cancer cell lines. These variables are often unique to a particular cancer cell line and may or may not impact nucleolar proteome and function. Therefore, the nucleolar stress response can be influenced by the spectrum of alterations inherent to each cancer. Our primary focus was to determine the impact of these drugs under normal conditions.

      That said, the selected hits of main drug classes were validated in a panel of cell lines that included two other hTERT lines (BJ5TA and CHON-002) and two cancer lines (DLD1 and HCT116). In cancer cells starting nucleolar normality scores were lower than in hTERT cells, suggesting that genetic and metabolic changes in these cells may indeed affect nucleolar morphology. Nonetheless, all drugs from a panel of selected hits from different target classes validated in both cancer cell lines (Fig. 2F).

      • Figure 2F indicates that DLD1 and HCT116 cells are less sensitive to nucleolar changes induced by several inhibitors, including CDK inhibitors. It would be crucial to correlate these differences with cell viability. Are these differences due to cell-type sensitivity or variations in intracellular drug levels? Assessing cell viability and intracellular drug concentration for the same drugs and cells would provide valuable insights.

      One of the reasons for the reduced magnitude of the effects of selected drugs in DLD1 and HCT116 cells is their lower baseline normality scores compared to hTERT cells (now shown in Sup. Fig. 1B-C). Other potential factors include proteomic and metabolic shifts and alterations in signaling pathways that control ribosome production. The less-likely possibility of variations in intracellular drug levels cannot be excluded, but measuring this for every compound in every cell line was not feasible in this study. These limitations are now noted in the results section.

      Regarding the point about viability - our initial screen output, in addition to normality scores, included cell count (cumulative count of cells in all imaged fields), which serves as a proxy for viability. By this measure, all hit compounds in our screen were cytostatic or cytotoxic in RPE1 cells (Fig. 2C). The impact of these drugs on the viability of cancer cells that can have various degrees of addiction to ribosome biogenesis merits a separate study of a large cancer cell line panel.

      • Have the authors interpreted nucleolar stress as the primary cause of cell death induced by these drugs? When cells treated with CDK inhibitors exhibit the dissociated nucleoli phenotype, is this effect reversible? Is this phenotype indicative of cell death commitment? Conducting a washout experiment to measure the recovery of nucleolar function and cell viability would address these questions.

      Whether nucleolar toxicity is the primary cause of cytotoxicity for a given chemotherapy drug is an incisive and thought-provoking question. Our screen did not discern whether the cytotoxic effects of our hits were due to inhibition of their intended targets, their impact on the nucleolus, or a combined effect. This point is now mentioned in the results section. Regarding the reversibility of the nucleolar disassembly phenotype seen in CDK inhibitors –in the case of flavopiridol, which is a reversible CDK inhibitor, we demonstrated that nucleoli re-assembled within 4-6 hours after the drug was washed out. An example of this is shown in Sup. Figure 3 and in Video 5. For these experiments, cells were pretreated with the drug for 5 hours, not long enough to cause cell death.

      • The correlation between the loss of Treacle phosphorylation and nucleolar stress upon CDK inhibition is intriguing. However, it remains unclear how these two events are related. Would Treacle knockdown yield the same nucleolar phenotype as CDK inhibition? Moreover, would point mutations that abolish Treacle phosphorylation prevent its interaction with Pol-I? Experiments addressing these questions would enhance our understanding of the correlation/causation between Treacle phosphorylation and the effects of CDK inhibition on nucleolar stress.

      We agree that the Treacle finding is interesting and warrants further investigation. In our attempts to knock down Treacle with siRNA, its protein levels were reduced by no more than 50%, which was not sufficient to cause a strong nucleolar stress response. Therefore, these data were not incorporated into the manuscript. However, in our view, Treacle is unlikely to be the only nucleolar CDK substrate whose dephosphorylation is causing the “bare scaffold” phenotype caused by the transcriptional CDK inhibitors. Our phospho-proteomics studies identified multiple nucleolar CDK substrates with established roles in the formation of the nucleolus. For instance, the granular component protein Ki-67 was also dephosphorylated on multiple sites and dispersed throughout the nucleus (shown in Sup. Fig 4). Given that CDKs typically phosphorylate many substrates that can have multiple phosphorylation sites, identifying a sole protein or phosphorylation site responsible for nucleolar disassembly may be an unattainable target.

      Overall, this study is significant and novel as it sheds light on the importance of nucleolar stress in defining the on-target and off-target effects of chemotherapy in normal and cancer cells.

      Thank you, we appreciate the positive and constructive assessment of our study.

      Reviewer #2 (Public Review):

      This is an interesting study with high-quality imaging and quantitative data. The authors devise a robust quantitative parameter that is easily applicable to any experimental system. The drug screen data can potentially be helpful to the wider community studying nucleolar architecture and the effects of chemotherapy drugs. Additionally, the authors find Treacle phosphorylation as a potential link between CDK9 inhibition, rDNA transcription, and nucleolar stress. Therefore I think this would be of broad interest to researchers studying transcription, CDKs, nucleolus, and chemotherapy drug mechanisms. However, the study has several weaknesses in its current form as outlined below.

      1) Overall the study seems to suffer from a lack of focus. At first, it feels like a descriptive study aimed at characterizing the effect of chemotherapy drugs on the nucleolar state. But then the authors dive into the mechanism of CDK inhibition and then suddenly switch to studying biophysical properties of nucleolus using NPM1. Figure 6 does not enhance the story in any way; on the contrary, the findings from Fig. 6 are inconclusive and therefore could lead to some confusion.

      This study was specifically designed to examine a broad range of chemotherapy drugs. The newly created nucleolar normality score enabled us to measure nucleolar stress precisely and in high throughput. Our primary objective was to find drugs that disrupt the normal nucleolar morphology and then study in-depth the most interesting and novel hits. We have made revisions to emphasize that these are the primary focal points of the manuscript.

      As context, we were motivated to explore the biophysical properties of the nucleolus because they are thought to underlie its formation and function, which also suggested a potential predictive value for modeling nucleolar responses to drug treatments. For this, we edited the RPE1 cell line by endogenously tagging NPM1, a granular component protein that behaves in line with the phase-separation paradigm in vitro and when over-expressed. We fully expected to confirm that its behavior in vivo would be consistent with LLPS, but instead found that even in an untreated scenario, the dynamics of endogenous NPM1 could not be fully explained by the phase separation theory (Fig. 6 A-C). Our message is that accurately predicting drug responses using the nucleolar normality score as a readout, based on our current understanding of the biophysical forces governing nucleolar assembly, is unworkable. For instance, normality scores decrease and NPM1 dynamics increase radically when CDKs are inhibited, without changes in NPM1 concentration or concentrations of other protein components (Fig.6 E-H). These observations are important because they highlight our gaps in understanding the relative contribution of phase separation versus active assembly in nucleolar formation. We believe that these observations are worth sharing with the scientific community.

      2) The justification for pursuing CDK inhibitors is not clear. Some of the top hits in the screen were mTOR, PI3K, HSP90, Topoisomerases, but the authors fail to properly justify why they chose CDKi over other inhibitors.

      We decided to focus on CDK inhibitors for several reasons. First, their effects were completely new and unexpected, suggesting the existence of an unknown mechanism regulating nucleolar structure and function. In addition, CDK inhibitors caused a very strong and distinct nucleolar stress phenotype with the lowest normality scores that merited its own term, the “bare scaffold” phenotype. One more reason for pursuing CDK-inhibiting drugs was their high rate of failure in clinics because of the intense and hard-to-explain toxicity. We suspect that this toxicity may be due at least in part to their profound effect on nucleolar organization and ribosome production throughout the body. We stated this rationale more explicitly in the manuscript.

      3) In addition to poor justification, it seems like a very superficial attempt at deciphering the mechanism of CDK9imediated nucleolar stress. I think the most interesting part of the study is the link between CDK9, Pol I transcription, and nucleolar stress. But the data presented is not entirely convincing. There are several important controls missing as detailed below.

      We agree with the reviewer that follow-up studies of CDK9, Pol I, and nucleolar stress connection are important long-term goals. However, the primary objective of this study was to ascertain the scope of anticancer agents that can cause nucleolar stress and the establishment of nucleolar stress categories. This is an important advance and could serve as the foundation for a standalone in-depth study or multiple studies. We have included the complete screen, proteomics, and phospho-proteomics results (Sup. Tables 1, 2, and 3), which will enable other investigators to mine the screen information based on their specific interests. Furthermore, we have made multiple text revisions to clarify rationale and interpretation, and incorporated additional data that strengthen the manuscript.

      4) The authors did not test if inhibition of CDK7 and/or CDK12 also induces nucleolar stress. CDK7 and CDK12 are also major kinases of RNAPII CTD, just like CDK9. Importantly, there are well-established inhibitors against both these kinases. It is not clear from the text whether these inhibitors were included in the screen library.

      Our anticancer compound library contained CDK7 inhibitor THZ1⦁2HCL, and it was a hit at both 1 and 10 uM concentrations (Sup. Table 1). However, its nucleolar stress phenotype was morphologically distinct from CDK9 inhibitors, resembling the stress caps phenotype instead of the bare scaffold phenotype. We did not pursue CDK7 because of its two hard-to-separate functions: in addition to its role as an RNAPII CTD kinase, it also acts as a CDK-activating kinase (CAK) by promoting the associations of multiple CDKs with their cyclin partners. This dual role of CDK7 makes the interpretation of THZ1-induced nucleolar stress phenotype difficult because it could be attributed to either or both of these functions. Moreover, it was reported to cause DNA damage, which may explain why it causes stress caps. An image depicting nucleolar stress phenotype caused by THZ1⦁2HCL is provided in Author response image 1.

      Author response image 1.

      Control and THZ1 - treated RPE1 cells, images from screen plates.

      We are not aware of specific inhibitors of CDK12, as they also reportedly inhibit CDK13. None of the CDK12/CDK13 inhibitors were present in our library, therefore we can neither confirm nor exclude the possible involvement of these kinases in regulating nucleolar structure. Many other existing CDK inhibitors were absent from our library. Our work highlights the importance of assessing their potential to induce nucleolar stress and offers an approach for this assessment.

      5) In Figure 4E, the authors show that Pol I is reduced in nucleolus/on rDNA. The authors should include an orthogonal method like chromatin fractionation and/or ChIP

      We acknowledge the reviewer’s request for additional validation of reduced occupancy of rDNA by Pol I.<br /> Nucleolar chromatin fractionation in cells treated with CDK inhibitors is unlikely to work due to nearly complete nucleolar disassembly. Chromatin immunoprecipitation would require finding and validating a suitable ChIP-grade antibody. Moreover, the evaluation of repetitive regions by ChIP is non-trivial and error-prone. To help address this request and further confirm the POLR1A immunofluorescence results in 4E, we included additional immunofluorescence data obtained with a different POLR1A antibody (Sup. Fig. 3D), and the results were similar.

      6) In Fig. 5D, in vitro kinase lacks important controls. The authors should include S to A mutants of Treacle S1299A/S1301A to demonstrate that CDK9 phosphorylates these two residues specifically.

      7) To support their model, the authors should test if overexpression of Treacle mutants S1299A/S1301A can partially phenocopy the nucleolar stress seen upon CDK9 inhibition. This would considerably strengthen the author's claim that reduced Treacle phosphorylation leads to Pol I disassociation from rDNA and consequently leads to nucleolar stress.

      8) Additionally, it would be interesting if S1299D/S1301D mutants could partially rescue CDK9 inhibition.

      Points (6-8):

      We reiterate that transcriptional CDKs target multiple nucleolar proteins, and the observed phenotype might be due to the combined effects of de-phosphorylation of multiple substrates. We concur that deconstructing the role of Treacle phosphorylation sites is very interesting and warrants further in-depth studies. The phospho-proteomics enrichment method, while an effective first-pass strategy, might not capture 100% of the phosphorylated sites. Treacle is a phospho-protein with an abundance of serine and threonine residues. It could potentially have been selectively dephosphorylated on more sites than were detected by this method. Therefore, the suggested mutations may not be the exclusive contributors responsible for the functional phenotype. Additionally, overexpressing Treacle impairs the viability of RPE1 cells, complicating the interpretation of experiments involving overexpression of both wild-type and mutant proteins. A conceivable strategy would involve generating phosphomimetic and non-phosphorylatable mutants by gene editing, studying their interactions by biochemical approaches, and determining their impact on nucleolar function, but this may take years of additional work. We hope that our work will inspire further studies that explore Treacle phosphorylation and other functions of transcriptional CDKs in nucleolar formation.

      Thank you for the thoughtful review and suggestions.

      Reviewer #2 (Recommendations For The Authors):

      1) The manuscript could be re-organized to focus on 'CDK9-Treacle-Pol I-nucleolar stress' as the central part of the story.

      While we acknowledge this suggestion, it's important to emphasize that the primary focus of this manuscript is on the identification of anticancer drugs that induce nucleolar stress and the establishment of nucleolar stress categories.

      2) Include a "no ATP" control in the in vitro kinase assay and indicate molecular sizes.

      We provided an additional kinase assay (Sup. Fig. 4B) that includes no ATP control lanes and a fragment of a Coomassie blue stained gel showing molecular weight markers. No ATP control assays (lanes 4 and 5) were blank as expected. Molecular weight markers were added to all other kinase assays based on the known sizes of isolated Pol II holoenzyme subunits Rbp1 (191 kDa) and Rbp2 (138 kDa).

      3) For in vitro phosphorylation, please provide an explanation for using CDK9/cyclin K instead of Cyclin T1 which is the predominant cyclin for CDK9

      Recombinant CDK9/cyclin K complex was used for in vitro kinase assays for a technical reason: CDK9/cyclin T obtained from the same vendor appeared to be low quality, as it showed only minimal activity toward our positive control, the isolated Pol II complex. The kinase assays using recombinant CDK9/cyclin T in parallel with CDK9/cyclin K are now presented it Sup. Fig. 4B. The first two assays in this experiment contained Pol II as a substrate, and it is evident that Pol II was phosphorylated much stronger by CDK9/cyclin K than CDK9/cyclin T (comparing lane 1 vs lane 2). Therefore, the lack of detectable Treacle phosphorylation by CDK9/Cyclin T (lane 7), in contrast to strong phosphorylation by CDK9/cyclin K (lane 6), was likely attributable to poor reagent quality rather than physiological differences. We can conclude that CDK9/cyclin K reliably phosphorylates Treacle in vitro, but CDK9/cyclin T kinase assays were inconclusive.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1:

      Point 1: While the manuscript is methodologically sound, the following aspects of image acquisition and data analysis need to be clarified to ensure replicability and reproducibility. The authors state that the sample is a "population-derived adult lifespan sample", the lack of demographic information makes it impossible to know if the sample is truly representative. Though this may seem inconsequential, education may impact both cognitive performance and functional activation patterns. Moreover, the authors do not report race/ethnicity in the manuscript. This information is essential to ensure representativeness in the sample. It is imperative that barriers to study participation within minoritized groups are addressed to ensure rigor and reproducibility of findings.

      First, the section Methods-Participants has been updated to refer readers to a prior article where the sample’s demographics are broken down into nine decile age groups (see Wu et al. 2023 Table 1), including information about their education levels. Secondly, we have updated the Data Availability section text to indicate that all Cam-CAN IDs are included in the available OSF datasets, allowing anyone to verify additional participant demographics described in the Cam-CAN protocol article (Shafto et al., 2014). Third, we have updated the Participants section text to refer to another prior study that reported on the representativeness of the Cam-CAN sample indicating that at least some elements of the sample have been independently deemed as representative (e.g., Sex).

      Page-24

      “A healthy population-derived adult lifespan human sample (N = 223; ages approximately uniformly distributed from 19 - 87 years; females = 112; 50.2%) was collected as part of the Cam-CAN study (Stage 3 cohort; Shafto et al., 2014). Participants were fluent English speakers in good physical and mental health, based on the Cam-CAN cohort’s exclusion criteria which includes poor mini mental state examination, ineligibility for MRI and medical, psychiatric, hearing or visual problems. Throughout analyses, age is defined at the Home Interview (Stage 1; Shafto et al., 2014). The study was approved by the Cambridgeshire 2 (now East of England–Cambridge Central) Research Ethics Committee and participants provided informed written consent. Further demographic information of the sample is reported in Wu et al. (2023) and is openly available (see section Data Availability) with a recent report indicating the representativeness of the sample across sexes (Green et al., 2018).”

      Page-30

      “Raw and minimally pre-processed MRI (i.e., from automatic analysis; Taylor et al., 2017) and behavioural data are available by submitting a data request to Cam-CAN (https://camcan-archive.mrc-cbu.cam.ac.uk/dataaccess/). The univariate and multivariate ROI data, and behavioural data, can be downloaded from the Open Science Framework, which includes Cam-CAN participant identifiers allowing the retrieval of any additional demographic data (https://osf.io/v7kmh), while the analysis code is available on GitHub.”

      Point 2: For the whole-brain analysis in which the ROIs were derived, the authors used a threshold-free cluster enhancement (TFCE; Smith & Nichols 2009). The methodological paper cited suggests that individuals' TCFE image should still be corrected for multiple comparisons using the following: "to correct for multiple comparisons, one [...] has to build up the null distribution (across permutations of the input data) of the maximum (across voxels) TFCE score, and then test the actual TFCE image against that. Once the 95th percentile in the null distribution is found then the TFCE image is simply thresholded at this level to give inference at the p < 0.05 (corrected) level." (Smith & Nichols, 2009). Although the authors mention that clusters were estimated using 2000 permutations, there is no mention of the TFCE image itself being thresholded. While this would impact the overall size of the ROIs used in the study, the remaining analyses are methodologically sound.

      We have updated the text to detail the t=1.97 (i.e., p = .05) threshold we applied before interpretation of the resultant TFCE images to the section: Experimental Design & Statistical Analysis. This threshold value can also be verified in the analytics code that is referenced on GitHub from the section Data Availability within the requisite toolbox functions: https://github.com/kamentsvetanov/CommonalityAnalysis/blob/main/code/ca_vba_tfce_threshold.m#L24 and https://github.com/kamentsvetanov/CommonalityAnalysis/blob/main/code/external/ca_matlab_tfce_transform.m

      Page-30

      “For whole-brain voxelwise analyses, clusters were estimated using threshold-free cluster enhancement (TFCE; Smith & Nichols 2009) with 2000 permutations and the resulting images were thresholded at a t-statistic of 1.97 before interpretation.”

      Point 3: The authors should consider moving the ROI section to results. The way the manuscript currently reads, the ROIs seem to be derived a priori as opposed to being derived from activation maps in the current study.

      After consideration of this point, we have decided to leave the methodological details regarding the definition of ROIs in the methods, to maintain the focus of the Results section. However, we have improved signposting in the results section to highlight that the ROIs were derived from the overlapped activation maps.

      Page-8

      “Crucially, two areas of the brain showed spatially-overlapping positive effects of age and performance, which is suggestive of an age-related compensatory response (Figure 2A yellow intersection). These were in bilateral cuneal cortex (Figure 2B magenta) and bilateral frontal cortex (Figure 2B brown), the latter incorporating parts of the middle frontal gyri and anterior cingulate. Therefore, based on traditional univariate analyses, these are two candidate regions for age-related functional compensation (Cabeza et al. 2013; 2018). Accordingly, we defined regions of interest within these two regions using the overlap activation maps (see section: ROIs) to be used for subsequent univariate and multivariate analysis.”

      Point 4: The manuscript can be strengthened by explaining why the authors chose a greedy search algorithm over a dynamic Bayesian model.

      The text is updated to refer to appropriateness of the computationally efficient greedy search implementation, due to the size of the fMRI cohort dataset.

      Page-28

      “The pattern weights specifying the mapping of data features to the target variable are optimized with a greedy search algorithm using a standard variational scheme (Friston et al., 2007) which was particularly appropriate given the large dataset.”

      Reviewer #2:

      Point 1: However, it might have been nice to see an analysis of a more crystallised intelligence task included too, as a contrast since this is an area that does not demonstrate such a decline (and perhaps continues to improve over aging).

      We (Samu et al., 2017) have previously investigated, but failed to find, univariate evidence for functional compensation in this cohort’s performance on a sentence comprehension task that is more closely aligned to a measure of crystallised intelligence. Based on the additional previous studies where we have applied these types of univariate and multivariate criteria of functional compensation (Morcom & Henson, 2018; Knights et al., 2021), we have consistently observed that the uni-/multivariate effects are in the same direction. Therefore, we would not initially expect a different conclusion here, where the univariate and multivariate effects suggest different outcomes. Notably, the univariate analysis approach in Samu et al. (2017) did differ from focusing on the age x behaviour interaction term here, so it could still be worth future investigation, but it does seem less likely that evidence of compensation would be observed than for fluid intelligence. However, as the Reviewer suggests, such a task may make another good contrast to show evidence against the existence of functional compensation (as in Morcom & Henson, 2018; Knights et al., 2021).

      Point 2: Figure 1B: Consider adding coefficients describing relationships to plots.

      Annotations of the coefficients have been added to Figure 1B:

      Point 3: Figure 2C. The scale of the axis for RSFA-Scales cuneal cortex ROI activations should be the same as the other 3 plots.

      Figure axes are updated such that ROIs are on matching scales, according to whether data were RSFA-scaled or not.

      Point 4: Figure 2C. Adding in the age ranges for each of the three groups following the tertile split may be informative to the reader.

      The age group tertile definition used for Figure 2C visualisations is now added to the Figure description.

      Page-10

      “Figure 2. Univariate analysis. (A) Whole-brain effects of age and performance. Age (green) and performance (red) positively predicted unique aspects of increased task activation, with their spatial overlap (yellow) being overlaid on a template MNI brain, using p < 0.05 TFCE. (B) Intersection ROIs. A bilateral cuneal (magenta) and frontal cortex (brown) ROI were defined from voxels that showed a positive and unique effect of both age and performance (yellow map in Figure 2A). (C) ROI Activation. Activation (raw = left; RSFA-scaled = right) is plotted against behavioural performance based on a tertile split between three age groups (19-44, 45-63 & 64-87 years).”

      Reviewer #3:

      Point 1: [Public Review] 1) I don't quite follow the argumentation that compensatory recruitment would need to show via non-redundant information carried by any given non-MDN region (cf. p14). Wouldn't the fact that a non-MDN region carries task-related information be sufficient to infer that it is involved in the task and, if activated increasingly with increasing age, that its stronger recruitment reflects compensation, rather than inefficiency or dedifferentiation? Put differently, wouldn't "more of the same" in an additional region suffice to qualify as compensation, as compared to the "additional information in an additional region" requirement set by the authors? As a consequence, in my honest opinion, showing that decoding task difficulty from non-MDN ROIs works better with higher age would already count as evidence for compensation, rather than asking for age-related increases in decoding boosts obtained from adding such ROIs. It would be interesting to see whether the arguably redundant frontal ROI would satisfy this less demanding criterion. At any rate, it seems useful to show whether the difference in log evidence for the real vs. shuffled models is also related to age.

      We agree with the logic for conducting a weaker assessment of functional compensation whereby a brain region does not necessarily have to provide a unique contribution beyond that of the ordinarily activated task-relevant network. However, although non-unique recruitment is predicted by a compensation theory, it can also be explained by a nonspecific mechanism that recruits multiple regions in tandem. In contrast, unique additional recruitment is compatible with compensation but not with nonspecific recruitment. In this article, and those prior (Morcom & Henson, 2018; Knights et al. 2021), we have also deliberately avoided using the specific kind of analysis proposed (i.e., testing for an effect of age on differential log evidence) because these would involve applying statistical tests directly to the log evidence, a variable that is already a statistical test output.

      Nevertheless, temporarily putting these caveats aside, we did run the suggested test. Results from multiple regression showed that using log evidence from frontal cortex models still did not meet this less demanding criterion for functional compensation as there was an effect of age in the opposite direction to that expected by functional compensation: there was a significant negative effect of age (t(218) = -7.95, p = < .001) indicating that as age increased, the difference in log evidence decreased. This effect is visualised below for transparency, but we preferred not to add this information to the article because we do not wish to encourage using this kind of analysis for the reason mentioned above. Thus, although our main multivariate test of interest is stringent, the additional step of mapping log evidence back to the boost-likelihood categories (e.g., boost vs. no difference to model performance) lends itself to the more appropriate logistic regression statistical approach.

      Author response image 1.

      Negative effect of age on MVB log evidence model outcomes for frontal cortex.

      A different approach that could be taken to assess a more lenient definition of functional compensation would be to analyse the effects of age on the spread of multivariate responses predicting task difficulty (i.e., standard deviation of fitted MVB voxel weights; also see Morcom & Henson, 2018; Knights et al., 2021) specifically from models that only include the candidate ‘compensation’ ROIs.

      Accordingly, these analyses and their discussion have been added to the article. To summarise, these analyses showed that (1) the frontal cortex still did not show evidence of functional compensation (i.e., a negative effect of age like in Morcom & Henson, 2018) and (2) no effect of age on the cuneal ROI, implying that the original model comparison approach (i.e., Figure 2C in the manuscript now) can provide more sensitivity for detecting evidence of functional compensation (perhaps because of the importance of including task-relevant network responses when building decoding models).

      Page-15

      “As a final analysis, we also tested a more lenient definition of functional compensation, whereby the multivariate contribution from the “compensation ROI” does not necessarily need to be above and beyond that of the task-relevant network (Morcom & Henson, 2018; Knights et al., 2021). To do this, we again assessed whether age was associated with an increase in the spread (standard deviation) of the weights over voxels, for smaller models containing only the cuneal or frontal ROI. This tested whether increased age led to more voxels carrying substantial information about task difficulty, a pattern predicted by functional compensation (but also consistent with non-specific additional recruitment). In this case, the results of this test did not support functional compensation, as there was no effect detected for the cuneal cortex and even a negative effect of age for the frontal cortex where the spread of the information across voxels was lower for older age (Figure 3C; Table 2).”

      Page-21

      “The age- and performance-related activation in our frontal region satisfied the traditional univariate criteria for functional compensation, but our multivariate (MVB) model comparison analysis showed that additional multivariate information beyond that in the MDN was absent in this region, which is inconsistent with the strongest definition of compensation. In fact, the results from the spread analysis showed that as age increased, this frontal area processed less, rather than more, multivariate information about the cognitive outcome (Figure 3C) as previously observed in two (memory) tasks for a comparable ROI within the same Cam-CAN cohort (Morcom & Henson, 2018).”

      Page-24

      “This said, univariate criteria for functional compensation will continue to play a role in hypothesis testing. For instance, the over-additive interaction observed in the cuneal cortex - where the increase in activity with better performance is more pronounced in older adults - offers stronger evidence of compensation compared to the simple additive effect of age and performance observed in the frontal cortex (Figure 2C). So far, the two studies that have combined these rigorous univariate, behavioral and multivariate approaches to assess functional compensation (i.e., Knights et al., 2021; the present study) have generally found converging evidence regardless of the method used. However, it is important to note that the MVB approach uniquely shifts the focus from individual differences to the specific task-related information that compensatory neural activations are assumed to carry and provides a specific test of region- (or network-) unique information. With further studies, it may also be that multivariate approaches prove more sensitive for detecting compensation effects than when using mean responses over voxels (e.g., Friston et al., 1995) particularly since over-additive effects are challenging to observe because compensatory effects are typically ‘partial’ and do not fully restore function (for review see Scheller et al., 2014; Morcom & Johnson, 2015). Within the multivariate analysis options themselves, it is also interesting to highlight that the stringent MVB boost likelihood analysis could detect functional compensation unlike the more lenient analysis focusing on the spread of MVB voxel weights. This suggests the importance of including task-relevant network responses when building decoding models to assess compensation.”

      Page-32

      “Alongside the MVB boost analysis, we also included an additional measure using the spread (standard deviation) of voxel classification weights (Morcom & Henson, 2018). This measure indexes the absolute amplitude of voxel contributions to the task, reflecting the degree to which multiple voxels carry substantial task-related information. When related to age this can serve as a multivariate index of information distribution, unlike univariate analyses. However, it is worth highlighting that even if an ROI shows an effect of age on this spread measure, such an effect could instead be explained by a non-specific mechanism that represents the same information in tandem across multiple regions (rather than reflecting compensation) as seen previously (Knights et al., 2021; also see Morcom & Johnson, 2015). Thus, it is the MVB boost analysis that is the most compelling assessment of functional compensation because it can directly detect novel information representation.”

      Point 2: [Public Review] 2) Relatedly, does the observed boost in decoding by adding the cuneal ROI (in older adults) really reflect "additional, non-redundant" information carried by this ROI? Or could it be that this boost is just a statistical phenomenon that is obtained because the cuneus just happens to show a more clear-cut, less noisy difference in hard vs. easy task activation patterns than does the MDN (which itself may suffer from increased neural inefficiency in older age), and thus the cuneaus improves decoding performance without containing additional (novel) pieces of information (but just more reliable ones)? If so, the compensation account could still be maintained by reference to the less demanding rationale for what constitutes compensation laid out above.

      We agree that this is a possibility and have added this as an additional explanation to the Discussion. We have also discussed why we think it is a less likely possibility, but do concede that it cannot be ruled out currently.

      Page-20

      “Another possibility is that the age-related increases in fMRI activations (for hard versus easy) in one or both of our ROIs do not reflect greater fMRI signal for hard problems in older than younger people, but rather lower fMRI signal for easy problems in the older. Without a third baseline condition, we cannot distinguish these two possibilities in our data. However, a reduced “baseline” level of fMRI signal (e.g., for easy problems) in older people is consistent with other studies showing an age-related decline in baseline perfusion levels, coupled with preserved capacity of cerebrovascular reactivity to meet metabolic demands of neuronal activity at higher cognitive load  (Calautti et al., 2001; Jennings et al., 2005). Though age-related decline in baseline perfusion occurs in the cuneal cortex (Tsvetanov et al., 2021), the brain regions showing modulation of behaviourally-relevant Cattell fMRI activity by perfusion levels did not include the cuneal cortex (Wu et al., 2023). This suggests that the compensatory effects in the cuneus are unlikely to be explained by age-related hypo-perfusion, consistent with the minimal effect here of adjusting for RSFA (Figure 2C).

      One final possibility is whether the observed boost in decoding from adding the cuneal ROI simply reflects less noisy task-related information (i.e., a better signal-to-noise ratio (SNR)) than the MDN and, consequently, the boosted decoding is the result of more resilient patterns of information (rather than the representation of additional information) based on a steeper age-related decline of SNR in the MDN. Overall then, as none of the explanations above agree with all aspects of the results, to functionally explain the role of the cuneal cortex in this task would require further investigation.”

      Point 3: [Public Review] 3) On page 21, the authors state that "...traditional univariate criteria alone are not sufficient for identifying functional compensation." To me, this conclusion is quite bold as I'd think that this depends on the unvariate criterion used. For instance, it could be argued that compensation should be more clearly indicated by an over additive interaction as observed for the relationship of cuneal activity with age and performance (i.e., the activity increase with better performance becomes stronger with age), rather than by an additive effect of age and performance as observed for the prefrontal ROI (see Fig. 2C). In any case, I'd appreciate it if the authors discussed this issue and the relationship between univariate and multivariate results in more detail (e.g. how many differences in sensitivity between the two approaches have contributed), in particular since the sophisticated multivariate approach used here is not widely established in the field yet.

      We have now considered this point further in a section of the Discussion (which is merged with points 1 & 2 above) about the relevance and distinction of univariate / multivariate criteria for functional compensation. As described in text below, whilst we agree that univariate / behavioural approaches have a role in testing functional compensation, we still view the MVB boost analysis to be a particularly compelling approach for assessing this theory.

      Page-22

      “This said, univariate criteria for functional compensation will continue to play a role in hypothesis testing. For instance, the over-additive interaction observed in the cuneal cortex - where the increase in activity with better performance is more pronounced in older adults - offers evidence of compensation compared to the simple additive effect of age and performance observed in the frontal cortex (Figure 2C). However, the conclusions that can be drawn from age-related differences in cross-sectional associations of brain and behaviour are limited, mainly because individual performance differences are largely lifespan-stable (see Lindenberger et al., 2011; Morcom & Johnson, 2015). So far, the two studies that have combined these univariate-behavioral and multivariate approaches to assess functional compensation (i.e., Knights et al., 2021; the present study) have generally found converging evidence regardless of the method used. However, it is important to note that the MVB approach uniquely shifts the focus from individual differences to the specific task-related information that compensatory neural activations are assumed to carry. With further studies, it may also be that multivariate approaches prove more sensitive for detecting compensation effects than when using mean responses over voxels (e.g., Friston et al., 1995) particularly since over-additive effects are challenging to observe because compensatory effects are typically ‘partial’ and do not fully restore function. Within the multivariate analysis options themselves, it is also interesting to highlight that the stringent MVB boost likelihood analysis could detect functional compensation unlike the more lenient analysis focusing on the spread of MVB voxel weights. This suggests the importance of including task-relevant network responses when building decoding models to asses compensation.”

      Point 4: [Public Review] 4) As to the exclusion of poorly performing participants (see p24): If only based on the absolute number of errors, wouldn't you miss those who worked (overly) slowly but made few errors (possibly because of adjusting their speed-accuracy tradeoff)? Wouldn't it be reasonable to define a criterion based on the same performance measure (correct - incorrect) as used in the main behavioural analyses?

      This is a good point, though if we were to exclude participants using a chance level exclusion rate based on the formulae used for measuring behavioural performance, this removes identical subjects to those originally excluded. Based on this, the text has been updated to reflect this more parsimonious approach for defining exclusion criteria.

      Page-25

      “In a block design, participants completed eight 30-second blocks which contained a series of puzzles from one of two difficulty levels (i.e., four hard and four easy blocks completed in an alternating block order; Figure 1A). The fixed block time allowed participants to attempt as many trials as possible. Therefore, to balance speed and accuracy, behavioural performance was measured by subtracting the number of incorrect from correct trials and averaging over the hard and easy blocks independently (i.e., ((hard correct - hard incorrect) + (easy correct - easy incorrect))/2; Samu et al., 2017). For assessing reliability and validity, behavioural performance (total number of puzzles correct) was also collected from the same participants during a full version of the Cattell task (Scale 2 Form A) administered outside the scanner at Stage 2 of the Cam-CAN study (Shafto et al., 2014). Both the in- and out-of-scanner measures were z-scored. We excluded participants (N = 28; 17 females) who performed at chance level ((correct + incorrect) / incorrect < 0.5) on the fMRI task, leading to the same subset as reported in Samu et al. (2017).”

      Point 5: [Public Review] 5) Did the authors consider testing for negative relationships between performance and brain activity, given that there is some literature arguing that neural efficiency (i.e. less activation) is the hallmark of high intelligence (i.e. high performance levels in the Cattell task)? If that were true, at least for some regions, the set of ROIs putatively carrying task-related information could be expanded beyond that examined here. If no such regions were found, it would provide some evidence bearing on the neural efficiency hypothesis.

      No, we did not test for negative relationships between performance and brain activity in this study. However, In Wu et al. (2023) we did specifically test for this and neither of the relevant results reported in section 3.3.1 (i.e., unique relationship between activity and performance) nor section 3.3.2 (i.e., age-related relationship between activity and performance) showed the queried direction of effects. Note that the negative effect in section 3.3.2 (Age U Performance) is a more unique suppression effect representing a positive relationship between performance and activity where this becomes stronger as age is added to the model.

      Point 6: [Recommendations for the authors] 1) Page 26: It is not quite clear how the authors made sure their age and performance covariates functioned as independent regressors in the univariate group-level GLM, given the correlation between age and performance (i.e. shared variance).

      We included age and performance as covariates (of the age x performance effect of interest) by simply including these as independent regressors in the group-level GLM design matrix in addition to the interaction term (i.e., activity ~ age*performance + covariates equivalent to activity ~ age:performance + age + performance + covariates; Wilkinson & Roger 1973 notation), allowing us to examine the unique variance explained by each predictor (Table 1 and Table 2) and to control for their shared variance.

      We should note that while the GLM approach we used accounts for unique and shared effects, it does not explicitly report shared effects in its standard output. To directly examine shared variance, one would need to employ commonality analysis. For reference, results from a commonality analysis on this task have been previously reported in Wu et al. (2023).

      Prompted by this point, we have made some further minor improvements to help ensure our methodological steps are reproducible, as highlighted below.

      Page-30

      “Continuous age and behavioural performance variables were standardised and treated as linear predictors in multiple regression throughout the behavioural (Figure 1B), wholebrain voxelwise (Figure 1C/2A), univariate (Table 1; Figure 1B/2B) and MVB (Table 2; Figure 3) analyses. Throughout, sex was included as a covariate. The models, including interaction terms, can be described, according to Wilkinson & Roger’s (1973) notation, as activity ~ age * performance + covariates (which is equivalent to activity ~ age:performance + age + performance + covariates), allowing us to examine the unique variance explained by each predictor (Table 1) and to control for their shared variance. For whole-brain voxelwise analyses, clusters were estimated using threshold-free cluster enhancement (TFCE; Smith & Nichols 2009) with 2000 permutations and the resulting images were thresholded at a t-statistic of 1.97 before interpretation. Bonferroni correction was applied to a standard alpha = 0.05 based on the two ROIs (cuneal and frontal) that were examined. For Bayes Factors, interpretation criteria norms were drawn from Jarosz & Wiley (2014).”

      Point 7: [Recommendations for the authors] 2) Figure 3: I suggest changing the subheading in panel B to "Joint vs. MDN-only Model," in line with the wording in the main text.

      The subheading of Figure 3B is updated as suggested to `Joint vs. MDN-only Model`.

      Point 8: [Recommendations for the authors] 3) In Figures 1C and 2A, MNI z coordinates should be added to the section views. The appreciation of Figure 2B could be enhanced by adding some rendering with a saggital (medial and/or lateral) view.

      The slice mosaics in Figure 1C and 2A are now updated with each slice’s MNI Z coordinates and mentioned in the figure descriptions.

      Point 9: [Recommendations for the authors] 4) Page 7 (l. 135): What exactly is meant by "lateral occipital temporal cortex"?

      The text is updated to specify the anatomical landmarks that were used for guidance when referring to activation within the lateral occipital temporal cortex, based on ROI criteria definitions used in Knights, Mansfield et al. (2021):

      Page-7 Line-135:

      “Additional activation was observed bilaterally in the inferior/ventral and lateral occipital temporal cortex (i.e., a cluster around the lateral occipital sulcus that extended anteriorly beyond the anterior occipital sulcus), likely due to the visual nature of the task.”

      Point 10: [Recommendations for the authors] 5) On p18ff. (ll. 259-318) the authors discuss in quite some detail how the age-related decoding boost seen with the cuneus ROI can be functionally explained, but it seems like none of the explanations agrees with all aspects of the results. While this is not a major problem for the paper, it may be advisable if this part of the discussion ends with a clearer statement that this issue is not fully solved yet and provides material for future research.

      A more direct sentence has been added to make it clear that future investigation will be needed to explain the role of the cuneal cortex here.

      Page-20 Line-322:

      “Another possibility is that the age-related increases in fMRI activations (for hard versus easy) in one or both of our ROIs do not reflect greater fMRI signal for hard problems in older than younger people, but rather lower fMRI signal for easy problems in the older. Without a third baseline condition, we cannot distinguish these two possibilities in our data. However, a reduced “baseline” level of fMRI signal (e.g., for easy problems) in older people is consistent with other studies showing an age-related decline in baseline perfusion levels, coupled with preserved capacity of cerebrovascular reactivity to meet metabolic demands of neuronal activity at higher cognitive load  (Calautti et al., 2001; Jennings et al., 2005). Though age-related decline in baseline perfusion occurs in the cuneal cortex (Tsvetanov et al., 2021), the brain regions showing modulation of behaviourally-relevant Cattell fMRI activity by perfusion levels did not include the cuneal cortex (Wu et al., 2021). This suggests that the compensatory effects in the cuneus are unlikely to be explained by age-related hypo-perfusion, consistent with the minimal effect here of adjusting for RSFA (Figure 2C). Overall then, as none of the explanations above agree with all aspects of the results, to functionally explain the role of the cuneal cortex in this task will require further investigation.”

      Point 11: [Recommendations for the authors] 6) The threshold choice for Bayesian log evidence (> 3) should be motivated in some more detail, rather than just pointing to a book reference, as there is no established convention in the field, the choice may depend on the type of data and/or analysis, and a sizeable part of the readership may not be deeply familiar with the particular Bayesian approach used here.

      Text is updated to further clarify our motivation for using the log evidence BF>3 criterion:

      Page-29

      “The outcome measure was the log evidence for each model (Morcom & Henson, 2018; Knights et al., 2021). To test whether activity from an ROI is compensatory, we used an ordinal boost measure (Morcom & Henson, 2018; Knights et al., 2021) to assess the contribution of that ROI for the decoding of task-relevant information (Figure 3B). Specifically, Bayesian model comparison assessed whether a model that contains activity patterns from a compensatory ROI and the MDN (i.e., a joint model) boosted the prediction of task-relevant information relative to a model containing the MDN only. The compensatory hypothesis predicts that the likelihood of a boost to model decoding will increase with older age. The dependent measure, for each participant, was a categorical recoding of the relative model evidence to indicate the outcome of the model comparison. The three possible outcomes were: a boost to model evidence for the joint vs. MDN-only model (difference in log evidence > 3), ambiguous evidence for the two models (difference in log evidence between -3 to 3), or a reduction in evidence for the joint vs. MDN-only model (difference in log evidence < -3).These values were selected because a log difference of three corresponds to a Bayes Factor of 20, which is generally considered strong evidence (Lee & Wagenmakers, 2014). Further, with uniform priors, this chosen criterion (Bayes Factor > 3) corresponds to a p-value of p<~.05 (since the natural logarithm of 20 equals three, as evidence for the alternative hypothesis).”

      Point 12: [Recommendations for the authors] 7) Adding page numbers would be helpful.

      Page numbers have been added to the manuscript file – apologies for this oversight.

      References

      Green, E., Bennett, H., Brayne, C., & Matthews, F. E. (2018). Exploring patterns of response across the lifespan: The Cambridge Centre for Ageing and Neuroscience (Cam-CAN) study. BMC Public Health18, 1-7.

      Knights, E., Mansfield, C., Tonin, D., Saada, J., Smith, F. W., & Rossit, S. (2021). Hand-selective visual regions represent how to grasp 3D tools: brain decoding during real actions. Journal of Neuroscience41(24), 5263-5273.

      Samu, D., Campbell, K. L., Tsvetanov, K. A., Shafto, M. A., & Tyler, L. K. (2017). Preserved cognitive functions with age are determined by domain-dependent shifts in network responsivity. Nature communications, 8(1), 14743.

      Shafto, M. A., Tyler, L. K., Dixon, M., Taylor, J. R., Rowe, J. B., Cusack, R., ... & Cam-CAN. (2014). The Cambridge Centre for Ageing and Neuroscience (Cam-CAN) study protocol: a cross-sectional, lifespan, multidisciplinary examination of healthy cognitive ageing. BMC neurology14, 1-25.

      Wu, S., Tyler, L. K., Henson, R. N., Rowe, J. B., & Tsvetanov, K. A. (2023). Cerebral blood flow predicts multiple demand network activity and fluid intelligence across the adult lifespan. Neurobiology of aging121, 1-14.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Reviewer):

      It is not clear from the analysis presented in the paper how persistent those environmentally induced changes, do they remain with the bats till the end of their lives.

      Currently, the long-term effects of enrichment on the bats remain uncertain. Preliminary results suggest that these differences may persist throughout the bats’ lifetimes; however, further data analysis is ongoing to determine the extent of these effects. We also addressed now at the manuscript discussion

      Reviewer #2 (Public Reviewer):

      (1) Assessing personality metrics and the indoor paradigm: While I applaud this effort and think the metrics used are justified, I see a few issues in the results as they are currently presented:

      (a) [Major] I am somewhat concerned that here, the foraging box paradigm is being used for two somewhat conflicting purposes: (1) assessing innate personality and (2) measuring changes in personality as a result of experience. If the indoor foraging task is indeed meant to measure and reflect both at the same time, then perhaps this can be made more explicit throughout the manuscript. In this circumstance, I think the authors could place more emphasis on the fact that the task, at later trials/measurements, begins to take on the character of a "composite" measure of personality and experience.

      Personality traits should generally be stable over time, but personality can also somewhat change with experience. We used the foraging box to assess individual personality, but we also examined the assumption that what we are measuring is a proxy of personality and hence is stable over time. We now clarify this in the manuscript. 

      (b) [Major] Although you only refer to results obtained in trials 1 and 2 when trying to estimate "innate personality" effects, I am a little worried that the paradigm used to measure personality, i.e. the stable components of behavior, is itself affected by other factors such as age (in the case of activity, Fig. 1C3, S1C1-2), the environment (see data re trial 3), and experience outdoors (see data re trials 4/5).

      We found that boldness was the most consistent trait, showing persistence between trials 1 to 5, i.e., 144 days apart on average. We thus also used Boldness as the primary parameter for assessing the effects of personality on the outdoors behavior. While we evaluated other traits for completeness, boldness was the only one that consistently met the criteria for personality, which is why we focused on it in our analyses. The other traits which were not stable over time could be used to assess the effects of experience on behavior

      Ideally, a study that aims to disentangle the role of predisposition from early-life experience would have a metric for predisposition that is relatively unchanging for individuals, which can stand as a baseline against a separate metric that reflects behavioral differences accumulated as a result of experience.

      I would find it more convincing that the foraging box paradigm can be used to measure personality if it could be shown that young bats' behavior was consistent across retests in the box paradigm prior to any environmental exposure across many baseline trials (i.e. more than 2), and that these "initial settings" were constant for individuals. I think it would be important to show that personality is consistent across baseline trials 1 and 2. This could be done, for example, by reproducing the plots in Fig. 1C1-3 while plotting trial 1 against trial 2. (I would note here that if a significant, positive correlation were to be found (as I would expect) between the measures across trial 1 and 2, it is likely that we would see the "habituation effect" the authors refer to expressed as a steep positive slope on the correlation line (indicating that bold individuals on trial 1 are much bolder on trial 2).)

      We agree and thus used boldness which was found to be stable over five trials (three of which were without external experience). We note that if Boldness as we measured it increased over time, the differences between individuals remained similar and this is what is expected from personality traits measured in the same paradigm several times (after the animal acquires experience).  

      (c) Related to the previous point, it was not clear to me why the data from trial 2 (the second baseline trial) was not presented in the main body of the paper, and only data from trial 1 was used as a baseline.

      We added a main figure, showing the correlation between the two baseline trials

      In the supplementary figure and table, you show that the bats tended to exhibit more boldness and exploratory behavior, but fewer actions, in trial 2 as compared with trial 1. You explain that this may be due to habituation to the experimental setup, however, the precise motivation for excluding data from trial 2 from the primary analyses is not stated. I would strongly encourage the authors to include a comparison of the data between the baseline trials in their primary analysis (see above), combine the information from these trials to form a composite baseline against which further analyses are performed, or further justify the exclusion of data as a baseline.

      We had no intention of excluding data from baseline 2. As we have shown several times before (e.g., Harten, 2021) bats’ boldness as we measure it in the box experiment increases over sessions performed nearby in time. This means that trial 2’s boldness was higher than that of trial 1 and trial 3 which made the data less suitable for a Linear model. Moreover, our measurement of boldness is capped (with a maximum of 1) again making it less suitable for a Linear model. However, following the reviewer’s question we now ran all analyses with trial 2’s data included and not only that the results remained the same, some of the models fit better (based on the AIC criterion). We added this information to the revised manuscript.  

      (2) Comparison of indoor behavioral measures and outdoor behavioral measures Regarding the final point in the results, correlation between indoor personality on Trial 4 and outdoor foraging behavior: It is not entirely clear to me what is being tested (neither the details of the tests nor the data or a figure are plotted). Given some of the strong trends in the data - namely, (1) how strongly early environment seems to affect outdoor behavior, (2) how strongly outdoor experience affects boldness, measured on indoor behavior (Fig. 1D) - I am not convinced that there is no relationship, as is stated here, between indoor and outdoor behavior. If this conclusion is made purely on the basis of a p-value, I would suggest revisiting this analysis.

      We agree that the relationship between indoor personality measures and outdoor foraging behavior is of great interest and had expected to find some correspondence between the two. To test this, we conducted multiple GLM analyses using the different indoor behavioral traits as predictors of outdoor behaviors. These analyses did not reveal any significant correlations. We also performed a separate analysis using PC1 (derived from the indoor behavioral variables) as a predictor, and again found no significant associations with outdoor behavior.

      We were indeed surprised by this outcome. It is possible that the behavioral traits we assessed indoors (boldness, exploration, and activity) do not fully capture the dimensions of behavior that are most relevant to foraging in the wild. For example, traits such as neophobia or decisionmaking under risk, which we did not assess directly, may have had stronger predictive value for outdoor behavior. We now highlight this point more clearly in the Discussion and acknowledge the possibility that alternative or additional personality traits might have revealed meaningful relationships.

      (3) Use of statistics/points regarding the generalized linear models While I think the implementation of the GLMM models is correct, I am not certain that the interpretation of the GLMM results is entirely correct for cases where multivariate regression has been performed (Tables 4s and S1, and possibly Table 3). (You do not present the exact equation they used for each model (this would be a helpful addition to the methods), therefore it is somewhat difficult to evaluate if the following critique properly applies, however...)

      The "estimate" for a fixed effect in a regression table gives the difference in the outcome variable for a 1 unit increase in the predictor variable (in the case of numeric predictors) or for each successive "level" or treatment (in the case of categorical variables), compared to the baseline, the intercept, which reflects the value of the outcome variable given by the combination of the first value/level of all predictors. Therefore, for example, in Table 4a - Time spend outside: the estimate for Bat sex: male indicates (I believe) the difference in time spent outside for an enriched male vs. an enriched female, not, as the authors seem to aim to explain, the effect of sex overall. Note that the interpretation of the first entry, Environmental condition: impoverished, is correct. I refer the authors to the section "Multiple treatments and interactions" on p. 11 of this guide to evaluating contrasts in G/LMMS: https://bbolker.github.io/mixedmodelsmisc/notes/contrasts.pdf

      We are not certain we fully understand the comment; however, if our understanding is correct, we respectfully disagree. A GLM analysis without interaction terms—as conducted in our study—functions as a multiple linear regression, wherein each factor's estimate reflects its individual effect on the dependent variable. For example in the case of sex, it examines he effect of sex on the tie spent out independently of enrichment. An interaction term would be needed to test sex*enrichment. We have added the models’ formula, and we hope this clarifies our approach

      Reviewer #1 (Recommendations for the authors):

      I would recommend the following:

      (1) As video tracking and behavioral analysis softwares are wide spread, it would be great to see this applied to the bat behavior indoor to answer questions like how does the bat velocity or heading or acceleration correlate with the behavioral measures boldness , activity or exploration? In the same gist, can one infer boldness, activity or exploration from measured bat velocity or other parameters? I think this will further make the indoor behavior more quantitative.

      In a tent of the size used in our study, bats’ flight behavior tends to be highly stereotypical: they typically perch on the wall, take off, circle the tent—sometimes multiple times—and then either land or not, and enter or not. Flight velocity is largely determined by individual maneuverability and the physical constraints of the space; thus, precise tracking is unlikely to provide further insight into boldness. In contrast, decision-making behaviors—such as whether to land or enter—more accurately reflect personality traits, as we have shown previously (Harten et al., 2018). Moreover, accurate 3D tracking in such an environment is possible but definitely not easy due to the many blind-spots resulting from the cameras being inside the 3D volume.  Nonetheless, we quantified flight activity and assessed its correlation with the other behavioral axes. As it was highly correlated with general activity, we did not include it as an independent parameter in the main analysis. However, in response to the reviewer’s suggestion, we now present this analysis in the Supplementary Materials.

      (2) It is not clear whether the bats come from the same genetic background. they might be but it is not mentioned in the methods under the experimental subjects.

      We have shown in the past that there is no familial relations in a randomly caught sample of bats in the colony where we usually work (Harten et al., 2018). The bats were caught in three, not related wild colonies. The text referring to the table was clarified in the revised manuscript

      (3) It will be great to include the author's thoughts about mechanisms underlying those environmentally induced changes in behavior in the discussion section along with how this will affect the bats' social foraging abilities. Another question that comes to mind is whether growing up with a large number of bats constitute an enriched environment in itself.

      We agree that this could count as an enrichment, and we thus ensured similar group sizes in both groups for this reason. We clarify this in the revised manuscript. 

      We have elaborated on the underlying mechanisms in the discussion, focusing on how they contribute to behavioral changes.

      Reviewer #2 (Recommendations for the authors):

      (1) Outdoor foraging behavior

      If I understand correctly, the data you display in Fig. 3A is only from the 2nd to 3rd weeks of exploration, i.e. just before the first post-exploration trial.

      What does the data look like for the second outdoor exploration data, i.e. before the final trial?

      Is there a specific reason why these measures were only computed on the GPS data from the 3rd week outside? If so, can this sampling of the data be motivated or briefly addressed (in the methods and wherever else necessary)?

      In order to allow a comparison between individuals, we had to restrict ourself to a period we had data from many individuals (some dissapeared later on).

      Following the reviewer suggestion – we added a supplemenry figure including days 21-26

      I would find it important and of great interest to see movement maps for more animals, as these give very rich information that is not entirely captured by the three proxies of outdoor activity.

      Are these four exemplary animals sampled from both seasons?

      Did you check to see if there were any overall differences in outdoor foraging behavior as a function of the season in which the bats were captured?

      Yes, the samples represent individuals from both tested years. This was clarified, and additional examples were included in a supplementary figure.

      Variable of time spent outdoors: You mention that you did not include the nights that the bat spent in the colony in these calculations. Did you also look to see if 'the number of nights when the bats left the colony' predicted the bat's earlier enrichment treatment? This could also be interesting to consider.

      In response to the reviewer’s comment, we conducted an additional analysis to test whether the proportion of nights each bat spent foraging outside the roost was predicted by its earlier environmental condition (enriched vs. impoverished). We also examined whether sex or age influenced this variable. This analysis showed no significant effect of environmental condition, sex, or age on the proportion of nights spent foraging outside the roost

      [Following on point 3 in public review...]

      When wishing to discuss the effect/significance of predictors overall, it is common to present the modelling results as an analysis of variance table. See, for example, the two-way anova section (p. 182) in the book Practical Regression and ANOVA using R: https://cran.r-project.org/doc/contrib/Faraway-PRA.pdf

      I think the output of passing the model object to an "anova" yields the table that you may be looking for, where the variance accounted for by a predictor is given overall, and not just relative to the first level of all predictors. Naturally, this information can be used in combination with the information provided by the raw model output presented in the paper.

      I assume you have done this analysis in R, but am not sure, as the statistical software used is not mentioned. There are several packages in R that allow users to quickly plot the graphical interaction of the parameters they use in models, which aids in interpreting results. It would be good to check results of model fitting in this manner.

      Relatedly, I was unable to locate the data and code for this paper using the DOI provided. Neither searching the internet using the doi nor entering the doi on the Mendeley Data website returned the right results. I tried searching Mendeley Data using the senior author's last name, but the most recent entry does not appear to be from this paper. https://data.mendeley.com/datasets/fr48bmnhxj/1

      We thank the reviewer for the helpful comment. The analysis was indeed conducted in MATLAB, and this has now been clarified in the manuscript. We have also revised the result tables to improve clarity and included the exact formulas used for each model. Regarding the data availability, the reviewer is correct — the dataset had not yet been published at the time of submission. It is now available at the provided DOI link.

      ### Suggestions and questions for the present paper, grouped thematically:

      [Major] Expansion and development of results: I thought there were many interesting and suggestive points in this data that could be expanded upon. I mention some of these here. While the authors of course do not need to implement all of these suggestions, I think the paper would benefit from a more substantial presentation of this rich data set:

      (a) Individual differences as such are not emphasized in the paper so much, as the analyses, particularly those expressed as boxplots, are grouped. The scatter plots in Figure 1 give the richest insight into how individual behavior changes throughout the course of the experiment. I would advocate for the authors to show additional comparisons using such scatter plots (perhaps in the supplementary, if needed).

      We thank the reviewer and added scatter plots to figure 2

      (b) In the second paragraph of the results, the authors introduce the concept of a pareto front and that of personality archetypes (lines 101-107). I found this very interesting, but these concepts were never reiterated upon later in the results or in the discussion. In fact, at many points, I found myself curious as to how the three indoor measures of personality might be combined to form a composite measure of personality (and likewise for outdoor measures). Have you tried to combine measures into a composite and tried to measure whether this composite metric provides any additional insight into these phenomena? For example, what if you mapped the starting position of each bat as a point in a three-dimensional space, given by the three personality measures, and then evaluated their trajectory through this space with measurements taken at later trials. Could innate personality be interpreted as the starting vector in this space (measured across the two baseline trials)? 

      Following the reviewer’s (justified) curiosity we ran a PCA analysis on the behavioral data from trials 1 and 5 and found that there is a significant correlation between the individual scores on PC1. This can be thought of as a measurement that takes both boldness and exploration into account (the weight of activity was very low). We added this information to the revised manuscript and also use this new behavioral parameter as a predisposition in the models (instead of exploration and activity). 

      Could environmental exposure be quantified as a warping of the trajectory through this space? Finally, could outdoor experience also be incorporated to evaluate how an individual arrives at its final measurement of personality combined with experience (trial 5)?

      The paper currently tries to explain outdoors behavior given personality and not vice versa. While this is a very interesting suggestion, we feel that adding this analysis would make the premise of the paper less clear and since the paper is already somewhat complex, we prefer to leave this analysis for a future study. 

      Examining the 3D trajectories of the individuals through the personality space did not reveal any immediate clear pattern (triangles mark the first trial and colours depict the environmental treatment) – 

      Author response image 1.

      Related to this point: I think the strongest part of the paper is the result showing that bats exposed to enriched environments explore farther, more often, and over larger distances than bats that were raised in an impoverished environment.

      We completely agree and tried to further emphasize this  

      (c) While these results of the outdoor GPS tracking are very clear, I wish that more information were extracted from the tracking data, which is incredibly rich and certainly can be used to derive many interest parameters beyond those that the authors have shown here. Examples might include: distance travelled (as opposed to estimated km2 or farthest point), a metric of navigational ability (how much "dead reckoning" the animal engages in). I even wonder if the areas or landmarks visited by the enriched bats might be found to be more complex, challenging, or richer by some measure.

      This study was a first step, aiming to establish a connection between early exposure and outdoors foraging

      We agree that there are many more analyses that can be done and indeed that ones related to navigation capabilities are missing. We are still collecting data on these bats and hope to present a more advanced analysis with a time span of years. 

      (d) Related to the above point: I find it very interesting that in 3 of the 4 bats for which you show exemplary movement data (Fig. 3, panels B and C), they appear to travel to the farthest distances and cover the most ground early on, and become more "conservative" in their flight paths on later evenings. This point is not explored in the discussion, nor related to earlier measurements.

      During the first months of exploration, bats will occasionally perform long exploratory flights in between bouts of shorter flights where they return to nearby familiar trees. This behavior can be seen in more detail in Harten et al Science 2020. We are currently quantifying this more carefully for another study. 

      (e) Finally, my points about the possible strength of a composite measure of the three personality metrics is related to my concern about one of the conclusions, which is that innate personality does not have an effect on outdoor foraging behavior. I think the manner in which this was tested statistically is likely to bias the results against finding such a result given that personality metrics are used to predict outdoor behaviors in an individual manner (6 models in total, each examining a single comparison of predisposition to outdoor behavior), while both indoor personality metrics (Fig 1B) and outdoor behaviors appear to be correlated with each other (Table 5).

      Are there other analyses you have performed that are not presented in the paper and that have led you to conclude that there is no relationship here?

      We agree with the reviewer, that our findings do not exclude an effect of innate personality on foraging but only suggest no such affect for the parameter we measured. That said, we did expect to find an effect of boldness because this parameter has been shown to differentiate much between groups (Harten et al., 2018), and to correlate with other parameters of behavior. We were therefore surprised to find no significant effects, as we had anticipated observing some differences.

      Following the reviewer’s previous comment we now also tested another predisposition parameter – the PC1 score and also found that it did not explain foraging. 

      (f) Personality measured before and after early environmental exposure (related to point (a) above): I find it interesting that the positive correlation in boldness between baseline and post-enrichment or baseline and post-release suggests that the individuals that were the most bold remained bold (and likewise for less adventurous individuals). The correlation for activity, too, still suggests that more active individuals early in life are likely to remain very active after enrichment, even accounting for the fact that activity is confounded with age.

      Perhaps you could place some emphasis on the fact that the initial variation between individuals also appears to be relatively stable over repeated trials. You might also consider measuring this directly (population variance over successive trials; relationship of population variance on indoor measures vs. outdoor measures...)

      Yes – this is a main point of interest. We further emphasize that in the revised manuscript 

      (g) Effect of indoor behavior following early experience on outdoor behavior: You evaluate the effect of predisposition (measured on baseline trial 1) and environmental condition on measures of outdoor activity (Table 4). I wonder if you also tried using indoor behavioral measures measured on the post-enrichment trial 3 to predict outdoor foraging behavior.

      Assuming that these measures are in fact reflecting a combination of predisposition and accumulated experience, then measurements at this closer time point may tell you how the combination of innate traits and early acquired experience affect behavior in the wild.

      We appreciate the reviewer’s insightful suggestion to test whether indoor behavior from post-enrichment Trial 3, reflecting both innate traits and experience, predicts outdoor foraging behavior. We conducted this analysis, but found that the boldness in Trial 3 did not significantly predict any of the outdoor activity measures.

      (2) [Minor] Age/development: While the authors discuss the effect of their manipulations on behavioral measures, they do not much discuss the effect of age.

      I think it would be important to include at some point a mention of the developmental stages of Rousettus, giving labels to certain age ranges, e.g. pup, juvenile, adult, and to provide more context about the stages at which bats were tested in the discussion. Presently, age is only really mentioned as an explanation for declining activity levels, but I wonder if it might also have an influence on boldness.

      It would also be very elegant for figures where age is given in days, to additional label then with these stages.

      All bats were juveniles during the trials (approximately 4 to 8 months old), so they could not be divided into distinct age groups. To assess the effect of age, it was included as a predictor (in days) in the GLM analysis.

      (3) [Major] Effect of early experience and outdoor experience on the indoor task: In the paragraph on lines 278-285, you argue that the effect of seeing earlyenriched bats exhibit more boldness in trial 5 was likely due to post-sampling bias...

      I tend to disagree with this conclusion. I actually find this result both interesting and intuitive - that bats that were exposed to an enriched environment and have had experience in the wild, show much bolder activity on a familiar indoor foraging test (i.e. outside experience has made the animals bolder than before) (Fig 1, lines 159-161, Fig. S1). I did not notice this possibility mentioned in the discussion of the results.

      I also do not fully understand this argument. Could you please explain further?

      We accept the reviewer's comment and updated the manuscript (lines 336346) explaining the two hypotheses more clearly and arguing that it is difficult to tell them apart with the current data.

      [Minor] You also say that "this difference... can be seen in Figure 2 when examining only the bats that had remained until the last trial (Figure 2A2)." Do you mean supplementary Figure S1 A2? In fact, I am entirely unclear on what data is plotted in the supplementary Figure S1 and what differentiates the two columns of figures and the two models presented in the supplementary table. Did you plot data similar to that in Figure 2, with only bats that were present for all trials, but not show this data?

      There was a mistake: what was previously referred to as 2A2 is actually S2 A2.

      On the right side—only among the individuals with GPS data—the change is already evident at Baseline 2, where only the bolder individuals remain. If you have suggestions for a better analysis approach, we would be happy to hear them.

      ### Minor points

      General points regarding figures:

      For Figures 2 and 3A1-3 (as well as Fig. S1): Authors must show the raw data points over the box plots. It is very difficult to interpret the data and conclusions without being able to see the true distribution.

      Done

      For all figures showing grouped individual data, please annotate all panels or sets of boxplots with the number of bats whose data entered into each, as it is a little difficult to keep track of the changing sample sizes across experimental stages.

      To enhance transparency, we have added individual data points to all boxplots, allowing visual estimation of sample sizes across experimental stages. While numerical annotations are not included on the figures, the exact number of bats contributing to each group is provided in the Methods section (Table 8), ensuring this information is readily accessible to readers.In response to the reviewer’s request, we have updated all relevant figures to display individual data points within each boxplot. This addition makes it easier to track changes in sample size across different experimental stages.

      Unless I've missed the reason behind differences in axis labelling across the figures, it seems that trials are not always referred to consistently. E.g. Fig. 1 labels say "Trial 1 (baseline)" and fig. 2 labels say "Baseline 1 0 days." I'm not entirely sure if these correspond to exactly the same data. If so, perhaps the labels can be made uniform. I think the descriptive ones (Baseline 1, Postenrichment...) may be more helpful to the reader than providing the trial number (Trial 1, etc....).

      Done

      Figure 1:

      Very good Fig. 1A and 1B.

      For panels C1-3 & D, I think it would make it easier for the reader if the personality measure labels were placed at the top of each panel, e.g. "Boldness (entrance proportion)". The double axis labels are not only harder to read, they are also redundant, as the personality measure label repeats on both axes.

      Done

      Panel C1: For the first panel in this sequence, I think it would be elegant to include an annotation in the figure that indicates what the datapoints lying on either side of the dashed line means, i.e. "bolder after enrichment treatment" in the upper left corner, and "bolder before enrichment treatment" in the bottom right corner.

      Panel C2: It appears as though many of the data points in this panel overlap, and it appears to me that the blue data points in particular are overlaid by the orange ones. I am guessing this happens because proportion values based on entrances to only 6 boxes end up giving a more "discrete" looking distribution. I wonder if you can find a way to allow all the data to be visible by, e.g., jittering the data slightly; if there is rounding being done to the proportions, perhaps don't round them so that minute differences will allow them to escape the overlap; or possibly split the panel by enrichment treatment.

      Caption for C1-3: it may be helpful to mention the correlation line color scheme: "enriched (blue lines), the impoverished (orange lines)". The caption also says positive correlations were found for "both environments together," but this correlation line is not shown. Perhaps mention "(not shown)" or show line. Please rephrase the sentence "Dashed line represents the Y=X line." for more transparency and clarity. I understand you mean an "equality" or "unity" line, but perhaps you can explicitly state the information that this line provides, something like e.g. "Dashed line indicates equal values measured on both trials."

      We added the line for a reference, the caption was corrected

      Figure 3:

      Panels B1-C2: I would suggest giving these panels supertitles that indicate that B panels are enriched, C panels are impoverished, and that each panel is data from a different individual.

      The legend was corrected to be more clear about the figure

      General points regarding tables:

      Please revisit tables for formatting and typos, particularly in Table 4. Please also revise table captions for clarity. E.g. "first exploration as predisposition" to "Exploration (Baseline 1)" or similar

      Done

      Supplementary Tables and Figure: these are missing captions and explanations.

      The missing parts were adddad and corrected

      Points of clarification/style:

      It would seem to me more logical to present the results shown in Table 3 before those in Table 2, given that the primary in-lab manipulation is discussed with relation to Table 3, and the analysis in Table 2 is discussed rather as a limitation (though I believe this result can be expanded upon further, see above).

      For the activity metric, I would suggest showing this data as actions/hour instead of actions/minute. I think it is much more intuitive to consider, for example, that a bat makes 2 actions every hour, than that it makes 0.002 actions per minute.

      Done

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      The authors present a model for multisensory correlation detection that is based on the neurobiologically plausible Hassenstein Reichardt detector. It modifies their previously reported model (Parise & Ernst, 2016) in two ways: a bandpass (rather than lowpass) filter is initially applied and the filtered signals are then squared. The study shows that this model can account for synchrony judgement, temporal order judgement, etc in two new data sets (acquired in this study) and a range of previous data sets.

      Strengths:

      (1) The model goes beyond descriptive models such as cumulative Gaussians for TOJ and differences in cumulative Gaussians for SJ tasks by providing a mechanism that builds on the neurobiologically plausible Hassenstein-Reichardt detector.

      (2) This modified model can account for results from two new experiments that focus on the detection of correlated transients and frequency doubling. The model also accounts for several behavioural results from experiments including stochastic sequences of A/V events and sine wave modulations.

      Additional thoughts:

      (1) The model introduces two changes: bandpass filtering and squaring of the inputs. The authors emphasize that these changes allow the model to focus selectively on transient rather than sustained channels. But shouldn't the two changes be introduced separately? Transients may also be detected for signed signals.

      We updated the original model because our new psychophysical evidence demonstrates the fundamental role of unsigned transient for multisensory perception. While the original model received input from sustained unimodal channels (low-pass filters), the new version receives input from unsigned unimodal transient channels. Transient channels are normally modelled through bandpass filters (to remove the DC and high-frequency signal components) and squaring (to remove the sign). While these may appear as two separate changes in the model, they are, in fact, a single one: the substitution of sustained with unsigned transient channels (for a similar approach, see Stigliani et al. 2017, PNAS). Either change alone would not be sufficient to implement a transient channel that accounts for the present results.

      That said, we were also concerned with introducing too many changes in the model at once. Indeed, we simply modelled the unimodal transient channels as a single band-pass filter followed by squaring. This is already a stripped-down version of the unsigned transient detectors proposed by Adelson and Bergen in their classic Motion Energy model. The original model consisted of two biphasic temporal filters 90 degrees out of phase (i.e., quadrature filters), whose output is later combined. While a simpler implementation of the transient channels was sufficient in the present study, the full model may be necessary for other classes of stimuli (including speech, Parise, 2024, BiorXiv). Therefore, for completeness, we now include in the Supplementary Information a formal description of the full model, and validate it by simulating our two novel psychophysical studies. See Supplementary Information “The quadrature MCD model” section and Supplementary Figure S8.

      (2) Because the model is applied only to rather simple artificial signals, it remains unclear to what extent it can account for AV correlation detection for naturalistic signals. In particular, speech appears to rely on correlation detection of signed signals. Can this modified model account for SJ or TOJ judgments for naturalistic signals?

      It can. In a recent series of studies we have demonstrated that a population of spatially-tuned MCD units can account for audiovisual correlation detection for naturalistic stimuli, including speech (e.g. the McGurk Illusion). Once again, unsigned transients were sufficient to replicate a variety of previous findings. We have now extended the discussion to cover this recent research: Parise, C. V. (2024). Spatiotemporal models for multisensory integration. bioRxiv, 2023-12.

      Even Nidiffer et al. (2018) which is explicitly modelled by the authors report a significant difference in performance for correlated and anti-correlated signals. This seems to disagree with the results of study 1 reported in the current paper and the model's predictions. How can these contradicting results be explained? If the brain detects correlation on signed and unsigned signals, is a more complex mechanism needed to arbitrate between those two?

      We believe the reviewer here refers to our Experiment 2 (where, like Nidiffer at al. (2018) we used periodic stimuli, not Experiment 1, which consists of step stimuli). We were also puzzled by the difference between our Experiment 2 and Nidiffer et al. (2018): we induced frequency doubling, Nidiffer did not. Based on quantitative simulations, we concluded that this difference could be attributed to the fact that while Nidiffer included on each trial an intensity ramp in their periodic audiovisual stimuli, we did not. As a result, when considering the ramp (unlike in Nidiffer’s analyses), all audiovisual signals used by Nidiffer were positively correlated (irrespective of frequency and phase offset), while our signals in Experiment 2 were sometimes correlated and other times not (depending on the phase offset). This important simulation is included in Supplementary Figure S7; we also have now updated the text to better highlight the role of the pedestal in determining the direction of the correlation.

      (3) The number of parameters seems quite comparable for the authors' model and descriptive models (e.g. PSF models). This is because time constants require refitting (at least for some experimental data sets) and the correlation values need to be passed through a response mode (i.e. probit function) to account for behavioural data. It remains unclear how the brain adjusts the time constants to different sensory signals.

      This is a deep question. For simplicity, here the temporal constants were fitted to the empirical psychometric functions. To avoid overfitting, whenever possible we fitted such parameters over some training datasets, while trying to predict others. However, in some cases, it was necessary to fit the temporal constants to specific datasets. This may suggest that the temporal tuning of those units is not crystalised to some pre-defined values, but is adjusted based on recent perceptual history (e.g., the sequence of trials and stimuli participants are exposed to during the various experiments).

      For transparency, here we show how varying the tuning of the temporal constants of the filters affects the goodness of fit of our new psychophysical experiments (Supplementary Figure S8). As it can be readily appreciated, the relative temporal tuning of the unimodal transient detector was critical, though their absolute values could vary over a range of about 15 to over 100ms. The tuning of the low-pass filters of the correlation detector (not shown here) displayed much lower temporal sensitivity over a range between 0.1s to over 1s.

      This simulation shows the impact of temporal tuning in our simulations, however, the question remains as to how such a tuning gets selected in the first place. An appealing explanation relies on natural scene statistics: units are temporally tuned to the most common audiovisual stimuli. Although our current empirical evidence does not allow us to quantitatively address this question, in previous simulations (see Parise & Ernst, 2016, Supplementary Figure 8), by analogy with visual motion adaptation, we show how the temporal constants of our model can dynamically adjust and adapt to recent perceptual history. We hope these new and previous simulations address the question about the nature of the temporal tuning of the MCD units.

      (4) Fujisaki and Nishida (2005, 2006) proposed mechanisms for AV correlation detection based on the Hassenstein-Reichardt motion detector (though not formalized as a computational model).

      This is correct, Fujisaki and Nishida (2005, 2007) also hypothesized that AV synchrony could be detected using a mechanism analogous to motion detection. Interestingly, however, they ruled out such a hypothesis, as their “data do not support the existence of specialized low-level audio-visual synchrony detectors”. Yet, along with our previous work (Parise & Ernst, 2016, where we explicitly modelled the experiments of Fujisaki and Nishida), the present simulations quantitatively demonstrate that a low-level AV synchrony detector is instead sufficient to account for audiovisual synchrony perception and correlation detection. We now credit Fujusaki and Nishida in the modelling section for proposing that AV synchrony can be detected by a cross-correlator.

      Finally, we believe the reviewer is referring to the 2005 and 2007 studies of Fujisaki and Nishida (not 2006); here are the full references of the two articles we are referring to:

      Fujisaki, W., & Nishida, S. Y. (2005). Temporal frequency characteristics of synchrony–asynchrony discrimination of audio-visual signals. Experimental Brain Research, 166, 455-464.

      Fujisaki, W., & Nishida, S. Y. (2007). Feature-based processing of audio-visual synchrony perception revealed by random pulse trains. Vision Research, 47(8), 1075-1093.

      Reviewer #2 (Public Review):

      Summary:

      This is an interesting and well-written manuscript that seeks to detail the performance of two human psychophysical experiments designed to look at the relative contributions of transient and sustained components of a multisensory (i.e., audiovisual) stimulus to their integration. The work is framed within the context of a model previously developed by the authors and is now somewhat revised to better incorporate the experimental findings. The major takeaway from the paper is that transient signals carry the vast majority of the information related to the integration of auditory and visual cues, and that the Multisensory Correlation Detector (MCD) model not only captures the results of the current study but is also highly effective in capturing the results of prior studies focused on temporal and causal judgments.

      Strengths:

      Overall the experimental design is sound and the analyses are well performed. The extension of the MCD model to better capture transients makes a great deal of sense in the current context, and it is very nice to see the model applied to a variety of previous studies.

      Weaknesses:

      My one major issue with the paper revolves around its significance. In the context of a temporal task(s), is it in any way surprising that the important information is carried by stimulus transients? Stated a bit differently, isn't all of the important information needed to solve the task embedded in the temporal dimension? I think the authors need to better address this issue to punch up the significance of their work.

      In hindsight, it may appear unsurprising that transient signals carry most information for audiovisual integration. Yet, so somewhat unexpectedly, this has never been investigated using perhaps the most diagnostic psychophysical tools for perceived crossmodal timing; namely temporal order and simultaneity judgments–along with carefully designed experiments with quantitative predictions for the effect of either channel. The fact that the results conform to intuitive expectations further supports the value of the present work: grounding empirically with what is intuitively expected. This offers solid psychophysical evidence that one can build on for future advancements. Importantly, developing a model that builds on our new results and uses the same parameters to predict a variety of classic experiments in the field, further supports the current approach.

      If “significance” is intended as shaking previous intuitions or theories, then no: this is not a significant contribution. If instead, by significance we intend to build a solid empirical and theoretical ground for future work, then we believe this study is not significant, it is foundational. We hope that this work's significance is better captured in our discussion.

      On a side note, there is an intriguing factor around transient vs. sustained channels: what matters is the amount of change, not the absolute stimulus intensity. Previous studies, for example, have suggested a positive cross modal mapping between auditory loudness and visual lightness or brightness [Odegaard et al., 2004]. This study, conversely, challenges this view and demonstrates that what matters for multisensory integration in time is not the intensity of a stimulus, but changes thereof.

      In a more minor comment, I think there also needs to be a bit more effort into articulating the biological plausibility/potential instantiations of this sustained versus transient dichotomy. As written, the paper suggests that these are different "channels" in sensory systems, when in reality many neurons (and neural circuits) carry both on the same lines.

      The reviewer is right, in our original manuscript we glossed over this aspect. We have now expanded the introduction to discuss their anatomical basis. However, we are not assuming any strict dichotomy between transient and sustained channels; rather, our results and simulations demonstrate that transient information is sufficient to account for audiovisual temporal integration.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      (1) Related to point 2 of the public review, can the authors provide additional results showing that the model can also account for naturalistic signals and more complex stochastic signals?

      While working on this manuscript, we were also working in parallel on a project related to audiovisual integration of naturalistic signals. A pre-print is available online [Parise, 2024, BiorXiv], and the related study is now discussed in the conclusions.

      (2) As noted in the public review, Fujisaki and Nishida (2005, 2006) already proposed mechanisms for AV correlation detection based on the Hassenstein-Reichardt motion detector. Their work should be referenced and discussed.

      We have now acknowledged the contribution of Fujisaki and Nishida in the modelling section, when we first introduce the link between our model and the Hassenstein-Reichardt detectors.

      (3) Experimental parameters: Was the phase shift manipulated in blocks? If yes, what about temporal recalibration?

      To minimise the effect of temporal recalibration, the order of trials in our experiments was randomised. Nonetheless, we can directly assess potential short-term recalibration effects by plotting our psychophysical responses against both the current SOA, and that of the previous trials. The resulting (raw) psychometric surfaces below are averaged across observers (and conditions for Experiment 1). In all our experiments, responses are obviously dependent on the current SOA (x-axis). However, the SOA of the previous trials (y-axis) does not seem to meaningfully affect simultaneity and temporal order judgments. The psychometric curves above the heatmaps represent the average psychometric functions (marginalized over the SOA of the previous trial).

      All in all, the present analyses demonstrate negligible temporal recalibration across trials, likely induced by a random sequence of lags or phase shifts. Therefore, when estimating the temporal constants of the model, it seems reasonable to ignore the potential effects of temporal recalibration. To avoid increasing the complexity of the present manuscript, we would prefer not to include the present analyses in the revised version.

      Author response image 1.

      Effect of previous trial. Psychometric surfaces for Experiments 1 and 2 plotted against the lag in the current vs. the previous trial. While psychophysical responses are strongly modulated by the lag in the last trial (horizontal axis), they are relatively unaffected by the lag in the previous trial (vertical axis).

      (4) The model predicts no differences for experiment 1 and this is what is empirically observed. Can the authors support these null results with Bayes factors?

      This is a good suggestion: we have now included a Bayesian repeated measures ANOVA to the analyses of Experiment 1. As expected, these analyses provide further, though mild evidence in support for the null hypothesis (See Table S2). For completeness, the new Bayesian analyses are presented alongside the previous frequentist ones in the revised manuscript.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      The authors aim to consider the effects of phonotactics on the effectiveness of memory reactivation during sleep. They have created artificial words that are either typical or atypical and showed that reactivation improves memory for the latter but not the former.

      Comment 1:

      Strengths:

      This is an interesting design and a creative way of manipulating memory strength and typicality. In addition, the spectral analysis on both the wakefulness data and the sleep data is well done. The article is clearly written and provides a relevant and comprehensive of the literature and of how the results contribute to it.

      We thank the reviewer for his/her positive evaluation of our manuscript. 

      Comment 2:

      Weaknesses:

      (1) Unlike most research involving artificial language or language in general, the task engaged in this manuscript did not require (or test) learning of meaning or translation. Instead, the artificial words were arbitrarily categorised and memory was tested for that categorisation. This somewhat limits the interpretation of the results as they pertain to language science, and qualifies comparisons with other language-related sleep studies that the manuscript builds on.

      We thank the reviewer for this comment. We agree that we did not test for meaning or translation but used a categorization task in which we trained subjects to discriminate artificial words according to their reward associations (rewarded vs. non-rewarded). Previous language studies (Batterink et al., 2014; Batterink and Paller, 2017; Reber, 1967) used artificial words to investigate implicit learning of hidden grammar rules. Here, the language researchers studied generalization of the previously learned grammar knowledge by testing subject’s ability to categorize correctly a novel set of artificial words into rule-congruent versus rule-incongruent words. These differences to our study design might limit the comparability between the results of previous language studies of artificial grammar learning and our findings. We discussed now this aspect as a limitation of our novel paradigm. 

      We added the following sentences to the discussion on p.14, ll. 481-488:

      Based on our paradigm, we investigated categorization learning of artificial words according to their reward associations (rewarded vs. unrewarded) and did not studied aspects of generalization learning of artificial grammar rules (Batterink et al., 2014; Batterink and Paller, 2017; Reber, 1967). This difference might limit the comparability between these previous language-related studies and our findings. However, the usage of artificial words with distinct phonotactical properties provided a successful way to manipulate learning difficulty and to investigate word properties on TMR, whereas our reward categorization learning paradigm had the advantage to increase the relevance of the word learnings due to incentives.    

      Comment 3:

      (2) The details of the behavioural task are hard to understand as described in the manuscript. Specifically, I wasn't able to understand when words were to be responded to with the left or right button. What were the instructions? Were half of the words randomly paired with left and half with right and then half of each rewarded and half unrewarded? Or was the task to know if a word was rewarded or not and right/left responses reflected the participants' guesses as to the reward (yes/no)? Please explain this fully in the methods, but also briefly in the caption to Figure 1 (e.g., panel C) and in the Results section.

      We thank the reviewer for this comment and added additional sentences into the document to provide additional explanations. We instructed the participants to respond to each word by left- and right-hand button presses, whereas one button means the word is rewarded and the other button means the word is unrewarded. The assignment of left- and right-hand button presses to their meanings (rewarded versus unrewarded) differed across subjects. In the beginning, they had to guess. Then over trial repetitions with feedback at the end of each trial, they learned to respond correctly according to the rewarded/unrewarded associations of the words.        

      We added the following sentences to the results section on p.5, ll. 161-168: 

      As a two alternative forced-choice task, we assigned left- and right-hand button presses to the rewarded and the unrewarded word category, counterbalanced across subjects. We instructed the participants to respond to each word by left- or right-hand button presses, whereas one button means the word is rewarded (gain of money points) and the other button means the word is unrewarded (avoid the loss of money points). In the beginning, they had to guess. By three presentations of each word in randomized order and by feedback at the end of each trial, they learned to respond correctly according to the rewarded/unrewarded associations of the words (Fig. 1c). 

      We added the following sentences to the caption of Figure 1 on p.6, ll. 188-194:

      As a two alternative forced-choice task, responses of left- and right-hand button presses were assigned to the rewarded and the unrewarded word category, respectively. The participants were instructed to respond to each word by left- or right-hand button presses, whereas one button means the word is rewarded (gain of money points) and the other button means the word is unrewarded (avoid the loss of money points). d) Feedback matrix with the four answer types (hits: rewarded and correct; CR, correct rejections: unrewarded and correct; misses: rewarded and incorrect; FA, false alarms: unrewarded and incorrect) regarding to response and reward assignment of the word.

      We added the following sentences to the methods on p.19, ll. 687-692:  

      As a two alternative forced-choice task, we assigned left- and right-hand button presses to the rewarded and the unrewarded word category, counterbalanced across subjects. We instructed the participants to respond to each word by left- or right-hand button presses, whereas one button means the word is rewarded (gain of money points) and the other button means the word is unrewarded (avoid the loss of money points).

      Comment 4:  

      (3) Relatedly, it is unclear how reward or lack thereof would translate cleanly into a categorisation of hits/misses/correct rejections/false alarms, as explained in the text and shown in Figure 1D. If the item was of the non-rewarded class and the participant got it correct, they avoided loss. Why would that be considered a correct rejection, as the text suggests? It is no less of a hit than the rewarded-correct, it's just the trial was set up in a way that limits gains. This seems to mix together signal detection nomenclature (in which reward is uniform and there are two options, one of which is correct and one isn't) and loss-aversion types of studies (in which reward is different for two types of stimuli, but for each type you can have H/M/CR/FA separably). Again, it might all stem from me not understanding the task, but at the very least this required extended explanations. Once the authors address this, they should also update Fig 1D. This complexity makes the results relatively hard to interpret and the merit of the manuscript hard to access. Unless there are strong hypotheses about reward's impact on memory (which, as far as I can see, are not at the core of the paper), there should be no difference in the manner in which the currently labelled "hits" and "CR" are deemed - both are correct memories. Treating them differently may have implications on the d', which is the main memory measure in the paper, and possibly on measures of decision bias that are used as well.

      We thank the reviewer for this comment giving us the opportunity to clarify. As explained in the previous comment, for our two alternative forced-choice task, we instructed the participants to press one button when they were thinking the presented word is rewarded and the other button, when they were thinking the word is unrewarded. Based on this instruction, we applied the signal detection theory (SDT), because the subjects had the task to detect when reward was present or to reject when reward was absent. Therefore, we considered correct responses of words of the rewarded category as hits and words of the unrewarded category as correct rejections (see Table below). However, the reviewer is correct because in addition to false alarms, we punished here the incorrect responses by subtraction of money points to control for alternative task strategies of the participants instead of reward association learning of words. We agree that further explanation/argumentation to introduce our nomenclature is necessary.  

      Author response table 1.

      We adjusted the results section on p.5, ll. 169-177:

      To obtain a measurement of discrimination memory with respect to the potential influence of the response bias, we applied the signal detection theory (Green and Swets, 1966). Because, we instructed the participants to respond to each word by left- or right-hand button presses and that one button means reward is present whereas the other button means reward is absent, we considered correct responses of words of the rewarded category as hits and words of the unrewarded category as correct rejections. Accordingly, we assigned the responses with regard to the reward associations of the words to the following four response types: hits (rewarded, correct); correct rejections (unrewarded, correct); misses (rewarded, incorrect); and false alarms (unrewarded, incorrect). Dependent on responses, subjects received money points (Fig. 1d). 

      Comment 5:

      (4) The study starts off with a sample size of N=39 but excludes 17 participants for some crucial analyses. This is a high number, and it's not entirely clear from the text whether exclusion criteria were pre-registered or decided upon before looking at the data. Having said that, some criteria seem very reasonable (e.g., excluding participants who were not fully exposed to words during sleep). It would still be helpful to see that the trend remains when including all participants who had sufficient exposure during sleep. Also, please carefully mention for each analysis what the N was.

      Our study was not pre-registered. Including all the subjects independent of low prememory performance, but with respect to a decent number of reactivations (> 160 reactivations, every word at least 2 times), resulted in a new dataset with 15 and 13 participants of the high- and low-PP cueing condition, respectively. Here, statistical analyses revealed no significant overnight change anymore in memory performance in the high-PP cueing condition (Δ memory (d'): t(14) = 1.67, p = 0.12), whereas the increase of the bias in decision making towards risk avoidance still remained significant (Δ bias (c-criterion): t(14) = 3.36, p = 0.005).

      We modified and added the following sentences to the discussion on p.13, ll. 456-458:

      Our study has limitations due to a small sample size and between-subject comparisons. The criteria of data analyses were not pre-registered and the p-values of our behavior analyses were not corrected for multiple comparisons.

      Comment 6:             

      (5) Relatedly, the final N is low for a between-subjects study (N=11 per group). This is adequately mentioned as a limitation, but since it does qualify the results, it seemed important to mention it in the public review.

      We agree with the reviewer that the small sample size and the between subject comparisons represent major limitations of our study. Accordingly, we now discussed these limitations in more detail by adding alternative explanations and further suggestions for future research to overcome these limitations.        

      We added the following sentences to the discussion about the limitations on p.14, ll. 465-488: 

      To control for potential confounders despite the influence of difficulty in word learning on TMR, we compared parameters of sleep, the pre-sleep memory performance and the vigilance shortly before the post-sleep memory test, revealing no significant group differences (see Table S1 and S2). Nevertheless, we cannot rule out that other individual trait factors differed between the groups, such as the individual susceptibility to TMR. To rule out these alternative explanations based on individual factors, we suggest for future research to replicate our study by conducting a within-subject design with cueing of subsets of previously learned low- and high-PP words providing all conditions within the same individuals as shown in other TMR studies (Cairney et al., 2018; Schreiner and Rasch, 2015).

      Comment 7:

      (6) The linguistic statistics used for establishing the artificial words are all based on American English, and are therefore in misalignment with the spoken language of the participants (which was German). The authors should address this limitation and discuss possible differences between the languages. Also, if the authors checked whether participants were fluent in English they should report these results and possibly consider them in their analyses. In all fairness, the behavioural effects presented in Figure 2A are convincing, providing a valuable manipulation test.

      We thank the reviewer pointing to the misalignment between the German-speaking participants and the used artificial words based on American English. Further, we did not assessed the English language capability of the participants to control it as a potential confounder, whereas comparative control analyses revealed no significant differences between the both cueing groups in pre-sleep memory performance (see Table S1). 

      We now discussed these comments as limitations on p.14, ll. 473-481: 

      Further, we used artificial words based on American English in combination with German speaking participants, whereas language differences of pronunciation and phoneme structures might affect word perception and memory processing (Bohn and Best, 2012). On the other hand, both languages are considered to have the same language family (Eberhard et al., 2019) and the phonological distance between English and German is quite short compared for example to Korean (Luef and Resnik, 2023). Thus, major common phonological characteristics across both languages are still preserved. In addition, our behavior analyses revealed robust word discrimination learning and distinct memory performance according to different levels of phonotactic probabilities providing evidence of successful experimental manipulation. 

      Comment 8:

      (7) With regard to the higher probability of nested spindles for the high- vs low-PP cueing conditions, the authors should try and explore whether what the results show is a general increase for spindles altogether (as has been reported in the past to be correlated with TMR benefit and sleep more generally) or a specific increase in nested spindles (with no significant change in the absolute numbers of post-cue spindles). In both cases, the results would be interesting, but differentiating the two is necessary in order to make the claim that nesting is what increased rather than spindle density altogether, regardless of the SW phase.

      We conducted additional analyses based on detected sleep spindles to provide additional data according to this question. 

      We added the following section to the supplementary data on pp. 31-32, ll. 1007-1045:  

      After conducting a sleep spindle detection (frequency range of 12-16Hz, see methods for details), we compared the sleep spindle density between the TMR conditions of high- and lowPP showing no significant difference (see Fig. S8a and Table S9). Next, we subdivided the detected sleep spindles into coupled and uncoupled sleep spindles with the previously detected slow waves (SW; analyses of Fig. 4). Sleep spindles were defined as coupled when their amplitude peak occurred during the SW up-state phase (0.3 to 0.8s time-locked to the SW troughs). A two-way mixed design ANOVA on the amplitude size of the sleep spindles with the cueing group as a between-subject factor (high-PP-cued vs. low-PP-cued) and SW-coupling as a within-subject factor (coupled vs. uncoupled) showed a significant interaction effect (cueing group × SW-coupling: F(1,20) = 4.51, p = 0.046, η2 = 0.18), a significant main effect of SW-coupling (F(1,20) = 85.02, p < 0.001, η2 = 0.81), and a trend of significance of the main effect of the cueing group (F(1,20) = 3.54, p = 0.08). Post-hoc unpaired t-tests revealed a significant higher amplitude size of the coupled sleep spindles of the cueing group of high- compared to low-PP (t(20) = 2.13, p = 0.046, Cohen’s d = 0.91; Fig. S8b) and no significant group difference of the uncoupled sleep spindles (t(20) = 1.62, p = 0.12). An additional comparison of the amount of coupled sleep spindles between the cueing groups revealed no significant difference (see Table S9). 

      Here, we found that detected sleep spindles coupled to the SW up-state phase occurred with higher amplitude after TMR presentations of the high-PP words in comparison to the low-PP words, whereas the sleep spindle density and the amount of sleep spindles coupled to the SW up-state phase did not differed between the cueing conditions.     

      We added the following sentences to the methods on pp. 22-23, ll. 822-839:  

      Sleep spindle analyses 

      We detected fast sleep spindles by band-pass filtering (12-16Hz) the signal of the Pz electrode during the auditory cueing trials in the time windows of -2 to 8s according to stimulus onsets. The amplitude threshold was calculated individually for each subject as 1.25 standard deviations (SDs) from the mean. The beginning and end times of the sleep spindles were then defined as the points at which the amplitude fell below 0.75 SDs before and after the detected sleep spindle. Only sleep spindles with a duration of 0.5-3 s were included in subsequent analyses. 

      To compare the sleep spindle densities between the different cueing conditions of high- and low-PP, we computed the grand average sleep spindle density distribution in number per trial with a bin size of 0.5s from -0.5 to 6s time-locked to stimulus onset in each condition (see Fig. S8a and Table S9).     

      Based on the detected slow waves and sleep spindles, we defined coupling events when the positive amplitude peak of a detected sleep spindle was occurring during the slow wave upstate phase in a time window of 0.3 to 0.8s according to the trough of a slow wave. 

      We computed the averaged amplitude size of each detected sleep spindle by calculating the mean of the absolute amplitude values of all negative and positive peaks within a detected sleep spindle (see Fig. S8b).

      We added the following sentences to the results on p.10, ll. 338-343:  

      By conducting an additional analyses based on detection of fast sleep spindles (12-16Hz; see methods), we confirmed that fast sleep spindles during the SW up-states (from 0.3 to 0.8s after the SW trough) occurred with significantly higher amplitude after the cueing presentation of high- compared to low-PP words, whereas parameters of sleep spindle density and the amount sleep spindles coupled to the SW up-state did not differed between the cueing conditions (see Fig. S8 and Table S9).       

      Reviewer #2 (Public Review):

      Summary:

      The work by Klaassen & Rasch investigates the influence of word learning difficulty on sleepassociated consolidation and reactivation. They elicited reactivation during sleep by applying targeted memory reactivation (TMR) and manipulated word learning difficulty by creating words more similar (easy) or more dissimilar (difficult) to our language. In one group of participants, they applied TMR of easy words and in another group of participants, they applied TMR of difficult words (between-subjects design). They showed that TMR leads to higher memory benefits in the easy compared to the difficult word group. On a neural level, they showed an increase in spindle power (in the up-state of an evoked response) when easy words were presented during sleep.

      Comment 9:

      Strengths:

      The authors investigate a research question relevant to the field, that is, which experiences are actually consolidated during sleep. To address this question, they developed an innovative task and manipulated difficulty in an elegant way.

      Overall, the paper is clearly structured, and results and methods are described in an understandable way. The analysis approach is solid.

      We thank the reviewer for his/her positive evaluation of our manuscript.

      Weaknesses:

      Comment 10:

      (1) Sample size

      For a between-subjects design, the sample size is too small (N = 22). The main finding (also found in the title "Difficulty in artificial word learning impacts targeted memory reactivation") is based on an independent samples t-test with 11 participants/group.

      The authors explicitly mention the small sample size and the between-subjects design as a limitation in their discussion. Nevertheless, making meaningful inferences based on studies with such a small sample size is difficult, if not impossible.

      We agree with the reviewer that the small sample size and the between subject comparisons represent major limitations of our study. Accordingly, we now discussed these limitations in more detail by adding alternative explanations and further suggestions for future research to overcome these limitations.        

      We added the following sentences to the discussion about the limitations on p.14, ll. 465-473: 

      To control for potential confounders despite the influence of difficulty in word learning on TMR, we compared parameters of sleep, the pre-sleep memory performance and the vigilance shortly before the post-sleep memory test, revealing no significant group differences (see Table

      S1 and S2). Nevertheless, we cannot rule out that other individual trait factors differed between the groups, such as the individual susceptibility to TMR. To rule out these alternative explanations based on individual factors, we suggest for future research to replicate our study by conducting a within-subject design with cueing of subsets of previously learned low- and high-PP words providing all conditions within the same individuals as shown in other TMR studies (Cairney et al., 2018; Schreiner and Rasch, 2015).

      Comment 11:

      (2) Choice of task

      though the task itself is innovative, there would have been tasks better suited to address the research question. The main disadvantage the task and the operationalisation of memory performance (d') have is that single-trial performance cannot be calculated. Consequently, choosing individual items for TMR is not possible.

      Additionally, TMR of low vs. high difficulty is conducted between subjects (and independently of pre-sleep memory performance) which is a consequence of the task design.

      The motivation for why this task has been used is missing in the paper.

      We used a reward task combined with TMR because previous studies revealed beneficial effects of reward related information on sleep dependent memory consolidation and reactivation (Asfestani et al., 2020; Fischer and Born, 2009; Lansink et al., 2009; Sterpenich et al., 2021). In addition, we wanted to increase the motivation of the participants, as they could receive additional monetary compensation according to their learning and memory task performances. Furthermore, we designed the task, with the overall possibility to translate this task to operant conditioning in rats (see research proposal: https://data.snf.ch/grants/grant/168602). However, the task turned out to be too difficult to translate to rats, whereas we developed a different learning paradigm for the animal study (Klaassen et al., 2021) of this cross-species research project.       

      We added the following sentence to the introduction on p.4, ll. 134-137:

      To consider the beneficial effect of reward related information on sleep dependent memory consolidation and reactivation (Asfestani et al., 2020; Fischer and Born, 2009; Lansink et al., 2009; Sterpenich et al., 2021), we trained healthy young participants to categorize these words into rewarded and unrewarded words to gain and to avoid losses of money points.  

      Reviewer #3 (Public Review):

      Summary:

      In this study, the authors investigated the effects of targeted memory reactivation (TMR) during sleep on memory retention for artificial words with varying levels of phonotactical similarity to real words. The authors report that the high phonotactic probability (PP) words showed a more pronounced EEG alpha decrease during encoding and were more easily learned than the low PP words. Following TMR during sleep, participants who had been cued with the high PP TMR, remembered those words better than 0, whilst no such difference was found in the other conditions. Accordingly, the authors report higher EEG spindle band power during slow-wave up-states for the high PP as compared to low PP TMR trials. Overall, the authors conclude that artificial words that are easier to learn, benefit more from TMR than those which are difficult to learn.

      Comment 12 & 13:

      Strengths:

      (1) The authors have carefully designed the artificial stimuli to investigate the effectiveness of TMR on words that are easy to learn and difficult to learn due to their levels of similarity with prior wordsound knowledge. Their approach of varying the level of phonotactic probability enables them to have better control over phonotactical familiarity than in a natural language and are thus able to disentangle which properties of word learning contribute to TMR success.

      (2) The use of EEG during wakeful encoding and sleep TMR sheds new light on the neural correlates of high PP vs. low PP both during wakeful encoding and cue-induced retrieval during sleep.

      We thank the reviewer for his/her positive evaluation of our manuscript.

      Weaknesses:

      Comment 14:

      (1) The present analyses are based on a small sample and comparisons between participants. Considering that the TMR benefits are based on changes in memory categorization between participants, it could be argued that the individuals in the high PP group were more susceptible to TMR than those in the low PP group for reasons other than the phonotactic probabilities of the stimuli (e.g., these individuals might be more attentive to sounds in the environment during sleep). While the authors acknowledge the small sample size and between-subjects comparison as a limitation, a discussion of an alternative interpretation of the data is missing.

      We agree with the reviewer that the small sample size and the between subject comparisons represent major limitations of our study. We thank the reviewer for this helpful comment and now discussed these limitations in more detail by adding alternative explanations and further suggestions for future research to overcome these limitations.

      We added the following sentences to the discussion on p.14, ll. 465-473: 

      To control for potential confounders despite the influence of difficulty in word learning on TMR, we compared parameters of sleep, the pre-sleep memory performance and the vigilance shortly before the post-sleep memory test, revealing no significant group differences (see Table S1 and S2). Nevertheless, we cannot rule out that other individual trait factors differed between the groups, such as the individual susceptibility to TMR. To rule out these alternative explanations based on individual factors, we suggest for future research to replicate our study by conducting a within-subject design with cueing of subsets of previously learned low- and high-PP words providing all conditions within the same individuals as shown in other TMR studies (Cairney et al., 2018; Schreiner and Rasch, 2015).

      Comment 15:

      (2) While the one-tailed comparison between the high PP condition and 0 is significant, the ANOVA comparing the four conditions (between subjects: cued/non-cued, within-subjects: high/low PP) does not show a significant effect. With a non-significant interaction, I would consider it statistically inappropriate to conduct post-hoc tests comparing the conditions against each other. Furthermore, it is unclear whether the p-values reported for the t-tests have been corrected for multiple comparisons. Thus, these findings should be interpreted with caution.

      We thank the reviewer for this comment giving us the opportunity to correct our analyses and clarify with additional description. Indeed, we investigated at first overnight changes in behavior performance within the four conditions, conducting t-tests against 0 of Δ-values of d' and c-criterion. Whereas for all our statistical analyses the p-value was set at p < 0.05 for two-tailed testing, we did not corrected the p-value of our behavior analyses for multiple comparisons. To investigate subsequently differences between conditions, we conducted additional ANOVAs. We agree with the reviewer that without significant of results of the ANOVA, post-hoc analyses should not be conducted. Taken in account as well the recommendation of reviewer 1, we included now only post-hoc pairwise comparisons when the interaction effect of the ANOVA revealed at least a trend of significance (p < 0.1). 

      We removed the following post-hoc analyses from the results section on p.9, ll. 291-295: 

      Additional post-hoc pairwise comparisons revealed a significant difference between the highPP cued and low-PP uncued (high-PP cued vs. low-PP uncued: t(10) = 2.43, p = 0.04), and no difference to other conditions (high-PP cued vs.: high-PP uncued t(20) = 1.28, p = 0.22; lowPP cued t(20) = 1.57, p = 0.13).  

      Further, we mentioned the lack of correction for multiple comparisons as a limitation of our results in the discussion on p.13, ll. 456-458:  

      The criteria of data analyses were not pre-registered and the p-values of our behavior analyses were not corrected for multiple comparisons.

      We added the following sentences to the methods p.23, ll. 842-849:

      To analyze overnight changes of sleep behavioral data within TMR conditions, we conducted at first dependent sample t-tests against 0 of Δ-values (post-sleep test minus pre-sleep test) of d' and c-criterion (see Fig. 3). Two-way mixed design ANOVAs were computed to compare Δvalues between TMR conditions. After confirming at least a trend of significance (p < 0.1) for the interaction effect, we conducted post-hoc pairwise comparisons by independent and dependent sample t-tests. For all behavior statistical analyses, the p-value was set at p < 0.05 for two-tailed testing. A p-value < 0.1 and > 0.05 was reported as a trend of significance.

      Comment 16:

      (3) With the assumption that the artificial words in the study have different levels of phonotactic similarity to prior word-sound knowledge, it was surprising to find that the phonotactic probabilities were calculated based on an American English lexicon whilst the participants were German speakers. While it may be the case that the between-language lexicons overlap, it would be reassuring to see some evidence of this, as the level of phonotactic probability is a key manipulation in the study.

      We thank the reviewer pointing to the misalignment between the German-speaking participants and the used artificial words based on American English. In line with this recommendation, we added a more outlined argumentation to the manuscript about the assumption of our study that major common phonetic characteristics across both languages are still preserved.       

      We now discussed these aspects on p.14, ll. 473-481:

      Further, we used artificial words based on American English in combination with German speaking participants, whereas language differences of pronunciation and phoneme structures might affect word perception and memory processing (Bohn and Best, 2012). On the other hand, both languages are considered to have the same language family (Eberhard et al., 2019) and the phonological distance between English and German is quite short compared for example to Korean (Luef and Resnik, 2023). Thus, major common phonological characteristics across both languages are still preserved. In addition, our behavior analyses revealed robust word discrimination learning and distinct memory performance according to different levels of phonotactic probabilities providing evidence of successful experimental manipulation. 

      Comment 17:

      (4) Another manipulation in the study is that participants learn whether the words are linked to a monetary reward or not, however, the rationale for this manipulation is unclear. For instance, it is unclear whether the authors expect the reward to interact with the TMR effects.

      We used a reward task combined with TMR because previous studies revealed beneficial effects of reward related information on sleep dependent memory consolidation and reactivation (Asfestani et al., 2020; Fischer and Born, 2009; Lansink et al., 2009; Sterpenich et al., 2021). In addition, we wanted to increase the motivation of the participants, as they could receive additional monetary compensation according to their learning and memory task performances. Furthermore, we designed the task, with the overall possibility to translate this task to operant conditioning in rats (see research proposal: https://data.snf.ch/grants/grant/168602). However, the task turned out to be too difficult to translate to rats, whereas we developed a different learning paradigm for the animal study (Klaassen et al., 2021) of this cross-species research project.       

      We added the following sentence to the introduction on p.4, ll. 134-137:

      To consider the beneficial effect of reward related information on sleep dependent memory consolidation and reactivation (Asfestani et al., 2020; Fischer and Born, 2009; Lansink et al., 2009; Sterpenich et al., 2021), we trained healthy young participants to categorize these words into rewarded and unrewarded words to gain and to avoid losses of money points.  

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      Comment 18:

      (1) Please clearly define all linguistics terms - and most importantly the term "phonotactics" - at first use.

      We thank the reviewer for this recommendation and we added the definition of phonotactics and further reduced the diversity of linguistic terms to improve readability. 

      We added the following sentences to the beginning of the introduction on p.3, ll. 72-76:

      One critical characteristic of similarity to pre-existing knowledge in auditory word processing is its speech sound (phoneme) pattern. In phonology as the field of language specific phoneme structures, phonotactics determines the constraints of word phoneme composition of a specific language.

      Comment 19:

      (2) Some critical details about the methods should be included in the Results section to make it comprehensible. For example, the way the crucial differences between G1-4 words should be addressed in the Results, not only in Figure 1.

      According to the recommendation, we added this information to the results section.  We added the following sentences to the results section on p.4, ll. 145-154:

      To study the impact of difficulty in word learning on TMR, we developed a novel learning paradigm. We formed four sets of artificial words (40 words per set; see Table S3 and S4) consisting of different sequences of two vowels and two consonants. Here, we subdivided the alphabet into two groups of consonants (C1: b, c, d, f, g, h, j, k, l, m; C2: n, p, q, r, s, t, v, w, x, z) and vowels (V1: a, e, I; V2: o, u, y). Four-letter-words were created by selecting letters from the vowel and consonant groups according to four different sequences (G1:C1, V1, V2, C2; G2: C1, V1, C2, V2; G3: V1, C1, C2, V2; G4: V1, C1, V2, C2; Fig. 1a; see methods for further details). Comparison analyses between the sets revealed significant differences in phonotactic probability (PP; Fig. 1b; unpaired t-tests: G1 / G2 > G3 / G4, p < 0.005, values of Cohen’s d > 0.71).

      Comment 20

      (3) Was scoring done both online and then verified offline? If so, please note that.

      We included now this information.  

      We adjusted the method section on p.21, ll. 765-769:   

      The sleep stages of NREM 1 to 3 (N1 to N3), wake, and REM sleep were scored offline and manually according to the criteria of the American Academy of Sleep Medicine (AASM) by visual inspection of the signals of the frontal, central, and occipital electrodes over 30s epochs (Iber et al., 2007). Based on offline scoring, we confirmed TMR exposure during N2 and N3 and no significant differences (p-values > 0.05) of sleep parameters between the cueing groups (see Table S2).  

      Comment 21:

      (4) In Figure 2, please arrange the panel letters in an easier-to-read way (e.g., label upper right panel b with a different letter).

      Now we rearranged the panel letters according to the recommendation.

      We adjusted Figure 2 on p.8, ll. 242-258:     

      Comment 22

      (5) In the first paragraph on TMR effects, please note which memory measure you are comparing (i.e., d').

      We added this information according to the recommendation.  

      We adjusted the sentence of the results on p.8, ll. 260-263:

      To examine whether TMR during sleep impacts memory consolidation of discrimination learning with respect to learning difficulty, we calculated the overnight changes by subtracting the pre- from the post-sleep memory performance based on d'-values of the reactivated sequences (cued) and non-reactivated sequences (uncued).

      Comment 23:

      (6) Please show the pre-sleep and post-sleep test scores for both word categories (not only the delta). It may be best to show this as another data point in Fig 2a, but it may be helpful to also see this split between cued and uncued.

      We added the pre-sleep and post-sleep test scores with the individual data points as an additional figure. 

      We added the following figure to the supplementary data on p.28, ll. 936-940:  

      Comment 24:

      (7) In the sentence "An additional two-way mixed design ANOVA on the same values with cueing as a between-subject factor (cued vs. uncued) ...", a more exact phrasing for the last parentheses would probably be "(high-PP-Cued vs Low-PP-Cued)". Both groups were cued.

      We thank the reviewer pointing this out. According to the recommendation, we corrected the descriptions of the two-way mixed design ANOVAs. In addition, we detected a mistake of wrong assignments of the conditions to ANOVAs and corrected the reported values.   

      We adjusted the sentences and corrected the values on p.9, ll. 271-275 and ll. 289-291: 

      An additional two-way mixed design ANOVA on the same values with the factor cueing (cued vs. uncued) as a within-subject factor and group as a between-subject factor revealed trends of significance (p < 0.1) for the interaction (cueing × group: F(1,20) = 3.47, p = 0.08) and the main effect of group (F(1,20) = 3.28, p = 0.09). The main effect of cueing was not significant (F(1,20) = 0.58, p = 0.46).

      An ANOVA on c-criterion changes showed no significant effects (interaction cueing × group: F(1,20) = 2.66, p = 0.12; main effect cueing  F(1,20) = 2.08, p = 0.17; main effect group F(1,20) = 0.38, p = 0.55).

      Comment 25:

      (8) In the same ANOVA, please mention that there is a trend toward an interaction effect. If there wasn't one, the post-hoc comparison would be unwarranted. Please consider noting other p<0.1 pvalues as a trend as well, for consistency.

      Regarding this recommendation, we included now only post-hoc pairwise comparisons after confirming at least a trend toward an interaction effect of these ANOVAs and reported consistently a p-value < 0.1 and > 0.05 as a trend of significance.

      We added the following sentences to the methods p.23, ll. 844-849:

      Two-way mixed design ANOVAs were computed to compare Δ-values between TMR conditions. After confirming at least a trend of significance (p < 0.1) for the interaction effect, we conducted post-hoc pairwise comparisons by independent and dependent sample t-tests. For all behavior statistical analyses, the p-value was set at p < 0.05 for two-tailed testing. A p-value < 0.1 and > 0.05 was reported as a trend of significance.

      We removed the following post-hoc analyses from the results section on p.9, ll. 291-295: 

      Additional post-hoc pairwise comparisons revealed a significant difference between the highPP cued and low-PP uncued (high-PP cued vs. low-PP uncued: t(10) = 2.43, p = 0.04), and no difference to other conditions (high-PP cued vs.: high-PP uncued t(20) = 1.28, p = 0.22; lowPP cued t(20) = 1.57, p = 0.13).          

      Comment 26:      

      (9) Please consider adding an analysis correlating spindle power with memory benefit across participants. Even if it is non-significant, it is important to report given that some studies have found such a relationship.

      According to this recommendation, we conducted an additional correlation analyses.

      We added the following sentences to the manuscript into the results (pp. 10-11, ll. 346-349), the discussion (p.12, ll. 413-417), and the methods (p.23, ll. 864-867):   

      Whereas we found a significant group difference in spindle power nested during SW up-states,   conducting further whole sample (n = 22) correlation analyses between the individual spindle power values of the significant cluster and the overnight changes of behavior measurements revealed no significant correlations (Δ d': r = 0.16, p = 0.48; Δ c-criterion: r = 0.19, p = 0.40).

      In addition to our result of the significant group difference, we failed to find significant correlations between SW nested spindle power values and overnight changes in behavior measurements, whereas previous studies reported associations of SW and spindle activities during sleep with the integration of new memories in pre-existing knowledge networks (Tamminen et al., 2013, 2010).

      By using the same extracted power values (0.3 to 0.8s; 11-14Hz; Pz, P3, P4, O2, P7) per subject, we performed whole sample (n = 22) Pearson correlation analyses between these power values and the overnight changes of behavior measurements of the cued condition (Δ d' and Δ ccriterion).

      Reviewer #2 (Recommendations For The Authors):

      (1) Choice of task

      Comment 27:      

      In general, I find your task well-designed and novel. In light of your research question, however, I wonder why you chose this task. When you outlined the research question in the introduction, I expected a task similar to Schreiner et al. (2015). For example, participants have to associate high PP words with each other and low PP words. The advantage here would be that you could test the benefits of TMR in a within-subjects design (for example, cueing half of the remembered high and half of the remembered low PP words).

      Please see our previous response at comment 14.    

      Comment 28:

      Why did you decide to introduce a reward manipulation?

      Please see our previous response at comment 11.    

      Comment 29:

      Why did you do the cueing on a category level (cueing all high PP or all low PP words instead of single word cueing or instead of cueing 20 reward high-PP, 20 unrewarded high-PP plus 20 reward low-PP and 20 unrewarded low-PP)? Both alternatives would have provided you the option to run your statistics within participants.

      Please see our previous response at comment 14.    

      Comment 30:

      (2) Between-subjects design and small sample size.

      Why did you decide on a between-subjects design that severely reduces your power?

      Why did you just collect 22 participants with such a design? Were there any reasons for this small sample size? Honestly, I think publishing a TMR study with healthy participants and such a small sample size (11 participants for some comparisons) is not advisable.

      Please see our previous response at comment 14.

      Comment 31:

      (3) Encoding performance.

      Is d' significantly above 0 in the first repetition round? I would assume that the distinction between rewarded and non-rewarded words is just possible after the first round of feedback.

      Indeed, conducting t-tests against 0 revealed significantly increased d'-values in the first repetition round (2nd presentation) in both PP conditions (high-PP: 0.85 ± 0.09, t(32) = 9.17, p < 0.001; low-PP: 0.62 ± 0.09, t(32) = 6.83, p < 0.001).  

      Comment 32:

      (4) Encoding response options

      If you want to you could make it more explicit what exactly the response options are. I assume that one button means a word has a high reward and the other button means a word has a low reward. Making it explicit increases the understanding of the results section.

      Please see our previous response at comment 3.

      Comment 33:           

      (5) Alpha desynchronisation.

      Relative change

      Why did you subtract alpha power during the 1st presentation from alpha power during 2nd and 3rd presentation? You baseline-corrected already and individually included the 1st, 2nd, and 3rd repetition in your behavioural analysis.

      Based on this analysis, we aimed to examine the relative change in alpha power between PP-conditions of memory-relevant word repetitions. Therefore, to extract memory relevant changes of EEG activities, the first word presentation of naive stimulus processing could serve as a more representative baseline condition covering the time-window of interest of 0.7 to 1.9 s after the stimulus onset compared to a baseline condition before stimulus onset (-1 to -0.1s). 

      To explain the rational of the analyses with the baseline condition more clearly, we added this information to the results section on p.7, ll. 222-226: 

      We obtained the changes in power values by subtracting the first from the second and third presentation for the high- and low-PP condition, respectively. Here, the first word presentation of naive stimulus processing served us with a more representative baseline condition covering the time-window of interest of 0.7 to 1.9 s after the stimulus onset to examine relevant changes of encoding.  

      Comment 34:

      (6) Alpha desynchronisation as a neural correlate of encoding depth & difficulty?

      "In addition to the behavior results, these EEG results indicate differences between PP conditions in desynchronization of alpha oscillations, as an assumed neural correlate of encoding depth. In addition to the behavior results, these EEG results indicate differences between PP conditions in desynchronization of alpha oscillations, as an assumed neural correlate of encoding depth."

      Given that the low-PP words are more difficult to learn, I was expecting to see higher alpha desynchronisation in the low-PP relative to the high-PP words. Could you outline in a bit more detail how your findings fit into the literature (e.g., Simon Hanslmayr did a lot of work on this)?

      I would also advise you to add citations e.g., after your sentence in the quote above ("as an assumed neural correlate of encoding depth").

      We thank the reviewer for the recommendation giving us the opportunity to discuss in more detail how our results relate to previous findings. 

      We added additional sentences to the discussion on p.13, ll. 441-455:    

      Additional studies linked alpha desynchronization to cognitive effort and cognitive load (Proskovec et al., 2019; Zhu et al., 2021). So, one could assume to observe higher alpha desynchronization in the more difficult to learn condition of low-PP compared to high-PP. On the other hand numerous studies investigating oscillatory correlates of learning and memory showed that alpha desynchronization is associated with memory across different tasks, modalities and experimental phases of encoding and retrieval (Griffiths et al., 2016, 2021, 2019a, 2019b; Hanslmayr et al., 2009; Michelmann et al., 2016). Strikingly, Griffith and colleagues (Griffiths et al., 2019a) revealed by simultaneous EEG-fMRI recordings a negative correlation between the occurrence of patterns of stimulus-specific information detected by fMRI and cortical alpha/beta suppression. Here, the authors suggested that a decrease of alpha/beta oscillations might represent the neuronal mechanism of unmasking the task-critical signal by simultaneous suppression of task-irrelevant neuronal activities to promote information processing. Following this interpretation, we assume that over the course of learning elevated memory processing of the easier to learn stimuli is associated with enhanced information processing and thus accompanied by higher cortical alpha desynchronization in comparison of the more difficult to learn stimuli.

      In addition, we added the mentioned quote on p.7, ll. 239-240:

      In addition to the behavior results, these EEG results indicate differences between PP conditions in desynchronization of alpha oscillations, as an assumed neural correlate of encoding depth (Griffiths et al., 2021; Hanslmayr et al., 2009).

      Comment 35:

      (7) Exclusion criterion.

      Why did you use a d' > 0.9 as a criterion for data inclusion?

      This criterion ensured that each included subject had at least in one PP-condition a d' > 1.05 of pre-sleep memory performance, which corresponds to a general accuracy rate of 70%. 

      Accordingly, we adjusted these sentences of the method section on p.19, ll. 677-680: 

      Data were excluded from subjects who did not reach the minimal learning performance of d' > 1.05 during the pre-sleep memory test in at least one of the two PP conditions, whereas this threshold value corresponds to accuracy rates of 70% (n = 5). In addition, we excluded one subject who showed a negative d' in one PP condition of the pre-sleep memory test (n = 1). 

      Comment 36:

      (8) Coherence of wording.

      When you talk about your dependent variable (d') you sometimes use sensitivity. I would stick to one term.

      We replaced the word sensitivity with d'.    

      (9) Criterion

      Comment 37:

      Why do you refer to a change in criterion (Figure 3b, axis labels) as a change in memory? Do you think the criterion says something about memory?

      We corrected the axis label of Figure 3b and deleted here the word memory.

      Comment 38:

      Additionally, why did you analyse the effect of TMR on the criterion? Do you expect the criterion to change due to sleep-dependent memory consolidation? This section would benefit from more explanation. Personally, I am very interested in your thoughts and your hypothesis (if you had one, if not that is also fine but then, make it explicit that it was an exploratory analysis).

      By conducting exploratory analyses of overnight changes of the c-criterion measurements, we aimed to examine the bias of decision-making to provide comprehensive data according to the framework of the signal detection theory. Regarding the previous literature showing mainly beneficial effects of sleep on learning and memory, we focused with our hypothesis on d' and explored additionally the c-criterion.

      Despite our task design with gains/hits of +10 money points and losses/FAs of -8 (instead of -10), the subjects showed already during the pre-sleep memory task significant biases towards loss avoidance in both PP conditions (t-tests against 0: high-PP: 0.44 ± 0.07, t(21) = 5.63, p < 0.001; low-PP: 0.47 ± 0.09, t(21) = 5.51, p < 0.001). As already reported in the preprint, we found an additional significant increase of c-criterion by TMR solely for the high-PP words (see Fig. 3b). Even by integrating subjects with poor pre-sleep memory performance (high-PP-cueing group: n = 15; low-PP-cueing group: n = 13), t-tests against 0 revealed a significant increase of the high-PP cueing condition (t(14) = 3.36, p = 0.005) and no significant overnight changes in the other conditions (high-PP uncued: t(12) = 1.39, p = 0.19; low-PP cued: t(12) = 1.47, p = 0.17; low-PP uncued: t(14) = -0.20, p = 0.84). These exploratory findings on c-criterion suggest potential applications of TMR to affect decision-making biases in combination with reward learning.      

      We revised the manuscript mentioning the exploratory character of the c-criterion analyses of the results on p.9, ll. 282-283 and of the discussion on p.12, ll. 400-402:  

      We examined next as an exploratory analysis whether TMR conditions influence biases in decision-making.

      By conducting an additional exploratory analysis, we observed a significant change of the decision bias in the cueing condition of the easy to learn words and no overnight changes in the other conditions.

      Comment 39:

      (10) You detected SWs in the time range of 0-6 sec post sound stimulation. How was the distribution of all detected SW down-states in this time range? (You could plot a histogram for this.)

      We illustrated now the detected SWs in the time range of 0 to 6 s after stimulus onset. 

      We added a histogram to the supplementary section on p.30, ll. 982-986:  

      Reviewer #3 (Recommendations For The Authors):

      Comment 40:

      (1) In line with the weakness outlined above, I would recommend including a discussion of how the between-subject comparison and small sample size could affect the results and provide alternative interpretations.

      Please see our previous response at comment 14.

      Comment 41:

      (2) Regarding my point about statistical comparisons, I would recommend that the authors follow best practice guidelines for post-hoc tests and multiple comparisons. In Figures 3a and b, I would also recommend removing the stars indicating significance from the post-hoc tests (if this is what they reflect). Perhaps this link will be useful: https://www.statology.org/anova-post-hoc-tests/

      Please see our previous response at comment 15.    

      Comment 42:

      (3) Furthermore, to address any doubts about the possible phonotactic probability differences between languages, I would recommend that the authors show whether the languages overlap, the level of English fluency in the German-speaking participants, and/or another way of reassuring that this is unlikely to have affected the results.

      Please see our previous response at comment 7.    

      Comment 43:

      (4) In the introduction, I would recommend that the authors outline a clear rationale for the reward/no reward manipulation.

      Please see our previous response at comment 11.    

      Comment 44:

      (5) Figure 1c: Please include what response options participants had, e.g., 'rewarded/not rewarded'. This would make the type of categorization clearer to the reader.

      Please see our previous response at comment 3.

      Comment 45:

      (6) It is unclear whether the additional ANOVA conducted on the time and frequency of the identified clusters included all channels or only the channels contributing to the cluster. Consider clarifying this in the relevant methods and results. Furthermore, I would recommend labelling this as a posthoc test as this analysis was guided by an initial peak at the data and the timings, frequencies, and channels of interest were not selected a-priori.

      We thank the reviewer for this recommendation and labelled the additional repeatedmeasure ANOVA as a post-hoc test. Further, we mentioned the used channels (Pz and Cz) for this analyses.

      We adjusted the results section on p.7, ll. 230-233 and the methods section on p.23, ll. 858-860:            

      A post-hoc repeated-measure ANOVA on alpha power changes (merged over Pz and Cz electrodes) with PP (high vs. low) and presentations (2 to 3) as within-subjects factors revealed a main effect of PP (F(1,32) = 5.42, p = 0.03, η2 = 0.15), and a significant interaction (F(1,32)  = 7.38, p = 0.01, η2 = 0.19; Fig. 2e).

      After confirming the existence of a significant cluster, we conducted an additional post-hoc repeated-measure ANOVA with averaged values of the identified time and frequency range of interest and merged over the Pz and Cz electrodes (see Fig. 2e).

      Comment 46:

      (7) Figure 3: To better illustrate within- vs. between-subjects comparisons and promote transparency, please add individual points and lines between the within-subjects conditions.

      According to this recommendation, we changed Figure 3 to add the individual data points by lines.  

      We modified Figure 3 on p.9, ll. 299-303:  

      Comment 47:

      (8) For the SW density time-bin analyses, please include statistics for all comparisons (i.e., through 0 s to 3 s) and say whether these were corrected for multiple comparisons.

      According to this recommendation, we included now statistics for all comparisons. 

      We added table S6 table to the supplementary data on p.29, l.962:     

      Comment 48:

      (9) Consider reporting effect sizes.

      We thank the reviewer for this recommendation and we added now effect sizes of significant results. 

      Comment 49:

      (10) For transparency and replicability, consider including a list of the four stimulus sets including their phoneme and biphone probabilities.

      We included a list of the four stimulus sets with their phoneme and biphone probabilities  

      We added table S3 and table S4 to the supplementary data on pp. 26-27:       

      References

      Asfestani MA, Brechtmann V, Santiago J, Peter A, Born J, Feld GB. 2020. Consolidation of Reward Memory during Sleep Does Not Require Dopaminergic Activation. J Cogn Neurosci 32:1688– 1703. doi:10.1162/JOCN_A_01585

      Batterink LJ, Oudiette D, Reber PJ, Paller KA. 2014. Sleep facilitates learning a new linguistic rule.

      Neuropsychologia 65:169–79. doi:10.1016/j.neuropsychologia.2014.10.024

      Batterink LJ, Paller KA. 2017. Sleep-based memory processing facilitates grammatical generalization: Evidence from targeted memory reactivation. Brain Lang 167:83–93. doi:10.1016/J.BANDL.2015.09.003

      Bohn OS, Best CT. 2012. Native-language phonetic and phonological influences on perception of American English approximants by Danish and German listeners. J Phon 40:109–128. doi:10.1016/J.WOCN.2011.08.002

      Cairney SA, Guttesen A á. V, El Marj N, Staresina BP. 2018. Memory Consolidation Is Linked to Spindle-Mediated Information Processing during Sleep. Curr Biol 28:948-954.e4. doi:10.1016/j.cub.2018.01.087

      Eberhard DM, Simons GF, Fennig CD. 2019. Ethnologue: Languages of the world . SIL International. Online version: http://www.ethnologue.com.

      Fischer S, Born J. 2009. Anticipated reward enhances offline learning during sleep. J Exp Psychol Learn Mem Cogn 35:1586–1593. doi:10.1037/A0017256

      Green DM, Swets JA. 1966. Signal detection theory and psychophysics., Signal detection theory and psychophysics. Oxford,  England: John Wiley.

      Griffiths B, Mazaheri A, Debener S, Hanslmayr S. 2016. Brain oscillations track the formation of episodic memories in the real world. Neuroimage 143:256–266. doi:10.1016/j.neuroimage.2016.09.021

      Griffiths BJ, Martín-Buro MC, Staresina BP, Hanslmayr S, Staudigl T. 2021. Alpha/beta power decreases during episodic memory formation predict the magnitude of alpha/beta power decreases during subsequent retrieval. Neuropsychologia 153. doi:10.1016/j.neuropsychologia.2021.107755

      Griffiths BJ, Mayhew SD, Mullinger KJ, Jorge J, Charest I, Wimber M, Hanslmayr S. 2019a. Alpha/beta power decreases track the fidelity of stimulus specific information. Elife 8. doi:10.7554/eLife.49562

      Griffiths BJ, Parish G, Roux F, Michelmann S, van der Plas M, Kolibius LD, Chelvarajah R, Rollings DT, Sawlani V, Hamer H, Gollwitzer S, Kreiselmeyer G, Staresina B, Wimber M, Hanslmayr S. 2019b. Directional coupling of slow and fast hippocampal gamma with neocortical alpha/beta oscillations in human episodic memory. Proc Natl Acad Sci U S A 116:21834–21842. doi:10.1073/pnas.1914180116

      Hanslmayr S, Spitzer B, Bäuml K-H. 2009. Brain oscillations dissociate between semantic and nonsemantic encoding of episodic memories. Cereb Cortex 19:1631–40. doi:10.1093/cercor/bhn197

      Iber C, Ancoli‐Israel S, Chesson AL, Quan SF. 2007. The AASM Manual for the Scoring of Sleep and Associated Events: Rules, Terminology and Technical Specifications. Westchester, IL: American Academy of Sleep Medicine.

      Klaassen AL, Heiniger A, Sánchez PV, Harvey MA, Rainer G. 2021. Ventral pallidum regulates the default mode network, controlling transitions between internally and externally guided behavior. Proc Natl Acad Sci U S A 118:1–10. doi:10.1073/pnas.2103642118

      Lansink CS, Goltstein PM, Lankelma J V., McNaughton BL, Pennartz CMA. 2009. Hippocampus leads ventral striatum in replay of place-reward information. PLoS Biol 7. doi:10.1371/JOURNAL.PBIO.1000173

      Luef EM, Resnik P. 2023. Phonotactic Probabilities and Sub-syllabic Segmentation in Language

      Learning. Theory Pract Second Lang Acquis 9:1–31. doi:10.31261/TAPSLA.12468

      Michelmann S, Bowman H, Hanslmayr S. 2016. The Temporal Signature of Memories: Identification of a General Mechanism for Dynamic Memory Replay in Humans. PLoS Biol 14:e1002528. doi:10.1371/journal.pbio.1002528

      Proskovec AL, Heinrichs-Graham E, Wilson TW. 2019. Load Modulates the Alpha and Beta Oscillatory Dynamics Serving Verbal Working Memory. Neuroimage 184:256. doi:10.1016/J.NEUROIMAGE.2018.09.022

      Reber AS. 1967. Implicit learning of artificial grammars. J Verbal Learning Verbal Behav 6:855–863.

      doi:10.1016/S0022-5371(67)80149-X

      Schreiner T, Rasch B. 2015. Boosting vocabulary learning by verbal cueing during sleep. Cereb Cortex 25:4169–4179. doi:10.1093/cercor/bhu139

      Sterpenich V, van Schie MKM, Catsiyannis M, Ramyead A, Perrig S, Yang H-D, Van De Ville D, Schwartz S. 2021. Reward biases spontaneous neural reactivation during sleep. Nat Commun 2021 121 12:1–11. doi:10.1038/s41467-021-24357-5

      Tamminen J, Lambon Ralph MA, Lewis PA. 2013. The role of sleep spindles and slow-wave activity in integrating new information in semantic memory. J Neurosci 33:15376–15381. doi:10.1523/JNEUROSCI.5093-12.2013

      Tamminen J, Payne JD, Stickgold R, Wamsley EJ, Gaskell MG. 2010. Sleep spindle activity is associated with the integration of new memories and existing knowledge. J Neurosci 30:14356–60. doi:10.1523/JNEUROSCI.3028-10.2010

      Zhu Y, Wang Q, Zhang L. 2021. Study of EEG characteristics while solving scientific problems with different mental effort. Sci Rep 11. doi:10.1038/S41598-021-03321-9

    1. Author Response

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      In this study, the researchers aimed to investigate the cellular landscape and cell-cell interactions in cavernous tissues under diabetic conditions, specifically focusing on erectile dysfunction (ED). They employed single-cell RNA sequencing to analyze gene expression patterns in various cell types within the cavernous tissues of diabetic individuals. The researchers identified decreased expression of genes associated with collagen or extracellular matrix organization and angiogenesis in several cell types, including fibroblasts, chondrocytes, myofibroblasts, valve-related lymphatic endothelial cells, and pericytes. They also discovered a newly identified marker, LBH, that distinguishes pericytes from smooth muscle cells in mouse and human cavernous tissues. Furthermore, the study revealed that pericytes play a role in angiogenesis, adhesion, and migration by communicating with other cell types within the corpus cavernosum. However, these interactions were found to be significantly reduced under diabetic conditions. The study also investigated the role of LBH and its interactions with other proteins (CRYAB and VIM) in maintaining pericyte function and highlighted their potential involvement in regulating neurovascular regeneration. Overall, the manuscript is well-written and the study provides novel insights into the pathogenesis of ED in patients with diabetes and identifies potential therapeutic targets for further investigation.

      Reviewer #2 (Public Review):

      Summary: In this manuscript, the authors performed single cell RNA-sequencing of cells from the penises of healthy and diabetes mellitus model (STZ injection-based) mice, identified Lbh as a marker of penis pericytes, and report that penis-specific overexpression of Lbh is sufficient to rescue erectile function in diabetic animals. In public human single cell RNA-sea datasets, the authors report that LBH is similarly specific to pericytes and down regulated in diabetic patients. Additionally, the authors report discovery of CRYAB and VIM1 as protein interacting partners with LBH.

      The authors contributions are of interest to the erectile dysfunction community and their Lbh overexpression experiments are especially interesting and well-conducted. However, claims in the manuscript regarding the specificity of Lbh as a pericyte marker, the mechanism by which Lbh overexpression rescues erectile function, cell-cell interactions impaired by diabetes, and protein-interaction partners require qualification or further evidence to justify.

      Major claims and evidence:

      1) Marker gene specificity and quantification: One of the authors' major contributions is the identification of Lbh as a marker of pericytes in their data. The authors present qualitative evidence for this marker gene relationship, but it is unclear from the data presented if Lbh is truly a specific marker gene for the pericyte lineage (either based on gene expression or IF presented in Fig. 2D, E). Prior results (see Tabula Muris Consortium, 2018) suggest that Lbh is widely expressed in non-pericyte cell types, so the claims presented in the manuscript may be overly broad. Even if Lbh is not a globally specific marker, the authors' subsequent intervention experiments argue that it is still an important gene worth studying.

      Answer: We appreciate this comment. In our scRNAseq data for the mouse cavernosum tissues, previously known markers such as Rgs5, Pdgfrb, Cspg4, Kcnj8, Higd1b, and Cox4i2 were found to be expressed not exclusively in pericytes, while Lbh exhibited specific expression patterns in pericytes (Fig. 2 and Supplementary Fig. 5). LBH expression was easily distinguishable from α-SMA, not only in mouse cavernosum but also in dorsal artery and dorsal vein tissues within penile tissues. This distinctive expression pattern of LBH was also observed in the human cavernous pericytes (Fig. 5). Then, we examined Lbh expression patterns in various mouse tissues using the mouse single-cell atlas (Tabula Muris), although endothelial and pericyte clusters were not subclustered in most tissues from Tabula Muris. To identify pericytes, we relied on the expression pattern of known marker genes (Pecam1 for endothelial cells, Rgs5, Pdgfrb, and Cspg4 for pericytes). Lbh was expressed in pericytes of the bladder, heart and aorta, kidney, and trachea but not as specifically in penile pericytes (Supplementary Fig. 6A-D). However, it is worth noting that other known pericyte markers were also did not exhibit exclusive expression in pericytes across all the tissues we analyzed. Therefore, in certain tissues, particularly in mouse penile tissues, Lbh may be a valuable marker in conjunction with other established pericyte marker genes for distinguishing pericytes.

      2) Cell-cell communication and regulon activity changes in the diabetic penis: The authors present cell-cell communication analysis and TF regulon analysis in Fig 3 and report differential activities in healthy and DM mice. These results are certainly interesting, however, no statistical analyses are performed to justify claimed changes in the disease state and no validations are performed. It is therefore challenging to interpret these results, and the relevant claims do not seem well supported.

      Answer: In response to these helpful suggestions, we calculated statistical significance and performed experimental validation. CellphoneDB permutes the cluster labels of all cells 1000 times and calculates the mean(mean(molecule 1 in cluster X), mean(molecule 2 in cluster Y)) at each time for each interaction pair, for each pairwise comparison between two cell types. We only considered interactions in which the difference in means calculated by these permutations were greater than 0.25-fold between diabetes and normal. Also, we considered that the interactions with P-value < 0.05 were significant.

      To assess differential regulon activities of transcription factor (SCENIC) between diabetic and normal pericytes, we utilized a generalized linear model with scaled activity scores for each cell as input. These scaled regulon activity values for angiogenesis-related TFs exhibited differences between diabetic and normal pericytes. The results of the generalized linear model revealed that Klf5, Egr1, and Junb were TFs with significantly altered regulon activities in diabetic pericytes. Experimental data indicated that the expression level of Lmo2, Junb, Elk1, and Hoxd10 was higher (Hoxd10) or lower (Lmo2, Junb, Elk1) in diabetic pericytes compared to normal pericytes (Supplementary Fig. 9). We have added the scaled regulon activity values and statistical significance in Fig. 3E.

      3) Rescue of ED by Lbh overexpression: This is a striking and very interesting result that warrants attention. By simple overexpression of the pericyte marker gene Lbh, the authors report rescue of erectile function in diabetic animals. While mechanistic details are lacking, the phenomenon appears to have a large effect size and the experiments appear sophisticated and well conducted. If anything, the authors appear to underplay the magnitude of this result.

      Answer: We appreciate this comment. Therefore, we have added relevant clarification in the revised manuscript discussion section to emphasize the importance of LBH overexpression on rescuing ED as follows: “To test our hypothesis, we utilized the diabetes-induced ED mouse model, commonly employed in various studies focusing on microvascular complications associated with type 1 diabetes. We observed that the overexpression of LBH in diabetic mice led to the restoration of reduced erectile function by enhancing neurovascular regeneration. However, this study primarily demonstrated the observed phenomenon without delving into the detailed mechanisms. Nonetheless, these results of LBH on erections provide us with new strategies for treating ED and should be of considerable concern.” (Please see revised ‘Discussion’)

      4) Mechanistic claims for rescue of ED by Lbh overexpression: The authors claim that cell type-specific effects on MPCs are responsible for the rescue of erectile function induced by Lbh overexpression. This causal claim is unsupported by the data, which only show that Lbh overexpression influences MPC performance. In vivo, it's likely that Lbh is being over expressed by diverse cell types, any of which could be the causal driver of ED rescue. In fact, the authors report rescue of cell type abundance in endothelial cells and neuronal cells. Therefore, it cannot be concluded that MPC effects alone or in principal are responsible for ED rescue.

      Answer: We agree with these claims. Therefore, we have added relevant clarifications in the discussion section of the revised manuscript. Our findings suggest that LBH can affect the function of cavernous pericytes, although we cannot definitively specify which particular cavernous cell types are affected by the overexpressed LBH, whether it be cavernous endothelial cells, smooth muscle cells, or others. Subsequent research will be required to conduct more comprehensive mechanistic investigations, such as in vitro studies using cavernous endothelial cells, smooth muscle cells, and fibroblasts to address these knowledge gaps. (Please see revised ‘Discussion’)

      5) Protein interaction data: The authors claim that CRYAB and VIM1 are novel interacting partners of LBH. However, the evidence presented (2 blots in Fig. 6A,B) lack the relevant controls. It is possible that CRYAB and VIM1 are cross-reactive with the anti-LBH antibody or were not washed out completely. The abundance of bands on the Coomassie stain in Fig. 6A suggests that either event is plausible. Therefore, the evidence presented is insufficient to support the claim that CRYAB and VIM1 are protein interacting partners of LBH.

      Answer: We agree with these claims. Therefore, we have added the relevant controls(Input) and performed Co-IP (IP: CRYAB or VIM, WB: LBH) to demonstrate CRYAB and VIM1 are not simply cross-reactive antigens to their LBH antibody. Our results show that we can detect the expression of CRYAB and VIM after LBH IP, and we also detect the expression of LBH after CRYAB and VIM IP. In addition, it can be seen from our results that the binding of LBH to VIM is higher than that of CRYAB. Regardless, these results indicate that the binding of CRYAB or VIM to LBH is not a random phenomenon. (Please see revised ‘Result’ and ‘Figure 6B’)

      Impact: These data will trigger interest in Lbh as a target gene within the erectile dysfunction community.

      Reviewer #3 (Public Review):

      Bae et al. described the key roles of pericytes in cavernous tissues in diabetic erectile dysfunction using both mouse and human single-cell transcriptomic analysis. Erectile dysfunction (ED) is caused by dysfunction of the cavernous tissue and affects a significant proportion of men aged 40-70. The most common treatment for ED is phosphodiesterase 5 inhibitors; however, these are less effective in patients with diabetic ED. Therefore, there is an unmet need for a better understanding of the cavernous microenvironment, cell-cell communications in patients with diabetic ED, and the development of new therapeutic treatments to improve the quality of life.

      Pericytes are mesenchymal-derived mural cells that directly interact with capillary endothelial cells (ECs). They play a vital role in the pathogenesis of erectile function as their interactions with ECs are essential for penile erection. Loss of pericytes has been associated with diabetic retinopathy, cancer, and Alzheimer's disease and has been investigated in relation to the permeability of cavernous blood vessels and neurovascular regeneration in the authors' previous studies. This manuscript explores the mechanisms underlying the effect of diabetes on pericyte dysfunction in ED. Additionally, the cellular landscape of cavernous tissues and cell type-specific transcriptional changes were carefully examined using both mouse and human single-cell RNA sequencing in diabetic ED. The novelty of this work lies in the identification of a newly identified pericyte (PC)-specific marker, LBH, in mouse and human cavernous tissues, which distinguishes pericytes from smooth muscle cells. LBH not only serves as a cavernous pericyte marker, but its expression level is also reduced in diabetic conditions. The LBH-interacting proteins (Cryab and Vim) were further identified in mouse cavernous pericytes, indicating that these signaling interactions are critical for maintaining normal pericyte function. Overall, this study demonstrates the novel marker of pericytes and highlights the critical role of pericytes in diabetic ED.

      Reviewer #1 (Recommendations For The Authors):

      1) The methods are poorly written. It lacks specific information on the sample size, experimental design, and data analysis methods employed. The absence of these crucial details makes it difficult to evaluate the robustness and reliability of the findings.

      Answer: We agree with the reviewer’s suggestion, now we revised the methods of our manuscript, and added detailed information or references. For sample size we have added detailed information in Figure legend (Please see revised ‘Method’ , Figure Legend, and Supplementary information.)

      2) The cell number in the scRNA-seq analysis is small (~12000) and some minor cell types are probably underrepresented. It is not clear whether the authors pooled the cells from different mice as one sample, or replicates in different groups have been included. It will be helpful to label different samples in the UMAP. The authors should repeat the experiments with more replicates to increase the cell number and validate the findings.

      Answer: We understand the reviewer's concern, but due to the small size of mouse penile tissue, we had to pool 5 corpus cavernosum tissues for each group (using pooled samples) for scRNA-seq analysis. Moreover, owing to the unique nature of mouse penile tissue, which is highly resistant, it posed challenges for the dissolution and isolation of single cells using conventional single-cell separation methods. Consequently, we had to increase the concentration of the enzyme to finally obtain 12,894 cells. Rather than conducting a repetitive scRNAseq analysis on the same mouse model, we validated our findings in human cavernous single-cell transcriptome data. This analysis allowed us to confirm the presence of pericyte in human corpus cavernosum, specific expression of LBH in human cavernous pericytes, and the identification of relevant GO terms associated with pericyte functions (Figure 5). We have add these information in ‘Method’ (Please see revised ‘Method’).

      3) Functional studies are lacking to justify how manipulating LBH expression or its interacting proteins might lead to effective therapeutic approaches for diabetic ED.

      Answer: We have performed the functional study to evaluate LBH expression might lead to effective therapeutic approaches for diabetic ED as showed in Figure 4G. Assessment of intracavernous pressure (ICP) is the most representative test for evaluating erectile function. Therefore, we modulated LBH expression in the penis of diabetic mice and assessed the erectile function of the mice by intracavernous pressure. However, we have not performed ICP studies and relative in vitro studies (migration, survival experiment) to assess whether LBH-interacting proteins have the same effect.

      4) Although the abstract identifies novel targets for potential interventions, such as LBH and its interacting proteins, the clinical relevance of these findings remains uncertain. The authors should include a discussion regarding the translation of these discoveries into therapeutic strategies or their potential impact on patients with diabetes and ED.

      Answer: We appreciate the reviewer's suggestion and have added a discussion as per the reviewer’s recommendation (Please see revised ‘Discussion’).

      5) While the study highlights the importance of pericytes in penile erection, it fails to mention the broader context of other cell types involved in the pathogenesis of ED. Neglecting to discuss potential contributions from endothelial cells, smooth muscle cells, or neural elements limits the comprehensive understanding of the cellular interactions underlying diabetic ED.

      Answer: We agree with the reviewer's suggestion and have added a discussion regarding the significance of other cell populations in penile tissues, such as endothelial cells, smooth muscle cells fibroblasts, and neural elements, along with the rationale for our focus on pericytes. (Please see revised ‘Discussion’).

      Reviewer #2 (Recommendations For The Authors):

      We congratulate the authors on an interesting study. We were especially excited to see their Lbh overexpression results. However, we felt other claims in the paper could benefit from additional investigation, analysis, and statistical rigor. We have provided a set of suggestions for improvement below.

      Major points:

      1) Pericyte marker gene proposal: See public review for commentary on the following suggested experiments. The authors should perform binary classification analysis using Lbh and report the performance of this gene as a marker (e.g. using the area under the receiver operating characteristic, accuracy, precision and recall). Further, they should consider performing this analysis for all other genes in their data to determine whether Lbh is the best marker gene.

      Answer: We appreciate this comment. AUC scores of Rgs5, Pln, Ednra, Npylr, Atp1b2, and Gpc3 for ability of a binary classifier to distinguish between pericyte and the other cell types in mouse penile tissues were measured by using FindMarkers function. Rgs5 had the highest AUC, but Rgs5 was also expressed in SMCs in our data. Pln, Ednra, Gpc3, and Npy1r also seemed to be candidate markers, but the literature search excluded these genes as they are also expressed in the SMCs of other tissues or different cell types. The AUC score of Lbh was over 0.7, and expression in SMC was not identified in previous studies, and ultimately, we experimentally identified that Lbh is penis pericyte specific. We have added this to the manuscript.

      Author response table 1.

      Robust differential expression analysis should also be performed for this gene (if not all) and the statistics should be reported, given known issues with the statistical approach used by the authors for differential expression (see: Squair 2021, 10.1038/s41467-021-25960-2). The authors' should also report the number of cells involved in these comparisons, as the number of pericytes in the data (Fig 1B) appears quite small.

      Answer: We appreciate this comment. We used “MAST” to identify differentially expressed genes. This test is often used to find DEGs in single-cell RNA data. However, because the pseudobulk method has advantages over the single cell DEG method (Squair 2021, 10.1038/s41467-021-25960-2), we additionally performed DEG analysis with DESeq2 to confirm whether Lbh can distinguish pericytes from other cell types in the penile. As a result, even when tested with DESeq2, Lbh expression was significantly higher in pericytes than in other cell types in penile (adjusted p-value = 2.694475e-07 in Pericyte vs SMC, adjusted P-value = 3.700118e-58 in Pericyte vs the other cell types). Mouse penile tissue is small in size, and the number of pericytes in mouse penile tissue is relatively smaller compared to fibroblasts and chondrocytes. In our mouse penile scRNAseq data, the number of pericytes is as follows: normal: 58, diabetes: 116. Despite the limited number of cells, we were able to establish statistical significance in our analyses.

      Immunostaining results in Fig. 2D, E should likewise be quantified. At present, it's unclear that LBH and aSMA are mutually exclusive as claimed. The authors should also investigate Lbh expression in public single cell genomics data, rather than performing candidate gene literature searches. For example, the Tabula Muris suggests Lbh is expressed widely outside pericytes.

      Answer: For Figure 2D and E, the aim of these analyses was to assess the distribution of LBH and other cellular markers to see if they overlap and if they can be distinguished. We think that some of the overlapping staining in the tissue may be caused by multilayered cellular structures, so staining within cells would be more convincing. Therefore, we quantified the percentage of LBH- or α-SMA-expressed pericytes and relative expression in smooth muscle cells in cell staining (Supplementary Fig. 5E). We found that only 3% of smooth muscle cells expressed LBH, 67% of mouse cavernous pericytes (MCPs) expressed α-SMA, and more than 97% of MCPs expressed LBH. Therefore, these results may illustrate the specific expression of LBH in MCPs. These information was added as ‘Supplementary Fig. 5E’ (Please see revised ‘Supplementary information’). We also examined Lbh expression patterns in various mouse tissues using the public mouse single-cell atlas (Tabula Muris), and provided a detailed response in reviewer 2’s public review 1.

      Even if Lbh is not the best marker, the authors' intervention experiment still motivates study of the gene, but these analyses would help contextualize the result for readers.

      2) Statistical anslyses for cell-cell communication and TF regulon analysis: See public review for context on these comments. The authors should perform statistical tests to evaluate the significance of differences detected for each of these analysis. For example, generalized linear models can be used to assess the significance of TF regulon activity scores from SCENIC, and permutation tests can be used to measure the significance of cell-cell interaction score changes. Without these statistical tests, it's challenging for a reader to interpret whether the results reported are meaningful or within the realm of experimental noise.

      Answer: We appreciate this comment. We calculated statistical significance TF regulon analyses as suggested by the reviewer and described a detailed statistical calculation method for cell-cell communication. We provided a detailed response in reviewer 2’s public review 2.

      3) Mechanism of ED rescue by Lbh overexpression: To support this claim, the authors would need to perform an experiment where Lbh is over expressed specifically in MPCs (using e.g. a specific promoter on their LTV construct, or a transgenic line with a cell type-specific Cre-Lox system). Absent these data, the claim should be removed.

      Answer: We agree with the reviewer's suggestion and we have reworked the claim that ‘LBH overexpression is affected by pericytes during ED recovery’ and have added relevant clarification in the Discussion section to clearly state that LBH overexpression may affect many cavernosum cells, such as cavernous endothelial cells, smooth muscle cells, fibroblasts, and pericytes (Please see revised ‘Result’ and ‘Discussion’)

      4) Protein interaction claims: This experiment would require that the authors perform a similar pull-down with LBH KO cells and or a reciprocal Co-IP (e.g. IP: CRYAB or VIM1, WB: LBH) to demonstrate CRYAB and VIM1 are not simply cross-reactive antigens to their LBH antibody. Further, these experiments appear to only have a single replicate for each condition. The authors should either remove associated claims, or perform a Co-IP experiment with the relevant controls with sufficient replication.

      Answer: We agree with the claims. Therefore, we have included the necessary controls (Input) and performed Co-IP (IP: CRYAB or VIM1, WB: LBH) to demonstrate that CRYAB and VIM1 are not simply cross-reactive antigens to their LBH antibody. Our results show that we can detect the expression of CRYAB and VIM after LBH IP, and we also detect the expression of LBH after CRYAB and VIM IP. In addition, it can be seen from our results that the binding of LBH to VIM is higher than that of CRYAB. Regardless, these results indicate that the binding of CRYAB or VIM to LBH is not a random phenomenon. Additionally, all IP experiments were replicated at least three times. (Please see revised ‘Result’ and ‘Figure 6B’)

      Minor Points:

      • The reference "especially in men" on line 56 seems odd given that only males can experience penile erectile dysfunction.

      Answer: We agree with the reviewer's suggestion and have removed the description 'especially male' (Please see revised ‘Introduction’)

      • Line 109, it's unclear what genes showed altered expression in Schwann cells.

      Answer: We apologize for the confusion. There was no significant differentially expressed genes between normal and diabetes in Schwann cells. We revised this part in the manuscript. (Schwann cells showed an increased expression compared to normal cells in diabetes, though not significant. In Schwann cells, there were no significant DEGs between diabetic and normal cells.)

      • It would be helpful for readers to see an analysis of the cell types that are transduced in the Lbh overexpression experiment in vivo. At present, some pericyte specificity is implied, but not demonstrated.

      Answer: We appreciate this comment. Our findings suggest that LBH can affect the function of cavernous pericytes, although we cannot definitively conclude which specific-cavernous cell types are affected by the overexpressed LBH, whether it be cavernous endothelial cells, smooth muscle cells, or others. Subsequent research will be required to conduct more comprehensive mechanistic investigations, such as in vitro studies using cavernous endothelial cells, smooth muscle cells, and fibroblasts to address these knowledge gaps. These were also mentioned in the manuscript.

      • To improve clarity and enhance readability, define abbreviations before their initial usage in the text. For instance, in the second paragraph of the Introduction, the abbreviation 'ECs' is used without prior definition. It can be inferred that it is referring to endothelial cells, mentioned in parentheses in the subsequent sentence.

      Answer: We agree with the reviewer's suggestion to expand acronyms and ensure that all acronyms are defined in the revised manuscript before they are used for the first time in the text (Please see revised Manuscript).

      • It is important to include relevant references that align with the content being discussed. For example, in the Introduction, pericytes are described as being involved in various processes such as angiogenesis, vasoconstriction, and permeability. The text refers to a single reverence, a review by Gerhardt and Besholtz, which primarily focuses on pericyte's role in regulating angiogenesis. Adding additional sources, such as the review by Bergers and Song (Neuro Oncol., 2005) is recommended.

      Answer: We agree with the reviewer's suggestion, and have added the reference as reviewer recommended (Please see revised Manuscript and reference).

      • Figure 3E: it is stated that a panel of 53 angiogenesis factors were tested, it is stated that only MMP3 showed increased expression. However, various unlabeled spots appear to show changed expression patterns. It would be helpful to show a summary graph with the relative intensities of the full array of factors tested.

      Answer: We agree with the reviewer’s suggestion, now we showed all spots density in angiogenesis array as Supplementary Table 1. The condition of the spots we selected was that the expression density was at least above 1500, and the change ratio was greater than 1.2. (Please see revised ‘Supplementary information’)

      Reviewer #3 (Recommendations For The Authors):

      Detailed statistical power calculation

      Data availability statement( were both mouse and human scRNA deposited in GEO with a taken and when will they be released to the public?)

      Answer: Human scRNA data have been deposited in GEO under accession number GSE206528. Our mouse scRNA dataset has been uploaded to KoNA and is available for download (https://www.kobic.re.kr/kona/review?encrypt_url=amlod2FucGFya3xLQUQyMzAxMDEz)

      Major concerns about this work

      1) The single cell RNAseq data collected for mouse diabetic ED(Fig 1B), FB are the most abundant cell population compared to PC, EC, SMC and other clusters. The rationale for studying FB clusters (in Figure 1, D-F) instead of PC cluster is unclear. Which cluster DEG did the authors annotate for Fig 1G-H?

      Answer: We understand the reviewer's suggestion and confusion. Although other major cell populations in penile tissue such as smooth muscle cells, endothelial cell, and fibroblasts have been extensively studied, pericytes have mainly been investigated in the context of the central nervous system (CNS). For example, in the CNS, pericytes are involved in maintaining the integrity of the brain's blood-brain barrier (BBB) [PMID: 27916653], regulating blood flow at capillary junctions [PMID: 33051294], and promoting neuroinflammatory processes [PMID: 31316352], whose dysfunction is considered an important factor in the progression of vascular diseases such as Alzheimer's disease [PMID: 24946075]. But little is known about the role of pericytes in penile tissue [PMID: 35865945; PMID: 36009395; PMID: 26044953]. In order to explore the role of pericytes in repairing the corpus cavernosum vascular and neural tissues damaged by DM, we focused on pericytes, which are multipotent perivascular cells that contribute to the generation and repair of various tissues in response to injury. Although recent studies have shown that pericytes are involved in physiological mechanisms of erection, little is known about their detailed mechanisms. We have also added this rationale in discussion.

      Single cell level study has not been conducted in mouse penile tissues. Therefore, before delving into pericytes, we aimed to identify overall transcriptome differences between normal and diabetic conditions in mouse penile tissues. We presented the analyses of FB, which make up the largest proportion among the cell types in the mouse penis, in Fig. 1D-F. The analysis of other cell types is provided in Supplementary Fig. 1-4. Fig. 1G-H are GO terms for Fibroblasts clusters. We added this information in the figure.

      2) Fig 2 is the critical data to show Lbh is a cavernous PC specific marker. More PC violin plots to identify PC cluster such as Cspg4, Kcnj8, Higd1b, Cox4i2 and more SMC violin plots to identify SMC cluster such as Acta2, Myh11, Tagln, Actg2 should be used for inclusion and exclusion of PC( the same concern applied to human scRNAseq in Fig 5B).

      Answer: We appreciate this comment. We examined the expression of other marker genes of pericytes and SMCs. Although some marker genes were rarely expressed in the mouse penis data (Kcnj8, Higd1b), the expression of marker genes tended to be relatively high in each cluster. The expression of Cspg4 and Cox4i2 was higher in pericytes than in SMCs, while the expression of Acta2, Myh11,and Tagln was higher in SMCs than in pericytes. Actag2 was specifically expressed in SMCs. Through the gene set enrichment test as well as the expression of known cell type marker genes, we identified that the annotation of pericyte and SMC was appropriate (Fig. 2B and Fig. 5C). We added the violin plots of these marker genes in Supplementary Fig. 5.

      Author response image 1.

      (Mouse)

      In human penis data, ACTA2 and MYH11 were expressed in SMCs, pericytes, and myofibroblasts, as in the previous paper [PMID: 35879305]. Among pericyte markers, the number of cells expressing KCNJ8 and HIGD1B was small. The cluster we annotated as pericyte was double positive for pericyte markers CSPG4 and COX4I2. ACTG2, a marker for SMC, was expressed more highly in SMC than in pericytes and myofibroblasts. As in the mouse penis data, we identified that the annotation of each cell type was appropriate through the gene set enrichment test in the human penis data. We added the violin plots of CSPG4, COX4I2, and ACTG2 in Supplementary Fig. 11.

      Author response image 2.

      (Human)

      When exploring Lbh expression levels in "Database of gene expression in adult mouse brain and lung vascular and perivascular cells" from https://betsholtzlab.org/VascularSingleCells/database.html, Lbh is not uniquely expressed in PC, suggesting its tissue-specific expression level. This difference should be discussed in the Discussion section.

      Answer: We appreciate this valuable comment. For the answer to this comment, we extensively analyzed Lbh expression patterns in various mouse tissues using the public mouse single-cell atlas (Tabula Muris) as also suggested by Reviewer 2. Please see our detailed response in reviewer 2’s public review 1.

      3) In prior studies on PC morphology and location (PMID: 21839917), they reside in capillaries (diameter less than 10um) or distal vessels (diameter less than 25um) and have oval cell body and long processes. Due to the non-specificity of Pdgfrb, SMC are positive for Pdgfrb staining (this has been shown in many publications that SMC are Pdgfrb+; unfortunately, NG2 antibody also stains for both PC and SMC). Therefore, the LBH immunostaining (in Fig 2D and 2E of large-sized vessels) are very likely for SMC identity, not PC. PC should be in close contact with CD31+ ECs in healthy conditions. The LBH immunostaining of PC in both mouse and human tissues (Fig 4) must be replaced and better characterized.

      Answer: We agree with the reviewer's suggestion. As it is widely known, peicytes are primarily located in capillaries, where they surround endothelial cells of blood vessels. However, recent discoveries have identified cells with pericyte-like characteristics in the walls of large blood vessels, challenging the traditional concept [PMID: 27268036]. In our study, we observed minimal overlap in staining between LBH and α-SMA, suggesting that the cells expressing LBH were not smooth muscle cells but possibly pericyte-like cells in large vessels. In small vessels within the bladder, kidney, and even the aorta, we found LBH-expressing cells surrounding CD31-expressing vessels, consistent with the known characteristics of pericytes. Further research is needed to comprehend the differences in LBH expression and its characteristics in both large and small blood vessels. We have added discussions and references for this issue (Please see revised ‘Discussion’ and ‘Reference’)

      4) How do mouse cavernous pericytes isolate? How is purity?

      Answer: As the reviewer points out, we isolated mouse spongiform pericytes following our and other previously published methods. We used pigment epithelium-derived factor (PEDF), which removes non-pericytic cells [PMID: 30929324, 23493068]. Although there are no purity study results such as FACS, other staining results thoroughly support the notion that this method yields pericytes with a notably high level of purity. (Please see ‘Method’ section).

      5) Can mouse scRNAseq cell-cell communication in Fig 3 be reproducible in human scRNAseq cell-cell communication? The results in human ED are more clinically significant than in mouse data.

      Answer: In human scRNAseq data, the difference between angiogenesis-related interactions between normal and diabetes was not as significant as that in mouse data. Because the cell type composition of the human and mouse penis is not completely identical, there are limitations in comparing cell-cell interactions. However, in the human penis data, some interactions related to angiogenesis between pericytes and other cell types were decreased in diabetes compared to normal (boxed parts).

      Author response image 3.

      6) Fibroblasts also express Vim. Murine PC VIM/CRYAB( should be written as Vim/Cryab as mouse proteins) direct interaction with Lbh is unclear from Lbh IP as Fig 6A red boxes showed a wide range of sizes. Where is the band for Lbh? Do human PC LBH interact with VIM/CRYAB?

      Answer: We agree with the reviewer's comment. VIM is a type III intermediate filament protein expressed in many cell types. We have added the relevant controls (Input) and performed Co-IP (IP: CRYAB or VIM, WB: LBH) to demonstrate CRYAB and VIM are not simply cross-reactive antigens to their LBH antibody. In western blot study, the LBH band was expressed between 35 kDa-48 kDa. From Figure 6A, we detected CRYAB in band 1 and VIM in bands 2 and 3. This may be due to the formation of dimers or multimers by VIM. We did not use human PCs for IP studies because IP requires large amounts of protein, making IP studies using human pericyte challenging. Nevertheless, the interaction between LBH and CRYAB in humans has been reported through fluorescent resonance energy transfer assay and affinity chromatography technology assay [PMID:34000384, PMID:20587334].

      7) In Fig 6H and I, why does CRYAB expression significantly reduce in vitro and in vivo under diabetic conditions, whereas VIM expression significantly increases?

      Answer: As the reviewer pointed out, and we have discussed on this issue in the manuscript, CRYAB is known to promote angiogenesis. Diabetes reduces CRYAB expression, so angiogenesis may be impaired. Furthermore, since VIM is a multifunctional protein, it interacts with several other proteins with multiple functions under various pathophysiological conditions. There are many relevant literatures showing that VIM expression is increased under diabetic conditions [PMID: 28348116 and PMID: 32557212]. And VIM deficiency protects against obesity and insulin resistance in patients with type 2 diabetes. Therefore, we hypothesize that exogenous LBH may have the ability to bind to the increased VIM in diabetic conditions and inactivate the effects of VIM. Thereby achieving the protective effect. This needs to be proved in further studies.

      8) The therapeutic strategies targeting (Lbh-Cryab-Vim) on mouse diabetic ED model is not investigated and need to be further validated and discussed.

      Answer: As the reviewers pointed out, in this study, we did not evaluate the targeted therapeutic strategy for LBH-CRYAB-VIM in a mouse diabetic ED model. We only identified the binding potential of these three proteins. Evaluation of this treatment strategy requires further study. For example, we can employ shRNA lentivirus, either alone or in combination, to downregulate CRYABexpression [PMID: 31612679] in normal mice, utilize a lentiviral vector CMV-GFP-puro-vimentin to overexpress Vimentin [PMID: 36912679], and then treat it with LBH to evaluate whether the LBH effect still exists (in vivo erectile function study and in vitro angiogenesis assay). We include this information in the Discussion section as a limitation of this study (Please see revised ‘Discussion’).

      9) The Discussion of current knowledge of pericytes in diabetic ED and other diseases and the significance of this study as well as clinical implications, should be expanded.

      Answer: As the reviewers pointed out, we have expanded the current knowledge of pericytes in diabetic ED and other diseases (CNS disease) and clinical implications as follows: “Although other major cell populations in penile tissue such as smooth muscle cells, endothelial cell, and fibroblasts have been extensively studied, pericytes have mainly been investigated in the context of the central nervous system (CNS). For example, in the CNS, pericytes are involved in maintaining the integrity of the brain's blood-brain barrier (BBB), regulating blood flow at capillary junctions, and promoting neuroinflammatory processes, whose dysfunction is considered an important factor in the progression of vascular diseases such as Alzheimer's disease. But little is known about the role of pericytes in penile tissue.” (Please see revised ‘Discussion’).

      10) How many clinical samples were used? How many times did each experiment repeat?

      Answer: As the reviewers pointed out, the clinical samples’ information was added in ‘method’ section. A total four human samples were used in this study (‘human corpus cavernosum tissues were obtained from two patients with congenital penile curvature (59-year-old and 47-year-old) who had normal erectile function during reconstructive penile surgery and two patients with diabetic ED (69-year-old and 56-year-old) during penile prosthesis implantation.’). For in vivo study, we quantified four different fields from human samples.

      Minor concerns

      1) Fig 1A, why normal mouse's body size is the same as DM?

      Answer: As the reviewer pointed out, in Figure 1A, while the size of normal mice and DM mice may not appear significantly different, there are indeed notable difference in body weight and size. The normal mice body weigh we used was about 30 grams, while DM mice body weigh was generally less than 24 grams. We found that we missed information on physiological and metabolic parameters from in vivo studies (ICP function study). Therefore, we have added it in Supplementary Table 2 (Please see revised ‘Supplementary information’)

      2) The label and negative, and positive controls for Fig 6B are missing.

      Answer: We thank for pointing out this. We have added the relevant controls (Input) and performed Co-IP (IP: CRYAB or VIM1, WB: LBH) to demonstrate CRYAB and VIM1 are not simply cross-reactive antigens to their LBH antibody and all IP was replicated for at least 3 times. (Please see revised ‘Result’ and ‘Figure 6B’)

      3) The limitation of this study and future work should be discussed.

      Answer: As the reviewer pointed out, we have added the limitation of this study and future direction in the discussion section (Please see revised ‘Discussion’).

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      The authors report an fMRI investigation of the neural mechanisms by which selective attention allows capacity-limited perceptual systems to preferentially represent task-relevant visual stimuli. Specifically, they examine competitive interactions between two simultaneously-presented items from different categories, to reveal how task-directed attention to one of them modulates the activity of brain regions that respond to both. The specific hypothesis is that attention will bias responses to be more like those elicited by the relevant object presented on its own, and further that this modulation will be stronger for more dissimilar stimulus pairs. This pattern was confirmed in univariate analyses that measured the mass response of a priori regions of interest, as well as multivariate analyses that considered the patterns of evoked activity within the same regions. The authors follow these neuroimaging results with a simulation study that favours a "tuning" mechanism of attention (enhanced responses to highly effective stimuli, and suppression for ineffective stimuli) to explain this pattern.

      Strengths:

      The manuscript clearly articulates a core issue in the cognitive neuroscience of attention, namely the need to understand how limited perceptual systems cope with complex environments in the service of the observer's goals. The use of a priori regions of interest, and the inclusion of both univariate and multivariate analyses as well as a simple model, are further strengths. The authors carefully derive clear indices of attentional effects (for both univariate and multivariate analyses) which makes explication of their findings easy to follow.

      Weaknesses:

      There are some relatively minor weaknesses in presentation, where the motivation behind some of the procedural decisions could be clearer. There are some apparently paradoxical findings reported -- namely, cases in which the univariate response to pairs of stimuli is greater than to the preferred stimulus alone -- that are not addressed. It is possible that some of the main findings may be attributable to range effects: notwithstanding the paradox just noted, it seems that a floor effect should minimise the range of possible attentional modulation of the responses to two highly similar stimuli. One possible limitation of the modelled results is that they do not reveal any attentional modulation at all under the assumptions of the gain model, for any pair of conditions, implying that as implemented the model may not be correctly capturing the assumptions of that hypothesis.

      We thank the reviewer for the constructive comments. In response, in the current version of the manuscript we have improved the presentation. We further discuss how the response in paired conditions is in some cases higher than the response to the preferred stimulus in this letter. For this, we provide a vector illustration, and a supplementary figure of the sum of weights to show that the weights of isolated-stimulus responses for each category pair are not bound to the similarity of the two isolated responses.

      Regarding the simulation results, we have clarified that the univariate effect of attention is not the attentional modulation itself, but the change in the amount of attentional modulation in the two paired conditions. We provide an explanation for this in this letter below, and have changed the term “attentional modulation” to “univariate shift” in the manuscript to avoid the confusion.

      Reviewer #2 (Public Review):

      Summary:

      In an fMRI study requiring participants to attend to one or another object category, either when the object was presented in isolation or with another object superimposed, the authors compared measured univariate and multivariate activation from object-selective and early visual cortex to predictions derived from response gain and tuning sharpening models. They observed a consistent result across higher-level visual cortex that more-divergent responses to isolated stimuli from category pairs predicted a greater modulation by attention when attending to a single stimulus from the category pair presented simultaneously, and argue via simulations that this must be explained by tuning sharpening for object categories.

      Strengths:

      - Interesting experiment design & approach - testing how category similarity impacts neural modulations induced by attention is an important question, and the experimental approach is principled and clever.

      - Examination of both univariate and multivariate signals is an important analysis strategy.

      - The acquired dataset will be useful for future modeling studies.

      Weaknesses:

      - The experimental design does not allow for a neutral 'baseline' estimate of neural responses to stimulus categories absent attention (e.g., attend fixation), nor of the combination of the stimulus categories. This seems critical for interpreting results (e.g., how should readers understand univariate results like that plotted in Fig. 4C-D, where the univariate response is greater for 2 stimuli than one, but the analyses are based on a shift between each extreme activation level?).

      We are happy to clarify our research rationale. We aimed to compare responses in paired conditions when the stimuli were kept constant while varying the attentional target. After we showed that the change in the attentional target resulted in a response change , we compared the amount of this response change to different stimulus category pairs to investigate the effect of representation similarity between the target and the distractor on the response modulation caused by attentional shift. While an estimate of the neural responses in the absence of attention might be useful for other modeling studies, it would not provide us with more information than the current data to answer the question of this study.

      Regarding the univariate results in Fig. 4C-D (and other equivalent ROI results in the revised version) and our analyses, we did not impose any limit on the estimated weights of the two isolated responses in the paired response and thus the sum of the two weights could be any number. We however see that the naming of “weighted average”, which implies a sum of weights being capped at one, has been misleading . We have now changed the name of this model to “linear combination” to avoid confusion

      Previous studies (Reddy et al., 2009, Doostani et al., 2023) using a similar approach have shown a related results pattern: the response to multiple stimuli is higher than the average, but lower than the sum of the isolated responses, which is exactly what our results suggest. We have added discussion on this topic in the Results section in lines 409-413 for clarification:

      “Note that the response in paired conditions can be higher or lower than the response to the isolated more preferred stimulus (condition Mat), depending on the voxel response to the two presented stimuli, as previously reported (Doostani et al. 2023). This is consistent with previous studies reporting the response to multiple stimuli to be higher than the average, but lower than the sum of the response to isolated stimuli (Reddy et al. 2009).”

      We are not sure what the reviewer means by “each extreme activation level”. Our analyses are based on all four conditions. The two isolated conditions are used to calculate the distance measures and the two paired conditions are used for calculating the shift index. Please note that either the isolated or the paired conditions could show the highest response and we seeboth cases in our data. For example, as shown in Figure 4A in EBA, the isolated Body condition and the paired BodyatCar condition show the highest activation levels for the Body-Car pair, whereas in Figure 4C, the two paired conditions (BodyatCat and BodyCatat) elicit the highest response.

      - Related, simulations assume there exists some non-attended baseline state of each individual object representation, yet this isn't measured, and the way it's inferred to drive the simulations isn't clearly described.

      We agree that the simulations assume a non-attended baseline state, and that we did not measure that state empirically. We needed this non-attended response in the simulations to test which attention mechanism led to the observed results. Thus, we generated the non-attended response using the data reported in previous neural studies of object recognition and attention in the visual cortex (Ni et al., 2012, Bao and Tsao, 2018). Note that the simulations are checking for the profile of the modulations based on category distance. Thus, they do not need to exactly match the real isolated responses in order to show the effect of gain and tuning shift on the results. We include the clarification and the range of neural responses and attention parameters used in the simulations in the revised manuscript in lines 327-333:

      “To examine which attentional mechanism leads to the effects observed in the empirical data, we generated the neural response to unattended object stimuli as a baseline response in the absence of attention, using the data reported by neural studies of object recognition in the visual cortex (Ni et al., 2012, Bao and Tsao, 2018). Then, using an attention parameter for each neuron and different attentional mechanisms, we simulated the response of each neuron to the different task conditions in our experiment. Finally, we assessed the population response by averaging neural responses.”

      - Some of the simulation results seem to be algebraic (univariate; Fig. 7; multivariate, gain model; Fig. 8)

      This is correct. We have used algebraic equations for the effect of attention on neural responses in the simulations. In fact, thinking about the two models of gain and tuning shift leads to the algebraic equations, which in turn logically leads to the observed results, if no noise is added to the data. The simulations are helpful for visualizing these logical conclusions. Also, after assigning different noise levels to each condition for each neuron, the results are not algebraic anymore which is shown in updated Figure 7 and Figure 8.

      - Cross-validation does not seem to be employed - strong/weak categories seem to be assigned based on the same data used for computing DVs of interest - to minimize the potential for circularity in analyses, it would be better to define preferred categories using separate data from that used to quantify - perhaps using a cross-validation scheme? This appears to be implemented in Reddy et al. (2009), a paper implementing a similar multivariate method and cited by the authors (their ref 6).

      Thank you for pointing out the missing details about how we used cross-validation. In the univariate analysis, we did use cross validation, defining preferred categories and calculating category distance on one half of the data and calculating the univariate shift on the other half of the data. Similarly, we employed cross-validation for the multivariate analysis by using one half of the data to calculate the multivariate distance between category pairs, and the other half of the data to calculate the weight shift for each category pair. We have now added this methodological information in the revised manuscript.

      - Multivariate distance metric - why is correlation/cosine similarity used instead of something like Euclidean or Mahalanobis distance? Correlation/cosine similarity is scale-invariant, so changes in the magnitude of the vector would not change distance, despite this likely being an important data attribute to consider.

      Since we are considering response patterns as vectors in each ROI, there is no major difference between the two measures for similarity. Using euclidean distance as a measure of distance (i.e. inverse of similarity) we observed the same relationship between weight shift and category euclidean distance. There was a positive correlation between weight shift and the euclidean category distance in all ROIs ( ps < 0.01, ts > 2.9) except for V1 (p = 0.5, t = 0.66). We include this information in the revised manuscript in the Results section lines 513-515:

      “We also calculated category distance based on the euclidean distance between response patterns of category pairs and observed a similarly positive correlation between the weight shift and the euclidean category distance in all ROIs (ps < 0.01, ts >2.9) except V1 ( p = 0.5, t = 0.66).”

      - Details about simulations implemented (and their algebraic results in some cases) make it challenging to interpret or understand these results. E.g., the noise properties of the simulated data aren't disclosed, nor are precise (or approximate) values used for simulating attentional modulations.

      We clarify that the average response to each category was based on previous neurophysiology studies (Ni et al., 2012, Bao and Tsao, 2018). The attentional parameter was also chosen based on previous neurophysiology (Ni et al., 2012) and human fMRI (Doostani et al., 2023) studies of visual attention by randomly assigning a value in the range from 1 to 10. We have included the details in the Methods section in lines 357-366:

      “We simulated the action of the response gain model and the tuning sharpening model using numerical simulations. We composed a neural population of 4⨯105 neurons in equal proportions body-, car-, cat- or house-selective. Each neuron also responded to object categories other than its preferred category, but to a lesser degree and with variation. We chose neural responses to each stimulus from a normal distribution with the mean of 30 spikes/s and standard deviation of 10 and each neuron was randomly assigned an attention factor in the range between 1 and 10 using a uniform distribution. These values are comparable with the values reported in neural studies of attention and object recognition in the ventral visual cortex (Ni et al. 2012, Bao and Tsao 2018). We also added poisson noise to the response of each neuron (Britten et al. 1993), assigned randomly for each condition of each neuron.”

      - Eye movements do not seem to be controlled nor measured. Could it be possible that some stimulus pairs result in more discriminable patterns of eye movements? Could this be ruled out by some aspect of the results?

      Subjects were instructed to direct their gaze towards the fixation point. Given the variation in the pose and orientation of the stimuli, it is unlikely that eye movements would help with the task. Eye movements have been controlled in previous experiments with individual stimulus presentation (Xu and Vaziri-Pashkam, 2019) and across attentional tasks in which colored dots were superimposed on the stimuli (Vaziri-Pashkam and Xu, 2017) and no significant difference for eye movement across categories or conditions was observed. As such, we do not think that eye movements would play a role in the results we are observing here.

      - A central, and untested/verified, assumption is that the multivariate activation pattern associated with 2 overlapping stimuli (with one attended) can be modeled as a weighted combination of the activation pattern associated with the individual stimuli. There are hints in the univariate data (e.g., Fig. 4C; 4D) that this might not be justified, which somewhat calls into question the interpretability of the multivariate results.

      If the reviewer is referring to the higher response in the paired compared to the isolated conditions, as explained above, we have not forced any limit on the sum of the estimated weights to equal 1 or 2. Therefore, our model is an estimation of a linear combination of the two multivariate patterns in the isolated conditions. In fact, Leila Reddy et al. (reference 6) reported that while the combination is closer to a weighted average than to a weighted sum, the sum of the weights are on average larger than 1. In Figure 4C and 4D the responses in the paired conditions are higher than either of the isolated-condition responses. This suggests that the weights for the linear combination of isolated responses in the multivariate analysis should add up to larger than one. This is what we find in our results. We have added a supplementary figure to Figure 6, depicting the sum of weights for different category pairs in all ROIs. The figure illustrates that in each ROI, the sum of weights are greater than 1 for some category pairs. It is however noteworthy that we normalized the weights in each condition by the sum of weights to calculate the weight shift in our analysis. The amount of the weight shift was therefore not affected by the absolute value of the weights.

      - Throughout the manuscript, the authors consistently refer to "tuning sharpening", an idea that's almost always used to reference changes in the width of tuning curves for specific feature dimensions (e.g., motion direction; hue; orientation; spatial position). Here, the authors are assaying tuning to the category (across exemplars of the category). The link between these concepts could be strengthened to improve the clarity of the manuscript.

      The reviewer brings up an excellent point. Whereas tuning curves have been extensively used for feature dimensions such as stimulus orientation or motion direction, here, we used the term to describe the variation in a neuron’s response to different object stimuli.

      With a finite set of object categories, as is the case in the current study, the neural response in object space is discrete, rather than a continuous curve illustrated for features such as stimulus orientation. However, since more preferred and less preferred features (objects in this case) can still be defined, we illustrated the neural response using a hypothetical curve in object space in Figure 3 to show how it relates with other stimulus features. Therefore, here, tuning sharpening refers to the fact that the response to the more preferred object categories has been enhanced while the response to the less preferred stimulus categories is suppressed.

      We clarify this point in the revised manuscript in the Discussion section lines 649-659:

      “While tuning curves are commonly used for feature dimensions such as stimulus orientation or motion direction, here, we used the term to describe the variation in a neuron’s response to different object stimuli. With a finite set of object categories, as is the case in the current study, the neural response in object space is discrete, rather than a continuous curve illustrated for features such as stimulus orientation. The neuron might have tuning for a particular feature such as curvature or spikiness (Bao et al., 2020) that is present to different degrees in our object stimuli in a continuous way, but we are not measuring this directly. Nevertheless, since more preferred and less preferred features (objects in this case) can still be defined, we illustrate the neural response using a hypothetical curve in object space. As such, here, tuning sharpening refers to the fact that the response to the more preferred object categories has been enhanced while the response to the less preferred stimulus categories is suppressed.”

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      a. The authors should address the apparent paradox noted above (and report whether it is seen in other regions of interest as well). On what model would the response to any pair of stimuli exceed that of the response to the preferred stimulus alone? This implies some kind of Gestalt interaction whereby the combined pair generates a percept that is even more effective for the voxels in question than the "most preferred" one?

      The response to a pair of stimuli can exceed the response to each of the stimuli presented in isolation if the voxel is responsive to both stimuli and as long as the voxel has not reached its saturation level. This phenomenon has been reported in many previous studies (Zoccolan et al., 2005, Reddy et al., 2009, Ni et al., 2012, Doostani et al., 2023) and can be modeled using a linear combination model which does not limit the weights of the isolated responses to equal 1 (Doostani et al., 2023). Note that the “most preferred” stimulus does not necessarily saturate the voxel response, thus the response to two stimuli could be more effective based on voxel responsiveness to the second stimulus.

      As for the current study, the labels “more preferred” and “less preferred” are only relatively defined (as explained in the Methods section), meaning that the more preferred stimulus is not necessarily the most preferred stimulus for the voxels. Furthermore, the presented stimuli are semi-transparent and presented with low-contrast, which moves the responses further away from the saturation level. Based on reported evidence for multiple-stimulus responses, responses to single stimuli are in many cases sublinearly added to yield the multiple-stimulus response (Zoccolan et al., 2005, Reddy et al., 2009, Doostani et al., 2023). This means that the multiple-stimulus response is lower than the sum of the isolated responses and not lower than each of the isolated responses. Therefore, it is not paradoxical to observe higher responses in paired conditions compared to the isolated conditions. We observe similar results in other ROIs, which we provide as supplementary figures to Figure 4 in the revised manuscript.

      We address this observation and similar reports in previous studies in the Results section of the revised manuscript in lines 409-413:

      “Note that the response in paired conditions can be higher or lower than the response to the isolated more preferred stimulus (condition Mat), depending on the voxel preference for the two presented stimuli, as previously reported (Doostani et al., 2023). This is consistent with previous studies reporting the response to multiple stimuli to be higher than the average, but lower than the sum of the response to isolated stimuli (Reddy et al., 2009).”

      b. Paradox aside, I wondered to what extent the results are in part explained by range limits. Take two categories that evoke a highly similar response (either mean over a full ROI, or in the multivariate sense). That imposes a range limit such that attentional modulation, if it works the way we think it does, could only move responses within that narrow range. In contrast, the starting point for two highly dissimilar categories leaves room in principle for more modulation.

      We do not believe that the results can be explained by range limits because responses in paired conditions are not limited by the isolated responses, as can be observed in Figure 4. However, to rule out the possibility of the similarity between responses in isolated conditions affecting the range within which responses in paired conditions can change, we turned to the multivariate analysis. We used the weight shift measure as the change in the weight of each stimulus with the change in the attentional target. In this method, no matter how close the two isolated vectors are, the response to the pair could still have a whole range of different weights of the isolated responses. We have plotted an example illustration of two-dimensional vectors for better clarification. Here, the vectors Vxat and Vyat denote the responses to the isolated x and y stimuli, respectively, and the vector Pxaty denotes the response to the paired condition in which stimulus x is attended. The weights a1 and a2 are illustrated in the figure, which are equal to regression coefficients if we solve the equation Pxaty \= [a1 a2] [x y]’. While the weight values depend on the amplitude of and the angle between the three vectors, they are not limited by a lower angle between Vxat and Vyat.

      We have updated Figure 2 in the manuscript to avoid the confusion. We have also added a figure including the sum of weights for different category pairs in different regions, showing that the sum of weights are not dependent on the similarity between the two stimuli. The conclusions based on the weight shift are therefore not confounded by the similarity between the two stimuli.

      c. Finally, related to the previous point, while including V1 is a good control, I wonder if it is getting a "fair" test here, because the range of responses to the four categories in this region, in terms of (dis)similarity, seems compressed relative to the other categories.

      We believe that V1 is getting a fair test because the single-subject range of category distance in V1 is similar to LO, as can be observed Author response image 1_:_

      Author response image 1.

      Range of category distance in each ROI averaged across participants

      The reason that V1 is showing a more compressed distance range on the average plot is that the category distance in V1 is not consistent among participants. Although the average plots are shown in Figure 5 and Figure 6, we tested statistical significance in each ROI based on single-subject correlation coefficients.

      Please also note that a more compressed range of dissimilarity does not necessarily lead to a less strong effect of category distance on the effect of attention. For instance, while LO shows a more compressed dissimilarity range for the presented categories compared to the other object selective regions, it shows the highest correlation between weight shift and category distance. Furthermore, as illustrated in Figure 5, no significant correlation is observed between univariate shift and category distance in V1, even though the range of the univariate distance in V1 is similar to LO and pFs, where we observed a significant correlation between category distance and univariate shift.

      d. In general, the manuscript does a very good job explaining the methods of the study in a way that would allow replication. In some places, the authors could be clearer about the reasoning behind those methodological choices. For example: - How was the sample size determined?

      Estimating conservatively based on the smallest amount of attentional modulation we observed in a previous study (Doostani et al., 2023), we chose a medium effect size (0.3). For a power of 0.8, the minimum number of participants should be 16. We have added the explanation to the Methods section in lines 78-81:

      “We estimated the number of participants conservatively based on the smallest amount of attentional modulation observed in our previous study (Doostani et al., 2023). For a medium effect size of 0.3 and a power of 0.8, we needed a minimum number of 16 participants.”

      - Why did the authors choose those four categories? What was the evidence that would suggest these would span the range of similarities needed here?

      We chose these four categories based on a previous behavioral study reporting the average reaction time of participants when detecting a target from one category among distractors from another category (Xu and Vaziri-Pashkam, 2019). Ideally the experiment should include as many object categories as possible. However, since we were limited by the duration of the experiment, the number of conditions had to be controlled, leading to a maximum of 4 object categories. We chose two animate and two inanimate object categories to include categories that are more similar and more different based on previous behavioral results (Xu and Vaziri-Pashkam, 2019). We included body and house categories because they are both among the categories to which highly responsive regions exist in the cortex. We chose the two remaining categories based on their similarity to body and house stimuli. In this way, for each category there was another category that elicited similar cortical responses, and two categories that elicited different responses. While we acknowledge that the chosen categories do not fully span the range of similarities, they provide an observable variety of similarities in different ROIs which we find acceptable for the purposes of our study.

      We include this information in the Methods section of the revised manuscript in lines 89-94:

      “We included body and house categories because there are regions in the brain that are highly responsive and unresponsive to each of these categories, which provided us with a range of responsiveness in the visual cortex. We chose the two remaining categories based on previous behavioral results to include categories that provided us with a range of similarities (Xu and Vaziri-Pashkam, 2019). Thus, for each category there was a range of responsiveness in the brain and a range of similarity with the other categories.”

      - Why did the authors present the stimuli at the same location? This procedure has been adopted in previous studies, but of course, it does also move the stimulus situation away from the real-world examples of cluttered scenes that motivate the Introduction.

      We presented the stimuli at the same location because we aimed to study the mechanism of object-based attention and this experimental design helped us isolate it from spatial attention. We do not think that our design moves the stimulus situation away from real-world examples in such a way that our results are not generalizable. We include real-world instances, as well as a discussion on this point, in the Discussion section of the revised manuscript, in lines 611-620:

      “Although examples of superimposed cluttered stimuli are not very common in everyday life, they still do occur in certain situations, for example reading text on the cellphone screen in the presence of reflection and glare on the screen or looking at the street through a patterned window. Such instances recruit object-based attention which was the aim of this study, whereas in more common cases in which attended and unattended objects occupy different locations in space, both space-based and object-based attention may work together to resolve the competition between different stimuli. Here we chose to move away from usual everyday scenarios to study the effect of object-based attention in isolation. Future studies can reveal the effect of target-distractor similarity, i.e. proximity in space, on space-based attention and how the effects caused by object-based and space-based attention interact.”

      - While I'm not concerned about this (all relevant comparisons were within-participants) was there an initial attempt to compare data quality from the two different scanners?

      We compared the SNR values of the two groups of participants and observed no significant difference between these values (ps > 0.34, ts < 0.97). We have added this information to the Methods section.

      Regarding the observed effect, we performed a t-test between the results of the participants from the two scanners. For the univariate results, the observed correlation between univariate attentional modulation and category distance was not significantly different for participants of the two scanners in any ROIs (ps > 0.07 , ts < 1.9). For the multivariate results, the observed correlation between the weight shift and multivariate category distance was not significantly different in any ROIs (ps > 0.48 , ts < 0.71) except for V1 (p-value = 0.015 , t-value = 2.75).

      We include a sentence about the comparison of the SNR values in the preprocessing section in the revised manuscript.

      e. There are a couple of analysis steps that could be applied to the existing data that might strengthen the findings. For one, the authors have adopted a liberal criterion of p < 0.001 uncorrected to include voxels within each ROI. Why, and to what extent is the general pattern of findings robust over more selective thresholds? Also, there are additional regions that are selective for bodies (fusiform body area) and scenes (occipital place area and retrosplenial cortex). Including these areas might provide more diversity of selectivity patterns (e.g. different responses to non-preferred categories) that would provide further tests of the hypothesis.

      We selected this threshold to allow for selection of a reasonable number of voxels in each hemisphere across all participants. To check whether the effect is robust over more selective thresholds, we exemplarily redefined the left EBA region using p < 0.0001 and p < 0.00001 and observed that the weight shift effect remained equivalent. We have made a note of this analysis in the Results section. As for the additional regions suggested by the reviewer, we chose not to include them because they could not be consistently defined in both hemispheres of all participants. Please note that the current ROIs also show different responses to non-preferred categories (e.g. in LO and pFs). We include this information in the Methods section in lines 206-207:

      “We selected this threshold to allow for selection of a reasonable number of voxels in each hemisphere across all participants.”

      And in the Results section in lines 509-512:

      “We performed the analysis including only voxels that had a significantly positive GLM coefficient across the runs and observed the same results. Moreover, to check whether the effect is robust over more selective thresholds for ROI definition, we redefined the left EBA region with p < 0.0001 and p < 0.00001 criteria. We observed a similar weight shift effect for both criteria.”

      f. One point the authors might address is the potential effect of blocking the paired conditions. If I understood right, the irrelevant item in each paired display was from the same category throughout a block. To what extent might this knowledge shape the way participants attend to the task-relevant item (e.g. by highlighting to them certain spatial frequencies or contours that might be useful in making that particular pairwise distinction)? In other words, are there theoretical reasons to expect different effects if the irrelevant category is not predictable?

      We believe that the participants’ knowledge about the distractor does not significantly affect our results because our results are in agreement with previous behavioral data (Cohen et al., 2014, Xu and Vaziri-Pashkam, 2019), in which the distractor could not be predicted. These reports suggest there is a theoretical reason to expect similar effects if the participants could not predict the distractor. To directly test this, one would need to perform an fMRI experiment using an event-related design, an interesting venue for future research.

      We have made a note of this point in the Discussion section of the revised manuscript in lines 621-626:

      “Please note that we used a blocked design in which the target and distractor categories could be predicted across each block. While it is possible that the current design has led to an enhancement of the observed effect, previous behavioral data (Cohen et al., 2014, Xu and Vaziri-Pashkam, 2019) have reported the same effect in experiments in which the distractor was not predictable. To study the effect of predictability on fMRI responses, however, an event-related design is more appropriate, an interesting venue for future fMRI studies.”

      g. The authors could provide behavioural data as a function of the specific category pairs. There is a clear prediction here about which pairs should be more or less difficult.

      We provide the behavioral data as a supplementary figure to Figure 1 in the revised manuscript. We however do not see differences in behavior for the different category paris. This is so because our fMRI task was designed in a way to make sure the participants could properly attend to the target for all conditions. The task was rather easy across all conditions and due to the ceiling effect, there was no significant difference between behavioral performance for different category pairs. However, the effect of category pair on behavior has been previously tested and reported in a visual search paradigm with the same categories (Xu and Vaziri-Pashkam, 2019), which was in fact the basis for our choice of categories in this study (as explained in response to point “d” above).

      h. Figure 4 shows data for EBA in detail; it would be helpful to have a similar presentation of the data for the other ROIs as well.

      We provide data for all ROIs as figure supplements 1-4 to Figure 4 in the revised manuscript.

      i. For the pFs and LOC ROIs, it would be helpful to have an indication of what proportion of voxels was most/least responsive to each of the four categories. Was this a relatively even balance, or generally favouring one of the categories?

      In LO, the proportion of voxels most responsive to each of the four categories was relatively even for Body (31%) and House (32%) stimuli, which was higher than the proportion of Car- and Cat-preferring voxels (18% and 19%, respectively). In pFs, 40% of the voxels were house-selective, while the proportion was relatively even for voxels most responsive to bodies, cars, and houses with 21%, 17%, and 22% of the voxels, respectively. We include the percentage of voxels most responsive to each of the four categories in each ROI as Appendix 1-table 1.

      j. Were the stimuli in the localisers the same as in the main experiment?

      No, we used different sets of stimuli for the localizers and the main experiment. We have added the information in line 146 of the Methods section.

      Reviewer #2 (Recommendations For The Authors):

      (1) Why are specific ROIs chosen? Perhaps some discussion motivating these choices, and addressing the possible overlap between these and retinotopic regions (based on other studies, or atlases - Wang et al, 2015) would be useful.

      Considering that we used object categories, we decided to look at general object-selective regions (LO, pFS) as well as regions that are highly selective for specific categories (EBA, PPA). We also looked at the primary visual cortex as a control region. We have added this clarification in the Methods section lines 128-133:

      “Considering that we used object categories, we investigated five different regions of interest (ROIs): the object-selective areas lateral occipital cortex (LO) and posterior fusiform (pFs) as general object-selective regions, the body-selective extrastriate body area (EBA) and the scene-selective parahippocampal place area (PPA) as regions that are highly selective for specific categories, and the primary visual cortex (V1) as a control region. We chose these regions because they could all be consistently defined in both hemispheres of all participants and included a large number of voxels.”

      (2) The authors should consider including data on the relative prevalence of voxels preferring each category for each ROI (and/or the mean activation level across voxels for each category for each ROI). If some ROIs have very few voxels preferring some categories, there's a chance the observed results are a bit noisy when sorting based on those categories (e.g., if a ROI has essentially no response to a given pair of categories, then there's not likely to be much attentional modulation detectable, because the ROI isn't driven by those categories to begin with).

      We thank the reviewer for the insightful comment.

      We include the percentage of voxels most responsive to each of the four categories in each ROI in the Appendix ( Appendix 1-table 1, please see the answer to point “i” of the first reviewer).

      We also provide a table of average activity across voxels for each category in all ROIs as Appendix 1-table 2.

      As shown in the table, voxels show positive activity for all categories in all ROIs except for PPA, where voxels show no response to body and cat stimuli. This might explain why we observed a marginally significant correlation between weight shift and category distance in PPA only. As the reviewer mentions, since this region does not respond to body and cat stimuli, we do not observe a significant change in response due to the shift in attention for some pairs. We include the table in the Appendix and add the explanation to the Results section of the revised manuscript in lines 506-508:

      _“_Less significant results in PPA might arise from the fact that PPA shows no response to body and cat stimuli and little response to car stimuli (Appendix 1-table 2). Therefore, it is not possible to observe the effect of attention for all category pairs.”

      a. Related - would it make sense to screen voxels for inclusion in analysis based on above-basely activation for one or both of the categories? [could, for example, imagine you're accidentally measuring from the motor cortex - you'd be able to perform this analysis, but it would be largely nonsensical because there's no established response to the stimuli in either isolated or combined states].

      We performed all the analyses including only voxels that had a significantly positive GLM coefficient across the runs and the results remained the same. We have added the explanation in the Results section in line 509-510.

      (3) Behavioral performance is compared against chance level, but it doesn't seem that 50% is chance for the detection task. The authors write on page 4 that the 1-back repetition occurred between 2-3 times per block, so it doesn't seem to be the case that each stimulus had a 50% chance of being a repetition of the previous one.

      We apologize for the mistake in our report. We have reported the detection rate for the target-present trials (2-3 per block), not the behavioral performance across all trials. We have modified the sentence in the Results section.

      (4) Authors mention that the stimuli are identical for 2-stimulus trials where each category is attended (for a given pair) - but the cue is different, and the cue appears as a centrally-fixated word for 1 s. Is this incorporated into the GLM? I can't imagine this would have much impact, but the strict statement that the goals of the participant are the only thing differentiating trials with otherwise-identical stimuli isn't quite true.

      The word cue was not incorporated as a separate predictor into the GLM. As the reviewer notes, the signals related to the cue and stimuli are mixed. But given that the cues are brief and in the form of words rather than images, they are unlikely to have an effect on the response in the regions of interest.

      To be more accurate, we have included the clarification in the Methods section in lines 181-182:

      “We did not enter the cue to the GLM as a predictor. The obtained voxel-wise coefficients for each condition are thus related to the cue and the stimuli presented in that condition.”

      And in the Results section in lines 425-428 :

      “It is important to note that since the cue was not separately modeled in the GLM, the signals related to the cue and the stimuli were mixed. However, given that the cues were brief and presented in the form of words, they are unlikely to have an effect on the responses observed in the higher-level ROIs.”

      (5) Eq 5: I expected there to be some comparison of a and b directly as ratios (e.g., a_1 > b_1, as shown in Fig. 2). The equations used here should be walked through more carefully - it's very hard to understand what this analysis is actually accomplishing. I'm not sure I follow the explanation of relative weights given by the authors, nor how that maps onto the delta_W quantity in Equation 5.

      We provide a direct comparison of a and b, as well as a more thorough clarification of the analysis, in the Methods section in lines 274-276:

      “We first projected the paired vector on the plane defined by the isolated vectors (Figure 2A) and then determined the weight of each isolated vector in the projected vector (Figure 2B).”

      And in lines 286-297:

      “A higher a1 compared to a2 indicates that the paired response pattern is more similar to Vxat compared to Vyat, and vice versa. For instance, if we calculate the weights of the Body and Car stimuli in the paired response related to the simultaneous presentation of both stimuli, we can write in the LO region: VBodyatCar \= 0.81 VBody + 0.31 VCar, VBodyCarat \= 0.43 VBody + 0.68 VCar. Note that these weights are averaged across participants. As can be observed, in the presence of both body and car stimuli, the weight of each stimulus is higher when attended compared to the case when it is unattended. In other words, when attention shifts from body to car stimuli, the weight of the isolated body response (VBody) decreases in the paired response. We can therefore observe that the response in the paired condition is more similar to the isolated body response pattern when body stimuli are attended and more similar to the isolated car response pattern when car stimuli are attended.”

      And lines 303-306:

      “As shown here, even when body stimuli are attended, the effect of the unattended car stimuli is still present in the response, shown in the weight of the isolated car response (0.31). However, this weight increases when attention shifts towards car stimuli (0.68 in the attended case).”

      We also provide more detailed clarification for the 𝛥w and the relative weights in lines 309-324:

      “To examine whether this increase in the weight of the attended stimulus was constant or depended on the similarity of the two stimuli in cortical representation, we defined the weight shift as the multivariate effect of attention:

      𝛥w = a1/(a1+a2) – b1/(b1+b2)                                                                                          (5)

      Here, a1, a2, b1,and b2 are the weights of the isolated responses, estimated using Equation 4. We calculate the weight of the isolated x response once when attention is directed towards x (a1), and a second time when attention is directed towards y (b1). In each case, we calculate the relative weight of the isolated x in the paired response by dividing the weight of the isolated x by the sum of weights of x and y (a1+a2 when attention is directed towards x, and b1+b2 when attention is directed towards y). We then define the weight shift, Δw, as the change in the relative weight of the isolated x response in the paired response when attention shifts from x to y. A higher Δw for a category pair indicates that attention is more efficient in removing the effect of the unattended stimulus in the pair. We used relative weights as a normalized measure to compensate for the difference in the sum of weights for different category pairs. Thus, using the normalized measure, we calculated the share of each stimulus in the paired response. For instance, considering the Body-Car pair, the share of the body stimulus in the paired response was equal to 0.72 and 0.38, when body stimuli were attended and unattended, respectively. We then calculated the change in the share of each stimulus caused by the shift in attention using a simple subtraction ( Equation 5: Δw=0.34 for the above example of the Body-Car pair in LO) and used this measure to compare between different pairs.”

      We hope that this clarification makes it easier to understand the multivariate analysis and the weight shift calculation in Equation 5.

      We additionally provide the values of the weights (a1, b1, a2, and b2 ) for each category pair averaged across participants as Appendix 1 -table 4.

      (6) For multivariate analyses (Fig. 6A-E), x axis is normalized (pattern distance based on Pearson correlation), while the delta_W does not seem to be similarly normalized.

      We calculated ΔW by dividing the weights in each condition by the sum of weights in that condition. Thus, we use relative weights which are always in the range of 0 to 1, and ΔW is thus always in the range of -1 to 1. This means that both axes are normalized. Note that even if one axis were not normalized, the relationship between the independent and the dependent variables would remain the same despite the change in the range of the axis.

      (7) Simulating additional scenarios like attention to both categories just increasing the mean response would be helpful - is this how one would capture results like those shown in some panels of Fig. 4?

      We did not have a condition in which participants were asked to attend to both categories. Therefore it was not useful for our simulations to include such a scenario. Please also note that the goal of our simulations is not to capture the exact amount of attentional modulation, but to investigate the effect of target-distractor similarity on the change in attentional modulation (univariate shift and weight shift).

      As for the results in some panels of Figure 4, we have explained the reason underlying higher responses in paired conditions compared to isolated conditions) in response to the “weaknesses” section of the second reviewer. We hope that these points satisfy the reviewer’s concern regarding the results in Figure 4 and our simulations.

      (8) Lines 271-276 - the "latter" and "former" are backwards here I think.

      We believe that the sentence was correct, but confusing.. We have rephrased the sentence to avoid the confusion in lines 371-376 of the revised manuscript:

      “We modeled two neural populations: a general object-selective population in which each voxel shows preference to a particular category and voxels with different preferences are mixed in with each other (similar to LO and pFS), and a category-selective population in which all voxels have a similar preference for a particular category (similar to EBA and PPA).”

      (9) Line 314 - "body-car" pair is mentioned twice in describing the non-significant result in PPA ROI.

      Thank you for catching the typo. We have changed the second Body-Car to Body-Cat.

      (10) Fig. 5 and Fig. 6 - I was expecting to see a plot that demonstrated variability across subjects rather than across category pairs. Would it be possible to show the distribution of each pair's datapoints across subjects, perhaps by coloring all (e.g.) body-car datapoints one color, all body-cat datapoints another, etc? This would also help readers better understand how category preferences (which differ across ROIs) impact the results.

      We demonstrated variability across category pairs rather than subjects because we aimed to investigate how the variation in the similarity between categories (i.e. category distance) affected the univariate and multivariate effects of attention. The variability across subjects is reflected in the error bars in the bar plots of Figure 5 and Figure 6.

      Here we show the distribution of each category pair’s data points across subjects by using a different color for each pair:

      Author response image 2.

      Univariate shift versus category distance including single-subject data points in all ROIs.

      Author response image 3.

      Weight shift versus category distance including single-subject data points in all ROIs.

      As can be observed in the figures, category preference has little impact on the results. Rather, the similarity in the preference (in the univariate case) or the response pattern (in the multivariate case) to the two presented categories is what impacts the amount of the univariate shift and the weight shift, respectively. For instance, in EBA we observe a low amount of attentional shift both for the Body-Cat pair, with two stimuli for which the ROI is highly selective, and the Car-House pair, including stimuli to which the region shows little response. A similar pattern is observed in the object-selective regions LO and pFs which show high responses to all stimulus categories.

      We believe that the figures including the data points related to all subjects are not strongly informative. However, we agree that using different colors for each category pair helps the readers better understand that category preference has little impact on the results in different ROIs. We therefore present the colored version of Figure 5 and Figure 6 in the revised manuscript, with a different color for each category pair.

      (11) Fig. 5 and Fig. 6 use R^2 as a dependent variable across participants to conclude a positive relationship. While the positive relationship is clear in the scatterplots, which depict averages across participants for each category pair, it could still be the case that there are a substantial number of participants with negative (but predictive, thus high positive R^2) slopes. For completeness and transparency, the authors should illustrate the average slope or regression coefficient for each of these analyses.

      We concluded the positive relationship and calculated the significance in Figure 5 and Figure 6 using the correlation r rather than r.^2 This is why the result was not significantly positive in V1. We acknowledge that the use of r-squared in the bar plot leads to confusion. We have therefore changed the bar plots to show the correlation coefficient instead of the r-squared. Furthermore, we have added a table of the correlation coefficient for all participants in all ROIs for the univariate and weight shift analyses supplemental to Figure 5 and Figure 6, respectively.

      (12) No statement about data or analysis code availability is provided

      Thanks for pointing this out. The fMRI data is available on OSF. We have added a statement about it in the Data Availability section of the revised manuscript in line 669.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Summary:

      The Notch signaling pathway plays an important role in many developmental and disease processes. Although well-studied there remain many puzzling aspects. One is the fact that as well as activating the receptor through trans-activation, the transmembrane ligands can interact with receptors present in the same cell. These cis-interactions are usually inhibitory, but in some cases, as in the assays used here, they may also be activating. With a total of 6 ligands and 4 receptors, there is potentially a wide array of possible outcomes when different combinations are co-expressed in vivo. Here the authors set out to make a systematic analysis of the qualitative and quantitative differences in the signaling output from different receptor-ligand combinations, generating sets of "signaling" (ligand expressing) and "receiving" (receptor +/- ligand expressing cells).

      The readout of pathway activity is transcriptional, relying on the fusion of GAL4 in the intracellular part of the receptor. Positive ligand interactions result in the proteolytic release of Gal4 that turns on the expression of H2B-citrine. As an indicator of ligand and receptor expression levels, they are linked via TA to H2B mCherry and H2B mTurq expression respectively. The authors also manipulate the expression of the glycosyltransferase Lunatic-Fringe (LFng) that modifies the EGF repeats in the extracellular domains impacting their interactions. The testing of multiple ligand-receptor combinations at varying expression levels is a tour de force, with over 50 stable cell lines generated, and yields valuable insights although as a whole, the results are quite complex.

      Strengths:

      Taking a reductionist approach to testing systematically differences in the signaling strength, binding strength, and cis-interactions from the different ligands in the context of the Notch1 and Notch 2 receptors (they justify well the choice of players to test via this approach) produces a baseline understanding of the different properties and leads to some unexpected and interesting findings. Notably:

      -                Jag1 ligand expressing cells failed to activate Notch1 receptor although were capable of activating Notch2. Conversely, Jag2 cells elicited the strongest activation of both receptors. The results with

      Jag1 are surprising also because it exhibits some of the strongest binding to plate-bound ligands. The failure to activate Notch1 has major functional significance and it will be important in the future to understand the mechanistic basis.

      -                Jagged ligands have the strongest cis-inhibitory effects and the receptors differ in their sensitivity to cis-inhibition by Dll ligands. These observations are in keeping with earlier in vivo and cell culture studies. More referencing of those would better place the work in context but it nicely supports and extends previous studies that were conducted in different ways.

      -                Responses to most trans-activating ligands showed a degree of ultrasensitivity but this was not the case for cis-interactions where effects were more linear. This has implications for the way the two mechanisms operate and for how the signaling levels will be impacted by ligand expression levels.

      -                Qualitatively similar results are obtained in a second cell line, suggesting they reflect fundamental properties of the ligands/receptors.

      We appreciate the positive and constructive feedback.

      Weaknesses:

      One weakness is that the methods used to quantify the expression of ligands and receptors rely on the co-translation of tagged nuclear H2B proteins. These may not accurately capture surface levels/correctly modified transmembrane proteins. In general, the multiple conditions tested partly compensate for the concerns - for example, as Jag1 cells do activate Notch2 even if they do not activate Notch1 some Jag1 must be getting to the surface. But even with Notch2, Jag1 activities are on the lower side, making it important to clarify, especially given the different outcomes with the plated ligands. Similarly, is the fact that all ligands "signalled strongest to Notch2" an inherent property or due to differences in surface levels of Notch 2 compared to Notch1? The results would be considerably strengthened by calibration of the ligand/receptor levels (and ideally their sub-cellular localizations). Assessing the membrane protein levels would be relatively straightforward to perform on some of the basic conditions because their ligand constructs contain Flag tags, making it plausible to relate surface protein to H2B, and there are antibodies available for Notch1 and Notch2.

      We agree that mCherry fluorescence does not provide a direct readout of active surface ligand levels. As the reviewer points out, the ability of Jag1 to activate Notch2 demonstrates that expressed Jag1 is competent for signaling. Further, in some cases, Jag1-Notch2 activation can be comparable to Dll1-Notch2 activation (Figure 2A). Following the reviewer’s suggestion, we performed a Western blot for multiple expression levels for each of three surface ligands (Dll1, Dll4, Jag1) (Figure 2—figure supplement 2). This blot revealed a signal for surface expression of Jag1. Interpretation is complicated by the expected dependence of the efficiency of surface protein purification on the number of primary amines in the protein, which varies among these ligands, and qualitatively correlates with the staining intensity. While this makes quantitative interpretation difficult, this result further supports the notion that Jag1 is present on the cell surface. Finally, we note that high signaling activity need not, in general, directly correlate with surface expression levels. In fact, one study showed an example in which increased ligand activity occurred with decreased basal ligand surface levels (Antfolk et al., 2017). While one would ideally like to know all parameters of the system, including surface protein levels, rates of recycling, etc. the perspective taken here is that the net effect of these many post-translational processing steps can be subsumed into the overall relationship between the expression of the protein (which, in our case, is read out by the co-translational reporter) and its activity, which is relevant for the behavior of developmental circuits, among other systems. To address this comment, we now explicitly mention the limitation of mCherry as a proxy for surface protein, and add a reference to previous work highlighting the relationship between surface levels and ligand activity.

      In terms of the dependence of signaling on Notch levels, the metric of signaling activity used here is explicitly normalized by the mTurquoise co-translational reporter of Notch expression to account for differences in receptor expression across receiver clones. We have added a new figure to show the variation in expression (Figure 1—figure supplement 1A) and to demonstrate this normalization (Figure 1—figure supplement 5). Having said that, as the reviewer correctly points out, we cannot directly address the dependence on surface receptor levels with mTurquoise alone. To address this comment, we have added a figure that shows cotranslational and surface receptor expression for a subset of our receiver clones (Figure 1—figure supplement 1B). Although antibody binding strengths may vary, it appears unlikely that higher surface levels could explain most ligands’ preferential activation of Notch2 over Notch1, since Notch2 levels were lower than Notch1 levels in both surface expression and cotranslational expression.

      Cis-activation as a mode of signaling has only emerged from these synthetic cell culture assays raising questions about its physiological relevance. Cis-activation is only seen at the higher ligand (Dll1, Dll4) levels, how physiological are the expression levels of the ligands/receptors in these assays? Is it likely that this would make a major contribution in vivo? Is it possible that the cells convert themselves into "signaling" and "receiving" sub-populations within the culture by post-translational mechanism? Again some analysis of the ligand/receptors in the cultures would be a valuable addition to show whether or not there are major heterogeneities.

      The cis-activation results in this paper are, as the reviewer points out, conducted in synthetic cell culture assays. Cis-activation is observed across a large dynamic range of ligand expression, possibly including non-physiologically high levels. However, our previous work (Nandagopal et al, eLife 2019) showed that cis-activation does not require over-expression, as it occurred in unmodified Caco-2 and NMuMG cells with their endogenous ligand and receptor expression levels. As shown here in Figure 4B, cis-activation for Notch2 increases monotonically and is substantial even at intermediate ligand concentrations. In other cases, cis-activation is maximal at intermediate concentrations. We agree that the in vivo role remains unclear, and is difficult to determine due to the typical close contacts among cells in tissues. Therefore, these assays do not speak to in vivo relevance. Note that we can, however, rule out the possibility of trans signaling between well-mixed cell populations at these densities (Figure 4A).

      It is hard to appreciate how much cell-to-cell variability in the "output" there is. For example, low "outputs" could arise from fewer cells becoming activated or from all cells being activated less. As presented, only the latter is considered. That may be already evident in their data, but not easy for the reader to distinguish from the way they are presented. For example, in many of the graphs, data have been processed through multiple steps of normalization. Some discussion/consideration of this point is needed.

      We agree that in different experiments changes in a mean response can reflect changes in fraction of activated cells, or level of activation or some combination of both. In this work, most assays were conducted by flow cytometry, which provides a full distribution of cellular responses. We provided distributions for some experiments in the supplementary figures (i.e., Figure 4—figure supplement 1, and Figure 5—figure supplement 4). The sheer number of experiments and samples prevents us from displaying all underlying histograms. Therefore, we have provided all flow data sets in an extensive archive that is publicly available on data.caltech.edu (https://doi.org/10.22002/gjjkn-wrj28).

      Impact:

      Overall, cataloging the outcomes from the different ligand-receptor combinations, both in cis and trans, yields a valuable baseline for those investigating their functional roles in different contexts. There is still a long way to go before it will be possible to make a predictive model for outcomes based on expression levels, but this work gives an idea about the landscape and the complexities. This is especially important now that signaling relationships are frequently hypothesized based on single-cell transcriptomic data. The results presented here demonstrate that the relationships are not straightforward when multiple players are involved.

      We appreciate this concise impact summary, and agree with its conclusions.

      Reviewer #2 (Public Review):

      Summary:

      In this manuscript, the authors extend their previous studies on trans-activation, cis-inhibition (PMID: 25255098), and cis-activation (PMID: 30628888) of the Notch pathway. Here they create a large number of cell lines using CHO-K1 and C2C12 cells expressing either Notch1-Gal4 or Notch2-Gal4 receptors which express a fluorescent protein upon receptor activation (receiver cells). For cis-inhibition and cis-activation assays, these cells were engineered to express one of the four canonical Notch ligands (Dll1, Dll4, Jag1, Jag2) under tetracycline control. Some of the receiver cells were also transfected with a Lunatic fringe (Lfng) plasmid to produce cells with a range of Lfng expression levels. Sender cells expressing all of the canonical ligands were also produced. Cells were mixed in a variety of co-culture assays to highlight trans-activation, cis-activation, and cis-inhibition. All four ligands were able to trans-activate Notch1 and Notch 2, except Jag1 did not transactivate Notch1. Lfng enhanced trans-activation of both Notch receptors by Dll1 and Dll2, and inhibited Notch1 activation by Jag2 and Notch2 activation by both Jag 1 and Jag2. Cis-expression of all four ligands was predominantly inhibitory, but Dll1 and Dll4 showed strong cis-activation of Notch2. Interestingly, cis-ligands preferentially inhibited trans-activation by the same ligand, with varying effects on other trans-ligands.

      Strengths:

      This represents the most comprehensive and rigorous analysis of the effects of canonical ligands on cis- and trans-activation, and cis-inhibition, of Notch1 and Notch2 in the presence or absence of Lfng so far. Studying cis-inhibition and cis-activation is difficult in vivo due to the presence of multiple Notch ligands and receptors (and Fringes) that often occur in single cells. The methods described here are a step towards generating cells expressing more complex arrays of ligands, receptors, and Fringes to better mimic in vivo effects on Notch function.

      In addition, the fact that their transactivation results with most ligands on Notch1 and 2 in the presence or absence of Lfng were largely consistent with previous publications provides confidence that the author's assays are working properly.

      We appreciate the thoughtful comments and feedback.

      Weaknesses:

      It was unusual that the engineered CHO cells expressing Notch1-Gal4 were not activated at all by co-culture with Jag1-expressing CHO cells. Many previous reports have shown that Jag1 can activate Notch1 in co-culture assays, including when Notch1 was expressed in CHO cells. Interestingly, when the authors used Jag1-Fc in a plate coating assay, it did activate Notch1 and could be inhibited by the expression of Lfng.

      In our assays, we do in fact also see some signaling of Jag1 to Notch1, especially when dLfng is coexpressed (Figure 2—figure supplement 4, formerly Figure 2—figure supplement 3). While these levels are lower than those observed for other ligand-receptor combinations, they are significantly elevated compared to baseline. In specific natural contexts, it will be important to determine whether the weak but non-zero Jag1-Notch1 signaling acts negatively to suppress signaling from other ligands, or provides weak but potentially functionally important levels of signaling. Evidence for both modes exists in the literature. To address this, we have expanded the discussion of Jag1-Notch1 signaling and added references to other work on Jag1-Notch1 signaling to the Discussion section.

      The cell surface level of the ligands was determined by flow cytometry of a co-translated fluorescent protein. Some calibration of the actual cell surface levels with the fluorescent protein would strengthen the results.

      This issue was also raised by Reviewers #1 and #3. Please see responses to Reviewer #1, above.

      Reviewer #3 (Public Review):

      Summary:

      This manuscript reports a comprehensive analysis of Notch-Delta/Jagged signaling inclusive of the human Notch1 and Notch2 receptors and DLL1, DLL4, JAG1, and JAG2 ligands. Measurements

      encompassed signaling activity for ligand trans-activation, cis-activation, cis-inhibition, and activity modulation by Lfng. The most striking observations of the study are that JAG1 has no detectable activity as a Notch1 ligand when presented on a cell (though it does have activity when immobilized on a surface), even though it is an effective cis-inhibitor of Notch1 signaling by other ligands, and that DLL1 and DLL4 exhibit cis-activating activity for Notch1 and especially for Notch2. Notwithstanding the artificiality of the system and some of its shortcomings, the results should nevertheless be a valuable resource for the Notch signaling community.

      Strengths:

      (1)  The work is systematic and comprehensive, addressing questions that are of importance to the community of researchers investigating mammalian Notch proteins, their activation by ligands, and the modulation of ligand activity by LFng.

      (2)  A quantitative and thorough analysis of the data is presented.

      Weaknesses:

      (1) The manuscript is primarily descriptive and does not delve into the underlying, mechanistic origin or source of the different ligand activities.

      We agree that the goals of this paper were largely to discover the range of signaling modes that occur. A mechanistic analysis would be beyond the scope of this work, but we agree it is an important next step.

      (2) The amount of ligand or receptor expressed is inferred from the flow cytometry signal of a co-translated fluorescent protein-histone fusion, and is not directly measured. The work would be more compelling if the amount of ligand present on the cell surface were directly measured with anti-ligand antibodies, rather than inferred from measurements of the fluorescent protein-histone fusion.

      This issue was also raised by Reviewers #1 and #2. Please see responses to Reviewer #1, above.

      (3) It would be helpful to see plots of the raw activity data before transformation and normalization, because the plots present data after several processing steps, and it is not clear how the processed data relate to the original values determined in each measurement.

      We included examples showing how raw data is processed in Figure 4—figure supplement 1 and Figure 5—figure supplement 4. The sheer number of experiments precludes including similar figures for all data sets. However, all raw and processed data and data analysis code is publicly available at (https://doi.org/10.22002/gjjkn-wrj28).

      (4) The authors use sparse plating of engineered cells with parental (no ligand or receptor-expressing cell to measure cis activation). However, the cells divide within the cultured period of 22-24 h and can potentially trans-activate each other.

      If measured cis-activation signal arises solely from trans-activation, then the measured cis-activation signal per cell should increase with cell density, since trans-activation per cell does depend on cell density (Figure 4A). However, for the strongest cis-activators (Dll1- and Dll4-Notch2), signaling magnitude is similar when these cells are cultured sparsely or at confluence, which would otherwise allow efficient trans signaling (Figure 5A). Thus, for Dll1- and Dll4-Notch2 receivers, total signaling strength per cell depends little or not at all on the opportunity to signal intercellularly. Moreover, cis-activation signal for the Dll1- and Dll4-Notch2 combinations exceeded the maximum trans-signaling levels we could achieve for the same receivers when cis-ligand was suppressed (Figure 4B). These results argue that cis interactions dominate signaling in this context. However, we have not ruled out the possibility that trans-signaling between sister cells after division contributes to the comparatively weak cis-activation observed for Notch1 receivers.

      Reviewer #1 (Recommendations For The Authors):

      As outlined in the public review, there is a question of whether the nuclear H2B accurately reflects the surface levels of the transmembrane proteins (ligand and receptor). Clearly, it would not be feasible to check levels in all of the experimental conditions, but some baseline conditions should be analyzed.

      We addressed this above.

      Reviewer #2 (Recommendations For The Authors):

      (1)  As mentioned above, it was unusual that Jag1 did not activate Notch1 in co-culture assays, but did activate Notch1 in plate-coating assays. The authors should add some text to the Discussion to explain why they think this is happening in their engineered cells. One possibility is that the CHO cells express Manic fringe (Mfng) which is known to reduce Jag1-Notch1 activation. Data for Mfng levels in CHO cells were not included in Supplemental Table 2. Knocking down all three Fringes in CHO cells might increase Jag1-Notch1 activation.

      This is already addressed in a sentence in the results: “Strikingly, while Jag1 sender cells failed to activate Notch1 receivers above background (Figure 2D), plate-bound Jag1-ext-Fc activated Notch1 only ~3-fold less efficiently than it activated Notch2 (Figure 3B-D). This suggests that the natural endocytic activation mechanism, or potential differences in tertiary structure between the expressed and recombinant Jag1 extracellular domains, could play roles in preventing Jag1-Notch1 signaling in coculture.” Regarding the point about Mfng, we added a note to Supplementary Table about other CHO-K1 expression data.

      (2) Figure 1-supplemental figure 1: Both the Notch1-Jag1 and Notch1-Jag2 cells show high expression of Jag1 in low 4epi, but any higher concentration reduces to control levels. How much of a problem is this for interpreting your data?

      This was not the ideal behavior, but by binning cells by co-translational reporters for ligand expression, we were able to obtain enough cells in intermediate bins. (Note: Figure 1—figure supplement 1 is now Figure 1—figure supplement 2.)

      (3)  Figure 1C legend: Are these stably-expressing cells or Tet-off cells? Please state in legend.

      The figure legend has been updated.

      (4)  Figure 1E: How long is the knockdown of Rfng and Lfng effective? Does it affect the expression of Lfng later?

      siRNA effects generally last for at least 72-96 hours, so we do not anticipate this being an issue.

      (5) Page 9: "Lfng significantly decreased trans-activation of both receptors by Jag1 (>2.5-fold)". If there is no Jag1-Notch1 activation, how can Lfng decrease trans-activation?

      We added a note in the main text to clarify that while Jag1-Notch1 signaling is relatively low, it can still be detectably decreased.

      (6) Figure 4A legend: Please define what "2.5k ea senders and Rec" means. In the text, it says "To focus on cis-interactions alone, we then cultured receiver cells at low density, amid an excess of wildtype CHO-K1 cells" (page 14).

      This was clarified in the text.

      (7)  Page 14: "By contrast, Notch2 was cis-activated by both Dll1 and Dll4, to levels exceeding those produced by trans-activation by high-Dll1 senders (Figure 4B, lower left)." Where is the trans-activation data? 4B, lower right?

      We updated this reference in the main text.

      (8)  Page 16: "For Notch2-Dll1 and Notch2-Dll4, single cell reporter activities correlated with cis-ligand expression, regardless of whether cells were pre-induced at a high or low culture density (Figure 4D)." It appears that Notch2-Dll1 has lower Notch activation at sparse culture than confluent.

      We agree that the level signaling is lower in sparse compared to confluent on average. This is explained by the sensitivity of the Tet-OFF promoter to culture density (Figure 4—figure supplement 2). However, the key point of this experiment is the positive correlation, which is consistent with cis-activation, and inconsistent with the pre-generation of NEXT hypothesis diagrammed in Figure 4C, which would not be expected to produce such a correlation.

      (9a) For the creation of the C2C12-Nkd cells: Has genomic sequencing been done to confirm editing of Notch2 and Jag1 loci?

      We confirmed the knockdown but did not do genomic sequencing.

      (9b) The gel in Figure 7-Supplement 1C is not adequate for showing loss of Jag1. It should be repeated.

      In this case, we have only the single gel. We added a note in figure legend that no duplicate was performed.

      (10) Figure 7A: Which Fringes are expressed in C2C12 cells? You should provide a rationale for knocking down just Rfng.

      Figure 7—figure supplement 1A shows the levels of expression in C2C12. Note that Mfng is not highlighted because its levels were undetectable.

      (11) Figure 7-Supplement 1D: This is confusing. Notch2 levels are not reduced in the left panel, and Notch1 and Notch2 levels are not reduced in the right panel?

      C2C12-Nkd cells exhibit reduced levels of Notch1 and Notch3. This can be seen in Figure 7—figure supplement 1A. Panel D presents the results of additional siRNA knockdown, performed to prevent subsequent up-regulation of Notch1 and Notch3 during the assay. These knockdown results were variable, as shown. The Notch2 siRNA knockdown was not essential for these experiments, but performed despite very low levels of Notch2 to begin with. In the revision, we have added this note to the Methods.

      Reviewer #3 (Recommendations For The Authors):

      (1) The results section of the manuscript is very dense and difficult to follow, as are the figure legends.

      We appreciate the criticism, and regret that it is not easier to read in its current form.

      (2) The authors could emphasize areas of concordance with published results (where available) to place their artificial, engineered system into a better biological context. Are there any examples of studies in whole organisms where cis-activation plays a role?

      We are not aware of examples of cis-activation in whole organisms at this point.

      (3) How do the authors rationalize the different responses of Notch1 to cell-presented Jag1 as opposed to immobilized Jag1, where its signal strength is second in rank order on a molar basis?

      This comment was addressed above in response to the first recommendation from Reviewer #2.

      It is also difficult to understand Figure 2_—_figure Supplement 3B, in which it appears that Jag1 induces a Notch1 reporter response when LFng is knocked down (dLfng), and how those data relate to the inactive response to Jag1 shown in the main figures.

      The issue here is a difference of normalization. Figure 2A in the main text is normalized to the sender expression level, i.e. relative signaling strength. By contrast, Figure 2—figure supplement 4B (previously Figure 2—figure supplement 3B) shows absolute signaling activity, which can appear higher because it does not normalize for ligand expression. For Jag1-Notch1 signaling in particular, substantial signaling required very high levels of Jag1. We have added a new figure to demonstrate these two types of normalization (Figure 2—figure supplement 1A).

      See the Authr response image 1 below for a direct comparison of these two normalization modes using data from both Figure 2A and Figure 2—figure supplement 4B. Note how the Jag1-Notch1 signaling activities that are nonzero in the top plot go to zero in the bottom plot as a result of normalizing the values to ligand expression.

      Author response image 1. Comparison of normalization modes in Figure 2A and Figure 2—figure supplement 4B (formerly 3B). Normalized trans-activation signaling activities for different ligand-receptor combinations (with dLfng only), either with further normalization to ligand expression (bottom row) or without further normalization (top row). Normalized signaling activity is defined as reporter activity (mCitrine, A.U.) divided by cotranslational receptor expression (mTurq2, A.U.), normalized to the strongest biological replicate-averaged signaling activity across all ligand-receptor-Lfng combinations in this experiment. Saturated data points, defined here as those with normalized signaling activity over 0.75 in both dLfng and Lfng conditions, were excluded. Colors indicate the identity of the trans-ligand expressed by cocultured sender cells. Error bars denote bootstrapped 95% confidence intervals (Methods), in this case sampled from the number of biological replicates given in the legend—n1 (for Notch1) or n2 (for Notch2). See Methods and Figure 2A caption for more details. Note that the only difference between this figure and the new Figure 2—figure supplement 1A is that this figure additionally includes the Jag1-high data from Figure 2—figure supplement 4B.

    1. Author response:

      The following is the authors’ response to the current reviews.

      Reviewer #1 (Public review):

      Summary:

      In this manuscript, Herrmannova et al explore changes in translation upon individual depletion of three subunits of the eIF3 complex (d, e and h) in mammalian cells. The authors provide a detailed analysis of regulated transcripts, followed by validation by RT-qPCR and/or Western blot of targets of interest, as well as GO and KKEG pathway analysis. The authors confirm prior observations that eIF3, despite being a general translation initiation factor, functions in mRNA-specific regulation, and that eIF3 is important for translation re-initiation. They show that global effects of eIF3e and eIF3d depletion on translation and cell growth are concordant. Their results support and extend previous reports suggesting that both factors control translation of 5'TOP mRNAs. Interestingly, they identify MAPK pathway components as a group of targets coordinately regulated by eIF3 d/e. The authors also discuss discrepancies with other reports analyzing eIF3e function.

      Strengths:

      Altogether, a solid analysis of eIF3 d/e/h-mediated translation regulation of specific transcripts. The data will be useful for scientists working in the Translation field.

      Weaknesses:

      The authors could have explored in more detail some of their novel observations, as well as their impact on cell behavior.

      The manuscript has improved with the new corrections. I appreciate the authors' attention to the minor comments, which have been fully solved. The authors have not, however, provided additional experimental evidence that uORF-mediated translation of Raf-1 mRNA depends on an intact eIF3 complex, nor have they addressed the consequences of such regulation for cell physiology. While I understand that this is a subject of follow-up research, the authors could have at least included their explanations/ speculations regarding major comments 2-4, which in my opinion could have been useful for the reader.

      Our explanations/speculations regarding major comments 2 and 3 were included in the Discussion. We apologize for this misunderstanding as we thought that we were supposed to explain our ideas only in the responses. We did not discuss the comment 4, however, as we are really not sure what is the true effect and did not want to go into wild speculations in our manuscript. We thank this reviewer for his insightful comments and understanding.


      The following is the authors’ response to the original reviews.

      Reviewer #1 (Recommendations For The Authors):

      Major comments:

      (1) The authors report the potential translational regulation of Raf kinase by re-initiation. It would be interesting to show that Raf is indeed regulated by uORF-mediated translation, and that this is dependent on an intact eIF3 complex. Analyzing the potential consequences of Raf1 regulation for cancer cell proliferation or apoptosis would be a plus.

      We agree that this is an interesting and likely possibility. In fact, another clue that translation of Raf1 is regulated by uORFs comes from Bohlen et al. 2023 (PMID: 36869665) where they showed that RAF1 translation is dependent on PRRC2 proteins (that promote leaky scanning through these uORFs). We noted in the discussion that our results from eIF3d/e/hKD and the PRRC2A/B/CKD partly overlap. It is a subject of our follow-up research to investigate whether eIF3 and PRRC2 co-operate together to regulate translation of this important mRNA. 

      (2) The authors show that eIF3 d/e -but not 3h- has an effect on cell proliferation. First, this indicates that proliferation does not fully correlate with eIF3 integrity. Depletion of eIF3d does not affect the integrity of eIF3, yet the effects on proliferation are similar to those of eIF3e. What is the possibility that changes in proliferation reflect functions of eIF3d outside the eIF3 complex? What could be the real consequences of disturbing eIF3 integrity for the mammalian cell? Please, discuss.

      Yes, proliferation does not fully correlate with eIF3 integrity. Downregulation of eIF3 subunits that lead to disintegration of eIF3 YLC core (a, b, c, g, i) have more detrimental effect on growth and translation than downregulation of the peripheral subunits (e, k, l, f, h, m). Our previous studies (Wagner et al. 2016, PMID: 27924037 and Herrmannová et al. 2020, PMID: 31863585) indicate that the YLC core of eIF3 can partially support translation even without its peripheral subunits. In this respect eIF3d (as a peripheral subunit) is an amazing exception, suggesting it may have some specialized function(s). Whether this function resides outside of the eIF3 complex or not we do not know, but do not think so. Mainly because in the absence of eIF3e – its interaction partner, eIF3d gets rapidly degraded. Therefore, it is not very likely that eIF3d exists alone outside of eIF3 complex with moonlighting functions elsewhere. We think that eIF3d, as a head-interacting subunit close to an important head ribosomal protein RACK1 (a landing pad for regulatory proteins), is a target of signaling pathways, which may make it important for translation of specific mRNAs. In support is these thoughts, eIF3d (in the context of entire eIF3) together with DAP5 were shown to promote translation by an alternate capdependent (eIF4F-independent) mechanism (Lee et al. 2016, PMID: 27462815; de la Parra et al. 2018, PMID:30076308). In addition, the eIF3d function (also in the context of entire eIF3) was proved to be regulated by stress-triggered phosphorylation (Lamper et al. 2020, PMID: 33184215). 

      (3) Figure 6D: Surprisingly, reduced levels of ERK1/2 upon eIF3d/e-KD are compensated by increased phosphorylation of ERK1/2 and net activation of c-Jun. Please comment on the functional consequences of buffering mechanisms that the cell deploys in order to counteract compromised eIF3 function. Why would the cell activate precisely the MAPK pathway to compensate for a compromised eIF3 function?

      This we do not know. We can only speculate that when translation is compromised, cells try to counteract it in two ways: 1) they produce more ribosomes to increase translational rates and 2) activate MAPK signaling to send pro-growth signals, which can in the end further boost ribosome biogenesis.

      (4) Regarding DAP-sensitive transcripts, can the authors discuss in more detail the role of eIF3d in alternative cap-dependent translation versus re-initiation? Are these transcripts being translated by a canonical cap- and uORF-dependent mechanism or by an alternative capdependent mechanism?

      This is indeed not an easy question. On one hand, it was shown that DAP5 facilitates translation re-initiation after uORF translation in a canonical cap-dependent manner. This mechanism is essential for translation of the main coding sequence (CDS) in mRNAs with structured 5' leaders and multiple uORFs. (Weber et al. 2022, PMID: 36473845; David et al., 2022, PMID: 35961752). On the other hand, DAP5 was proposed to promote alternative, eIF4F-independent but cap-dependent translation, as it can substitute the function of the eIF4F complex in cooperation with eIF3d (de la Parra et al., 2018, PMID: 30076308; Volta et al., 2021 34848685). Overall, these observations paint a very complex picture for us to propose a clear scenario of what is going on between these two proteins on individual mRNAs. We speculate that both mechanisms are taking place and that the specific mechanism of translation initiation differs for differently arranged mRNAs.

      Minor comments:

      (5) Figure S2C: why is there a strong reduction of the stop codon peak for 3d and 3h KDs?

      We have checked the Ribowaltz profiles of all replicates (in the Supplementary data we are showing only a representative replicate I) and the stop codon peak differs a lot among the replicates. We think that this way of plotting was optimized for calculation and visualization of P-sites and triplet periodicity and thus is not suitable for this type of comparison among samples. Therefore, we have performed our own analysis where the 5’ ends of reads are used instead of P-sites and triplicates are averaged and normalized to CDS (see below please), so that all samples can be compared directly in one plot (same as Fig. S13A but for stop codon). We can see that the stop codon peak really differs and is the smallest for eIF3hKD. However, these changes are in the range of 20% and we are not sure about their biological significance. We therefore refrain from drawing any conclusions. In general, reduced stop codon peak may signal faster termination or increased stop codon readthrough, but the latter should be accompanied by an increased ribosome density in the 3’UTR, which is not the case. A defect in termination efficiency would be manifested by an increased stop codon peak, instead.

      Author response image 1.

       

      (6) Figures 5 and S8: Adding a vertical line at 'zero' in all cumulative plots will help the reader understand the author's interpretation of the data. 

      We have added a dashed grey vertical line at zero as requested. However, for interpretation of these plots, the reader should focus on the colored curve and whether it is shifted in respect to the grey curve (background) or not. Shift to the right indicates increased expression, while shift to the left indicates decreased expression. The reported p-value then indicates the statistical significance of the shift.

      (7) The entire Figure 2 are controls that can go to Supplementary Material. The clustering of Figure S3B could be shown in the main Figure, as it is a very easy read-out of the consistent effects of the KDs of the different eIF3 subunits under analysis.

      We have moved the entire Figure 2 to Supplementary Material as suggested (the original panels can be found as Supplementary Figures 1B, 1C and 3A). Figure S3B is now the main Figure 2E. 

      (8) There are 3 replicates for Ribo-Seq and four for RNA-Seq. Were these not carried out in parallel, as it is usually done in Ribo-seq experiments? Why is there an extra replicate for RNASeq?

      Yes, the three replicates were carried out in parallel. We have decided to add the fourth replicate in RNA-Seq to increase the data robustness as the RNA-Seq is used for normalization of FP to calculate the TE, which was our main analyzed metrics in this article. We had the option to add the fourth replicate as we originally prepared five biological replicates for all samples, but after performing the control experiments, we selected only the 3 best replicates for the Ribo-Seq library preparation and sequencing.  

      (9) Please, add another sheet in Table S2 with the names of all genes that change only at the translation (RPF) levels.

      As requested, we have added three extra sheets (one for each downregulation) for differential FP with Padjusted <0.05 in the Spreadsheet S2. We also provide a complete unfiltered differential expression data (sheet named “all data”), so that readers can filter out any relevant data based on their interest.

      (10) Page 5, bottom: ' ...we showed that the expression of all 12 eIF3 subunits is interconnected such that perturbance of the expression of one subunit results in the down-regulation of entire modules...'. This is not true for eIF3d, as shown in Fig1B and mentioned in Results.

      This reviewer is correct. By this generalized statement, we were trying to summarize our previous results from Wagner et al., 2014, PMID: 24912683; Wagner et al.,2016, PMID: 27924037 and Herrmannova et al.,2020, PMID: 31863585. The eIF3d downregulation is the only exception that does not affect expression of any other eIF3 subunit. Therefore, we have rewritten this paragraph accordingly: “We recently reported a comprehensive in vivo analysis of the modular dynamics of the human eIF3 complex (Wagner et al, 2020; Wagner et al, 2014; Wagner et al., 2016). Using a systematic individual downregulation strategy, we showed that the expression of all 12 eIF3 subunits is interconnected such that perturbance of the expression of one subunit results in the down-regulation of entire modules leading to the formation of partial eIF3 subcomplexes with limited functionality (Herrmannova et al, 2020). eIF3d is the only exception in this respect, as its downregulation does not influence expression of any other eIF3 subunit.”

      (11) Page 10, bottom: ' The PCA plot and hierarchical clustering... These results suggest that eIF3h depletion impacts the translatome differentially than depletion of eIF3e or eIF3d.' This is already obvious in the polysome profiles of Figure S2C.

      We agree that this result is surely not surprising given the polysome profile and growth phenotype analyses of eIF3hKD. But still, we think that the PCA plot and hierarchical clustering results represent valuable controls. Nonetheless, we rephrased this section to note that this result agrees with the polysome profiles analysis: “The PCA plot and hierarchical clustering (Figure 2A and Supplementary Figure 4A) showed clustering of the samples into two main groups: Ribo-Seq and RNA-seq, and also into two subgroups; NT and eIF3hKD samples clustered on one side and eIF3eKD and eIF3dKD samples on the other. These results suggest that the eIF3h depletion has a much milder impact on the translatome than depletion of eIF3e or eIF3d, which agrees with the growth phenotype and polysome profile analyses (Supplementary Figure 1A and 1D).”

      (12) Page 12: ' As for the eIF3dKD "unique upregulated" DTEGs, we identified one interesting and unique KEGG pathway, the ABC transporters (Supplementary Figure 5A, in green).' This sentence is confusing, as there are more pathways that are significant in this group, so it is unclear why the authors consider it 'unique'.

      The eIF3dKD “unique upregulated” group comprises genes with increased TE only in eIF3dKD but not in eIF3eKD or eIF3hKD (500 genes, Fig 2G). All these 500 genes were examined for enrichment in the KEGG pathways, and the top 10 significant pathways were reported (Fig S6A). However, 8 out of these 10 pathways were also significantly enriched in other gene groups examined (e.g. eIF3d/eIF3e common). Therefore, the two remaining pathways (“ABC transporters” and “Other types of O-glycan biosynthesis”) are truly unique for eIF3dKD. We wanted to highlight the ABC transporters group in particular because we find it rather interesting (for the reasons mentioned in the article). We have corrected the sentence in question to avoid confusion: “Among the eIF3dKD “unique upregulated” DTEGs, we identified one interesting KEGG pathway, the ABC transporters, which did not show up in other gene groups (Supplementary Figure 6A, in green). A total of 12 different ABC transporters had elevated TE (9 of them are unique to eIF3dKD, while 3 were also found in eIF3eKD), 6 of which (ABCC1-5, ABCC10) belong to the C subfamily, known to confer multidrug resistance with alternative designation as multidrug resistance protein (MRP1-5, MRP7) (Sodani et al, 2012).

      Interestingly, all six of these ABCC transporters were upregulated solely at the translational level (Supplementary Spreadsheet S2).”    

      (13) Note typo ('Various') in Figure 4A.

      Corrected

      (14) The introduction could be shortened.

      This is a very subjective requirement. In fact, when this manuscript was reviewed in NAR, we were asked by two reviewers to expand it substantially. Because a number of various research topics come together in this work, e.g. translational regulation, the eIF3 structure and function, MAPK/ERK signaling, we are convinced that all of them demand a comprehensive introduction for non-experts in each of these topics. Therefore, with all due respect to this reviewer, we did not ultimately shorten it.

      Reviewer #2 (Recommendations For The Authors):

      - In Figure 2, it would be useful to know why eIF3d is destabilized by eIF3e knockdown - is it protein degradation and why do the eIF3d/e knockdowns not more completely phenocopy each other when there is the same reduction to eIF3d as in the eIF3d knockdown sample?

      Yes, we do think that protein degradation lies behind the eIF3d destabilization in the eIF3eKD, but we have not yet directly demonstrated this. However, we have shown that eIF3d mRNA levels are not altered in eIF3eKD and that Ribo-Seq data indicate no change in TE or FP for eIF3d-encoding mRNA in eIF3eKD. Nonetheless, it is important to note (and we discuss it in the article) that eIF3d levels in eIF3dKD are lower than eIF3d levels in eIF3eKD (please see Supplementary Figure 1C). In fact, we believe that this is one of the main reasons for the eIF3d/e knockdowns differences.

      - The western blots in Figures 4 and 6 show modest changes to target protein levels and would be strengthened by quantification.

      We have added the quantifications as requested by this reviewer and the reviewer 3.

      - For Figure 4, this figure would be strengthened by experiments showing if the increase in ribosomal protein levels is correlated with actual changes to ribosome biogenesis.

      As suggested, we performed polysome profiling in the presence of EDTA to monitor changes in the 60S/40S ratio, indicating a potential imbalance in the biogenesis of individual ribosome subunits. We found that it was not affected (Figure 3G). In addition, we performed the same experiment, normalizing all samples to the same number of cells (cells were carefully counted before lysis). In this way, we confirmed that eIF3dKD and eIF3eKD cells indeed contain a significantly increased number of ribosomes, in agreement with the western blot analysis (Figure 3H).

      - In Figure 6, there needs to be a nuclear loading control.

      This experiment was repeated with Lamin B1 used as a nuclear loading control – it is now shown as Fig. 5F.

      - For Figure 8, these findings would be strengthened using luciferase reporter assays where the various RNA determinants are experimentally tested. Similarly, 5′ TOP RNA reporters would have been appreciated in Figure 4.

      This is indeed a logical continuation of our work, which represents the current work in progress of one of the PhD students. We apologize, but we consider this time- and resource-demanding analysis out of scope of this article.

      Reviewer #3 (Recommendations For The Authors):

      (1) Within the many effects observed, it is mentioned that eIF3d is known to be overexpressed while eIF3e is underexpressed in many cancers, but knockdown of either subunit decreases MDM2 levels, which would be expected to increase P53 activity and decrease tumor cell transformation. In contrast, they also report that 3e/3d knockdown dramatically increases levels of cJUN, presumably due to increased MAPK activity, and is expected to increase protumor gene expression. Additional discussion is needed to clarify the significance of the findings, which are a bit confusing.

      This is indeed true. However, considering the complexity of eIF3, the largest initiation factor among all, as well as the broad portfolio of its functions, it is perhaps not so surprising that the observed effects are complex and may seem even contradictory in respect to cancer. To acknowledge that, we expanded the corresponding part of discussion as follows: “Here, we demonstrate that alterations in the eIF3 subunit stoichiometry and/or eIF3 subcomplexes have distinct effects on the translatome; for example, they affect factors that play a prominent (either positive or negative) role in cancer biology (e.g., MDM2 and cJUN), but the resulting impact is unclear so far. Considering the complex interactions between these factors as well as the complexity of the eIF3 complex per se, future studies are required to delineate the specific oncogenic and tumor suppressive pathways that play a predominant role in mediating the effects of perturbations in the eIF3 complex in the context of neoplasia.”

      (2) There are places in the text where the authors refer to changes in transcriptional control when RNA levels differ, but transcription versus RNA turnover wasn't tested, e.g. page 16 and Figure S10, qPCR does not confirm "transcriptional upregulation in all three knockdowns" and page 19 "despite apparent compensatory mechanisms that increase their transcription."

      This is indeed true, the sentences in question were corrected. The term “increased mRNA levels” was used instead of transcriptional upregulation (increased mRNA stabilization is also possible).

      (3) Similarly, the authors suggest that steady-state LARP1 protein levels are unaffected based on ribosome footprint counts (page 21). It is incorrect to assume this, because ribosome footprints can be elevated due to stalling on RNA that isn't being translated and doesn't yield more protein, and because levels of translated RNA/synthesized proteins do not always reflect steady-state protein levels, especially in mutants that could affect lysosome levels and protein turnover. Also page 12, 1st paragraph suggests protein production is down when ribosome footprints are changed.

      Yes, we are well-aware of this known limitation of Ribo-seq analysis. Therefore, the steadystate protein levels of our key hits were verified by western blotting. In addition, we have removed the sentence about LARP1 because it was based on Ribo-Seq data only without experimental evaluation of the steady-state LARP1 protein levels.

      (4) The translation buffering effect is not clear in some Figures, e.g. S6, S8, 8A, and B. The authors show a scheme for translationally buffered RNAs being clustered in the upper right and lower left quadrants in S4H (translation up with transcript level down and v.v.), but in the FP versus RNA plots, the non-TOP RNAs and 4E-P-regulated RNAs don't show this behavior, and appear to show a similar distribution to the global changes. Some of the right panels in these figures show modest shifts, but it's not clear how these were determined to be significant. More information is needed to clarify, or a different presentation, such as displaying the RNA subsets in the left panels with heat map coloring to reveal whether RNAs show the buffered translation pattern defined in purple in Figure S4H, or by reporting a statistical parameter or number of RNAs that show behavior out of total for significance. Currently the conclusion that these RNAs are translationally buffered seems subjective since there are clearly many RNAs that don't show changes, or show translation-only or RNA-only changes.

      We would like to clarify that S4H does not indicate a necessity for changes in FPs in the buffered subsets. Although opposing changes in total mRNA and FPs are classified as buffering, often we also consider the scenario where there are changes to the total mRNA levels not accompanied by changes in ribosome association.

      In figure S6, the scatterplots indicate a high density of genes shifted towards negative fold changes on the x-axis (total mRNA). This is also reflected in the empirical cumulative distribution functions (ecdfs) for the log2 fold changes in total mRNA in the far right panels of A and B, and the lack of changes in log2 fold change for FPs (middle panels). Similarly, in figure S8, the scatterplots indicate a density of genes shifted towards positive fold changes on the x-axis for total mRNA. The ecdfs also demonstrate that there is a significant directional shift in log2 fold changes in the total mRNA that is not present to a similar degree in the FPs, consistent with translational offsetting. It is rightly pointed out that not all genes in these sets follow the same pattern of regulation. We have revised the title of Supplementary Figure S6 (now S7) to reflect this. However, we would like to emphasize that these figures are not intended to communicate that all genes within these sets of interest are regulated in the same manner, but rather that when considered as a whole, the predominant effect seen is that of translational offsetting (directional shifts in the log2 fold change distribution of total mRNA that are not accompanied by similar shifts in FP mRNA log2 fold changes).

      The significance of these differences was determined by comparing the ecdfs of the log2 fold changes for the genes belonging to a particular set (e.g. non-TOP mTOR-sensitive, p-eIF4E-sensitive) against all other expressed genes (background) using a Wilcoxan rank sum test. This allows identification of significant shifts in the distributions that have a clear directionality (if there is an overall increase, or decrease in fold changes of FPs or total mRNA compared to background). If log2 fold changes are different from background, but without a clear directionality (equally likely to be increased or decreased), the test will not yield a significant result. This approach allows assessment of the overall behavior of gene signatures within a given dataset in a manner that is completely threshold-independent, such that it does not rely on classification of genes into different regulatory categories (translation only, buffering, etc.) based on significance or fold-change cut-offs (as in S4H). Therefore, we believe that this unbiased approach is well-suited for identifying cases when there are many genes that follow similar patterns of regulation within a given dataset.

      (5) Page 10-"These results suggest that eIF3h depletion impacts the translatome differentially than depletion of eIF3e or eIF3d" ...These results suggest that eIF3h has less impact on the translatome, not that it does so differently. If it were changing translation by a different mechanism, I would not expect it to cluster with control.

      This sentence was rewritten as follows: “The PCA plot and hierarchical clustering (Figure 2A and Supplementary Figure 4A) showed clustering of the samples into two main groups: RiboSeq and RNA-seq, and also into two subgroups; NT and eIF3hKD samples clustered on one side and eIF3eKD and eIF3dKD samples on the other. These results suggest that the eIF3h depletion has a much milder impact on the translatome than depletion of eIF3e or eIF3d, which agrees with the growth phenotype and polysome profile analyses (Supplementary Figure 1A and 1D).”

      Other minor issues:

      (1) There are some typos: Figure 2 leves, Figure 4 variou,

      Corrected.

      (2) Figure 3, font for genes on volcano plot too small

      Yes, maybe, however the resolution of this image is high enough to enlarge a certain part of it at will. In our opinion, a larger font would take up too much space, which would reduce the informativeness of this graph.

      (3) Figure S5, highlighting isn't defined.

      The figure legend for S5A (now S6A) states: “Less significant terms ranking 11 and below are in grey. Terms specifically discussed in the main text are highlighted in green.” Perhaps it was overlooked by this reviewer.

      (4) At several points the authors refer to "the MAPK signaling pathway", suggesting there is a single MAPK that is affected, e.g in the title, page 3, and other places when it seems they mean "MAPK signaling pathways" since several MAPK pathways appear to be affected.

      We apologize for any terminological inaccuracies. There are indeed several MAPK pathways operating in cells. In our study, we focused mainly on the MAPK/ERK pathway. The confusion probably stems from the fact that the corresponding term in the KEGG pathway database is labeled "MAPK signaling pathway" and this term, although singular, includes all MAPK pathways. We have carefully reviewed the entire article and have corrected the term used accordingly to either: 1) MAPK pathways in general, 2) the MAPK/ERK pathway for this particular pathway, or 3) "MAPK signaling pathway", where the KEGG term is meant.

      (5) Some eIF3 subunit RNAs have TOP motifs. One might expect 3e and 3h levels to change as a function of 3d knockdown due to TOP motifs but this is not observed. Can the authors speculate why the eIF3 subunit levels don't change but other TOP RNAs show TE changes? Is this true for other translation factors, or just for eIF3, or just for these subunits? Could the Western blot be out of linear range for the antibody or is there feedback affecting eIF3 levels differently than the other TOP RNAs, or a protein turnover mechanism to maintain eIF3 levels?

      This is indeed a very interesting question. In addition to the mRNAs encoding ribosomal proteins, we examined all TOP mRNAs and added an additional sheet to the S2 supplemental spreadsheet with all TOP RNAs listed in (Philippe et al., 2020, PMID: 32094190). According to our Ribo-Seq data, we could expect to see increased protein levels of eIF3a and eIF3f in eIF3dKD and eIF3eKD, but this is not the case, as judged from extensive western blot analysis performed in (Wagner et. al 2016, PMID: 27924037). Indeed, we cannot rule out the involvement of a compensatory mechanism monitoring and maintaining the levels of eIF3 subunits at steady-state – increasing or decreasing them if necessary, which could depend on the TOP motif-mediated regulation. However, we think that in our KDs, all non-targeted subunits that lose their direct binding partner in eIF3 due to siRNA treatment become rapidly degraded. For example, co-downregulation of subunits d, k and l in eIF3eKD is very likely caused by protein degradation as a result of a loss of their direct binding partner – eIF3e. Since we showed that the yeast eIF3 complex assembles co-translationally (Wagner et. al 2020, PMID: 32589964), and there is no reason to think that mammalian eIF3 differs in this regard, our working hypothesis is that free subunits that are not promptly incorporated into the eIF3 complex are rapidly degraded, and the presence or absence of the TOP motif in the 5’ UTR of their mRNAs has no effect. As for the other TOP mRNAs, translation factors eEF1B2, eEF1D, eEF1G, eEF2 have significantly increased FPs in both eIF3dKD and eIF3eKD, but we did not check their protein levels by western blotting to conclude anything specific.

    1. Author response:

      The following is the authors’ response to the original reviews.

      The detailed, thorough critique provided by the three reviewers is very much appreciated. We believe the manuscript is greatly improved by the changes we have made based on those reviews. The major changes are described below, followed by a point by point response.

      Major Changes:

      (1) We revised our model (old Fig. 10; new Fig. 9) to keep the explanation focused on the data shown in the current study. Specifically, references to GTP/GDP states of Rab3A and changes in the presynaptic quantum have been removed and the mechanisms depicted are confined to pre- or post-synaptic Rab3A participating in either controlling release of a trophic factor that regulates surface GluA2 receptors (pre- or postsynaptic) or directly affecting fusion of GluA2-receptor containing vesicles (postsynaptic).

      (2) We replaced all cumulative density function plots and ratio plots, based on multiple quantile samples per cell, with box plots of cell means. This affects new Figures 1, 2, 3, 5, 6, 7 and 8. All references to “scaling,” “divergent scaling,” or “uniform scaling,” have been removed. New p values for comparison of means are provided above every box plot in Figures 1, 2, 3, 5, 6, 7 and 8. The number of cultures is provided in the figure legends.

      (3) We have added frequency to Figures 1, 2 and 8. Frequency values overall are more variable, and the effect of activity blockade less robust, than for mEPSC amplitudes. We have added text indicating that the increase in frequency after activity blockade was significant in neurons from cultures prepared from WT in the Rab3A+/- colony but not cultures prepared from KO mice (Results, lines 143 to 147, new Fig. 1G. H). The TTX-induced increase in frequency was significant in the NASPM experiments before NASPM, but not after NASPM (Results, lines 231 to 233, new Fig. 3, also cultures from WT in Rab3A+/- colony). The homeostatic plasticity effect on frequency did not reach significance in WT on WT glia cultures or

      WT on KO glia cultures, possibly due to the variability of frequency, combined with smaller sample sizes (Results, lines 400 to 403, new Fig. 8). In the cultures prepared from WT mice in the Rab3A+/Ebd colony, there was a trend towards higher frequency after TTX that did not reach statistical significance, and in cultures prepared from mutant mice, the p value was large, suggesting disruption of the effect, which appears to be due to an increase in frequency in untreated cultures, similar to the behavior of mEPSC amplitudes in neurons from mutant mice (Results, lines 161-167). In sum, the effect of activity on frequency requires Rab3A and Ca2+-permeable receptors, and is mimicked by the presence of the Rab3A Earlybird mutant. We have also added a discussion of these results (Discussion, lines 427-435). 

      (4) In the revised manuscript we have added analysis of VGLUT1 levels for the same synaptic sites that we previously analyzed GluA2 levels, and these data are described in Results, lines 344 to 371, and appear in new Table 2. In contrast to previous studies, we did not find any evidence for an increase in VGLUT1 levels after activity blockade. We reviewed those studies to determine whether there might be differences in the experimental details that could explain the lack of effect we observed. In (De Gois et al., 2005), the authors measured mRNA and performed western blots to show increases in VGLUT1 after TTX treatment in older rat cortical cultures (DIV 19). The study performs immunofluorescence imaging of VGLUT1 but only after bicuculline treatment (it decreases), not after TTX treatment. In (Wilson et al.,

      2005), the hippocampal cultures are treated with AP5, not TTX, and the VGLUT1 levels in immunofluorescence images are reported relative to synapsin I. That the type of activity blockade matters is illustrated by the failure of Wilson and colleagues to observe a consistent increase in VGLUT1/Synapsin ratio in cultures treated with AMPA receptor blockade (NBQX; supplementary information). These points have been added to the Discussion, lines 436 to 447.)

      Reviewer #1:

      (1) (model…is not supported by the data), (2) (The analysis of mEPSC data using quantile sampling…), (3) (…statistical analysis of CDFs suffers from n-inflation…), (4) (How does recording noise and the mEPSC amplitude threshold affect “divergent scaling?”) (5) (…justification for the line fits of the ratio data…), (7) (A comparison of p-values between conditions….) and (10) (Was VGLUT intensity altered in the stainings presented in the manuscript?)

      The major changes we made, described above, address Reviewer #1’s points. The remaining points are addressed below.

      (6) TTX application induces a significant increase in mEPSC amplitude in Rab3A-/- mice in two out of three data sets (Figs. 1 and 9). Hence, the major conclusion that Rab3A is required for homeostatic scaling is only partially supported by the data. 

      The p values based on CDF comparisons were problematic, but the point we were making is that they were much larger for amplitudes measured in cultures prepared from Rab3A-/- mice (Fig. 1, p = 0.04) compared to those from cultures prepared from Rab3A+/+ mice (Fig. 1, p = 4.6 * 10-4). Now that we are comparing means, there are no significant TTX-induced effects on mEPSC amplitudes for Rab3A-/- data. However, acknowledging that some increase after activity blockade remains, we describe homeostatic plasticity as being impaired or not significant, rather than abolished, by loss of Rab3A, (Abstract, lines 37 to 39; Results, lines 141 to 143; Discussion, lines 415 to 418).

      (8) There is a significant increase in baseline mEPSC amplitude in Rab3AEbd/Ebd (15 pA) vs. Rab3AEbd/+ (11 pA) cultures, but not in Rab3A-/- (13.6 pA) vs. Rab3A+/- (13.9 pA). Although the nature of scaling was different between Rab3AEbd/Ebd vs. Rab3AEbd/+ and Rab3AEbd/Ebd with vs. without TTX, the question arises whether the increase in mEPSC amplitude in Rab3AEbd/Ebd is Rab3A dependent. Could a Rab3A independent mechanism occlude scaling?

      The Reviewer is concerned that the increase in mEPSC amplitude in the presence of the Rab3A point mutant may be through a ‘non-Rab3A’ mechanism (a concern raised by the lack of such effect in cultures from the Rab3A-/- mice), and secondly, that the already large mEPSC cannot be further increased by the homeostatic plasticity mechanism. It must always be considered that a mutant with an altered genetic sequence may bind to novel partners, causing activities that would not be either facilitated or inhibited by the original molecule. We have added this caveat to Results, lines 180 to 186 We added that a number of other manipulations, implicating individual molecules in the homeostatic mechanism, have caused an increase in mEPSC amplitude at baseline, potentially nonspecifically occluding the ability of activity blockade to induce a further increase (Results lines 186 to 189). Still, it is a strong coincidence that the novel activity of the mutant Rab3A would affect mEPSC amplitude, the same characteristic that is affected by activity blockade in a Rab3A dependent manner, a point which we added to Results, lines 189 to 191.

      (9) Figure 4: NASPM appears to have a stronger effect on mEPSC frequency in the TTX condition vs. control (-40% vs -15%). A larger sample size might be necessary to draw definitive conclusions on the contribution of Ca2+-permeable AMPARs.

      Our results, even with the modest sample size of 11 cells, are clear: NASPM does not disrupt the effect of TTX treatment on mEPSC amplitude (new Fig. 3A). It also looks like there is a greater magnitude effect of NAPSM on frequency in TTX-treated cells; we note this, but point out that nevertheless, these mEPSCs are not contributing to the increase in mEPSC amplitude (Results, lines 238-241). 

      (11) The change in GluA2 area or fluorescence intensity upon TTX treatment in controls is modest. How does the GluA2 integral change?

      We had reported that GluA2 area showed the most prominent increase following activity blockade, with intensity changing very little. When we examined the integral, it closely matched the change in area. We have added the values for integral to new Fig. 5 D, H; new Fig. 6 A-C; new Fig. 7 A-C and new Table 1 (for GluA2) and new Table 2 (for VGLUT1). These results are described in the text in the following places: Results, lines 289-292; 298-299; 311-319; 328-324). For VGLUT1, both area and intensity changed modestly, and the integral appeared to be a combination of the two, being higher in magnitude and resulting in smaller p values than either area or intensity (Results, lines 344-348; 353-359; new Table 2).

      (12) The quantitative comparison between physiology and microscopy data is problematic. The authors report a mismatch in ratio values between the smallest mEPSC amplitudes and the smallest GluA2 receptor cluster sizes (l. 464; Figure 8). Is this comparison affected by the fluorescence intensity threshold? What was the rationale for a threshold of 400 a.u. or 450 a.u.? How does this threshold compare to the mEPSC threshold of 3 pA.

      This concern is partially addressed by no longer comparing the rank ordered mEPSC amplitudes with the rank ordered GluA2 receptor characteristics. We had used multiple thresholds in the event that an experiment was not analyzable with the chosen threshold (this in fact happened for VGLUT1, see end of this paragraph). We created box plots of the mean GluA2 receptor cluster size, intensity and integral, for experiments in which we used all three thresholds, to determine if the effect of activity blockade was different depending on which threshold was applied, and found that there was no obvious difference in the results (Author response image 1). Nevertheless, since there is no need to use a different threshold for any of the 6 experiments (3 WT and 3KO), for new Figures 5, 6 and 7 we used the same threshold for all data, 450; described in Methods, lines 746 to 749. For VGLUT1 levels, it was necessary to use a different threshold for Rab3A+/+ Culture #1 (400), but a threshold of 200 for the other five experiments (Methods, lines 751-757). The VGLUT1 immunofluorescent sites in Culture #1 had higher levels overall, and the low threshold caused the entire AOI to be counted as the synapse, which clearly included background levels outside of the synaptic site. Conversely, to use a threshold of 400 on the other experiments meant that the synaptic site found by the automated measurement tool was much smaller that what was visible by eye. In our judgement it would have been meaningless to adhere to a single threshold for VGLUT1 data.

      Author response image 1.

      Using different thresholds does not substantially alter GluA2 receptor cluster size data. A) Rab3A+/+ Culture #1, size data for three different thresholds, depicted above each graph. B) Rab3A+/+ Culture #2, size data for three different thresholds, depicted above each graph. Note scale bar in A is different from B, to highlight differences for different thresholds. (Culture #3 was only analyzed with 450 threshold).

      The conclusion that an increase in AMPAR levels is not fully responsible for the observed mEPSC increase is mainly based on the rank-order analysis of GluA2 intensity, yielding a slope of ~0.9. There are several points to consider here: (i) GluA2 fluorescence intensity did increase on average, as did GluA2 cluster size.

      (ii) The increase in GluA2 cluster size is very similar to the increase in mEPSC amplitude (each approx. 1820%). (iii) Are there any reports that fluorescence intensity values are linearly reporting mEPSC amplitudes (in this system)? Antibody labelling efficiency, and false negatives of mEPSC recordings may influence the results. The latter was already noted by the authors.

      Our comparison between mEPSC amplitude and GluA2 receptor cluster characteristics has been reexamined in the revised version using means rather than rank-ordered data in rank-order plots or ratio plots. Importantly, all of these methods revealed that in one out of three WT cultures (Culture #3) GluA2 receptor cluster size (old Fig. 8, old Table 1; new Fig. 6, new Table 1), intensity and integral (new Fig. 6, new Table 1) values decreased following activity blockade while in the same culture, mEPSC amplitudes increased. It is based on this lack of correspondence that we conclude that increases in mEPSC amplitude are not fully explained by increases in GluA2 receptors, and suggest there may be other contributors. These points are made in the Abstract (lines 108-110); Results (lines 319 to 326; 330337; 341-343) and the Discussion (lines 472 to 474). To our knowledge, there are not any reports that quantitatively compare receptor levels (area, intensity or integrals) to mEPSC amplitudes in the same cultures. We examined the comparisons very closely for 5 studies that used TTX to block activity and examined receptor levels using confocal imaging at identified synapses (Hou et al., 2008; Ibata et al., 2008; Jakawich et al., 2010a; Xu and Pozzo-Miller, 2017; Dubes et al., 2022). We were specifically looking for whether the receptor data were more variable than the mEPSC amplitude data, as we found. However, for 4 of the studies, sample sizes were very different so that we cannot simply compare the p values. Below is a table of the comparisons.

      Author response table 1.

      In Xu 2017 the sample sizes are close enough that we feel comfortable concluding that the receptor data were slightly more variable (p < 0.05) than mEPSC data (p<0.01) but recognize that it is speculative to say our finding has been confirmed. A discussion of these articles is in Discussion, lines 456-474.

      (iv) It is not entirely clear if their imaging experiments will sample from all synapses. Other AMPAR subtypes than GluA2 could contribute, as could kainite or NMDA receptors.

      While our imaging data only examined GluA2, we used the application of NASPM to demonstrate Ca2+permeable receptors did not contribute quantitatively to the increase in mEPSC amplitude following TTX treatment. Since GluA3 and GluA4 are also Ca2+-permeable, the findings in new Figure 3 (old Fig. 4) likely rule out these receptors as well.  There are also reports that Kainate receptors are Ca2+-permeable and blocked by NASPM (Koike et al., 1997; Sun et al., 2009), suggesting the NASPM experiment also rules out the contribution of Kainate receptors. Finally, given our recording conditions, which included normal magnesium levels in the extracellular solution as well as TTX to block action-potential evoked synaptic transmission, NMDA receptors would not be available to contribute currents to our recordings due to block by magnesium ions at resting Vm. These points have been added to the Methods section, lines 617 to 677 (NMDA); 687-694 (Ca2+-permeable AMPA receptors and Kainate receptors).

      Furthermore, the statement “complete lack of correspondence of TTX/CON ratios” is not supported by the data presented (l. 515ff). First, under the assumption that no scaling occurs in Rab3A-/-, the TTX/CON ratios show a 20-30% change, which indicates the variation of this readout. Second, the two examples shown in Figure 8 for Rab3A+/+ are actually quite similar (culture #1 and #2, particularly when ignoring the leftmost section of the data, which is heavily affected by the raw values approaching zero.

      We are no longer presenting ratio plots in the revised manuscript, so we do not base our conclusion that mEPSC amplitude data is not always corresponding to GluA2 receptor data on the difference in behavior of TTX/CON ratio values, but only on the difference in direction of the TTX effect in one out of three cultures. We agree with the reviewer that the ratio plots are much more sensitive to differences between control and treated values than the rank order plot, and we feel these differences are important, for example, there is still a homeostatic increase in the Rab3A-/- cultures, and the effect is still divergent rather than uniform. But the comparison of ratio data will be presented elsewhere.

      (13) Figure 7A: TTX CDF was shifted to smaller mEPSC amplitude values in Rab3A-/- cultures. How can this be explained?

      While this result is most obvious in CDF plots, we still observe a trend towards smaller mEPSC amplitudes after TTX treatment in two of three individual cultures prepared from Rab3A-/- mice when comparing means (new Fig. 7, Table 1) which did not reach statistical significance for the pooled data (new Fig. 5, new Table 1). There was not any evidence of this decrease in the larger data set (new Fig. 1) nor for Rab3A-/- neurons on Rab3A+/+ glia (new Fig. 8). Given that this effect is not consistent, we did not comment on it in the revised manuscript. It may be that there is a non-Rab3A-dependent mechanism that results in a decrease in mEPSC amplitude after activity blockade, which normally pulls down the magnitude of the activity-dependent increase typically observed. But studying this second component would be difficult given its magnitude and inconsistent presentation.

      Reviewer #1 (Recommendations For the Authors):

      (1) Abstract, last sentence: The conclusion of the present manuscript should be primarily based on the results presented. At present, it is mainly based on a previous publication by the authors.

      We have revised the last sentence to reflect actual findings of the current study (Abstract, lines 47 to 49).

      (2) Line 55: “neurodevelopmental”

      This phrase has been removed.

      (3) Line 56: “AMPAergic” should be replaced by AMPAR-mediated

      This sentence was removed when all references to “scaling” were removed; no other instances of “AMPAergic” are present.

      (4) Figure 9: The use of BioRender should be disclosed in the Figure Legend.

      We used BioRender in new Figures 3, 7 and 8, and now acknowledge BioRender in those figure legends.

      (5) Figure legends and results: The number of cultures should be indicated for each comparison.

      Number of cultures has been added to the figure legends.

      (6) Line 289: A comparison of p-values between conditions does not allow any meaningful conclusions.

      Agreed, therefore we have removed CDFs and the KS test comparison p values. All comparisons in the revised manuscript are for cell means.

      (7) Line 623ff: The argument referring to NMJ data is weak, given that different types of receptors are involved.

      We still think it is valid to point out that Rab3A is required for the increase in mEPC at the NMJ but that ACh receptors do not increase (Discussion, lines 522 to 525). We are not saying that postsynaptic receptors do not contribute in cortical cultures, only that there could be another Rab3A-dependent mechanism that also affects mEPSC amplitude.

      (8) Plotting data points outside of the ranges should be avoided (e.g., Fig. 2Giii, 7F).

      These two figures are no longer present in the revised manuscript. In revising figures, we made sure no other plots have data points outside of the ranges.

      (9) The rationale for investigating Rab3AEbd/Ebd remains elusive and should be described.

      A rationale for investigating Rab3AEbd/Ebd is that if the results are similar to the KO, it strengthens the evidence for Rab3A being involved in homeostatic synaptic plasticity. In addition, since its phenotype of early awakening was stronger than that demonstrated in Rab3A KO mice (Kapfhamer et al., 2002), it was possible we would see a more robust effect. These points have been added to the Results, lines 118 to 126.

      (10) Figures 3 and 4, as well as Figure 5 and 6 could be merged.

      In the revised version, Figure 3 has been eliminated since its main point was a difference in scaling behavior. Figure 4 has been expanded to include a model of how NASPM could reduce frequency (new Fig. 3.) Images of the pyramidal cell body have been added to Figure 5 (new Fig. 4), and Figure 6 has been completely revised and now includes pooled data for both Rab3A+/+ and Rab3A-/- cultures, for mEPSC amplitude, GluA2 receptor cluster size, intensity and integral.

      (11) Figure 5: The legend refers to MAP2, but this is not indicated in the figure.

      MAP2 has now been added to the labels for each image and described in the figure legend (new Fig. 4).

      Reviewer #2:

      Technical concerns:

      (1) The culture condition is questionable. The authors saw no NMDAR current present during spontaneous recordings, which is worrisome since NMDARs should be active in cultures with normal network activity (Watt et al., 2000; Sutton et al., 2006). It is important to ensure there is enough spiking activity before doing any activity manipulation. Similarly it is also unknown whether spiking activity is normal in Rab3AKO/Ebd neurons.

      In the studies cited by the reviewer, NMDA currents were detected under experimental conditions in which magnesium was removed. In our recordings, we have normal magnesium (1.3 mM) and also TTX, which prevents the necessary depolarization to allow inward current through NMDA receptors. This point has been added to our Methods, lines 674 to 677. We acknowledge we do not know the level of spiking in cultures prepared from Rab3A+/+, Rab3A-/- or Rab3A_Ebd/Ebd_ mice. Given the similar mEPSC amplitude for untreated cultures from WT and KO studies, we think it unlikely that activity was low in the latter, but it remains a possibility for untreated cultures from Rab3A_Ebd/Ebd_ mice, where mEPSC amplitude was increased. These points are added to the Methods, lines 615 to 622.

      (2) Selection of mEPSC events is not conducted in an unbiased manner. Manually selecting events is insufficient for cumulative distribution analysis, where small biases could skew the entire distribution. Since the authors claim their ratio plot is a better method to detect the uniformity of scaling than the well-established rank-order plot, it is important to use an unbiased population to substantiate this claim.

      We no longer include any cumulative distributions or ratio plot analysis in the revised version. We have added the following text to Methods, lines 703 to 720:

      “MiniAnalysis selects many false positives with the automated feature when a small threshold amplitude value is employed, due to random fluctuations in noise, so manual re-evaluation of the automated process is necessary to eliminate false positives. If the threshold value is set high, there are few false positives but small amplitude events that visually are clearly mEPSCs are missed, and manual re-evaluation is necessary to add back false negatives or the population ends up biased towards large mEPSC amplitudes. As soon as there is a manual step, bias is introduced. Interestingly, a manual reevaluation step was applied in a recent study that describes their process as ‘unbiased (Wu et al., 2020). In sum, we do not believe it is currently possible to perform a completely unbiased detection process. A fully manual detection process means that the same criterion (“does this look like an mEPSC?”) is applied to all events, not just the false positives, or the false negatives, which prevents the bias from being primarily at one end or the other of the range of mEPSC amplitudes. It is important to note that when performing the MiniAnalysis process, the researcher did not know whether a record was from an untreated cell or a TTX-treated cell.”

      (3) Immunohistochemistry data analysis is problematic. The authors only labeled dendrites without doing cell-fills to look at morphology, so it is questionable how they differentiate branches from pyramidal neurons and interneurons. Since glutamatergic synapse on these two types of neuron scale in the opposite directions, it is crucial to show that only pyramidal neurons are included for analysis.

      We identified neurons with a pyramidal shape and a prominent primary dendrite at 60x magnification without the zoom feature. This should have been made clear in the description of imaging. We have added an image of the two selected cells to our figure of dendrites (old Fig. 5, new Fig. 4), and described this process in the Methods, lines 736 to 739, and Results, lines 246 to 253. Given the morphology of the neurons selected it is highly unlikely that the dendrites we analyzed came from interneurons.

      Conceptual Concerns

      The only novel finding here is the implicated role for Rab3A in synaptic scaling, but insights into mechanisms behind this observation are lacking. The authors claim that Rab3A likely regulates scaling from the presynaptic side, yet there is no direct evidence from data presented. In its current form, this study’s contribution to the field is very limited.

      We have demonstrated that loss of Rab3A and expression of a Rab3A point mutant disrupt homeostatic plasticity of mEPSC amplitudes, and that in the absence of Rab3A, the increase in GluA2 receptors at synaptic sites is abolished. Further, we show that this effect cannot be through release of a factor, like TNFα, from astrocytes. In the new version, we add the finding that VGLUT1 is not increased after activity blockade, ruling out this presynaptic factor as a contributor to homeostatic increases in mEPSC amplitude. We show for the first time by examining mEPSC amplitudes and GluA2 receptors in the same cultures that the increases in GluA2 receptors are not as consistent as the increases in mEPSC amplitude, suggesting the possibility of another contributor to homeostatic increases in mEPSC amplitude. We first proposed this idea in our previous study of Rab3A-dependent homeostatic increases in mEPC amplitudes at the mouse neuromuscular junction. In sum, we dispute that there is only one novel finding and that we have no insights into mechanism. We acknowledge that we have no direct evidence for regulation from the presynaptic side, and have removed this claim from the revised manuscript. We have retained the Discussion of potential mechanisms affecting the presynaptic quantum and evidence that Rab3A is implicated in these mechanisms (vesicle size, fusion pore kinetics; Discussion, lines 537 to 563). One way to directly show that the amount of transmitter released for an mEPSC has been modified after activity blockade is to demonstrate that a fast off-rate antagonist has become less effective at inhibiting mEPSCs (because the increased glutamate released out competes it; see (Liu et al., 1999) and (Wilson et al., 2005) for example experiments). This set of experiments is underway but will take more time than originally expected, because we are finding surprisingly large decreases in frequency, possibly the result of mEPSCs with very low glutamate concentration that are completely inhibited by the dose used. Once mEPSCs are lost, it is difficult to compare the mEPSC amplitude before and after application of the antagonist. Therefore we intend to include this experiment in a future report, once we determine the reason for the frequency reduction, or, can find a dose where this does not occur.

      (1) Their major argument for this is that homeostatic effects on mEPSC amplitudes and GluA2 cluster sizes do not match. This is inconsistent with reports from multiple labs showing that upscaling of mEPSC amplitude and GluA2 accumulation occur side by side during scaling (Ibata et al., 2008; Pozo et al., 2012; Tan et al., 2015; Silva et al., 2019). Further, because the acquisition and quantification methods for mEPSC recordings and immunohistochemistry imaging are entirely different (each with its own limitations in signal detection), it is not convincing that the lack of proportional changes must signify a presynaptic component.

      Within the analyses in the revised manuscript, which are now based only on comparison of cell/dendrite means, we find a very good match in the magnitude of increase for the pooled data of mEPSC amplitudes and GluA2 receptor cluster sizes (+19.7% and +20.0% respectively; new Table 1). However, when looking at individual cultures, we had one of three WT cultures in which mEPSC amplitude increased 17.2% but GluA2 cluster size decreased 9.5%. This result suggests that while activity blockade does lead to an increase in GluA2 receptors after activity blockade, the effect is more variable than that for mEPSC amplitude. We went back to published studies to see if this has been previously observed, but found that it was difficult to compare because the sample sizes were different for the two characteristics (see Author response table 1). We included these particular 5 studies because they use the same treatment (TTX), examine receptors using imaging of identified synaptic sites, and record mEPSCs in their cultures (although the authors do not indicate that imaging and recordings are done simultaneously on the same cultures.) Only one of the studies listed by the Reviewer is in our group (Ibata et al., 2008). The study by (Tan et al., 2015) uses western blots to measure receptors; the study by (Silva et al., 2019) blocks activity using a combination of AMPA and NMDA receptor blockers; the study by (Pozo et al., 2012) correlates mEPSC amplitude changes with imaging but not in response to activity blockade, instead for changing the expression of GluA2. While it may seem like splitting hairs to reject studies that use other treatment protocols, there is ample evidence that the mechanisms of homeostatic plasticity depend on how activity was altered, see the following studies for several examples of this (Sutton et al., 2006; Soden and Chen, 2010; Fong et al., 2015). A discussion of the 5 articles we selected is in the revised manuscript, Discussion, lines 456 to 474. In sum, we provide evidence that activity blockade is associated with an overall increase in GluA2 receptors; what we propose is that this increase, being more variable, does not fully explain the increase in mEPSC amplitude. However, we acknowledge that the disparity could be explained by the differences in limitations of the two methods (Discussion, lines 469-472).

      (2) The authors also speculate in the discussion that presynaptic Rab3A could be interacting with retrograde BDNF signaling to regulate postsynaptic AMPARs. Without data showing Rab3A-dependent presynaptic changes after TTX treatment, this argument is not compelling. In this retrograde pathway, BDNF is synthesized in and released from dendrites (Jakawich et al., 2010b; Thapliyal et al., 2022), and it is entirely possible for postsynaptic Rab3A to interfere with this process cell-autonomously.

      We have added the information that Rab3A could control BDNF from the postsynaptic cell and included the two references provided by the reviewer, Discussion, lines 517 to 518. We have added new evidence, recently published, that the Rab3 family has been shown to regulate targeting of EGF receptors to rafts (among other plasma membrane molecules), with Rab3A itself clearly present in nonneuronal cells (Diaz-Rohrer et al., 2023) (added to Discussion, lines 509 to 515).

      (3) The authors propose that a change in AMPAR subunit composition from GluA2-containing ones to GluA1 homomers may account for the distinct changes in mEPSC amplitudes and GluA2 clusters. However, their data from the NASPM wash-in experiments clearly show that the GluA1 homomer contributions have not changed before and after TTX treatment.

      We have revised this section in the Discussion, lines 534 to 536, to clarify that any change due to GluA1 homomers should have been detectable by a greater ability of NASPM to reverse the TTX-induced increase.

      Reviewer #2 (Recommendations for the Authors):

      For authors to have more convincing arguments in general, they will need to clarify/improve certain details in their data collection by addressing the above technical concerns. Additionally, the authors should design experiments to test whether Rab3A regulates scaling from pre- or post-synaptic site. For example, they could sparsely knock out Rab3A in WT neurons to test the postsynaptic possibility. On the other hand, their argument for a presynaptic role would be much more compelling if they could show whether there are clear functional changes such as in vesicle sizes and release probability in the presynaptic terminal of Rab3AKO neurons.

      An important next step is to identify whether Rab3A is acting pre- or post-synaptically (Discussion, lines 572 to 573), but these experiments will be undertaken in the future. It would not add much to simply show vesicle size is altered in the KO (and we do not necessarily expect this since mEPSC amplitude is normal in the KO). It will be very difficult to establish that vesicle size is changing with activity blockade and that this change is prevented in the Rab3A KO, because we are looking for a ~25% increase in vesicle volume, which would correspond to a ~7.5% increase in diameter. Finally, we do not believe demonstrating changes in release probability tell us anything about a presynaptic role for Rab3A in regulating the size of the presynaptic quantum.

      Reviewer #3 (Public Review)

      Weaknesses: However, the rather strong conclusions on the dissociation of AMPAR trafficking and synaptic response are made from somewhat weaker data. The key issue is the GluA2 immunostaining in comparison with the mEPSC recordings. Their imaging method involves only assessing puncta clearly associated with a MAP2 labeled dendrite. This is a small subset of synapses, judging from the sample micrographs (Fig. 5). To my knowledge, this is a new and unvalidated approach that could represent a particular subset of synapses not representative of the synapses contributing to the mEPSC change (they are also sampling different neurons for the two measurements; an additional unknown detail is how far from the cell body were the analyzed dendrites for immunostaining.) While the authors acknowledge that a sampling issue could explain the data, they still use this data to draw strong conclusions about the lack of AMPAR trafficking contribution to the mEPSC amplitude change. This apparent difference may be a methodological issue rather than a biological one, and at this point it is impossible to differentiate these. It will unfortunately be difficult to validate their approach. Perhaps if they were to drive NMDAdependent LTD or chemLTP, and show alignment of the imaging and ephys, that would help. More helpful would be recordings and imaging from the same neurons but this is challenging. Sampling from identified synapses would of course be ideal, perhaps from 2P uncaging combined with SEP-labeled AMPARs, but this is more challenging still. But without data to validate the method, it seems unwarranted to make such strong conclusions such as that AMPAR trafficking does not underlie the increase in mEPSC amplitude, given the previous data supporting such a model.

      In the new version, we soften our conclusion regarding the mismatch between GluA2 receptor levels and mEPSC amplitudes, now only stating that receptors may not be the sole contributor to the TTX effect on mEPSC amplitude (Discussion, lines 472 to 474). With our analysis in the new version focusing on comparisons of cell means, the GluA2 receptor cluster size and the mEPSC amplitude data match well in magnitude for the data pooled across the 3 matched cultures (20.0% and 19.7%, respectively, see new Table 1). However, in one of the three cultures the direction of change for GluA2 receptors is opposite that of mEPSC amplitudes (Table 1, Culture #3, -9.5% vs +17.2%, respectively).

      It is unlikely that the lack of matching of homeostatic plasticity in one culture, but very good matching in two other cultures, can be explained by an unvalidated focus on puncta associated with MAP2 positive dendrites. We chose to restrict analysis of synaptic GluA2 receptors to the primary dendrite in order to reduce variability, reasoning that we are always measuring synapses for an excitatory pyramidal neuron, synapses that are relatively close to the cell body, on the consistently identifiable primary dendrite. We measured how far this was for the two cells depicted in old Figure 5 (new Fig. 4). Because we always used the 5X zoom window which is a set length, and positioned it within ~10 microns of the cell body, these cells give a ball park estimate for the usual distances. For the untreated cell, the average distance from the cell body was 38.5 ± 2.8 µm; for the TTX-treated cell, it was 42.4 ± 3.2 µm (p = 0.35, KruskalWallis test). We have added these values to the Results, lines 270 to 274.

      We did not mean to propose that AMPA receptor levels do not contribute at all to mEPSC amplitude, and we acknowledge there are clear cases where the two characteristics change in parallel (for example, in the study cited by Reviewer #2, (Pozo et al., 2012), increases in GluA2 receptors due to exogenous expression are closely matched by increases in mEPSC amplitudes.) What our matched culture experiments demonstrate is that in the case of TTX treatment, both GluA2 receptors and mEPSC amplitudes increase on average, but sometimes mEPSC amplitudes can increase in the absence of an increase in GluA2 receptors (Culture #3, Rab3A+/+ cultures), and sometimes mEPSC amplitudes do not increase even though GluA2 receptor levels do increase (Culture #3, Rab3A-/- cultures). Therefore, it would not add anything to our argument to examine receptors and mEPSCs in NMDA-dependent LTP, a different plasticity paradigm in which changes in receptors and mEPSCs may more closely align. It has been demonstrated that mEPSCs of widely varying amplitude can be recorded from a single synaptic site (Liu and Tsien, 1995), so we would need to measure a large sample of individual synapse recordings to detect a modest shift in average values due to activity blockade. In addition, it would be essential to express fluorescent AMPA receptors in order to correlate receptor levels in the same cells we record from (or at the same synapses). And yet, even after these heroics, one is still left with the issue that the two methods, electrophysiology and fluorescent imaging, have distinct limitations and sources of variability that may obscure any true quantitative correlation.

      Other questions arise from the NASPM experiments, used to justify looking at GluA2 (and not GluA1) in the immunostaining. First, there is a frequency effect that is quite unclear in origin. One would expect NASPM to merely block some fraction of the post-synaptic current, and not affect pre-synaptic release or block whole synapses. It is also unclear why the authors argue this proves that NASPM was at an effective concentration (lines 399-400). Further, the amplitude data show a strong trend towards smaller amplitude. The p value for both control and TTX neurons was 0.08 – it is very difficult to argue that there is no effect. And the decrease is larger in the TTX neurons. Considering the strong claims for a presynaptic locus and the use of this data to justify only looking at GluA2 by immunostaining, these data do not offer much support of the conclusions. Between the sampling issues and perhaps looking at the wrong GluA subunit, it seems premature to argue that trafficking is not a contributor to the mEPSC amplitude change, especially given the substantial support for that hypothesis. Further, even if trafficking is not the major contributor, there could be shifts in conductance (perhaps due to regulation of auxiliary subunits) that does not necessitate a pre-synaptic locus. While the authors are free to hypothesize such a mechanism, it would be prudent to acknowledge other options and explanations.

      We have created a model cartoon to explain how NASPM could reduce mEPSC frequency (new Fig. 3D). mEPSCs that arise from a synaptic site that has only Ca2+-permeable AMPA receptors will be completely blocked by NASPM, if the NASPM concentration is maximal. The reason we conclude that we have sufficient NASPM reaching the cells is that the frequency is decreased, as expected if there are synaptic sites with only Ca2+-permeable AMPA receptors. We previously were not clear that there is an effect of NASPM on mEPSC amplitude, although it did not reach statistical significance (new Fig. 3B). Where there is no effect is on the TTX-induced increase in mEPSC amplitude, which remains after the acute NASPM application (new Fig. 3A). We have revised the description of these findings in Results, lines 220 to 241. In reviewing the literature further, we could find no previous studies demonstrating an increase in conductance in GluA2 or Ca2+-impermeable receptors, only in GluA1 homomers. In other words, any conductance change would have been due to a change in GluA1 homomers, and should have been visible as a disruption of the homeostatic plasticity by NASPM application. We have added text to Results, lines 211 to 217; 236-241; Discussion, lines 420 to 422; 526-536 and Methods, lines 685 to 695 regarding this point.

      The frequency data are missing from the paper, with the exception of the NASPM dataset. The mEPSC frequencies should be reported for all experiments, particularly given that Rab3A is generally viewed as a pre-synaptic protein regulating release. Also, in the NASPM experiments, the average frequency is much higher in the TTX treated cultures. Is this statistically above control values?

      This comment is addressed by the major change #3, above.

      Unaddressed issues that would greatly increase the impact of the paper:

      (1) Is Rab3A activity pre-synaptically, post-synaptically or both. The authors provide good evidence that Rab3A is acting within neurons and not astrocytes. But where is it acting (pre or post) would aid substantially in understanding its role (and particularly the hypothesized and somewhat novel idea that the amount of glutamate released per vesicle is altered in HSP). They could use sparse knockdown of Rab3A, or simply mix cultures from KO and WT mice (with appropriate tags/labels). The general view in the field has been that HSP is regulated post-synaptically via regulation of AMPAR trafficking, and considerable evidence supports this view. The more support for their suggestion of a pre-synaptic site of control, the better.

      This is similar to the request of Reviewer #2, Recommendations to the Authors. An important next step is to identify whether Rab3A is working pre- or postsynaptically. However, it is possible that it is acting pre-synaptically to anterogradely regulate trafficking of AMPAR, as we have depicted in our model, new Fig. 9. To demonstrate that the presynaptic quantum is being altered, we would need to show that vesicle size is increased, or the amount of transmitter being released during an mEPSC is increased after activity blockade. To that end, we are currently performing experiments using a fast off-rate antagonist. As described above in response to Reviewer #2’s Conceptual Concerns, we find dramatic decreases in frequency not explained by the 30-60% inhibition observed for the largest amplitude mEPSCs, which suggests the possibility that small mEPSCs are more sensitive than large mEPSCs and therefore may have less transmitter. Due to these complexities and the delay while we test other antagonists to see if the effect is specific to fast-off rate antagonists, we are not including these results here.

      (2) Rab3A is also found at inhibitory synapses. It would be very informative to know if HSP at inhibitory synapses is similarly affected. This is particularly relevant as at inhibitory synapses, one expects a removal of GABARs and/or a decrease of GABA-packaging in vesicles (ie the opposite of whatever is happening at excitatory synapses.). If both processes are regulated by Rab3A, this might suggest a role for this protein more upstream in the signaling, an effect only at excitatory synapses would argue for a more specific role just at these synapses.

      It will be important to determine if homeostatic synaptic plasticity at inhibitory synapses on excitatory neurons is sensitive to Rab3A deletion, especially in light of the fact that unlike many of the other molecules implicated in homeostatic increases in mEPSCS, Rab3A is not a molecule known to be selective for glutamate receptor trafficking (in contrast to Arc/Arg3.1 or GRIP1, for example). Such a study would warrant its own publication.

      Reviewer #3 (Recommendations for the Authors):

      There are a number of minor points or suggestions for the authors:

      Is RIM1 part of this pathway (or expected to be)? Some discussion of this would be nice.

      RIM, Rab3-interacting molecule, has been implicated at the drosophila neuromuscular junction in a presynaptic form of homeostatic synaptic plasticity in which evoked release is increased after block of postsynaptic receptors (Muller et al., 2012), a plasticity that also requires Rab3-GAP (Muller et al., 2011). To our knowledge there is no evidence that RIM is involved in the homeostatic plasticity of mEPSC amplitude after activity blockade by TTX. The Rim1a KO does not have a change in mEPSC amplitude relative to WT (Calakos et al., 2004), but that is not unexpected given the normal mEPSC amplitude in neurons from cultures prepared from Rab3A-/- mice in the current study. It would be interesting to look at homeostatic plasticity in cortical cultures prepared from Rim1a or other RIM deletion mice, but we have not added these points to the revised manuscript since there are a number of directions one could go in attempting to define the molecular pathway and we feel it is more important to discuss the potential location of action and physiological mechanisms.

      Is the Earlybird mutation a GOF? More information about this mutation would help.

      We have added a description of how the Earlybird mutation was identified, in a screen for rest:activity mutants (Results, lines 118 to 123). Rab3A Earlybird mice have a shortened circadian period, shifting their wake cycle earlier and earlier. When Rab3A deletion mice were tested in the same activity raster plot measurements, the shift was smaller than that for the Earlybird mutant, suggesting the possibility that it is a dominant negative mutation.

      The high K used in the NASPM experiments seems a bit unusual. Have the authors done high K/no drug controls to see if this affects the synapses in any way?

      We used the high K based on previous studies that indicated the blocking effect of the Ca2+-permeable receptor blockers was use dependent (Herlitze et al., 1993; Iino et al., 1996; Koike et al., 1997). We reasoned that a modest depolarization would increase the frequency of AMPA receptor mEPSCs and allow access of the NASPM.  We have added this point to the Methods, lines 695 to 708. 

      The NASPM experiments do not show that GluA1 does not contribute (line 401), only that GluA1 homomers are not contributing (much – see above). GluA1/A2 heteromers are quite likely involved. Also, the SEM is missing from the WT pre/post NASPM data.

      Imaging of GluA2-positive sites will not distinguish between GluA2 homomers and GluA2-GluA1 heteromers, so we have added this clarification to Results, lines 242 to 246. We have remade the NASPM pre-post line plots so that the mean values and error bars are more visible (new Fig. 3B, C).

      It seems odd to speculate based on non-significant findings (line 650-1), with lower significance (p = 0.11) than findings being dismissed in the paper (NASPM on mEPSC amplitude; p = 0.08).

      We did not mean to dismiss the effect of NASPM on mEPSC amplitude (new Fig. 3B), rather, we dismiss the effect of NASPM on the homeostatic increase in mEPSC amplitude caused by TTX treatment (new Fig. 3A). We have emphasized this distinction in Results, lines 223 to 225, and Discussion, lines 420 to 422, as well as adding that the stronger effect of NASPM on frequency after TTX treatment suggests an activity-dependent increase in the number of synapses expressing only Ca2+ permeable homomers (Results, lines 236 to 241; Discussion, lines 431 to 435).

      Fig. 4 could be labeled better (to make it clear that B is amplitude and C is freq from the same cells).

      Fig. 4 has been revised—now the amplitude and frequency plots from the same condition (new Fig. 3, B, C; CON or TTX) are in a vertical line and the figure legend states that the frequency data are from the same cells as in Fig. 3A.

      The raw amplitude data seems a bit hidden in the inset panels – I would suggest these data are at least as important as the cumulative distributions in the main panel. Maybe re-organizing the figures would help.

      We have removed all cumulative distributions, rank order plots, and ratio plots. The box plots are now full size in new Figures 1, 2, 5, 6, 7 and 8.

      I’m not sure I would argue in the paper that 12 cells a day is a limiting issue for experiments. It doesn’t add anything and doesn’t seem like that high a barrier. It is fine to just say it is difficult and therefore there is a limited amount of data meeting the criteria.

      We have removed the comment regarding difficulty.

      Calakos N, Schoch S, Sudhof TC, Malenka RC (2004) Multiple roles for the active zone protein RIM1alpha in late stages of neurotransmitter release. Neuron 42:889-896.

      De Gois S, Schafer MK, Defamie N, Chen C, Ricci A, Weihe E, Varoqui H, Erickson JD (2005) Homeostatic scaling of vesicular glutamate and GABA transporter expression in rat neocortical circuits. J Neurosci 25:7121-7133.

      Diaz-Rohrer B, Castello-Serrano I, Chan SH, Wang HY, Shurer CR, Levental KR, Levental I (2023) Rab3 mediates a pathway for endocytic sorting and plasma membrane recycling of ordered microdomains. Proc Natl Acad Sci U S A 120:e2207461120.

      Dubes S, Soula A, Benquet S, Tessier B, Poujol C, Favereaux A, Thoumine O, Letellier M (2022) miR-124dependent tagging of synapses by synaptopodin enables input-specific homeostatic plasticity. EMBO J 41:e109012.

      Fong MF, Newman JP, Potter SM, Wenner P (2015) Upward synaptic scaling is dependent on neurotransmission rather than spiking. Nat Commun 6:6339.

      Herlitze S, Raditsch M, Ruppersberg JP, Jahn W, Monyer H, Schoepfer R, Witzemann V (1993) Argiotoxin detects molecular differences in AMPA receptor channels. Neuron 10:1131-1140.

      Hou Q, Zhang D, Jarzylo L, Huganir RL, Man HY (2008) Homeostatic regulation of AMPA receptor expression at single hippocampal synapses. Proc Natl Acad Sci U S A 105:775-780.

      Ibata K, Sun Q, Turrigiano GG (2008) Rapid synaptic scaling induced by changes in postsynaptic firing. Neuron 57:819-826.

      Iino M, Koike M, Isa T, Ozawa S (1996) Voltage-dependent blockage of Ca(2+)-permeable AMPA receptors by joro spider toxin in cultured rat hippocampal neurones. J Physiol 496 ( Pt 2):431437.

      Jakawich SK, Neely RM, Djakovic SN, Patrick GN, Sutton MA (2010a) An essential postsynaptic role for the ubiquitin proteasome system in slow homeostatic synaptic plasticity in cultured hippocampal neurons. Neuroscience 171:1016-1031.

      Jakawich SK, Nasser HB, Strong MJ, McCartney AJ, Perez AS, Rakesh N, Carruthers CJ, Sutton MA (2010b) Local presynaptic activity gates homeostatic changes in presynaptic function driven by dendritic BDNF synthesis. Neuron 68:1143-1158.

      Kapfhamer D, Valladares O, Sun Y, Nolan PM, Rux JJ, Arnold SE, Veasey SC, Bucan M (2002) Mutations in Rab3a alter circadian period and homeostatic response to sleep loss in the mouse. Nat Genet 32:290-295.

      Koike M, Iino M, Ozawa S (1997) Blocking effect of 1-naphthyl acetyl spermine on Ca(2+)-permeable AMPA receptors in cultured rat hippocampal neurons. Neurosci Res 29:27-36.

      Liu G, Tsien RW (1995) Properties of synaptic transmission at single hippocampal synaptic boutons. Nature 375:404-408.

      Liu G, Choi S, Tsien RW (1999) Variability of neurotransmitter concentration and nonsaturation of postsynaptic AMPA receptors at synapses in hippocampal cultures and slices. Neuron 22:395409.

      Muller M, Pym EC, Tong A, Davis GW (2011) Rab3-GAP controls the progression of synaptic homeostasis at a late stage of vesicle release. Neuron 69:749-762.

      Muller M, Liu KS, Sigrist SJ, Davis GW (2012) RIM controls homeostatic plasticity through modulation of the readily-releasable vesicle pool. J Neurosci 32:16574-16585.

      Pozo K, Cingolani LA, Bassani S, Laurent F, Passafaro M, Goda Y (2012) beta3 integrin interacts directly with GluA2 AMPA receptor subunit and regulates AMPA receptor expression in hippocampal neurons. Proc Natl Acad Sci U S A 109:1323-1328.

      Silva MM, Rodrigues B, Fernandes J, Santos SD, Carreto L, Santos MAS, Pinheiro P, Carvalho AL (2019) MicroRNA-186-5p controls GluA2 surface expression and synaptic scaling in hippocampal neurons. Proc Natl Acad Sci U S A 116:5727-5736.

      Soden ME, Chen L (2010) Fragile X protein FMRP is required for homeostatic plasticity and regulation of synaptic strength by retinoic acid. J Neurosci 30:16910-16921.

      Sun HY, Bartley AF, Dobrunz LE (2009) Calcium-permeable presynaptic kainate receptors involved in excitatory short-term facilitation onto somatostatin interneurons during natural stimulus patterns. J Neurophysiol 101:1043-1055.

      Sutton MA, Ito HT, Cressy P, Kempf C, Woo JC, Schuman EM (2006) Miniature neurotransmission stabilizes synaptic function via tonic suppression of local dendritic protein synthesis. Cell 125:785-799.

      Tan HL, Queenan BN, Huganir RL (2015) GRIP1 is required for homeostatic regulation of AMPAR trafficking. Proc Natl Acad Sci U S A 112:10026-10031.

      Thapliyal S, Arendt KL, Lau AG, Chen L (2022) Retinoic acid-gated BDNF synthesis in neuronal dendrites drives presynaptic homeostatic plasticity. Elife 11.

      Wilson NR, Kang J, Hueske EV, Leung T, Varoqui H, Murnick JG, Erickson JD, Liu G (2005) Presynaptic regulation of quantal size by the vesicular glutamate transporter VGLUT1. J Neurosci 25:62216234.

      Wu YK, Hengen KB, Turrigiano GG, Gjorgjieva J (2020) Homeostatic mechanisms regulate distinct aspects of cortical circuit dynamics. Proc Natl Acad Sci U S A 117:24514-24525.

      Xu X, Pozzo-Miller L (2017) EEA1 restores homeostatic synaptic plasticity in hippocampal neurons from Rett syndrome mice. J Physiol 595:5699-5712.

    1. Author Response

      The following is the authors’ response to the original reviews.

      eLife assessment

      This important study enhances our understanding of the effects of landscape context on grassland plant diversity and biomass. Notably, the authors use a well-designed field sampling method to separate the effects of habitat loss and fragmentation per se. Most of the data and analyses provide solid support for the findings that habitat loss weakens the positive relationship between grassland plant richness and biomass.

      Response: Thanks very much for organizing the review of the manuscript. We are grateful to you for the recognition. We have carefully analyzed all comments of the editors and reviewers and revised our manuscript to address them. All comments and recommendations are helpfully and constructive for improving our manuscript. We have described in detail our response to each of comment below.

      In addition to the reviewers' assessments, we have the following comments on your paper.

      (1) Some of the results are not consistent between figures. The relationships between overall species richness and fragmentation per se are not consistent between Figs. 3 and 5. The relationships between aboveground biomass and habitat loss are not consistent between Figs. 4 and 5. How shall we interpret these inconsistent results?

      Response: Thanks for your insightful comments. The reason for these inconsistencies is that the linear regression model did not take into account the complex causal relationships (including direct and indirect effects) among the different influencing factors. The results in Figures 3 and 4 just represent the pairwise relationship pattern and relative importance, respectively. The causal effects of habitat loss and fragmentation per se on plant richness and above-ground biomass should be interpreted based on the structural equation model results (Figure 6). We have revised the data analysis to clear these inconsistent results. Line 225-228

      In the revised manuscript, we have added the interpretation for these inconsistent results. The inconsistent effects between Figures 3 and 6 suggest that fragmentation per se actually had a positive effect on plant richness after accounting for the effects of habitat loss and environmental factors simultaneously.

      The inconsistent effects between Figures 4 and 6 are because the effects of habitat loss and fragmentation per se on above-ground biomass were mainly mediated by plant richness and environmental factors, which had no significant direct effect (Figure 6). Thus, habitat loss and fragmentation per se showed no significant relative effects on above-ground biomass after controlling the effects of plant richness and environmental factors (Figure 4).

      (2) One of the fragmentation indices, mean patch area metric, seems to be more appropriate as a measure of habitat loss, because it represents "a decrease in grassland patch area in the landscape".

      Response: Thanks for your insightful comments. We apologize for causing this confusion. The mean patch area metric in our study represents the mean size of grassland patches in the landscape for a given grassland amount. Previous studies have often used the mean patch metric as a measure of fragmentation, which can reflect the processes of local extinction in the landscape (Fahrig, 2003; Fletcher et al., 2018). We have revised the definition of the mean patch area metric and added its ecological implication in the revised manuscript to clarify this confusion.

      (3) It is important to show both the mean and 95% CI (or standard error) of the slope coefficients regarding to Figs. 3 and 6.

      Response: Thanks for your suggestions. We have added the 95% confidence intervals to the Figure 3 and Figure 6 in the revised manuscript.

      (4) It would be great to clarify what patch-level and landscape-level studies are in lines 302-306. Note that this study assesses the effects of landscape context on patch-level variables (i.e., plot-based plant richness and plot-based grassland biomass) rather than landscape-level variables (i.e., the average or total amount of biomass in a landscape).

      Response: Thanks for your insightful comment. We agree with your point that our study investigated the effect of fragmented landscape context (habitat loss and fragmentation per se) on plot-based plant richness and plot-based above-ground biomass rather than landscape-level variables.

      Therefore, we no longer discussed the differences between the patch-level and landscape-level studies here, instead focusing on the different ecological impacts of habitat loss and fragmentation per se in the revised manuscript.

      Line 369-374:

      “Although habitat loss and fragmentation per se are generally highly associated in natural landscapes, they are distinct ecological processes that determine decisions on effective conservation strategies (Fahrig, 2017; Valente et al., 2023). Our study evaluated the effects of habitat loss and fragmentation per se on grassland plant diversity and above-ground productivity in the context of fragmented landscapes in the agro-pastoral ecotone of northern China, with our results showing the effects of these two facets to not be consistent.”

      (5) One possible way to avoid the confusion between "habitat fragmentation" and "fragmentation per se" could be to say "habitat loss and fragmentation per se" when you intend to express "habitat fragmentation".

      Response: Thanks for your constructive suggestions. To avoid this confusion, we no longer mention habitat fragmentation in the revised manuscript but instead express it as habitat loss and fragmentation per se.

      Reviewer #1 (Public Review):

      This is a well-designed study that explores the BEF relationships in fragmented landscapes. Although there are massive studies on BEF relationships, most of them were conducted at local scales, few considered the impacts of landscape variables. This study used a large dataset to specifically address this question and found that habitat loss weakened the BEF relationships. Overall, this manuscript is clearly written and has important implications for BEF studies as well as for ecosystem restoration.

      Response: We are grateful to you for the recognition and constructive comments. All the comments and suggestions are very constructive for improving this manuscript. We have carefully revised the manuscript following your suggestions. All changes are marked in red font in the revised manuscript.

      My only concern is that the authors should clearly define habitat loss and fragmentation. Habitat loss and fragmentation are often associated, but they are different terms. The authors consider habitat loss a component of habitat fragmentation, which is not reasonable. Please see my specific comments below.

      Response: We agree with your point. In the revised manuscript, we no longer consider habitat loss and fragmentation per se as two facets of habitat fragmentation. We have clearly defined habitat loss and fragmentation per se and explicitly evaluated their relative effects on plant richness, above-ground biomass, and the BEF relationship.

      Reviewer #1 (Recommendations For The Authors):

      Title: It is more proper to say habitat loss, rather than habitat fragmentation.

      Response: Thanks for your suggestion. We have revised the title to “Habitat loss weakens the positive relationship between grassland plant richness and above-ground biomass”

      Line 22, remove "Anthropogenic", this paper is not specifically discussing habitat fragmentation driven by humans.

      Response: Thanks for your suggestion. We have removed the “Anthropogenic” from this sentence.

      Line 26, revise to "we investigated the effects of habitat loss and fragmentation per se on plant richness... in grassland communities by using a structural equation model".

      Response: Thanks for your suggestion. We have revised this sentence.

      Line 25-28:

      “Based on 130 landscapes identified by a stratified random sampling in the agro-pastoral ecotone of northern China, we investigated the effects of landscape context (habitat loss and fragmentation per se) on plant richness, above-ground biomass, and the relationship between them in grassland communities using a structural equation model.”

      Line 58-60, habitat fragmentation generally involves habitat loss, but habitat loss is independent of habitat fragmentation, it is not a facet of habitat fragmentation.

      Response: Thanks for your insightful comment. We have no longer considered habitat loss and fragmentation per se as two facets of habitat fragmentation. In the revised manuscript, we consider habitat loss and fragmentation as two different processes in fragmented landscapes.

      Line 65-67, this sentence is not very relevant to this paragraph and can be deleted.

      Response: Thanks for your suggestion. We have deleted this sentence from the paragraph.

      Line 87-90, these references are mainly based on microorganisms, are there any references based on plants? These references are more relevant to this study. In addition, this is a key mechanism mentioned in this study, this section needs to be strengthened with more evidence and further exploration.

      Response: Thanks for your comment and suggestion. Thanks for your comment and suggestion. We have added some references based on plants here to strengthen the evidence and mechanism of habitat specialisation determines the BEF relationship.

      Line 89-95:

      “In communities, specialists with specialised niches in resource use may contribute complementary roles to ecosystem functioning, whereas generalists with unspecialised in resource use may contribute redundant roles to ecosystem functioning due to overlapping niches (Dehling et al., 2021; Denelle et al., 2020; Gravel et al., 2011; Wilsey et al., 2023). Therefore, communities composed of specialists should have a higher niche complementarity effect in maintaining ecosystem functions and a more significant BEF relationship than communities composed of generalists.”

      Denelle, P., Violle, C., DivGrass, C., Munoz, F. 2020. Generalist plants are more competitive and more functionally similar to each other than specialist plants: insights from network analyses. Journal of Biogeography 47: 1922-1933.

      Dehling, D.M., Bender, I.M.A., Blendinger, P.G., Böhning-Gaese, K., Muñoz, M.C., Neuschulz, E.L., Quitián, M., Saavedra, F., Santillán, V., Schleuning, M., Stouffer, D.B. 2021. Specialists and generalists fulfil important and complementary functional roles in ecological processes. Functional Ecology 35: 1810-1821.

      Wilsey, B., Martin, L., Xu, X., Isbell, F., Polley, H.W. 2023. Biodiversity: Net primary productivity relationships are eliminated by invasive species dominance. Ecology Letters.

      Line 129-130, Although you can use habitat loss in the discussion or the introduction, here preferably use habitat amount or habitat area, rather than habitat loss in this case. Habitat loss represents changes in habitat area, but the remaining grasslands could be the case of natural succession or other processes, rather than loss of natural habitat.

      Response: Thanks for your insightful comment. We agree with your point. In the revised manuscript, we have explicitly stated that habitat loss was represented by the loss of grassland amount in the landscape.

      Since the remaining grassland fragments in this region were mainly caused by grassland loss due to human activities such as cropland expansion (Chen et al., 2019; Yang et al., 2020), we used the percentage of non-grassland cover in the landscape to represent habitat loss in our study.

      Line 132-135:

      “Habitat loss was represented by the loss of grassland amount in the landscape. As the remaining grassland fragments in this region were mainly caused by grassland loss due to human activities such as cropland expansion (Chen et al., 2019; Yang et al., 2020), the percentage of non-grassland cover in the landscape was used in our study to represent habitat loss.”

      Lines 245-246, please also give more details of the statistical results, such as n, r value et al in the text.

      Response: Thanks for your suggestion. We have added the details of the statistical results in the revised manuscript.

      Line 283-290:

      “Habitat loss was significantly negatively correlated with overall species richness (R = -0.21, p < 0.05, Figure 3a) and grassland specialist richness (R = -0.41, p < 0.01, Figure 3a), but positively correlated with weed richness (R = 0.31, p < 0.01, Figure 3a). Fragmentation per se was not significantly correlated with overall species richness and grassland specialist richness, but was significantly positively correlated with weed richness (R = 0.26, p < 0.01, Figure 3b). Habitat loss (R = -0.39, p < 0.01, Figure 3c) and fragmentation per se (R = -0.26, p < 0.01, Figure 3d) were both significantly negatively correlated with above-ground biomass.”

      Fig. 5, is there any relationship between habitat amount and fragmentation per se in this study?

      Response: Thanks for your insightful comment. We have considered a causal relationship between habitat loss and fragmentation per se in the structural equation model. We have discussed this relationship in the revised manuscript.

      Line 290-293, how about the BEF relationships with different fragmentation levels? I may have missed something somewhere, but it was not shown here.

      Response: Thanks for your insightful comment. We have added the BEF relationships with different fragmentation per se levels here.

      Line 323-340:

      “The linear regression models showed that habitat loss had a significant positive modulating effect on the positive relationship between plant richness and above-ground biomass, and fragmentation per se had no significant modulating effect (Figure 5). The positive relationship between plant richness and above-ground biomass weakened with increasing levels of habitat loss, strengthened and then weakened with increasing levels of fragmentation per se.

      Author response image 1.

      Relationships between grassland plant richness and above-ground biomass at different levels of habitat loss and fragmentation per se from 130 landscapes in the Tabu River Basin, a typical agro-pastoral ecotone of northern China: (a) high habitat loss and low fragmentation per se, (b) high habitat loss and moderate fragmentation per se, (c) high habitat loss and high fragmentation per se, (d) moderate habitat loss and low fragmentation per se, (e) moderate habitat loss and moderate fragmentation per se, (f) moderate habitat loss and high fragmentation per se, (g) low habitat loss and low fragmentation per se, (h) low habitat loss and moderate fragmentation per se. The R2 values in each panel are from linear regression models. The n in each panel is the number of surveying sites used in the linear regression models. The blue solid and dashed trend lines represent the significant and not significant effects, respectively. The shaded area around the trend line represents the 95% confidence interval. * represent significance at the 0.05 level. ** represent significance at the 0.01 level.”

      Discussion

      The Discussion (Section 4.2) needs to be revised and focused on your key findings, it is habitat loss, not fragmentation per se, that weakens the BEF relationships.

      Response: Thanks for your insightful comment and suggestion. In the revised manuscript, we have rephrased the Discussion (Section 4.2) to mainly discuss the inconsistent effects of habitat loss and fragmentation per se on the BEF relationship.

      Line 414-416:

      “4.2 Habitat loss rather than fragmentation per se weakened the magnitude of the positive relationship between plant diversity and ecosystem function”

      The R2 in the results are low (e.g., Fig. 3), please also mention other variables that might influence the observed pattern in the Discussion, such as soil and topography, though I understand it is difficult to collect such data in this study.

      Response: Thanks for your insightful comment and suggestion. We agree with you and reviewer 3 that the impact of environmental factors should also be considered.

      Therefore, we have considered two environmental factors related to water and temperature (soil water content and land surface temperature) in the analysis and discussed their impacts on plant diversity and above-ground biomass in the revised manuscript.

      Lines 344-345, its relative importance was stronger in the intact landscape than that of the fragmented landscape?

      Response: We apologize for making this confusion. We have rephrased this sentence.

      Line 422-426:

      “Our study found grassland plant diversity showed a stronger positive impact on above-ground productivity than landscape context and environmental factors. This result is consistent with findings by Duffy et al. (2017) in natural ecosystems, indicating grassland plant diversity has an important role in maintaining grassland ecosystem functions in the fragmented landscapes of the agro-pastoral ecotone of northern China.”

      Reviewer #2 (Public Review):

      Summary:

      In this manuscript, Yan et al. assess the effect of two facets of habitat fragmentation (i.e., habitat loss and habitat fragmentation per se) on biodiversity, ecosystem function, and the biodiversity-ecosystem function (BEF) relationship in grasslands of an agro-pastoral ecotone landscape in northern China. The authors use stratified random sampling to select 130 study sites located within 500m-radius landscapes varying along gradients of habitat loss and habitat fragmentation per se. In these study sites, the authors measure grassland specialist and generalist plant richness via field surveys, as well as above-ground biomass by harvesting and dry-weighting the grass communities in each 3 x 1m2 plots of the 130 study sites. The authors find that habitat loss and fragmentation per se have different effects on biodiversity, ecosystem function and the BEF relationship: whereas habitat loss was associated with a decrease in plant richness, fragmentation per se was not; and whereas fragmentation per se was associated with a decrease in above-ground biomass, habitat loss was not. Finally, habitat loss, but not fragmentation per se was linked to a decrease in the magnitude of the positive biodiversity-ecosystem functioning relationship, by reducing the percentage of grassland specialists in the community.

      Strengths:

      This study by Yan et al. is an exceptionally well-designed, well-written, clear and concise study shedding light on a longstanding, important question in landscape ecology and biodiversity-ecosystem functioning research. Via a stratified random sampling approach (cf. also "quasi-experimental design" Butsic et al. 2017), Yan et al. create an ideal set of study sites, where habitat loss and habitat fragmentation per se (usually highly correlated) are decorrelated and hence, separate effects of each of these facets on biodiversity and ecosystem function can be assessed statistically in "real-world" (and not experimental, cf. Duffy et al. 2017) communities. The authors use adequate and well-described methods to investigate their questions. The findings of this study add important empirical evidence from real-world grassland ecosystems that help to advance our theoretical understanding of landscape-moderation of biodiversity effects and provide important guidelines for conservation management.

      Weaknesses:

      I found only a few minor issues, mostly unclear descriptions in the study that could be revised for more clarity.

      Response: Thanks very much for your review of the manuscript. We are grateful to you for the recognition. All the comments and suggestions are very insightful and constructive for improving this manuscript. We have carefully studied the literature you recommend and revised the manuscript carefully following your suggestions. All changes are marked in red font in the revised manuscript.

      Reviewer #2 (Recommendations For The Authors):

      Specific comments

      (1) Some aspects of the Methods section were not entirely clear to me, could you revise them for more clarity?

      (a) Whereas you describe 4 main facets of fragmentation per se that are used to create the PC1 as a measure of overall fragmentation per se, it looks as if this PC1 is mainly driven by 3 facets only (ED, PD and AREA_MN), and patch isolation (nearest neighbour distance, ENN) having a relatively low loading on PC1 (Figure A1). I think it would be good to discuss this fact and the consequences of it, that your definition of fragmentation is focused more on edge density, patch density and mean patch area, and less on patch isolation in your Discussion section?

      Response: Thanks for your insightful comment and suggestion. We agree with your point. We have discussed this fact and its implications for understanding the effects of fragmentation per se in our study.

      Line 384-389:

      “However, it is important to stress that the observed positive effect of fragmentation per se does not imply that increasing the isolation of grassland patches would promote biodiversity, as the metric of fragmentation per se used in our study was more related to patch density, edge density and mean patch area while relatively less related to patch isolation (Appendix Table A1). The potential threats from isolation still need to be carefully considered in the conservation of biodiversity in fragmented landscapes (Haddad et al., 2015).”

      (b) Also, from your PCA in Figure A1, it seems that positive values of PC1 mean "low fragmentation", whereas high values of PC1 mean "high fragmentation", however, in Figure A2, the inverse is shown (low values of PC1 = low fragmentation, high values of PC1 = high fragmentation). Could you clarify in the Methods section, if you scaled or normalized the PC1 to match this directionality?

      Response: We apologize for making this confusion. In order to be consistent with the direction of change in fragmentation per se, we took the inverse of the PC1 as a single fragmentation per se index, which was positively correlated with patch density, edge density, mean nearest-neighbor distance metric, and negatively with mean patch area (Appendix Figure A1 and Table A1). We have clarified this point in the Method section.

      Line 160-163:

      “We took the inverse of the PC1 as a single fragmentation per se index, which was positively correlated with patch density, edge density, mean nearest-neighbor distance metric, and negatively with mean patch area (Appendix Figure A1 and Table A1).”

      (c) On line 155 you describe that you selected at least 20 landscapes using stratified sampling from each of the eight groups of habitat amount and fragmentation combination. Could you clarify: 1) did you randomly sample within these groups with a minimum distance condition or was it a non-random selection according to other criteria? (I think you could move the "To prevent overlapping landscapes..." sentence up here to the description of the landscape selection process) 2) Why did you write "at least 20 landscapes" - were there in some cases more or less landscapes selected? 130 study landscapes divided by 8 groups only gives you 16.25, hence, at least for some groups there were less than 20 landscapes? Could you describe your final dataset in more detail, i.e. the number of landscapes per group and potential repercussions for your analysis?

      Response: Thanks for your insightful comments. In the revised manuscript, we have rephrased the method to provide more detail for the sampling landscape selection.

      (1) Line 169-172

      We randomly selected at least 20 grassland landscapes with a minimum distance condition using stratified sampling from each of the remaining eight grassland types as alternative sites for field surveys. The minimum distance between each landscape was at least 1000 m to prevent overlapping landscapes and potential spatial autocorrelation.

      (2) Line 184-191

      The reason for selecting at least 20 grassland landscapes of each type in this study was to ensure enough alternative sites for the field survey. This is because the habitat type of some selected sites was not the natural grasslands, such as abandoned agricultural land. Some of the selected sites may not be permitted for field surveys.

      Thus, we finally established 130 sites in the field survey. The types of the 130 sites were: 19 high-moderate, 14 high-low, 19 moderate-high, 16 moderate-moderate, 18 moderate-low, 16 low-high, 17 low-moderate, 11 low-low habitat amount and fragmentation per se.

      (d) On line 166, you describe that you established 130 sites of 30 m by 30 m - I assume they were located (more or less) exactly in the centre of the selected 500 m - radius landscapes? Were they established so that they were fully covered with grassland? And more importantly, how did you establish the 10 m by 10 m areas and the 1 m2 plots within the 30 m by 30 m sites? Did you divide the 30 m by 30 m areas into three rectangles of 10 m by 10 m and then randomly established 1 m2 plots? Were the 1 m2 plots always fully covered with grassland/was there a minimum distance to edge criterion? Please describe with more detail how you established the 1 m2 study sites, and how many there were per landscape.

      Response: Thanks for your insightful comments. In the revised manuscript, we have provided more detailed information on how to set up 130 sites of 30 m by 30 m and three plots of 1 m by 1 m.

      (1) As these 130 sites were selected based on the calculation of the moving window, they were located (more or less) exactly in the centre of the 500-m radius buffer.

      (2) These sites were fully covered with grassland because their size (30 m by 30 m) was the same as the size of the grassland cell (30 m by 30 m) used in the calculation of the moving window.

      (3) We randomly set up three 1 m * 1 m plots in a flat topographic area at the 10 m * 10 m centre of each site. Thus, there was a minimum distance of 10 m to the edge for each 1 m * 1 m plot.

      (4) There are three 1 m * 1 m plots per landscape.

      Line 182-191:

      “Based on the alternative sites selected above, we established 130 sites (30 m * 30 m) between late July to mid-August 2020 in the Tabu River Basin in Siziwang Banner, Inner Mongolia Autonomous Region (Figure 1). The types of the 130 sites were: 19 high-moderate, 14 high-low, 19 moderate-high, 16 moderate-moderate, 18 moderate-low, 16 low-high, 17 low-moderate, 11 low-low habitat amount and fragmentation per se. In order to exclude the impact of historical agricultural activities, the habitat type of the established sites was natural grasslands with regional vegetation characteristics. Each site was not abandoned agricultural land, and there was no sign of agricultural reclamation.

      At the 10 m * 10 m center of each site, we randomly set up three 1 m * 1 m plots in a flat topographic area to investigate grassland vascular plant diversity and above-ground productivity.”

      (e) Line 171: could you explain what you mean by reclaimed?

      Response: Thanks for your comment. The “reclaimed” means that historical agricultural activities. We have rephrased this sentence to make it more explicit.

      Line 186-189:

      “In order to exclude the impact of historical agricultural activities, the habitat type of the established sites was natural grasslands with regional vegetation characteristics. Each site was not abandoned agricultural land, and there was no sign of agricultural reclamation.”

      (f) Line 188 ff.: Hence your measure of productivity is average-above ground biomass per 1 m2. I think it would add clarity if you highlighted this more explicitly.

      Response: Thanks for your suggestion. We have highlighted that the productivity in our study was the average above-ground biomass per 1 m * 1 m plots in each site.

      Line 215-217:

      “For each site, we calculated the mean vascular plant richness of the three 1 m * 1 m plots, representing the vascular plant diversity, and mean above-ground biomass of the three 1 m * 1 m plots, representing the above-ground productivity.”

      (2) All figures are clear and well-designed!

      (a) Just as a suggestion: in Figures 3 and 6, you could maybe add the standard errors of the mean as well?

      Response: Thanks for your suggestion. In the revised manuscript, we have added the standard errors of the mean in Figures 3 and 6.

      (b) Figure 4: Could you please clarify: Which models were the optimal models on which these model-averaged standardized parameter estimates were based on? And hence, the optimal models contained all 4 predictors (otherwise, no standardized parameter estimate could be calculated)? Or do these model-averaged parameters take into account all possible models (and not only the optimal ones)?

      Response: Thanks for your suggestion. We selected the four optimal models based on the AICc value to calculate the model-averaged standardized parameter estimates. The four optimal models contained all predictors in Figure 4. We have added the four optimal models in Appendix Table A3.

      Appendix:

      Author response table 1.

      Four optimal models of landscape context, environment factors, and plant diversity affecting above-ground biomass.

      Note: AGB: above-ground biomass; HL: habitat loss; FPS: fragmentation per se; SWT: soil water content; LST: land surface temperature; GSR: grassland specialist richness; WR: weed richness; **: significance at the 0.01 level.”

      (c) Please add in all Figures (i.e., Figures 4, 5 and 6, Figure 6 per "high, moderate and low-class") the number of study units the analyses were based on.

      Response: Thanks for your suggestion. In the revised manuscript, we have added the number of study units the analyses were based on in all Figures.

      (d) Figure 6: I think it would be more consistent to add a second plot where the BEF-relationship is shown for low, moderate and high levels of habitat fragmentation per se. Could you also add a clearer description in the Methods and/or Results section of how you assessed if habitat amount or fragmentation per se affected the BEF-relationship? I.e. based on the significance of the interaction term (habitat amount x species richness) in a linear model?

      Response: Thanks for your insightful comment and suggestion. We have added a second plot in Figure 5 to show the BEF relationship at low, moderate and high levels of fragmentation per se.

      Line 328-340:

      Author response image 2.

      Relationships between grassland plant richness and above-ground biomass at different levels of habitat loss and fragmentation per se from 130 landscapes in the Tabu River Basin, a typical agro-pastoral ecotone of northern China: (a) high habitat loss and low fragmentation per se, (b) high habitat loss and moderate fragmentation per se, (c) high habitat loss and high fragmentation per se, (d) moderate habitat loss and low fragmentation per se, (e) moderate habitat loss and moderate fragmentation per se, (f) moderate habitat loss and high fragmentation per se, (g) low habitat loss and low fragmentation per se, (h) low habitat loss and moderate fragmentation per se. The R2 values in each panel are from linear regression models. The n in each panel is the number of surveying sites used in the linear regression models. The blue solid and dashed trend lines represent the significant and not significant effects, respectively. The shaded area around the trend line represents the 95% confidence interval. * represent significance at the 0.05 level. ** represent significance at the 0.01 level.”

      We determined whether habitat loss and fragmentation per se moderated the BEF relationship by testing the significance of their interaction term with plant richness. We have added a clearer description in the Methods section of the revised manuscript.

      Line 245-250:

      “We then assessed the significance of interaction terms between habitat loss and fragmentation per se and plant richness in the linear regression models to evaluate whether they modulate the relationship between plant richness and above-ground biomass. Further, we used a piecewise structural equation model to investigate the specific pathways in which habitat loss and fragmentation per se modulate the relationship between plant richness and above-ground biomass.”

      (3) While reading your manuscript, I missed a discussion on the potential non-linear effects of habitat amount and fragmentation per se. In your study, it seems that the effects of habitat amount and fragmentation per se on biodiversity and ecosystem function are quite linear, which contrasts previous research highlighting that intermediate levels of fragmentation/heterogeneity could maximise spatial asynchrony, biodiversity and ecosystem function (e.g. Redon et al. 2014, Thompson & Gonzalez 2016, Tscharntke et al. 2012, Wilcox et al. 2017). I think it would add depth to your study if you discussed your finding of linear effects of habitat amount and fragmentation on biodiversity, ecosystem functioning and BEF. For example:

      Response: Thanks for your constructive suggestions. We have carefully studied the literature (e.g. Redon et al. 2014, Thompson & Gonzalez 2016, Tscharntke et al. 2012, Wilcox et al. 2017), which highlights that intermediate levels of fragmentation/heterogeneity could maximise spatial asynchrony, biodiversity and ecosystem function.

      In the revised manuscript, we have added the discussion about the linear positive effects of fragmentation on plant diversity and above-ground productivity and discussed possible reasons for this linear effect.

      Line 402-413:

      “In our study, a possible mechanism for the positive impacts of fragmentation per se on plant diversity and above-ground productivity (indirect positive impact via plant diversity) is that fragmentation per se increases the habitat heterogeneity in the landscape, which can promote biodiversity through spatial asynchrony and spatial insurance effects (Tscharntke et al., 2012). Previous studies indicated that heterogeneity typically has nonlinear effects on biodiversity and ecosystem function, as moderate heterogeneity can maximise spatial asynchrony (Redon et al., 2014; Wilcox et al., 2017). However, our study did not observe nonlinear patterns between fragmentation per se and plant diversity and above-ground productivity. This may be due to the low spatial heterogeneity of this area as a result of agricultural intensification (Benton et al., 2003; Chen et al., 2019). The gradient of fragmentation per se in our study may not cover the optimal heterogeneity levels for maximising plant diversity and above-ground productivity (Thompson and Gonzalez, 2016).”

      Meanwhile, we also discussed the nonlinear pattern of the BEF relationship with increasing levels of fragmentation per se to add depth to the discussion.

      Line 442-451:

      “In addition, our study found that the BEF relationship showed a nonlinear pattern with increasing levels of fragmentation per se. For a given level of habitat loss, the positive BEF relationship was strongest at moderate fragmentation per se level and became neutral at high fragmentation per se level. This can be explained by the increased spatial asynchrony at moderate fragmentation per se level, which can promote niche complementary among species in the community and thus strengthen the BEF relationship (Gonzalez et al., 2020; Thompson and Gonzalez, 2016; Tscharntke et al., 2012). The neutral BEF relationship at high fragmentation per se level may be due to edge effects enhancing environmental filtering, thereby leading to functional redundancy among species and decoupling the BEF relationship (Fetzer et al., 2015; Hu et al., 2016; Zambrano et al., 2019).”

      (a) Line 74-75: I was wondering if you also thought of spatial insurance effects or spatial asynchrony effects that can emerge with habitat fragmentation, which could lead to increased ecosystem functioning as well? (refs. above).

      Response: Thanks for your constructive suggestions. In the revised manuscript, we have explicitly considered the spatial insurance effect or spatial asynchrony as the important mechanism for fragmentation per se to increase plant diversity, ecosystem function, and the BEF relationship.

      Line 74-77:

      “In theory, habitat loss and fragmentation per se can regulate ecosystem function and the BEF relationship by altering species composition, interactions, and spatial asynchrony regardless of changes in species richness (Liu et al., 2018; Thompson and Gonzalez, 2016; Tscharntke et al., 2012).”

      Line 402-408:

      “In our study, a possible mechanism for the positive impacts of fragmentation per se on plant diversity and above-ground productivity (indirect positive impact via plant diversity) is that fragmentation per se increases the habitat heterogeneity in the landscape, which can promote biodiversity through spatial asynchrony and spatial insurance effects (Tscharntke et al., 2012). Previous studies indicated that heterogeneity typically has nonlinear effects on biodiversity and ecosystem function, as moderate heterogeneity can maximise spatial asynchrony (Redon et al., 2014; Wilcox et al., 2017).”

      Line 442-451:

      “In addition, our study found that the BEF relationship showed a nonlinear pattern with increasing levels of fragmentation per se. For a given level of habitat loss, the positive BEF relationship was strongest at moderate fragmentation per se level and became neutral at high fragmentation per se level. This can be explained by the increased spatial asynchrony at moderate fragmentation per se level, which can promote niche complementary among species in the community and thus strengthen the BEF relationship (Gonzalez et al., 2020; Thompson and Gonzalez, 2016; Tscharntke et al., 2012). The neutral BEF relationship at high fragmentation per se level may be due to edge effects enhancing environmental filtering, thereby leading to functional redundancy among species and decoupling the BEF relationship (Fetzer et al., 2015; Hu et al., 2016; Zambrano et al., 2019).”

      (b) I was wondering, if this result of linear effects could also be the result of a fragmentation gradient that does not cover the whole range of potential values? Maybe it would be good to compare the gradient in habitat fragmentation in your study with a theoretical minimum maximum/considering that there might be an optimal medium degree of fragmentation.

      Response: Thanks for your insightful comment. We agree with your point that the linear effect of fragmentation per se in our study may be due to the fact that the gradient of fragmentation per se in this region may not cover the optimal heterogeneity levels for maximising spatial asynchrony. This is mainly because the agricultural intensification in the agro-pastoral ecotone of northern China could lead to lower spatial heterogeneity in this region. We have explicitly discussed this point in the revised manuscript.

      Line 406-413:

      “Previous studies indicated that heterogeneity typically has nonlinear effects on biodiversity and ecosystem function, as moderate heterogeneity can maximise spatial asynchrony (Redon et al., 2014; Wilcox et al., 2017). However, our study did not observe nonlinear patterns between fragmentation per se and plant diversity and above-ground productivity. This may be due to the low spatial heterogeneity of this area as a result of agricultural intensification (Benton et al., 2003; Chen et al., 2019). The gradient of fragmentation per se in our study may not cover the optimal heterogeneity levels for maximising plant diversity and above-ground productivity (Thompson and Gonzalez, 2016).”

      (4) Some additional suggestions:

      (a) Line 3: Maybe add "via reducing the percentage of grassland specialists in the community"?

      Response: Thanks for your suggestion. We have revised this sentence.

      Line 19:

      “Habitat loss can weaken the positive BEF relationship via reducing the percentage of grassland specialists in the community”

      (b) Lines 46-48: Maybe add "but see: Duffy, J.E., Godwin, C.M. & Cardinale, B.J. (2017). Biodiversity effects in the wild are common and as strong as key drivers of productivity. Nature."

      Response: Thanks for your suggestion. We have added this reference here.

      Line 47-49:

      “When research expands from experiments to natural systems, however, BEF relationships remain unclear in the natural assembled communities, with significant context dependency (Hagan et al., 2021; van der Plas, 2019; but see Duffy et al., 2017).”

      (c) Lines 82-87 and lines 90-93: Hence, your study actually is in contrast to these findings, i.e., fragmented landscapes do not necessarily have a lower fraction of grassland specialists? If yes, could you highlight this more explicitly?

      Response: Thanks for your insightful comment. We have explicitly highlighted this point in the revised manuscript.

      Line 434-439:

      “Meanwhile, our study demonstrates that habitat loss, rather than fragmentation per se, can decrease the degree of habitat specialisation by leading to the replacement of specialists by generalists in the community, thus weakening the BEF relationship. This is mainly because fragmentation per se did not decrease the grassland specialist richness in this region, whereas habitat loss decreased the grassland specialist richness and led to the invasion of more weeds from the surrounding farmland into the grassland community (Yan et al., 2022; Yan et al., 2023).”

      (d) Line 360: Could you add some examples of these multiple ecosystem functions you refer to?

      Response: Thanks for your suggestion. We have added some examples of these multiple ecosystem functions here.

      Line 456-457:

      “Therefore, future studies are needed to focus on multiple ecosystem functions, such as below-ground productivity, litter decomposition, soil carbon stocks, etc.”

      Reviewer #3 (Public Review):

      Summary:

      The authors aim to solve how landscape context impacts the community BEF relationship. They found habitat loss and fragmentation per se have inconsistent effects on biodiversity and ecosystem function. Habitat loss rather than fragmentation per se can weaken the positive BEF relationship by decreasing the degree of habitat specialization of the community.

      Strengths:

      The authors provide a good background, and they have a good grasp of habitat fragmentation and BEF literature. A major strength of this study is separating the impacts of habitat loss and fragmentation per se using the convincing design selection of landscapes with different combinations of habitat amount and fragmentation per se. Another strength is considering the role of specialists and generalists in shaping the BEF relationship.

      Response: We are grateful to you for the recognition and constructive comments. All the comments and suggestions are very constructive for improving this manuscript. We have carefully revised the manuscript following your suggestions. All changes are marked in red font in the revised manuscript.

      Weaknesses:

      (1) The authors used five fragmentation metrics in their study. However, the choice of these fragmentation metrics was not well justified. The ecological significance of each fragmentation metric needs to be differentiated clearly. Also, these fragmentation metrics may be highly correlated with each other and redundant. I suggest author test the collinearity of these fragmentation metrics for influencing biodiversity and ecosystem function.

      Response: Thanks for your constructive suggestion. The fragmentation metrics used in our study represent the different processes of breaking apart of habitat in the landscape, which are widely used by previous studies (Fahrig, 2003; Fahrig, 2017). In the revised manuscript, we have provided more detailed information about the ecological significance of these fragmentation indices.

      Line 142-148:

      “The patch density metric reflects the breaking apart of habitat in the landscape, which is a direct reflection of the definition of fragmentation per se (Fahrig et al., 2019). The edge density metric reflects the magnitude of the edge effect caused by fragmentation (Fahrig, 2017). The mean patch area metric and the mean nearest-neighbor distance metric are associated with the area and distance effects of island biogeography, respectively, reflecting the processes of local extinction and dispersal of species in the landscape (Fletcher et al., 2018).”

      Meanwhile, we have calculated the variance inflation factors (VIF) for each fragmentation metric to assess their collinearity. The VIF of these fragmentation metrics were all less than four, suggesting no significant multicollinearity for influencing biodiversity and ecosystem function.

      Author response table 2.

      Variance inflation factors of habitat loss and fragmentation per se indices for influencing plant richness and above-ground biomass.

      (2) I found the local environmental factors were not considered in the study. As the author mentioned in the manuscript, temperature and water also have important impacts on biodiversity and ecosystem function in the natural ecosystem. I suggest authors include the environmental factors in the data analysis to control their potential impact, especially the structural equation model.

      Response: Thanks for your constructive suggestion. We agree with you that environmental factors should be considered in our study. In the revised manuscript, we have integrated two environmental factors related to water and temperature (soil water content and land surface temperature) into the data analysis to control their potential impact. The main results and conclusions of the revised manuscript are consistent with those of the previous manuscript.

      Reviewer #3 (Recommendations For The Authors):

      (1) L60-63. The necessity to distinguish between habitat loss and fragmentation per se is not clearly stated. More information about biodiversity conservation strategies can be given here.

      Response: Thanks for your suggestion. In the revised manuscript, we have provided more evidence about the importance of distinguishing between habitat loss and fragmentation per se for biodiversity conservation.

      Line 62-67:

      “Habitat loss is often considered the major near-term threat to the biodiversity of terrestrial ecosystems (Chase et al., 2020; Haddad et al., 2015), while the impact of fragmentation per se remains debated (Fletcher Jr et al., 2023; Miller-Rushing et al., 2019). Thus, habitat loss and fragmentation per se may have inconsistent ecological consequences and should be considered simultaneously to establish effective conservation strategies in fragmented landscapes (Fahrig et al., 2019; Fletcher et al., 2018; Miller-Rushing et al., 2019).”

      (2) L73-77. The two sentences are hard to follow. Please rephrase to improve the logic. And I don't understand the "however" here. There is no twist.

      Response: Thanks for your suggestion. We have rephrased the two sentences to improve their logic.

      Line 74-79:

      “In theory, habitat loss and fragmentation per se can regulate ecosystem function and the BEF relationship by altering species composition, interactions, and spatial asynchrony regardless of changes in species richness (Liu et al., 2018; Thompson and Gonzalez, 2016; Tscharntke et al., 2012). This is because species in communities are not ecologically equivalent and may respond differently to habitat loss and fragmentation per se, and contribute unequally to ecosystem function (Devictor et al., 2008; Wardle and Zackrisson, 2005).”

      (3) L97. Are grasslands really the largest terrestrial ecosystem? Isn't it the forest?

      Response: We apologize for making this confusion. We have rephrased this sentence here.

      Line 101-104:

      “Grasslands have received considerably less attention, despite being one of the largest terrestrial ecosystems, and suffering severe fragmentation due to human activities, such as agricultural reclamation and urbanisation (Fardila et al., 2017).”

      (4) Fig.1, whether the four sample plots presented in panel b are from panel a. Please add the scale bar in panel b.

      Response: Thanks for your comment. The four sample plots presented in panel b are from panel a in Figure 1. We have also added the scale bar in panel b.

      (5) L105. This statement is too specific. Please remove and consider merging this paragraph with the next.

      Response: Thanks for your suggestion. We have removed this sentence and merged this paragraph with the next.

      (6) L157. The accuracy and kappa value of the supervised classification should be given.

      Response: Thanks for your suggestion. We have added the accuracy and kappa value of the supervised classification in the revised manuscript.

      Line 176-177:

      “The overall classification accuracy was 84.3 %, and the kappa coefficient was 0.81.”

      (7) I would recommend the authors provide the list of generalists and specialists surveyed in the supplementary. Readers may not be familiar with the plant species composition in this area.

      Response: Thanks for your suggestion. We agree with your point. We have provided the list of generalists and specialists surveyed in the Appendix Table A4.

      Line 282-283:

      “A total of 130 vascular plant species were identified in our study sites, including 91 grassland specialists and 39 weeds (Appendix Table A4).”

      (8) Fig.4, it is better to add the results of variation partition to present the relative contribution of habitat fragmentation, environmental factors, and plant diversity.

      Response: Thanks for your suggestion. We have integrated the landscape context, environmental factors, and plant diversity into the multi-model averaging analysis and redraw Figure 4 to present their relative importance for above-ground biomass.

      Line 313-319:

      Author response image 3.

      Standardised parameter estimates and 95% confidence intervals for landscape context, plant diversity, and environmental factors affecting above-ground biomass from 130 landscapes in the Tabu River Basin, a typical agro-pastoral ecotone of northern China. Standardised estimates and 95% confidence intervals are calculated by the multi-model averaging method based on the four optimal models affecting above-ground biomass (Appendix Table A3). ** represent significance at the 0.01 level.

      (9) Please redraw Fig.2 and Fig.5 to integrate the environmental factors. Add the R-square to Fig 5.

      Response: Thanks for your suggestion. We have integrated two environmental factors into the structural equation model and redraw Figure 2 and Figure 5 in the revised manuscript. And we have added the R-square to the Figure 5.

      (10) L354. The authors should be careful to claim that habitat loss could reduce the importance of plant diversity to ecosystem function. This pattern observed may depend on the type of ecosystem function studied.

      Response: Thanks for your suggestion. We have avoided this claim in the revised manuscript and explicitly discussed the importance of simultaneously focusing on multiple ecosystem functions, such as below-ground productivity, litter decomposition, soil carbon stocks, etc.

      Line 454-457:

      “This inconsistency can be explained by trade-offs between different ecosystem functions that may differ in their response to fragmentation per se (Banks-Leite et al., 2020). Therefore, future studies are needed to focus on multiple ecosystem functions, such as below-ground productivity, litter decomposition, soil carbon stocks, etc.”

    1. Author Response

      The following is the authors’ response to the original reviews.

      eLife assessment

      The work is a useful contribution towards understanding the role of archaeal and plant D-aminoacyl-tRNA deacylase 2 (DTD2) in deacylation and detoxification of D-Tyr-tRNATyr modified by various aldehydes produced as metabolic byproducts in plants. It integrates convincing results from both in vitro and in vivo experiments to address the long-standing puzzle of why plants outperform bacteria in handling reactive aldehydes and suggests a new strategy for stress-tolerant crops. The impact of the paper is limited by the fact that only one modified D-aminoacyl tRNA was examined, in lack of evidence that plant eEF1A mimics EF-Tu in protecting L-aminoacyl tRNAs from modification, and in failure to measure accumulation of toxic D-aminoacyl tRNAs or impairment of translation in plant cells lacking DTD2.

      We have now addressed all the drawbacks as follows:

      ‘only one modified D-aminoacyl tRNA was examined’

      We wish to clarify that only D-Leu (Yeast), D-Asp (Bacteria, Yeast), D-Tyr (Bacteria, Cyanobacteria, Yeast) and D-Trp (Bacteria) show toxicity in vivo in the absence of known DTD (Soutourina J. et al., JBC, 2000; Soutourina O. et al., JBC, 2004; Wydau S. et al., JBC, 2009) and D-Tyr-tRNATyr is used as a model substrate to test the DTD activity in the field because of the conserved toxicity of D-Tyr in various organisms. DTD2 has been shown to recycle D-Asp-tRNAAsp and D-Tyr-tRNATyr with the same efficiency both in vitro and in vivo (Wydau S. et al., NAR, 2007) and it also recycles acetaldehyde-modified D-Phe-tRNAPhe and D-Tyr-tRNATyr in vitro as shown in our earlier work (Mazeed M. et al., Science Advances, 2021). We have earlier shown that DTD1, another conserved chiral proofreader across bacteria and eukaryotes, acts via a side chain independent mechanism (Ahmad S. et al., eLife, 2013). To check the biochemical activity of DTD2 on D-Trp-tRNATrp, we have now done the D-Trp, D-Tyr and D-Asp toxicity rescue experiments by expressing the archaeal DTD2 in dtd null E. coli cells. We found that DTD2 could rescue the D-Trp toxicity with equal efficiency like D-Tyr and D-Asp (Figure: 1). Considering the action on multiple side chains with different chemistry and size, it can be proposed with reasonable confidence that DTD2 also operates based on a side chain independent manner.

      Author response image 1.

      DTD2 recycles multiple D-aa-tRNAs with different side chain chemistry and size. Growth of wildtype (WT), dtd null strain (∆dtd), and Pyrococcus horikoshii DTD2 (PhoDTD2) complemented ∆dtd strains of E. coli K12 cells with 500 µM IPTG along with A) no D-amino acids, B) 2.5 mM D-tyrosine, C) 30 mM D-aspartate and D) 5 mM D-tryptophan.

      ‘lack of evidence that plant eEF1A mimics EF-Tu in protecting L-aminoacyl tRNAs from modification’

      To understand the role of plant eEF1A in protecting L-aa-tRNAs from aldehyde modification, we have done a thorough sequence and structural analysis. We analysed the aa-tRNA bound elongation factor structure from bacteria (PDB ids: 1TTT) and found that the side chain of amino acid in the amino acid binding site of EF-Tu is projected outside (Figure: 2A; 3A). In addition, the amino group of amino acid is tightly selected by the main chain atoms of elongation factor thereby lacking a space for aldehydes to enter and then modify the L-aa-tRNAs and Gly-tRNAs (Figure: 2B; 3B). Modelling of D-amino acid (D-phenylalanine and smallest chiral amino acid, D-alanine) in the same site shows serious clashes with main chain atoms of EF-Tu, indicating D-chiral rejection during aa-tRNA binding by elongation factor (Figure: 2C-E). Next, we superimposed the tRNA bound mammalian eEF-1A cryoEM structure (PDB id: 5LZS) with bacterial structure to understand the structural differences in terms of tRNA binding and found that elongation factor binds tRNA in a similar way (Figure: 3C-D). Modelling of D-alanine in the amino acid binding site of eEF-1A shows serious clashes with main chain atoms, indicating a general theme of D-chiral rejection during aa-tRNA binding by elongation factor (Figure: 2F; 3E). Structure-based sequence alignment of elongation factor from bacteria, archaea and eukaryotes (both plants and mammals) shows a strict conservation of amino acid binding site (Figure: 2G). This suggests that eEF-1A will mimic EF-Tu in protecting L-aa-tRNAs from reactive aldehydes. Minor differences near the amino acid side chain binding site (as indicated in Wolfson and Knight, FEBS Letters, 2005) might induce the amino acid specific binding differences (Figure: 3F). However, those changes will have no influence when the D-chiral amino acid enters the pocket, as the whole side chain would clash with the active site. We have now included this sequence and structural conservation analysis in our revised manuscript (in text: line no 107-129; Figure: 2 and S2). Overall, our structural analysis suggests a conserved mode of aa-tRNA selection by elongation factor across life forms and therefore, our biochemical results with bacterial elongation factor Tu (EF-Tu) reflect the protective role of elongation factor in general across species.

      Author response image 2.

      Elongation factor enantio-selects L-aa-tRNAs through D-chiral rejection mechanism. A) Surface representation showing the cocrystal structure of EF-Tu with L-Phe-tRNAPhe. Zoomed-in image showing the binding of L-phenylalanine with side chain projected outside of binding site of EF-Tu (PDB id: 1TTT). B) Zoomed-in image of amino acid binding site of EF-Tu bound with L-phenylalanine showing the selection of amino group of amino acid through main chain atoms (PDB id: 1TTT). C) Modelling of D-phenylalanine in the amino acid binding site of EF-Tu shows severe clashes with main chain atoms of EF-Tu. Modelling of smallest chiral amino acid, alanine, in the amino acid binding site of EF-Tu shows D) no clashes with L-alanine and E) clashes with D-alanine. F) Modelling of D-alanine in the amino acid binding site of eEF-1A shows clashes with main chain atoms. (*Represents modelled molecule). G) Structure-based sequence alignment of elongation factor from bacteria, archaea and eukaryotes (both plants and animals) showing conserved amino acid binding site residues. (Key residues are marked with red star).

      Author response image 3.

      Elongation factor protects L-aa-tRNAs from aldehyde modification. A) Cartoon representation showing the cocrystal structure of EF-Tu with L-Phe-tRNAPhe (PDB id: 1TTT). B) Zoomed-in image of amino acid binding site of EF-Tu bound with L-phenylalanine (PDB id: 1TTT). C) Cartoon representation showing the cryoEM structure of eEF-1A with tRNAPhe (PDB id: 5LZS). D) Image showing the overlap of EF-Tu:L-Phe-tRNAPhe crystal structure and eEF-1A:tRNAPhe cryoEM structure (r.m.s.d. of 1.44 Å over 292 Cα atoms). E) Zoomed-in image of amino acid binding site of eEF-1A with modelled L-alanine (PDB id: 5ZLS). (*Modelled) F) Overlap showing the amino acid binding site residues of EF-Tu and eEF-1A. (EF-Tu residues are marked in black and eEF-1A residues are marked in red).

      ‘failure to measure accumulation of toxic D-aminoacyl tRNAs or impairment of translation in plant cells lacking DTD2’

      We agree that measuring the accumulation of D-aa-tRNA adducts from plant cells lacking DTD2 is important. We tried to characterise the same with dtd2 mutant plants extensively through Northern blotting as well as mass spectrometry. However, due to the lack of information about the tissue getting affected (root or shoot), identity of aa-tRNA as well as location of aa-tRNA (cytosol or organellar), we are so far unsuccessful in identifying them from plants. Efforts are still underway to identify them from plant system lacking DTD2. However, we have used a bacterial surrogate system, E. coli, as used earlier in Mazeed M. et al., Science Advances, 2021 to show the accumulation of D-aa-tRNA adducts in the absence of dtd. We could identify the accumulation of both formaldehyde and MG modified D-aa-tRNA adducts via mass spectrometry (Figure: 4). These results are now included in the revised manuscript (in line no: 190-197 and Figure: S5).

      Author response image 4.

      Loss of DTD results in accumulation of modified D-aminoacyl adducts on tRNAs in E. coli. Mass spectrometry analysis showing the accumulation of aldehyde modified D-Tyr-tRNATyr in A) Δdtd E. coli, B) formaldehyde and D-tyrosine treated Δdtd E. coli, and C) MG and D-tyrosine treated Δdtd E. coli. ESI-MS based tandem fragmentation analysis for unmodified and aldehyde modified D-Tyr-tRNATyr in D) Δdtd E. coli, E) and F) formaldehyde and D-tyrosine treated Δdtd E. coli, G) and H) MG and D-tyrosine treated Δdtd E. coli.

      Response to Public Reviews:

      We are grateful for the reviewers’ positive feedback and their comments and suggestions on this manuscript. Reviewer 1 has indicated two weaknesses and Reviewer 2 has none. We have now addressed all the concerns of the Reviewers.

      Reviewer #1 (Public Review):

      Summary:

      This work is an extension of the authors' earlier work published in Sci Adv in 2001, wherein the authors showed that DTD2 deacylates N-ethyl-D-aminoacyl-tRNAs arising from acetaldehyde toxicity. The authors in this study, investigate the role of archaeal/plant DTD2 in the deacylation/detoxification of D-Tyr-tRNATyr modified by multiple other aldehydes and methylglyoxal (produced by plants). Importantly, the authors take their biochemical observations to plants, to show that deletion of DTD2 gene from a model plant (Arabidopsis thaliana) makes them sensitive to the aldehyde supplementation in the media especially in the presence of D-Tyr. These conclusions are further supported by the observation that the model plant shows increased tolerance to the aldehyde stress when DTD2 is overproduced from the CaMV 35S promoter. The authors propose a model for the role of DTD2 in the evolution of land plants. Finally, the authors suggest that the transgenic crops carrying DTD2 may offer a strategy for stress-tolerant crop development. Overall, the authors present a convincing story, and the data are supportive of the central theme of the story.

      We are happy that reviewer found our work convincing and would like to thank the reviewer for finding our data supportive to the central theme of the manuscript.

      Strengths:

      Data are novel and they provide a new perspective on the role of DTD2, and propose possible use of the DTD2 lines in crop improvement.

      We are happy for this positive comment on the manuscript.

      Weaknesses:

      (a) Data obtained from a single aminoacyl-tRNA (D-Tyr-tRNATyr) have been generalized to imply that what is relevant to this model substrate is true for all other D-aa-tRNAs (term modified aa-tRNAs has been used synonymously with the modified Tyr-tRNATyr). This is not a risk-free extrapolation. For example, the authors see that DTD2 removes modified D-Tyr from tRNATyr in a chain-length dependent manner of the modifier. Why do the authors believe that the length of the amino acid side chain will not matter in the activity of DTD2?

      We thank the reviewer for bringing up this important point. As mentioned above, we wish to clarify that only half of the aminoacyl-tRNA synthetases are known to charge D-amino acids and only D-Leu (Yeast), D-Asp (Bacteria, Yeast), D-Tyr (Bacteria, Cyanobacteria, Yeast) and D-Trp (Bacteria) show toxicity in vivo in the absence of known DTD (Soutourina J. et al., JBC, 2000; Soutourina O. et al., JBC, 2004; Wydau S. et al., JBC, 2009). D-Tyr-tRNATyr is used as a model substrate to test the DTD activity in the field because of the conserved toxicity of D-Tyr in various organisms. DTD2 has been shown to recycle D-Asp-tRNAAsp and D-Tyr-tRNATyr with the same efficiency both in vitro and in vivo (Wydau S. et al., NAR, 2007). Moreover, we have previously shown that it recycles acetaldehyde-modified D-Phe-tRNAPhe and D-Tyr-tRNATyr in vitro as shown in our earlier work (Mazeed M. et al., Science Advances, 2021). We have earlier shown that DTD1, another conserved chiral proofreader across bacteria and eukaryotes, acts via a side chain independent mechanism (Ahmad S. et al., eLife, 2013). To check the biochemical activity of DTD2 on D-Trp-tRNATrp, we have now done the D-Trp, D-Tyr and D-Asp toxicity rescue experiments by expressing the archaeal DTD2 in dtd null E. coli cells. We found that DTD2 could rescue the D-Trp toxicity with equal efficiency like D-Tyr and D-Asp (Figure 1). Considering the action on multiple side chains with different chemistry and size, it can be proposed with reasonable confidence that DTD2 also operates based on a side chain independent manner.

      (b) While the use of EFTu supports that the ternary complex formation by the elongation factor can resist modifications of L-Tyr-tRNATyr by the aldehydes or other agents, in the context of the present work on the role of DTD2 in plants, one would want to see the data using eEF1alpha. This is particularly relevant because there are likely to be differences in the way EFTu and eEF1alpha may protect aminoacyl-tRNAs (for example see description in the latter half of the article by Wolfson and Knight 2005, FEBS Letters 579, 3467-3472).

      We thank the reviewer for bringing up this important point. As mentioned above, to understand the role of plant eEF1A in protecting L-aa-tRNAs from aldehyde modification, we have done a thorough sequence and structural analysis. We analysed the aa-tRNA bound elongation factor structure from bacteria (PDB ids: 1TTT) and found that the side chain of amino acid in the amino acid binding site of EF-Tu is projected outside (Figure: 2A; 3A). In addition, the amino group of amino acid is tightly selected by the main chain atoms of elongation factor thereby lacking a space for aldehydes to enter and then modify the L-aa-tRNAs and Gly-tRNAs (Figure: 2B; 3B). Modelling of D-amino acid (D-phenylalanine and smallest chiral amino acid, D-alanine) in the same site shows serious clashes with main chain atoms of EF-Tu, indicating D-chiral rejection during aa-tRNA binding by elongation factor (Figure: 2C-E). Next, we superimposed the tRNA bound mammalian eEF-1A cryoEM structure (PDB id: 5LZS) with bacterial structure to understand the structural differences in terms of tRNA binding and found that elongation factor binds tRNA in a similar way (Figure: 3C-D). Modelling of D-alanine in the amino acid binding site of eEF-1A shows serious clashes with main chain atoms, indicating a general theme of D-chiral rejection during aa-tRNA binding by elongation factor (Figure: 2F; 3E). Structure-based sequence alignment of elongation factor from bacteria, archaea and eukaryotes (both plants and mammals) shows a strict conservation of amino acid binding site (Figure: 2G). Minor differences near the amino acid side chain binding site (as indicated in Wolfson and Knight, FEBS Letters, 2005) might induce the amino acid specific binding differences (Figure: 3F). However, those changes will have no influence when the D-chiral amino acid enters the pocket, as the whole side chain would clash with the active site. We have now included this sequence and structural conservation analysis in our revised manuscript (in text: line no 107-129; Figure: 2 and S2). Overall, our structural analysis suggests a conserved mode of aa-tRNA selection by elongation factor across life forms and therefore, our biochemical results with bacterial elongation factor Tu (EF-Tu) reflect the protective role of elongation factor in general across species.

      Reviewer #2 (Public Review):

      In bacteria and mammals, metabolically generated aldehydes become toxic at high concentrations because they irreversibly modify the free amino group of various essential biological macromolecules. However, these aldehydes can be present in extremely high amounts in archaea and plants without causing major toxic side effects. This fact suggests that archaea and plants have evolved specialized mechanisms to prevent the harmful effects of aldehyde accumulation.

      In this study, the authors show that the plant enzyme DTD2, originating from archaea, functions as a D-aminoacyl-tRNA deacylase. This enzyme effectively removes stable D-aminoacyl adducts from tRNAs, enabling these molecules to be recycled for translation. Furthermore, they demonstrate that DTD2 serves as a broad detoxifier for various aldehydes in vivo, extending its function beyond acetaldehyde, as previously believed. Notably, the absence of DTD2 makes plants more susceptible to reactive aldehydes, while its overexpression offers protection against them. These findings underscore the physiological significance of this enzyme.

      We thank the reviewer for the positive comments the manuscript.

      Response to recommendation to authors:

      Reviewer #1 (Recommendations For The Authors):

      I enjoyed reading the manuscript entitled, "Archaeal origin translation proofreader imparts multi aldehyde stress tolerance to land plants" from the Sankaranarayanan lab. This work is an extension of their earlier work published in Sci Adv in 2001, wherein they showed that DTD2 deacylates N-ethyl-D-aminoacyl-tRNAs arising from acetaldehyde toxicity. Now, the authors of this study (Kumar et al.) investigate the role of archaeal/plant DTD2 in the deacylation/detoxification of D-Tyr-tRNATyr modified by multiple other aldehydes and methylglyoxal (which are produced during metabolic reactions in plants). Importantly, the authors take their biochemical observations to plants, to show that deletion of DTD2 gene from a model plant (Arabidopsis thaliana) makes them sensitive to the aldehyde supplementation in the media especially in the presence of D-Tyr. These conclusions are further supported by the observation that the model plant shows increased tolerance to the aldehyde stress when DTD2 is overproduced from the CaMV 35S promoter. The authors propose a model for the role of DTD2 in the evolution of land plants. Finally, the authors suggest that the transgenic crops carrying DTD2 may offer a strategy for stress-tolerant crop development. Overall, the authors present a convincing story, and the data are supportive of the central theme of the story.

      We are happy that reviewer enjoyed our manuscript and found our work convincing. We would also like to thank reviewer for finding our data supportive to the central theme of the manuscript.

      I have the following observations that require the authors' attention.

      1) The title of the manuscript will be more appropriate if revised to, "Archaeal origin translation proofreader, DTD2, imparts multialdehyde stress tolerance to land plants".

      Both the reviewer’s suggested to change the title. We have now changed the title based on reviewer 2 suggestion.

      2) Abstract (line 19): change, "physiologically abundantly produced" to "physiologically produced".

      As per the reviewer’s suggestion, we have now changed it to "physiologically produced".

      3) Introduction (line 50): delete, 'extremely'.

      We have removed the word 'extremely' from the Introduction.

      4) Line 79: change, "can be utilized" to "may be explored".

      We have changed "can be utilized" to "may be explored" as suggested by the reviewers.

      5) Results in general:

      (a) Data obtained from a single aminoacyl-tRNA (D-Tyr-tRNATyr) have been generalized to imply that what is relevant to this model substrate is true for all other D-aa-tRNAs (term modified aa-tRNAs has been used synonymously with the modified D-Tyr-tRNATyr). This is a risky extrapolation. For example, the authors see that DTD2 removes modified D-Tyr from tRNATyr in a chain-length dependent manner of the modifier. Why do the authors believe that the length of the amino acid side chain will not matter in the activity of DTD2?

      We thank the reviewer for bringing up this important point. As mentioned above, we wish to clarify that only half of the aminoacyl-tRNA synthetases are known to charge D-amino acids and only D-Leu (Yeast), D-Asp (Bacteria, Yeast), D-Tyr (Bacteria, Cyanobacteria, Yeast) and D-Trp (Bacteria) show toxicity in vivo in the absence of known DTD (Soutourina J. et al., JBC, 2000; Soutourina O. et al., JBC, 2004; Wydau S. et al., JBC, 2009). D-Tyr-tRNATyr is used as a model substrate to test the DTD activity in the field because of the conserved toxicity of D-Tyr in various organisms. DTD2 has been shown to recycle D-Asp-tRNAAsp and D-Tyr-tRNATyr with the same efficiency both in vitro and in vivo (Wydau S. et al., NAR, 2007). Moreover, we have previously shown that it recycles acetaldehyde-modified D-Phe-tRNAPhe and D-Tyr-tRNATyr in vitro as shown in our earlier work (Mazeed M. et al., Science Advances, 2021). We have earlier shown that DTD1, another conserved chiral proofreader across bacteria and eukaryotes, acts via a side chain independent mechanism (Ahmad S. et al., eLife, 2013). To check the biochemical activity of DTD2 on D-Trp-tRNATrp, we have now done the D-Trp, D-Tyr and D-Asp toxicity rescue experiments by expressing the archaeal DTD2 in dtd null E. coli cells. We found that DTD2 could rescue the D-Trp toxicity with equal efficiency like D-Tyr and D-Asp (Figure 1). Considering the action on multiple side chains with different chemistry and size, it can be proposed with reasonable confidence that DTD2 also operates based on a side chain independent manner.

      (b) Interestingly, the authors do suggest (in the Materials and Methods section) that the experiments were performed with Phe-tRNAPhe as well as Ala-tRNAAla. If what is stated in Materials and Methods is correct, these data should be included to generalize the observations.

      We regret for the confusing statement. We wish to clarify that L- and D-Tyr-tRNATyr were used for checking the TLC-based aldehyde modification, EF-Tu based protection assays and deacylation assays, D-Phe-tRNAPhe was used to characterise aldehyde-based modification by mass spectrometry and L-Ala-tRNAAla was used to check the modification propensity of multiple aldehydes. We used multiple aa-tRNAs to emphasize that aldehyde-based modifications are aspecific towards the identity of aa-tRNAs. All the data obtained with respective aa-tRNAs are included in manuscript.

      (c) While the use of EFTu supports that the ternary complex formation by the elongation factor can resist modifications of L-Tyr-tRNATyr by the aldehydes or other agents, in the context of the present work on the role of DTD2 in plants, one would want to see the data using eEF1alpha. This is particularly relevant because there are likely to be differences in the way EFTu and eEF1alpha may protect aminoacyl-tRNAs (for example see description in the latter half of the article by Wolfson and Knight 2005, FEBS Letters 579, 3467-3472).

      We thank the reviewer for bringing up this important point. As mentioned above, to understand the role of plant eEF1A in protecting L-aa-tRNAs from aldehyde modification, we have done a thorough sequence and structural analysis. We analysed the aa-tRNA bound elongation factor structure from bacteria (PDB ids: 1TTT) and found that the side chain of amino acid in the amino acid binding site of EF-Tu is projected outside (Figure: 2A; 3A). In addition, the amino group of amino acid is tightly selected by the main chain atoms of elongation factor thereby lacking a space for aldehydes to enter and then modify the L-aa-tRNAs and Gly-tRNAs (Figure: 2B; 3B). Modelling of D-amino acid (D-phenylalanine and smallest chiral amino acid, D-alanine) in the same site shows serious clashes with main chain atoms of EF-Tu, indicating D-chiral rejection during aa-tRNA binding by elongation factor (Figure: 2C-E). Next, we superimposed the tRNA bound mammalian eEF-1A cryoEM structure (PDB id: 5LZS) with bacterial structure to understand the structural differences in terms of tRNA binding and found that elongation factor binds tRNA in a similar way (Figure: 3C-D). Modelling of D-alanine in the amino acid binding site of eEF-1A shows serious clashes with main chain atoms, indicating a general theme of D-chiral rejection during aa-tRNA binding by elongation factor (Figure: 2F; 3E). Structure-based sequence alignment of elongation factor from bacteria, archaea and eukaryotes (both plants and mammals) shows a strict conservation of amino acid binding site (Figure: 2G). Minor differences near the amino acid side chain binding site (as indicated in Wolfson and Knight, FEBS Letters, 2005) might induce the amino acid specific binding differences (Figure: 3F). However, those changes will have no influence when the D-chiral amino acid enters the pocket, as the whole side chain would clash with the active site. We have now included this sequence and structural conservation analysis in our revised manuscript (in text: line no 107-129; Figure: 2 and S2). Overall, our structural analysis suggests a conserved mode of aa-tRNA selection by elongation factor across life forms and therefore, our biochemical results with bacterial elongation factor Tu (EF-Tu) reflect the protective role of elongation factor in general across species.

      6) Results (line 89): Figure: 1C-G (not B-G).

      As correctly pointed out by the reviewer(s), we have changed it to Figure: 1C-G.

      7) Results (line 91): Figure: S1B-G (not C-G).

      We wish to clarify that this is correct.

      8) Line 97: change, "propionaldehyde" to "propionaldehyde (Figure: 1H)".

      As per the reviewer’s suggestion, we have now changed, "propionaldehyde" to "propionaldehyde (Figure: 1H)".

      9) Line 124: The statement, "DTD2 cleaved all modified D-aa-tRNAs at 50 pM to 500 nM range (Figure: 2A_D)" is not consistent with the data presented. For example, Figure 2D does not show any significant cleavage. Figure S2A-B also does not show cleavage.

      We thank the reviewers for pointing this out. We have changed the sentence to “DTD2 cleaved majority of aldehyde modified D-aa-tRNAs at 50 pM to 500 nM range".

      10) Line 131: Cleavage observed in Fig. S2E is inconsistent with the generalized statement on DTD1.

      We wish to clarify that the minimal activity seen in Fig. S2E is inconsistent with the general trend of DTD1’s biochemical activity seen on modified D-aa-tRNAs. In addition, we have earlier shown that D-aa-tRNA fits snugly in the active site of DTD1 (Ahmad S. et al., eLife, 2013) whereas the modified D-aa-tRNA cannot bind due to the space constrains in the active site of DTD1 (Mazeed M. et al., Science Advances, 2021). Therefore, this minimal activity could be a result of technical error during this biochemical experiment and could be considered as no activity.

      11) Lines 129-133: Citations of many figure panels particularly in the supplementary figures are inconsistent with generalized statements. This section requires a major rewrite or rearrangement of the figure panels (in case the statements are correct).

      We thank the reviewers for bringing forth this point and we have accordingly modified the statement into “DTD2 from archaea recycled short chain aldehyde-modified D-aa-tRNA adducts as expected (Figure: 3E-G) and, like DTD2 from plants, it did not act on aldehyde-modified D-aa-tRNAs longer than three chains (Figure: 3H; S3C-D; S4G-L)”.

      12) Line 142: I don't believe one can call PTH a proofreader. Its job is to recycle tRNAs from peptidyl-tRNAs.

      We thank the reviewers for pointing out this very important point. This is now corrected.

      13). Line 145: change, "DTD2 can exert its protection for" to "DTD2 may exert protection from".

      As per the reviewer’s suggestion, we have now changed"DTD2 can exert its protection for" to "DTD2 may exert protection from".

      14) Line 148: change, "a homozygous line (Figure: 3A) and checked for" to "homozygous lines (Figure: 3A) and checked them for".

      As per the reviewer’s suggestion, we have now changed, "a homozygous line (Figure: 3A) and checked for" to "homozygous lines (Figure: 3A) and checked them for".

      15) Line 148: Change, the sentence beginning with dtd2 as follows. Similar to earlier results30-32, dtd2-/- (dtd2 hereafter) plants were susceptible to ethanol (Figure: S4A) confirming the non-functionality DTD2 gene in dtd2 plants.

      As per the reviewer’s suggestion, we have now changed the sentence accordingly.

      16) Line 161: change, "linked" to "associated".

      As per the reviewer’s suggestion, we have now changed "linked" to "associated".

      17) Lines 173-176: It would be interesting to know how well the DTD2 OE lines do in comparison to the other known transgenic lines developed with, for example, ADH, ALDH, or AOX lines. Any ideas would help appreciate the observation with DTD2 OE lines!

      We greatly appreciate the reviewer’s suggestion. We have not done any comparison experiment with any transgenic lines so far. However, it can be potentially done in further studies with DTD2 OE lines.

      18) Line 194: change, "necessary" with "present".

      As per the reviewer’s suggestion, we have now changed "necessary" with "present".

      19) Line 210: what is meant by 'huge'? Would 'significant' sound better?

      As per the reviewer’s suggestion, we have now changed "huge" with "significant".

      20) Lines 239-243: This needs to be rephrased. Isn't alpha carbonyl of the carboxyl group that makes ester bond with the -CCA end of the tRNA required for DTD2 activity as well? Are you referring to the carbonyl group in the moiety that modifies the alpha-amino group? Please clarify. The cited reference (no. 64) of Atherly does not talk about it.

      We regret for the confusing statement. To clarify, we were referencing to the carbonyl carbon of the modification post amino group of the amino acid in aa-tRNAs (Figure: 5). We have now included a figure (Figure: S4Q of revised manuscript) to show the comparison of the carbonyl group for the better clarity. The cited reference Atherly A. G., Nature, 1978 shows the activity of PTH on peptidyl-tRNAs and peptidyl-tRNAs possess carbonyl carbon at alpha position post amino group of amino acid in L-aa-tRNAs.

      Author response image 5.

      Figure showing the difference in the position of carbonyl carbon in acetonyl and acetyl modification on aa-tRNAs.

      21) Line 261: thrive (not thrives).

      As per the reviewer’s suggestion, we have now changed it to thrive.

      22) In Fig3A: second last lane, it should be dtd-/-:: AtDTDH150A (not dtd-/-:: AtDTDH150A).

      We thank the reviewers for pointing out this, we have corrected it.

      23). Materials and methods: Please clarify which experiments used tRNAPhe, tRNAAla, PheRS, etc. Also, please carefully check all other details provided in this section.

      As per the reviewer’s suggestion, we would like to provide a table below explaining the use of different substrates as well as enzymes in our experiments.

      Author response table 1.

      24) Figure legends (many places): p values higher than 0.05 (not less than) are denoted as ns.

      We thank the reviewers for pointing out this. We have corrected it.

      Reviewer #2 (Recommendations For The Authors):

      I have only minor comments for the authors:

      Title: I would replace "Archeal origin translation proofreader" with " A translation proofreader of archeal origin"

      As per the reviewer’s suggestion, we have now changed the title.

      Abstract: This section could benefit from some rewriting. For instance, at the outset, the initial logical connection between the first and second sentences of the abstract is somewhat unclear. At the very least, I would suggest swapping their order to enhance the narrative flow. Later in the text, the term "chiral proofreading systems" is introduced; however, it is only in a subsequent sentence that these systems are explained to be responsible for removing stable D-aminoacyl adducts from tRNA. Providing an immediate explanation of these systems would enhance the reader's comprehension. The authors switch from the past participle tense to the present tense towards the end of the text. I would recommend that they choose one tense for consistency. In the final sentence, I would suggest toning down the statement and replacing "can be used" with "could be explored." (https://www.nature.com/articles/d41586-023-02895-w). The same comment applies to the introduction, line 79.

      As per the reviewer’s suggestion, we have now changed the abstract appropriately.

      General note: Conventionally, the use of italics is reserved for the specific species "Arabidopsis thaliana," while the broader genus "Arabidopsis" is not italicized.

      We acknowledge the reviewer for this pertinent suggestion. This is now corrected in revised version of our manuscript.

      General note: I would advise the authors against employing bold characters in conjunction with colors in the figures.

      We thank the reviewer for this suggestion. We have now changed it appropriately in revised version of our manuscript.

      Figure 1A: I recommend including the concentrations of the various aldehydes used in the experiment within the figure legend. While this information is available in the materials and methods section, it would be beneficial to have it readily accessible when analyzing the figure.

      As per the reviewer’s suggestion, we have now included the concentrations in figure legend.

      Figure 1I, J: some error bars are invisible.

      We thank the reviewers for pointing out this, we have corrected it.

      Figure 2M: The table could be simplified by removing aldehydes for which it was not feasible to demonstrate activity. The letter "M" within the cell labeled "aldehydes" appears to be a typographical error, presumably indicating the figure panel.

      As per the reviewer’s suggestion, we have now changed this appropriately.

      Figure 3: For consistency with the other panels in the figure, I recommend including an additional panel to display the graph depicting the impact of MG on germination.

      As per the reviewer’s suggestion, we have now changed this appropriately.

      Figure 4: Considering that only one plant is presented, it would be beneficial to visualize the data distribution for the other plants used in this experiment, similar to what the authors have done in panel A of the same figure.

      We thank the reviewer for bringing up this point. We wish to clarify that we have done experiment with multiple plants. However, for the sake of clarity, we have included the representative images. Moreover, we have included the quantitative data for multiple plants in Figure 3C-G.

      Figure 5E: The authors may consider presenting a chronological order of events as they believe they occurred during evolution.

      We thank the reviewer for the suggestion. However, it is very difficult to pinpoint the chronology of the events. Aldehydes are lethal for systems due to their hyper reactivity and systems would require immediate solutions to survive. Therefore, we think that both problem (toxic aldehyde production) and its solution (expansion of aldehyde metabolising repertoire and recruitment of archaeal DTD2) might have appeared simultaneously.

      Figure 6: The model appears somewhat crowded, which may affect its clarity and ease of interpretation. The authors might also consider dividing the legend sentence into two separate sentences for better readability.

      As per the reviewer’s suggestion, we have now changed this appropriately.

      Line 149: I recommend explicitly stating that ethanol metabolism produces acetaldehyde. This clarification will help the general reader immediately understand why DTD2 mutant plants are sensitive to ethanol.

      As per the reviewer’s suggestion, we have now changed this appropriately.

      Line 289: there is a typographical error, "promotor" instead of the correct term "promoter.".

      We thank the referee for pointing out this, we have now corrected it.

      Figure S5: The root morphology of DTD2 OE plants appears to exhibit some differences compared to the WT, even in the absence of a high concentration of aldehydes. It would be valuable if the authors could comment on these observed differences unless they have already done so, and I may have overlooked it.

      We thank the referee for pointing out this. We do see minor differences in root morphology, but they are more pronounced with aldehyde treatments. The reason for this phenotype remains elusive and we are trying to understand the role of DTD2 in root development in detail in further studies.

      Some Curiosity Questions (not mandatory for manuscript acceptance):

      1) Do DTD2 OE plants display an earlier flowering phenotype than wild-type Col-0?

      We have not done detailed phenotyping of DTD2 OE plants. However, our preliminary observations suggest no differences in flowering pattern as compared to wild-type Col-0.

      2) What is the current understanding of the endogenous regulation of DTD2?

      We have not done detailed analysis to understand the endogenous regulation of DTD2.

      3) Could the protective phenotype of DTD2 OE plants in the presence of aldehydes be attributed to additional functions of this enzyme beyond the removal of stable D-aminoacyl adducts from tRNAs?

      Based on the available evidence regarding the biochemical activity and in vivo phenotypes of DTD2, it appears that removal of stable D-aminoacyl adducts from tRNA is key for the protective phenotype of DTD2 OE.

      A Suggestion for Future Research (not required for manuscript acceptance):

      The authors could explore the possibility of overexpressing DTD2 in pyruvate decarboxylase transgenic plants and assess whether this strategy enhances flood tolerance without incurring a growth penalty under normal growth conditions.

      We thank the referee for this interesting suggestion for future research. We will surely keep this in mind while exploring the flood tolerance potential of DTD2 OE plants.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      In this study, Yue et al. re-processed publicly available DNA methylation data (published in 2012 and 2017 from the Meissner lab) from pre- and post-implantation mouse embryos. Against the global wave of genome-wide reduction of DNA methylation occurring during pre-implantation development, they detected a slight increase (~1% on average) of DNA methylation at gene promoter regions during the transition from 8-cell to blastocyst stage. They claim that many such promoters are located in the X chromosome. Subsequently, they knocked down Dnmt3b (presumably because of its upregulation during the transition from the 8-cell to blastocyst stage) and detected the aberrant patterning of H3K27me3 in the mutant female embryos. Based on this observation, they claim that imprinted X-chromosome inactivation is impaired in the Dnmt3b-Kd pre-implantation embryos. Finally, they propose a model where such an increase of DNA methylation together with H3K27me3 regulates imprinted X-chromosome inactivation in the pre-implantation embryos. While their observation is of potential interest, the current version of the work fails to provide enough evidence to support their conclusions. Below are suggestions and comments on the manuscript.

      Major issues:

      (1) Sex of the embryos of the genome-wide bisulfite-sequencing data

      The authors re-analyzed publicly available genome-wide DNA methylation data from the Meissner lab published in 2012 and 2017. The former used reduced representation bisulfite sequencing (RRBS) and the latter used whole-genome bisulfite sequencing (WGBS). Based mainly on the RRBS data, Yue et al. detected de novo DNA methylated promoters during the transition from 8-cell to blastocyst against the global wave of genome-wide DNA demethylation. They claim that such promoter regions are enriched at the "inactive" X chromosome. However, it would be difficult to discuss DNA methylation at inactive X-chromosomes as the RRBS data were derived from a mixture of male and female embryos. It would also be notable that the increase of DNA methylation at these promoter regions is ~1% on average. Such a slight increase in DNA methylation during pre-implantation development could also be due to the developmental variations between the embryos or between the sexes of embryos.

      Thanks so much for your insightful comments. Whether de novo DNA methylation occurs in a sex-dimorphic manner would be of significance for our study. Based on your comments, we have added a reanalysis based on a publicly available single cell multi-omics sequencing (COOL-seq) data of mouse early embryos (Guo et al., 2017). The results showed that both male and female embryonic cells gain DNA methylation during the transition from the 8-cell to ICM (Figure 1—figure supplement 1C-D; Lines 112-115 in the revised manuscript).

      With regards to the increase in the promoter region, many previous studies have revealed that promoter and overlapping CGI regions, especially high CpG promoters, always showed low levels of DNA methylation (Auclair et al., 2014; Borgel et al., 2010; Dahlet et al., 2020). The relatively lower basal levels make the increase seem relatively slight. Thus, we added relevant statements to clarify this information and rewritten the sentences in the revised manuscript (Lines 116-118, 125-127 in the revised manuscript).

      In addition, using the single cell COOL-seq data, we also specifically reanalyzed the DNA methylation changes on the X chromosome in female embryos. The X chromosome showed a more notable increase than that on autosomes, and the female X chromosome showed a higher DNA methylation level than that of the male (Figure 3—figure supplement 2A-B; Lines 203-206 in the revised manuscript).

      Thanks again for your insightful and constructive comments that significantly strengthen our evidence. We have added these results in the revised manuscript.

      (2) Imprinted X-chromosome inactivation and evaluation of H3K27me3 (related to Figures 2C, D; 3F; Figure2-supplement 2 F, G; Figure3-supplement 3G)

      Based on the slight change in the H3K27me3 signals in the Dnmt3b-Kd blastocysts, the authors claim that imprinted X-chromosome inactivation is impaired in the mutant embryo. It would be not easy to reach this conclusion from such a rough analysis of H3K27me3 presented in Figure 2C, D. Rigorous quantification/evaluation of the H3K27me3 signals in the Dnmt3b-Kd embryos should be considered. Additional evidence for the impairment of H3K27me3 in the mutant embryos should also be provided (expression of a subset of X-linked genes by RNA-FISH or RT-PCR etc.). Though technically challenging, high-resolution genome-wide approach such as ChIP-seq of H3K27me3 in the Dnmt3b-kd female embryos (with traceable SNPs between maternal and paternal X chromosome to distinguish inactive and active X-chromosome) could more precisely evaluate regions that lose H3K27me3 in the X-chromosome (de novo DNA methylated promoters from 8-cell to blastocyst, for example).

      Thanks so much for your insightful comments that make our results more convincing. The H3K27me3 domain is a classic marker for establishment of XCI by achieving X chromosome wide heterochromatinization of transcriptional depression (Chow and Heard, 2009; Heard et al., 2004; Huynh and Lee, 2005). Thus, in the present study, we have performed immunostaining for H3K27me3 domains to evaluate the iXCI status in the blastocysts, as previously reported (Fukuda et al., 2014; Gontan et al., 2018; Inoue et al., 2010; Tan et al., 2016). Base on your comments, we have added another statistical method to quantify the establishment of iXCI, i.e. the percentage of H3K27me3-positive and -negative cells to total trophoblast cells in female blastocysts subject to Dnmt3b knockdown or not. The result also indicated that Dnmt3b knockdown led to a significant loss of H3K27me3 domains from total trophoblast cells. Similarly, new data based on statistical analyses of total trophoblast cells, has also been added in the results of Dnmt3b knockout and 5-aza-dC (Figure 3F; Figure 3—figure supplement 3D, H in the revised manuscript).

      To clarify the significance and reliability of detecting H3K27me3 domains, we have added a schematic diagram depicting the process of iXCI initiation and establishment, as well as the experimental design and work flows, to make our results easier to be understood (Figure 3C in the revised manuscript).

      In addition, we agree with your comments that additional evidence will benefit the conclusion. Thus, we have reanalyzed the RNA-seq and H3K27me3 CHIP-seq data in extraembryonic ectoderm (ExE) of E6.5 single embryos that underwent Dnm3a/3b knockout because preimplantation iXCI status maintains extraembryonic cells (Chen et al., 2019; Galupa and Heard, 2015; Schulz and Heard, 2013). The results showed that Dnmt knockout-induced chromosome-wide loss of DNA methylation led to a nearly complete loss of H3k27me3 on paternal X chromosome (specifically inactivated in iXCI), along with a notable transcriptional upregulation cross the chromosome. By contrast, these changes cannot be not observed on maternal X chromosome.

      We have added this result in the revised manuscript (Lines 253-261; Figure 3—figure supplement 4A in the revised manuscript).

      (3) Analysis of the developmental potential of Dnmt3b-kd embryos

      While the authors claim that Dnmt3b-mediated de novo DNA methylation plays an important role in imprinted X-chromosome inactivation, it remains unclear whether the analysis presented in Figure 4 is derived from "female" embryos. This analysis seemed confusing as the authors claim that de novo DNA methylation in the promoter regions during the transition from 8-cell to blastocyst regulates imprinted X-chromosome inactivation, but this should not happen in the male embryos. Was the impairment of embryonic proliferation and differentiation observed in both male and female embryos? Or is this specific to the female embryos? We think that the sex of the embryos would be critical for the analysis presented in Figure 4.

      Thanks so much for your constructive comments to make our results smoother and clearer. The Figure 4 mainly presents the developmental role of minor de novo methylation based on the integrated analysis of DNA methylation and gene expression dynamics from the 8-cell to ICM. Because our data indicated that both male and female embryos undergo minor de novo methylation (Figure 1—figure supplement 1C-D in the revised manuscript). This section mainly focused on genome wide and general changes, but not on sex dimorphic consequence.

      To avoid the possible confusion, we have reorganized the RESULTS AND DISCUSSION section and presented this section as Figure 2 in the revised manuscript, before the chromosomal distribution analysis and subsequent detection relevant to iXCI.

      Reviewer #2 (Public Review):

      Summary:

      Here, Yue et al. set out to determine if the low DNMT3B expression that is observed prior to de novo DNA methylation (before the blastocyst stage) has a function. Re-analyzing existing DNA methylation data from Smith et al. (2012) they find a small DNA methylation gain over a subset of promoters and gene bodies, occurring between the 8-cell and blastocyst stages, and refer to this as "minor de novo DNA methylation". They attempt to assess the relevance/functionality of this minor DNA methylation gain, and report reduced H3K27me3 in Dnmt3b knockdown (KD) trophoblast cells that normally undergo imprinted X-chromosome inactivation (iXCI) before the blastocyst stage. In addition, they assess the proliferation, differentiation, metabolic function, implantation rate, and live birth rate of Dnmt3b KD blastocysts.

      Strengths:

      Working with early embryos is technically demanding, making the well-designed experiments from this manuscript useful to the epigenetics community. Particularly, the DNMT3B expression and 5-mC staining at different embryonic stages.

      Thanks for your positive evaluation, we have revised manuscript based on your comments, and the items need to be addressed in detail are explained in the point-by-point response to each comment.

      Weaknesses:

      - Throughout the manuscript, please represent DNA methylation changes as delta DNA methylation instead of fold change.

      Thanks so much for your constructive comments. We have represented DNA methylation changes as “ΔDNA methylation” (Figure 2—figure supplement 1A; Figure 3—figure supplement 1A; Figure 3—figure supplement 3I in the revised manuscript).

      - Detailed methods on the re-analysis of the DNA methylation data from Smith et al. 2012 are missing from the materials and methods section. Was a minimum coverage threshold used?

      Thanks so much for your reminder. We have added relevant statements and provided the detail of the coverage criteria in the subsection of Bioinformatics analysis in the Materials and methods section as follows: RRBS data of mouse embryos (2-cell embryos, 4-cell embryos, 8-cell embryos, ICM, and E6.5 embryos) were downloaded from the published article by Smith et al (Smith et al., 2012) (accession number: GSE34864). The methylation level was calculated as the number of “methylated” reads (reporting as C), divided by the total number of “methylated” and “unmethylated” read, which reporting as C or T. The genomic region information was downloaded from the mm9 Repeat Masker. As described in the published article, promoters were defined as 1 kb up- and downstream of the TSS and classified into high-density CpG promoter (HCP), intermediate-density CpG promoter (ICP) and low-density CpG promoter (LCP). Only CpG sites with at least fivefold coverage were included in the methylation analysis. We have added relevant information in the revised manuscript (Lines 462-470 in the revised manuscript).

      - Detailed methods on the establishment and validation of Dnmt3b KO blastocysts and 5-aza-dC treated blastocysts are missing (related to Figure 2).

      Thanks so much for your detailed reminder. In the present study, we used a well-established Dnmt3b-deficient mouse model (Okano et al., 1999) to validate the role of minor de novo DNA methylation in iXCI establishment. Heterozygous Dnmt3b<sup>+/-</sup> mice that carry one mutant locus of Dnmt3b, were obtained from the Mutant Mouse Resource & Research Centers (MMRRC, NIH). Homozygous embryos were obtained by intercrossing Dnmt3b<sup>+/-</sup> male and female mice. Genotyping assays of collected embryos was performed by PCR using primers that were designed based on the gene targeting strategy following the MMRRC genotyping protocol (https://www.med.unc.edu/mmrrc/genotyping-protocols/mmrrc-center-protocol-29886/). We have provided the detailed methods in the revised manuscript (Lines 350-354; 391-393 in the revised manuscript). In addition, we added a schematic diagram depicting the processes of embryo collection and detection (Figure 3—figure supplement 3A in the revised manuscript).

      Similarly, we have provided relevant details of 5-aza-dC supplementation in the revised manuscript (Lines 412-415 in the revised manuscript) and added a schematic diagram depicting the details of experimental design and processes (Figure 3—figure supplement 3E in the revised manuscript).

      - Detailed methods on the re-analysis of the ChIPseq data from Liu et al. 2016 are missing from the materials and methods section.

      Thank you for pointing this out. The bigwig files of H3K27me3 ChIP-seq data were downloaded from the published article by Liu et al (Liu et al., 2016)(accession number: GSE73952). These signal tracks were generated using the MACS2 (v2.0.10.20131216) pileup function and normalized to 1 million reads for visualization, as described in the original publication. We have added relevant information to the MATERIALS AND METHODS section in the revised manuscript (Lines 474-479 in the revised manuscript).

      - Some of the data represented in bar graphs does not look convincing/significant. Maybe this data can be better represented differently, such as in box plots or violin plots, which would better represent the data.

      Thanks so much for your comments that improve our result presentation, relevant results have been changed into box plots in the revised manuscript (Figure 3E; Figure 3—figure supplement 3C; Figure 3—figure supplement 3G in the revised manuscript). In addition, to strengthen our evidence, we have added alternative statistical method to quantify the establishment of iXCI, i.e. the percentage of H3K27me3-positive and -negative cells to total trophoblast cells in female blastocysts subject to Dnmt3b knockdown or not. (Figure 3F; Figure 3—figure supplement 3D, H in the revised manuscript).

      - The relevance and rationale for experiments using 5-aza-dC treatment is unclear.

      Thanks so much for reminding us to make our results more informative and convincing. 5-aza-dC is a well-established global DNA hypomethylating agent that efficiently inhibit the activity of all DNMTs, and thus has been frequently used to study the maintenance of DNA methylation and de novo DNA methylation (Maslov et al., 2012; Oka et al., 2005).

      In our study, to validate the function of minor de novo DNA methylation in iXCI, we take advantage of 5-aza-dC-induced DNMT inhibition, which allows us, despite its inhibitory effect common to various DNMTs, to transiently treat embryos specifically during the window of minor de novo DNA methylation (from the 8-cell to blastocyst stage). We have added these statements, as well as a schematic diagram depicting the experimental design, in the revised manuscript to make our experiments more rational and easier to be understood (Lines 183-188; Figure 3—figure supplement 3E in the revised manuscript).

      References

      Auclair, G., Guibert, S., Bender, A. and Weber, M. (2014). Ontogeny of CpG island methylation and specificity of DNMT3 methyltransferases during embryonic development in the mouse. Genome Biol. 15, 545.

      Borgel, J., Guibert, S., Li, Y., Chiba, H., Schubeler, D., Sasaki, H., Forne, T. and Weber, M. (2010). Targets and dynamics of promoter DNA methylation during early mouse development. Nat. Genet. 42, 1093-1100.

      Chen, Z., Yin, Q., Inoue, A., Zhang, C. and Zhang, Y. (2019). Allelic H3K27me3 to allelic DNA methylation switch maintains noncanonical imprinting in extraembryonic cells. Sci Adv 5, eaay7246.

      Chow, J. and Heard, E. (2009). X inactivation and the complexities of silencing a sex chromosome. Curr. Opin. Cell Biol. 21, 359-366.

      Dahlet, T., Argueso Lleida, A., Al Adhami, H., Dumas, M., Bender, A., Ngondo, R. P., Tanguy, M., Vallet, J., Auclair, G., Bardet, A. F., et al. (2020). Genome-wide analysis in the mouse embryo reveals the importance of DNA methylation for transcription integrity. Nat Commun 11, 3153.

      Fukuda, A., Tomikawa, J., Miura, T., Hata, K., Nakabayashi, K., Eggan, K., Akutsu, H. and Umezawa, A. (2014). The role of maternal-specific H3K9me3 modification in establishing imprinted X-chromosome inactivation and embryogenesis in mice. Nat Commun 5, 5464.

      Galupa, R. and Heard, E. (2015). X-chromosome inactivation: new insights into cis and trans regulation. Curr. Opin. Genet. Dev. 31, 57-66.

      Gontan, C., Mira-Bontenbal, H., Magaraki, A., Dupont, C., Barakat, T. S., Rentmeester, E., Demmers, J. and Gribnau, J. (2018). REX1 is the critical target of RNF12 in imprinted X chromosome inactivation in mice. Nat Commun 9, 4752.

      Guo, F., Li, L., Li, J., Wu, X., Hu, B., Zhu, P., Wen, L. and Tang, F. (2017). Single-cell multi-omics sequencing of mouse early embryos and embryonic stem cells. Cell Res. 27, 967-988.

      Heard, E., Chaumeil, J., Masui, O. and Okamoto, I. (2004). Mammalian X-chromosome inactivation: an epigenetics paradigm. Cold Spring Harb. Symp. Quant. Biol. 69, 89-102.

      Huynh, K. D. and Lee, J. T. (2005). X-chromosome inactivation: a hypothesis linking ontogeny and phylogeny. Nat. Rev. Genet. 6, 410-418.

      Inoue, K., Kohda, T., Sugimoto, M., Sado, T., Ogonuki, N., Matoba, S., Shiura, H., Ikeda, R., Mochida, K., Fujii, T., et al. (2010). Impeding Xist expression from the active X chromosome improves mouse somatic cell nuclear transfer. Science 330, 496-499.

      Liu, X. Y., Wang, C. F., Liu, W. Q., Li, J. Y., Li, C., Kou, X. C., Chen, J. Y., Zhao, Y. H., Gao, H. B., Wang, H., et al. (2016). Distinct features of H3K4me3 and H3K27me3 chromatin domains in pre-implantation embryos. Nature 537, 558-562.

      Maslov, A. Y., Lee, M., Gundry, M., Gravina, S., Strogonova, N., Tazearslan, C., Bendebury, A., Suh, Y. and Vijg, J. (2012). 5-aza-2'-deoxycytidine-induced genome rearrangements are mediated by DNMT1. Oncogene 31, 5172-5179.

      Oka, M., Meacham, A. M., Hamazaki, T., Rodic, N., Chang, L. J. and Terada, N. (2005). De novo DNA methyltransferases Dnmt3a and Dnmt3b primarily mediate the cytotoxic effect of 5-aza-2'-deoxycytidine. Oncogene 24, 3091-3099.

      Okano, M., Bell, D. W., Haber, D. A. and Li, E. (1999). DNA methyltransferases Dnmt3a and Dnmt3b are essential for de novo methylation and mammalian development. Cell 99, 247-257.

      Schulz, E. G. and Heard, E. (2013). Role and control of X chromosome dosage in mammalian development. Curr. Opin. Genet. Dev. 23, 109-115.

      Smith, Z. D., Chan, M. M., Mikkelsen, T. S., Gu, H. C., Gnirke, A., Regev, A. and Meissner, A. (2012). A unique regulatory phase of DNA methylation in the early mammalian embryo. Nature 484, 339-344.

      Tan, K., An, L., Miao, K., Ren, L., Hou, Z., Tao, L., Zhang, Z., Wang, X., Xia, W., Liu, J., et al. (2016). Impaired imprinted X chromosome inactivation is responsible for the skewed sex ratio following in vitro fertilization. Proc. Natl. Acad. Sci. U. S. A. 113, 3197-3202.

      Reviewer #1 (Recommendations For The Authors):

      Title

      It would be hard to understand what "co"-regulates means. Does this mean DNA methylation and H3K27me3 co-regulate imprinted X- X-chromosome inactivation? If so, the title can be reworded.

      Thanks for your insightful comments, the title has been corrected into “A wave of minor de novo DNA methylation initiates in mouse 8-cell embryos and co-regulates imprinted X- chromosome inactivation with H3K27me3” (Line 2 in the revised manuscript).

      Text

      (1) As DNA methylation analysis is a primary part of this study, how they processed DNA methylation data can be added to the "Bioinformatics analysis" in the MATERIALS AND METHODS section.

      Thanks for your kind reminder. We have added relevant information in the Materials and methods section in the revised manuscript (Lines 462-474 in the revised manuscript).

      (2) It seems that recent literature has not been cited in the manuscript. Specifically, none of the papers after 2018 were cited. Recent relevant papers should also be cited throughout the manuscript.

      Thanks so much for your reminder. We have added more recent literature to update the relevant information, such as the evidence supporting the causal role between DNA methylation and XCI (Lines 225-228, 264-265 in the revised manuscript); the concurrent enrichment of DNA methylation and H3K27me3 in genes subject to XCI (Lines 301-303 in the revised manuscript); the dominant role of de novo methylation in X chromosome (Lines 253-256 in the revised manuscript), etc.

      (3) Line 56: The first report that describes the dynamics of DNMT3B expression in pre-implantation embryonic development (Hirasawa et al., 2007) is missing. This paper should be cited.

      Sorry for our carelessness, we have added relevant references and rewritten the sentence in the revised manuscript (Lines 56-57 in the revised manuscript). I think you meant the report by Hirasawa et al in 2008, in which presented expression and subcellular localization of Dnmt3a and Dnmt3b in mouse oocytes and preimplantation embryos.

      (4) Line 98: It would be good to mention that the data were derived from reduced representation bisulfite sequencing as the authors used whole-genome bisulfite sequencing data from the same research group as well.

      Thanks for your kind reminder. As you have suggested, we have added the description in the revised manuscript to emphasize that these data were derived from reduced representation bisulfite sequencing, while another data were derived from whole-genome bisulfite sequencing, respectively. (Lines 98-99, 111 in the revised manuscript).

      (5) Line 101: We first... "the preferential target of DNMT3B (Auclair et al., 2014; Borgel et al., 2010)". More recent literature (Baubec et al., 2016, Duymich et al., 2016, for example) showed that the preferential target of DNMT3B is not a promoter but a gene body. This sentence should be reworded.

      Thanks so much for your detailed reminder. As you have pointed out, “preferential target” seems to be an inaccurate statement. Besides of promoters, gene bodies and other elements also undergo de novo DNA methylation (Auclair et al., 2014; Dahlet et al., 2020; Duymich et al., 2016).

      We have rewritten the sentence as follows in the revised manuscript: “Promoter regions are important target sites of DNMT3B (Choi et al., 2011). The acquisition of DNA methylation in promoters, especially in intermediate and low CpG promoters, during implantation is largely dependent on DNMT3B and plays an important role in regulating developmental genes (Auclair et al., 2014; Borgel et al., 2010; Dahlet et al., 2020). Thus, among genomic regions that may undergo de novo DNA methylation, we initially focused our analysis on DNA methylation dynamics of promoters...” (Lines 100-106 in the revised manuscript)

      (6) Lines 108-109: It would be good to mention that these data were derived from whole-genome bisulfite sequencing.

      Thanks for your kind reminder. As aforementioned, we have added a description in the revised manuscript to distinguish between data derived from reduced representation bisulfite sequencing and whole-genome bisulfite sequencing (Lines 98-99, 111 in the revised manuscript).

      (7) Line 141: rXCI should be defined.

      Thanks for your kind reminder. We have added full descriptions and more necessary information about iXCI and rXCI, to make our statements clearer and easier to be understood (Lines 210-213 in the revised manuscript). In addition, we carefully checked the relevant descriptions throughout the manuscript, and each abbreviation (such as “ICM”) has been defined at its first occurrence. Additionally, we have replaced abbreviations that appears only once in the manuscript with their full terms (Lines 122, 212 in the revised manuscript).

      (8) Lines 145-149: The role of DNA methylation for imprinted X-inactivation has already been reported (Chiba et al., 2008). The relevant sentences should be reworded.

      Thanks so much for reminding us the important earlier literature that explores the relationship between DNA methylation and XCI. However, the primary aim and hypothesis of the study by Chiba et al. are different from those of our study. Chiba et al focused on whether DNA methylation is the imprinting mark responsible for monoallelic expression of Xist (the initiation event of iXCI), while our study focused on the role of DNA methylation in achieving X chromosomal heterochromatinization (the late event of iXCI).

      In detail, the study by Chiba et al. mainly focused on exploring why Xist is specifically expressed from paternal allele and iXCI occurs specifically on the paternal X chromosome in mouse preimplantation embryos. Because Previous studies have suggested that genomic imprinting of Xist is established during oogenesis (Oikawa et al., 2014; Tada et al., 2000), Chiba et al. wanted to test whether the DNA methylation imprinting established during oogenesis is responsible for the monoallelic expression of Xist in preimpantaiton embryos. Analyses of DNA methyltransferase maternal knockout embryos revealed that oocyte DNA methylation is dispensable for Xist imprinting (Chiba et al., 2008). Follow-up study by Inoue et al. identified a broad H3K27me3 enrichment within the Xist 5’region established during oocyte growth and persists through preimplantation development, as the imprinting mark of Xist (Inoue et al., 2017). These series of studies are very important and allows us to understand the mechanism underlying paternal allele-specific iXCI in mouse preimplantation embryos and extraembryonic tissues.

      However, the hypothesis is different in our study. Based on the finding of minor de novo DNA methylation and its preferential distribution on the X chromosome, we have speculated that the minor de novo methylation, which occurs from the 8-cell to blastocyst stage, may participate in achieving X chromosomal heterochromatinization. Although DNA methylation is essential for maintaining X chromosome-wide transcriptional silence of rXCI, its role in iXCI remains controversial and it is even plausibly thought that DNA methylation is not required for achieving iXCI because preimplantation embryos undergo global and massive DNA demethylation.

      We have reorganized this paragraph, relevant statements have been added to make the background and discussion clearer and easier to be understood. (Lines 217-234 in the revised manuscript)

      (9) Lines 164-165: Information regarding Dnmt3b KO is missing. Did the authors generate an original KO line or use an already published one? It should be explicitly stated.

      Thank you so much for your kind reminder. The Dnmt3b heterozygous mice were obtained from the Mutant Mouse Resource & Research Centers (MMRRC), and Dnmt3b knockout (KO) embryos were generated by mating Dnmt3b heterozygous females with heterozygous males. The genotyping of Dnmt3b KO embryos was performed by PCR following the MMRRC genotyping protocol (https://www.med.unc.edu/mmrrc/genotyping-protocols/mmrrc-center-protocol-29886/). The relevant information has been added to the MATERIALS AND METHODS section in the revised manuscript (Lines 350-354; 391-393 in the revised manuscript).

      (10) Line 165: chemical-induced inhibition of DNMT3B. As 5-aza-dC also blocks DNMT3A and DNMT1, this sentence should be reworded.

      Thank you for your valuable comments. 5-aza-dC is a well-established global DNA hypomethylating agent that efficiently inhibit the activity of all DNMTs, and has been frequently used to study the maintenance of DNA methylation and de novo DNA methylation (Maslov et al., 2012; Oka et al., 2005). Thus, despite its inhibitory effect common to various DNMTs, chemical-induced inhibition of DNMTs has the advantage of allowing us to transiently treated embryos specifically during the window of minor de novo DNA methylation (the 8-cell to blastocyst stage). We have rewritten the relevant sentences in the revised manuscript (Lines 183-188 in the revised manuscript).

      (11) Lines 171-174: "The role of de novo methylation in iXCI...". This possibility was already tested in the previous study from the Sasaki lab (Chiba et al., 2008).

      As mentioned above, the primary aim and hypothesis of the study by Chiba et al. are different from those of our study. Chiba et al. mainly focused on exploring why Xist is specifically expressed from paternal allele and iXCI occurs specifically on the paternal X chromosome in mouse preimplantation embryos, so they tested whether the DNA methylation imprinting established during oogenesis is responsible for this monoallelic expression of Xist in preimplantation embryos (the initiation event of iXCI).

      By contrast, based on the finding of minor de novo DNA methylation and its preferential distribution on X chromosome, our study has speculated that the minor de novo DNA methylation, which occurs from the 8-cell to blastocyst stage, may participate in achieving X chromosomal heterochromatinization (the late event of iXCI).

      Thanks so much for reminding us this important literature, to make our discussion more informative. We have reorganized this paragraph by rewriting or adding relevant statements to make the background and discussion clearer and easier to be understood (Lines 217-231 in the revised manuscript). In addition, to avoid repeated statement and make our discussion more concise, we have removed the similar sentences at the end of this paragraph.

      (12) Lines 198-200: "Given DNA methylation...". These citations mention a general relationship between DNA methylation and H3K27me3 in cells in culture. As I believe the authors focus on X-chromosome inactivation in the female embryos, more relevant papers that discuss the order of the events for the establishment of H3K27me3 and DNA methylation in the inactive X-chromosome can be cited.

      Thanks so much for your comment to improve our discussion. It has been thought that during the late phase of rXCI in fully differentiated cells, gene silencing is achieved by PRC2 complex-induced H3K27me3, and then is further stably maintained by the redundant action of multiple layers of epigenetic modifications, including DNA methylation, to reach the maximum level of chromatin compaction (Chow and Heard, 2009; Heard et al., 2004; Pintacuda and Cerase, 2015). In line with this, a recent multifaceted analysis showed that DNA methylation and H3K27me3 are concurrently enriched in genes subject to XCI (Balaton and Brown, 2021). We have added these statements in the revised manuscript (Lines 295-303 in the revised manuscript).

      (13) Line 241: As 5-aza-dC blocks both de novo and maintenance DNA methylation, this sentence should be reworded.

      Thank you for your kind reminder. As you have mentioned above, 5-aza-dC is a well-established global DNA hypomethylating agent that efficiently inhibit the activity of all DNMTs, and has been frequently used to study the maintenance of DNA methylation and de novo DNA methylation (Maslov et al., 2012; Oka et al., 2005). Thus, despite its inhibitory effect common to various DNMTs, chemical-induced inhibition of DNMTs has the advantage of allowing us to transiently treated embryos specifically during the window of minor de novo DNA methylation (the 8-cell to blastocyst stage). We have rewritten the relevant sentences in the revised manuscript (Lines 183-188 in the revised manuscript).

      Figures

      (1) Figure 1C, D: Do the rows in C and D show the corresponding genes?

      Figure 1C and D represent the DNA methylation changes of promoters (C) and gene bodies (D) respectively, during the transition from the 8-cell to blastocyst stage. Two data were analyzed independently, and rows did not show the corresponding genes. Since we have focused on the minor de novo methylation in promoter regions, to avoid confusion, the results of the gene body have been removed from the revised manuscript.

      (2) Figure 1G: Yy2 promoter gained DNA methylation during the transition from 8-cell to the blastocyst stage. Is this a representative locus for the de novo methylated promoters that are shown in Figure 1F where an increase of DNA methylation is about ~1% on average? Another representative locus could be shown instead of this gene promoter.

      Thanks so much for you detailed reminder. The inconsistency between the global methylation change and bisulfite sequencing analysis of Yy2, may be due to the details of methodologies, such C-T conversion efficiency, the number of picked colonies, etc. Since we have confirmed the presence of minor de novo DNA methylation using different publicly available data, to avoid ambiguity, we have removed this result in revised manuscript.

      (3) Figures 2C and 3A: It would be helpful to mention what the arrowheads mean.

      Thanks so much for you detailed reminder. In Figure 2C, the arrowhead indicates the H3k27me3 domain and the blank arrowhead indicates the blastomere without the H3k27me3 domain. In Figure 3A, the arrowhead indicates Xist RNA domain and the blank arrowhead indicates the blastomere without Xist RNA domain. We have added the information in the revised manuscript (Lines 736-738, 747-749 in the revised manuscript).

      (4) Figure 3-figure supplement 2B: It would be hard to see whether H3K27me3 is enriched at the promoter regions of presented genes. It would be helpful to show the values for the Y-axis as in panel A.

      Thanks for your helpful reminder. We have added the scales to the figure to improve the result presentation (Figure 4—figure supplement 2B in the revised manuscript).

      (5) Figure 4-figure supplement 2: 5-aza-dC blocks not only the activity of DNMT3B but also DNMT1, and DNMT3A (all these DNMTs are expressed during pre-implantation embryos, see Hirasawa et al., 2007). This part can be omitted from the manuscript.

      Thanks for your insightful comments. As you have mentioned above, the relevance and rationale for experiments using 5-aza-dC treatment should be clarified. 5-aza-dC is a well-established global DNA hypomethylating agent that efficiently inhibit the activity of all DNMTs, and thus has been frequently used to study the maintenance of DNA methylation and de novo DNA methylation (Maslov et al., 2012; Oka et al., 2005).

      In our study, to validate the function of minor de novo DNA methylation in iXCI and blastocyst development, we take advantage of 5-aza-dC-induced DNMT inhibition, which allows us to transiently treated embryos specifically during the window of minor de novo DNA methylation (the 8-cell to blastocyst stage), despite its non-specificity to various DNMTs.

      Based on these considerations, we hope to retain this result, and wish to get your understanding.

      We have added these statements in the revised manuscript to make our experiments more rational and easier to be understood (Lines 183-188 in the revised manuscript) and added a schematic diagram depicting the experimental design (Figure 3—figure supplement 3E in the revised manuscript).

      Reviewer #2 (Recommendations For The Authors):

      Recommendations/concerns in the text:

      - Line 106, it is unclear what is meant by "in line with this"? Gene body DNA methylation is a characteristic of active transcription, so why would a gain in DNA methylation at promoters be in line with a gain in DNA methylation over gene bodies?

      Thank you so much for your comments that pointed out our ambiguous statement. We meant both the promoter and gene body regions, albeit accounting for small proportions, gain DNA methylation during the transition from the 8-cell to blastocyst stage. Based on the comment by Reviewer#1, since we have focused on the minor de novo methylation in promoter regions, to avoid confusion, the results of the gene body have been removed from the revised manuscript.

      - Line 111 & 114, can 6% DNA methylation really be considered "relatively hypermethylated" compared to 3% DNA methylation that is referred to as "more hypomethylated"?

      We apologize for our unclear and ambiguous statements. Here we focused on the promoter regions. Many previous studies have revealed that compared with gene bodies and other genome elements, promoter and overlapping CGI regions, especially high CpG promoters, always showed low levels of DNA methylation. We have added relevant statements to clarify this information, and rewritten the sentences in the revised manuscript (Lines 100-106, 116-118, 121, 124 in the revised manuscript).

      - Line 124, there are a number of processes identified, why only mention one in the text? Suggest changing writing to be more accurate, indicating what was included for the GO analysis and using the words "enriched for ... processes". Saying it may be linked to a process is an overstatement and not supported by further experiments/data.

      Thank you so much for your detailed comments that make our results more informative. We have checked the relevant description and addressed your suggestions as follows: By performing gene ontology enrichment analysis of genes that undergo minor or major de novo DNA methylation respectively, we noticed that besides of many important basic processes common to two waves of de novo DNA methylation, genes subject to minor de novo DNA methylation were enriched in processes such as organic substance transport, chromosome organization, and cell fate specification (Lines 129-134 in the revised manuscript).

      - Lines 149 - 152: sentence/message unclear.

      We apologize for the ambiguous description. We have corrected the relevant descriptions as follows: To identify the biological function of minor de novo DNA methylation in iXCI, we knocked down Dnmt3b in preimplantation embryos by microinjecting Dnmt3b siRNA into zygotes (Lines 234-236 in the revised manuscript).

      - Lines 162-164: the data in Figure 2C/D does not support this statement, as it does not show H3K27me3 loss specifically at the inactive X-chromosome.

      Thanks so much for your insightful comments. Despite the global enrichment of H3K27me3, the H3K27me3 domain detected by immunostaining is a classic marker for establishment of XCI by achieving X chromosome wide heterochromatinization of transcriptional depression (Chow and Heard, 2009; Heard et al., 2004; Huynh and Lee, 2005). Thus, we have used immunostaining for H3K27me3 domains to evaluate the iXCI establishment in the blastocysts, as previously reported (Fukuda et al., 2014; Gontan et al., 2018; Inoue et al., 2010; Tan et al., 2016). To make our results more convincing, we have added another statistical method to quantify the establishment of iXCI, i.e., the percentage of H3K27me3-positive and -negative trophoblast cells to total trophoblast cells in female blastocysts subject to Dnmt3b knockdown or not.

      In addition, we have added a schematic diagram depicting the process of iXCI initiation and establishment, as well as the experimental design and work flows, to make the result easier to be understood.

      In addition, we agree with your comments that additional evidence will benefit the conclusion. To strengthen the evidence, and test whether DNA methylation loss leads to a prolonged effect on iXCI, we have reanalyzed the RNA-seq and H3K27me3 CHIP-seq data in extraembryonic ectoderm (ExE) of E6.5 single embryos that underwent Dnm3a/3b knockout because preimplantation iXCI status maintains extraembryonic cells (Chen et al., 2019; Galupa and Heard, 2015; Schulz and Heard, 2013). The results showed that chromosome-wide loss of DNA methylation led to a nearly complete loss of H3k27me3 on paternal (specifically inactivated in iXCI), along with a notable transcriptional upregulation cross the chromosome. By contrast, these changes cannot be not observed on maternal X chromosome. (Lines 253-261; Figure 3—figure supplement 4A in the revised manuscript)

      - Lines 169-174: sentence/message unclear.

      As aforementioned, we have reorganized this paragraph by rewriting or adding relevant statements relevant to the DNA methylation and XCI, to make the background and discussion clearer and easier to be understood (Lines 217-234 in the revised manuscript). In addition, to avoid repeated statement and make our discussion more concise, we have removed the similar sentences at the end of this paragraph.

      - Lines 177-179: this statement is too bold. The data does not support "direct evidence".

      Thank you for your detailed reminder. We have rewritten the sentence to avoid confusion and overstatement (Lines 262-268 in the revised manuscript).

      - Line 198: these are not all enzymes, but could be referred to as chromatin modifiers.

      We apologize for the ambiguous description. As you suggested, we have corrected “enzymes” to “chromatin modifiers” (Lines 284, 287 in the revised manuscript).

      - Line 199: this statement is not correct in all contexts. There are many studies showing antagonism between DNA methylation and H3K27me3.

      Thanks so much for you careful reviewing. As you have pointed out, the relationship of DNA methylation and H3K27me3 are divergent and largely controversial among studies. Under certain circumstances, DNA methylation shows antagonistic effect to H3K27me3 at promoters, via excluding the binding of PRC2 (the main complex responsible for H3K27me3 deposition) components to their targets (Bartke et al., 2010; Jermann et al., 2014), while other studies have presented alternative evidence that PRC2 (the main complex responsible for H3K27me3 deposition) and DNA methylation cooperate to achieve silencing (Hagarman et al., 2013; Vire et al., 2006). Thus, it has been thought that the relationship between DNA and methylation and histone modifications is complex, possibly in a cell-type and/or genomic region-specific manner. Both antagonism and coordination can be observed in different regulatory elements in mouse ES cells (King et al., 2016).

      We apologize our incomplete statement because we mainly focused on their synergistic relationship. We have refined this section by rewriting relevant sentences and adding necessary statements (Lines 288-303 in the revised manuscript).

      - Lines 228-230: the developmental significance of DNA methylation homeostasis is already well-established. Please reference relevant papers showing this here.

      Thank you for this helpful suggestion. We have reorganized this section. Relevant references that highlight the developmental significance of DNA methylation homeostasis have added. The sentence has been rewritten and moved to the end of this paragraph, in the revised manuscript (Lines 159-161 in the revised manuscript).

      - Line 238: an explanation/rationale for looking at energy metabolism is lacking.

      Thank you for your comments to make our results earlier to be understood. The detection of energy metabolism is mainly based on the integrated analysis of DNA methylation and gene expression from the 8-cell embryos to ICM, to test the potential short-and long-term developmental consequences of minor de novo DNA methylation. Bioinformatic analysis suggested that many basic processes, such as cell differentiation, cell cycle and metabolic regulation, may be regulated by minor de novo DNA methylation. Among the enriched genes, several are related energy metabolism. In addition, because energy metabolism is crucial for supporting embryo differentiation and development, and oxidative phosphorylation (OXPHOS) metabolism is highly activated during the blastocyst stage (Zhao et al., 2021), we next examined the energy metabolism, particularly OXPHOS activity, of Dnmt3b-KD embryos. We have refined the section by rewritten relevant sentence and added necessary statements (Lines 175-179 in the revised manuscript).

      - Lines 246-248: Looking at the data in Figure 2 figure supplement 2, this statement is simply not true with regards to DNMT3B protein, and also global DNA methylation level is reduced in the Dnmt3b KD blastocyst, which could lead to defective major de novo DNA methylation.

      Thanks for your careful reviewing, we have rewritten the sentence to make our statement more accurate and avoid overstatement (Lines 188-190 in the revised manuscript).

      Recommendations/concerns relating to figures:

      Figure 1:

      - Of all genic promoters, how many were included in the analysis (contained sufficient coverage)? What cut-off/thresholds were used to consider DNA methylation gain at a promoter?

      Thanks for your comments. In total, 11662 promoters were analyzed. Given that promoter methylation is generally at low level, particularly at the 8-cell stage at which minor de novo methylation is just initiated. The relatively lower basal levels make the increase before the blastocyst, seem considerably slight. To capture the slight changes, we have used the relaxed threshold based on ΔDNA methylation. Only CpG sites with at least fivefold coverage were included in the methylation analysis based on data from Smith et al. (Smith et al., 2012)., ΔDNA methylation greater or less than 0 was defined as gain or loss of DNA methylation. We have added this information in the revised manuscript (Lines 462-470 in the revised manuscript).

      - Does an average methylation level of 0.02 represent 2% DNA methylation? Presuming yes, is the average 1.5% DNA methylation gain at promoters real? And meaningful? Especially compared to the gain in DNA methylation that takes place between ICM and E6.5 (Figure 1 Figure Supplement 1 D)

      As you have pointed out, an average methylation level of 0.02 represent 2% DNA methylation. As aforementioned, promoters exhibited an average of 1.5% DNA methylation gain during the transition from 8-cell stage to ICM. The slight increase may be mainly due to the relatively lower basal levels. As you expected, compared with the comprehensive de novo DNA methylation during implantation, preimplantation de novo methylation occurs more slightly, at a small proportion of promoter regions, so designated it as minor de novo DNA methylation. It should be also mentioned that a proportion of these promoters continue to gain massive DNA methylation during implantation. We have refined the relevant sentences to provide more detailed information of our results (Lines 125-127 in the revised manuscript).

      - Why is there a focus on promoters (which are not the preferential target of DNMT3B)?

      Thanks so much for your detailed reminder. As you have pointed out, “preferential target” seems to be an inaccurate statement. besides of promoters, gene bodies and other elements also undergo de novo DNA methylation (Auclair et al., 2014; Dahlet et al., 2020; Duymich et al., 2016). We have focused on the promoter regions based on the following considerations: (1) Promoter regions are important target sites of DNMT3B (Choi et al., 2011); (2) The acquisition of DNA methylation in promoters, especially in intermediate and low CpG promoters, during implantation is largely dependent on DNMT3B and plays an important role in regulating developmental genes (Auclair et al., 2014; Borgel et al., 2010; Dahlet et al., 2020). We have rewritten the relevant sentence in the revised manuscript (Lines 100-106 in the revised manuscript).

      - Figure 1H shows that promoters that gain DNA methylation during the "minor de novo DNA methylation" continue to gain DNA methylation during "de novo DNA methylation". Is the ~1.5% DNA methylation gain just the slow start of the main de novo DNA methylation wave?

      Your comments is very helpful to improve the description of our results. In the present study, our analysis indicated that a small proportion of promoters initially gain methylation during the transition from the 8-cell to ICM. The finding challenges current knowledge: (1) de novo DNA methylation occurs during implantation, by which globally hypomethylated blastocysts acquire genome-wide DNA methylation (Borgel et al., 2010; Dahlet et al., 2020; Smith et al., 2012); (2) during preimplantation development, embryos undergo massive and global DNA demethylation.

      To distinguish the current knowledge of the timing and dynamics of DNA methylation during the early development, we have designated our finding during the transition from the 8-cell to blastocyst stage, as minor de novo DNA methylation.

      We agree with your notion that among the promoters undergoing minor de novo methylation, most of them continue to gain DNA methylation during implantation, as revealed in Fig. 1F. We have added refine the relevant statement in revised manuscript (Lines 125-127 in the revised manuscript).

      - The GO analysis performed for Figure 1H, what was used as input? Promoters of genes that gain DNA methylation as identified in 1C?

      Thank you for your comments. For the GO analysis shown in Figure 1H, we used genes with promoter regions that gained or lost DNA methylation during the transition from the 8-cell to ICM respectively (identified in Figure 1C, as input), respectively. This information has been clarified in the revised manuscript to ensure accuracy (Lines 129-134 in the revised manuscript).

      - Figure 1 figure supplement 1, is there only a fold change as threshold or also a calculated significance (eg. p-value/FDR)?

      Thanks for your valuable comments. Considering the relatively low DNA methylation levels at promoter regions, and the slightly changes occurring during the preimplantation embryo development, we used the relaxed threshold based on ΔDNA methylation. Only CpG sites with at least fivefold coverage were included in the methylation analysis based on data from Smith et al. (Smith et al., 2012), ΔDNA methylation greater or less than 0 was defined as gain or loss of DNA methylation. We have replaced relevant figures and added this information in the revised manuscript (Figure 1—figure supplement 1D-E; Lines 125-127 in the revised manuscript).

      - To confirm DNMT3B is responsible for the DNA methylation gain: DNMT3B KD/KO followed by promoter DNA methylation analysis to confirm the promoters that gain DNA methylation between 8 cell and ICM don't gain DNA methylation in the absence of DNMT3B.

      We agree with your comments that additional evidence will benefit the conclusion. To strengthen the evidence, we have reanalyzed the RNA-seq and H3K27me3 CHIP-seq data in extraembryonic ectoderm (ExE) of E6.5 single embryos that underwent Dnm3a/3b knockout because preimplantation iXCI status maintains extraembryonic cells (Chen et al., 2019; Galupa and Heard, 2015; Schulz and Heard, 2013). The results showed that chromosome-wide loss of DNA methylation led to a nearly complete loss of H3k27me3 on paternal (specifically inactivated in iXCI), which showed a notable transcriptional upregulation cross the chromosome. By contrast, these changes cannot be not observed on maternal X chromosome. We have added this result in the revised manuscript (Lines 253-261; Figure 3—figure supplement 4A in the revised manuscript).

      Figure 2:

      - Figure 2A: label missing for what the numbers on the y-axis represent.

      Thank you for pointing this out. We apologize for the oversight. We have added the label of y-axis in Figure 2A to clarify what the numbers represent, making it easier to be understood (Figure 3A in the revised manuscript).

      - Figure 2B: y-axis is % of methylated promoters compared to all promoters?

      Thank you for your suggestion. The y-axis in Figure 2B indeed represents the percentage of de novo methylated promoters relative to all promoters. As you have suggested, we have clarified this labeling in the revised manuscript (Figure 3B in the revised manuscript).

      - What is the delta DNA methylation gain specifically for X-linked promoters?

      Thanks so much for your reminder. To provide more convincing evidence. We have reanalyzed a single cell COOL-seq data, we also specifically reanalyzed the DNA methylation changes on the X chromosomal promoter in female embryos. The X chromosome showed a more notable increase in the de novo methylated promoters than that on autosomes, and the female X chromosome showed higher DNA methylation levels than that of the male (Figure 3—figure supplement 2A-B; Lines 203-206 in the revised manuscript).

      - Figure 2C: include representative images of separate channels to better see the signal of CDX2 and H3K27me3. Quantification would be better represented with box plots.

      Thank you for your helpful suggestions. We have added separate channel images in the revised manuscript. Additionally, we have adjusted the quantification to be represented as box plots, as you have suggested, to improve the accuracy and interpretability of the data presentation (Figure 3D-F in the revised manuscript).

      - Figure 2C: Does the H3K27me3 signal overlap with the location of the inactive X-chromosome (is there maybe denser DAPI or do IF combined with Xist RNA-FISH)?

      Thanks so much for your insightful comments. Despite the global enrichment of H3K27me3, the H3K27me3 domain detected by immunostaining is a classic marker for establishment of XCI by achieving X chromosome wide heterochromatinization of transcriptional depression (Chow and Heard, 2009; Heard et al., 2004; Huynh and Lee, 2005). Thus, we have used immunostaining for H3K27me3 domains to evaluate the iXCI establishment in the blastocysts, as previously reported (Fukuda et al., 2014; Gontan et al., 2018; Inoue et al., 2010; Tan et al., 2016). We have taken effort to perform co-staining of H3K27me3 IF and Xist FISH, but was hindered by the technical challenge, we wish to get your understanding. However, as we aforementioned, H3K27me3 is a well-accepted maker to clarify the XCI status.

      In addition, to make our results more convincing, we have added an alternative statistical method to quantify the establishment of iXCI, i.e., the percentage of H3K27me3-positive and -negative trophoblast cells to total trophoblast cells in female blastocysts subject to Dnmt3b knockdown or not (Figure 3F; Lines 243-244 in the revised manuscript)

      - Figure 2 figure supplement 2A: relative expression of Dnmt3b?

      Thanks for your detailed reminder. The data represent the relative expression level of Dnmt3b, as noted in the original figure legend. Based on your comments, we have added the gene name in the label of the Y-axis. Similarly, the protein name has been also added to make the results more informative (Figure 2 figure supplement 2A, C, E in the revised manuscript).

      - Figure 2 figure supplement 2B/C: in the text, line 153, it is stated that "Dnmt3b mRNA and protein levels were significantly reduced in morulae, but not in blastocysts compared to those of negative control (NC) group". These figures do not support that statement. The IF images show a loss of DNMT3B in the Dnmt3b KD blastocysts. The IF quantification seems to have fewer datapoints for the blastocyst, and looking at the bar graphs, there seems to be a trend towards reduced DNMT3B in both the morula and blastocyst, which would also explain the reduction in DNA methylation in both stages as shown in Figure 2 figure supplement 2D/E.

      Thanks so much for your careful reviewing that makes our statements more accurate. We have rewritten the sentence in the revised manuscript as follows: Dnmt3b mRNA and protein levels were significantly reduced in morulae, and tended to be lower in blastocysts compared to those of the negative control (NC) group. In addition, we have removed “transient” from the original statement “The transient inhibition of Dnmt3b” (Lines 168-170 in the revised manuscript).

      - Figure 2 figure supplement 2F/G: include representative IF images with separation of all channels and the merged image.

      Thank you for your suggestion. We have added the representative immunofluorescence (IF) images with separate channels and merged image in the revised manuscript (Figure 3—figure supplement 3B, F in the revised manuscript).

      - Figure 2 figure supplement 2H: Instead of showing log2FC in methylation levels, delta methylation would be more informative. Are these genes already inactivated at the 8-cell stage? Or are they active and become inactivated by the gain in DNA methylation? Doing qPCR for these genes, or looking at published RNAseq data would be informative. What happens to the expression of these genes in the Dnmt3b KD?

      Thanks for your suggestions. We have represented DNA methylation changes as “ΔDNA methylation”. During mouse preimplantation development, iXCI is initiated in earlier cleavage female embryos dependent on Xist upregulation around 4-8-cell stage, and then Xist specifically coats paternal X chromosome and finally leads to chromosome-wide silencing via heterochromatinization in early blastocysts. Thus, these non-escaping genes, which are subject to XCI, would not be inactivated at 8-cell stage

      Author response image 1.

      The processes of iXCI initiation and establishment (left panel), and dynamics of total expression levels of X chromosome in male and female preimplantation embryos (right panel, note that X-dosage is balanced between sexes until the early blastocyst stage).

      As you expected, most of these representative non-escaping is downregulated upon the transition of 8-cell to blastocyst stage, consistent with their gain of DNA methylation. Additionally, since preimplantation iXCI status maintains extraembryonic cells (Galupa and Heard, 2015; Schulz and Heard, 2013), we further reanalyzed the published RNA-seq data in extraembryonic ectoderm (ExE) of E6.5 single embryos that underwent DNA methyltransferase knockout (Chen et al., 2019). The results showed that chromosome-wide loss of DNA methylation led to a chromosome-wide transcriptional upregulation, including the locus of these non-escaping genes, on paternal X chromosome. We have added this result in the revised manuscript (Figure 3—figure supplement 3J; Figure 3—figure supplement 4A-B; Lines 253-261 in the revised manuscript).

      Figure 3:

      - Figure 3 figure supplement 1: representative IF image missing.

      Thanks for your kind reminder. We have added the representative IF images in the revised manuscript to provide a clearer illustration of the data (Figure 4—figure supplement 1A in the revised manuscript).

      - Figure 3 figure supplement 2B: scales are missing for the H3K27me3 ChIP-seq data (are the 8-cell and ICM tracks set to the same scale?). It looks like the ICM track is cut off at the top (peaks not fully displayed) and the data looks very sparse. A more informative analysis would be to do peak calling over promoters and compare 8-cell with ICM.

      Thanks for your detailed reminder. We apologize for the missing of scale bars in the H3K27me3 ChIP-seq data. The 8-cell and ICM tracks were set to the same scale, and we have now added scales to the figure in the revised manuscript to improve the result presentation. As you have speculated, the visual effect of the flatted peak is not caused by track cutting off, but rather by zooming into a specific region in the extended IGV files.

      These results are based on the reanalysis of publicly available data of pooled embryos, which just provided suggestive but not direct evidence to support the role of DNA methylation in promoting X-linked H3K27me3 enrichment in iXCI.

      To provide more convincing evidence. we have reanalyzed the RNA-seq and H3K27me3 CHIP-seq data in extraembryonic ectoderm (ExE) of E6.5 female embryos that underwent Dnmt3a/3b knockout because preimplantation iXCI status maintains extraembryonic cells (Chen et al., 2019; Galupa and Heard, 2015; Schulz and Heard, 2013). The results showed that Dnmt knockout led to a nearly complete loss of H3k27me3 on paternal (specifically inactivated in iXCI), which showed a notable transcriptional upregulation cross the chromosome. By contrast, these changes cannot be not observed on maternal X chromosome (Figure 3—figure supplement 4 in the revised manuscript). We have added these results in the revised manuscript.

      - Figure 3E: Given all tested proteins give a positive signal, it would have been good to include a negative control chromatin protein that is known to not interact with DNMT3B. Given both PRC2 and DNMT3B are chromatin-binding proteins, can the signal be a result of close proximity instead of a direct interaction?

      In the present study, to test the interaction between DNMT3B and PRC2 core components, we have used in situ proximity ligation assay (PLA), an increasingly popular technique for detecting the close proximity of two proteins in fixed samples using two primary antibodies (Alsemarz et al., 2018).

      Author response image 2.

      Schematic diagram of the principle of the in situ PLA.

      Compared with classical co-Immunoprecipitation (Co-IP) method, in situ PLA has advantages in (1) detecting low input samples or proteins expressed at low levels, which is extremely difficult using Co-IP; (2) providing in situ or subcellular information of protein-protein interaction. However, it should be noted that the maximal distance allowing this reaction is 40 nm, which is not quite small enough to demonstrate a physical interaction between the two antigens, but sufficient to support a very close “proximity”.

      In our study, in situ PLA, including the experimental design of negative control, was performed in the accordance with the manufacturer’s instruction of Duolink® In Situ Red Starter Kit (MilliporeSigma): “Technical negative controls included incubation with each primary antibody separately and no primary antibody”. We have refined the relevant sentence in the revised manuscript (Lines 308-310 in the revised manuscript)

      - Figure 3G: It would have been good to include a negative control, and DNase/benzonase to exclude DNA/RNA-mediated protein interaction.

      - (Of note, there have been previous studies reporting an interaction between PRC2 and DNMT3B in other cell types, such as in Weigert et al. 2023, but unfortunately, they don't seem to use DNase/benzonase either).

      The Co-IP analysis of DNMT3B and PRC2 core components in differentiated female ES cells was presented as additional supportive evidence. Because the Co-IP analysis is extremely difficult for preimplantation embryos, we have used in situ PLA to detect their interaction. However, the maximal distance allowing in situ PLA reaction is 40 nm, which is not quite small enough to demonstrate a physical interaction (Alsemarz et al., 2018). Thus, we have added a Co-IP analysis using differentiated female ES cells, in which rXCI occurs upon the differentiation.

      Based on this consideration of the importance and contribution of this result, we have moved this result from the main figure, to the supplemental figure (Figure 4—figure supplement 3H in the revised manuscript).

      - Figure 3 figure supplement 3G: what were the ESCs differentiated into? Did the Dnmt3b KO or Dnmt3a/b DKO show any differentiation defect?

      The mouse ESC line PGK12.1 was a well-established ex vivo model of rXCI. Under the standard culture condition, PGK12.1 is normally fated to neuroectodermal commitment.

      Author response image 3.

      Immunostaining of NESTIN, a neuroectodermal stem cell marker molecule, and NANOG in undifferentiated and differentiated PGK12.1 ESCs respectively.

      No differentiation defects have been observed in either Dnmt3b KO or Dnmt3a/3b DKO ESCs in our study. Dnmt KO/DKO/TKO ES cell lines have been successfully used as the model of interaction of DNA methylation and H3K27me3 deposition (King et al., 2016).

      Figure 4:

      - Figure 4B: Is there an explanation for seeing similar total cell numbers in Figure 4B, but showing decreased proliferation in Figure 4A?

      Thank you for your insightful comments. The EdU cell proliferation assays labels cells during the S phase of cell cycle, as the 5-ethynyl 2´-deoxyuridine (EdU) is incorporated into newly synthesized DNA. This labeling identifies cells undergoing DNA synthesis, but these cells may not have completed mitosis at the time of detection. As a result, the total cell number may not immediately reflect the decrease in proliferation observed in the treated group. To address this point, we have rewritten the sentences in the revised manuscript (Lines 174-175 in the revised manuscript).

      References

      Alsemarz, A., Lasko, P. and Fagotto, F. J. B. (2018). Limited significance of the in situ proximity ligation assay. bioRxiv, 411355.

      Auclair, G., Guibert, S., Bender, A. and Weber, M. (2014). Ontogeny of CpG island methylation and specificity of DNMT3 methyltransferases during embryonic development in the mouse. Genome Biol. 15, 545.

      Balaton, B. P. and Brown, C. J. (2021). Contribution of genetic and epigenetic changes to escape from X-chromosome inactivation. Epigenetics Chromatin 14, 30.

      Bartke, T., Vermeulen, M., Xhemalce, B., Robson, S. C., Mann, M. and Kouzarides, T. (2010). Nucleosome-interacting proteins regulated by DNA and histone methylation. Cell 143, 470-484.

      Borgel, J., Guibert, S., Li, Y., Chiba, H., Schubeler, D., Sasaki, H., Forne, T. and Weber, M. (2010). Targets and dynamics of promoter DNA methylation during early mouse development. Nat. Genet. 42, 1093-1100.

      Chen, Z., Yin, Q., Inoue, A., Zhang, C. and Zhang, Y. (2019). Allelic H3K27me3 to allelic DNA methylation switch maintains noncanonical imprinting in extraembryonic cells. Sci Adv 5, eaay7246.

      Chiba, H., Hirasawa, R., Kaneda, M., Amakawa, Y., Li, E., Sado, T. and Sasaki, H. (2008). De novo DNA methylation independent establishment of maternal imprint on X chromosome in mouse oocytes. Genesis 46, 768-774.

      Choi, S. H., Heo, K., Byun, H. M., An, W., Lu, W. and Yang, A. S. (2011). Identification of preferential target sites for human DNA methyltransferases. Nucleic Acids Res. 39, 104-118.

      Chow, J. and Heard, E. (2009). X inactivation and the complexities of silencing a sex chromosome. Curr. Opin. Cell Biol. 21, 359-366.

      Dahlet, T., Argueso Lleida, A., Al Adhami, H., Dumas, M., Bender, A., Ngondo, R. P., Tanguy, M., Vallet, J., Auclair, G., Bardet, A. F., et al. (2020). Genome-wide analysis in the mouse embryo reveals the importance of DNA methylation for transcription integrity. Nat Commun 11, 3153.

      Duymich, C. E., Charlet, J., Yang, X. J., Jones, P. A. and Liang, G. N. (2016). DNMT3B isoforms without catalytic activity stimulate gene body methylation as accessory proteins in somatic cells. Nat Commun 7, 11453.

      Fukuda, A., Tomikawa, J., Miura, T., Hata, K., Nakabayashi, K., Eggan, K., Akutsu, H. and Umezawa, A. (2014). The role of maternal-specific H3K9me3 modification in establishing imprinted X-chromosome inactivation and embryogenesis in mice. Nat Commun 5, 5464.

      Galupa, R. and Heard, E. (2015). X-chromosome inactivation: new insights into cis and trans regulation. Curr. Opin. Genet. Dev. 31, 57-66.

      Gontan, C., Mira-Bontenbal, H., Magaraki, A., Dupont, C., Barakat, T. S., Rentmeester, E., Demmers, J. and Gribnau, J. (2018). REX1 is the critical target of RNF12 in imprinted X chromosome inactivation in mice. Nat Commun 9, 4752.

      Hagarman, J. A., Motley, M. P., Kristjansdottir, K. and Soloway, P. D. (2013). Coordinate regulation of DNA methylation and H3K27me3 in mouse embryonic stem cells. PLoS One 8, e53880.

      Heard, E., Chaumeil, J., Masui, O. and Okamoto, I. (2004). Mammalian X-chromosome inactivation: an epigenetics paradigm. Cold Spring Harb. Symp. Quant. Biol. 69, 89-102.

      Huynh, K. D. and Lee, J. T. (2005). X-chromosome inactivation: a hypothesis linking ontogeny and phylogeny. Nat. Rev. Genet. 6, 410-418.

      Inoue, A., Jiang, L., Lu, F. and Zhang, Y. (2017). Genomic imprinting of Xist by maternal H3K27me3. Genes Dev. 31, 1927-1932.

      Inoue, K., Kohda, T., Sugimoto, M., Sado, T., Ogonuki, N., Matoba, S., Shiura, H., Ikeda, R., Mochida, K., Fujii, T., et al. (2010). Impeding Xist expression from the active X chromosome improves mouse somatic cell nuclear transfer. Science 330, 496-499.

      Jermann, P., Hoerner, L., Burger, L. and Schubeler, D. (2014). Short sequences can efficiently recruit histone H3 lysine 27 trimethylation in the absence of enhancer activity and DNA methylation. Proc. Natl. Acad. Sci. U. S. A. 111, E3415-3421.

      King, A. D., Huang, K., Rubbi, L., Liu, S., Wang, C. Y., Wang, Y., Pellegrini, M. and Fan, G. (2016). Reversible Regulation of Promoter and Enhancer Histone Landscape by DNA Methylation in Mouse Embryonic Stem Cells. Cell Rep. 17, 289-302.

      Maslov, A. Y., Lee, M., Gundry, M., Gravina, S., Strogonova, N., Tazearslan, C., Bendebury, A., Suh, Y. and Vijg, J. (2012). 5-aza-2'-deoxycytidine-induced genome rearrangements are mediated by DNMT1. Oncogene 31, 5172-5179.

      Oikawa, M., Inoue, K., Shiura, H., Matoba, S., Kamimura, S., Hirose, M., Mekada, K., Yoshiki, A., Tanaka, S., Abe, K., et al. (2014). Understanding the X chromosome inactivation cycle in mice: a comprehensive view provided by nuclear transfer. Epigenetics-Us 9, 204-211.

      Oka, M., Meacham, A. M., Hamazaki, T., Rodic, N., Chang, L. J. and Terada, N. (2005). De novo DNA methyltransferases Dnmt3a and Dnmt3b primarily mediate the cytotoxic effect of 5-aza-2'-deoxycytidine. Oncogene 24, 3091-3099.

      Pintacuda, G. and Cerase, A. (2015). X Inactivation Lessons from Differentiating Mouse Embryonic Stem Cells. Stem Cell Rev Rep 11, 699-705.

      Schulz, E. G. and Heard, E. (2013). Role and control of X chromosome dosage in mammalian development. Curr. Opin. Genet. Dev. 23, 109-115.

      Smith, Z. D., Chan, M. M., Mikkelsen, T. S., Gu, H. C., Gnirke, A., Regev, A. and Meissner, A. (2012). A unique regulatory phase of DNA methylation in the early mammalian embryo. Nature 484, 339-344.

      Tada, T., Obata, Y., Tada, M., Goto, Y., Nakatsuji, N., Tan, S., Kono, T. and Takagi, N. (2000). Imprint switching for non-random X-chromosome inactivation during mouse oocyte growth. Development 127, 3101-3105.

      Tan, K., An, L., Miao, K., Ren, L., Hou, Z., Tao, L., Zhang, Z., Wang, X., Xia, W., Liu, J., et al. (2016). Impaired imprinted X chromosome inactivation is responsible for the skewed sex ratio following in vitro fertilization. Proc. Natl. Acad. Sci. U. S. A. 113, 3197-3202.

      Vire, E., Brenner, C., Deplus, R., Blanchon, L., Fraga, M., Didelot, C., Morey, L., Van Eynde, A., Bernard, D., Vanderwinden, J. M., et al. (2006). The Polycomb group protein EZH2 directly controls DNA methylation. Nature 439, 871-874.

      Zhao, J., Yao, K., Yu, H., Zhang, L., Xu, Y., Chen, L., Sun, Z., Zhu, Y., Zhang, C., Qian, Y., et al. (2021). Metabolic remodelling during early mouse embryo development. Nat Metab 3, 1372-1384.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1

      (1) In the "Introduction" section, an important aspect that requires attention pertains to the discussion surrounding the heterodimerization of CXCR4 and CCR5. Notably, the manuscript overlooks a recent study (https://doi.org/10.1038/s41467-023-42082-z) elucidating the mechanism underlying the formation of functional dimers within these G protein-coupled receptors (GPCRs)…The inclusion of this study within the manuscript would significantly enrich the contextual framework of the work, offering readers a comprehensive understanding of the current knowledge surrounding the structural dynamics and functional implications of CXCR4 and CCR5 heterodimerization.

      We thank the reviewer for his/her recommendation to enrich the contextual framework of our study. The Nature Communications paper by Di Marino et al. was published after we sent the first version of our manuscript to eLife, and therefore was not included in the discussion. As the reviewer rightly indicates, this paper elucidates the mechanism underlying the formation of functional dimers within CCR5 and CXCR4. Using metadynamics approaches, the authors emphasize the importance of distinct transmembrane regions for dimerization of the two receptors. In particular, CXCR4 shows two low energy dimer structures and the TMVI-TMVII helices are the preferred interfaces involved in the protomer interactions in both cases. Although the study uses in silico techniques, it also includes the molecular binding mechanism of CCR5 and CXCR4 in the membrane environment, as the authors generate a model in which the receptors are immersed in a 1-palmitoyl-2-oleoyl-sn-glycero-3-phosphocholine (POPC) phospholipid bilayer with 10% cholesterol. This is an important point in this study, as membrane lipids also interact with membrane proteins, and the lipid composition affects CXCR4 oligomerization (Gardeta S.R. et al. Front. Immunol. 2023). In particular, Di Marino et al. find a cholesterol molecule placed in-between the two CXCR4 protomers where it engages a series of hydrophobic interactions with residues including Leu132, Val214, Leu216 and Phe249. Then, the polar head of cholesterol forms an H-bond with Tyr135 that further stabilizes protomer binding. In our hands, the F249L mutation in CXCR4 reverted the antagonism of AGR1.137, suggesting that the compound binds, among others, this residue. We should, nonetheless, indicate that we analyzed receptor oligomerization and not CXCR4 dimerization, which was the main object of the Di Marino et al. study. It is therefore also plausible that other residues than those described as essential for CXCR4 dimerization might participate in receptor oligomerization. We can speculate that AGR1.137 might affect cholesterol binding to CXCR4 and, therefore, alter dimerization/oligomerization. Additionally, the CXCR4 x-ray structure with PDB code 3ODU (Wu B. et al. Science, 2010) experimentally shows the presence of two fatty acid molecules in contact with both TMV and TMVI. These molecules closely interact with hydrophobic residues in the protein, thereby stabilizing it in a hydrophobic environment. Although more experiments will be needed to clarify the mechanism involved, our results suggest that cholesterol and/or other lipids also play an important role in CXCR4 oligomerization and function, as seen for other GPCRs (Jakubik J. & ElFakahani E.E. Int J Mol Sci. 2021). However, we should also consider that other factors not included in the analysis by Di Marino et al. can also affect CXCR4 oligomerization; for instance, the co-expression of other chemokine receptors and/or other GPCRs that heterodimerize with CXCR4 might affect CXCR4 dynamics at the cell membrane, similar to other membrane proteins such as CD4, which also forms complexes with CXCR4 (Martinez-Muñoz L. et al. Mol. Cell 2018).

      The revised discussion contains references to the study by Di Marino et al. to enrich the contextual framework of our data.

      (2) In "various sections" of the manuscript, there appears to be confusion surrounding the terminology used to refer to antagonists. It is recommended to provide a clearer distinction between allosteric and orthosteric antagonists to enhance reader comprehension. An orthosteric antagonist typically binds to the same site as the endogenous ligand, directly blocking its interaction with the receptor. On the other hand, an allosteric antagonist binds to a site distinct from the orthosteric site, inducing a conformational change in the receptor that inhibits the binding of the endogenous ligand. By explicitly defining the terms "allosteric antagonist" and "orthosteric antagonist" within the manuscript, readers will be better equipped to discern the specific mechanisms discussed in the context of the study.

      The behavior of the compounds described in our manuscript (AGR1.35 and AGR1.137) fits with the definition of allosteric antagonists, as they bind on a site distinct from the orthosteric site, although they only block some ligand-mediated functions and not others. This would mean that they are not formally antagonists and should be not considered as allosteric compounds, as their binding on CXCR4 does not alter CXCL12 binding, although they might affect its affinity. In this sense, our compounds respond much better to the concept of negative allosteric modulators (Gao Z.-G. & Jacobson K.A. Drug Discov. Today Technol. 2013). They act by binding on a site distinct from the orthosteric site and selectively block some downstream signaling pathways but not others induced by the same endogenous agonist.

      To avoid confusion and to clarify the role of the compounds described in this study, we now refer to them as negative allosteric modulators along the manuscript.

      (3) In the Results section, the computational approach employed for "screening small compounds targeting CXCR4, particularly focusing on the inhibition of CXCL12-induced CXCR4 nanoclustering", requires clarification due to several points of incomprehension. The following recommendations aim to address these concerns and enhance the overall clarity of the section:

      (1) Computational Approach and Binding Mode Description: 

      -Explicitly describe the methodology for identifying the pocket/clef area in angstroms (Å) on the CXCR4 protein structure. Include details on how the volume of the cleft enclosed by TMV and TMVI was determined, as this information is not readily apparent in the provided reference (https://doi.org/10.1073/pnas.1601278113).

      The identification of the cleft was based on the observations by Wu et al. (Wu B. et al. Science 2010) who described the presence of bound lipids in the area formed by TMV and VI, and those of Wescott et al. (Wescott M.P. et al. Proc. Natl. Acad. Sci. 2016) on the importance of TMVI in the transmission of conformational changes promoted by CXCL12 on CXCR4 towards the cytoplasmic surface of the receptor to link the binding site with signaling activation. Collectively, these results, and our previous data on the critical role of the N-terminus region of TMVI for CXCR4 oligomerization (Martinez-Muñoz L. et al. Mol. Cell 2018), focused our in silico screening to this region. Once we detected that several compounds bound CXCR4 in this region, the cleavage properties were calculated by subtracting the compound structure. The resulting PDB was analyzed using the PDBsum server (Laskowski R.A. et. al. Protein Sci. 2018). Volume calculations were obtained using the server analyzing surface clefts by SURFNET (Laskowski R. A. J. Mol. Graph. 1995). The theoretical interaction surface between the selected compounds and CXCR4 and the atomic distances between the protein residues and the compounds was calculated using the PISA server (Krissinel E. & Henrick K. J. Mol. Biol. 2007) (Fig. I, only for review purposes). The analysis of the cleft occupied by AGR1.135 showed two independent cavities of 434 Å3 and 1,381 Å3 that were not connected to the orthosteric site. In the case of AGR1.137, the data revealed two distinct clefts of 790 Å3 and 580 Å3 (Fig. I, only for review purposes). These details have been included in the revised manuscript (New Fig. 1A, Supplementary Fig 8A, B).

      (4) Clarify the statement regarding the cleft being "surface exposed for interactions with the plasma membrane," particularly in the context of its embedding within the membrane.

      For GPCRs, transmembrane domains represent binding sites for bioactive lipids that play important functional and physiological roles (Huwiler A. & Zangemeister-Wittke U. Pharmacol. Ther. 2018). The channel between TMV and TMVI connects the orthosteric chemokine binding pocket to the lipid bilayer and is occupied by an oleic acid molecule, according to the CXCR4 structure published in 2010 (Wu B. et al. Science 2010). In addition, the target region contains residues involved in cholesterol (and perhaps other lipids) engagement (Di Marino et al. Nat. Commun. 2023). Taken together, these data support our statement that the cleft supports interactions between CXCR4 molecules and the plasma membrane. 

      Moreover, the data of Di Marino et al. also support that CCR5 and CXCR4 have a symmetric and an asymmetric binding mode. Therefore, either dimeric structure has the possibility to form trimers, tetramers, and even oligomers by using the free binding interface to complex with another protomer. This hypothesis suggests that the interaction of dimers to form oligomers should involve residues distinct from those included in the dimeric conformation.

      The sentence has been modified in the revised manuscript to clarify comprehension.

      (5) Discuss the rationale behind targeting the allosteric binding pocket instead of the orthosteric pocket, outlining potential advantages and disadvantages.

      The advantages and disadvantages of using negative allosteric modulators vs orthosteric antagonists have been now included in the revised discussion. 

      The majority of GPCR-targeted drugs function by binding to the orthosteric site of the receptor, and are agonists, partial agonists, antagonists or inverse agonists. These orthosteric compounds can have off-target effects and poor selectivity due to highly homologous receptor orthosteric sites and to abrogation of spatial and/or temporal endogenous signaling patterns. 

      The alternative is to use allosteric modulators, which can tune the functions associated with the receptors without affecting the orthosteric site. They can be positive, negative or neutral modulators, depending on their effect on the functionality of the receptor (Foster D.J. & Conn P.J. Neuron 2017). For example, the use of a negative allosteric modulator of a chemokine receptor to dampen pathological signaling events, while retaining full signaling for non-pathological activities might limit adverse effects (Kohout T.A.et al. J. Biol. Chem. 2004). In this case, the negative allosteric modulator 873140 blocks CCL3 binding on CCR5 but does not alter CCL5 binding (Watson C. et al. Mol. Pharmacol. 2005). In other cases, allosteric modulators can stabilize a particular receptor conformation and block others. The mechanism of action of the anti-HIV-1, FDAapproved, CCR5 allosteric modulator, maraviroc (Jin J. et al. Sci. Signal. 2018) is attributed to its ability to modulate CCR5 dimer populations and their subsequent subcellular trafficking and localization to the cell membrane (Jin J .et al. Sci. Signal. 2018). Two CCR5 dimeric conformations that are imperative for membrane localization were present in the absence of maraviroc; however, an additional CCR5 dimer conformation was discovered after the addition of maraviroc, and all homodimeric conformations were further stabilized. This finding is consistent with the observation that CCR5 dimers and oligomers inhibit HIV host-cell entry, likely by preventing the HIV-1 co-receptor formation.

      It is well known that GPCRs activate G proteins, but they also recruit additional proteins (e.g., β-arrestins) that induce signaling cascades which, in turn, can direct specific subsets of cellular responses independent of G protein activation (Eichel K. et al. Nature 2018) and are responsible for either therapeutic or adverse effects. Allosteric modulators can thus be used to block these adverse effects without influencing the therapeutic benefits. This was the case in the design of G protein-biased agonists for the kappa opioid receptor, which maintain the desirable antinociceptive and antipruritic effects and eliminate the sedative and dissociative effects in rodent models (Brust T.F. et al. Sci. Signal 2016).

      (6) Provide the PDB ID of the CXCR4 structure used as a template for modeling with SwissModel. Explain the decision to model the structure from the amino acid sequence and suggest an alternative approach, such as utilizing AlphaFold structures and performing classical molecular dynamics with subsequent clustering for the best representative structure.

      The PDB used as a template for modeling CXCR4 was 3ODU. This information was already included in the material and methods section. At the time we performed these analyses, there were several crystallographic structures of CXCR4 in complex with different molecules and peptides deposited at the PDB. None of them included a full construct containing the complete receptor sequence to provide a suitable sample for Xray structure resolution, as the N- and C-terminal ends of CXCR4 are very flexible loops. In addition, the CXCR4 constructs contained T4 lysozyme inserted between helices TMV and TMVI to increase the stability of the protein––a common strategy used to facilitate crystallogenesis of GPCRs (Zou Y. et al. PLoS One 2012). Therefore, we generated a CXCR4 homology model using the SWISS-MODEL server (Waterhouse A. et al. Nucleic Acids Res. 2018). This program reconstructed the loop between TMV and TMVI, a domain particularly important in this study that was not present in any of the crystal structure available in PDB. The model structure was, nonetheless, still incomplete, as it began at P27 and ended at S319 because the terminal ends were not resolved in the crystal structure used as a template. Nevertheless, we considered that these terminal ends were not involved in CXCR4 oligomerization. 

      As Alphafold was not available at the time we initiated this project, we didn’t use it. However, we have now updated our workflow to current methods and predicted the structure of the target using AlphaFold (Jumper J. et al. Nature 2021) and the sequence available under UniProt entry P61073. We prepared the ligands using OpenBabel (O’Boyle N.M. et al., J. Cheminformatics 2011), with a gasteiger charge assignment, and generated 10 conformers for each input ligand using the OpenBabel genetic algorithm. We then prepared the target structure with Openmm, removing all waters and possible heteroatoms, and adding all missing atoms. We next predicted the target binding pockets with fPocket (Le Guilloux V. et al. BMC Bioinformatics 2009), p2rank (Krivak R. & Hoksza, J. Cheminformatics 2018), and AutoDock autosite (Ravindranath P.A. & Sanner M.F. Bioinformatics 2016). We chose only those pockets between TMV and TMVI (see answer to point 3). We merged the results of the three programs into so-called consensus pockets, as two pockets are said to be sufficiently similar if at least 75% of their surfaces are shared (del Hoyo D. et al. J. Chem. Inform. Model. 2023). From the consensus pockets, there was one pocket that was significantly larger than the others and was therefore selected. We then docked the ligand conformers in this pocket using AutoDock GPU (Santos-Martins D. et al. J. Chem. Theory Comput. 2021), LeDock (Liu N & Xu Z., IOP Conf. Ser. Earth Environ. Sci. 2019), and Vina (Eberhardt J. et al. J. Chem. Inf. Model. 2021). The number of dockings varied from 210 to 287 poses. We scored each pose with the Vina score using ODDT (Wójcikowski M. et al. J. Cheminform. 2015). Then, we clustered the different solutions into groups whose maximum RMSD was 1Å. This resulted in 40 clusters, the representative of each cluster was the one with maximum Vina score and confirmed that the selected compounds bound this pocket (Author response image 1). When required, we calculated the binding affinity using Schrodinger’s MM-GBSA procedure (Greenidge P.A. et al. J. Chem. Inf. Model. 2013), in two ways: first, assuming that the ligand and target are fixed; second, with an energy minimization of all the atoms within a distance of 3Å from the ligand. This information has now been included in the revised version of the manuscript.

      Author response image 1.

      AGR1.135 docking in CXCR4 using the updated protocol for ligand docking. Cartoon representation colored in gray with TMV and TMVI shown in blue and pink, respectively. AGR1.135 is shown in stick representation with carbons in yellow, oxygens in red and nitrogens in blue.

      (7) Specify the meaning of "minimal interaction energy" and where (if present) the interaction scores are reported in the text.

      We refer to minimal interaction energy, the best docking score, that is, the best score obtained in our docking studies. These data were not included in the previous manuscript due to space restrictions but are now included in the reviewed manuscript.

      (8) You performed docking studies using GLIDE to identify potential binding sites for the small compounds on the CXCR4 protein. The top-scoring binders were then subjected to further refinement using PELE simulations. However, I realize that a detailed description of the specific binding modes of these compounds was not provided in the text. Please make the description of binding poses more detailed

      Firstly, to assess the reliability of this method, a PELE study was carried out for the control molecule IT1t, which is a small drug-like isothiourea derivative that has been crystallized in complex with CXCR4 (PDB code: 3ODU). IT1t is a CXCR4 antagonist that binds to the CXCL12 binding cavity and inhibits HIV-1 infection (Das D. Antimicrob. Agents Chemother. 2015; Dekkers S. et al. J. Med. Chem. 2023). From the best five trajectories, two of them had clearly better binding energies, and corresponded to almost the same predicted pose of the molecule. Although the predicted binding mode was not exactly the same as the one in the crystal structure, the approximation was very good, giving validation to the approach. Although PELE is a suitable technique to find potential binding sites, the predicted poses must be subsequently refined using docking programs.

      Analyzing the best trajectories for the remaining ligands, at least one of the best-scored poses was always located at the orthosteric binding site of CXCR4. Even though these poses showed good binding energies, they were discarded as the in vitro biological experiments indicated that the compounds were unable to block CXCL12 binding or CXCL12-mediated inhibition of cAMP release or CXCR4 internalization. Collectively, these data indicated that the selected compounds did not behave as orthosteric inhibitors of CXCR4. The CXCL12 binding pocket is the biggest cavity in CXCR4, and so PELE may tend to place the molecules near it. However, all the compounds presented other feasible binding sites with a comparable binding energy.

      AGR1.135 and AGR1.137 showed interesting poses between TMV and TMVI with very good binding energy (-51.4 and -37.2 kcal/mol, respectively). This was precisely the region we had previously selected for the in silico screening, as previously described (see response to point 3).

      AGR1.131 showed two poses with low binding energy that were placed between helices TMI and TMVII (-43.6 kcal/mol) and between helices TMV and TMVI (-39.8 kcal/mol). This compound was unable to affect CXCL12-mediated chemotaxis and was therefore used as an internal negative control as it was selected in the in silico screening with the same criteria as the other compounds but failed to alter any CXCL12-mediated functions. PELE studies nonetheless provided different binding sites for each molecule, which had to be further studied using docking to obtain a more accurate binding mode. In agreement with the previous commentary, we repeated the analysis using AlphaFold and the rest of the procedure described (see our response to point 6) and calculated the binding energies for all the compounds using Schrodinger’s MM-GBSA procedure (Greenidge P.A. et al. J. Chem. Inf. Model. 2013). Calculations were performed in two ways: first, assuming that the ligand and target are fixed; second, with an energy minimization of all the atoms within a distance of 3Å from the ligand. The results using the first method indicated that AGR1.135 and AGR1.137 showed poses between TMV and TMVI with - 56.4 and -62.4 kcal/mol, respectively and AGR1.131 had a pose between TMI and TMVII with -61.6kcal/mol.  In the second method AGR1.135 and AGR1.137 showed poses between TMV and TMVI with -57.9, and -67.6 kcal/mol, respectively, and AGR1.131 of -62.2 kcal/mol between TMI and TMVII.

      This information is now included in the text.

      (9) (2) Experimental Design:-Justify the choice of treating Jurkat cells with a concentration of 50 μM of the selected compound. Consider exploring different concentrations and provide a rationale for the selected dosage. Additionally, clearly identify the type of small compound used in the initial experiment.

      The revised version contains a new panel in Fig. 1B to show a more detailed kinetic analysis with different concentrations (1-100 µM) of the compounds in the Jurkat migration experiments. In all cases, 100 µM nearly completely abrogated cell migration, but in order to reduce the amount of DMSO added to the cells we selected 50 µM for further experiments, as it was the concentration that inhibits 50-75% of ligand-induced cell migration. Regarding the type of small compounds used in the initial experiments, they were compounds included in the library described in reference #24 (Sebastian-Pérez V. et al Med. Biol. Chem. 2017), which contains heterocyclic compounds. We would note that we do not consider AGR1.137 a final compound. We think that there is scope to develop AGR1.137-based second-generation compounds with greater solubility in water, greater specificity or affinity for CXCR4, and to evaluate delivery methods to hopefully increase activity.  

      (10) Avoid reporting details in rounded parentheses within the text; consider relocating such information to the Materials and Methods section or figure captions for improved readability.

      Most of the rounded parentheses within the text have been eliminated in the revised version of the manuscript to improve readability.

      (11) Elaborate on the virtual screening approach using GLIDE software, specifying the targeted site and methodology employed.

      For the virtual screening, we used the Glide module (SP and XP function scoring) included in the Schrödinger software package, utilizing the corresponding 3D target structure and our MBC library (Sebastián-Pérez V et al. J. Chem. Inf. Model. 2017).  The center of the catalytic pocket was selected as the centroid of the grid. In the grid generation, a scaling factor of 1.0 in van der Waals radius scaling and a partial charge cutoff of 0.25 were used. A rescoring of the SP poses of each compound was then performed with the XP scoring function of the Glide. The XP mode in Glide was used in the virtual screening, the ligand sampling was flexible, epik state penalties were added and an energy window of 2.5 kcal/mol was used for ring sampling. In the energy minimization step, the distance-dependent dielectric constant was 4.0 with a maximum number of minimization steps of 100,000. In the clustering, poses were considered as duplicates and discarded if both RMS deviation is less than 0.5 Å and maximum atomic displacement is less than 1.3 Å.

      (12) Provide clarity on the statement that AGR1.131 "theoretically" binds the same motif, explaining the docking procedure used for this determination.

      In the in silico screening, AGR1.131 was one of the 40 selected compounds that showed, according to the PELE analysis (see answer to point 8), a pose with low binding energy (-39.8 kcal/mol) between TMV and TMVI helices, which is the selected area for the screening. It, nonetheless, also showed a best pose placed between helices TM1 and TM7 (-43.7 kcal/mol) using the initial workflow. In conclusion, although AGR1.131 also faced to the TMV-TMVI, the most favorable pose was in the area between TMI and TMVII. In addition, the compound was included in the biological screening, where it did not affect CXCL12-mediated chemotaxis. We thus decided to use it as an internal negative control, as it has a skeleton very similar to AGR1.135 and AGR1.137 and can interact with the TM domains of CXCR4 without promoting biological effects. This statement has been clarified in the revised text.

      (13) Toxicity Testing:

      -Enhance the explanation of the approach to testing the toxicity of the compound in Jurkat cells. Consider incorporating positive controls to strengthen the assessment and clarify the experimental design.

      All the selected compounds in the in silico screening were initially tested for propidium iodide incorporation in treated cells in a toxicity assay, and some of them were discarded for further experiments (e.g., AGR1.103 and VSP3.1).

      Further evaluation of Jurkat cell viability was determined by cell cycle analysis using propidium iodide.  Supplementary Fig. 1B included the percentage of each cell cycle phase, and data indicated no significant differences between the treatments tested. Nevertheless, at the suggestion of the reviewer, and to clarify this issue, positive controls inducing Jurkat cell death (staurosporine and hydrogen peroxide) have also been included in the new Supplementary Fig. 2. The new figure also includes a table showing the percentage of cells in each cell-cycle phase.  

      (14) In the Results section concerning "AGR1.135 and AGR1.137 blocking CXCL12-mediated CXCR4 nanoclustering and dynamics", several points can be improved to enhance clarity and coherence: 1. Specificity of Low Molecular Weight Compounds:  

      -Clearly articulate how AGR1.135 and AGR1.137 specifically target homodimeric CXCR4 and provide an explanation for their lack of impact on heterodimeric CXCR4-CCR5 in that region.

      First of all, we should clarify that when we talk about receptor nanoclustering, oligomers refer to complexes including 3 or more receptors and, therefore, the residues involved in these interactions can differ from those involved in receptor dimerization. Moreover, our FRET experiments did not indicate that the compounds alter receptor dimerization (see new Supplementary Fig. 7). Of note, mutant receptors unable to oligomerize can still form dimers (Martínez-Muñoz L. et al. Mol. Cell 2018; García-Cuesta E.M .et al. Proc. Natl. Acad. Sci. USA 2022). Additionally, we believe that these oligomers can also include other chemokine receptors/proteins expressed at the cell membrane, which we are currently studying using different models and techniques.

      We have results supporting the existence of CCR5/CXCR4 heterodimers (Martínez-Muñoz L et al. Proc. Natl. Acad. Sci. USA 2014), in line with the data published by Di Marino et al. However, in the current study we have not evaluated the impact of the selected compounds on other CXCR4 complexes distinct from CXCR4 oligomers. Our Jurkat cells do not express CCR5 and, therefore, we cannot discuss whether AGR1.137 affects CCR5/CXCR4 heterodimers. The chemokine field is very complex and most receptors can form dimers (homo- and heterodimers) as well as oligomers (Martinez-Muñoz L., et al Pharmacol & Therap. 2011) when co-expressed. To evaluate different receptor combinations in the same experiment is a complex task, as the number of potential combinations between distinct expressed receptors makes the analysis very difficult. We started with CXCR4 as a model, to continue later with other possible CXCR4 complexes. In addition, for the analysis of CCR5/CXCR4 dynamics, it is much better to use dual-TIRF techniques, which allow the simultaneous detection of two distinct molecules coupled to different fluorochromes.

      Regarding the data of Di Marino et al., it is possible that the compounds might also affect heterodimeric conformations of CXCR4. This aspect has also been broached in the revised discussion. We would again note that we evaluated CXCR4 oligomers and not monomers or dimers; this is especially relevant when we compare the residues involved in these processes as they might differ depending on the receptor conformation considered. This issue was also hypothesized by Di Marino et al. (see our response to point 4).

      (15) When referring to "unstimulated" cells, provide a more detailed explanation to elucidate the experimental conditions and cellular state under consideration.

      Unstimulated cells refer to the cells in basal conditions, that is, cells in the absence of CXCL12. For TIRF-M experiments, transiently-transfected Jurkat cells were plated on glass-bottomed microwell dishes coated with fibronectin; these are the unstimulated cells. To observe the effect of the ligand, dishes were coated as above plus CXCL12 (stimulated cells). We have clarified this point in the material and methods section of the revised version.

      (16) 2. Paragraph Organization

      -Reorganize the second paragraph to eliminate redundancy and improve overall flow. A more concise and fluid presentation will facilitate reader comprehension and engagement.

      The second paragraph has been reorganized to improve overall flow.

      (17) Ensure that each paragraph contributes distinct information, avoiding repetition and redundancy.

      We have carefully revised each paragraph of the manuscript to avoid redundancy.

      (18) 3. Claim of Allosteric Antagonism:

      -Exercise caution when asserting that "AGR1.135 and AGR1.137 behave as allosteric antagonists of CXCR4" based on the presented results. Consider rephrasing to reflect that the observed effects suggest the potential allosteric nature of these compounds, acknowledging the need for further investigations and evidence.

      To avoid misinterpretations on the effect of the compounds on CXCR4, as we have commented in our response to point 2, we have substituted the term allosteric inhibitors with negative allosteric modulators, which refer to molecules that act by binding a site distinct from the orthosteric site, and selectively block some downstream signaling pathways, whereas others induced by the same endogenous or orthosteric agonist are unaffected (Gao Z.-G. & Jacobson K.A. Drug Discov. Today Technol. 2013). Our data indicate that the selected small compounds do not block ligand binding or G protein activation or receptor internalization, but inhibit receptor oligomerization and ligand-mediated directed cell migration.

      (19) In the Results section discussing the "incomplete abolition of CXCR4-mediated responses in Jurkat cells by AGR1.135 and AGR1.137", several points can be refined for better clarity and completeness:  1. Inclusion of Positive Controls: 

      -Consider incorporating positive controls in relevant experiments to provide a comparative benchmark for assessing the impact of AGR1.135 and AGR1.137. This addition will strengthen the interpretation of results and enhance the experimental rigor. 

      The in vivo experiments (Fig. 7E,F) used AMD3100, an orthosteric antagonist of CXCR4, as a positive control. We also included AMD3100, as a positive control of inhibition when evaluating the effect of the compounds on CXCL12 binding (Fig. 3, new Supplementary Fig. 3). The revised version of the manuscript also includes the effect of this inhibitor on other relevant CXCL12-mediated responses such as cell migration (Fig. 1B), receptor internalization (Fig. 3A), cAMP production (Fig. 3C), ERK1/2 and AKT phosphorylation (Supplementary Fig. 4), actin polymerization (Fig. 4A), cell polarization (Fig. 4B, C) and cell adhesion (Fig. 4D), to facilitate the interpretation of the results and improve the experimental rigor.

      (20) 2. Clarification of Terminology: 

      -Clarify the term "CXCR4 internalizes" by providing context, perhaps explaining the process of receptor internalization and its relevance to the study.

      We refer to CXCR4 internalization as a CXCL12-mediated endocytosis process that results in reduction of CXCR4 levels on the cell surface. We use CXCR4 internalization in this study with two purposes: First, for CXCR4 and other chemokine receptors, internalization processes are mediated by ligand-induced clathrin vesicles (Venkatesan et al 2003) a process that triggers CXCR4 aggregation in these vesicles. We have previously determined that the oligomers of receptors detected by TIRF-M remain unaltered in cells treated with inhibitors of clathrin vesicle formation and of internalization processes (Martinez-Muñoz L. et al. Mol. Cell 2018). Moreover, we have described a mutant CXCR4 that cannot form oligomers but internalizes normally in response to CXCL12 (Martinez-Muñoz L. et al. Mol. Cell 2018). The observation in this manuscript of normal CXCL12-mediated endocytosis in the presence of the negative allosteric inhibitors of CXCR4 that abrogate receptor oligomerization reinforces the idea that the oligomers detected by TIRF are not related to receptor aggregates involved in endocytosis; Second, receptor internalization is not affected by the allosteric compounds, indicating that they downregulate some CXCL12-mediated signaling events but not others (new Fig. 3).

      All these data have been included in the revised discussion of the manuscript.

      (21) Elaborate on the meaning of "CXCL12 triggers normal CXCR4mut internalization" to enhance reader understanding.

      We have previously described a triple-mutant CXCR4 (K239L/V242A/L246A; CXCR4mut). The mutant residues are located in the N-terminal region of TMVI, close to the cytoplasmic region, thus limiting the CXCR4 pocket described in this study (see our response to point 3). This mutant receptor dimerizes but neither oligomerizes in response to CXCL12 nor supports CXCL12-induced directed cell migration, although it can still trigger some Ca2+ flux and is internalized after ligand activation (Martinez-Muñoz L. et al. Mol. Cell 2018).  We use the behavior of this mutant (CXCR4mut) to show that the CXCR4 oligomers and the complexes involved in internalization processes are not the same and to explain why we evaluated CXCR4 endocytosis in the presence of the negative allosteric modulators.

      As we indicated in a previous answer to the reviewer, these issues have been re-elaborated in the revised version.

      (22) 3. Discrepancy in CXCL12 Concentration:

      -Address the apparent discrepancy between the text stating, "...were stimulated with CXCL12 (50 nM, 37{degree sign}C)," and the figure caption (Fig. 3A) reporting a concentration of 12.5 nM. Rectify this inconsistency and provide an accurate and clear explanation.

      We apologize for this error, which is now corrected in the revised manuscript. With the exception of the cell migration assays in Transwells, where the optimal concentration was established at 12.5 nM, in the remaining experiments the optimal concentration of CXCL12 employed was 50 nM. These concentrations were optimized in previous works of our laboratory using the same type of experiment. We should also remark that in the experiments using lipid bilayers or TIRF-M experiments, CXCL12 is used to coat the plates and therefore it is difficult to determine the real concentration of the ligand that is retained in the surface of the plates after the washing steps performed prior to adding the cells. In addition, we use 100 nM CXCL12 to create the gradient in the chambers used to perform the directed-cell migration experiments.

      (23) 4. Speculation on CXCL12 Binding:

      -Refrain from making speculative statements, such as "These data suggest that none of the antagonists alters CXCL12 binding to CXCR4," unless there is concrete evidence presented up to that point. Clearly outline the results that support this conclusion.

      Figure 3B and Supplementary Figure 3 show CXCL12-ATTO700 binding by flow cytometry in cells pretreated with the negative allosteric modulators. We have also included AMD3100, the orthosteric antagonist, as a control for inhibition. While these experiments showed no major effect of the compounds on CXCL12 binding, we cannot discard small changes in the affinity of the interaction between CXCL12 and CXCR4. In consequence we have re-written these statements.

      (24) 5. Corroboration of Data:

      -Specify where the corroborating data from immunostaining and confocal analysis are reported, ensuring readers can access the relevant information to support the conclusions drawn in this section.

      In agreement with the suggestion of the reviewer, the revised manuscript includes data from immunostaining and confocal analysis to complement Fig. 4B (new Fig. 4C). The revised version also includes some representative videos for the TIRF experiments showed in Figure 2 to clarify readability.

      (25) In the Results section concerning "AGR1.135 and AGR1.137 antagonists and their direct binding to CXCR4", several aspects need clarification and refinement for a more comprehensive and understandable presentation: 1. Workflow Clarification:

      -Clearly articulate the workflow used for assessing the binding of AGR1.135 and AGR1.137 to CXCR4. Address the apparent contradiction between the inability to detect a direct interaction and the utilization of Glide for docking in the TMV-TMVI cleft.

      To address the direct interaction of the compounds with CXCR4, we intentionally avoided the modification of the small compounds with different labels, which could affect their properties. We therefore attempted a fluorescence a spectroscopy strategy to formally prove the ability of the small compounds to bind CXCR4, but this failed because the AGR1.135 is yellow in color, which interfered with the determinations. We also tried a FRET strategy (see new Supplementary Fig. 7) and detected a significant increase in FRET efficiency of CXCR4 homodimers when AGR1.135 was evaluated, but again the yellow color interfered with FRET determinations. Moreover, AGR1.137 did not modify FRET efficiency of CXCR4 dimers. Therefore, we were unable to detect the interaction of the compounds with CXCR4.

      We elected to develop an indirect strategy; in silico, we evaluated the binding-site using docking and molecular dynamics to predict the most promising CXCR4 binding residues involved in the interaction with the selected compounds. Next, we generated point mutant receptors of the predicted residues and re-evaluated the behavior of the allosteric antagonists in a CXCL12-induced cell migration experiment. Obviously, we first discarded those CXCR4 mutants that were not expressed on the cell membrane as well as those that were not functional when activated with CXCL12. Using this strategy, we eliminated the interference due to the physical properties of the compounds and demonstrated that if the antagonism of a compound is reversed in a particular CXCR4 mutant it is because the mutated residue participates or interferes with the interaction between CXCR4 and the compound, thus assuming (albeit indirectly) that the compound binds CXCR4. 

      To select the specific mutations included in the analysis, our strategy was to generate point mutations in residues present in the TMV-TMVI pocket of CXCR4 that were not directly proposed as critical residues involved in chemokine engagement, signal initiation, signal propagation, or G protein-binding, based on the extensive mutational study published by Wescott MP et. al. (Wescott M.P. et. al. Proc. Natl. Acad. Sci. U S A. 2016).

      (26) Provide a cohesive explanation of the transition from docking evaluation to MD analysis, ensuring a transparent representation of the methodology.

      Based on the aim of this work, the workflow shown in Author response image 2, was proposed to predict the binding mode of the selected molecules. Firstly, a CXCR4 model was generated to reconstruct some unresolved parts of the protein structure; then a binding site search using PELE software was performed to identify the most promising binding sites; subsequently, docking studies were performed to refine the binding mode of the molecules; and finally, molecular dynamics simulations were run to determine the most stable poses and predict the residues that we should mutate to test that the compounds interact with CXCR4. 

      Author response image 2.

      Workflow followed to determine the binding mode of the  studied compounds.

      (27) 2. Choice of Software and Techniques:

      -Justify the use of "AMBER14" and the PELE approach, considering  their potential obsolescence.

      These experiments were performed five years ago when the project was initiated. As the reviewer indicates, AMBER14 and PELE approaches might perhaps be considered obsolescent. Thus, we have predicted the structure of the target using AlphaFold (Jumper J. et al, Nature 2021) and the sequence available under UniProt entry P61073. The complete analysis performed (see our response to point 4) confirmed that the compounds bound the selected pocket, as we had originally determined using PELE. These new analyses have been incorporated into the revised manuscript.

      (28)-Discuss the role of the membrane in the receptor-ligand interac7on. Elaborate on how the lipidic double layer may influence the binding of small compounds to GPCRs embedded in the membrane.

      Biological membranes are vital components of living organisms, providing a diffusion barrier that separates cells from the extracellular environment, and compartmentalizing specialized organelles within the cell. In order to maintain the diffusion barrier and to keep it electrochemically sealed, a close interaction of membrane proteins with the lipid bilayer is necessary. It is well known that this is important, as many membrane proteins undergo conformational changes that affect their transmembrane regions and that may regulate their activity, as seen with GPCRs (Daemen F.J. & Bonting S.L., Biophys. Struct. Mech. 1977; Gether U. et al. EMBO J. 1997). The lateral and rotational mobility of membrane lipids supports the sealing function while allowing for the structural rearrangement of membrane proteins, as they can adhere to the surface of integral membrane proteins and flexibly adjust to a changing microenvironment. In the case of the first atomistic structure of CXCR4 (Wu B. et al. Science 2010), it was indicated that for dimers, monomers interact only at the extracellular side of helices V and VI, leaving at least a 4-Å gap between the intracellular regions, which is presumably filled by lipids. In particular, they indicated that the channel between TMV and TMVI that connects the orthosteric chemokine binding pocket to the lipid bilayer is occupied by an oleic acid molecule. Recently, Di Marino et al., analyzing the dimeric structure of CXCR4, found a cholesterol molecule placed in between the two protomers, where it engages a series of hydrophobic interactions with residues located in the area between TMI and TMVI (Leu132, Val214, Leu216, Leu246, and Phe249). The polar head of cholesterol forms an H-bond with Tyr135 that further stabilizes its binding mode. This finding confirms that cholesterol might play an important role in mediating and stabilizing receptor dimerization, as seen in other GPCRs (Pluhackova, K., et al. PLoS Comput. Biol. 2016). In addition, we have previously observed that, independently of the structural changes on CXCR4 triggered by lipids, the local lipid environment also regulates CXCR4 organization, dynamics and function at the cell membrane and modulates chemokine-triggered directed cell migration. Prolonged treatment of T cells with bacterial sphingomyelinase promoted the complete and sustained breakdown of sphingomyelins and the accumulation of the corresponding ceramides, which altered both membrane fluidity and CXCR4 nanoclustering and dynamics. Under these conditions, CXCR4 retained some CXCL12-mediated signaling activity but failed to promote efficient directed cell migration (Gardeta S.R. et al. Front. Immunol. 2022). Collectively, these data demonstrate the key role that lipids play in the stabilization of CXCR4 conformations and in regulating its lateral mobility, influencing their associated functions. These considerations have been included in the revised version of the manuscript. 

      (29) 3. Stable Trajectories and Binding Mode Superimposi7on -Specify the criteria for defining "stable trajectories" to enhance reader understanding

      There could be several ways to describe the stability of a MD simulation, based on the convergence of energies, distances or ligand-target interactions, among others. In this work, we use the expression “stable trajectories” to refer to simulations in which the ligand trajectory converges and the ligand RMSD does not fluctuate more than 0.25Å. This definition is now included in the revised text.

      (30)  Clarify the meaning behind superimposing the two small compounds and ensure that the statement in the figure caption aligns with the information presented in the main text.

      We apologize for the error in the previous Fig. 5A and in its legend. The figure was created by superimposing the protein component of the poses for the two compounds, AGR1.135 and AGR1.137, rather than the compounds themselves. As panel 5A was confusing, we have modified all Fig. 5 in the revised manuscript to improve clarity.

      (31) 4. Volume Analysis and Distances:

      -Provide details on how the volume analysis was computed and how distances were accounted for. Consider adding a figure to illustrate these analyses, aiding reader comprehension.

      The cleft search and analysis were performed using the default settings of SURFNET (Laskowski R.A. J. Mol. Graph. 1995) included in the PDBsum server (Laskowski R.A. et. al. Trends Biochem. Sci. 1997). The first run of the input model for CXCR4 3ODU identified a promising cleft of 870 Å3 in the lower half of the region flanked by TMV and TMVI, highlighting this area as a possible small molecule binding site (Fig. I, only for review purposes). Analysis of the cleft occupied by AGR1.135 showed two independent cavities of 434 Å3 and 1381 Å3 that were not connected to the orthosteric site. The same procedure for AGR1.137 revealed two distinct clefts of 790 Å3 and 580 Å3, respectively (Fig. I, only for review purposes). Analysis of the atomic distances between the protein residues and the compounds was performed using the PISA server. Krissinel E. & Henrick K. J. Mol. Biol. 2007). (Please see our response to point 3 and the corresponding figure).

      (32) 5. Mutant Selection and Relevance:

      -Clarify the rationale behind selecting the CXCR4 mutants used in the study. Consider justifying the choice and exploring the possibility of performing an alanine (ALA) scan for a more comprehensive mutational analysis.  

      The selection of the residues to be mutated along the cleft was first based on their presence in the proposed cleft and the direct interaction of the compounds with them, either by hydrogen bonding or by hydrophobic interactions. Secondly, all mutated residues did not belong to any of the critical residues involved in transmitting the signal generated by the interaction of CXCL12 with the receptor. In any case, mutants producing a non-functional CXCR4 at the cell membrane were discarded after FACS analysis and chemotaxis experiments. Finally, the length and nature of the resulting mutations were designed mainly to occlude the cleft in case of the introduction of long residues such as lysines (I204K, L208K) or to alter hydrophobic interactions by changing the carbon side chain composition of the residues in the cleft. Indeed, we agree that the alanine scan mutation analysis would have been an alternative strategy to evaluate the residues involved in the interactions of the compounds. 

      (33) Reevaluate the statement regarding the relevance of the Y256F muta7on for the binding of AGR1.137. If there is a significant impact on migra7on in the mutant (Fig. 6B), elaborate on the significance in the context of AGR1.137 binding.

      In the revised discussion we provide more detail on the relevance of Y256F mutation for the binding of AGR1.137 as well as for the partial effect of G207I and R235L mutations. The predicted interactions for each compound are depicted in new Fig. 6 C, D after LigPlot+ analysis (Laskowski R.A. & Swindells M.B. J. Chem. Inf. Model. 2011), showing that AGR1.135 interacted directly with the receptor through a hydrogen bond with Y256. When this residue was mutated to F, one of the anchor points for the compound was lost, weakening the potential interaction in the region of the upper anchor point.

      It is not clear how the Y256F mutation will affect the binding of AGR1.137, but other potential contacts cannot be ruled out since that portion of the compound is identical in both AGR1.135 and AGR1.137. This is especially true for its neighboring residues in the alpha helix, F249, L208, as shown in 3ODU structure (Fig. 6D), which are shown to be directly implicated in the interaction of both compounds. Alternatively, we cannot discard that Y256 interacts with other TMs or lipids stabilizing the overall structure, which could reverse the effect of the mutant at a later stage (Author response image 3).

      Author response image 3.

      Cartoon representation of Y256 and its intramolecular interactions in the CXCR4 Xray solved structure 3ODU. TMV helix is colored in blue and TMVI in pink.

      (34) Address the apparent discrepancy in residue involvement between AGR1.135 and AGR1.137, particularly if they share the same binding mode in the same clef.

      AGR1.135 and AGR1.137 exhibit comparable yet distinct binding modes, engaging with CXCR4 within a molecular cavity formed by TMV and TMVI. AGR1.135 binds to CXCR4 through three hydrogen bonds, two on the apical side of the compound that interact with residues TMV-G207 and TMVI-Y256 and one on the basal side that interacts with TMVI-R235 (Fig. 5A). This results in a more extended and rigid conformation when sharing hydrogen bonds, with both TMs occupying a surface area of 400 Å2 and a length of 20 Å in the cleft between TMV and TMVI (Supplementary Fig. 8A). AGR1.137 exhibits a distinct binding profile, interacting with a more internal region of the receptor. This interaction involves the formation of a hydrogen bond with TMIIIV124, which induces a conformational shift in the TMVI helix towards an active conformation (Fig. 5B; Supplementary Fig. 13). Moreover, AGR1.137 may utilize the carboxyl group of V124 in TMIII and overlap with AGR1.135 binding in the cavity, interacting with the other 19 residues dispersed between TMV and VI to create an interaction surface of 370 Å2 along 20 Å (Supplementary Fig. 8B). This is illustrated in the new Fig. 5B. AGR1.137 lacks the phenyl ring present in AGR1.135, resulting in a shorter compound with greater difficulty in reaching the lower part of TMVI where R235 sits. 

      Author response image 4.

      AGR1.135 and AGR1.137 interaction with TMV and TMVI.  The model shows the location of the compounds within the TMV-VI cleft, illustrated by a ribbon and stick representation. The CXCR4 segments of TMV and TMVI are represented in blue and pink ribbons respectively, and side chains for some of the residues defining the cavity are shown in sticks. AGR1.135 and AGR1.137 are shown in stick representation with carbon in yellow, nitrogen in blue, oxygen in red, and fluorine in green. Hydrogen bonds are indicated by dashed black lines, while hydrophobic interactions are shown in green. The figure reproduces the panels A, B of Fig. 5 in the revised manuscript.

      (35) In the Results sec7on regarding "AGR1.137 treatment in a zebrafish xenograf model", the following points can be refined for clarity and completeness: 1. Cell Line Choice for Zebrafish Xenograft Model:

      -Explain the rationale behind the choice of HeLa cells for the zebrafish xenograft model when the previous experiments primarily focused on Jurkat cells. Address any specific biological or experimental considerations that influenced this decision.

      As far as we know, there are no available models of tumors in zebrafish using Jurkat cells. We looked for a tumoral cell system that expresses CXCR4 and could be transplanted into zebrafish. HeLa cells are derived from a human cervical tumor, express a functional CXCR4, and have been previously used for tumorigenesis analyses in zebrafish (Brown H.K. et al. Expert Opin. Drug Discover. 2017; You Y. et al Front. Pharmacol. 2020). These cells grow in the fish and disseminate through the ventral area and can be used to determine primary tumor growth and metastasis. Nonetheless, we first analyzed in vitro the expression of a functional CXCR4 in these cells (Supplementary Fig. 10A), whether AGR1.137 treatment specifically abrogated CXCL12-mediated direct cell migration (Fig. 7A, B), as whether it affected cell proliferation (Supplementary Fig. 10B). As HeLa cells reproduce the in vitro effects detected for the compounds in Jurkat cells, we used this model in zebrafish. These issues were already discussed in the first version of our manuscript. 

      (36) 2. Toxicity Assessment in Zebrafish Embryos: 

      -Clarify the basis for stating that AGR1.137 is not toxic to zebrafish embryos. Consider referencing the Zebrafish Embryo Acute Toxicity Test (ZFET) and provide relevant data on lethal concentration (LC50) and non-lethal toxic phenotypes such as pericardial edema, head and tail necrosis, malformation, brain hemorrhage, or yolk sac edema.

      Tumor growth and metastasis kinetics within the zebrafish model have been extensively evaluated in many publications (White R. et al. Nat. Rev. Cancer. 2013; Astell K.R. and Sieger D. Cold Spring Harb. Perspect. Med. 2020; Chen X. et al. Front. Cell Dev. Biol. 2021; Weiss JM. Et al. eLife 2022; Lindhal G. et al NPJ Precis. Oncol. 2024). Our previous experience using this model shows that tumors start having a more pronounced proliferation and lower degree of apoptosis from day 4 onwards, but we cannot keep the tumor-baring larvae for that long due to ethical reasons and also because we don’t see much scientific benefit of unnecessarily extending the experiments. Anti-proliferative or pro-apoptotic effects of drugs can still be observed within the three days, even if this is then commonly seen as larger reduction (instead of a smaller growth as it is commonly seen in for example mouse tumor models) compared to controls. Initially we characterized the evolution of implanted tumors in our system and how much they metastasize over time in the absence of treatment before to test the compounds (Author response image 5).

      The in vivo experiments were planned to validate efficacious concentrations of the investigated drugs rather than to derive in vivo IC50 or other values, which require testing of multiple doses. We have, however, included an additional concentration to show concentration-dependence and therefore on-target specificity of the drugs in the revised version of the manuscript (data also being elaborated in ongoing experiments). At this stage, we believe that adding the LC50 does not provide interesting new knowledge, and it is standard to only show results from the experimental endpoint (in our case 3 days post implantation). We agree that showing these new data points strengthens the manuscript and facilitates independent evaluation and conclusions to be drawn from the presented data. We have created new graphs where datapoints for each compound dose are shown.  

      Author response image 5.

      Evolution of the tumors and metastasis along the time in the absence of any treatment. HeLa cells were labeled with 8 µg/mL Fast-DiI™ oil and then implanted in the dorsal perivitelline space of 2-days old zebrafish embryos. Tumors were imaged within 2 hours of implantation and re-imaged each 24 h for three days. Changes in tumor size was evaluated as tumor area at day 1, 2 and 3 divided by tumor area at day 0, and metastasis was evaluated as the number of cells disseminated to the caudal hematopoietic plexus at day 1, 2 and 3 divided by the number of cells at day  3.

      Regarding the statement that AGR1.137 was not toxic, this was based on visual inspection of the zebrafish larvae at the end of the experiment, which also revealed a lack of drug-related mortality in these experiments. There are a number of differences in how our experiment was run compared with the standardized ZFET. ZFET evaluates toxicity from 0 hours post-fertilization to 1 or 2 days post-fertilization, whereas here we exposed zebrafish from 2 days post-fertilization to 5 days post-fertilization. The ZFET furthermore requires that the embryos are raised at 26ºC whereas kept the temperature as close as possible to a physiologically relevant temperature for the tumor cells (36ºC). In the ZFET, embryos are incubated in 96-well plates whereas for our studies we required larger wells to be able to manipulate the larvae and avoid well edge-related imaging artefacts, and we therefore used 24-well plates. As such, the ZFET was for various reasons not applicable to our experimental settings. As we were not interested in rigorously determining the LD50 or other toxicity-related measurements, as our focus was instead on efficacy and we found that the targeted dose was tolerated, we did not evaluate multiple doses, including lethal doses of the drug, and are therefore not able to determine an LD50/LC50. We also did not find drug-induced non-lethal toxic phenotypes in this study, and so we cannot elaborate further on such phenotypes other than to simply state that the drug is well tolerated at the given doses. Therefore, the reference to ZFET in the manuscript was eliminated.

      (37) If supplementary information is available, consider providing it for a comprehensive understanding of toxicity assessments. 

      The effective concentration used in the zebrafish study was derived from the in vitro experiments. That being said, and as elaborated in our response to comment 36, we have added data for one additional dose to show the dose-dependent regulation of tumor growth and metastasis. 

      (38) 3. Optimization and Development of AGR1.137: 

      -Justify the need for further optimization and development of AGR1.137 if it has a comparable effect to AMD3100. Explain the specific advantages or improvements that AGR1.137 may offer over AMD3100. 

      AGR1.137 is highly hydrophobic and is very difficult to handle, particularly in in vivo assays; thus, for the negative allosteric modulators to be used clinically, it would be very important to increase their solubility in water. Contrastingly, AMD3100 is a water-soluble compound. Before using the zebrafish model, we performed several experiments in mice using AGR1.137, but the inhibitory results were highly variable, probably due to its hydrophobicity. We also believe that it would be important to increase the affinity of AGR1.137 for CXCR4, as the use of lower concentrations of the negative allosteric modulator would limit potential in vivo side effects of the drug. On the other hand, we are also evaluating distinct administration alternatives, including encapsulation of the compounds in different vehicles. These alternatives may also require modifications of the compounds. 

      AMD3100 is an orthosteric inhibitor and therefore blocks all the signaling cascades triggered by CXCL12. For instance, we observed that AMD3100 treatment blocked CXCL12 binding, cAMP inhibition, calcium flux, cell adhesion and cell migration (Fig. 3, Fig. 4), whereas the effects of AGR1.137 were restricted to CXCL12-mediated directed cell migration. Although AMD3100 was well tolerated by healthy volunteers in a singledose study, it also promoted some mild and reversible events, including white blood cells count elevations and variations of urine calcium just beyond the reported normal range (Hendrix C.W. et al. Antimicrob. Agents Chemother. 2000). To treat viral infections, continuous daily dosing requirements of AMD3100 were impractical due to severe side effects including cardiac arrhythmias (De Clercq E. Front Immunol. 2015). For AMD3100 to be used clinically, it would be critical to control the timing of administration. In addition, side effects after long-term administration have potential problems. Shorter-term usage and lower doses would be fundamental keys to its success in clinical use (Liu T.Y. et al. Exp. Hematol. Oncol. 2016). The use of a negative allosteric modulator that block cell migration but do not affect other signaling pathways triggered by CXCL12 would be, at least in theory, more specific and produce less side effects. These ideas have been incorporated into the revised discussion to reflect potential advantages or improvements that AGR1.137 may offer over AMD3100.

      (39) 4. Discrepancy in AGR1.137 and AMD3100 Effects:

      -Discuss the observed discrepancy where AGR1.137 exhibits similar effects to AMD3100 but only after 48 hours. Provide insights into the temporal dynamics of their actions and potential implications for the experimental design.

      Images and data shown in Fig. 7E, F correspond to days 0 and 3 after HeLa cell implantation (tumorigenesis) and only to day 3 in the case of metastasis data. The revised version contains the effect of two distinct doses of the compounds (10 and 50 µM, for AGR1.135 and AGR1.137 and 1 and 10 µM for AMD3100). 

      (40) In the "Discussion" section, there are several points that require clarifica7on and refinement to enhance the overall coherence and depth of the analysis:  1. Reduction of Side-Effects: 

      -Provide a more detailed explanation of how the identified compounds, specifically AGR1.135 and AGR1.137, contribute to the reduction of side effects. Consider discussing specific mechanisms or characteristics that differentiate these compounds from existing antagonists.

      The sentence indicating that AGR1.135 and AGR1.137 contribute to reduce side effects is entirely speculative, as we have no experimental evidence to support it. We have therefore corrected this in the revised version. The origin of the sentence was that orthosteric antagonists typically bind to the same site as the endogenous ligand, thus blocking its interaction with the receptor. Therefore, orthosteric inhibitors (i.e. AMD3100) block all signaling cascades triggered by the ligand and therefore their functional consequences. However, the compounds described in this project are essentially negative allosteric modulators, that is, they bind to a site distinct from the orthosteric site, inducing a conformational change in the receptor that does not alter the binding of the endogenous ligand, and therefore block some specific receptor-associated functions without altering others. We observed that AGR1.137 blocked receptor oligomerization and directed cell migration whereas CXCL12 still bound CXCR4, triggered calcium mobilization, did not inhibit cAMP release or promoted receptor internalization. This is why we speculated on the limitation of side effects. The statements have been nonetheless revised in the new version of the manuscript.

      (41) 2. Binding Site Clarification:

      -Address the apparent discrepancy between docking the small compounds in a narrow cleft formed by TMV and TMVI helices and the statement that AGR1.131 binds elsewhere. Clarify the rationale behind this assertion

      After the in silico screening, a total of 40 compounds were selected.  These compounds showed distinct degrees of interaction with the cleft formed by TMV and TMVI and even with other potential interaction sites on CXCR4, with the exception of the ligand binding site according to the data described by Wescott et al. (PNAS 2016 113:9928-9933), as this possibility was discarded in the initial approach of the in silico screening. According to PELE analysis, AGR1.131 was one of the 40 selected compounds that showed a pose with low binding energy, -39.8 kcal/mol, between TMV and TMVI helices, that is, it might interact with CXCR4 through the selected area for the screening. It nonetheless also showed a best pose placed between helices TMI and TMVII, -43.7 kcal/mol. In any case, the compound was included in the biological screening, where it was unable to impact CXCL12-mediated chemotaxis (Fig. 1B). We then focused on AGR1.135 and AGR1.137, as showed a higher inhibitory effect on CXCL12-mediated migration, and on AGR1.131 as an internal negative control. AGR1.131 has a skeleton very similar to the other compounds (Fig. 1C) and can interact with the TM domains of CXCR4 without promoting effects. None of the three compounds affected CXCL12 binding, or CXCL12mediated inhibition of cAMP release, or receptor internalization. However, whereas AGR1.135 and AGR1.137, blocked CXCL12-mediated CXCR4 oligomerization and directed cell migration towards CXCL12 gradients, AGR1.131 had no effect in these experiments (Fig. 3, Fig.  4). 

      Next, we performed additional theoretical calculations (PELE, docking, MD) to inspect in detail the potential binding modes of active and inactive molecules. Based on these additional calculations, we identified that whereas AGR1.135 and AGR1.137 showed preferent binding on the molecular pocket between TMV and TMVI, the best pose for AGR1.131 was located between TMI and TMVII, as the initial experiments indicated.  These observations and data have been clarified in the revised discussion. 

      (42) 3. Impact of Chemical Modifications:

      -Discuss the consequences of the distinct chemical groups in AGR1.135, AGR1.137, and AGR1.131, specifically addressing how variations in amine length and chemical nature may influence binding affinity and biological activity. Provide insights into the potential effects of these modifications on cellular responses and the observed outcomes in zebrafish. 

      The main difference between AGR1.131 and the other two compounds is the higher flexibility of AGR1.131 due to the additional CH2 linker, together with the lack of a piperazine ring. The additional CH2 linking the phenyl ring increases the flexibility of AGR1.131 when compared with AGR1.135 and AGR1.137, and the absence of the piperazine ring might be responsible for its lack of activity, as it makes this compound able to bind to CXCR4 (Fig. 1C).

      AGR1.137 was chosen in a second round. The additional presence of the tertiary amine (in the piperazine ring) allows the formation of quaternary ammonium salts in the aqueous medium and its substituents to increase its solubility (Fig 1C). This characteristic might be related to the absence of toxic effects of the compound in the zebrafish model.

      (43) 4. Existence of Distinct CXCR4 Conformational States: 

      -Provide more detailed support for the statement suggesting the "existence of distinct CXCR4 conformational states" responsible for activating different signaling pathways. Consider referencing relevant studies or experiments that support this claim.

      Classical models of GPCR allostery and activation, which describe an equilibrium between a single inactive and a single signaling-competent active conformation, cannot account for the complex pharmacology of these receptors. The emerging view is that GPCRs are highly dynamic proteins, and ligands with varying pharmacological properties differentially modulate the balance between multiple conformations.

      Just as a single photograph from one angle cannot capture all aspects of an object in movement, no one biophysical method can visualize all aspects of GPCR activation. In general, there is a tradeoff between high-resolution information on the entire protein versus dynamic information on limited regions. In the former category, crystal and cryo-electron microscopy (cryoEM) structures have provided comprehensive, atomic-resolution snapshots of scores of GPCRs both in inactive and active conformations, revealing conserved conformational changes associated with activation. However, different GPCRs vary considerably in the magnitude and nature of the conformational changes in the orthosteric ligand-binding site following agonist binding (Venkatakrishnan A.J.V. et al. Nature 2016). Spectroscopic and computational approaches provide complementary information, highlighting the role of conformational dynamics in GPCR activation (Latorraca N.R.V. et al. Chem. Rev 2017). In the absence of agonists, the receptor population is typically dominated by conformations closely related to those observed in inactive-state crystal structures (Manglik A. et al. Cell 2015). While agonist binding drives the receptor population towards conformations similar to those in activestate structures, a mixture of inactive and active conformations remains, reflecting “loose” or incomplete allosteric coupling between the orthosteric and transducer pockets (Dror R.O. et al. Proc. Natl. Acad. Sci. USA 2011). Surprisingly, for some GPCRs, and under some experimental conditions, a substantial fraction of unliganded receptors already reside in an active-like conformation, which may be related to their level of basal or constitutive signaling (Staus D.P. et al. J. Biol. Chem. 2019);  Ye L. et al. Nature 2016).  In our case, the negative allosteric modulators, (Staus DP, et al. J. Biol. Chem 2019); Ye L. et al. Nature 2016) did not alter ligand binding and had only minor effects on specific CXCL12-mediated functions such as inhibition of cAMP release or receptor internalization, among others, but failed to regulate CXCL12-mediated actin dynamics and receptor oligomerization. Collectively, these data suggest that the described compounds alter the active conformation of CXCR4 and therefore support the presence of distinct receptor conformations that explain a partial activation of the signaling cascade.

      All these observations are now included in the revised discussion of the manuscript.

      (44) 5. Equilibrium Shift and Allosteric Ligands: 

      -Clarify the statement about "allosteric ligands shifting the equilibrium to favor a particular receptor conformation". Support this suggestion with references or experimental evidence

      In a previous answer (see our response to point 2), we explain why we define the compounds as negative allosteric modulators. These compounds do not bind the orthosteric binding site or a site distinct from the orthosteric site that alters the ligand-binding site. Their effect should be due to changes in the active conformation of CXCR4, which allow some signaling events whereas others are blocked. Our functional data thus support that through the same receptor the compounds separate distinct receptor-mediated signaling cascades, that is, our data suggest that CXCR4 has a conformational heterogeneity. It is known that GPCRs exhibit more than one “inactive” and “active” conformation, and the endogenous agonists stabilize a mixture of multiple conformations. Biased ligands or allosteric modulators can achieve their distinctive signaling profiles by modulating this distribution of receptor conformations. (Wingler L.M. & Lefkowitz R.J. Trends Cell Biol. 2020). For instance, some analogs of angiotensin II do not appreciably activate Gq signaling (e.g., increases in IP3 and Ca2+) but still induce receptor phosphorylation, internalization, and mitogen-activated protein kinase (MAPK) signaling (Wei H, et al. Proc. Natl. Acad. Sci. USA 2003). Some of these ligands activate Gi and G12 in bioluminescence resonance energy transfer (BRET) experiments (Namkung Y. et al. Sci. Signal. 2018). A similar observation was described in the case of CCR5, where some chemokine analogs promoted G protein subtype-specific signaling bias (Lorenzen E. et al. Sci. Signal 2018). Structural analysis of distinct GPCRs in the presence of different ligands vary considerably in the magnitude and nature of the conformational changes in the orthosteric ligand-binding site following agonist binding (Venkatakrishnan A.J.V. et al. Nature 2016). Yet, these changes modify conserved motifs in the interior of the receptor core and induce common conformational changes in the intracellular site involved in signal transduction. That is, these modifications might be considered distinct receptor conformations. 

      The revised discussion contains some of these interpretations to support our statement about the stabilization of a particular receptor conformation triggered by the negative allosteric modulators. 

      (45) 6. Refinement of Binding Mode: 

      -Clarify the workflow for obtaining the binding mode, particularly the role of GLIDE and PELE. Clearly explain how these software tools were used in tandem to refine the binding mode. 

      The computational sequential workflow applied in this project included, i) Protein model construction, ii) Virtual screening (Glide), iii) PELE, iv) Docking (AutoDock and Glide) and v) Molecular Dynamics (AMBER).

      Glide was applied for the structure-based virtual screening to explore which compounds could fit and interact with the previously selected binding site.

      After the identification of theoretically active compounds (modulators of CXCR4), additional calculations were done to identify a potential binding site. PELE was used in this sense, to study how the compounds could bind in the whole surface of the target (TMV-TMVI). By applying PELE, we avoided biasing the calculation, and we found that the trajectories with better interaction energies identified the cleft between TMV and TMVI as the binding site for AGR1.135 and AGR1.137, and not for AGR1.131. AGR1.131 showed a pose with low binding energy, -39.8 kcal/mol, between TMV and TMVI helices, that is, it might interact with CXCR4 in the selected area for the screening. But it also showed a better pose placed between helices TMI and TMVII, - 43.7 kcal/mol (see our response to point 41). These data have been now confirmed using Schrodinger’s MM-GBSA procedure (see our response to points 6 and 8). In any case, the compound was included in the biological screening, where it was unable to affect CXCL12-mediated chemotaxis (Fig. 1B). Docking and MD simulations were then performed to study and refine the specific binding mode in this cavity. These data were important to choose the mutations on CXCR4 required, to test whether the compounds reversed its behavior. In these experiments we also confirmed that AGR1.131 had a better pose on the TMI-TMVII region. 

      (46) 7. Impact of Compound Differences on CXCR4-F249L mutant: 

      -Provide visual aids, such as figures, and additional experiments to support the statement about differences in the behavior of AGR1.135 and AGR1.137 on cells expressing CXCR4-F249L mutant. Elaborate on the closer interaction suggested between the triazole group of AGR1.137 and the F249 residue

      At the reviewer’s suggestion, Fig. 5 has been modified to incorporate a closer view of the interactions identified and new panels in new Fig. 6 have been added to show in detail the effect of the mutations selected on the structure of the cleft between TMV and TMVI. The main difference between AGR1.135 and AGR1.137 is how the triazole group interacts with F249 and L216 (Author response image 6). In AGR1.137, the three groups are aligned in a parallel organization, which appears to be more effective: This might be due to a better adaptation of this compound to the cleft since there is only one hydrogen bond with V124. In AGR1.135, the compound interacts with the phenyl ring of F249 and has a stronger interaction at the apical edge to stabilize its position in the cleft. However, there is still an additional interaction present. When changing F249

      Author response image 6.

      Cartoon representation of the interaction of CXCR4 F249L mutant with AGR1.135 (A) and AGR1.137 (B). The two most probable conformations of Leucine rotamers are represented in cyan A and B conformations. Van der Waals interactions are depicted in blue cyan dashed lines, hydrogen bonds in black dashed lines. CXCR4 segments of TMV and TMVI are colored in blue and pink, respectively

      to L (Fig. VIIA, B, only for review purposes) and showing the two most likely rotamers resulting from the mutation, it is observed that rotamer B is in close proximity to the compound, which may cause the binding to either displace or adopt an alternative conformation that is easier to bind into the cleft. As previously mentioned, it is likely that AGR1.135 can displace the mutant rotamer and bind into the cleft more easily due to its higher affinity.

      (47) In the "Materials and Methods" section, the computational approach for the "discovery of CXCR4 modulators" requires significant revision and clarification. The following suggestions aim to address the identified issues: 1. Structural Modeling: 

      -Reconsider the use of SWISS-MODEL if there is an available PDB code for the entire CXCR4 structure. Clearly articulate the rationale for choosing one method over the other and explain any limitations associated with the selected approach. 

      The SWISS-model server allows for automated comparative modeling of 3D protein structures that was pioneered in the fields of automated modeling. At the time we started this project. it was the most accurate method to generate reliable 3D protein structure models.

      As explained above, we have now predicted the structure of the target using AlphaFold (Jumper J. et al, Nature 2021) and performed several additional experiments that confirm that the small compounds bind the selected pocket as the original strategy indicated (see our response to point 6). (Fig. II, only for review purposes).

      (48) 2. Parametriza7on of Small Compounds: 

      -Provide a detailed description of the parametrization process for the small compounds used in the study. Specify the force field and parameters employed, considering the obsolescence of AMBER14 and ff14SB. Consider adopting more contemporary force fields and parameterization strategies. 

      When we performed these experiments, some years ago, the force fields applied (ff14SB, AMBER14 used in MD or OPLS2004 in docking with Glide) were well accepted and were gold standards. It is, however, true that the force fields have evolved in the past few years, Moreover, in the case of the MD simulations, to consider the parameters of the ligands that are not contained within the force field, we performed an additional parameterization as a standard methodology. We then generated an Ab initio optimization of the ligand geometry, defining as basis sets B3LYP 6-311+g(d), using Gaussian 09, Revision A.02, and then a single point energy calculation of ESP charges, with HF 6311+g(d) on the optimized structure. As the last step of the parametrization, the antechamber module was used to adapt these charges and additional parameters for MD simulations.

      (49) 3. Treatment of Lipids and Membrane: 

      -Elaborate on how lipids were treated in the system. Clearly describe whether a membrane was included in the simulations and provide details on its composition and structure. Address the role of the membrane in the study and its relevance to the interactions between CXCR4 and small compounds 

      To stabilize CXCR4 and more accurately reproduce the real environment in the MD simulation, the system was embedded in a lipid bilayer using the Membrane Builder tool (Sunhwan J. et al. Biophys. J. 2009) from the CHARMM-GUI server. The membrane was composed of 175 molecules of the fatty acid 1-palmitoyl-2-oleoyl-sn-glycero-3phosphocholine (POPC) in each leaflet. The protein-membrane complex was solvated with TIP3 water molecules. Chloride ions were added up to a concentration of 0.15 M in water, and sodium ions were added to neutralize the system. This information was previously described in detail.

      (50) 4. Molecular Dynamics Protocol: 

      -Provide a more detailed and coherent explanation of the molecular dynamics protocol. Clarify the specific steps, parameters, and conditions used in the simulations. Ensure that the protocol aligns with established best practices in the field.

      Simulations were calculated on an Asus 1151 h170 LVX-GTX-980Ti workstation, with an Intel Core i7-6500 K Processor (12 M Cache, 3.40 GHz) and 16 GB DDR4 2133 MHz RAM, equipped with a Nvidia GeForce GTX 980Ti available for GPU (Graphics Processing Unit) computations. MD simulations were performed using AMBER14 (Case D.A. et al. AMBERT 14, Univ. of California, San Francisco, USA, 2014) with ff14SB (Maier J.A. et al. J. Chem. Theory Comput. 2015) and lipid14 (Dickson C. J. et al. J. Chem. Theory Comput. 2014) force fields in the NPT thermodynamic ensemble (constant pressure and temperature). Minimization was performed using 3500 Steepest Descent steps and 4500 Conjugate Gradient steps three times, firstly considering only hydrogens, next considering only water molecules and ions, and finally minimizing all atoms. Equilibration raises system temperature from 0 to 300 K at a constant volume fixing everything but ions and water molecules. After thermalization, several density equilibration phases were performed. In the production phase, 50 ns MD simulations without position restraints were calculated using a time step of 2 fs. Trajectories of the most interesting poses were extended to 150 ns. All bonds involving hydrogen atoms were constrained with the SHAKE algorithm (Lippert R.A. et al. J. Chem. Phys. 2007). A cutoff of 8 Å was used for the Lennard-Jones interaction and the short-range electrostatic interactions. Berendsen barostat (Berendsen H.J. et al. J. Chem. Phys.  1984) and Langevin thermostat were used to regulate the system pression and temperature, respectively. All trajectories were processed using CPPTRAJ (Roe D.R. & Cheatham III T.E. J. Chem. Theory Comput. 2013) and visualized with VMD (Visual Molecular Dynamics) (Humphrey W. et al. J. Mol. Graphics. 1996). To reduce the complexity of the data, Principal Component Analysis (PCA) was performed on the trajectories using CPPTRAJ.

      (51) Consider updating the molecular dynamics protocol to incorporate more contemporary methodologies, considering advancements in simulation techniques and software.

      In our answer to points 6 and 47, we describe why we use the technology based on Swiss-model and PELE analysis and how we have now used Alphafold and other more contemporary methodologies to confirm that the small compounds bind the selected pocket.

      (52) Figure 1A: 

      •  Consider switching to a cavity representation for CXCL12 to enhance clarity and emphasize the cleft.

      Fig. 1A has been modified to emphasize the cleft.

      (53) Explicitly show the TMV-TMVI cleft in the figure for a more comprehensive visualization. 

      In Fig. 1A we have added an insert to facilitate TMV-TMVI visualization.

      (54) Figure 1B: 

      •  Clearly explain the meaning of the second DMSO barplot to avoid confusion. 

      To clarify this panel, we have modified the figure and the figure legend. Panel B now includes a complete titration of the three compounds analyzed in the manuscript.  The first bar shows cell migration in the absence of both treatment with AMD3100 and stimulation with CXCL12.  The second bar shows migration in response to CXCL12 in the absence of AMD3100. The third bar shows the effect of AMD3100 on CXCL12-induced migration, as a known control of inhibition of migration.  We hope that this new representation of the data results is clearer.

      (55) Figure 1C: 

      •  Provide a clear legend explaining the significance of the green shading on the small compounds. 

      The legend for Fig. 1C has been modified accordingly to the reviewer’s suggestion.

      (56) Figure 2: 

      •  Elaborate on the role of fibronectin in the experiment and explain the specific contribution of CD86-AcGFP.

      The ideal situation for TIRF-M determinations is to employ cells on a physiological substrate complemented with or without chemokines. Fibronectin is a substrate widely used in different studies that allows cell adhesion, mimicking a physiological situation. Jurkat cells express alpha4beta1 and alpha5beta1 integrins that mediate adhesion to fibronectin (Seminario M.C. et al. J. Leuk. Biol. 1999).

      Regarding the use of CD86-AcGFP in TIRF-M experiments. We currently determine the number of receptors in individual trajectories of CXCR4 using, as a reference, the MSI value of CD86-AcGFP that strictly showed a single photobleaching step (Dorsch S. et al. Nat Methods 2009).

      We preferred to use CD86-AcGFP in cells instead of AcGFP on glass, to exclude any potential effect on the different photodynamics exhibited by AcGFP when bound directly to glass. In any case, this issue has been clarified in the revised version.

      (57) Figure 3D: 

      •  Include a plot for the respective band intensity to enhance data presentation 

      The plot showing the band intensity analysis of the experiments shown in Fig. 3D was already included in the original version (see old Supplementary Fig. 3). However, in the revised version, we include these plots in the same figure as panels 3E and 3F.  As a control of inhibition of CXCL12 stimulation, we have also included a new figure (Supplementary Fig. 4) showing the effect of AMD3100 on CXCL12-induced activation of Akt and ERK as analyzed by western blot.

      (58) Consider adding AMD3100 as a control for comparison. 

      In agreement with the reviewer’s suggestion, we have added the effect of AMD3100 in most of the functional experiments performed.

      (59) Figure 4: 

      •  Address the lack of positive controls in Figure 4 and consider their inclusion for a more comprehensive analysis. 

      DMSO bars correspond to the control of the experiment, as they represent the effect of CXCL12 in the absence of any allosteric modulator. As previously described in this point-by-point reply, DMSO bars correspond to the control performed with the solvent with which the small compounds, at maximum concentration, are diluted.  Therefore, they show the effect of the solvent on CXCL12 responses. In any case, and in order to facilitate the comprehension of the figure we have also added the controls in the absence of DMSO to demonstrate that the solvent does not affect CXCL12-mediated functions, together with the effect of the orthosteric inhibitor AMD3100. In addition, we have also included representative images of the effect of the different compounds on CXCL12-induced polarization (Fig. 4C).

      (60) In Figure 4A, carefully assess overlapping error bars and ensure accurate interpreta7on. If necessary, consider alternative representation. 

      We have tried alternative representations of data in Fig. 4A, but in all cases the figure was unclear. We believe that the way we represent the data in the original manuscript is the most clear and appropriate.  Nevertheless, we have now included significance values as a table annexed to the figure, as well as the effect of AMD3100, as a control of inhibition

      (61) Supplementary Figure 1A: 

      •  Improve the clarity of bar plots for better understanding. Consider reordering them from the most significant to the least. 

      This was a good idea, and therefore Supplementary Fig. 1A has been reorganized to improve clarity.

      (62) Supplementary Figure 1C: 

      •  Clarify the rationale behind choosing the 12.5 nM concentration and explain if different concentrations of CXCL12 were tested. 

      In old Supplementary Fig. 1C, we used untreated cells, that is, CXCL12 was not present in the assay.  These experiments were performed to test the potential toxicity of DMSO (solvent) or the negative allosteric modulators on Jurkat cells. The 12.5 nM concentration of CXCL12 mentioned in the figure legend applied only to panels A and B, as indicated in the figure legend. We previously optimized this concentration for Jurkat cells using different concentrations of CXCL12 between 5 and 100 nM.  Nevertheless, we have reorganized old supplementary fig. 1 and clarified the figure legend to avoid misinterpretations (see Supplementary Fig 1A, B and Supplementary Fig. 2A, B).

      (63) Explain the observed reduction in fluorescence intensity for AGR1.135. 

      The cell cycle analysis has been moved from Supplementary Fig. 1C to a new Supplementary Fig. 2.  It now includes the flow cytometry panels to show fluorescence intensity as a function of the number of cells analyzed (Panel 1A) as well as a table (panel B) with the percentage of cells in each phase of the cell cycle. We believe that the apparent reduction in fluorescence that the reviewer observes is mainly due to the number of events analyzed. However, we have changed the flow cytometry panels for others that are more representative and included a table with the mean of the different results. When we determined the percentage of cells in each cell cycle phase, we observed that it looks very similar in all the experimental conditions. That is, none of the compounds affected any of the cell cycle phases. We have also included the effect of H2O2 and staurosporine as control compounds inducing cell death and cell cycle alteration of Jurkat cells.

      (64) Supplementary Table 1: 

      •  Include a column specifying the scoring for each compound to provide a clear reference for readers. 

      To facilitate references to readers, we have now included the inhibitory effect of each compound on Jurkat cell migration in the revised version of this table. 

      (65) Minor Points 

      Page 2 - Abstract: Rephrase the first sentence of the abstract to enhance fluidity. 

      Although the entire manuscript was revised by a professional English editor, we appreciate the valuable comments of this reviewer and we have corrected these issues accordingly.

      (66) Page 2 - Abstract: Explicitly define "CXCR4" as "C-X-C chemokine receptor type 4" the first time it appears.

      We have not used C-X-C chemokine receptor type 4 the first time it appears in the abstract. CXCR4 is an acronym normally accepted to identify this chemokine receptor, and it is used as CXCR4 in many articles published in eLife. However, we introduce the complete name the first time it appears in the introduction.

      (67) Page 2 - Abstract: Explicitly define "CXCL12" as "C-X-C motif chemokine 12" the first time it is mentioned. 

      As we have discussed in the previous response, we have not used C-X-C motif chemokine 12 the first time CXCL12 appears in the abstract, as it is a general acronym normally accepted to identify this specific chemokine, even in eLife papers. However, we introduce the complete name the first time it appears in the introduction section.

      (68) Page 2 - Abstract: Explicitly define "TMV and TMVI" upon its first mention.

      The acronym TM has been defined as “Transmembrane” in the revised version

      (69) Page 2 - Abstract: Review the use of "in silico" in the sentence for accuracy and consider revising if necessary.

      With the term “in silico” we want to refer to those experiments performed on a computer or via computer simulation software. We have carefully reviewed its use in the new version of the manuscript.

      (70) Page 2 - Abstract: Add a comma after "compound" in the sentence, "We identified AGR1.137, a small compound that abolishes...".

      A comma after “compound” has been added in the revised sentence.

      (71) Page 2 - Significance Statement: Rephrase the first sentence of the "Significance Statement" to avoid duplication with the abstract.

      The first sentence of the Significance Statement has been revised to avoid duplication with the abstract. 

      (72) Page 2 - Significance Statement: Break down the lengthy sentence, "Here, we performed in silico analyses..." for better readability. 

      The sentence starting by “Here, we performed in silico analyses…” has been broken down in the revised manuscript.

      (73) Page 2 - Introduction: Replace "Murine studies" with a more specific term for clarity.

      The term “murine studies” is normally used to refer to experimental studies developed in mice. We have nonetheless rephrased the sentence.

      (74) Page 3 - Introduction: Rephrase the sentence for clarity: "Finally, using a zebrafish model, ..."

      The sentence has been now rephrased for clarity.

      (75) Results-AGR1.135 and AGR1.137 block CXCL12-mediated CXCR4 nanoclustering and dynamics: 

      Rephrase the sentence for clarity: "Retreatment with AGR1.135 and AGR1.137, but not with AGR1.131, substantially impaired CXCL12-mediated receptor nanoclustering.”

      The sentence has been rephrased for clarity.

      (76) Results - AGR1.135 and AGR1.137 incompletely abolish CXCR4-mediated responses in Jurkat cells: Clarify the sentence: "In contrast to the effect promoted by AMD3100, a binding-site antagonist of CXCR4..."

      The sentence has been modified for clarity.

      (77) Consider using "orthosteric" instead of "binding-site" antagonist.

      The term orthosteric is now used throughout to refer to a binding site antagonist.

      (78) Discussion: Use the term "in silico" only when necessary.

      We have carefully reviewed the use of “in silico” in the manuscript.

      (79) Discussion: Clarify the sentence: "...not affect neither CXCR2-mediated cell migration...". Confirm if "CXCL12" is intended.

      The sentence refers to the chemokine receptor CXCR2, which binds the chemokine CXCL2. To test the specificity of the compounds for the CXCL12/CXCR4 axis, we evaluated CXCL2-mediated cell migration.  The results indicated that CXCL2/CXCR2 axis was not affected by the negative allosteric modulators, whereas CXCL12-mediated cell migration was blocked.  The sentence has been clarified in the new version of the manuscript.

      (80) Figure 4B: Bold the "B" in the figure label for consistency.

      The “B” in Fig. 4B has been bolded.

      Reviewer #2

      (1) Fig 2. The SPT data is sub-optimal in its presentation as well as analysis. Example images should be shown. The analysis and visualization of the data should be reconsidered for improvements. Graphs with several hundreds, in some conditions over 1000 tracks, per condition are very hard to compare. The same (randomly selected representative set) number of data points should be shown for better visualization. Also, more thorough analyses like MSD or autocorrelation functions are lacking - they would allow enhanced overall representation of the data.

      In agreement with the reviewer’s commentary, we have modified the representation of Fig. 2. We have carefully read the paper published by Lord S.J. and col. (Lord S. J. et al., J. Cell Biol. 2020) and we apply their recommendations for these type of data. We have also included as supplementary material representative videos for the TIRF-M experiments performed to allow readers to visualize the original images. Regarding the MSD analyses, they were developed to determine all D1-4 values. According to the data published by Manzo & García-Parajo (Manzo C. & García-Parajo M.F. Rep.Prog. Phys. 2015) due to the finite trajectory length the MSD curve at large tlag has poor statistics and deviates from linearity. However, the estimation of the Diffusion Coefficient (D1-4) can be obtained by fitting of the short tlag region of the MSD plot giving a more accurate idea of the behavior of particles. In agreement we show D1-4 values and not MSD data. 

      Due to the space restrictions, it is very difficult to include all the figures generated, but, only for review purposes, we included in this point-by-point reply some representative plots of the MSD values as a function of the time from individual trajectories showing different types of motion obtained in our experiments (Author response image 7).

      Author response image 7.

      Representative MSD plots from individual trajectories of CXCR4-AcGFP showing different types of motion: A) confined, B) Brownian/Free, C) direct transport of CXCR4-AcGFP particles diffusing at the cell membrane detected by SPT-TIRF in resting JKCD4 cells.

      Further analysis, such as the classification based on particle motion, has not been included in this article. This classification uses the moment scaling spectrum (MSS), described by Ewers H. et al. 2005 PNAS, and requires particles with longer trajectories (>50 frames). Only for review purposes, we include a figure showing the percentage of the MSS-based particle motion classification for each condition. As expected, most of long particles are confined, with a slight increase in the percentage upon CXCL12 stimulation in all conditions, except in cell treated with AGR1.137 (Author response image 8).

      Author response image 8.

      Effects of the negative allosteric modulators on the Types of Motion of CXCR4. Percentage of single trajectories with different types of motion, classified by MSS (DMSO: 58 particles in 59 cells on FN; 314 in 63 cells on FN+CXCL12; AGR1.131: 102 particles in 71 cells on FN; 258in 69 cells on FN+CXCL12; AGR1.135: 86 particles in 70 cells on FN; 120 in 77 cells on FN+CXCL12; AGR1.137: 47 particles in 66 cells on FN; 74 in 64 cells on FN+CXCL12) n = 3.

      (2) Fig 3. The figure legends have inadequate information on concentrations and incubation times used, both for the compounds and other treatments like CXCL12 and forskolin. For the Western blot data, also the quantification should be added to the main figure. The compounds, particularly AGR1.137 seem to lead to augmented stimulation of pAKT and pERK. This should be discussed

      The Fig. 3 legend has been corrected in the revised manuscript. Fig. 3D now contains representative western blots and the densitometry evaluation of these experiments. As the reviewer indicates, we also detected in the western blot included, augmented stimulation of pAKT and pERK in cells treated with AGR1.137. However, as shown in the densitometry analysis, no significant differences were noted between the data obtained with each compound. As a control of inhibition of CXCL12 stimulation we have included a new Supplementary Fig. 4 showing the effect of AMD3100 on CXCL12-induced activation of Akt and ERK as analyzed by western blot.

      (3) Fig. 4 immunofluorescence data on polarization as well as the flow chamber data lack the representative images of the data. The information on the source of the T cells is missing. Not clear if this experiment was done on bilayers or on static surfaces.

      Representative images for the data shown in Figure 4B have been added in the revised figure (Fig. 4C). The experiments in Fig. 4B were performed on static surfaces. As indicated in the material and methods section, primary T cell blasts were added to fibronectin-coated glass slides and then were stimulated or not with CXCL12 (5 min at 37ºC) prior to fix permeabilize and stain them with Phalloidin. Primary T cell blasts were generated from PBMCs isolated from buffy coats that were activated in vitro with IL-2 and PHA as indicated in the material and methods section.

      (4) The data largely lacks titration of different concentrations of the compounds. How were the effective concentration and treatment times determined? What happens at higher concentrations? It is important to show, for instance, if the CXCR12 binding gets inhibited at higher concentrations. most experiments were performed with 50 uM, but HeLa cell data with 100 uM. Why and how was this determined? 

      The revised version contains a new panel in Fig. 1B to show a more detailed kinetic analysis with different concentrations (1-100 µM) of the compounds in the migration experiments using Jurkat cells. We choose 50 µM for further studies as it was the concentration that inhibits 50-75% of the ligand induced cell migration. 

      We have also included the effect of two doses of the compounds (10 and 50 µM) in the zebrafish model as well as AMD3100 (1 and 10 µM) as control (new Fig. 7D, E).  Tumors were imaged within 2 hours of implantation and tumor-baring embryos were treated with either vehicle (DMSO) alone, AGR1.131 or AGR1.137 at 10 and 50 µM or AMD3100 at 1 and 10 µM for three days, followed by re-imaging.

      Regarding the amount of CXCL12 used in these experiments, with the exception of cell migration assays in Transwells, where the optimal concentration was established at 12.5 nM, in all the other experiments the optimal concentration of CXCL12 employed was 50 nM. In the case of the directional cell migration assays, we use 100 nM to create the chemokine gradient in the device. These concentrations have been optimized in previous works of our laboratory using these types of experiments. It should also be noted that in the experiments using lipid bilayers or TIRF-M experiments, CXCL12 is used to coat the plates and therefore it is difficult to determine the real concentration that is retained in the surface after the washing steps performed prior adding the cells.

      (5) The authors state that they could not detect direct binding of the compounds and the CXCR14. It should be reported what approaches were tried and discussed why this was not possible. 

      We attempted a fluorescence spectroscopy strategy to formally prove the ability of AGR1.135 to bind CXCR4, but this strategy failed because the compound has a yellow color that interfered with the determinations. We also tried a FRET strategy (see supplementary Fig. 7) and detected a significant increase in FRET efficiency of CXCR4 homodimers in cells treated with AGR1.135; this effect was due to the yellow color of this compound that interferes with FRET determinations. In the same assays, AGR1.137 did not modify FRET efficiency for CXCR4 homodimers and therefore we cannot assume that AGR1.137 binds on CXCR4. All these data have been considered in the revised discussion.

      (6) The proliferation data in Supplementary Figure 1 lacks controls that affect proliferation and indication of different cell cycle stages. What is the conclusion of this data? More information on the effects of the drug to cell viability would be important.

      Toxicity in Jurkat cells was first determined by propidium iodide incorporation. Some compounds (i.e., AGR1.103 and VSP3.1) were discarded from further analysis as they were toxic for cells. In a deeper analysis of cell toxicity, even if these compounds did not kill the cells, we checked whether they could alter the cell cycle of the cells. New Supplementary Fig. 2 includes a table (panel B) with the percentage of cells in each cell cycle phase, and no differences between any of the treatments tested were detected. 

      Nevertheless, to clarify this issue the revised version of the figure also includes H2O2 and staurosporine stimuli to induce cell death and cell cycle alterations as controls of these assays.

      (7) The flow data in Supplementary Figure 2 should be statistically analysed. 

      Bar graphs corresponding to the old Supplementary Fig. 2 (new Supplementary Fig. 3) are shown in Fig. 3B. We have also incorporated the corresponding statistical analysis to this figure. 

      (8) In general, the authors should revise the figure legends to ensure that critical details are added. 

      We have carefully revised all the figure legends in the new version of the manuscript.

      (9) Bar plots are very poor in showing the heterogeneity of the data. Individual data points should be shown whenever feasible. Superplot-type of representation is strongly advised (https://doi.org/10.1083/jcb.202001064).

      We have carefully read the paper published by Lord S.J. and col. (Lord S. J. et al., J. Cell Biol. 2020) and we apply their recommendations for our TIRF-M data (see revised Fig.  2).

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Recommendations For The Authors): 

      - The title may not reflect the key finding of the paper. It is well established in the field that the disaggregation process is sensitive to perturbations of the levels of the disaggregating factors.

      We have changed the title to better reflect the major finding of the work, the importance of the NEF during the initiation of disaggregation. The new title is: Early Steps of Protein Disaggregation by Hsp70 Chaperone and Class B J-Domain Proteins are Shaped by Hsp110.

      - Abstract:

      Please note that the phrases "stimulation is much limited with class A JDPs", "limited destabilization of the chaperone complex improves disaggregation", and "tuned proportion between the co-chaperones" are hard to understand. Only after having read the manuscript are the meanings of these phrases accessible.

      The phrases in the abstract were changed (page 1, lines 10-14).

      - The subheading "Sse1 improves aggregate modification by Hsp70" on p. 7 is unclear. What is measured is a decrease in aggregate size dependent on Hsp70-JDP as well as Sse1.

      The subheading was changed to include more precise information, into “Sse1 leads to Hsp70-depenent reduction of aggregate size”.

      - The subheading "Biphasic effects of Sse1 on the Hsp70 disaggregation activity" does not describe the finding clearly; "Biphasic effects" is a term that is hard to understand.

      To avoid phrases that can be understood in many ways, we have changed the subheading into “Hormetic effects of Sse1 in Hsp70 disaggregation activity”

      - p.5, last line. Hsp110 typo The typos have been corrected.

      Reviewer #2 (Recommendations For The Authors):

      (1) The article emphasises multiple times the importance of stoichiometry between the (co-)chaperones. Most figures would benefit from an indication of the used stoichiometry (or all absolute concentrations) to support the points made about the stoichiometry, especially the figures showing titrations of Sse1, Sse1-2, and Sis1 (Fig. 3D, 3E, 4A-C, S2B, S5F, S6A-E).

      The information of protein concentrations has been included in all figure captions.

      (2) The manuscript includes a summary model. While this model is a plausible hypothesis of the mechanism of disaggregation by Hsp70, in particular when viewed with previous data (Wyszkowski et al., 2021), it focuses rather heavily on the potential remodeling of clients by Hsp70, which is not the primary focus of the data presented in this manuscript. More emphasis could be put on the JDP class/ functional specificity observed.

      The model has been changed according to the Reviewer’s comments to better reflect the findings presented in the manuscript (Figure 5).

      (3) The methods section is very brief. I recommend including additional details about reaction conditions (temperature, buffer compositions, protein concentrations) even when previously reported elsewhere to improve the readability of the manuscript. Details regarding the DLS experiments performed are missing.

      More detailed information on the experimental conditions has been added to the Methods section, as well as to figure legends.

      (4) Many experiments incorporate BLI to assess the effect of NEFs on the binding of the Hsp70 and JDP to aggregates. Although appropriate controls are included (no ATP, Hsp70, and JDP only), a control with only Hsp70 and the NEF would be useful to determine to which extent the NEF itself alters the thickness of the (Hsp70-bound) aggregate biolayer.

      The suggested controls were added (Figure 1—figure supplement 1 G) and discussed in the manuscript (page 5, lines 23-24).

      Reviewer #3 (Recommendations For The Authors):

      - The refolding assay makes use of Luciferase denatured in 5 M GdnHCl. These conditions lead to a spontaneous refolding yield of 20% (Figure 3C), which is very high and limits conclusions on the effect of Hsp110 but also JDPs on the refolding process. Typically this assay uses 6 M GdnHCl for Luciferase denaturation and under these conditions, spontaneous refolding of Luciferase is hardly observed (e.g. Laufen et al. PNAS 1999). The authors are therefore asked to repeat key experiments using altered (6M) GdnHCl concentrations.

      We based our experiments assessing luciferase refolding on the publication by Imamoglu et al. (2020), in which the authors, using 5 M GdnHCl for luciferase denaturation, demonstrated that spontaneous and chaperone-assisted luciferase refolding strongly depends on luciferase concentration. In this work, a similar degree of luciferase refolding was reported for the same final luciferase concentration (100 nM) as we used in our experiments (Figure 1—figure supplement 1D). As an additional control, we compared the effects of 5 M and 6 M of GdnHCl during denaturation on luciferase refolding under the same conditions (100 nM, 25 °C, 2 h) and we observed no significant differences (Author response image 1).

      Author response image 1.

      Chaperone-assisted folding of luciferase after denaturation at 5 M or 6 M GdnHCl. Luciferase was denatured in 5 M or 6 M GdnHCl according to the protocol in the Materials and Methods section. Luminescence was monitored alone or after incubation with Luminescence was monitored alone or after incubation with Ssa1-Sis1 or Ssa1-Ydj1. Chaperones were used at 1 µM concentration. Luciferase activity was measured after 2 hours and normalized to the activity of the native protein. Error bars indicate SD from three repeats.

      - Figure 1B: The authors are asked to provide binding curves for Ssa1/Sse1 (no Sis1) and Sis1/Sse1 (no Ssa1) as controls. Particularly the latter combination is required as direct cooperation between Hsp110 and JDPs has been suggested in the literature (Mattoo et al., JBC 2013).

      We performed the suggested BLI experiment, and the results are presented in the new Figure 1—figure supplement 1 G (page 5, lines 23-24).

      - Figure 1B (and other figure parts showing BLI data): it is unclear how often the BLI experiments have been performed. This should be stated in the figure legend. Can the authors add SDs to the respective curves?

      We added detailed information about the number of replicates to the figure legends. SD bars were added to the BLI results shown in Figures1-4, apart from the results of titrations, for which, for the sake of clarity, the three replicates are represented in the plots on the right (Figure 3D). In the case of less than 3 repeats of the results presented in the Supplementary Figures, the remaining repeats are added to the provided Source Data file, information about which has been added to the captions of the respective figures. 

      - The observation that Hsp110 can interrupt Hsp70 interaction with JDPs is intriguing. Do the authors envision JDP displacement from the aggregate? If so this could be shown in BLI experiments by monitoring the release of fluorescently labeled Sis1 (similar to labeled Ssa1, Fig. S3C). Or will the released JDP immediately rebind to another binding site on the aggregate? The authors should at least discuss the diverse scenarios as they are relevant to the mechanism of protein disaggregation.

      The proposed experiment is challenging due to the transient nature of Sis1 binding to aggregate and high background observed with the method using the fluorescently labelled proteins. The aspect of chaperone’s re-binding after their release by Hsp110 proposed by the reviewer has been introduced into the Discussion section (pages 12/13, lines 25-4). We speculate that Hsp110 might release an Hsp70 molecule as well as a JDP molecule that had been bound to the aggregate through Hsp70 (Figure 5).  

      - Figure 2B: Ssa1/Sis1/Sse1 strongly decreases the size of Luciferase-GFP aggregates. Yet this activity only allows for limited refolding of aggregated Luciferase and the reaction stays largely dependent on Hsp104. How do the authors envision the role of the hexameric disaggregase in this process? Does it act exclusively on small-sized aggregates after Hsp110-dependent fragmentation?

      A question of the Hsp104 activity with the Hsp70-processed aggregates is indeed intriguing and we agree that it should have been discussed more thoroughly. We added to the manuscript the results of the reactivation of luciferase-GFP with and without Hsp104 to emphasize the role of Hsp104 in the active protein recovery (Figure 2—figure supplement 1A) (page 7, lines 24-27). We propose that aggregate fragmentation by Hsp70-JDPB-Hsp110 increases the effective aggregate surface, at which Hsp104 might become engaged. We do not think that Hsp104 acts only on small aggregates, it might be just more effective, when the number of exposed polypeptides is larger. In the cell, where Hsp104 binds to aggregates of various sizes, protein aggregates apparently also need to undergo such Hsp110-boosted pre-processing by Hsp70, based on the finding that Sse1 is not necessary for Hsp104 recruitment to aggregates, but it is required for Hsp104-dependent disaggregation (Kaimal et al., 2017). We have added a comment on this problem to the Discussion section (pages 11/12, lines 33-4) .

      - Page 9: The authors state that the Sse1-2 variant is nearly as effective as Sse1 Wt in stimulating substrate dissociation and refer to published work (Polier et al., 2008). It is unclear how the variant should have Wtlike activity in triggering substrate release although its activity in catalyzing nucleotide exchange is reduced to 5% (both activities are coupled). The observation that high Sse1-2 concentrations do not inhibit protein disaggregation does not necessarily exclude the possibility that high Sse1 WT concentration inhibit the reaction by overstimulating substrate release. The latter possibility should be considered by the authors and added to the discussion section.

      We agree with the Reviewer that the description of the Sse1-2 variant was misleading, as it was lacking the key information, that according to the published data (Polier et al., 2008), it was 10 times higher the concentration of the Sse1-2 variant than Sse1 WT that had a similar nucleotide-exchange activity to the wild type. We have changed the text (page 9, lines 16-22, page 13, lines 26-28) to avoid confusion as well as the model in the Figure 5, to underline the importance of substrate release as the cause of the Hsp110-dependent inhibition.

      - While similar effects are observed for human class A and class B JDP co-chaperones, they are clearly less pronounced. A mechanistic explanation for the difference between yeast and human chaperones is currently missing and the authors are asked to elaborate on this aspect.

      There are indeed clear differences between the human and yeasts systems, especially regarding the dependence on the NEF. Hsc70 has been reported to have a lower rate of ADP release (Dragovic et al., 2006) and thus might rely more on Hsp110 than its yeast ortholog. For the same reason, the strong Hsc70 stimulation by Hsp105 is also observed with class A JDP. We have added a comment on these effects in the Discussion section (page 12, lines 17-21).

      Minor points

      - Figure S1C (right): the disaggregation rate (%GFP/h) is somewhat misleading/confusing as a value of more than 150%/h is determined in the presence of the complete disaggregation system while only approx. 60% GFP is indeed refolded by the system (Figure S1C, left). Showing the rate as %GFP/min seems more rational.

      We changed the units according to the Reviewer’s comment (Figure 1—figure supplement 1A, C).

      - Figure S5B: Only a single data point is shown for Ssa1/Sis1/Sse1.

      We changed the figure to include datapoints from all three repeats (Figure 3—figure supplement 1 B).

      - There are several typos throughout the manuscript. A more careful proofreading is recommended

      We have corrected the typos.

      Reviewer #1 (Public Review):

      The experiments differ somewhat in regard to the aggregated protein used. For example, in Figure 1A, FFL is used with only limited reactivation (10% reactivated at the last timepoint and the curve is flattening), while in Figure 2B FFL-EGFP is used to monitor microscopically what appears to be complete disaggregation. Does FFL-EGFP behave the same as FFL in assays such as the one in Figure 1A or are there major differences that may impact how the data should be interpreted?

      We added the results of Luc-GFP reactivation (Figure 2—figure supplement 1 B) (discussed on page 7, lines 24-27 of the manuscipt) which agree with the results obtain with Luciferase as a substrate (Figure 1—figure supplement 1 B). They clearly show that the Ssa1-Sis1-Sse1-dependent decrease in aggregate size is not associated with the recovery of active protein.

      Reviewer #2 (Public Review):

      Experimental data concerning the class A JDPs should be interpreted with caution. These experiments show very small reactivation activities for luciferase in the range of 0-1% without the addition of Hsp104 and 0-15% with the addition of Hsp104. Moreover, since the assay is based on the recovery of luciferase activity, it conflates two chaperone activities, namely disaggregation and refolding. It is possible that the small degree of reactivation observed for the class A JDP reflects a minor subpopulation of the aggregated species that is particularly easy to disaggregate/refold and may thus not be representative of bulk behaviour.

      The disaggregation by the Hsp70 system can be enhanced by the addition of small heat shock proteins at the step of substrate aggregation (Rampelt et al., 2012). However, sHsps compete with Hsp70 for binding to the aggregate (Żwirowski et al., 2017) and for that reason we decided not to include sHsps in the experiments presented in the manuscript, as it would introduce another level of complexity. However, as a control, we performed the disaggregation assay with Hsp70 with Ydj1 using luciferase aggregates formed in the presence or absence of sHsp (Author response image 2). In 1 h, the Hsp70 system without Hsp104 yielded 5% of recovered luciferase activity and the system with Hsp104, 23% compared to the native. The impact of Sse1 on Ssa1-Ydj1 and Ssa1-Ydj1-Hsp104 was similar as for luciferase aggregates formed without sHsps (Figure 1A, Figure 1—figure supplement 1 B). Furthermore, according to the Reviewer’s comment, we have changed the Figure 5 to underscore the more prominent role of class A JDPs in the final protein folding than in disaggregation.

      Author response image 2.

      Disaggregaton of heat-aggregated luciferase – impact of sHsps. Luciferase (2 μM) was denatured with (blue) or without (red) Hsp26 (20 μM) at 45 ̊C for 15 min in the buffer A (Materials and Methods). Upon 100-fold dilution with the buffer A, supplemented wih 5 mM ATP, 2 mM DTT, 1.2 μM creatine kinase, 20 mM creatine phosphate, chaperones indicated in the legend were added to the final concentration of 1 μM, except for Sse1, concentration of which was 0.1 μM. Shown is luciferase activity measured after 1 h of incubation at 25 °C, normalized to the activity of native luciferase.

      Reviewer #3 (Public Review):

      Enhanced recruitment of Hsp70 in the presence of Hsp110 was shown for amyloid fibrils before (Beton et al., EMBO J 2022) and should be acknowledged. 

      We have added the suggested citation with a respective comment (page 11, lines 20-21).

    1. Author Response

      The following is the authors’ response to the original reviews.

      Reviewer #1:

      People can perform a wide variety of different tasks, and a long-standing question in cognitive neuroscience is how the properties of different tasks are represented in the brain. The authors develop an interesting task that mixes two different sources of difficulty, and find that the brain appears to represent this mixture on a continuum, in the prefrontal areas involved in resolving task difficulty. While these results are interesting and in several ways compelling, they overlap with previous findings and rely on novel statistical analyses that may require further validation.

      Strengths

      1) The authors present an interesting and novel task for combining the contributions of stimulus-stimulus and stimulus-response conflict. While this mixture has been measured in the multi-source interference task (MSIT), this task provides a more graded mixture between these two sources of difficulty

      2) The authors do a good job triangulating regions that encoding conflict similarity, looking for the conjunction across several different measures of conflict encoding

      3) The authors quantify several salient alternative hypothesis and systematically distinguish their core results from these alternatives

      4) The question that the authors tackle is of central theoretical importance to cognitive control, and they make an interesting an interesting contribution to this question

      We would like to thank the reviewer for the positive evaluation of our manuscript and the constructive comments and suggestions. Your feedback has been invaluable in our efforts to enhance the accessibility of our manuscript and strengthen our findings. In response to your suggestion, we reanalyzed our data using the approach proposed by Chen et al.’s (2017, NeuroImage) and applied stricter multiple comparison correction thresholds in our reporting. This reanalysis largely replicated our previous results, thereby reinforcing the robustness of our findings. We also have examined several alternative models and results supported the integration of the spatial Stroop and Simon conflicts within the cognitive space. In addition, we enriched the theoretical framework of our manuscript by connecting the cognitive space with other important theories such as the “Expected Value of Control” theory. We have incorporated your feedback, revisions and additional analyses into the manuscript. As a result, we firmly believe that these changes have significantly improved the quality of our work. We have provided detailed responses to your comments below.

      1) It's not entirely clear what the current task can measure that is not known from the MSIT, such as the additive influence of conflict sources in Fu et al. (2022), Science. More could be done to distinguish the benefits of this task from MSIT.

      We agree that the MSIT task incorporates Simon and Eriksen Flanker conflict tasks and can efficiently detect the additivity of conflict effects across orthogonal tasks. Like the MSIT, our task incorporates Simon with spatial Stroop conflicts and can test the same idea. For example, a previous study from our lab (Li et al., 2014) used the combined spatial Stroop-Simon condition with the arrows displayed on diagonal corners and found evidence for the additive hypothesis. However, the MSIT cannot be used to test whether/how different conflicts are parametrically represented in a low-dimensional space, a question that is important to address the debate of domain-general and domain-specific cognitive control.

      To this end, our current study adopted the spatial Stroop-Simon task for the unique purpose of parametrically modulating conflict similarity. As far as we know, there is no way to define the similarity between the combined Simon_Flanker conflict condition and the Simon/Flanker conditions in the MSIT. In contrast, with the spatial Stroop-Simon paradigm, we can define the similarity with the cosine of the angle difference across the two conditions in question.

      We have added the following texts in the discussion part to emphasize the 51 difference between our paradigm and other studies.

      "The use of an experimental paradigm that permits parametric manipulation of conflict similarity provides a way to systematically investigate the organization of cognitive control, as well as its influence on adaptive behaviors. This approach extends traditional paradigms, such as the multi-source interference task (Fu et al., 2022), color Stroop-Simon task (Liu et al., 2010) and similar paradigms that do not afford a quantifiable metric of conflict source similarity."

      References:

      Li, Q., Nan, W., Wang, K., & Liu, X. (2014). Independent processing of stimulus-stimulus and stimulus-response conflicts. PloS One, 9(2), e89249.

      2) The evidence from this previous work for mixtures between different conflict sources make the framing of 'infinite possible types of conflict' feel like a strawman. The authors cite classic work (e.g., Kornblum et al., 1990) that develops a typology for conflict which is far from infinite, and I think few people would argue that every possible source of difficulty will have to be learned separately. Such an issue is addressed in theories like 'Expected Value of Control', where optimization of control policies can address unique combinations of task demands.

      The notion that there might be infinite conflicts arises when we consider the quantitative feature of cognitive control. If each combination of the Stroop-Simon combination is regarded as a conflict condition, there would be infinite combinations, and it is our major goal to investigate how these infinite conflict conditions are represented effectively in a space with finite dimensions. We agree that it is unnecessary to dissociate each of these conflict conditions into a unique conflict type, since they may not differ substantially. However, we argue that understanding variant conflicts within a purely categorical framework (e.g., Simon and Flanker conflict in MSIT) is insufficient, especially because it leads to dichotomic conclusions that do not capture how combinations of conflicts are organized in the brain, as our study addresses.

      There could be different perspectives on how our cognitive control system flexibly encodes and resolves multiple conflicts. The cognitive space assumption we held provides a principle by which we can represent multiple conflicts in a lower dimensional space efficiently. While the “Expected Value of Control” theory addresses when and how much cognitive control to apply based on control demand, the “cognitive space” view seeks to explain how the conflict, which defines cognitive control demand, is encoded in the brain. Thus, we argue that these two lines of work are different yet complementary. The geometry of cognitive space of conflict can benefit the adjustment of cognitive control for upcoming conflicts. For example, our brain may evaluate the similarity/distance (and thus cost) between the consecutive conflict conditions, and selects the path with best cost-benefit tradeoff to switch from one state to another. This idea is conceptually similar to a recent study by Grahek et al. (2022) demonstrating that more frequently switching states were encoded as closer together than less frequently switching states in a “drift-threshold” space.

      Nevertheless, Grahek et al (2022) investigated how cognitive control changes based on the expected value of control theory within the same conflict, whereas our study aims to examine organization of different conflict.

      We have added the implications of cognitive space view in the discussion to indicate the potential values of our finding to understand the EVC account and the difference between the two theories.

      “Previous researchers have proposed an “expected value of control (EVC)” theory, which posits that the brain can evaluate the cost and benefit associated with executing control for a demanding task, such as the conflict task, and specify the optimal control strength (Shenhav et al., 2013). For instance, Grahek et al. (2022) found that more frequently switching goals when doing a Stroop task were achieved by adjusting smaller control intensity. Our work complements the EVC theory by further investigating the neural representation of different conflict conditions and how these representations can be evaluated to facilitate conflict resolution. We found that different conflict conditions can be efficiently represented in a cognitive space encoded by the right dlPFC, and participants with stronger cognitive space representation have also adjusted their conflict control to a greater extent based on the conflict similarity (Fig 4C). The finding suggests that the cognitive space organization of conflicts guides cognitive control to adjust behavior. Previous studies have shown that participants may adopt different strategies to represent a task, with the model-based strategies benefitting goal-related behaviors more than the model-free strategies (Rmus et al., 2022). Similarly, we propose that cognitive space could serve as a mental model to assist fast learning and efficient organization of cognitive control settings. Specifically, the cognitive space representation may provide a principle for how our brain evaluates the expected cost of switching and the benefit of generalization between states and selects the path with the best cost-benefit tradeoff (Abrahamse et al., 2016; Shenhav et al., 2013). The proximity between two states in cognitive space could reflect both the expected cognitive demand required to transition and the useful mechanisms to adapt from. The closer the two conditions are in cognitive space, the lower the expected switching cost and the higher the generalizability when transitioning between them. With the organization of a cognitive space, a new conflict can be quickly assigned a location in the cognitive space, which will facilitate the development of cognitive control settings for this conflict by interpolating nearby conflicts and/or projecting the location to axes representing different cognitive control processes, thus leading to a stronger CSE when following a more similar conflict condition. On the other hand, without a cognitive space, there would be no measure of similarity between conflicts on different trials, hence limiting the ability of fast learning of cognitive control setting from similar trials.”

      Reference:

      Grahek, I., Leng, X., Fahey, M. P., Yee, D., & Shenhav, A. Empirical and Computational Evidence for Reconfiguration Costs During Within-Task Adjustments in Cognitive Control. CogSci.

      3) Wouldn't a region that represented each conflict source separately still show the same pattern of results? The degree of Stroop vs Simon conflict is perfectly negatively correlated across conditions, so wouldn't a region that just tracks Stoop conflict show these RSA patterns? The authors show that overall congruency is not represented in DLPFC (which is surprising), but they don't break it down by whether this is due to Stroop or Simon congruency (I'm not sure their task allows for this).

      To estimate the unique contributions of the spatial Stroop and Simon conflicts, we performed a model-comparison analysis. We constructed a Stroop-Only model and a Simon-Only model, with each conflict type projected onto the Stroop (vertical) axis or Simon (horizontal) axis, respectively. The similarity between any two conflict types was defined using the Jaccard similarity index (Jaccard, P., 1901), that is, their intersection divided by their union. By replacing the cognitive spacebased conflict similarity regressor with the Stroop-Only and Simon-Only regressors, we calculated their BICs. Results showed that the BIC was larger for Stroop-Only (5377122) and Simon-Only (5377096) than for the Cognitive-Space model (5377094). An additional Stroop+Simon model, including both Stroop-Only and Simon-Only regressors, also showed a poorer model fitting (BIC = 5377118) than the Cognitive-Space model. Considering that the pattern of conflict representations is more manifested when the conflict is present (i.e., on incongruent trials) than not (i.e., on congruent trials), we also conducted the model comparison using the incongruent trials only. Results showed that Stroop-Only (1344128), Simon-Only (1344120), and Stroop+Simon (1344157) models all showed higher BIC values than the CognitiveSpace model (1344104). These results indicate that the right 8C encodes an integrated cognitive space for resolving Stroop and Simon conflicts. Therefore, we believe the cognitive space has incorporated both dimensions. We added these additional analyses and results to the revised manuscript.

      “To examine if the right 8C specifically encodes the cognitive space rather than the domain-general or domain-specific organizations, we tested several additional models (see Methods). Model comparison showed a lower BIC in the Cognitive-Space model (BIC = 5377094) than the Domain-General (BIC = 537127) or Domain-Specific (BIC = 537127) models. Further analysis showed the dimensionality of the representation in the right 8C was 1.19, suggesting the cognitive space was close to 1D. We also tested if the observed conflict similarity effect was driven solely by spatial Stroop or Simon conflicts, and found larger BICs for the models only including the Stroop similarity (i.e., the Stroop-Only model, BIC = 5377122) or Simon similarity (i.e., the Simon-Only model, BIC = 5377096). An additional Stroop+Simon model, including both StroopOnly and Simon-Only regressors, also showed a worse model fitting (BIC = 5377118). Moreover, we replicated the results with only incongruent trials, considering that the pattern of conflict representations is more manifested when the conflict is present (i.e., on incongruent trials) than not (i.e., on congruent trials). We found a poorer fitting in Domain-general (BIC = 1344129), Domain-Specific (BIC = 1344129), Stroop-Only (BIC = 1344128), Simon-Only (BIC = 1344120), and Stroop+Simon (BIC = 1344157) models than the Cognitive-Space model (BIC = 1344104). These results indicate that the right 8C encodes an integrated cognitive space for resolving Stroop and Simon conflicts. The more detailed model comparison results are listed in Table 2.”

      We reason that we did not observe an overall congruency effect in the RSA results is because our definition of congruency here differed from traditional definitions (i.e., contrast between incongruent and congruent conditions). In the congruency regressor of our RSA model, we defined representational similarity as 1 if calculated between two incongruent, or two congruent trials, and 0 if between incongruent and congruent trials. Thus, our definition of the congruency regressor reflects whether multivariate patterns differ between incongruent and congruent trials, rather than whether activity strengths differ. Indeed, we did observe the latter form of congruency effects, with stronger univariate activities in pre-SMA for incongruent versus congruent conditions. We have added this in the Note S6 (“The multivariate representations of conflict type and orientation are different from the congruency effect”):

      “Neither did we observe a multivariate congruency effect (i.e., the pattern difference between incongruent and congruent conditions compared to that within each condition) in the right 8C or any other regions. Note the definition of congruency here differed from traditional definitions (i.e., contrast between activity strength of incongruent and congruent conditions), with which we found stronger univariate activities in pre-SMA for incongruent versus congruent conditions.”

      We could not determine whether the null effect of the congruency regressor was due to Stroop or Simon congruency alone, because congruency levels of the two types always covary. On all trials of the compound conditions (Conf 2-4), whenever the Stroop dimension was incongruent, the Simon dimension was also incongruent, and vice versa for the congruent condition. Thus, the contribution of spatial Stroop or Simon alone to the congruency effect could not be tested using compound conditions. Although we have pure spatial Stroop or Simon conditions, within-Stroop and withinSimon trial pairs constituted only 8% of cells in the representational similarity matrix. This was insufficient to determine whether the null congruency effect was due to solely Stroop or Simon.

      Overall, with the added analysis we found that the data in the right 8C area supports conflict representations that are organized based on both Simon and spatial Stroop conflict. Although the current experimental design does not allow us to identify whether the null effect of the congruency regressor was driven by either conflict or both, we clarified that the congruency regressor did not test the 205 conventional congruency effect and the null finding does not contradict previous 206 research.

      Reference:

      Jaccard, P. (1901). Étude comparative de la distribution florale dans une portion des Alpes et des Jura. Bull Soc Vaudoise Sci Nat(37), 547-579.

      4) The authors use a novel form of RSA that concatenates patterns across conditions, runs and subjects into a giant RSA matrix, which is then used for linear mixed effects analysis. This appears to be necessary because conflict type and visual orientation are perfectly confounded within the subject (although, if I understand, the conflict type x congruence interaction wouldn't have the same concern about visual confounds, which shouldn't depend on congruence). This is an interesting approach but should be better justified, preferably with simulations validating the sensitivity and specificity of this method and comparing it to more standard methods.

      The confound exists for both the conflict type and the conflict type × congruence interaction in our design, since both incongruent and congruent conditions include stimuli from the full orientation space. For example, for the spatial Stroop type, the congruent condition could be either an up arrow at the top or a down arrow at the bottom. Similarly, the incongruent condition could be either an up arrow at the bottom or a down arrow at the top. Therefore, both the congruent and incongruent conditions are perfectly confounded with the orientation.

      We reanalyzed the data using the well-documented approach by Chen et al. (2017, Neuroimage), as suggested by the reviewer. The new analysis replicated our previously reported results (Fig. 4-5, S4-S7). As Chen et al (2017) has provided abundant simulations to validate this approach, we did not run any further simulations.

      5) A chief concern is that the same pattern contributes to many entries in the DV, which has been addressed in previous work using row-wise and column-wise random effects (Chen et al., 2017, Neuroimage). It would also be informative to know whether the results hold up to removing within-run similarity, which can bias similarity measures (Walther et al., 2016, Neuroimage).

      Thank you for the comment. In our revised manuscript, we followed your suggestion and adopted the approach proposed by Chen et al. (2017). Specifically, we included both the upper and lower triangle of the representational similarity matrix (excluding the diagonal). Moreover, we also removed all the within-subject similarity (thus also excluding the within-run similarity as suggested by Walther et al. (2016)) to minimize the bias of the potentially strong within-subject similarity. In addition, we added both the row-wise and column-wise random effects to capture the dependence of cells within each column and each row, respectively (Chen et al., 2017).

      Results from this approach largely replicated our previous results. The right 8C again showed significant conflict similarity representation, with greater representational strength in incongruent than congruent condition, and positively correlated to behavioral performance. The orientation effect was also identified in the visual (e.g., right V1) and oculomotor (e.g., left FEF) regions.

      We have revised the methodology and the results in the revised manuscript:

      "Representational similarity analysis (RSA).

      For each cortical region, we calculated the Pearson’s correlations between fMRI activity patterns for each run and each subject, yielding a 1400 (20 conditions × 2 runs × 35 participants) × 1400 RSM. The correlations were calculated in a cross297 voxel manner using the fMRI activation maps obtained from GLM3 described in the previous section. We excluded within-subject cells from the RSM (thus also excluding the within-run similarity as suggested by Walther et al., (2016)), and the remaining cells were converted into a vector, which was then z-transformed and submitted to a linear mixed effect model as the dependent variable. The linear mixed effect model also included regressors of conflict similarity and orientation similarity. Importantly, conflict similarity was based on how Simon and spatial Stroop conflict are combined and hence was calculated by first rotating all subject’s stimulus location to the top right and bottom-left quadrants, whereas orientation was calculated using original stimulus locations. As a result, the regressors representing conflict similarity and orientation similarity were de-correlated. Similarity between two conditions was measured as the cosine value of the angular difference. Other regressors included a target similarity regressor (i.e., whether the arrow directions were identical), a response similarity regressor (i.e., whether the correct responses were identical); a spatial Stroop distractor regressor (i.e., vertical distance between two stimulus locations); a Simon distractor regressor (i.e., horizontal distance between two stimulus locations). Additionally, we also included a regressor denoting the similarity of Group (i.e., whether two conditions are within the same subject group, according to the stimulus-response mapping). We also added two regressors including ROI316 mean fMRI activations for each condition of the pair to remove the possible uni-voxel influence on the RSM. A last term was the intercept. To control the artefact due to dependence of the correlation pairs sharing the same subject, we included crossed random effects (i.e., row-wise and column-wise random effects) for the intercept, conflict similarity, orientation and the group factors (G. Chen et al., 2017)."

      Reference:

      Walther, A., Nili, H., Ejaz, N., Alink, A., Kriegeskorte, N., & Diedrichsen, J. (2016). Reliability of dissimilarity measures for multi-voxel pattern analysis. Neuroimage, 137, 188-200. doi:10.1016/j.neuroimage.2015.12.012

      6) Another concern is the extent to which across-subject similarity will only capture consistent patterns across people, making this analysis very similar to a traditional univariate analysis (and unlike the traditional use of RSA to capture subject-specific patterns).

      With proper normalization, we assume voxels across different subjects should show some consistent localizations, although individual differences can be high. J. Chen et al. (2017) has demonstrated that consistent multi-voxel activation patterns exist across individuals. Previous studies have also successfully applied cross-subject RSA (see review by Freund et al, 2021) and cross-subject decoding approaches (e.g., Jiang et al., 2016; Tusche et al., 2016), so we believe cross-subject RSA should be feasible to capture distributed activation patterns shared at the group level. We added this argument in the revised manuscript:

      "Previous studies (e.g., J. Chen et al., 2017) have demonstrated that consistent multivoxel activation patterns exist across individuals, and successful applications of cross-subject RSA (see review by Freund, Etzel, et al., 2021) and cross-subject decoding approaches (Jiang et al., 2016; Tusche et al., 2016) have also been reported."

      In the revised manuscript, we also tested whether the representation in right 8C held for within-subject data. We reasoned that the conflict similarity effects identified by cross-subject RSA should be replicable in within-subject data, although the latter is not able to dissociate the conflict similarity effect from the orientation effect. We performed similar RSA for within-subject RSMs, excluding the within-run cells. We replaced the perfectly confounded factors of conflict similarity and orientation with a common factor called similarity_orientation. Other confounding factor pairs were addressed similarly. Results showed a significant effect of similarity_orientation, t(13993) = 3.270, p = .0005, 1-tailed. Given the specific representation of conflict similarity identified by the cross-subject RSA, we believe that the within-subject data of right 8C probably showed similar conflict similarity modulation effects as the cross-subject data, although future research that orthogonalizes conflict type and orientation is needed to fully answer this question. We added this result in the revised section Note S7.

      "Note S7. The cross-subject RSA captures similar effects with the within-subject RSA Considering the variability in voxel-level functional localizations among individuals, one may question whether the cross-subject RSA results were biased by the consistent multi-voxel patterns across subjects, distinct from the more commonly utilized withinsubject RSA. We reasoned that the cross-subject RSA should have captured similar effects as the within-subject RSA if we observe the conflict similarity effect in right 8C with the latter analysis. Therefore, we tested whether the representation in right 8C held for within-subject data. Specifically, we performed similar RSA for withinsubject RSMs, excluding the within-run cells. We replaced the perfectly confounded factors of conflict similarity and orientation with a common factor called similarity_orientation. Other confounding factor pairs (i.e., target versus response, and Stroop distractor versus Simon distractor) were addressed similarly. Results showed a significant effect of similarity_orientation, t(13993) = 3.270, p = .0005, 1tailed. Given the specific representation of conflict similarity identified by the crosssubject RSA, the within-subject data of right 8C may show similar conflict similarity modulation effects as the cross-subject data. Further research is needed to fully dissociate the representation of conflict and the representation of visual features such as orientation."

      Reference:

      Chen, J., Leong, Y. C., Honey, C. J., Yong, C. H., Norman, K. A., & Hasson, U. (2017). Shared memories reveal shared structure in neural activity across individuals. Nature Neuroscience, 20(1), 115-125.

      Freund, M. C., Etzel, J. A., & Braver, T. S. (2021). Neural Coding of Cognitive Control: The Representational Similarity Analysis Approach. Trends in Cognitive Sciences, 25(7), 622-638.

      Jiang, J., Summerfield, C., & Egner, T. (2016). Visual Prediction Error Spreads Across Object Features in Human Visual Cortex. J Neurosci, 36(50), 12746-12763.

      Tusche, A., Bockler, A., Kanske, P., Trautwein, F. M., & Singer, T. (2016). Decoding the Charitable Brain: Empathy, Perspective Taking, and Attention Shifts Differentially Predict Altruistic Giving. Journal of Neuroscience, 36(17), 4719-4732.

      7) Finally, the authors should confirm all their results are robust to less liberal methods of multiplicity correction. For univariate analysis, they should report the effects from the standard p < .001 cluster forming threshold for univariate analysis (or TFCE). For multivariate analyses, FDR can be quite liberal. The authors should consider whether their mixed-effects analyses allow for group-level randomization, and consider (relatively powerful) Max-Stat randomization tests (Nichols & Holmes, 2002, Hum Brain Mapp).

      In our revised manuscript, we have corrected the univariate results using the probabilistic TFCE (pTFCE) approach by Spisak et al. (2019). This approach estimates the conditional probability of cluster extent based on Bayes’ rule. Specifically, we applied pTFCE on our univariate results (i.e., the z-maps of our contrasts). This returned enhanced Z-score maps, which were then thresholded based on simulated cluster size thresholds using 3dClustSim. A cluster-forming threshold of p < .001 was employed. Results showed only the pre-SMA was activated in the incongruent > congruent contrast, and right IPS and right dmPFC were activated in the linear Simon modulation effect. Further tests also showed these regions were not correlated with the behavioral performance, uncorrected ps >.28. These results largely replicated our previous results. We have revised the method and results accordingly.

      Methods:

      "Results were corrected with the probabilistic threshold-free cluster enhancement(pTFCE) and then thresholded by 3dClustSim function in AFNI (Cox & Hyde, 1997) with voxel-wise p < .001 and cluster-wize p < .05, both 1-tailed."

      Results:

      "In the fMRI analysis, we first replicated the classic congruency effect by searching for brain regions showing higher univariate activation in incongruent than congruent conditions (GLM1, see Methods). Consistent with the literature (Botvinick et al., 2004; Fu et al., 2022), this effect was observed in the pre-supplementary motor area (preSMA) (Fig. 3, Table S1). We then tested the encoding of conflict type as a cognitive space by identifying brain regions with activation levels parametrically covarying with the coordinates (i.e., axial angle relative to the horizontal axis) in the hypothesized cognitive space. As shown in Fig. 1B, change in the angle corresponds to change in spatial Stroop and Simon conflicts in opposite directions. Accordingly, we found the right inferior parietal sulcus (IPS) and the right dorsomedial prefrontal cortex (dmPFC) displayed positive correlation between fMRI activation and the Simon conflict (Fig. 3, Fig. S3, Table S1)."

      We appreciate the reviewer’s suggestion to apply the Max-Stat randomization tests (Nichols & Holmes, 2002) for the multivariate analyses. However, the representational similarity matrix was too large (1400×1400) to be tested with a balanced randomization approach (i.e., the Max-Stat), due to (1) running even 1000 times for all ROIs cost very long time; (2) the distribution generated from normal times of randomization (e.g., 5000 iterations) would probably be unbalanced, since the full range of possible samples that could be generated by a complete randomization is not adequately represented. Instead, we adopted a very strict Bonferroni correction p < 0.0001/360 when reporting the regression results from RSA. Notebally, Chen et al (2017) has shown that their approach could control the FDR at an acceptable level.

      Reference:

      Spisák, T., Spisák, Z., Zunhammer, M., Bingel, U., Smith, S., Nichols, T., & Kincses,T. (2019). Probabilistic TFCE: A generalized combination of cluster size and voxel intensity to increase statistical power. NeuroImage, 185, 12-26.

      Chen, G., Taylor, P. A., Shin, Y.-W., Reynolds, R. C., & Cox, R. W. J. N. (2017). Untangling the relatedness among correlations, Part II: Inter-subject correlation group analysis through linear mixed-effects modeling. 147, 825-840.

      Minor concerns:

      8) I appreciate the authors wanting to present the conditions in a theory-agnostic way, but the framing of 5 conflict types was confusing. I think framing the conditions as a mixture of 2 conflict types (Stroop and Simon) makes more sense, especially given the previous work on MSIT.

      We have renamed the Type1-5 as spatial Stroop, StHSmL, StMSmM, StLSmH, and Simon conditions, respectively. H, L, and M indicate high, low andmedium similarity with the corresponding conflict, respectively. This is alsoconsistent with the naming of our previous work (Yang et al., 2021).

      Reference:

      Yang, G., Xu, H., Li, Z., Nan, W., Wu, H., Li, Q., & Liu, X. (2021). The congruency sequence effect is modulated by the similarity of conflicts. Journal of Experimental Psychology: Learning, Memory, and Cognition, 47(10), 1705-1719.

      9) It would be helpful to have more scaffolding for the key conflict & orientation analyses. A schematic in the main text that outlines these contrasts would be very helpful (e.g. similar to S4).

      We have inserted Figure 7 in the revised manuscript. In this figure, we plotted the schematic of the difference between the conflict similarity 467 and orientation regressors according to their cross-group representational similarity 468 matrices.

      10) Figure 4D could be clearer, both in labeling and figure caption. 'Modeled similarity' could be relabelled to something more informative, like 'conflict type (or mixture) similarity'. Alternatively, it would be helpful to show a summary RDM for region r-8C. For example, breaking it down by just conflict type and congruence.

      We have relabeled the x-axis to “Conflict type similarity” and y-axis to “Neural similarity” for Figure 4D in the revised manuscript.

      We have also added a summary RSM figure in Fig. S5 to show the different similarity patterns between incongruent and congruent conditions.

      11) It may be helpful to connect your work to how people have discussed multiple forms of conflict monitoring and control with respect to target and distractor features e.g., Lindsay & Jacoby, 1994, JEP:HPP; Mante, Sussillo et al., 2013, Nature; Soutschek et al., 2015, JoCN; Jackson et al., 2021, Comm Bio; Ritz & Shenhav, 2022, bioRxiv

      We have added an analysis to examine how cognitive control modulates target and distractor representation. To this end, we selected the left V4, a visual region showing joint representation of target, Stroop distractor and Simon distractor, as the region of interest. We tested whether these representation strengths differed between incongruent and congruent conditions, finding the representation of target was stronger and representations of both distractors were weaker in the incongruent condition. This suggests that cognitive control modulates the stimuli in both directions. We added the results in Note S10 and Fig. S8, and also added discussion of it in “Methodological implications”.

      “Note S10. Cognitive control enhances target representation and suppresses distractor representation Using the separability of confounding factors afforded by the cross-subject RSA, we examined how representations of targets and distractors are modulated by cognitive control. The key assumption is that exerting cognitive control may enhance target representation and suppress distractor representation. We hypothesized that stimuli are represented in visual areas, so we chose a visual ROI from the main RSA results showing joint representation of target, spatial Stroop distractor and Simon distractor (p < .005, 1-tail, uncorrected). Only the left V4 met this criterion. We then tested representations with models similar to the main text for incongruent only trials, congruent only trials, and the incongruent – congruent contrast. The contrast model additionally used interaction between the congruency and target, Stroop distractor and Simon distractor terms. Results showed that in the incongruent condition, when we employ more cognitive control, the target representation was enhanced (t(237990) = 2.59, p = .029, Bonferroni corrected) and both spatial Stroop (t(237990) = –4.18, p < .001, Bonferroni corrected) and Simon (t(237990) = –3.14, p = .005, Bonferroni corrected) distractor representations were suppressed (Fig. S8). These are consistent with the idea that the top-down control modulates the stimuli in both directions (Polk et al., 2008; Ritz & Shenhav, 2022).”

      Discussion:

      “Moreover, the cross-subject RSA provides high sensitivity to the variables of interest and the ability to separate confounding factors. For instance, in addition to dissociating conflict type from orientation, we dissociated target from response, and spatial Stroop distractor from Simon distractor. We further showed cognitive control can both enhance the target representation and suppress the distractor representation (Note S10, Fig. S8), which is in line with previous studies (Polk et al., 2008; Ritz & Shenhav, 2022)."

      12) For future work, I would recommend placing stimuli along the whole circumference, to orthogonalize Stroop and Simon conflict within-subject.

      We thank the reviewer for this highly helpful suggestion. Expanding the 547 conflict conditions to a full conflict space and replicating our current results could 548 provide stronger evidence for the cognitive space view.

      In the revised manuscript, we added this as a possible future design:

      “A possible improvement to our current design would be to include left, right, up, and down arrows presented in a grid formation across four spatially separate quadrants, with each arrow mapped to its own response button. However, one potential confounding factor would be that these conditions have different levels of difficulty (i.e., different magnitude of conflict), which may affect the CSE results and their representational similarity."

      Reviewer #2:

      Summary, general appraisal

      This study examines the construct of "cognitive spaces" as they relate to neural coding schemes present in response conflict tasks. The authors utilize a novel paradigm, in which subjects must map the direction of a vertically oriented arrow to either a left or right response. Different types of conflict (spatial Stroop, Simon) are parametrically manipulated by varying the spatial location of the arrow (a taskirrelevant feature). The vertical eccentricity of the arrow either agrees or conflicts with the arrow's direction (spatial Stroop), while the horizontal eccentricity of the arrow agrees or conflicts with the side of the response (Simon). A neural coding model is postulated in which the stimuli are embedded in a cognitive space, organized by distances that depend only on the similarity of congruency types (i.e., where conditions with similar relative proportions of spatial-Stroop versus Simon congruency are represented with similar activity patterns). The authors conduct a behavioral and fMRI study to provide evidence for such a representational coding scheme. The behavioral findings replicate the authors' prior work in demonstrating that conflict-related cognitive control adjustments (the congruency sequence effect) shows strong modulation as a function of the similarity between conflict types. With the fMRI neural activity data, the authors report univariate analyses that identified activation in left prefrontal and dorsomedial frontal cortex modulated by the amount of Stroop or Simon conflict present, and multivariate representational similarity analyses (RSA) that identified right lateral prefrontal activity encoding conflict similarity and correlated with the behavioral effects of conflict similarity.

      This study tackles an important question regarding how distinct types of conflict, which have been previously shown to elicit independent forms of cognitive control adjustments, might be encoded in the brain within a computationally efficient representational format. The ideas postulated by the authors are interesting ones and the utilized methods are rigorous.

      We would like to express our sincere appreciation for the reviewer’s positive evaluation of our manuscript and the constructive comments and suggestions. Through careful consideration of your feedback, we have endeavored to make our manuscript more accessible to readers and further strengthened our findings. In response to your suggestion, we reanalyzed our data with the approach proposed by Chen et al.’s (2017, NeuroImage). This reanalysis largely replicated our previous results, reinforcing the validity of our findings. Additionally, we conducted tests with several alternative models and found that the cognitive space hypothesis best aligns with our observed data. We have incorporated these revisions and additional analyses into the manuscript based on your valuable feedback. As a result, we believe that these changes and additional analyses have significantly enhanced the quality of our manuscript. We have provided detailed responses to your comments below.

      However, the study has critical limitations that are due to a lack of clarity regarding theoretical hypotheses, serious confounds in the experimental design, and a highly non-standard (and problematic) approach to RSA. Without addressing these issues it is hard to evaluate the contribution of the authors findings to the computational cognitive neuroscience literature.

      1) The primary theoretical question and its implications are unclear. The paper would greatly benefit from more clearly specifying potential alternative hypotheses and discussing their implications. Consider, for example, the case of parallel conflict monitors. Say that these conflict monitors are separately tuned for Stroop and Simon conflict, and are located within adjacent patches of cortex that are both contained within a single cortical parcel (e.g., as defined by the Glasser atlas used by the authors for analyses). If RSA was conducted on the responses of such a parcel to this task, it seems highly likely that an activation similarity matrix would be observed that is quite similar (if not identical) to the hypothesized one displayed in Figure 1. Yet it would seem like the authors are arguing that the "cognitive space" representation is qualitatively and conceptually distinct from the "parallel monitor" coding scheme. Thus, it seems that the task and analytic approach is not sufficient to disambiguate these different types of coding schemes or neural architectures.

      The authors also discuss a fully domain-general conflict monitor, in which different forms of conflict are encoded within a single dimension. Yet this alternative hypothesis is also not explicitly tested nor discussed in detail. It seems that the experiment was designed to orthogonalize the "domain-general" model from the "cognitive space" model, by attempting to keep the overall conflict uniform across the different stimuli (i.e., in the design, the level of Stroop congruency parametrically trades off with the level of Simon congruency). But in the behavioral results (Fig. S1), the interference effects were found to peak when both Stroop and Simon congruency are present (i.e., Conf 3 and 4), suggesting that the "domain-general" model may not be orthogonal to the "cognitive space" model. One of the key advantages of RSA is that it provides the ability to explicitly formulate, test and compare different coding models to determine which best accounts for the pattern of data. Thus, it would seem critical for the authors to set up the design and analyses so that an explicit model comparison analysis could be conducted, contrasting the domain-general, domain-specific, and cognitive space accounts.

      We appreciate the reviewer pointing out the need to formally test alternative models. In the revised manuscript, we have added and compared a few alternative models, finding the Cognitive-Space model (the one with graded conflict similarity levels as we reported) provided the best fit to our data. Specifically, we tested the following five models against the Cognitive-Space model:

      (1) Domain-General model. This model treats each conflict type as equivalent, so each two conflict types only differ in the magnitude of their conflict. Therefore, we defined the domain-general matrix as the difference in their effects indexed by the group-averaged RT in Experiment 2. Then the z-scored model vector was sign-flipped to reflect similarity instead of distance. This model showed non-significant conflict type effects (t(951989) = 0.92, p = .179) and poorer fit (BIC = 5377126) than the Cognitive-Space model (BIC = 5377094).

      (2) Domain-Specific model. This model treats each conflict type differently, so we used a diagonal matrix, with within-conflict type similarities being 1 and all crossconflict type similarities being 0. This model also showed non-significant effects (t(951989) = 0.84, p = .201) and poorer fit (BIC = 5377127) than the Cognitive-Space model.

      (3) Stroop-Only model. This model assumes that the right 8C only encodes the spatial Stroop conflict. We projected each conflict type to the Stroop (vertical) axis and calculated the similarity between any two conflict types as the Jaccard similarity index (Jaccard, 1901), that is, their intersection divided by their union. This model also showed non-significant effects (t(951989) = 0.20, p = .423) and poorer fit (BIC = 5377122) than the Cognitive-Space model.

      (4) Simon-Only model. This model assumes that the right 8C only encodes the Simon conflict. We projected each conflict type to the Simon (horizontal) axis and calculated the similarity like the Stroop-Only model. This model showed significant effects (t(951989) = 4.19, p < .001) but still quantitatively poorer fit (BIC = 5377096) than the Cognitive-Space model.

      (5) Stroop+Simon model. This model assumes the spatial Stroop and Simon conflicts are parallelly encoded in the brain, similar to the "parallel monitor" hypothesis suggested by the reviewer. It includes both Stroop-Only and Simon-Only regressors. This model showed nonsignificant effect for the Stroop regressor (t(951988) = 0.06, p = .478) and significant effect for the Simon regressor (t(951988) = 3.30, p < .001), but poorer fit (BIC = 5377118) than the Cognitive-Space model.

      “Moreover, we replicated these results with only incongruent trials (i.e., when conflict is present), considering that the pattern of conflict representations is more manifested when the conflict is present (i.e., on incongruent trials) than not (i.e., on congruent trials). We found a poorer fitting in Domain-general (BIC = 1344129), Domain-Specific (BIC = 1344129), Stroop-Only (BIC = 1344128), Simon-Only (BIC = 1344120), and Stroop+Simon (BIC = 1344157) models than the Cognitive-Space model (BIC = 1344104).”

      In summary, these results indicate that the right 8C encodes an integrated cognitive space for resolving Stroop and Simon conflicts. We added the above results to the revised manuscript.

      The above analysis approach was added to the method “Model comparison and representational dimensionality”, and the results were added to the “Multivariate patterns of the right dlPFC encodes the conflict similarity” in the revised manuscript.

      Methods:

      “Model comparison and representational dimensionality To estimate if the right 8C specifically encodes the cognitive space, rather than the domain-general or domain-specific structures, we conducted two more RSAs. We replaced the cognitive space-based conflict similarity matrix in the RSA we reported above (hereafter referred to as the Cognitive-Space model) with one of the alternative model matrices, with all other regressors equal. The domain-general model treats each conflict type as equivalent, so each two conflict types only differ in the magnitude of their conflict. Therefore, we defined the domain-general matrix as the difference in their congruency effects indexed by the group-averaged RT in Experiment 2. Then the zscored model vector was sign-flipped to reflect similarity instead of distance. The domain-specific model treats each conflict type differently, so we used a diagonal matrix, with within-conflict type similarities being 1 and all cross-conflict type similarities being 0.

      Moreover, to examine if the cognitive space is driven solely by the Stroop or Simon conflicts, we tested a spatial Stroop-Only (hereafter referred to as “Stroop-Only”) and a Simon-Only model, with each conflict type projected onto the spatial Stroop (vertical) axis or Simon (horizontal) axis, respectively. The similarity between any two conflict types was defined using the Jaccard similarity index (Jaccard, 1901), that is, their intersection divided by their union. We also included a model assuming the Stroop and Simon dimensions are independently represented in the brain, adding up the StroopOnly and Simon-Only regressors (hereafter referred to as the Stroop+Simon model). We conducted similar RSAs as reported above, replacing the original conflict similarity regressor with the Strrop-Only, Simon-Only, or both regressors (for the Stroop+Simon model), and then calculated their Bayesian information criterions (BICs).”

      Results:

      “To examine if the right 8C specifically encodes the cognitive space rather than the domain-general or domain-specific organizations, we tested several additional models (see Methods). Model comparison showed a lower BIC in the Cognitive-Space model (BIC = 5377094) than the Domain-General (BIC = 537127) or Domain-Specific (BIC = 537127) models. Further analysis showed the dimensionality of the representation in the right 8C was 1.19, suggesting the cognitive space was close to 1D. We also tested if the observed conflict similarity effect was driven solely by spatial Stroop or Simon conflicts, and found larger BICs for the models only including the Stroop similarity (i.e., the Stroop-Only model, BIC = 5377122) or Simon similarity (i.e., the Simon-Only model, BIC = 5377096). An additional Stroop+Simon model, including both StroopOnly and Simon-Only regressors, also showed a worse model fitting (BIC = 5377118). Moreover, we replicated the results with only incongruent trials, considering that the pattern of conflict representations is more manifested when the conflict is present (i.e., on incongruent trials) than not (i.e., on congruent trials). We found a poorer fitting in Domain-general (BIC = 1344129), Domain-Specific (BIC = 1344129), Stroop-Only (BIC = 1344128), Simon-Only (BIC = 1344120), and Stroop+Simon (BIC = 1344157) models than the Cognitive-Space model (BIC = 1344104). These results indicate that the right 8C encodes an integrated cognitive space for resolving Stroop and Simon conflicts. The more detailed model comparison results are listed in Table 2.”

      Reference:

      Jaccard, P. (1901). Étude comparative de la distribution florale dans une portion des Alpes et des Jura. Bull Soc Vaudoise Sci Nat(37), 547-579.

      2a) Relatedly, the reasoning for the use of the term "cognitive space" is unclear. The mere presence of graded coding for two types of conflict seems to be a low bar for referring to neural activity patterns as encoding a "cognitive space". It is discussed that cognitive spaces/maps allow for flexibility through inference and generalization. But no links were made between these cognitive abilities and the observed representational structure.

      In the revised manuscript, we have clarified that we tested a specific prediction of the cognitive space hypothesis: the geometry of the cognitive space predicts that more similar conflict types will have more similar neural representations,leading to the CSE and RSA patterns tested in this study. These results add to the literature by providing empirical evidence on how different conflict types are encoded in the brain. We agree that this study is not a comprehensive test of the cognitive space hypothesis. Thus, in the revised manuscript we explicitly clarified that this study is a test of the geometry of the cognitive space hypothesis.

      Critically, the cognitive space view holds that the representations of different abstract information are organized continuously and the representational geometry in the cognitive space are determined by the similarity among the represented information (Bellmund et al., 2018).

      "The present study aimed to test the geometry of cognitive space in conflict representation. Specifically, we hypothesize that different types of conflict are represented as points in a cognitive space. Importantly, the distance between the points, which reflects the geometry of the cognitive space, scales with the difference in the sources of the conflicts being represented by the points."

      We have also discussed the limitation of the results and stressed the need for more research to fully test the cognitive space hypothesis.

      “Additionally, our study is not a comprehensive test of the cognitive space hypothesis but aimed primarily to provide original evidence for the geometry of cognitive space in representing conflict information in cognitive control. Future research should examine other aspects of the cognitive space such as its dimensionality, its applicability to other conflict tasks such as Eriksen Flanker task, and its relevance to other cognitive abilities, such as cognitive flexibility and learning.

      2b) Additionally, no explicit tests of generality (e.g., via cross-condition generalization) were provided.

      To examine the generality of cognitive space across conditions, we conducted a leave-one-out prediction analysis. We used the behavioral data from Experiment 1 for this test, due to its larger amount of data than Experiment 2. Specifically, we removed data from one of the five similarity levels (as illustrated by the θs in Fig. 1C) and used the remaining data to perform the same mixed-effect model as reported in the main text (i.e., the two-stage analysis). This yielded one pair of beta coefficients including the similarity regressor and the intercept for each subject, with which we predicted the CSE for the removed similarity level for each subject. We repeated this process for each similarity level once. The predicted results were highly correlated with the original data, with r = .87 for the RT and r = .84 for the ER, ps < .001. We have added this analysis and result to the “Conflict type 706 similarity modulated behavioral congruency sequence effect (CSE)” section.

      “Moreover, to test the continuity and generalizability of the similarity modulation, we conducted a leave-one-out prediction analysis. Specifically, we removed data from one of the five similarity levels (as illustrated by the θs in Fig. 1C) and used the remaining data to perform the same mixed-effect model (i.e., the two-stage analysis). This yielded one pair of beta coefficients including the similarity regressor and the intercept for each subject, with which we predicted the CSE for the removed similarity level for each subject. We repeated this process for each similarity level once. The predicted results were highly correlated with the original data, with r = .87 for the RT and r = .84 for the ER, ps < .001."

      2c) Finally, although the design elicits strong CSE effects, it seems somewhat awkward to consider CSE behavioral patterns as a reflection of the kind of abilities supported by a cognitive map (if this is indeed the implication that was intended). In fact, CSE effects are well-modeled by simpler "model-free" associative learning processes, that do not require elaborate representations of abstract structures.

      We argue the conflict similarity modulation of CSEs we observed cannot be explained by the “model-free” stimulus-driven associative learning process. This mainly refers to the feature integration account proposed by Hommel et al. (2004), which explains poorer performance in CI and IC trials (compared with CC and II trials) with the partial repetition cost caused by the breaking of stimulus-response binding. Although we cannot remove its influence on the within-type trials (similarity level 5, θ = 0), it should not affect the cross-type trials (similarity level 1-4, θ = 90°, 67.5°, 45° and 22.5°, respectively), because the CC, CI, IC, II trials had equal probabilities of partially repeated and fully switched trials (see the Author response image 1 for an example of trials across Conf 1 and Conf 3 conditions). Thus, feature integration cannot explain the gradual CSE decrease from similarity level 1 to 4, which sufficiently reproduce the full effect, as suggested by the leave-one-out prediction analysis mentioned above. We thus conclude that the similarity modulation of CSE cannot be explained by the stimulus-driven associative learning.

      Author response image 1.

      Notably, however, our findings are aligned with an associative learning account of cognitive control (Abrahamse et al., 2016), which extends association learning from stimulus/response level to cognitive control. In other words, abstract cognitive control state can be learned and generalized like other sensorimotor features. This view explicitly proposes that “transfer occurs to the extent that two tasks overlap”, a hypothesis directly supported by our CSE results (see also Yang et al., 2021). Extending this, our fMRI results provide the neural basis of how cognitive control can generalize through a representation of cognitive space. The cognitive space view complements associative learning account by providing a fundamental principle for the learning and generalization of control states. Given the widespread application of CSE as indicator of cognitive control generalization (Braem et al., 2014), we believe that it can be recognized as a kind of ability supported by the cognitive space. This was further supported by the brain-behavioral correlation: stronger encoding of cognitive space was associated with greater bias of trial-wise behavioral adjustment by the consecutive conflict similarity.

      We have incorporated these ideas into the discussion:

      “Similarly, we propose that cognitive space could serve as a mental model to assist fast learning and efficient organization of cognitive control settings. Specifically, the cognitive space representation may provide a principle for how our brain evaluates the expected cost of switching and the benefit of generalization between states and selects the path with the best cost-benefit tradeoff (Abrahamse et al., 2016; Shenhav et al., 2013). The proximity between two states in cognitive space could reflect both the expected cognitive demand required to transition and the useful mechanisms to adapt from. The closer the two conditions are in cognitive space, the lower the expected switching cost and the higher the generalizability when transitioning between them. With the organization of a cognitive space, a new conflict can be quickly assigned a location in the cognitive space, which will facilitate the development of cognitive control settings for this conflict by interpolating nearby conflicts and/or projecting the location to axes representing different cognitive control processes, thus leading to a stronger CSE when following a more similar conflict condition.”

      References:

      Hommel, B., Proctor, R. W., & Vu, K. P. (2004). A feature-integration account of sequential effects in the Simon task. Psychological Research, 68(1), 1-17. Abrahamse, E., Braem, S., Notebaert, W., & Verguts, T. (2016). Grounding cognitive control in associative learning. Psychological Bulletin, 142(7), 693-728.

      Yang, G., Xu, H., Li, Z., Nan, W., Wu, H., Li, Q., & Liu, X. (2021). The congruency sequence effect is modulated by the similarity of conflicts. Journal of 770 Experimental Psychology: Learning, Memory, and Cognition, 47(10), 1705-1719.

      Braem, S., Abrahamse, E. L., Duthoo, W., & Notebaert, W. (2014). What determines the specificity of conflict adaptation? A review, critical analysis, and proposed synthesis. Frontiers in Psychology, 5, 1134.

      3) More generally, it seems problematic that Stroop and Simon conflict in the paradigm parametrically trade-off against each other. A more powerful design would have de-confounded Stroop and Simon conflict so that each could be separately estimation via (potentially orthogonal) conflict axes. Additionally, incorporating more varied stimulus sets, locations, or responses might have enabled various tests of generality, as implied by a cognitive space account.

      We thank the reviewer for these valuable suggestions. We argue that the current design is adequate to test the prediction that more similar conflict types have more similar neural representations. That said, we agree that further examination using more powerful experimental designs are needed to fully test the cognitive space account of cognitive control. We also agree that employing more varied stimulus sets,locations and responses would further extend our findings. We have included this as a future research direction in the revised manuscript.

      We have revised our discussion about the limitation as:

      “A few limitations of this study need to be noted. To parametrically manipulate the conflict similarity levels, we adopted the spatial Stroop-Simon paradigm that enables parametrical combinations of spatial Stroop and Simon conflicts. However, since this paradigm is a two-alternative forced choice design, the behavioral CSE is not a pure measure of adjusted control but could be partly confounded by bottom-up factors such as feature integration (Hommel et al., 2004). Future studies may replicate our findings with a multiple-choice design (including more varied stimulus sets, locations and responses) with confound-free trial sequences (Braem et al., 2019). Another limitation is that in our design, the spatial Stroop and Simon effects are highly anticorrelated. This constraint may make the five conflict types represented in a unidimensional space (e.g., a circle) embedded in a 2D space. Future studies may test the 2D cognitive space with fully independent conditions. A possible improvement to our current design would be to include left, right, up, and down arrows presented in a grid formation across four spatially separate quadrants, with each arrow mapped to its own response button. However, one potential confounding factor would be that these conditions have different levels of difficulty (i.e., different magnitude of conflict), which may affect the CSE results and their representational similarity.”

      4) Serious confounds in the design render the results difficult to interpret. As much prior neuroimaging and behavioral work has established, "conflict" per se is perniciously correlated with many conceptually different variables. Consequently, it is very difficult to distinguish these confounding variables within aggregate measures of neural activity like fMRI. For example, conflict is confounded with increased time-on-task with longer RT, as well as conflict-driven increases in coding of other task variables (e.g., task-set related coding; e.g., Ebitz et al. 2020 bioRxiv). Even when using much higher resolution invasive measures than fMRI (i.e., eCoG), researchers have rightly been wary of making strong conclusions about explicit encoding of conflict (Tang et al, 2019; eLife). As such, the researchers would do well to be quite cautious and conservative in their analytic approach and interpretation of results.

      We acknowledge the findings showing that encoding of conflicts may not be easily detected in the brain. However, recent studies have shown that the representational similarity analysis can effectively detect representations of conflict tasks (e.g., the color Stroop) using factorial designs (Freund et al., 2021a; 2021b).

      In our analysis, we are aware of the potential impact of time-on-task (e.g., RT) on univariate activation levels and subsequent RSA patterns. To address this issue, we added univariate fMRI activation levels as nuisance regressors to the RSA. To de confound conflict from other factors such as orientation of stimuli related to the center of the screen, we also applied the cross-subject RSA approach. Furthermore, we were cautious about determining regions that encoded conflict control. We set three strict criteria: (1) Regions must show a conflict similarity modulation effect; (2) regions must show higher representational strength in the incongruent condition compared with the congruent condition; and (3) regions must correlate with behavioral performance. With these criteria, we believe that the results we reported are already conservative. We would be happy to implement any additional criteria the reviewer recommends.

      Reference:

      Freund, M. C., Etzel, J. A., & Braver, T. S. (2021a). Neural Coding of Cognitive Control: The Representational Similarity Analysis Approach. Trends in Cognitive Sciences, 25(7), 622-638.

      Freund, M. C., Bugg, J. M., & Braver, T. S. (2021b). A Representational Similarity 823 Analysis of Cognitive Control during Color-Word Stroop. Journal of 824 Neuroscience, 41(35), 7388-7402.

      5) This issue is most critical in the interpretation of the fMRI results as reflecting encoding of conflict types. A key limitation of the design, that is acknowledged by the authors is that conflict is fully confounded within-subject by spatial orientation. Indeed, the limited set of stimulus-response mappings also cast doubt on the underlying factors that give rise to the CSE modulations observed by the authors in their behavioral results. The CSE modulations are so strong - going from a complete absence of current x previous trial-type interaction in the cos(90) case all the way to a complete elimination of any current trial conflict when the prior trial was incongruent in the cos(0) case - that they cause suspicion that they are actually driven by conflict-related control adjustments rather than sequential dependencies in the stimulus-response mappings that can be associatively learned.

      Unlike the fMRI data, we cannot tease apart the effects of conflict similarity and orientation in a similar manner as the cross-subject RSA for behavioral CSEs. However, we have a few reasons that the orientation and other bottom-up factors should not be the factors driving the similarity modulation effect.

      First, we did not find any correlation between the regions showing orientation effects and behavioral CSEs. This suggests that orientation does not directly contribute to the CSE modulation.

      Second, if the CSE modulation is purely driven by the association learning of the stimulus-response mapping, we should observe a stronger modulation effect after more extensive training. However, our results do not support this prediction. Using data from Experiment 1, we found that the modulation effect remained constant across the three sessions (see Note S3).

      “Note S3. Modulation of conflict similarity on behavioral CSEs does not change across time We tested if the conflict similarity modulation on the CSE is susceptible to training. We collected the data of Experiment 1 across three sessions, thus it is possible to examine if the conflict similarity modulation effect changes across time. To this end, we added conflict similarity, session and their interaction into a mixed-effect linear model, in which the session was set as a categorical variable. With a post-hoc analysis of variance (ANOVA), we calculated the statistical significance of the interaction term. This approach was applied to both the RT and ER. Results showed no interaction effect in either RT, F(2,1479) = 1.025, p = .359, or ER, F(2,1479) = 0.789, p = .455. This result suggests that the modulation effect does not change across time. “

      Third, the observed similarity modulation on the CSE, particularly for similarity levels 1-4, should not be attributed to the stimulus-response associations, such as feature integration, as have been addressed in response to comment 2.c.

      Finally, other bottom-up factors, such as the spatial location proximity did not drive the CSE modulation results, which we have addressed in the original manuscript in Note S2.

      "Note S2. Modulation of conflict similarity on behavioral CSEs cannot be explained by the physical proximity

      In our design, the conflict similarity might be confounded by the physical proximity between stimulus (i.e., the arrow) of two consecutive trials. That is, when arrows of the two trials appear at the same quadrant, a higher conflict similarity also indicates a higher physical proximity (Fig. 1A). Although the opposite is true if arrows of the two trials appear at different quadrants, it is possible the behavioral effects can be biased by the within quadrant trials. To examine if the physical distance has confounded the conflict similarity modulation effect, we conducted an additional analysis.

      We defined the physical angular difference across two trials as the difference of their polar angles relative to the origin. Therefore, the physical angular difference could vary from 0 to 180°. For each CSE conditions (i.e., CC, CI, IC and II), we grouped the trials based on their physical angular distances, and then averaged trials with the same previous by current conflict type transition but different orders (e.g., StHSmL−StLSmH and StLSmH−StHSmL) within each subject. The data were submitted to a mixed-effect model with the conflict similarity, physical proximity (i.e., the opposite of the physical angular difference) as fixed-effect predictors, and subject and CSE condition as random effects. Results showed significant conflict similarity modulation effects in both Experiment 1 (RT: β = 0.09 ± 0.01, t(7812) = 13.74, p < .001, ηp2 = .025; 875 ER: β = 0.09 ± 0.01, t(7812) = 7.66, p < .001, ηp2 = .018) and Experiment 2 (RT: β = 876 0.21 ± 0.02, t(3956) = 9.88, p < .001, ηp2 = .043; ER: β = 0.20 ± 0.03, t(4201) = 6.11, 877 p < .001, ηp2 = .038). Thus, the observed modulation of conflict similarity on behavioral 878 CSEs cannot be explained by physical proximity."

      6) To their credit, the authors recognize this confound, and attempt to address it analytically through the use of a between-subject RSA approach. Yet the solution is itself problematic, because it doesn't actually deconfound conflict from orientation. In particular, the RSA model assumes that whatever components of neural activity encode orientation produce this encoding within the same voxellevel patterns of activity in each subject. If they are not (which is of course likely), then orthogonalization of these variables will be incomplete. Similar issues underlie the interpretation target/response and distractor coding. Given these issues, perhaps zooming out to a larger spatial scale for the between-subject RSA might be warranted. Perhaps whole-brain at the voxel level with a high degree of smoothing, or even whole-brain at the parcel level (averaging per parcel). For this purpose, Schaefer atlas parcels might be more useful than Glasser, as they more strongly reflect functional divisions (e.g., motor strip is split into mouth/hand divisions; visual cortex is split into central/peripheral visual field divisions). Similarly, given the lateralization of stimuli, if a within-parcel RSA is going to be used, it seems quite sensible to pool voxels across hemispheres (so effectively using 180 parcels instead of 360).

      Doing RSA at the whole-brain level is an interesting idea. However, it does not allow the identification of specific brain regions representing the cognitive space. Additionally, increasing the spatial scale would include more voxels that are not involved in representing the information of interest and may increase the noise level of data. Given these concerns, we did not conduct the whole-brain level RSA.

      We agree that smoothing data can decrease cross-subject variance in voxel distribution and may increase the signal-noise ratio. We reanalyzed the results for the right 8C region using RSA on smoothed beta maps (6-mm FWHM Gaussian kernel). This yielded a significant conflict similarity effect, t(951989) = 5.55, p < .0001, replicating the results on unsmoothed data (t(951989) = 5.60, p < .0001). Therefore, we retained the results from unsmoothed data in the main text, and added the results based on smoothed data to the supplementary material (Note S9).

      “Note S9. The cross-subject pattern similarity is robust against individual differences Due to individual differences, the multivoxel patterns extracted from the same brain mask may not reflect exactly the same brain region for each subject. To reduce the influence of individual difference, we conducted the same cross-subject RSA using data smoothed with a 6-mm FWHM Gaussian kernel. Results showed a significant conflict similarity effect, t(951989) = 5.55, p < .0001, replicating the results on unsmoothed data (t(951989) = 5.60, p < .0001). “

      We also used the bilateral 8C area as a single mask and conducted the same RSA. We found a significant conflict type similarity effect, t(951989) = 4.36, p < .0001. However, the left 8C alone showed no such representation, t(951989) = 0.38, p = .351, consistent with the right lateralized representation of cognitive space we reported in Note S8. Therefore, we used ROIs from each hemisphere separately.

      “Note S8. The lateralization of conflict type representation

      We observed the right 8C but not the left 8C represented the conflict type similarity. A further test is to show if there is a lateralization. We tested several regions of the left dlPFC, including the i6-8, 8Av, 8C, p9-46v, 46, 9-46d, a9-46v (Freund, Bugg, et al., 2021). We found that none of these regions show the representation of conflict type, all uncorrected ps > .35. These results indicate that the conflict type is specifically represented in the right dlPFC. “

      We have also discussed the lateralization in the manuscript:

      “In addition, we found no such representation in the left dlPFC (Note S8), indicating a possible lateralization. Previous studies showed that the left dlPFC was related to the expectancy-related attentional set up-regulation, while the right dlPFC was related to the online adjustment of control (Friehs et al., 2020; Vanderhasselt et al., 2009), which is consistent with our findings. Moreover, the right PFC also represents a composition of single rules (Reverberi et al., 2012), which may explain how the spatial Stroop and Simon types can be jointly encoded in a single space.”

      7) The strength of the results is difficult to interpret due to the non-standard analysis method. The use of a mixed-level modeling approach to summarize the empirical similarity matrix is an interesting idea, but nevertheless is highly non-standard within RSA neuroimaging methods. More importantly, the way in which it was implemented makes it potentially vulnerable to a high degree of inaccuracy or bias. In this case, this bias is likely to be overly optimistic (high false positive rate). No numerical or formal defense was provided for this mixed-level model approach. As a result, the use of this method seems quite problematic, as it renders the strength of the observed results difficult to interpret. Instead, the authors are encouraged using a previously published method of conducting inference with between-subject RSA, such as the bootstrapping methods illustrated in Kragel et al. (2018; Nat Neurosci), or in potentially adopting one of the Chen et al. methods mentioned above, that have been extensively explored in terms of statistical properties.

      No numerical or formal defense was provided for this mixed-level model approach. As a result, the use of this method seems quite problematic, as it renders the strength of the observed results difficult to interpret. Instead, the authors are encouraged using a previously published method of conducting inference with between-subject RSA, such as the bootstrapping methods illustrated in Kragel et al. (2018; Nat Neurosci), or in potentially adopting one of the Chen et al. methods mentioned above, that have been extensively explored in terms of statistical properties.

      In our revised manuscript, we have adopted the approach proposed by Chen et al. (2017). Specifically, we included both the upper and lower triangle of the representational similarity matrix (excluding the diagonal). Moreover, we also removed all the within-subject similarity (thus also excluding the within-run similarity) to minimize the bias of the potentially strong within-subject similarity (note we also analyzed the within-subject data and found significant effects for the similarity modulation, though this effect cannot be attributed to the conflict similarity or orientation alone. We added this part in Note S7, see below). In addition, we added both the row-wise and column-wise random effects to capture the dependence of cells within each column/row (Chen et al., 2017). We have revised the method part as:

      “We excluded within-subject cells from the RSM (thus also excluding the withinrun similarity as suggested by Walther et al., (2016)), and the remaining cells were converted into a vector, which was then z-transformed and submitted to a linear mixed effect model as the dependent variable. The linear mixed effect model also included regressors of conflict similarity and orientation similarity. Importantly, conflict similarity was based on how Simon and spatial Stroop conflicts are combined and hence was calculated by first rotating all subject’s stimulus location to the topright and bottom-left quadrants, whereas orientation was calculated using original stimulus locations. As a result, the regressors representing conflict similarity and orientation similarity were de-correlated. Similarity between two conditions was measured as the cosine value of the angular difference. Other regressors included a target similarity regressor (i.e., whether the arrow directions were identical), a response similarity regressor (i.e., whether the correct responses were identical); a spatial Stroop distractor regressor (i.e., vertical distance between two stimulus locations); a Simon distractor regressor (i.e., horizontal distance between two stimulus locations). Additionally, we also included a regressor denoting the similarity of Group (i.e., whether two conditions are within the same subject group, according to the stimulus-response mapping). We also added two regressors including ROImean fMRI activations for each condition of the pair to remove the possible uni-voxel influence on the RSM. A last term was the intercept. To control the artefact due to dependence of the correlation pairs sharing the same subject, we included crossed random effects (i.e., row-wise and column-wise random effects) for the intercept, conflict similarity, orientation and the group factors (G. Chen et al., 2017).”

      Results from this approach highly replicated our original results. Specifically, we found the right 8C again showed a strong conflict similarity effect, a higher representational strength in the incongruent condition compared to the congruent condition, and a significant correlation with the behavioral CSE. The orientation effect was also identified in the visual (e.g., right V1) and oculomotor (e.g., left FEF) regions.

      We revised the results accordingly:

      For the conflict type effect:

      “The first criterion revealed several cortical regions encoding the conflict similarity, including the Brodmann 8C area (a subregion of dlPFC(Glasser et al., 2016)) and a47r in the right hemisphere, and the superior frontal language (SFL) area, 6r, 7Am, 24dd, and ventromedial visual area 1 (VMV1) areas in the left hemisphere (Bonferroni corrected ps < 0.0001, one-tailed, Fig. 4A). We next tested whether these regions were related to cognitive control by comparing the strength of conflict similarity effect between incongruent and congruent conditions (criterion 2). Results revealed that the left SFL, left VMV1, and right 8C met this criterion, Bonferroni corrected ps < .05, one-tailed, suggesting that the representation of conflict type was strengthened when conflict was present (e.g., Fig. 4D). The intersubject brain-behavioral correlation analysis (criterion 3) showed that the strength of conflict similarity effect on RSM scaled with the modulation of conflict similarity on the CSE (slope in Fig. S2C) in right 8C (r = .52, Bonferroni corrected p = .002, onetailed, Fig. 4C, Table 1) but not in the left SFL and VMV1 (all Bonferroni corrected ps > .05, one-tailed). “

      For the orientation effect:

      “We observed increasing fMRI representational similarity between trials with more similar orientations of stimulus location in the occipital cortex, such as right V1, right V2, right V4, and right lateral occipital 2 (LO2) areas (Bonferroni corrected ps < 0.0001). We also found the same effect in the oculomotor related region, i.e., the left 997 frontal eye field (FEF), and other regions including the right 5m, left 31pv and right parietal area F (PF) (Fig. 5A). Then we tested if any of these brain regions were related to the conflict representation by comparing their encoding strength between incongruent and congruent conditions. Results showed that the right V1, right V2, left FEF, and right PF encoded stronger orientation effect in the incongruent than the congruent condition, Bonferroni corrected ps < .05, one-tailed (Table1, Fig. 5B). We then tested if any of these regions was related to the behavioral performance, and results showed that none of them positively correlated with the behavioral conflict similarity modulation effect, all uncorrected ps > .45, one-tailed. Thus all regions are consistent with the criterion 3.”

      “Note S7. The cross-subject RSA captures similar effects with the within-subject RSA Considering the variability in voxel-level functional localizations among individuals, one may question whether the cross-subject RSA results were biased by the consistent multi-voxel patterns across subjects, distinct from the more commonly utilized withinsubject RSA. We reasoned that the cross-subject RSA should have captured similar effects as the within-subject RSA if we observe the conflict similarity effect in right 8C with the latter analysis. Therefore, we tested whether the representation in right 8C held for within-subject data. Specifically, we performed similar RSA for withinsubject RSMs, excluding the within-run cells. We replaced the perfectly confounded factors of conflict similarity and orientation with a common factor called similarity_orientation. Other confounding factor pairs (i.e., target versus response, and Stroop distractor versus Simon distractor) were addressed similarly. Results showed a significant effect of similarity_orientation, t(13993) = 3.270, p = .0005, 1tailed. Given the specific representation of conflict similarity identified by the crosssubject RSA, the within-subject data of right 8C may show similar conflict similarity modulation effects as the cross-subject data. Further research is needed to fully dissociate the representation of conflict and the representation of visual features such as orientation.”

      8) Another potential source of bias is in treating the subject-level random effect coefficients (as predicted by the mixed-level model) as independent samples from a random variable (in the t-tests). The more standard method for inference would be to use test statistics derived from the mixed-model fixed effects, as those have degrees of freedom calculations that are calibrated based on statistical theory.

      In our revised manuscript, we reported the statistical p values calculated from the mixed-effect models. Note that because we used the Chen et al. (2017) method, which includes data from the symmetric matrix, we corrected the degrees of freedom and estimated the true p values based on the t statistics of model results. For the I versus C comparison results, we calculated the p values by combining I and C RSMs into a larger model and then adding the condition type, as well as the interaction between the regressors of interest (conflict similarity and orientation) and the condition type. We made the statistical inference based on the interaction effect.

      We have revised the corresponding methods as:

      “The statistical significance of these beta estimates was based on the outputs of the mixed-effect model estimated with the “fitlme” function in Matlab 2022a. Since symmetric cells from the RSM matrix were included in the mixed-effect model, we adjusted the t and p values with the true degree of freedom, which is half of the cells included minus the number of fixed regressors. Multiple comparison correction was applied with the Bonferroni approach across all cortical regions at the p < 0.0001 level. To test if the representation strengths are different between congruent and incongruent conditions, we also conducted the RSA using only congruent (RDM_C) and incongruent (RDM_I) trials separately. The contrast analysis was achieved by an additional model with both RDM_C and RDM_I included, adding the congruency and the interaction between conflict type (and orientation) and congruency as both fixed and random factors. The difference between incongruent and congruent representations was indicated by a significant interaction effect.”

      Reviewer #3:

      Yang and colleagues investigated whether information on two task-irrelevant features that induce response conflict is represented in a common cognitive space. To test this, the authors used a task that combines the spatial Stroop conflict and the Simon effect. This task reliably produces a beautiful graded congruency sequence effect (CSE), where the cost of congruency is reduced after incongruent trials. The authors measured fMRI to identify brain regions that represent the graded similarity of conflict types, the congruency of responses, and the visual features that induce conflicts.

      Using several theory-driven exclusion criteria, the authors identified the right dlPFC (right 8C), which shows 1) stronger encoding of graded similarity of conflicts in incongruent trials and 2) a positive correlation between the strength of conflict similarity type and the CSE on behavior. The dlPFC has been shown to be important for cognitive control tasks. As the dlPFC did not show a univariate parametric modulation based on the higher or lower component of one type of conflict (e.g., having more spatial Stroop conflict or less Simon conflict), it implies that dissimilarity of conflicts is represented by a linear increase or decrease of neural responses. Therefore, the similarity of conflict is represented in multivariate neural responses that combine two sources of conflict.

      The strength of the current approach lies in the clear effect of parametric modulation of conflict similarity across different conflict types. The authors employed a clever cross-subject RSA that counterbalanced and isolated the targeted effect of conflict similarity, decorrelating orientation similarity of stimulus positions that would otherwise be correlated with conflict similarity. A pattern of neural response seems to exist that maps different types of conflict, where each type is defined by the parametric gradation of the yoked spatial Stroop conflict and the Simon conflict on a similarity scale. The similarity of patterns increases in incongruent trials and is correlated with CSE modulation of behavior.

      We would like to thank the reviewer for the positive evaluation of our manuscript and for providing constructive comments. By addressing these comments, we believe that we have made our manuscript more accessible for the readers while also strengthening our findings. In particular, we have tested a few alternative models and confirmed that the cognitive space hypothesis best fits the data. We have also demonstrated the geometric properties of the cognitive space by examining the continuity and dimensionality of the space, further supporting our main arguments. We have incorporated revisions and additional analyses to the manuscript based on your feedback. Overall, we believe that these changes and additional analyses have significantly improved the manuscript. Please find our detailed responses below.

      However, several potential caveats need to be considered.

      1) One caveat to consider is that the main claim of recruitment of an organized "cognitive space" for conflict representation is solely supported by the exclusion criteria mentioned earlier. To further support the involvement of organized space in conflict representation, other pieces of evidence need to be considered. One approach could be to test the accuracy of out-of-sample predictions to examine the continuity of the space, as commonly done in studies on representational spaces of sensory information. Another possible approach could involve rigorously testing the geometric properties of space, rather than fitting RSM to all conflict types. For instance, in Fig 6, both the organized and domain-specific cognitive maps would similarly represent the similarity of conflict types expressed in Fig1c (as evident from the preserved order of conflict types). The RSM suggests a low-dimensional embedding of conflict similarity, but the underlying dimension remains unclear.

      Following the reviewer’s first suggestion, we conducted a leave-one-out prediction approach to examine the continuity of the cognitive space. We used the behavioral data from Experiment 1 for this test, due to its larger amount of data than Experiment 2. Specifically, we removed data from one of the five similarity levels (as illustrated by the θs in Fig. 1C) and used the remaining data to perform the same mixed-effect model as reported in the main text (i.e., the two-stage analysis). This yielded one pair of beta coefficients including the similarity regressor and the intercept for each subject, with which we predicted the CSE for the removed similarity level at subject level. We repeated this process for each similarity level once. The predicted results were highly correlated with the original data, with r = .87 for the RT and r = .84 for the ER, ps < .001. We have added this analysis and result to the “Conflict type similarity modulated behavioral congruency sequence effect (CSE)” 1079 section:

      “Moreover, to test the continuity and generalizability of the similarity modulation, we conducted a leave-one-out prediction analysis. We used the behavioral data from Experiment 1 for this test, due to its larger amount of data than Experiment 2. Specifically, we removed data from one of the five similarity levels (as illustrated by the θs in Fig. 1C) and used the remaining data to perform the same mixed-effect model (i.e., the two-stage analysis). This yielded one pair of beta coefficients including the similarity regressor and the intercept for each subject, with which we predicted the CSE for the removed similarity level for each subject. We repeated this process for each similarity level once. The predicted results were highly correlated with the original data, with r = .87 for the RT and r = .84 for the ER, ps < .001.”

      To estimate if the domain-specific model could explain the results we observed in right 8C, we conducted a model-comparison analysis. The domain-specific model treats each conflict type differently, so we used a diagonal matrix, with within-conflict type similarities being 1 and all cross-conflict type similarities being 0. This model showed non-significant effects (t(951989) = 0.84, p = .201) and poorer fit (BIC = 5377127) than the cognitive space model (t(951989) = 5.60, p = 1.1×10−8, BIC = 5377094). We also compared other alternative models and found the cognitive space model best fitted the data. We have included these results in the revised manuscript:

      “To examine if the right 8C specifically encodes the cognitive space rather than the domain-general or domain-specific organizations, we tested several additional models (see Methods). Model comparison showed a lower BIC in the Cognitive-Space model (BIC = 5377094) than the Domain-General (BIC = 537127) or Domain-Specific (BIC = 537127) models. Further analysis showed the dimensionality of the representation in the right 8C was 1.19, suggesting the cognitive space was close to 1D. We also tested if the observed conflict similarity effect was driven solely by spatial Stroop or Simon conflicts, and found larger BICs for the models only including the Stroop similarity (i.e., the Stroop-Only model, BIC = 5377122) or Simon similarity (i.e., the Simon-Only model, BIC = 5377096). An additional Stroop+Simon model, including both StroopOnly and Simon-Only regressors, also showed a worse model fitting (BIC = 5377118). Moreover, we replicated the results with only incongruent trials, considering that the pattern of conflict representations is more manifested when the conflict is present (i.e., on incongruent trials) than not (i.e., on congruent trials). We found a poorer fitting in Domain-general (BIC = 1344129), Domain-Specific (BIC = 1344129), Stroop-Only (BIC = 1344128), Simon-Only (BIC = 1344120), and Stroop+Simon (BIC = 1344157) models than the Cognitive-Space model (BIC = 1344104). These results indicate that the right 8C encodes an integrated cognitive space for resolving Stroop and Simon conflicts. The more detailed model comparison results are listed in Table 2.”

      We also estimated the dimensionality of the right 8C with the averaged RSM and found the dimensionality of the cognitive space was ~ 1.19, very close to a 1D space. This result is consistent with our experimental design, as the only manipulated variable is the angular distance between conflict types. We have added these results and the methods to the revised manuscript.

      Results:

      “Further analysis showed the dimensionality of the representation in the right 8C was 1.19, suggesting the cognitive space was close to 1D.”

      Methods:

      “To better capture the dimensionality of the representational space, we estimated its dimensionality using the participation ratio (Ito & Murray, 2023). Since we excluded the within-subject cells from the whole RSM, the whole RSM is an incomplete matrix and could not be used. To resolve this issue, we averaged the cells corresponding to each pair of conflict types to obtain an averaged 5×5 RSM matrix, similar to the matrix shown in Fig. 1C. We then estimated the participation ratio using the formula:

      where λi is the eigenvalue of the RSM and m is the number of eigenvalues.

      2) Another important factor to consider is how learning within the confined task space, which always negatively correlates the two types of conflicts within each subject, may have influenced the current results. Is statistical dependence of conflict information necessary to use the organized cognitive space to represent conflicts from multiple sources? Answering this question would require a paradigm that can adjust multiple sources of conflicts parametrically and independently. Investigating such dependencies is crucial in order to better understand the adaptive utility of the observed cognitive space of conflict similarity.

      As the central goal of our design was to test the geometry of neural representations of conflict, we manipulated the conflict similarity. The anticorrelated Simon and spatial Stroop conflict aimed to make the overall magnitude of conflict similar among different conflict types. We agree that with the current design the likely cognitive space is not a full 2D space with Simon and spatial Stroop being two dimensions. Instead, the likely cognitive space is a subspace (e.g., a circle) embedded in the 2D space, due to the constraint of anticorrelated Simon and spatial Stroop conflict across conflict types. Nevertheless, the subspace can also be used to test the geometry that similar conflict types share similar neural representations.

      To test the full 2D cognitive space, a possible revision of our current design is to have multiple hybrid conditions (like Type 2-4) that cover the whole space. For instance, imagine arrow locations in the first quadrant space. We could have a 3×3 design with 9 conflict conditions, where their horizontal/vertical coordinates could be one of the combinations of 0, 0.5 and 1. This way, the spatial Stroop and Simon conditions would be independent of each other. Notably, however, one potential confounding factor would be that these conditions have different levels of difficulty (i.e., different magnitude of conflict), which may affect the CSE results and their representational similarity.<br /> We have added the above limitations and future designs to the revised 1156 manuscript.

      “Another limitation is that in our design, the spatial Stroop and Simon effects are highly anticorrelated. This constraint may make the five conflict types represented in a unidimensional space (e.g., a circle) embedded in a 2D space. Future studies may test the 2D cognitive space with fully independent conditions. A possible improvement to our current design would be to include left, right, up, and down arrows presented in a grid formation across four spatially separate quadrants, with each arrow mapped to its own response button. However, one potential confounding factor would be that these conditions have different levels of difficulty (i.e., different magnitude of conflict), which may affect the CSE results and their representational similarity.”

      Major comments:

      3) The RSM result (and the absence of univariate effect) seem to be a good first step to claim the use of cognitive space of conflict. Yet, the presence of an organized (unidimensional; Fig. 6) and continuous cognitive space should be further tested and backed up.

      We thank the reviewer for recognizing the methods and results of our current work. Indeed, the utilization of a parametric design and RSA to examine organization of neural representations is a widely embraced methodology in the field of cognitive neuroscience (e.g., Freund et al., 2021; Ritz et al., 2022). Our current study aimed primarily to provide original evidence for whether similar conflicts are represented similarly in the brain, which reflects the geometry of conflict representations (i.e., the structure of differences between conflict representations). We have used multiple criteria to back up the findings by showing the representation is sensitive to the presence of conflict and has behavioral relevance.

      We agree that the cognitive space account of cognitive control requires further validation. Therefore, in the revised manuscript, we have added several additional tests to strengthen the evidence supporting the organized cognitive space representation. Firstly, we tested five alternative models (Domain-General, Domain Specific, Stroop-Only, Simon-Only and Stroop+Simon models), and found that the Cognitive-Space model best fitted our data. Secondly, we explicitly calculated the dimensionality of the representation and observed a low dimensionality (1.19D). We have added these results to the “Multivariate patterns of the right dlPFC encodes the conflict similarity” section in the revised manuscript (see also the response to Comment 1).

      Furthermore, we utilized data from Experiment 1 to demonstrate the continuity of the cognitive space by showing its ability to predict out-of-sample data. We have included this result to the “Conflict type similarity modulated behavioral congruency sequence effect (CSE)” section in the revised manuscript:

      “Moreover, to test the continuity and generalizability of the similarity modulation, we conducted a leave-one-out prediction analysis. We used the behavioral data from Experiment 1 for this test, due to its larger amount of data than Experiment 2. Specifically, we removed data from one of the five similarity levels (as illustrated by the θs in Fig. 1C) and used the remaining data to perform the same mixed-effect model (i.e., the two-stage analysis). This yielded one pair of beta coefficients including the similarity regressor and the intercept for each subject, with which we predicted the CSE for the removed similarity level for each subject. We repeated this process for each similarity level once. The predicted results were highly correlated with the original data, with r = .87 for the RT and r = .84 for the ER, ps < .001.”

      References:

      Freund, M. C., Bugg, J. M., & Braver, T. S. (2021). A Representational Similarity Analysis of Cognitive Control during Color-Word Stroop. Journal of Neuroscience, 41(35), 7388-7402.

      Ritz, H., & Shenhav, A. (2022). Humans reconfigure target and distractor processing to address distinct task demands. bioRxiv. doi:10.1101/2021.09.08.459546

      4) Is the conflict similarity effect not driven by either coding of the weak to strong gradient of the spatial Stroop conflict or the Simon conflict? For example, would simply identifying brain regions that selectively tuned to the Simon conflict continuously enough to create a graded similarity in Fig. C.

      We recognize that our current design and analyzing approach cannot fully exclude the possibility that the current results are driven solely by either Stroop or Simon conflicts, since their gradients are correlated to the conflict similarity gradient we defined. To estimate their unique contributions, we performed a model-comparison analysis. We constructed a Stroop-Only model and a Simon-Only model, with each conflict type projected onto the Stroop (vertical) axis or Simon (horizontal) axis, respectively. The similarity between any two conflict types was defined using the Jaccard similarity index (Jaccard, P., 1901), that is, their intersection divided by their union. By replacing the cognitive space-based conflict similarity regressor with the Stroop-Only and Simon-Only regressors, we calculated their BICs. Results showed that the BIC was larger for Stroop-Only (5377122) and Simon-Only (5377096) than for the cognitive space model (5377094). An additional Stroop+Simon model, including both Stroop-Only and Simon-Only regressors, also 1220 showed a poorer model fitting (BIC = 5377118) than the cognitive space model.

      Moreover, we replicated the results with only incongruent trials. We found a poorer fitting in Stroop-Only (BIC = 1344128), Simon-Only (BIC = 1344120), and Stroop+Simon (BIC = 1344157) models than the Cognitive-Space model (BIC = 1344104). These results indicate that the right 8C encodes an integrated cognitive space for resolving Stroop and Simon conflicts. Therefore, we believe the cognitive space has incorporated both dimensions. We added these additional analyses and results to the revised manuscript (see also the response to the above Comment 1).

      5) Is encoding of conflict similarity in the unidimensional organized space driven by specific requirements of the task or is this a general control strategy? Specifically, is the recruitment of organized space something specific to the task that people are trained to work with stimuli that negatively correlate the spatial Stroop conflict and the Simon conflict?

      We argue that this encoding is a general control strategy. In our task design, we asked the participants to respond to the target arrow and ignore the location that appeared randomly for them. So, they were not trained to deal with the stimuli in any certain way. We also found the conflict similarity modulation on CSE did not change with more training (We added this result in Note S3), indicating that the cognitive space did not depend on strategies that could be learned through training.

      “Note S3. Modulation of conflict similarity on behavioral CSEs does not change across time We tested if the conflict similarity modulation on the CSE is susceptible to training. We collected the data of Experiment 1 across three sessions, thus it is possible to examine if the conflict similarity modulation effect changes across time. To this end, we added conflict similarity, session and their interaction into a mixed-effect linear model, in which the session was set as a categorical variable. With a post-hoc analysis of variance (ANOVA), we calculated the statistical significance of the interaction term.

      This approach was applied to both the RT and ER. Results showed no interaction effect in either RT, F(2,1479) = 1.025, p = .359, or ER, F(2,1479) = 0.789, p = .455. This result suggests that the modulation effect does not change across time."

      Instead, the cognitive space should be determined by the intrinsic similarity structure of the task design. A previous study (Freitas et al., 2015) has found that the CSE across different versions of spatial Stroop and flanker tasks was stronger than that across either of the two conflicts and Simon. In their designs, the stimulus similarity was controlled at the same level, so the difference in CSE was only attributable to the similar dimensional overlap between Stroop and flanker tasks, in contrast to the Simon task. Furthermore, recent studies showed that the cognitive space generally exists to represent structured latent states (e.g., Vaidya et al., 2022), mental strategy cost (Grahek et al., 2022), and social hierarchies (Park et al., 2020). Therefore, we argue that cognitive space is likely a universal strategy that can be applied to different scenarios.

      We added this argument in the discussion:

      “Although the spatial orientation information in our design could be helpful to the construction of cognitive space, the cognitive space itself was independent of the stimulus-level representation of the task. We found the conflict similarity modulation on CSE did not change with more training (see Note S3), indicating that the cognitive space did not depend on strategies that could be learned through training. Instead, the cognitive space should be determined by the intrinsic similarity structure of the task design. For example, a previous study (Freitas et al, 2015) has found that the CSE across different versions of spatial Stroop and flanker tasks was stronger than that across either of the two conflicts and Simon. In their designs, the stimulus similarity was controlled at the same level, so the difference in CSE was only attributable to the similar dimensional overlap between Stroop and flanker tasks, in contrast to the Simon task. Furthermore, recent studies showed that the cognitive space generally exists to represent structured latent states (e.g., Vaidya et al., 2022), mental strategy cost (Grahek et al., 2022), and social hierarchies (Park et al., 2020). Therefore, cognitive space is likely a universal strategy that can be applied to different scenarios."

      Reference:

      Freitas, A. L., & Clark, S. L. (2015). Generality and specificity in cognitive control: conflict adaptation within and across selective-attention tasks but not across selective-attention and Simon tasks. Psychological Research, 79(1), 143-162.

      Vaidya, A. R., Jones, H. M., Castillo, J., & Badre, D. (2021). Neural representation of 1280 abstract task structure during generalization. Elife, 10, 1-26.

      Grahek, I., Leng, X., Fahey, M. P., Yee, D., & Shenhav, A. Empirical and 1282 Computational Evidence for Reconfiguration Costs During Within-Task 1283 Adjustments in Cognitive Control. CogSci.

      Park, S. A., Miller, D. S., Nili, H., Ranganath, C., & Boorman, E. D. (2020). Map 1285 Making: Constructing, Combining, and Inferring on Abstract Cognitive Maps. 1286 Neuron, 107(6), 1226-1238 e1228. doi:10.1016/j.neuron.2020.06.030

      6) The observed pattern seems to suggest that there is conflict similarity space that is defined by the combination of the conflict similarity (i.e., the strength of conflicts) and the sources of conflict (i.e., the Simon vs the spatial Stroop). What are the rational reasons to separate conflicts of different sources (beyond detecting incongruence)? And how are they used for better conflict resolutions?

      The necessity of separating conflicts of different sources lies in that the spatial Stroop and the Simon effects are resolved with different mechanisms. The behavioral congruency effects of a combined conflict from two different sources were shown to be the summation of the two conflict sources (Liu et al., 2010), suggesting that the conflicts are resolved independently. Moreover, previous studies have shown that different sources of conflict are resolved with different brain regions (Egner, 2008; Li et al., 2017), and at different processing stages (Wang et al., 2013). Therefore, when multiple sources of conflict occur simultaneously or sequentially, it should be more efficient to resolve the conflict by identifying the sources.

      We have added this argument to the revised manuscript:

      “The rationale behind defining conflict similarity based on combinations of different conflict sources, such as spatial-Stroop and Simon, stems from the evidence that these sources undergo independent processing (Egner, 2008; Li et al., 2014; Liu et al., 2010; Wang et al., 2014). Identifying these distinct sources is critical in efficiently resolving potentially infinite conflicts."

      Reference:

      Egner, T. (2008). Multiple conflict-driven control mechanisms in the human brain. Trends in Cognitive Sciences, 12(10), 374-380.

      Li, Q., Yang, G., Li, Z., Qi, Y., Cole, M. W., & Liu, X. (2017). Conflict detection and 1307 resolution rely on a combination of common and distinct cognitive control networks. Neuroscience and Biobehavioral Reviews, 83, 123-131.

      Wang, K., Li, Q., Zheng, Y., Wang, H., & Liu, X. (2014). Temporal and spectral 1310 profiles of stimulus-stimulus and stimulus-response conflict processing. NeuroImage, 89, 280-288.

      Liu, X., Park, Y., Gu, X., & Fan, J. (2010). Dimensional overlap accounts for independence and integration of stimulus-response compatibility effects. Attention, Perception, & Psychophysics, 72(6), 1710-1720.

      7) The congruency effect is larger in conflict type 2, 3, 4 consistently compared to conflict 1 and 5. Are these expected under the hypothesis of unified cognitive space of conflict similarity? Is the pattern of similarity modeled in RSA?

      Yes, this is expected. The spatial Stroop and Simon effects have been shown to be additive and independent (Li et al., 2014). Therefore, the congruency effects of conflict type 2, 3 and 4 would be the weighted sum of the spatial Stroop and Simon effects. The weights can be defined by the sine and cosine of the polar angle.

      For instance, in Type 2, wy = sin(67.5°) and wx = cos(67.5°). The sum of the two 1321 weight values (i.e., 1.31) is larger than 1, leading to a larger congruency effect than 1322 the pure spatial Stroop (Conf 1) and Simon (Conf 5) conditions.

      Note that this hypothesis underlies the Stroop+Simon model, which assumes the Stroop and Simon dimensions are independently represented in the brain and drive the behavior in an additive fashion. Moreover, the observed difference of behavioral congruency effects may have reflected the variance in the Domain-General model, which treats all conflict types as equivalent, with the only difference between each two conflict types in the magnitude of their conflict. Therefore, we did not model the behavioral congruency effects as a covariance regressor in the major RSA. Instead, we conducted a model comparison analysis by comparing these models and the Cognitive-Space model. Results showed worse model fitting of both the Domain-general and Stroop+Simon models. Specially, the regressor of congruency effect difference in the Domain-General model was not significant (p = .575), which also suggests that the higher congruency effect in conflict type 2, 3 and 4 should not influence the Cognitive-Space model results. We have added these methods and results to the revised manuscript (see also our response to Comment 1):

      Methods:

      “Model comparison and representational dimensionality

      To estimate if the right 8C specifically encodes the cognitive space, rather than the domain-general or domain-specific structures, we conducted two more RSAs. We replaced the cognitive space-based conflict similarity matrix in the RSA we reported above (hereafter referred to as the Cognitive-Space model) with one of the alternative model matrices, with all other regressors equal. The domain-general model treats each conflict type as equivalent, so each two conflict types only differ in the magnitude of their conflict. Therefore, we defined the domain-general matrix as the difference in their congruency effects indexed by the group-averaged RT in Experiment 2. Then the z scored model vector was sign-flipped to reflect similarity instead of distance. The domain-specific model treats each conflict type differently, so we used a diagonal matrix, with within-conflict type similarities being 1 and all cross-conflict type similarities being 0.

      Moreover, to examine if the cognitive space is driven solely by the Stroop or Simon conflicts, we tested a spatial Stroop-Only (hereafter referred to as “Stroop-Only”) and a Simon-Only model, with each conflict type projected onto the spatial Stroop (vertical) axis or Simon (horizontal) axis, respectively. The similarity between any two conflict types was defined using the Jaccard similarity index (Jaccard, 1901), that is, their intersection divided by their union. We also included a model assuming the Stroop and Simon dimensions are independently represented in the brain, adding up the Stroop Only and Simon-Only regressors. We conducted similar RSAs as reported above, replacing the original conflict similarity regressor with the Strrop-Only, Simon-Only, or both regressors, and then calculated their Bayesian information criterions (BICs)."

      Reference:

      Li, Q., Nan, W., Wang, K., & Liu, X. (2014). Independent processing of stimulus stimulus and stimulus-response conflicts. PloS One, 9(2), e89249.

      8) Please clarify the observed patterns of CSE effects in relation to the hypothesis of common cognitive space of conflict. In particular, right 8C shows that the patterns become dissimilar in incongruent trials compared to congruent trials. How does this direction of the effect fit to the common unidimensional cognitive space account? And how does such a representation contribute to the CES effects?

      The behavioral CSE patterns provide initial evidence for the cognitive space hypothesis. Previous studies have debated whether cognitive control relies on domain-general or domain-specific representations, with much evidence gathered from behavioral CSE patterns. A significant CSE across two conflict conditions typically suggests domain-general representations of cognitive control, while an absence of CSE suggests domain-specific representations. The cognitive space view proposes that conflict representations are neither purely domain-general nor purely domain-specific, but rather exist on a continuum. This view predicts that the CSE across two conflict conditions should depend on the representational distance between them within this cognitive space. Our finding that CSE values systematically vary with conflict similarity level support this hypothesis. We have added this point in the discussion of the revised manuscript:

      “Previous research on this topic often adopts a binary manipulation of conflict(Braem et al., 2014) (i.e., each domain only has one conflict type) and gathered evidence for the domain-general/specific view with presence/absence of CSE, respectively. Here, we parametrically manipulated the similarity of conflict types and found the CSE systematically vary with conflict similarity level, demonstrating that cognitive control is neither purely domain-general nor purely domain-specific, but can be reconciled as a cognitive space(Bellmund et al., 2018) (Fig. 6, middle).

      Fig. 4D was plotted to show the steeper slope of the conflict similarity effect for incongruent versus congruent conditions. Note the y-aixs displays z-scored Pearson correlation values, so the grand mean of each condition was 0. The values for the first two similarity levels (level 1 and 2) were lower for incongruent than congruent conditions, seemingly indicating lower average similarity. However, this was not the case. The five similarity levels contained different numbers of data points (see Fig. 1C), so levels 4 and 5 should be weighted more heavily than levels 1 and 2. When comparing the grand mean of raw Pearson correlation values, the incongruent condition (0.0053) showed a tendency toward higher similarity than the congruent condition (0.0040), t(475998) = 1.41, p = .079. We have also plotted another version of Fig. 4D in Fig. S5, in which the raw Pearson correlation values were used.

      The greater representation of conflict type in incongruent condition compared to congruent condition (as evidenced by a steeper slope) suggests that the conflict representation was driven by the incongruent condition. This is probably due to the stronger involvement of cognitive control in incongruent condition (than congruent condition), which in turn leads to more distinct patterns across different conflict types. This is consistent with the fact that the congruent condition is typically a baseline, where any conflict related effects should be weaker.

      The representation of cognitive space may contribute to the CSE as a mental model. This model allows our brain to evaluate the cost and benefit associated with transitioning between different conflict conditions. When two consecutive trials are characterized by more similar conflict types, their representations in the cognitive space will be closer, resulting in a less costly transition. As a consequence, stronger CSEs are observed. We revised the corresponding discussion part as:

      “Similarly, we propose that cognitive space could serve as a mental model to assist fast learning and efficient organization of cognitive control settings. Specifically, the cognitive space representation may provide a principle for how our brain evaluates the expected cost of switching and the benefit of generalization between states and selects the path with the best cost-benefit tradeoff (Abrahamse et al., 2016; Shenhav et al., 2013). The proximity between two states in cognitive space could reflect both the expected cognitive demand required to transition and the useful mechanisms to adapt from. The closer the two conditions are in cognitive space, the lower the expected switching cost and the higher the generalizability when transitioning between them. With the organization of a cognitive space, a new conflict can be quickly assigned a location in the cognitive space, which will facilitate the development of cognitive control settings for this conflict by interpolating nearby conflicts and/or projecting the location to axes representing different cognitive control processes, thus leading to a stronger CSE when following a more similar conflict condition.”

      Minor comments:

      9) Some of the labels of figure axes are unclear (e.g., Fig4C) about what they represent.

      In Fig. 4C, the x-axis label is “neural representational strength”, which refers to the beta coefficient of the conflict type effect computed from the main RSA, denoting the strength of the conflict type representation in neural patterns. The y-axis label is “behavioral representational strength”, which refers to the beta coefficient obtained from the behavioral linear model using conflict similarity to predict the CSE in Experiment 2; it reflects how strong the conflict similarity modulates the behavioral 1440 CSE. We apologize for any confusion from the brief axis labels. We have added expanded descriptions to the figure caption of Fig. 4C.

    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewer 1:

      (1) In general, the representation of target and distractor processing is a bit of a reach. Target processing is represented by SSVEP amplitude, which is most likely going to be related to the contrast of the dots, as opposed to representing coherent motion energy, which is the actual target. These may well be linked (e.g., greater attention to the coherent motion task might increase SSVEP amplitude), but I would call it a limitation of the interpretation. Decoding accuracy of emotional content makes sense as a measure of distractor processing, and the supplementary analysis comparing target SSVEP amplitude to distractor decoding accuracy is duly noted.

      We agree with the reviewer. The SSVEP amplitude of the target at the whole trial level indeed reflected the combined effect of the stimulus parameters (e.g., contrast of the moving dots) as well as attention. However, the time course of the target SSVEP amplitude within a trial, derived from the moving window analysis, reflected the temporal fluctuations of target processing, since the stimulus parameters remained the same during the trial. We now make this clearer in the revised manuscript.

      (2) Comparing SSVEP amplitude to emotional category decoding accuracy feels a bit like comparing apples with oranges. They have different units and scales and probably reflect different neural processes. Is the result the authors find not a little surprising in this context? This relationship does predict performance and is thus intriguing, but I think this methodological aspect needs to be discussed further. For example, is the phase relationship with behaviour a result of a complex interaction between different levels of processing (fundamental contrast vs higher order emotional processing)?

      Traditionally, the SSVEP amplitude at the distractor frequency is used to quantify distractor processing. Given that the target SSVEP amplitude is stronger than that of the distractor, it is possible that the distractor SSVEP amplitude is contaminated by the target SSVEP amplitude due to spectral power leakage; see Figure S4 for a demonstration of this. Because of this issue we therefore introduced the use of decoding accuracy as an index of distractor processing. The lack of correlation between the distractor SSVEP amplitude and the distractor decoding accuracy, although it is kind of like comparing apples with oranges as pointed out by the reviewer, serves the purpose of showing that these two measures are not co-varying, and the use of decoding accuracy is free from the influence of the distractor SSVEP amplitude which is influenced by the target SSVEP amplitude. Also, to address the apples-vs-oranges issue, the correlation was computed on normalized time series, in which a z-score time series replaced the original time series so that the correlated variables are dimensionless. Regarding the question of assessing the relation between behavior and different levels of processing, we do not have means to address it, given that we are not able to empirically separate the effects of stimulus parameters versus attention.

      Reviewer 2:

      (1) Incomplete Evidence for Rhythmicity at 1 Hz: The central claim of 1 Hz rhythmic sampling is insufficiently validated. The windowing procedure (0.5s windows with 0.25s step) inherently restricts frequency resolution, potentially biasing toward low-frequency components like 1 Hz. Testing different window durations or providing controls would significantly strengthen this claim.

      We appreciate the reviewer’s insightful suggestion. In response, we tested different windowing parameters, e.g., 0.1s sliding window with a 0.05s step size. Figure S5 demonstrates that the strength of both target and distractor processing fluctuates around ~1 Hz, both at the individual and group levels. Additionally, Figures S6(A) and S6(B) show that the relative phase between target and distractor processing time series exhibits a uniform distribution across subjects. In terms of the relation between relative phase and behavior, Figure S6(C) illustrates two representative cases: a high-performing subject with 84.34% task accuracy exhibited a relative phase of 0.9483π (closer to π), while a low-performing subject with 30.95% accuracy showed a phase of 0.29π close to 0). At the group level, a significant positive correlation between relative phase and task performance was found (r = 0.6343, p = 0.0004), as shown in Figure S6(D). All these results, aligning closely with our original findings (0.5s window length and 0.25s step size), suggest that the conclusions are not dependent on windowing parameters. We discuss these results in the revised manuscript.

      To further validate our findings, we also employed the Hilbert transform to extract amplitude envelopes of the target and distractor signals on a time-point-by-time-point basis, providing a window-free estimate of signal strength (Figures R3 and R4). The results remain consistent with both the original findings and the new sliding window analyses (Figure S6). Specifically, Figure S7 reveals ~1 Hz fluctuations in target and distractor processing at both individual and group levels. Figures S8(A) and S8(B) confirm a uniform distribution of the relative phase across subjects. In Figure S8(C), the relative phase was 0.9567π for a high-performing subject (84.34% accuracy) and 0.2247π for a low-performing subject (28.57% accuracy). At the group level, a significant positive correlation was again observed between relative phase and task performance (r = 0.4020, p = 0.0376), as shown in Figure S8(D).

      (2) No-Distractor Control Condition: The study lacks a baseline or control condition without distractors. This makes it difficult to determine whether the distractor-related decoding signals or the 1 Hz effect reflect genuine distractor processing or more general task dynamics.

      The lack of a no-distractor control condition is certainly a limitation and will be acknowledged as such in the revised manuscript. However, given that our decoding results are between two different classes of distractors, we are confident that they reflect distractor processing.

      (3) Decoding Near Chance Levels: The pairwise decoding accuracies for distractor categories hover close to chance (~55%), raising concerns about robustness. While statistically above chance, the small effect sizes need careful interpretation, particularly when linked to behavior.

      This is an important point. To test robustness, we have implemented a random permutation procedure in which trial labels were randomly shuffled to construct a nullhypothesis distribution for decoding accuracy. We then compared the decoding accuracy from the actual data to this distribution. Figure S9 shows the results based on 1,000 permutations. For each of the three pairwise classifications—pleasant vs. neutral, unpleasant vs. neutral, and pleasant vs. unpleasant—as well as the three-way classification, the actual decoding accuracies fall far outside the null-hypothesis distribution (p < 0.001), and the effect size in all four cases is extremely large. These findings indicate that the observed decoding accuracies are statistically significant and robust in terms of both statistical inference and effect size.

      (4) No Clear Correlation Between SSVEP and Behavior: Neither target nor distractor signal strength (SSVEP amplitude) correlates with behavioral accuracy. The study instead relies heavily on relative phase, which - while interesting - may benefit from additional converging evidence.

      We felt that what the reviewer pointed out is actually the main point of our study, namely, it is not the target or distractor strength over the whole trial that matters for behavior, it is their temporal relationship within the trial that matters for behavior. This reveals a novel neuroscience principle that has not been reported in the past. We have stressed this point further in the revised manuscript.

      (5) Phase-analysis: phase analysis is performed between different types of signals hindering their interpretability (time-resolved SSVEP amplitude and time-resolved decoding accuracy).

      The time-resolved SSVEP amplitude is used to index the temporal dynamics of target processing whereas the time-resolved decoding accuracy is used to index the temporal dynamics of distractor processing. As such, they can be compared, using relative phase for example, to examine how temporal relations between the two types of processes impact behavior. This said, we do recognize the reviewer’s concern that these two processes are indexed by two different types of signals. We thus normalized each time course using zscoring, making them dimensionless, and then computed the temporal relations between them.

      Appraisal of Aims and Conclusions:

      The authors largely achieved their stated goal of assessing rhythmic sampling of distractors. However, the conclusions drawn - particularly regarding the presence of 1 Hz rhythmicity - rest on analytical choices that should be scrutinized further. While the observed phaseperformance relationship is interesting and potentially impactful, the lack of stronger and convergent evidence on the frequency component itself reduces confidence in the broader conclusions.

      Impact and Utility to the Field:

      If validated, the findings will advance our understanding of attentional dynamics and competition in complex visual environments. Demonstrating that ignored distractors can be rhythmically sampled at similar frequencies to targets has implications for models of attention and cognitive control. However, the methodological limitations currently constrain the paper's impact.

      Thanks for these comments and positive assessment of our work’s potential implications and impact. As indicated above, in the revision process, we have carried out a number of additional analyses, some suggested by the reviewers, and the results of the additional analyses, now included in the Supplementary Materials, served to further validate the main findings and strengthen our conclusions.

      Additional Context and Considerations:

      (1) The use of EEG-fMRI is mentioned but not leveraged. If BOLD data were collected, even exploratory fMRI analyses (e.g., distractor modulation in visual cortex) could provide valuable converging evidence.

      Indeed, leveraging fMRI data in EEG studies would be very beneficial, as has been demonstrated in our previous work. However, given that this study concerns the temporal relationship between target and distractor processing, it is felt that fMRI data, which is known to possess low temporal resolution, has limited potential to contribute. We will be exploring this rich dataset in other ways in the future, where we will be integrating the two modalities for more insights that are not possible with either modality used alone.

      Author response image 1.

      Appyling moving window analysis (0.02s window duration and 0.01 step size) to a different EEG-fMRI dataset. (A) The amplitude time series of the 4.29 Hz component and the Fourier spectrum. (B) The group level Fourier spectrum. At both individual and group level, no 1 Hz modulation is observed, suggesting that the 1 Hz modulation observed in our data is not introduced by the artifact removal procedure.

      (2) In turn, removal of fMRI artifacts might introduce biases or alter the data. For instance, the authors might consider investigating potential fMRI artifact harmonics around 1 Hz to address concerns regarding induced spectral components.

      We have done extensive work in the area of simultaneous EEG-fMRI and have not encountered artifacts with a 1Hz rhythmicity. Our scanner artifact removal procedure is very standardized. As such, it stands to reason that if the 1Hz rhythmicity observed here results from the artifact removal process, it should also be present in other datasets where the same preprocessing steps were implemented. We tested this using another EEG-fMRI dataset (Rajan et al., 2019) . Author response image 1 shows that the EEG power time series of the new dataset doesn't have 1 Hz rhythmicity, whether at the individual level or at the group level, suggesting that the 1 Hz rhythmicity reported in the manuscript is not coming from the removal of the scanner artifacts, but instead reflects true rhythmic sampling of stimulus information. Also, the fact that the temporal relations between target processing and distractor processing at 1Hz impact behavior is another indication that the 1Hz rhythmicity is a neuroscientific effect, not an artifact.

      References

      Rajan, A., Siegel, S. N., Liu, Y., Bengson, J., Mangun, G. R., & Ding, M. (2019). Theta Oscillations Index Frontal Decision-Making and Mediate Reciprocal Frontal–Parietal Interactions in Willed Attention. Cerebral Cortex, 29(7), 2832–2843. https://doi.org/10.1093/cercor/bhy149

    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewer #1 (Public review):

      Summary:

      Rossi et al. asked whether gait adaptation is solely a matter of slow perceptual realignment or if it also involves fast/flexible stimulus-response mapping mechanisms. To test this, they conducted a series of split-belt treadmill experiments with ramped perturbations, revealing behavior indicative of a flexible, automatic stimulus-response mapping mechanism.

      Strengths:

      (1) The study includes a perceptual test of leg speed, which correlates with the perceptual realignment component of motor aftereffects. This indicates that there are motor performances that are not accounted for by perceptual re-alignment.

      (2) They study incorporates qualitatively distinct, hypothesis-driven models of adaptation and proposes a new framework that integrates these various mechanisms.

      Weaknesses:

      (1) The study could benefit from considering other alternative models. As the authors noted in their discussion, while the descriptive models explain some patterns of behaviour/aftereffects, they don't currently account for how these mechanisms influence the initial learning process itself.

      (1a) For example, the pattern of gait asymmetric might differ for perceptual realignment (a smooth, gradual process), structural learning (more erratic, involving hypothesis testing/reasoning to understand the perturbation, see (Tsay et al. 2024) for a recent review on Reasoning), and stimulus-response mapping (possibly through a reinforcement based trial-and-error approach). If not formally doing a model comparison, the manuscript might benefit from clearly laying out the behavioural predictions for how these different processes shape initial learning.

      (1b) Related to the above, the authors noted that the absence of difference during initial learning suggests that the differences in Experiment 2 in the ramp-up phase are driven by two distinct processes: structural learning and memory-based processes. If the assumptions about initial learning are not clear, this logic of this conclusion is hard to follow.

      Thank you for this insightful comment. We agree that considering alternative models and clarifying their potential contributions to the initial learning process would enhance the manuscript. We performed additional analyses and revised the text to outline how the mechanisms of adaptation in our study align with the framework described by Tsay et al. (2024) regarding the initial learning process and other features of adaptation.

      First, we referenced the Tsay et al. framework in the Introduction and Discussion to highlight parallels between their description of implicit adaptation and our forward model recalibration mechanism (producing motor changes and perceptual realignment). Specifically, the features defining recalibration in our study – gradual, trial-by-trial adjustments, rigid learning that leads to aftereffects, and limited contribution to generalization – align with those described by Tsay et al.

      Second, we used the description provided by Tsay et al. to test the presence of explicit strategies in our study. We specifically test for the criteria of reportability and intentionality, corroborating the finding that our stimulus response mapping mechanism differs from explicit strategies.

      “A recent framework for motor learning by Tsay et al. defines explicit strategies as motor plans that are both intentional and reportable (Tsay et al., 2024). Within this framework, Tsay et al. clarify that "intentional" means participants deliberately perform the motor plan, while "reportable" means they are able to clearly articulate it.” (Experiment 2 Results, lines 515-518).

      “…the motor adjustments reported by participants consistently fail to meet the criteria for explicit strategies as outlined by Tsay et al.: reportability and intentionality (Tsay et al., 2024).” (Discussion, lines 657-660).

      Third, we interpreted the operation of stimulus-response mapping within the Tsay theoretical framework for the three stages of motor learning: 1) “reasoning” to acquire new action–outcome relationships, 2) “refinement” of the motor action parameters, and 3) “retrieval” of learnt motor actions based on contextual cues. We note that the definition of these stages closely aligns with our definition for stimulus response mapping mechanisms. Moreover, according to Tsay’s definition, both implicit and explicit learning mechanisms can involve similar reasoning and retrieval processes. This shared operational basis may explain why our stimulus-response mapping mechanism exhibits some characteristics associated with explicit strategies, such as flexibility and generalizability.

      We performed a new analysis to evaluate Tsay’s framework predictions that, if walking adaptation includes a stimulus-response mapping mechanism following these three stages of motor learning, the learning process would initially be erratic and would then stabilize as learning progresses. We assessed within-participant residual variance in step length asymmetry around a double exponential model fit during adaptation, testing the prediction that this variability would decrease between the start and end of adaptation. Experiment 1 results confirmed this prediction, showing that a significant reduction in variability as adaptation progressed.

      “We finally tested whether the pattern of motor variability during adaptation aligns with predictions for learning new  stimulus response maps. In contrast to recalibration, mapping mechanisms are predicted to be highly  variable  and  erratic  during  early learning, and stabilize as learning progresses (Tsay et al., 2024). Consistent with these predictions,  the  step  length  asymmetry residual  variance  (around  a  double exponential  fit)  decreased  significantly between the start and end of adaptation (residual variance at start minus end of adaptation = 0.005 [0.004, 0.007], mean [CI]; SI Appendix, Fig. S3). These control analyses corroborate the hypothesis that the “no aftereffects” region of the Ramp Down reflects the operation of a mapping mechanism.”

      (Experiment 1 Results, lines 187-194; Methods, lines 1040-1050).

      Moreover, Experiment 2 results demonstrated that the pattern of variability (its magnitude and decay in adaptation) did not differ between participants using memory-based versus structure-based stimulus-response mapping mechanisms. These findings suggest that both types of mapping operate accordingly to Tsay’s stages of motor learning.

      “Furthermore, the pattern of step length asymmetry variability was similar between the subgroups (structure – memory difference in residual variance relative to double exponential during initial adaptation = -0.0052 [0.0161, 0.0044], adaptation plateau = -0.0007 [-0.0021, 0.0003], difference in variance decay = -0.0045 [-0.0155, 0.0052], mean [CI]; SI Appendix, Fig. S16). This confirms that the distinct performance clusters in the Ramp Up & Down task are not driven by natural variations in learning ability, such as differences in learning speed or variability. Rather, these findings indicate that the subgroups employ different types of mapping mechanisms, which perform similarly during initial learning but differ fundamentally in how they encode, retrieve, and generalize relationships between perturbations and Δ motor outputs.” (Experiment 2 Results, lines 503-511).

      “Both memory- and structure-based operations of mapping align with Tsay et al.’s framework for motor learning: first, action–outcome relationships are learned through exploration; second, motor control policies are refined to optimize rewards or costs, such as reducing error; and finally, learned mappings or policies are retrieved based on contextual cues (Tsay et al., 2024). Consistent with the proposed stages of exploration followed by refinement, we found that motor behavior during adaptation was initially erratic but became less variable at later stages of learning. Similarly, consistent with the retrieval stage, the generalization observed in the ramp tasks indicates that learned motor outputs are flexibly retrieved based on belt speed cues.” (Discussion, lines 701-708).

      Finally, we addressed the prediction outlined by Tsay et al. that repeated exposure to perturbations attenuates the magnitude of forward model recalibration, with savings being driven by stimulus-response mapping mechanisms. While we could not directly test savings for the primary perturbation used during adaptation, we were able to indirectly evaluate savings for a different perturbation through analyses of our control experiments combined with previous results from Leech et al. (Leech et al., 2018). Specifically, we examined how motor aftereffects and perceptual realignment evolved across repeated iterations of the speed-matching task post-adaptation in Ascending groups. Each task began with the right leg stationary and the left leg moving at 0.5 m/s – a configuration corresponding to a perturbation of -0.5 m/s, which is opposite in direction to the adaptation perturbation. By analyzing repeated exposures to this -0.5 m/s perturbation across iterations, we gained insights into the learning dynamics associated with this perturbation and the effect of repeated exposures on motor aftereffects and perceptual realignment. Consistent with predictions from Tsay et al., our results combined with Leech et al. demonstrate that, with repeated exposures to the same perturbation, perceptual realignment decays while the contribution of stimulus-response mapping to aftereffect savings is enhanced. We present this analysis and interpretation in Control Experiments Results, lines 429-442; Figure 8B; Table S7; and Discussion lines 709-753.

      (1c) The authors could also test a variant of the dual-rate state-space model with two perceptual realignment processes where the constraints on retention and learning rate are relaxed. This model would be a stronger test for two perceptual re-alignment processes: one that is flexible and another that is rigid, without mandating that one be fast learning and fast forgetting, and the other be slow learning and slow forgetting.

      We tested multiple variants of the suggested models, and confirmed that they cannot capture the motor behavior observed in our Ramp Down task. We include Author response image 1 with the models fits, Author response table 1 with the BIC statistics, and the models equations below. Only the recalibration + mapping model captures the matching-then-divergent behavior of the Δ motor output, corroborating our interpretation that state-space based models cannot capture the mapping mechanism (see Discussion, “Implications for models of adaptation”). Furthermore, all models fit the data significantly worse than the recalibration+mapping model according to the BIC statistic.

      Model fits:

      Author response image 1.

      Statistical results:

      Author response table 1.

      Model definitions:

      • DualStateRelaxed: same equations as the original Dual State, but no constraints dictating the relative relationship between the parameters

      • DualStateRelaxedV2: same equations as the original Dual State, but no constraints dictating the relative relationship between the parameters, and “loose” parameter bounds (parameters can take values between -10 to 10).

      • PremoOriginalRelaxed: PReMo with two states (see below), no constraints dictating the relative relationship between the parameters

      • PremoOriginalRelaxed: PReMo with two states (see below), no constraints dictating the relative relationship between the parameters, and “loose” parameter bounds (parameters can take values between -10 to 10).

      PReMo with two states – the remaining equations are the same as the original PReMo (see Methods):

      (2) The authors claim that stimulus-response mapping operates outside of explicit/deliberate control. While this could be true, the survey questions may have limitations that could be more clearly acknowledged.

      (2a) Specifically, asking participants at the end of the experiments to recall their strategies may suffer from memory biases (e.g., participants may be biased by recent events, and forget about the explicit strategies early in the experiment), be susceptible to the framing of the questions (e.g., participants not being sure what the experimenter is asking and how to verbalize their own strategy), and moreover, not clear what is the category of explicit strategies one might enact here which dictates what might be considered "relevant" and "accurate".

      (2b) The concept of perceptual realignment also suggests that participants are somewhat aware of the treadmill's changing conditions; therefore, as a thought experiment, if the authors have asked participants throughout/during the experiment whether they are trying different strategies, would they predict that some behaviour is under deliberate control?

      We have expanded the discussion to explicitly acknowledge that our testing methodology for assessing explicit strategies may have limitations, recognizing the factors mentioned by the reviewer. Moreover, as mentioned in response to comment (1), we leveraged the framework from Tsay et al., 2024 and its definition of explicit strategies to ensure a robust and consistent approach in interpreting the survey responses.

      We revised the Experiment 2 Results section, lines 515-518, to specify that we are evaluating the presence of explicit strategies according to the criteria of intentionality and reportability:

      “A recent framework for motor learning by Tsay et al. defines explicit strategies as motor plans that are both intentional and reportable (Tsay et al., 2024). Within this framework, Tsay et al. clarify that "intentional" means participants deliberately perform the motor plan, while "reportable" means they are able to clearly articulate it.”

      We then reorganized the Discussion to include a separate section “Mapping operates independently of explicit control”, lines 646-661, where we discuss limitations of the survey methodology and interpretation of the results according to Tsay et al., 2024:

      “Here, we show that explicit strategies are not systematically used to adapt step length asymmetry and Δ motor output: the participants in our study either did not know what they did, reported changes that did not actually occur or would not lead symmetry. Only one person reported “leaning” on the left (slow) leg for as much time as possible, which is a relevant but incomplete description for how to walk with symmetry. Four reports mentioned pressure or weight, which may indirectly influence symmetry (Hirata et al., 2019; Lauzière et al., 2014), but they were vague and conflicting (e.g., “making heavy steps on the right foot” or “put more weight on my left foot”). All other responses were null, explicitly wrong or irrelevant, or overly generic, like wanting to “stay upright” and “not fall down”. We acknowledge that our testing methodology has limitations. First, it may introduce biases related to memory recall or framing of the questionnaire. Second, while it focuses on participants' intentional use of explicit strategies to control walking, it does not rule out the possibility of passive awareness of motor adjustments or treadmill configurations. Despite these limitations, the motor adjustments reported by participants consistently fail to meet the criteria for explicit strategies as outlined by Tsay et al.: reportability and intentionality (Tsay et al., 2024). Together with existing literature, this supports the interpretation that stimulus response mapping operates automatically.”

      We also made the following addition to the “Limitations” section of the Discussion (lines 917-919):

      “While mapping differs from explicit strategies as they are currently defined, we still lack a comprehensive framework to capture the varying levels and nuanced characteristics of intentionality and awareness of different mechanisms (Tsay et al., 2024).”

      We finally note that “Unlike explicit strategies, which are rapidly acquired and diminish over time, this mapping mechanism exhibits prolonged learning beyond 15 minutes, with a rate comparable to recalibration” (Discussion, lines 632-634).

      (3) The distinction between structural and memory-based differences in the two subgroups was based on the notion that memory-based strategies increase asymmetry. However, an alternative explanation could be that unfamiliar perturbations, due to the ramping up, trigger a surprise signal that leads to greater asymmetry due to reactive corrections to prevent one's fall - not because participants are generalizing from previously learned representations (e.g., (Iturralde & Torres-Oviedo, 2019)).

      We agree that reactive corrections could contribute to the walking pattern in response to split-belt perturbations, as detailed by Iturralde & Torres-Oviedo, 2019. We also acknowledge that reactive corrections are rapid, flexible, feedback-driven, and automatic – characteristics that make them appear similar to stimulus-response mapping. However, a detailed evaluation of our results suggests that the behaviors observed in the ramp tasks cannot be fully explained by reactive corrections. Reactive corrections occur almost immediately, quickly adjusting the walking pattern to reduce error and improve stability. This excludes the possibility that what we identified as stimulusresponse mapping could instead be reactive corrections, because the stimulus-response mapping observed in our study is acquired slowly at a rate comparable to recalibration. It also excludes the possibility that the increased asymmetry in the Ramp Up & Down could be due to reactive corrections, because these would operate alongside mapping to help reduce asymmetry rather than exacerbate it.

      We made substantial revisions to the Discussion and included the section “Stimulus-response mapping is flexible but requires learning” to explain this interpretation (lines 595-622):

      “The mapping mechanism observed in our study aligns with the corrective responses described by Iturralde and Torres-Oviedo, which operate relative to a recalibrated "new normal" rather than relying solely on environmental cues (Iturralde and Torres-Oviedo, 2019). Accordingly, our findings suggest a tandem architecture: forward model recalibration adjusts the nervous system's "normal state," while stimulus-response mapping computes motor adjustments relative to this "new normal." This architecture explains the sharp transition from flexible to rigid motor adjustments observed in our Ramp Down task. The transition occurs at the configuration perceived as "equal speeds" (~0.5 m/s speed difference) because this corresponds to the recalibrated “new normal”.

      In the first half of the Ramp Down, participants adequately modulated their walking pattern to accommodate the gradually diminishing perturbation, achieving symmetric step lengths. Due to the recalibrated “new normal”, perturbations within this range are perceived as congruent with the direction of adaptation but reduced in magnitude. This allows the mapping mechanism to flexibly modulate the walking pattern by using motor adjustments previously learned during adaptation. Importantly, the rapid duration of the Ramp Down task rules out the possibility that the observed modulation may instead reflect washout, as confirmed by the fact the aftereffects measured post-Ramp-Down were comparable to previous work (Kambic et al., 2023; Reisman et al., 2005).

      In the second half of the Ramp Down, aftereffects emerged as participants failed to accommodate perturbations smaller than the recalibrated “new normal”. These perturbations were perceived as opposite to the adaptation perturbation and, therefore, novel. Accordingly, the mapping mechanism responded as it would to a newly introduced perturbation, rather than leveraging previously learned adjustments (Iturralde and Torres-Oviedo, 2019). Due to the rapid nature of the Ramp Down, the mapping mechanism lacked sufficient time to learn the novel motor adjustments required for these perturbations – a process that typically takes several minutes, as shown by our baseline ramp tasks and control experiments. As mapping-related learning was negligible, the rigid recalibration adjustments dominated during this phase. Consequently, the walking pattern did not change to accommodate the gradually diminishing perturbation, leading to the emergence of aftereffects.”

      (4) Further contextualization: Recognizing the differences in dependent variables (reaching position vs. leg speed/symmetry in walking), could the Proprioceptive/Perceptual Re-alignment model also apply to gait adaptation (Tsay et al., 2022; Zhang et al., 2024)? Recent reaching studies show a similar link between perception and action during motor adaptation (Tsay et al., 2021) and have proposed a model aligning with the authors' correlations between perception and action. The core signal driving implicit adaptation is the discrepancy between perceived and desired limb position, integrating forward model predictions with proprioceptive/visual feedback.

      We appreciate the reviewer’s suggestion and agree that the Proprioceptive Re-alignment model (PReMo) and Perceptual Error Adaptation model (PEA), offer valuable insights into the relationship between perception and motor adaptation. To explore whether these frameworks apply to gait adaptation, we conducted an extensive modeling analysis. This is shown in Figure 5 and Supplementary Figures S7-S8, and is detailed in the text of Experiment 1 Results section “Modelling analysis for perceptual realignment” (lines 327–375), Methods section “Proprioceptive re-alignment model (PReMo)” (lines 1181-1221), Methods section “Perceptual Error Adaptation model (PEA)” (lines 1222-1247), Methods section “Perceptuomotor recalibration + mapping (PM-ReMap)” (lines 1248-1286), and SI Appendix section “Evaluation and development of perceptual models.” (lines 99-237).

      First, we evaluated how PReMo and PEA models fitted our Ramp Down data. We translated the original variables to walking adaptation variables using a conceptual equivalence explained by one of the features explored by Tsay et al. (2022). Specifically, the manuscript provides guidance on extending the PReMo model from visuomotor adaptation in response to visual-proprioceptive discrepancies, to force-field adaptation in response to mechanical perturbations – which share conceptual similarities with split-belt treadmill perturbations. The manuscript also discusses that, if vision is removed, the proprioceptive shift decays back to zero according to a decay parameter. This description entails that proprioceptive shift cannot increase or develop in the absence of vision. We applied the models to split-belt adaptation in accordance with this information, as described in the SI Appendix: “PReMo variables equivalents for walking adaptation”. As reported in Experiment 1 Results “Modelling analysis for perceptual realignment” (lines 327–375) and Figure 5, neither PReMo nor PEA adequately captured the key features of our Ramp Down data: “The models could not capture the matching-then-divergent behavior of Δ motor output, performing significantly worse than the recalibration + mapping model (PReMo minus recalibration+mapping BIC difference = 24.591 [16.483, 32.037], PEA minus recalibration+mapping BIC difference = 6.834 [1.779, 12.130], mean [CI]). Furthermore, they could not capture the perceptual realignment and instead predicted that the right leg would feel faster than the left throughout the entire Ramp Down”.

      Second, we used simulations to confirm that PReMo and PEA cannot account for the perceptual realignment observed in our study, and to understand why. At adaptation plateau, PReMo predicts that perceived and actual step length asymmetry converge, as shown in Fig. S7A, top, and as detailed in the SI Appendix “Original PReMo simulations”. We found that this is because PReMo assumes that perceptual realignment arises specifically from mismatches between different sensory modalities. This assumption works for paradigms that introduce an actual mismatch between sensory modalities, such as visuomotor adaptation paradigms with a mismatch between vision and proprioception. This assumption also works for paradigms that indirectly introduce a mismatch between integrated sensory information from different sensory modalities. In force-field adaptation, both proprioceptive and visual inputs are present and realistic, but when these inputs are integrated with sensory predictions, the resulting integrated visual estimate is mismatched compared to the integrated proprioceptive estimate. In contrast, the assumption that perceptual realignment arises from sensory modalities mismatches does not work for paradigms that involve a single sensory modality. Split-belt adaptation only involves proprioception as no visual feedback is given, and perceptual realignment arises from discrepancies between predicted and actual motor outcomes, rather than between integrated sensory modalities.

      To overcome this limitation, we reinterpreted the variables of the PReMo model, while keeping the original equations, to account for realignment driven by mismatches of the same nature as the perturbation driving adaptation. As reported in the SI Appendix “Iterative simulations for the development of PM-ReMap”, the simulation (Fig. S7A, middle row) “showed perceptual realignment at adaptation plateau, addressing a limitation of the original model. However, it failed to account for the Ramp Down perceptual results, inaccurately predicting that belt speeds feel equal when they are actually equal (Fig. S7A, middle row, perceived perturbation decays alongside actual perturbation and converge to zero at the end of the Ramp Down). […] This occurs because, under the retained PReMo equations, β<sub>p</sub> and β<sub>v</sub> change immediately and are proportional to the difference between and on each trial, so that they ramp down to zero in parallel with the perturbation”.

      We also noted that the simulations of the original and reinterpreted PReMo models could also not support the operation of the mapping mechanism observed in the Ramp Down (Fig. S7B). We describe that “This occurs because the overall motor output x<sub>p</sub>, which includes both recalibration and mapping mechanisms, changes gradually according to the learning rate 𝐾. Consequently, changes in 𝐺 take many trials to be fully reflected in x<sub>p</sub>. Hence, we found complementary limitations where PReMo assumes perceptual realignment changes immediately while mapping adjustments develop gradually – but the opposite is true in our data”.

      We therefore modified the PReMo equations and developed a new model, called perceptuomotor recalibration + mapping (PM-ReMap) that addresses these limitations and is able to capture our Ramp Down motor and perceptual results. As described in the SI Appendix “Iterative simulations for the development of PM-ReMap”, “we introduced an update equation for β<sub>p</sub> so that it changes gradually trial-by-trial according to the learning rate 𝐾. We then removed the learning rate from the update equation for x<sub>p</sub> so that it integrates two distinct types of changes: 1) the gradual changes in driven by β<sub>p</sub> and representing the recalibration mechanism, and 2) the immediate changes in 𝐺 – representing the mapping mechanism”. The final equations of the PM-ReMap model are as follows:

      As reported in Experiment 1 Results, “Modelling analysis for perceptual realignment”, and as shown in Fig. 5C, “the PM-ReMap model captured the Δ motor output in the Ramp Down with performance comparable to that of the recalibration + mapping model (BIC difference = 2.381 [-0.739, 5.147], mean [CI]). It also captured perceptual realignment, predicting that some intermediate belt speed difference in the Ramp Down is perceived as “equal speeds” (, Fig. 5C)”. We also found that the estimated aligned with the empirical measurement of the PSE in the Ramp Down both at group and individual level: “At group level, was comparable to the upper bound of compensation<sub>perceptual</sub> (difference = -7 [-15, 1]%, mean [CI]), but significantly larger than the lower bound (difference = 19 [8, 31]%, mean [CI]). Furthermore, we found a significant correlation between individual participants’ and their upper bound of compensation<sub>perceptual</sub> (r=0.63, p=0.003), but not their lower bound (r=0.30, p=0.203). Both sets of results are consistent with those observed for the recalibration + mapping model”.

      Based on these findings, we summarize that PM-ReMap “extends the recalibration + mapping model by incorporating the ability to account for forgetting – typical of state space models – while still effectively capturing both recalibration and mapping mechanisms. However, performance of the PM-ReMap model does not exceed that of the simpler recalibration + mapping model, suggesting that forgetting and unlearning do not have a substantial impact on the Ramp Down”.

      Reviewer #2 (Public review):

      Recent findings in the field of motor learning have pointed to the combined action of multiple mechanisms that potentially contribute to changes in motor output during adaptation. A nearly ubiquitous motor learning process occurs via the trial-by-trial compensation of motor errors, often attributed to cerebellar-dependent updating. This error-based learning process is slow and largely unconscious. Additional learning processes that are rapid (e.g., explicit strategy-based compensation) have been described in discrete movements like goal-directed reaching adaptation. However, the role of rapid motor updating during continuous movements such as walking has been either under-explored or inconsistent with those found during the adaptation of discrete movements. Indeed, previous results have largely discounted the role of explicit strategy-based mechanisms for locomotor learning. In the current manuscript, Rossi et al. provide convincing evidence for a previously unknown rapid updating mechanism for locomotor adaptation. Unlike the now well-studied explicit strategies employed during reaching movements, the authors demonstrate that this stimulus-response mapping process is largely unconscious. The authors show that in approximately half of subjects, the mapping process appears to be memory-based while the remainder of subjects appear to perform structural learning of the task design. The participants that learned using a structural approach had the capability to rapidly generalize to previously unexplored regions of the perturbation space.

      One result that will likely be particularly important to the field of motor learning is the authors' quite convincing correlation between the magnitude of proprioceptive recalibration and the magnitude error-based updating. This result beautifully parallels results in other motor learning tasks and appears to provide a robust marker for the magnitude of the mapping process (by means of subtracting off the contribution of error-based motor learning). This is a fascinating result with implications for the motor learning field well beyond the current study.

      A major strength of this manuscript is the large sample size across experiments and the extent of replication performed by the authors in multiple control experiments.

      Finally, I commend the authors on extending their original observations via Experiment 2. While it seems that participants use a range of mapping mechanisms (or indeed a combination of multiple mapping mechanisms), future experiments may be able to tease apart why some subjects use memory versus structural mapping. A future ability to push subjects to learn structurally-based mapping rules has the potential to inform rehabilitation strategies.

      Overall, the manuscript is well written, the results are clear, and the data and analyses are convincing. The manuscript's weaknesses are minor, mostly related to the presentation of the results and modeling.

      Weaknesses:

      The overall weaknesses in the manuscript are minor and can likely be addressed with textual changes.

      (1) A key aspect of the experimental design is the speed of the "ramp down" following the adaptation period. If the ramp-down is too slow, then no after-effects would be expected even in the alternative recalibration-only/errorbased only hypothesis. How did the authors determine the appropriate rate of ramp-down? Do alternative choices of ramp-down rates result in step length asymmetry measures that are consistent with the mapping hypothesis?

      We thank the reviewer for their insightful comment regarding the rate of the Ramp Down following the adaptation period and its potential impact on aftereffects under different hypotheses. We added a detailed explanation for how we determined the Ramp Down design, including analyses of previous work, to the SI Appendix, “Ramp Down design”, lines 22-98. We also describe the primary points in the main Methods section, “Ramp Tasks”, lines 978-991:

      As described in SI Appendix, “Ramp Down design”, the Ramp Down task was specifically designed to measure the pattern of aftereffects in a way that ensured reliable and robust measurements with sufficient resolution across speeds, and that minimized washout to prevent confounding the results. To balance time constraints with a measurement resolution adequate for capturing perceptual realignment, we used 0.05 m/s speed decrements, matching the perceptual sensitivity estimated from our re-analysis of the baseline data from Leech et al. (Leech et al., 2018a). To obtain robust motor aftereffect measurements, we collected three strides at each speed condition, as averaging over three strides represents the minimum standard for consistent and reliable aftereffect estimates in split-belt adaptation (typically used in catch trials) (Leech et al., 2018a; Rossi et al., 2019; Vazquez et al., 2015). To minimize unwanted washout by forgetting and/or unlearning, we did not pause the treadmill between adaptation and the post-adaptation ramp tasks, and ensured the Ramp Down was relatively quick, lasting approximately 80 seconds on average. Of note, the Ramp Down design ensures that even in cases of partial forgetting, the emergence pattern of aftereffects remains consistent with the underlying hypotheses.

      In the SI Appendix, we explain that, while we did not test longer ramp-down durations directly, previous data suggest that durations of up to at least 4.5 minutes would yield step length asymmetry measures consistent with our results and the mapping hypothesis. Additionally, our control experiments replicated the behavior observed in the Ramp Down using speed match tasks lasting only 30 seconds, further supporting the robustness of our findings across varying durations.

      (2) Overall, the modeling as presented in Figure 3 (Equation 1-3) is a bit convoluted. To my mind, it would be far more useful if the authors reworked Equations 1-3 and Figure 3 (with potential changes to Figure 2) so that the motor output (u) is related to the stride rather than the magnitude of the perturbation. There should be an equation relating the forward model recalibration (i.e., Equation 1) to the fraction of the motor error on a given stride, something akin to u(k+1) = r * (u(k) - p(k)). This formulation is easier to understand and commonplace in other motor learning tasks (and likely what the authors actually fit given the Smith & Shadmehr citation and the derivations in the Supplemental Materials). Such a change would require that Figure 3's independent axes be changed to "stride," but this has the benefit of complementing the presentation that is already in Figure 5.

      We reworked these equations (now numbered 4-6, lines 207-209) so that the motor output u is related to stride k as suggested by the reviewer:

      We changed Figure 2 and Figure 3 accordingly, adding a “stride” x-axis to the Ramp Down data figure.

      Reviewer #2 (Recommendations for the authors):

      I think that some changes to the text/ordering could improve the manuscript's readability. In particular:

      (1) My feeling is that much of the equations presented in the Methods section should be moved to the Results section. Particularly Equations 9-11. The introduction of these motor measures should likely precede Figure 1, as their definitions form the crux of Figure 1 and the subsequent analyses.

      (2) It is unclear to me why many of the analyses and discussion points have been relegated to Supplemental Material. I would significantly revise the manuscript to move much of the content from Supplemental Material to the Methods and Discussion (where appropriate). Even the Todorov and Herzfeld models can likely simply be referenced in the text without a need for their full description in the Supplemental material - as their implementations appear to this reviewer as consistent with those presented in the respective papers. Beyond the Supplementary Tables, my feeling is that nearly all of the content in Supplemental can either be simply cited (e.g. alternative model implementations) or directly incorporated into the main manuscript without compromising the readability of the manuscript.

      We reorganized the manuscript and SI Appendix substantially, moving content to the Results or other main text section. The changes included those recommended by the reviewer:

      • We moved the equations describing step length asymmetry, perturbation, and Δ motor output (originally numbered Eq. 9-11) to the Results section (Experiment 1, “Motor paradigm and hypothesis”, lines 131-133, now numbered Eq. 1-3).

      • We moved Supplementary Methods to the main Methods section

      • We moved the most relevant content of the Supplementary Discussion to the main Discussion, and removed the less relevant content altogether.

      • We moved the methods describing walking-adaptation specific implementation of the Todorov and Herzfeld models to the main Methods section and removed the portions that were identical to the original implementation.

      • We moved the control experiments to the main text (main Results and Methods sections).

      • We removed the SI Appendix section “Experiment 1 mechanisms characteristics”

      Reviewer #3 (Public review):

      Summary:

      In this work, Rossi et al. use a novel split-belt treadmill learning task to reveal distinct sub-components of gait adaptation. The task involved following a standard adaptation phase with a "ramp-down" phase that helped them dissociate implicit recalibration and more deliberate SR map learning. Combined with modeling and re-analysis of previous studies, the authors show multiple lines of evidence that both processes run simultaneously, with implicit learning saturating based on intrinsic learning constraints and SR learning showing sensitivity to a "perceptual" error. These results offer a parallel with work in reaching adaptation showing both explicit and implicit processes contributing to behavior; however, in the case of gait adaptation the deliberate learning component does not appear to be strategic but is instead a more implicit SR learning processes.

      Strengths:

      (1) The task design is very clever and the "ramp down" phase offers a novel way to attempt to dissociate competing models of multiple processes in gait adaptation.

      (2) The analyses are thorough, as is the re-analysis of multiple previous data sets.

      (3) The querying of perception of the different relative belt speeds is a very nice addition, allowing the authors to connect different learning components with error perception.

      (4) The conceptual framework is compelling, highlighting parallels with work in reaching but also emphasizing differences, especially w/r/t SR learning versus strategic behaviors. Thus the discovery of an SR learning process in gait adaptation would be both novel and also help conjoin different siloed subfields of motor learning research.

      Weaknesses:

      (1) The behavior in the ramp-down phase does indeed appear to support multiple learning processes. However, I may have missed something, but I have a fundamental worry about the specific modeling and framing of the "SR" learning process. If I correctly understand, the SR process learns by adjusting to perceived L/R belt speed differences (Figure 7). What is bugging me is why that process would not cause the SR system to still learn something in the later parts of the ramp-down phase when the perceived speed differences flip (Figure 4). I do believe this "blunted learning" is what the SR component is actually modeled with, given this quote in the caption to Figure 7: "When the perturbation is perceived to be opposite than adaptation, even if it is not, mapping is zero and the Δ motor output is constant, reflecting recalibration adjustments only." It seems a priori odd and perhaps a little arbitrary to me that a SR learning system would just stop working (go to zero) just because the perception flipped sign. Or for that matter "generalize" to a ramp-up (i.e., just learn a new SR mapping just like the system did at the beginning of the first perturbation). What am I missing that justifies this key assumption? Or is the model doing something else? (if so that should be more clearly described).

      We concur that this point was confusing, and we performed additional analyses and revised the text to improve clarity. Specifically, we clarify that the stimulus-response mapping does indeed still learn in the second portion of the Ramp Down, when the perceived speed differences flip. However, learning by the mapping mechanism proceeds slowly – at a rate comparable to that of forward model recalibration, taking several minutes. The duration of the task is relatively short, so that learning by the mapping mechanism is limited. We schematize the learning to be zero as an approximation. We have now included an additional modelling analysis (as part of our expanded perceptual modelling analyses), which shows there is no significant improvement in modelling performance when accounting for forgetting of recalibration or learning in the opposite direction by mapping in the second half of the ramp down, supporting this approximation. We explain this and other revisions in detail below.

      We include a Discussion section “Stimulus-response mapping is flexible but requires learning” where we improve our explanation of the operation of the mapping mechanism in the Ramp Down by leveraging the framework proposed by Iturralde and Torres-Oviedo, 2019. The section first explains that mapping operates relative to a new equilibrium corresponding to the current forward model calibration (lines 595-603):

      “The mapping mechanism observed in our study aligns with the corrective responses described by Iturralde and Torres-Oviedo, which operate relative to a recalibrated "new normal" rather than relying solely on environmental cues (Iturralde and Torres-Oviedo, 2019). Accordingly, our findings suggest a tandem architecture: forward model recalibration adjusts the nervous system's "normal state," while stimulus-response mapping computes motor adjustments relative to this "new normal." This architecture explains the sharp transition from flexible to rigid motor adjustments observed in our Ramp Down task. The transition occurs at the configuration perceived as "equal speeds" (~0.5 m/s speed difference) because this corresponds to the recalibrated “new normal”.”

      The following paragraph (lines 604-611) explain how this concept reflects in the first half of the Ramp Down:

      “In the first half of the Ramp Down, participants adequately modulated their walking pattern to accommodate the gradually diminishing perturbation, achieving symmetric step lengths. Due to the recalibrated “new normal”, perturbations within this range are perceived as congruent with the direction of adaptation but reduced in magnitude. This allows the mapping mechanism to flexibly modulate the walking pattern by using motor adjustments previously learned during adaptation. Importantly, the rapid duration of the Ramp Down task rules out the possibility that the observed modulation may instead reflect washout, as confirmed by the fact the aftereffects measured post-Ramp-Down were comparable to previous work (Kambic et al., 2023; Reisman et al., 2005).”

      The last paragraph (lines 612–622) explain the second half of the Ramp Down in light of the equilibrium concept and of the slow learning rate of mapping:

      “In the second half of the Ramp Down, aftereffects emerged as participants failed to accommodate perturbations smaller than the recalibrated “new normal”. These perturbations were perceived as opposite to the adaptation perturbation and, therefore, novel. Accordingly, the mapping mechanism responded as it would to a newly introduced perturbation, rather than leveraging previously learned adjustments (Iturralde and TorresOviedo, 2019). Due to the rapid nature of the Ramp Down, the mapping mechanism lacked sufficient time to learn the novel motor adjustments required for these perturbations – a process that typically takes several minutes, as shown by our baseline ramp tasks and control experiments. As mapping-related learning was negligible, the rigid recalibration adjustments dominated during this phase. Consequently, the walking pattern did not change to accommodate the gradually diminishing perturbation, leading to the emergence of aftereffects.”

      We also revised the Discussion section “Mapping operates as memory-based in some people, structure-based in others”, to clarify the processes of interpolation and extrapolation (lines 689-700). This revision helps explain why mapping may generalize to a ramp-up faster than learning a perturbation perceived in the opposite direction (when considered together with the explanation that mapping operates relative to the new recalibrated equilibrium) In the former case (generalize to a ramp-up), a structure-based mapping can use the extrapolation computation: it leverages previous knowledge of which gait parameters should be modified and how – e.g., modulating the positioning our right foot to be more forward on the treadmill – but must extrapolate the specific parameter values – e.g., how more far forward. In the latter case (learning a perturbation perceived in the opposite direction), even a structure-based mapping would need to figure out what gait parameters to change completely anew – e.g., modulating the positioning of the foot in the opposite way, to be less forward, requires a different set of control policies.

      We mentioned above that this illustration of the mapping mechanism relies on the assumption that the additional learning of the mapping mechanism in the second half of the Ramp Down is negligible. As part of our revisions for the “Modelling analysis for perceptual realignment”, we developed a new model – the perceptuomotor recalibration + mapping model (PM-ReMap) that extends the recalibration + mapping model by accounting for the possibility that Δ motor output is not constant in the second half of the Ramp Down (main points are at lines 355-275, and Figure 5; see response to Reviewer #1 (Public review), Comment 4, for a detailed explanation). We find that performance of the PM-ReMap model does not exceed that of the simpler recalibration + mapping model, suggesting that the Δ motor output does not change substantially in the second half of the Ramp Down. Note that, if the Δ motor output decayed in this phase, it could be due to forgetting or unlearning of the recalibration mechanism, or also it could be due to the mapping mechanism learning in the opposite direction than it did in adaptation. In the Results section, we focused on describing recalibration forgetting/unlearning for simplicity. However, in the Discussion section “Mapping may underly savings upon re-exposure to the same or different perturbation”, we explain in detail how the motor aftereffects also depend on the mapping mechanism learning in the opposite direction, as corroborated by our Control experiments and previous work. Therefore, the finding that the PM-ReMap model performance does not exceed that of the simpler recalibration + mapping model suggest that both effects – recalibration forgetting/unlearning and opposite-direction-learning of mapping – are not significant, nor is their combined effect on the Δ motor output.

      (2) A more minor point, but given the sample size it is hard to be convinced about the individual difference analysis for structure learning (Figure 5). How clear is it that these two groups of subjects are fully separable and not on a continuum? The lack of clusters in another data set seems like a somewhat less than convincing control here.

      We performed an additional analysis – a silhouette analysis – to confirm the presence of these clusters in our data (Methods, lines 1070-1072). The results, reported in Experiment 2 Results, lines 487-490, confirmed that there is strong evidence for the presence of these clusters:

      “A silhouette analysis confirmed strong evidence for these clusters: the average silhouette score was 0.90, with 19 of 20 participants scoring above 0.7 – considered strong evidence – and one scoring between 0.5 and 0.7 – considered reasonable evidence (Dalmaijer et al., 2022; Kaufman and Rousseeuw, 1990; Rousseeuw, 1987).”

      Reviewer #3 (Recommendations for the authors):

      (1) I think there is far too much content pushed into the supplement. The other models and full model comparison should be in the main text, as should the re-analysis of previous data sets. Also, key discussion points should not be in the supplement either.

      We reorganized the manuscript and SI Appendix substantially, including the changes recommended by the reviewer. Please refer to our response to “Reviewer #2 - Recommendations for the authors” for a detailed explanation.

      (2) Line 649: in reaching the calibration system does respond to different error sizes; why not here?

      We apologize for the confusion. Similar to reaching adaptation, the recalibration in walking adaptation also scales based on the error size experienced in adaptation. What we meant to convey is that, once a calibration has been acquired in adaptation, the recalibration process is rigid in that it can only change gradually. So if we jump the perturbation to a different value, the original calibration is transiently used until the system has the time to recalibrate again. For example, if we jump abruptly from the adaptation perturbation to a perturbation of zero in postadaptation, the adaptation calibration persists resulting in aftereffects.

      We revised the manuscript to clarity these points. First, we explicitly report that forward model recalibration scales based on the error size experienced in adaptation:

      “We next compared Medium Descend and Small Abrupt (1m/s or 0.4m/s perturbation), and found that recalibration contributed significantly more for the smaller perturbation (larger compensation<sub>perceptual</sub> / compensation<sub>motor-total</sub> in Small Abrupt than Medium Descend, Fig. 8A middle and Table S6).” (Control experiments Results, lines 422-425)

      “the mapping described here shares some characteristics with explicit mechanisms, such as flexibility and modulation by error size” (Discussion, lines 630-631)

      Additionally, we leverage the framework proposed by Tsay et al., 2024, to improve our explanation of the characteristics of the different learning mechanisms. Please refer to our response to “Reviewer #1 (Public review)”, Comment (1).

      (3) It would be nice to see bar graphs showing model comparison results for each individual subject in the main text, and to see how many subjects are best fit by the SR+calibration model.

      We included the recommended bar graphs to Figure 3 and Figure 5.

      (4) Why exactly does the "perturbation" in Figure 3 have error bars?

      In walking adaptation, the perturbation that participants experienced is closely dictated by the treadmill belt speeds, but not exactly, because participants are free to move their feet as they like, so that their ankle movement may not always match the treadmill belts exactly. Therefore, we record the perturbation that is actually experienced by each participant’s feet using markers. We then display the mean and standard error of this perturbation.

      We moved the equation describing the perturbation measure from the Methods to the Experiment 1 Results (lines 131-133, Eq. 1-3). We believe this change will help the reader understand the measures depicted.

    1. Author Response

      The following is the authors’ response to the original reviews.

      eLife assessment

      This valuable work provides a near-complete description of the mechanosensory bristles on the Drosophila melanogaster head and the anatomy and projection patterns of the bristle mechanosensory neurons that innervate them. The data presented are solid. The study has generated numerous invaluable resources for the community that will be of interest to neuroscientists in the field of circuits and behaviour, particularly those interested in mechanosensation and behavioural sequence generation.

      We express our gratitude to the Reviewers for their valuable suggestions, which significantly enhanced the manuscript. The revisions were undertaken, not with the expectation of acceptance, but rather driven by our sincere belief that these revisions would enhance the manuscript's impact for future readers.

      Public Reviews:

      Reviewer #1 (Public Review):

      Sensory neurons of the mechanosensory bristles on the head of the fly project to the sub esophageal ganglion (SEZ). In this manuscript, the authors have built on a large body of previous work to comprehensively classify and quantify the head bristles. They broadly identify the nerves that various bristles use to project to the SEZ and describe their region-specific innervation in the SEZ. They use dye-fills, clonal labelling, and electron microscopic reconstructions to describe in detail the phenomenon of somatotopy - conserved peripheral representations within the central brain - within the innervation of these neurons. In the process they develop novel tools to access subsets of these neurons. They use these to demostrate that groups of bristles in different parts of the head control different aspects of the grooming sequence.

      Reviewer #2 (Public Review):

      The authors combine genetic tools, dye fills and connectome analysis techniques to generate a "first-of-its-kind", near complete, synaptic resolution map of the head bristle neurons of Drosophila. While some of the BMN anatomy was already known based on previous work by the authors and other researchers, this is the first time a near complete map has been created for the head BMNs at electron microscopy resolution.

      Strengths:

      (1) The authors cleverly use techniques that allow moving back and forth between periphery (head bristle location) and brain, as well as moving between light microscopy and electron microscopy data. This allows them to first characterize the pathways taken by different head BMNs to project to the brain and also characterize anatomical differences among individual neurons at the level of morphology and connectivity.

      (2) The work is very comprehensive and results in a near complete map of all I’m head BMNs.

      (3) Authors also complement this anatomical characterization with a first-level functional analysis using optogenetic activation of BMNs that results in expected directed grooming behavior.

      Weaknesses:

      (1) The clustering analysis is compelling but cluster numbers seem to be arbitrarily chosen instead of by using some informed metrics.

      We made revisions to the manuscript that address this concern. Please see our response to “recommendations for authors” for a description of these revisions.

      (2) It could help provide context if authors revealed some of the important downstream pathways that could explain optogenetics behavioral phenotypes and previously shown hierarchical organization of grooming sequences.

      We made revisions to the manuscript that address this recommendation. Please see our response to “recommendations for authors” for a description of these revisions.

      (3) In contrast to the rigorous quantitative analysis of the anatomical data, the behavioral data is analyzed using much more subjective methods. While I do not think it is necessary to perform a rigorous analysis of behaviors in this anatomy focused manuscript, the conclusions based on behavioral analysis should be treated as speculative in the current form e.g. calling "nodding + backward walking" as an avoidance response is not justified as it currently stands. Strong optogenetic activation could lead to sudden postural changes that due to purely biomechanical constraints could lead to a couple of backward steps as seen in the example videos. Moreover since the quantification is manual, it is not clear what the analyst interprets as backward walking or nodding. Interpretation is also concerning because controls show backward walking (although in fewer instances based on subjective quantification).

      While unbiased machine vision-based methods would nicely complement the present work, this type of analysis is not yet working to distinguish between different head grooming movements. Therefore, we are currently limited to manual annotation for our behavioral analysis. That said, we do not believe that our manual annotation is subjective. The grooming movements that we examine in this work are distinguishable from each other through frame-by-frame manual annotation of video at 30 fps. Our annotation of the grooming and backward motions performed by flies are based on previous publications that established a controlled vocabulary defining each movement (Hampel et al., 2020a, 2017, 2015; Seeds et al., 2014). In this work, we added head nodding to this controlled vocabulary that is described in the Materials and methods. We have added additional text to the third paragraph of the Material and methods section entitled “Behavioral analysis procedures” that we hope better describes our behavioral analysis. This description now reads:

      Head nodding was annotated when the fly tilted its head downward by any amount until it returned its head back in its original position. This movement often occurred in repeated cycles. Therefore, the “start” was scored at the onset of the first forward movement and the “stop” when the head returned to its original position on the last nod.

      We do not make any firm conclusions about the head movements (nodding) and backwards motions. We refer to nodding as a descriptive term that would allow the reader to better understand what the behavior looks like. We make no firm conclusions about any behavioral functional role that either the nodding or the backward motions might have, with the exception of nodding in the context of grooming. We only suggest that the behaviors appear to be avoidance responses. Furthermore, backward walking was not mentioned. Instead we refer to backward motions. We are only reporting our annotations of these movements that do occur, and are significantly different from controls. We speculate that these could be avoidance responses based on support from the literature. Future studies will be required to understand whether these movements serve real behavioral roles.

      Summary:

      The authors end up generating a near-complete map of head BMNs that will serve as a long-standing resource to the Drosophila research community. This will directly shape future experiments aimed at modeling or functionally analyzing the head grooming circuit to understand how somatotopy guides behaviors.

      Reviewer #3 (Public Review):

      Eichler et al. set out to map the locations of the mechanosensory bristles on the fly head, examine the axonal morphology of the bristle mechanosensory neurons (BMNs) that innervate them, and match these to electron microscopy reconstructions of the same BMNs in a previously published EM volume of the female adult fly brain. They used BMN synaptic connectivity information to create clusters of BMNs that they show occupy different regions of the subesophageal zone brain region and use optogenetic activation of subsets of BMNs to support the claim that the morphological projections and connectivity of defined groups of BMNs are consistent with the parallel model for behavioral sequence generation.

      The authors have beautifully cataloged the mechanosensory bristles and the projection paths and patterns of the corresponding BMN axons in the brain using detailed and painstaking methods. The result is a neuroanatomy resource that will be an important community resource. To match BMNs reconstructed in an electron microscopy volume of the adult fly brain, the authors matched clustered reconstructed BMNs with light-level BMN classes using a variety of methods, but evidence for matching is only summarized and not demonstrated in a way that allows the reader to evaluate the strength of the evidence. The authors then switch from morphology-based categorization to non-BMN connectivity as a clustering method, which they claim demonstrates that BMNs form a somatotopic map in the brain. This map is not easily appreciated, and although contralateral projections in some populations are clear, the distinct projection zones that are mentioned by the authors are not readily apparent. Because of the extensive morphological overlap between connectivity-based clusters, it is not clear that small projection differences at the projection level are what determines the post-synaptic connectivity of a given BMN cluster or their functional role during behavior. The claim the somatotopic organization of BMN projections is preserved among their postsynaptic partners to form parallel sensory pathways is not supported by the result that different connectivity clusters still have high cosine similarity in a number of cases (i.e. Clusters 1 and 3, or Clusters 1 and 2). Finally, the authors use tools that were generated during the light-level characterization of BMN projections to show that specifically activating BMNs that innervate different areas of the head triggers different grooming behaviors. In one case, activation of a single population of sensory bristles (lnOm) triggers two different behaviors, both eye and dorsal head grooming. This result does not seem consistent with the parallel model, which suggests that these behaviors should be mutually exclusive and rely on parallel downstream circuitry.

      We made revisions to the manuscript that address this recommendation. Please see our response to “recommendations for authors” for a description of these revisions.

      This work will have a positive impact on the field by contributing a complete accounting of the mechanosensory bristles of the fruit fly head, describing the brain projection patterns of the BMNs that innervate them, and linking them to BMN sensory projections in an electron microscopy volume of the adult fly brain. It will also have a positive impact on the field by providing genetic tools to help functionally subdivide the contributions of different BMN populations to circuit computations and behavior. This contribution will pave the way for further mechanistic study of central circuits that subserve grooming circuits.

      Recommendations for the authors:

      All three reviewers appreciated the work presented in this manuscript. There were also a few overlapping concerns that were raised that are summarised below, should the authors wish to address them:

      Somatotopy: We recommend that the authors describe the extent of prior knowledge in more detail to highlight their contribution better.

      We made revisions that better highlight the extent of prior knowledge about somatotopy. We describe how previous studies showed bristle mechanosensory neurons in insects are somatotopically organized, but these studies were not comprehensive descriptions of complete somatotopic maps for the head or body. To our knowledge, our study provides the first comprehensive and synaptic resolution somatotopic map of a head for any animal. This sets the stage for the complete definition of the interface between somatotopically-organized mechanosensory neurons and postsynaptic circuits, which has broad implications for future studies on aimed grooming, and mechanosensation in general. Below we itemize revisions to the Introduction, Discussion, and Figures to provide a clearer statement of the significance of our study as it relates to somatotopy.

      (1) Newly added Figure 1 – figure supplement 1 more explicitly grounds the study in somatotopy, providing a working model of the organization of the circuit pathways that produce the grooming sequence. This model features somatotopy as shown in Figure 1 – figure supplement 1C.

      (2) Figure 1 – figure supplement 1 is incorporated into the Introduction in the second, third, and fourth paragraphs, the first paragraph of the Results section titled “Somatotopically-organized parallel BMN pathways”, and the second and third paragraphs of the last Discussion section titled “Parallel circuit architecture underlying the grooming sequence”.

      (3) We added text to the end of the fourth paragraph of the Introduction that now reads: “In this model, parallel-projecting mechanosensory neurons that respond to stimuli at specific locations on the head or body could connect with somatotopically-organized parallel circuits that elicit grooming of those locations (Figure 1 – figure supplement 1A-C). The previous discovery of a mechanosensory-connected circuit that elicits aimed grooming of the antennae provides evidence of this organization (Hampel 2015). However, the extent to which distinct circuits elicit grooming of other locations is unknown, in part, because the somatotopic projections of the mechanosensory neurons have not been comprehensively defined for the head or body.”

      (4) There is a Discussion section that further explains the extent of prior knowledge and our contributions on somatotopy that is titled “A synaptic resolution somatotopic map of the head BMNs”. Additionally, the previous version of this section had a paragraph on the broader implications of our work as it relates to somatotopy across species. In light of the reviewer comments, we decided to make this paragraph into its own Discussion section to better highlight the broader significance of our work. This section is titled “First synaptic resolution somatotopic map of the head”.

      The somatotopy isn't overtly obvious - perhaps they could try mapping presynaptic sites and provide landmarks to improve visualisation.

      We made the following revisions to better highlight the head BMN somatotopy. One point of confusion from the previous manuscript version stemmed from us not explicitly defining the somatotopic organization that we observed. There seemed to be confusion that we were defining the head somatotopy based only on the small projection differences among BMNs from neighboring head locations. While we believe that these small differences indeed correspond to somatotopy, we failed to highlight that there are overt differences in the brain projections of BMNs from distant locations on the head. For example, Figure 5B (right panel) shows the distinct projections between the LabNv (brown) and AntNv (blue) BMNs that innervate bristles on the ventral and dorsal head, respectively. Thus, BMN types innervating neighboring bristles show overlapping projections with small projection differences, whereas those innervating distant bristles show non overlapping projections into distinct zones.

      Our analysis of postsynaptic connectivity similarity also shows somatotopic organization among the BMN postsynaptic partners, as BMN types innervating the same or neighboring bristle populations show high connectivity similarity (Figure 8, old Figure 7). Below we highlight major revisions to the text and Figures that hopefully better reveal the head somatotopy.

      (1) In the last paragraph of the Introduction we added text that explicitly frames the experiments in terms of somatotopic organization: “This reveals somatotopic organization, where BMNs innervating neighboring bristles project to the same zones in the CNS while those innervating distant bristles project to distinct zones. Analysis of the BMN postsynaptic connectome reveals that neighboring BMNs show higher connectivity similarity than distant BMNs, providing evidence of somatotopically organized postsynaptic circuit pathways.”

      (2) We mention an example of overt somatotopy from Figure 5 in the Results section titled “EM-based reconstruction of the head BMN projections in a full adult brain”. The text reads “For example, BMNs from the Eye- and LabNv have distinct ventral and anterior projections, respectively. This shows how the BMNs are somatotopically organized, as their distinct projections correspond to different bristle locations on the head (Figure 5B,C).”

      (3) In new Figure 8 (part of old Figure 7), we modified panels that correspond to the cosine similarity analysis of postsynaptic connectivity. The major revision was to plot the cosine similarity clusters onto the head bristles so that the bristles are now colored based on their clusters (C). This shows how neighboring BMNs cluster together, and therefore show similar postsynaptic connectivity. We believe that this provides a nice visualization of somatotopic organization in BMN postsynaptic connectivity. We also added the clustering dendrogram as recommended by Reviewer #2 (Figure 8A).

      (4) In new Figure 8, we added new panels (D-F) that summarize our anatomical and connectomic analysis showing different somatotopic features of the head BMNs. Different BMN types innervate bristles at neighboring and distant proximities (D). BMNs that innervate neighboring bristles project into overlapping zones (E, example of reconstructed BM-Fr and -Ant neurons with non-overlapping BM-MaPa neurons) and show postsynaptic connectivity similarity (F, example connectivity map of three BM types on cosine similarity data).

      (5) To accompany the new Figure 8D-F panels, we added a paragraph to summarize the different somatotopic features of the head BMNs that were identified based on our anatomical and connectomic analysis. This is the last paragraph in the Results section titled “Somatotopically-organized parallel BMN pathways”:

      Our results reveal head bristle proximity-based organization among the BMN projections and their postsynaptic partners to form parallel mechanosensory pathways. BMNs innervating neighboring bristles project into overlapping zones in the SEZ, whereas those innervating distant bristles project to distinct zones (example of BM-Fr, -Ant, and -MaPa neurons shown in Figure 8D,E). Cosine similarity analysis of BMN postsynaptic connectivity revealed that BMNs innervating the same bristle populations (same types) have the highest connectivity similarity. Figure 8F shows example parallel connections for BM-Fr, -Ant, and -MaPa neurons (vertical arrows), where the edge width indicates the number of synapses from each BMN type to their major postsynaptic partners. Additionally, BMNs innervating neighboring bristle populations showed postsynaptic connectivity similarity, while BMNs innervating distant bristles show little or none. For example, BM-Fr and -Ant neurons have connections to common postsynaptic partners, whereas BM-MaPa neurons show only weak connections with the main postsynaptic partners of BM-Fr or -Ant neurons (Figure 8F, connections under 5% of total BMN output omitted). These results suggest that BMN somatotopy could have different possible levels of head spatial resolution, from specific bristle populations (e.g. Ant bristles), to general head areas (e.g. dorsal head bristles).

      We also refer to Figure 8D-F to illustrate the different somatotopic features in the Discussion. These references can be found in the following Discussion sections titled “A synaptic resolution somatotopic map of the head BMNs (fourth paragraph)”, and “Parallel circuit architecture underlying the grooming sequence (second paragraph)”.

      (6) In addition to improving the Figures, we provide additional tools that enable readers to explore the BMN somatotopy in a more interactive way. That is, we provide 5 different FlyWire.ai links in the manuscript Results section that enable 3D visualization of the different reconstructed BMNs (e.g. FlyWire.ai link 1).

      Note: In working on old Figure 7 to address this Reviewer suggestion, we also reordered panels A-E. We believe that this was a more logical ordering than in the previous draft. These panels are now the only data shown in Figure 7, as the cosine similarity analysis is now in Figure 8. We hope that splitting these panels into two Figures will improve manuscript readability.

      Light EM Mapping: A better description of methods by which this mapping was done would be helpful. Perhaps the authors could provide a few example parallel representations of the EM and light images in the main figure would help the reader better appreciate the strength of their approach.

      We have done as the Reviewers suggested and added panels to Figure 6 that show examples of the LM and EM image matching (Figure 6A,B). We added two examples that used different methods for labeling the LM imaged BMNs, including MCFO labeling of an individual BM-InOc neuron and driver line labeling of a major portion of BM-InOm neurons using InOmBMN-LexA. These panels are referred to in the first paragraph of the Results section titled “Matching the reconstructed head BMNs with their bristles”. Note that examples for all LM/EM matched BMN types are shown in Figure 6 – figure supplement 2.

      We had provided Figure 6 – figure supplement 2 in the reviewed manuscript that shows all the above requested “parallel representations of the EM and light images”. However, the Reviewer critiques made us realize that the purpose of this figure supplement was not clearly indicated. Therefore, we have revised Figure 6 – figure supplement 2 and its legend to make its purpose clearer. First, we changed the legend title to better highlight its purpose. The legend is now titled: “Matching EM reconstructed BMN projections with light microscopy (LM) imaged BMNs that innervate specific bristles”. Second, we added label designations to the figure panel rows that highlight the LM and EM comparisons. That is, the rows for light microscopy images of BMNs are indicated with LM and the rows for EM reconstructed BMN images are labeled with EM. Reviewer #3 had indicated that it was not clear what labeling methods were used to visualize the LM imaged BM-InOm neurons in Figure 6 – figure supplement 2N. Therefore, we added text to the figure and the legend to better highlight the different methods used. Panels A and B were also cropped to accommodate the above mentioned revisions.

      The manuscript also provides an extensive Materials and methods section that describes the different lines of evidence that were used to assign the reconstructed BMNs as specific types. We changed the title to better highlight the purpose of this methods section to “Matching EM reconstructed BMN projections with light microscopy imaged BMNs that innervate specific bristles”. The evidence used to support the assignment of the different BMN types is also summarized in Figure 6 – figure supplement 3.

      Parallel circuit model: The authors motivate their study with this. We're recommending that they define expectations of such circuitry, its alternatives (including implications for downstream pathways), and behavior before they present their results. We're also recommending that they interpret their behavioural results in the context of these circuits.

      Our primary motivation for doing the experiments described in this manuscript was to help define the neural circuit architecture underlying the parallel model that drives the Drosophila grooming sequence. This manuscript provides a comprehensive assessment of the first layer of this circuit architecture. A byproduct of this work is a contribution that offers immediate utility and significance to the Drosophila connectomics community. Namely, the description of the majority of mechanosensory neurons on the head, with their annotation in the recently released whole brain connectome dataset (FlyWire.ai). In writing this manuscript, we tried to balance both of these things, which was difficult to write. We very much appreciate the Reviewers' comments that have highlighted points of confusion in our original draft. We hope that the revised draft is now clearer and more logically presented. We have made revisions to the text and provided a new figure supplement (Figure 1 - figure supplement 1) and new panels in Figure 8. Below we highlight the major revisions.

      (1) The Introduction was revised to more explicitly ground the study in the parallel model, while also removing details that were not pertinent to the experiments presented in the manuscript.

      The first paragraph introduces different features of the parallel model. To better focus the reader on the parts of the model that were being assessed in the manuscript, we removed the following sentences: “Performance order is established by an activity gradient among parallel circuits where earlier actions have the highest activity and later actions have the lowest. A winner-take-all network selects the action with the highest activity and suppresses the others. The selected action is performed and then terminated to allow a new round of competition and selection of the next action.” Note that these sentences are included in the third and fourth paragraphs of the last Discussion section titled “Parallel circuit architecture underlying the grooming sequence”.

      The first paragraph of the Introduction now introduces a bigger picture view of the model that emphasizes the two main features: 1) a parallel circuit architecture that ensures all mutually exclusive actions to be performed in sequence are simultaneously readied and competing for output, and 2) hierarchical suppression among the parallel circuits, where earlier actions suppress later actions.

      (2) Newly added Figure 1 – figure supplement 1 provides a working model of grooming (Reviewer # 1 suggestion). We now more strongly emphasize that the study aimed to define the parallel neural circuit architecture underlying the grooming sequence, focusing on the mechanosensory layer of this architecture. In particular, we refer to the new Figure 1 – figure supplement 1 that has been added to better convey the hypothesized grooming neural circuit architecture. Figure 1 – figure supplement 1 is incorporated into the Introduction (paragraphs two, three, and four), Results section titled “Somatotopically-organized parallel BMN pathways (first paragraph)”, and last Discussion section titled “Parallel circuit architecture underlying the grooming sequence (second and third paragraphs)”.

      (3) New panels in Figure 8 update the model of parallel circuit organization as it relates to somatotopy (D-F). These panels show the parallel circuits hypothesized by the model, but also indicate convergence, with different possible levels of head resolution for these circuits. We describe above where these panels are referenced in the text.

      (4) We added a new paragraph in the last Discussion section titled “Parallel circuit architecture underlying the grooming sequence” that better incorporates the results from this manuscript into the working model of grooming. This paragraph is shown below.

      Here we define the parallel architecture of BMN types that elicit the head grooming sequence that starts with the eyes and proceeds to other locations, such as the antennae and ventral head. The different BMN types are hypothesized to connect with parallel circuits that elicit grooming of specific locations (described above and shown in Figure 1 – figure supplement 1A,C). Indeed, we identify distinct projections and connectivity among BMNs innervating distant bristles on the head, providing evidence supporting this parallel architecture (Figure 8D-F). However, we also find partially overlapping projections and connectivity among BMNs innervating neighboring bristles. Further, optogenetic activation of BMNs at specific head locations elicits grooming of both those locations and neighboring locations (Figure 9). These findings raise questions about the resolution of the parallel architecture underlying grooming. Are BMN types connected with distinct postsynaptic circuits that elicit aimed grooming of their corresponding bristle populations (e.g. Ant bristles)? Or are neighboring BMN types that innervate bristles in particular head areas connected with circuits that elicit grooming of those areas (e.g. dorsal or ventral head)? Future studies of the BMN postsynaptic circuits will be required to define the resolution of the parallel pathways that elicit aimed grooming.

      Aside from this summary of major concerns, the detailed recommendations are attached below.

      Reviewer #1 (Recommendations For The Authors):

      I appreciate the quality and exhaustive body of work presented in this manuscript. I have a few comments that the authors may want to consider:

      (1) The authors motivate this study by posing that it would allow them to uncover whether the complex grooming behaviour of flies followed a parallel model of circuit function. It would have been nice to have been introduced to what the alternative model might be and what each would mean for organisation of the circuit architecture. Some guiding schematics would go a long way in illustrating this point. Modifying the discussion along these lines would also be helpful.

      We made several revisions to the manuscript that address this recommendation. Among these revisions, we added Figure 1 – figure supplement 1 that includes a working model for grooming. Please see above for a description of these revisions.

      (2) The authors mention the body of work that has mapped head bristles and described somatotopy. It would be useful to discuss in more detail what these studies have shown and highlight where the gaps are that their study fills.

      We made several revisions to the manuscript that address this recommendation. Please see above for a description of these revisions.

      (3) The dye-fills and reconstructions that are single colour could use a boundary to demarcate the SEZ. This would help in orienting the reader.

      We agree with Reviewer #1 that Figure 4 and its supplements could use some indicator that would orient the reader with respect to the dye filled or stochastically labeled neurons. The images are of the entire SEZ in the ventral brain, and in the case of some panels, the background staining enables visualization of the brain (e.g. Figure 4H,M,N. To help orient the reader in this region, we added a dotted line to indicate the approximate SEZ midline. This also enables the reader to more clearly see which of the BMN types cross the midline.

      Midline visual guides were added for Figure 4, Figure 4 – figure supplement 2, Figure 4 – figure supplement 3, Figure 4 – figure supplement 4, Figure 4 – figure supplement 5, Figure 4 – figure supplement 6, Figure 4 – figure supplement 7, Figure 4 – figure supplement 8, Figure 6 – figure supplement 2.

      (4) The comparison between the EM and the fills/clones are not obvious. And particularly because they are not directly determined, it would be nice to have the EM reconstruction alongside the dye-fills. This would work very nicely in the supplementary figure with the multiple fills of the same bristles. I think this would really drive home the point.

      We made several revisions to the manuscript that address this recommendation. Please see above for a description of these revisions.

      (5) Are there unnoticed black error-bars floating around in many of the gray-scale images?

      The black bars were masking white scale bars in the images. We have removed the black bars and remade the images without scale bars. This was done for the following Figures: Figure 4, Figure 4 – figure supplement 2, Figure 4 – figure supplement 3, Figure 4 – figure supplement 4, Figure 4 – figure supplement 5, Figure 4 – figure supplement 6, Figure 4 – figure supplement 7, Figure 4 – figure supplement 8, Figure 6 – figure supplement 2.

      Reviewer #2 (Recommendations For The Authors):

      (1) The only point in the paper I found myself going back and forth between methods/supp and text was when authors discuss about the clustering. I think it would help the reader if a few sentences about cosine clustering used for connectivity based clustering were included in the main text. Also, for NBLAST hierarchical clustering, it would help if some informed metrics could be used for defining cluster numbers (e.g. Braun et al, 2010 PLOS ONE shows how Ward linkage cost could be used for hierarchical clustering).

      Depending on where the cut height is placed on the dendrogram for cosine similarity of BMNs, different features of the BMN type postsynaptic connectivity are captured. As the number of clusters is increased (lower cut height), clustering is mainly among BMNs of the same type, showing that these BMNs have the highest connectivity similarity. As the number of clusters is reduced (higher cut height), BMNs innervating neighboring bristles on the head are clustered, revealing three general clusters corresponding to the dorsal, ventral, and posterior head. This reveals somatotopy based clustering among same and neighboring BMN types. The cut height shown in Figure 8 and Figure 8 – figure supplement 2 was chosen because it highlighted both of these features.

      The NBLAST clustering shows similar results to the connectivity based clustering with respect to neighboring and distant BMN types. As the number of clusters increases BMNs of the same type are clustered, and these types can be further subdivided into morphologically distinct subtypes. As the number of clusters is reduced, the clustering captures neighboring BMNs. Thus, neighboring BMN types showed high morphology similarity (and proximity) with each other, and low similarity with distant BMN types.

      Please see our responses to a Reviewer #3 critique below for further description of the clustering results.

      On the same lines it would help if the clustering dendrograms were included in the main figure.

      We thank Reviewer #2 for this comment. We have added the dendrogram to Figure 8A, a change that we feel makes this Figure much easier to understand.

      (2) It could help provide intuition if the authors revealed some of the downstream targets and their implication in explaining the behavioral phenotypes.

      While this will be the subject of at least two forthcoming manuscripts, we have added text to the present manuscript that provides insight into BMN postsynaptic targets. Our previous work (Hampel et al. 2015) described a mechanosensory connected neural circuit that elicits grooming of the antennae. While this previous study demonstrated that the Johnston’s organ mechanosensory neurons are synaptically and functionally connected with this circuit, our preliminary analysis indicates that it is also connected with BM-Ant neurons. We hypothesize that there are additional such circuits that are responsible for eliciting grooming of other head locations.

      To better highlight potential downstream targets in the manuscript, we now mention the antennal circuit in the Introduction. This text reads: In this model, parallel-projecting mechanosensory neurons that respond to stimuli at specific locations on the head or body could connect with somatotopically-organized parallel circuits that elicit grooming of those locations (Figure 1 – figure supplement 1A-C). The previous discovery of a mechanosensory-connected circuit that elicits aimed grooming of the antennae provides evidence of this organization (Hampel 2015). However, the extent to which distinct circuits elicit grooming of other locations is unknown, in part, because the somatotopic projections of the mechanosensory neurons have not been comprehensively defined for the head or body.

      There is also text in the Discussion that addresses this Reviewer comment. It describes the antennal circuit and mentions the possibility that other similar circuits may exist. This can be found in the third paragraph of the section titled “Circuits that elicit aimed grooming of specific head locations”.

      (3) Authors find that opto activation of BMNs leads to grooming of targeted as well as neighboring areas. Is there any sequence observed here? i.e. first clean targeted area and then clean neighboring area? I wonder if the answer to this is something as simple as common post-synaptic targets which is essentially reducing the resolution of the BMN sensory map. Some more speculation on this interesting result could be helpful.

      We appreciate and agree with this point from Reviewer #2, and have tried to better emphasize the possible implications for grooming that the overlapping projections and connectivity among BMNs innervating neighboring bristles may have. This is now better addressed in the Results and Discussion sections. Below we highlight where this is addressed:

      (1) In the second paragraph of the Results section titled “Activation of subsets of head BMNs elicits aimed grooming of specific locations” we added text that suggests the possibility that grooming of the stimulated and neighboring locations could be due to the overlapping projections and connectivity. This text reads: This suggested that head BMNs elicit aimed grooming of their corresponding bristle locations, but also neighboring locations. This result is consistent with our anatomical and connectomic data indicating that BMNs innervating neighboring bristles show overlapping projections and postsynaptic connectivity similarity (see Discussion).

      (2) In the fourth paragraph of the Discussion section titled “A synaptic resolution somatotopic map of the head BMNs”, we added a sentence to the end of the fourth paragraph that alludes to further discussion of this topic. This sentence reads: This overlap may have implications for aimed grooming behavior. For example, neighboring BMNs could connect with common neural circuits to elicit grooming of overlapping locations (discussed more below).

      (3) In the fourth paragraph of the Discussion section titled “Circuits that elicit aimed grooming of specific head locations” there is a paragraph that mentions the possibility of mechanosensory convergence onto common postsynaptic circuits to promote grooming of the stimulated area, along with neighboring areas. This paragraph is below.

      We find that activation of specific BMN types elicits both aimed grooming of their corresponding bristle locations and neighboring locations. This suggests overlap in the locations that are groomed with the activation of different BMN types. Such overlap provides a means of cleaning the area surrounding the stimulus location. Interestingly, our NBLAST and cosine similarity analysis indicates that neighboring BMNs project into overlapping zones in the SEZ and show common postsynaptic connectivity. Thus, we hypothesize that neighboring BMNs connect with common neural circuits (e.g. antennal grooming circuit) to elicit overlapping aimed grooming of common head locations.

      (4) In the new second paragraph of the Discussion section titled “Parallel circuit architecture underlying the grooming sequence” we further discuss the issue of the BMN “sensory map. This paragraph is below.

      Here we define the parallel architecture of BMN types that elicit the head grooming sequence that starts with the eyes and proceeds to other locations, such as the antennae and ventral head. The different BMN types are hypothesized to connect with parallel circuits that elicit grooming of specific locations (described above and shown in Figure 1 – figure supplement 1A,C). Indeed, we identify distinct projections and connectivity among BMNs innervating distant bristles on the head, providing evidence supporting this parallel architecture (Figure 8D-F). However, we also find partially overlapping projections and connectivity among BMNs innervating neighboring bristles. Further, optogenetic activation of BMNs at specific head locations elicits grooming of both those locations and neighboring locations (Figure 9). These findings raise questions about the resolution of the parallel architecture underlying grooming. Are BMN types connected with distinct postsynaptic circuits that elicit aimed grooming of their corresponding bristle populations (e.g. Ant bristles)? Or are neighboring BMN types that innervate bristles in particular head areas connected with circuits that elicit grooming of those areas (e.g. dorsal or ventral head)? Future studies of the BMN postsynaptic circuits will be required to define the resolution of the parallel pathways that elicit aimed grooming.

      (4) If authors were to include a summary table that shows all known attributes about BMN type as columns that could be very useful as a resource to the community. Table columns could include attributes like "bristle name", "nerve tract", "FlyWire IDs of all segments corresponding to the bristle class". "split-Gal4 line or known enhancer" , etc.

      We provided a table that includes much of this information after the manuscript had already gone out for review. We regret that this was not available. This is now provided as Supplementary file 3. This table provides the following information for each reconstructed BMN: BMN name, bristle type, nerve, flywire ID, flywire coordinates, NBLAST cluster (cut height 1), NBLAST cluster (cut height 5), and cosine cluster (cut height 4.5). Note that the driver line enhancers for targeting specific BMN types are shown in Figure 3I.

      Specific Points:

      Figure 4C-V:

      • I find it a bit difficult to distinguish ipsi- from contra-lateral projections. Maybe indicate the midline as a thin, stippled line?

      We thank the Reviewer #2 for this suggestion. We have now added lines in the panels in Figure 4C-V to indicate the approximate location of the midline. We also added lines to the Figure 4 – figure supplements as described above.

      I think this Fig reference is wrong "the red-light stimulus also elicited backward motions with control flies (Figure 6B,C, control, black trace, Video 5)." should be Fig 8B,C

      We have fixed this error.

      Reviewer #3 (Recommendations For The Authors):

      Introduction:

      Motivating this study in terms of understanding the neural mechanisms that execute the parallel model seems to overstate what you will achieve with the current study. If you want to motivate it this way, I suggest focusing on the grooming sequence of the head along (eyes, antennae, proboscis).

      We made several revisions to the manuscript that address this recommendation. Please see above for a description of these revisions. Please note that many of the revisions focus on the head grooming sequence. We also made minor revisions to the Introduction that further emphasize the focus on head grooming.

      Results:

      Figure 1. Please indicate that this is a male fly in either the figure title or in the figure itself.

      We added a male symbol to Figure 1A.

      Figure 3. Panel J is referenced in the main body text and in the figure caption, but there is no Fig 3J.

      Panel J is shown in the upper right corner of Figure 3. We realize that the placement of this panel is not ideal, but this was the only place that we could fit it. Additionally, the panel works nicely at that location to better enable comparison with panel C. We have revised the text in the Figure 3 legend to better highlight the location of this Figure panel: “Shown in the upper right corner of the figure are the aligned expression patterns of InOmBMN-LexA (red), dBMN-spGAL4 (green), and TasteBMN-spGAL4 (brown).”

      We also added text to a sentence in the results section entitled “Head BMNs project into discrete zones in the ventral brain” that indicates the panel location. This text reads: To further visualize the spatial relationships between these projections, we computationally aligned the expression patterns of the different driver lines into the same brain space (Figure 3J, upper right corner).

      Matching the BMNs to EM reconstructions: why cut the dendrogram at H=5? Would be better to determine cluster number using an unbiased method.

      To match the morphologically distinct EM reconstructed BMNs to their specific bristles, we relied on different lines of evidence, including NBLAST results (discussed more below), dye fill/stochastic labeling/driver line labeling matches, published morphology, nerve projection, bristle number, proximity to other BMNs, and postsynaptic connectivity (summarized in Figure 6 – figure supplement 3). The following Materials and methods section provides a detailed description of the evidence used to assign each BMN type in “Matching EM reconstructed BMN projections with light microscopy imaged BMNs that innervate specific bristles”. In many cases, BMN type could be assigned with confidence solely based on morphological comparisons with our light level data (e.g. dye fills), in conjunction with bristle counts to indicate an expected number of BMNs showing similar morphology. Thus, the LM/EM matches and NBLAST clustering were largely complementary.

      The EM reconstructed BMNs were matched as particular BMN types, in part based on examination of the NBLAST data at different cut heights. NBLAST clustering of the BMNs revealed general trends at higher and lower cut heights (Figure 6 – figure supplement 1A, Supplementary file 3). The lowest cut heights included mostly BMNs of the same type innervating the same bristle populations, and smaller clusters that subdivided into morphologically distinct subtypes (see Supplementary file 3 for clusters produced at cut height 1). This revealed that BMNs of the same type tended to show the highest morphological similarity with each other, but they also showed intratype morphological diversity. Higher cut heights produced clusters of BMNs innervating neighboring bristles populations (e.g. ventral head BMNs), showing high morphological similarity among neighboring BMN types.

      We selected the cut height 5 shown in Figure 6 – figure supplement 1A,B because it captures examples of both same and neighboring type clustering. For example, it captures a cluster of mostly BM-Taste neurons (Cluster 16), and neighboring BMN types, including those from the dorsal head (Cluster 14) or ventral head (Cluster 15).

      Based on reviewer comments, we realized that the way we wrote the BMN matching section in the Results indicated more reliance on the NBLAST clustering than what was actually necessary, distorting the way we actually matched the BMNs. Therefore, we softend the first couple of sentences to place less emphasis on the importance of the NBLAST. We also indicated that the readers can find the resulting clusters at different cut heights, referring to Figure 6 – figure supplement 1A and Supplementary file 3. The first two sentences of the first paragraph in the Results section titled “Matching the reconstructed head BMNs with their bristles” now read:

      The reconstructed BMN projections were next matched with their specific bristle populations. The projections were clustered based on morphological similarity using the NBLAST algorithm (example clustering at cut height 5 shown in Figure 6 – figure supplement 1A,B, Supplementary file 3, FlyWire.ai link 2) (Costa et al., 2016). Clusters could be assigned as BMN types based on their similarity to light microscopy images of BMNs known to innervate specific bristles.

      The number of reconstructed BMNs is remarkably similar to what is expected based on bristle counts for each group except for lnOm. Why do you think there is such a large discrepancy there?

      We believe that there is a discrepancy between the number of reconstructed BM-InOm neurons and the number expected based on InOm bristle counts because these bristle counts were based on few flies and these numbers appear to be variable. We did not further investigate the numbers of InOm bristles in this manuscript because we only needed an estimate of their numbers, given that there is over an order of magnitude difference in the eye bristles versus any other head bristle population. Therefore, we could relatively easily conclude that the head BMNs were related to the InOm bristles, based on their sheer numbers and their morphology.

      Figure 6 - figure supplement 2N, please describe these panels better. Main text says the upper image is from lnOmBMN-LexA, but the figure legend doesn't agree.

      We have added text to the figure legend that now makes the contents of panel 2N clear to the reader. Further, we now indicate in the figure legend for each panel, the method used to obtain the labeled neurons (i.e. fill, MCFO, driver), to avoid similar confusion for the other panels.

      Figure 6 - figure supplement 4D. How frequently is there a mismatch between the number of BMNs for a given type across hemispheres?

      Although the full reconstruction of the BMNs on both sides of the brain was beyond the scope of this work, the BMNs on both sides have since been reconstructed and annotated (Schlegal et al. 2023). We plan to provide more analysis of BMNs on both sides of the brain in a forthcoming manuscript. However, the BMN numbers tend to show agreement on both sides of the brain. The table below shows a comparison between the two sides:

      Author response table 1.

      Figures 6 and 7. It would be helpful to include a reference brain in all panels that show cluster morphology. Without landmarks there is nothing to anchor the eye to allow the reader to see the described differences in BMN projection zones and patterns.

      While we apologize for not making this specific change, we have made revisions to other parts of the manuscript to better highlight the somatotopic organization among the BMNs (revisions described above). Please note that we now provide FlyWire.ai publicly available links that enable readers to view the BMN projections in 3D. They can also toggle a brain mesh on and off to provide spatial reference.

      "BMN somatotopic map": It would be helpful to show or describe in more detail what the unique branch morphology for each zone is. It is quite difficult to appreciate, as the groups also have a lot of overlap. Would the unique regions that the BMN groups innervate be easier to see if you plotted presynaptic sites by group? I am left unsure about whether there is a somatotopic map here.

      We made several revisions to the manuscript that address this recommendation. Please see above for a description of these revisions. Please note that we did not examine the fine branch morphological differences between BMN types having overlapping projections. Showing these differences would require more extensive anatomical analysis that is beyond the scope of this work. For showing definitive somatotopy, we focused on the overt differences between BMNs innervating bristles at distant locations on the head.

      Overall the strict adherence to the parallel model impacts the interpretation of the data. It would be helpful for the authors to discuss which aspects of the current study are consistent with the parallel model and which results are not consistent.

      We made several revisions to the manuscript that address this recommendation. Please see above for a description of these revisions.

      Discussion:

      "Circuits that elicit aimed grooming of specific head locations": In the previous paragraph you mention "BMN types innervating neighboring bristle populations have overlapping projections into zones that correspond roughly to the dorsal, ventral, and posterior head. The overlap is likely functionally significant, as cosine similarity analysis revealed that neighboring head BMN types have common postsynaptic partners. However, overlap between neighboring BMN types is only partial, as they show differing projections and postsynaptic connectivity." Then in this paragraph, you say, "How do the parallel-projecting head BMNs interface with postsynaptic neural circuits to elicit aimed grooming of specific head locations? Different evidence supports the hypothesis that the BMNs connect with parallel circuits that each elicit a different aimed grooming movement (Seeds et al., 2014)." The overlapping postsynaptic BMN connectivity seems in conflict with the claim that the circuits are parallel.

      We apologize for this confusion. We now better describe this apparent discrepancy between our results and the parallel model of grooming behavior. We made several revisions to the manuscript that address this recommendation. Please see above for a description of these revisions.

      We have made additional changes to the manuscript:

      (1) We added Supplementary file 2 that includes links for downloading the image stacks used to generate panels in Figure 1, Figure 2, Figure 3, Figure 4, and figure supplements for these figures. These image stacks are stored in the Brain Image Library (BIL). Rows in the spreadsheet correspond to each image stack. Columns provide information about each stack including: figure panels that each image stack contributed to, image stack title, DOI for each stack (link provides metadata for each stack and file download link), image stack file name, genotype of imaged fly, and information about image stack. References to this file have been made at different locations throughout the text and Figure legends. We also added a section on the BIL data in the Materials and methods entitled “Light microscopy image stack storage and availability”. Old Supplementary file 2 has been renamed Supplementary file 3.

      (2) We added a new reference for FlyWire.ai (Dorkenwald et al. 2023) that was posted as a preprint during the revision of this manuscript.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      In the manuscript titled "Vangl2 suppresses NF-κB signaling and ameliorates sepsis by targeting p65 for NDP52-mediated autophagic degradation" by Lu et al, the authors show that Vangl2, a planner cell polarity component, plays a direct role in autophagic degradation of NFkB-p65 by facilitating its ubiquitination via PDLIM2 and subsequent recognition and autophagic targeting via the autophagy adaptor protein NDP52. Conceptually it is a wonderful study with excellent execution of experiments and controls. The concerns with the manuscript are mainly on two counts - First issue is the kinetics of p65 regulation reported here, which does not fit into the kinetics of the mechanism proposed here, i.e., Vangl2-mediated ubiquitination followed by autophagic degradation of p65. The second issue is more technical- an absolute lack of quantitative analyses. The authors rely mostly on visual qualitative interpretation to assess an increase or decrease in associations between partner molecules throughout the study. While the overall mechanism is interesting, the authors should address these concerns as highlighted below:

      Major points:

      (1) Kinetics of p65 regulation by Vangl2: As mentioned above, authors report that LPS stimulation leads to higher IKK and p65 activation in the absence of Vangl2. The mechanism of action authors subsequently work out is that- Vangl2 helps recruit E3 ligase PDLIM to p65, which causes K63 ubiquitination, which is recognised by NDP52 for autophagic targeting. Curiously, peak p65 activation is achieved within 30 minutes of LPS stimulation. The time scale of all other assays is way longer. It is not clear that in WT cells, p65 could be targeted to autophagic degradation in Vangl2 dependent manner within 30 minutes. The HA-Myc-Flag-based overexpression and Co-IP studies do confirm the interactions as proposed. However, they do not prove that this mechanism was responsible for the Vangl2-mediated modulation of p65 activation upon LPS stimulation. Moreover, the Vangl2 KO line also shows increased IKK activation. The authors do not show the cause behind increased IKK activation, which in itself can trigger increased p65 phosphorylation.

      We thank the reviewer for this valuable suggestion.

      Indeed, we agreed with the reviewer that peak p65 activation is achieved within 30 minutes of LPS stimulation in vitro, and p65 could not be targeted to autophagic degradation in a Vangl2 dependent manner within 30 minutes. Given that the protein and mRNA levels of Vangl2 were elevated at 3-6 h of LPS stimulation (Fig. S1 C-E), we extended the stimulation time scale in the revised manuscript. The data (Fig. 2A-D in the revised manuscript) demonstrated that IKK phosphorylation was enhanced in Vangl2 KO myeloid cells during the early phase (within 3 h) of LPS stimulation, but not for the prolonged period of LPS stimulation. The underlying mechanism may be complex. Only p65 phosphorylation was continuously enhanced after long-term LPS stimulation in Vangl2 KO cells, compared to WT cells. Furthermore, the overexpression of Vangl2 in A549 cells also demonstrated a reduction of phosphorylation and total endogenous p65 (Fig. 2 I, J in the revised manuscript). These findings were corroborated by overexpression and Co-IP experiments, which collectively indicated that Vangl2 regulates the stability of p65 by promoting its interaction with NDP52 and autophagic degradation. (Page 7; Line 183-185).  

      (2) The other major concern is regarding the lack of quantitative assessments. For Co-IP experiments, I can understand it is qualitative observation. However, when the authors infer that there is an increase or decrease in the association through co-IP immunoblots, it should also be quantified, especially since the differences are quite marginal and could be easily misinterpreted.

      We are grateful to the reviewer for this suggestion. The quantitative analysis has been updated in the revised version.

      (3) Figure 4E and F: It is evident that inhibiting Autolysosome (CQ or BafA1) or autophagy (3MA) led to the recovery of p65 levels and inducing autophagy by Rapamycin led to faster decay in p65 levels. Did the authors also note/explore the possibility that Vangl2 itself may be degraded via the autophagy pathway? IB of WCL upon CQ/BAF/3MA or upon Rapa treatment does indicate the same. If true, how would that impact the dynamics of p65 activation?

      We thank the reviewer for this question. Previous studies have shown that Vangl2 is primarily degraded by the proteasome pathway, rather than by the autolysosomal pathway (doi: 10.1126/sciadv.abg2099; doi: 10.1038/s41598-019-39642-z). In our experiments, Vangl2 recruits E3 ligase PDLIM2 to enhance K63-linked ubiquitination on p65, which serves as a recognition signal for cargo receptor NDP52-mediated selective autophagic degradation. Vangl2 facilitated the interaction between p65 and NDP52, yet itself did not undergo significant autophagic degradation.

      (4) Autophagic targeting of p65 should also be shown through alternate evidence, like microscopy etc., in the LPS-stimulated WT cells.

      We thank the reviewer for this suggestion. We have added the data (co-localization of p65 and LC3 was detected by immunofluorescence) in the revised version (Fig. S4 H in the revised manuscript). (Page 9, lines 267-268)

      Reviewer #2 (Public Review):

      Vangl2, a core planar cell polarity protein involved in Wnt/PCP signaling, mediates cell proliferation, differentiation, homeostasis, and cell migration. Vangl2 malfunctioning has been linked to various human ailments, including autoimmune and neoplastic disorders. Interestingly, Vangl2 was shown to interact with the autophagy regulator p62, and indeed, autophagic degradation limits the activity of inflammatory mediators such as p65/NF-κB. However, if Vangl2, per se, contributes to restraining aberrant p65/NF-kB activity remains unclear.

      In this manuscript, Lu et al. describe that Vangl2 expression is upregulated in human sepsis-associated PBMCs and that Vangl2 mitigates experimental sepsis in mice by negatively regulating p65/NF-κB signaling in myeloid cells. Vangl2 recruits the E3 ubiquitin ligase PDLIM2 to promote K63-linked poly-ubiquitination of p65. Vangl2 also facilitates the recognition of ubiquitinated p65 by the cargo receptor NDP52. These molecular processes cause selective autophagic degradation of p65. Indeed, abrogation of PDLIM2 or NDP52 functions rescued p65 from autophagic degradation, leading to extended p65/NF-κB activity.

      As such, the manuscript presents a substantial body of interesting work and a novel mechanism of NF-κB control. If found true, the proposed mechanism may expand therapeutic opportunities for inflammatory diseases. However, the current draft has significant weaknesses that need to be addressed.

      We appreciate the reviewer’s comments on our manuscript, and we have further improved the manuscript as suggested.

      Specific comments

      (1) Vangl2 deficiency did not cause a discernible increase in the cellular level of total endogenous p65 (Fig 2A and Fig 2B) but accumulated also phosphorylated IKK.

      Even Fig 4D reveals that Vangl2 exerts a rather modest effect on the total p65 level and the figure does not provide any standard error for the quantified data. Therefore, these results do not fully support the proposed model (Figure 7) - this is a significant draw back. Instead, these data provoke an alternate hypothesis that Vangl2 could be specifically mediating autophagic removal of phosphorylated IKK and phosphorylated IKK, leading to exacerbated inflammatory NF-κB response in Vangl2-deficient cells. One may need to use phosphorylation-defective mutants of p65, at least in the over-expression experiments, to dissect between these possibilities.

      We appreciate the reviewer’s comments on our manuscript, and we have further improved the manuscript as suggested.

      (1) Indeed, we agreed with the reviewer that Vangl2 deficiency did not cause a discernible increase in the cellular level of total p65 after a short time of LPS stimulation in vitro, and p65 could not be targeted to autophagic degradation in a Vangl2 dependent manner within 30 minutes. Given that the protein and mRNA levels of Vangl2 were elevated at 3-6 h of LPS stimulation (Fig. S1 C-E), we extended the stimulation time scale in the revised manuscript. The data (Fig. 2A-D in the revised manuscript) demonstrated that IKK phosphorylation was enhanced in Vangl2 KO myeloid cells during the early phase (within 3 h) of LPS stimulation, but not for the prolonged period of LPS stimulation. The underlying mechanism may be complex. Only phosphorylation of p65 and total endogenous p65 was continuously enhanced after long-term LPS stimulation in Vangl2 KO cells, compared to WT cells. Furthermore, the overexpression of Vangl2 in A549 cells also demonstrated a reduction of phosphorylation and total endogenous p65 (Fig. 2 I, J in the revised manuscript). These findings were corroborated by overexpression and Co-IP experiments, which collectively indicated that Vangl2 regulates the stability of p65 by promoting its interaction with NDP52 and autophagic degradation. (Page 7; Line 183-185).  

      (2) Similarly, the stimulation time scale in Fig 4D was extended, and it was demonstrated that p65 was more stable in Vangl2-deficient cells.

      3) Moreover, we constructed phosphorylation-defective mutants of p65 (S536A), and found that Vangl2 could also promote the degradation of the p65 phosphorylation mutants (Fig. S4 A, B in the revised manuscript). Thus, Vangl2 promote the degradation of the basal/unphosphorylated p65. (Page 8, lines 237-240)

      (2) Fig 1A: The data indicates the presence of two subgroups within the sepsis cohort - one with high Vangl2 expressions and the other with relatively normal Vangl2 expression. Was there any difference with respect to NF-κB target inflammatory gene expressions between these subgroups?

      As suggested, we conducted an analysis of NF-kB target inflammatory gene expressions between the high and relatively low Vangl2 expression groups in sepsis patients. The results showed that the serum of the high Vangl2 expression group exhibited lower levels of IL-6, WBC, and CRP than the low Vangl2 expression group, which suggested an inverse correlation between Vangl2 and the inflammatory response (Fig. S1 A in the revised manuscript) (Page 5, lines 126-128).

      (3) The effect of Vangl2 deficiency was rather modest in the neutrophil. Could it be that Vangl2 mediates its effect mostly in macrophages?

      As showed in Fig. S1C-E, the induction of Vangl2 by LPS stimulation is more rapid in macrophages than in neutrophils. This may contribute to its dominant effect in macrophages. Consequently, we primarily focused our investigation on the role of Vangl2 in macrophages.

      (4) Fig 1D and Figure 1E: Data for unstimulated Vangl2 cells should be provided. Also, the source of the IL-1β primary antibody has not been mentioned.

      Thank you for the suggestion. We have updated the data for unstimulated cells in the revised manuscript (Fig. 1 D, E in the revised manuscript). Also, IL-1β primary antibody was purchased from Cell Signaling Technology and the information has been included in the Materials and Methods section (Table S1).

      (5) The relevance and the requirement of RNA-seq analysis are not clear in the present draft. Figure 1E already reveals upregulation of the signature NF-κB target inflammatory genes upon Vangl2 deficiency.

      We agreed with the reviewer that the data presented in Figure 1E demonstrated the upregulation of the signature NF-kB target inflammatory genes upon Vangl2 deficiency in a murine model of LPS induced sepsis. Subsequently, we proceeded to investigate the mechanism by which Vangl2 regulates NF-kB target inflammatory genes at the cellular level in Figure 2. To this end, we performed RNA-seq analysis to screen signal pathways involved in LPS-induced septic shock by comparing LPS-stimulated BMDMs from Vangl2ΔM and WT mice, and identified that TNF signaling pathway and cytokine-cytokine receptor interaction were found to be significantly enriched in Vangl2ΔM BMDMs upon LPS stimulation. This analysis provides further evidence that Vangl2 plays a role in regulating NF-kB signaling pathways and the release of related inflammatory cytokines.

      (6) Fig 2A reveals an increased accumulation of phosphorylated p65 and IKK in Vangl2-deficient macrophages upon LPS stimulation within 30 minutes. However, Vangl2 accumulates at around 60 minutes post-stimulation in WT cells. Similar results were obtained for neutrophils (Fig 2B). There appears to be a temporal disconnect between Vangl2 and phosphorylated p65 accumulation - this must be clarified.

      This concern has been addressed above (see response to questions 1 from reviewer #2). 

      (7) Figure 2E and 2F do not have untreated controls. Presentations in Fig 2E may be improved to more clearly depict IL6 and TNF data, preferably with separate Y-axes.

      Thank you for the suggestion. We have added untreated controls and separated Y-axes for IL-6 and TNF data in the revised manuscript (Fig. 2 E, F in the revised manuscript).

      (8) Line 219: "strongly with IKKα, p65 and MyD88, and weak" - should be revised.

      We have improved the manuscript as suggested in the revised manuscript (Page 7; Line 203).

      (9) It is not clear why IKKβ was excluded from interaction studies in Fig S3G.

      We added the Co-IP experiment and showed that HA-tagged Vangl2 only interacted with Flag-tagged p65, but not with Flag-tagged IKKb in 293T cells (Fig S3H). Furthermore, endogenous co-IP immunoblot analyses showed that Vangl2 did not associate with IKKb (Fig. S3I)

      (10) Fig 3F- In the text, authors mentioned that Vangl2 strongly associates with p65 upon LPS stimulation in BMDM. However, no controls, including input or another p65-interacting protein, were used.

      As reviewer suggested, we have added input and positive control (IkBa) in this experiment (Fig. 3F in the revised manuscript). The results demonstrated that the interaction between p65 and IkBa was attenuated, although the total IkBa did not undergo significant degradation over long-term course of LPS stimulation.

      (11) Figure 4D - Authors claim that Vangl2-deficient BMDMs stabilized the expression of endogenous p65 after LPS treatment. However, p65 levels were particularly constitutively elevated in knockout cells, and LPS signaling did not cause any further upregulation. This again indicates the role of Vangl2 in the basal state. The authors need to explain this and revise the test accordingly.

      Thank you for the reviewer's comments. We repeated the experiment to ascertain whether Vangl2 could stabilize the expression of endogenous p65 before and after LPS treatment. It was found that, due to the extremely low expression of Vangl2 in WT cells in the absence of stimulation, there was no observable difference on the basal level of p65 between WT and Vangl2DM cells. However, upon prolonged LPS stimulation, Vangl2 expression was induced, resulting in p65 degradation in WT cells. In contrast, p65 protein was more stable in Vangl2 deficient cells after LPS stimulation (Fig. 4D in the revised manuscript).

      Reviewer #3 (Public Review):

      Lu et al. describe Vangl2 as a negative regulator of inflammation in myeloid cells. The primary mechanism appears to be through binding p65 and promoting its degradation, albeit in an unusual autolysosome/autophagy dependent manner. Overall, the findings are novel and the crosstalk of PCP pathway protein Vangl2 with NF-kappaB is of interest. …….Regardless, Vangl2 as a negative regulator of NF-kappaB is an important finding. There are, however, some concerns about methodology and statistics that need to be addressed.

      Thank you for your comments on our manuscript, and we have further improved the manuscript as suggested.

      (1) Whether PCP is anyway relevant or if this is a PCP-independent function of Vangl2 is not directly explored (the later appears more likely from the manuscript/discussion). PCP pathways intersect often with developmentally important pathways such as WNT, HH/GLI, Fat-Dachsous and even mechanical tension. It might be of importance to investigate whether Vangl2-dependent NF-kappaB is influenced by developmental pathways.

      Thank you for the reviewer's insightful comments. Our study revealed that Vangl2 recruits the E3 ubiquitin ligase PDLIM2 to facilitate K63-linked ubiquitination of p65, which is subsequently recognized by autophagy receptor NDP52 and then promotes the autophagic degradation of p65. Our findings by using autophagy inhibitors and autophagic-deficient cells indicate that Vangl2 regulates NF-kB signaling through a selective autophagic pathway, rather than affecting the PCP pathway, WNT, HH/GLI, Fat-Dachsous or even mechanical tension. Moreover, a discussion section has been added to the revised version. (Page 12, lines 377-393)

      (2) Are Vangl2 phosphorylations (S5, S82 and S84) in anyway necessary for the observed effects on NF-kappaB or would a phospho-mutant (alanine substitution mutant) Vangl2 phenocopy WT Vangl2 for regulation of NF-kappaB?

      As suggested, we generated phospho-mutants of Vangl2 (S82/84A) and observed that Vangl2 (S82/84A) could still facilitate the degradation of p65 (Fig. S4 B in the revised manuscript), suggesting that Vangl2 regulates the NF-kB pathway independently of its phosphorylation.

      (3) Another area to strengthen might be with regards to specificity of cell types where this phenomenon may be observed. LPS treatment in mice resulted in Vangl2 upregulation in spleen and lymph nodes, but not in lung and liver. What explains the specificity of organ/cell-type Vangl2 upregulation and its consequences observed here? Why is NF-kappaB signaling not more broadly or even ubiquitously affected in all cell types in a Vangl2-dependent manner, rather than being restricted to macrophages, neutrophils and peritoneal macrophages, or, for that matter, in spleen and LN and not liver and lung? After all, one may think that the PCP proteins, as well as NF-kappaB, are ubiquitous.

      Thank you for the reviewer's comments.

      (1) LPS is an important mediator to trigger sepsis with excessive immune activation. As is well known, the spleen and lymph nodes are important peripheral immune organs, where immune cells (e.g., macrophages) are abundant and respond sensitively to LPS stimulation. Nevertheless, immune cells represent a minor fraction of the lungs and liver. Consequently, Vangl2 represents a pivotal regulator of immune function, exhibiting a more pronounced increase in the immune organs and cells.

      2) Induction of Vangl2 expression by LPS stimulation is cell specific. Given that different cells exhibit varying protein abundances, the molecular events involved may also differ. Moreover, we observed high Vangl2 expression in the liver at the basal state (Author response image 1), whereas it was not induced after 12 h of LPS stimulation. Therefore, the functional role of Vangl2 exhibits significant phenotype in macrophages and neutrophils/spleen and LN, rather than in liver or lung cells.

      Author response image 1.

      Vangl2 showed no significant changes in the liver after LPS treatment. Mice (n≥3) were treated with LPS (30 mg/kg, i.p.). Livers were collected at 12 h after LPS treatment. Immunoblot analysis of Vangl2.

      Recommendations For The Authors:

      Reviewer #1 (Recommendations For The Authors):

      General points:

      Figure 4G- panels appear mislabeled. Pl correct.

      We have corrected this mislabeling as you suggested.

      The dynamics of Vangl2 interaction with p65 and autophagy adaptors is not clear/apparent. For example, Vangl2 expression destabilises p65 levels (as in Fig. 4), but in Fig. 5, it seems there is no decline in the p65 protein level, and a large fraction of it coprecipitates with NDP52.

      We appreciate the reviewer’s comments. In the co-IP assay, we used the lysosomal inhibitor CQ to inhibit p65 degradation to observe the interaction between p65 and NDP52 or Vangl2.

      Fig 5E- I would expect p65 levels to be lower in WT cells than Vangl2 KO cells. But as such, there is no difference between the two.

      We appreciate the reviewer’s comments. We repeated the experiments and updated the data. Firstly, Vangl2 was not induced in WT cells in the absence of LPS stimulation, thus there was no difference in p65 expression between the two groups at the basal level. Secondly, we used CQ/Baf-A1 to inhibit the degradation of Vangl2 in the co-IP assay to observe the interaction between p65 and other molecule.

      Reviewer #2 (Recommendations For The Authors):

      A few points that can be looked at and revised.

      (1) Quantification of the presented data is needed for Fig 4D and Fig 4E.

      We added the quantification analysis as suggested.  

      (2) The labeling of Fig 4G should be scrutinized.

      We have corrected this mislabeling as you suggested.

      (3) Fig 6B and Fig 6C should be explained in the result section more elaborately.

      We thank the reviewer for the suggestion, and we have rephrased this sentence to better describe the results. (Page 10, lines 306-313)

      (4) Line 85: "Vangl2 mediated downstream of Toll-like or interleukin (IL)-1" - unclear.

      We appreciate the reviewer’s comments on our manuscript, and we have further improved the manuscript as suggested in the revised manuscript. (Page 3, lines 68)

      (5) Line 181: "mice. Differentially expression analysis" - this should be revised.

      We appreciate the reviewer’s comments on our manuscript, and we have further improved the manuscript as suggested in the revised manuscript. (Page 11, lines 323)

      (6) Line 261-264- CHX-chase assay showed the degradation rate of p65 in Vangl2-deficient BMDM was slower compared with WT cells. However, Vangl2 is not induced in WT BMDMs upon CHX treatment (Fig. S4B).

      We appreciate the reviewer’s comments on our manuscript, and we have further improved the manuscript as suggested in the revised manuscript (Fig. S4D).

      (7) Finally, some editing to provide data only critical for the conclusions could improve the ease of reading.

      We have further improved the manuscript as suggested in the revised manuscript.

      Reviewer #3 (Recommendations For The Authors):

      Comments (general, please address at least in Discussion. Some experimental data, for example the role, if any, of Vangl2 phosphorylations will be very useful):

      (1) It might be interesting to explore whether there are any potential effects of developmental pathways on the observed effect mediated by Vangl2 or if the effects are entirely a PCP-independent function of Vangl2. Please see above public review.

      Thank you for the reviewer's insightful comments. Our study revealed that Vangl2 recruits the E3 ubiquitin ligase PDLIM2 to facilitate K63-linked ubiquitination of p65, which is subsequently recognized by autophagy receptor NDP52 and then promotes the autophagic degradation of p65. Our findings by using autophagy inhibitors and autophagic-deficient cells indicate that Vangl2 regulates NF-kB signaling through a selective autophagic pathway, rather than affecting the PCP pathway, WNT, HH/GLI, Fat-Dachsous or even mechanical tension. Furthermore, we generated phospho-mutants of Vangl2 (S82/84A) and observed that Vangl2 (S82/84A) could still facilitate the degradation of p65 (Fig. S4 B), suggesting that Vangl2 regulates the NF-kB pathway independently of its phosphorylation. In addition, a discussion section has been added to the revised version. (Page 12, lines 377-393)

      (2) What explains the specificity of organ/cell-type Vangl2 upregulation and its consequences observed here? Why is NF-kappaB signaling not more broadly or even ubiquitously affected in all cell types in a Vangl2-dependent manner, rather than being restricted to macrophages, neutrophils and peritoneal macrophages, or, for that matter, in spleen and LN and not liver and lung? Afterall, one may think that the PCP proteins, as well as NF-kappaB, are ubiquitous.

      Thank you for the reviewer's comments. A similar question has been addressed above (refer to the response to question 3 of reviewer 3).

      (3) Another specificity-related question that comes to mind is whether the Vangl2 function in autolysomal/autophagic degradation is restricted to p65 as the exclusive substrate? The cytosolic targeting of p65 as opposed to the more well-known nuclear-targeting is interesting.

      Our previous finding demonstrated that Vangl2 inhibits antiviral IFN-I signaling by targeting TBK1 for autophagic degradation (doi: 10.1126/sciadv.adg2339), thereby indicating that p65 is not the sole substrate for Vangl2. However, in the NF-kB pathway, p65 is a specific substrate for Vangl2. Moreover, our findings indicate that the interaction between Vangl2 and p65 occurs predominantly in the cytoplasm, rather than in the nucleus (Fig. S4 C).

      (4) Pharmacological approach is used to tease apart autolysosome versus proteasome pathway. What is the physiological importance of autophagic degradation? It is interesting to note that Vangl2 was already previously implicated in degrading LAMP-2A and increasing chaperon-mediated autophagy (CMA)-lysosome numbers (PMID: 34214490).

      Previous literature has domonstrated that Vangl2 can inhibit CMA degradation (PMID: 34214490). However, in our study, we found that Vangl2 can promote the selective autophagic degradation of p65. It is important to note that CMA degradation and selective autophagic degradation are two distinct degradation modes, which is not contradictory.

      (5) Are these phenotypes discernable in heterozygotes or only when ablated in homozygosity? Any phenotypes recapitulated in the looptail heterozygote mice?

      We found that these phenotypes discernable only in homozygosity.

      (6) What is the conservation of the Vangl2 p65-interaction site between Vangl2 and Vangl1? PDLIM2 recruitment between Vangl2 and Vangl1?

      We appreciate the reviewer’s comments on our manuscript. Previous studies have shown that human Vangl1 and Vangl2 exhibit only 72% identity and exhibit distinct functional properties (doi: 10.1530/ERC-14-0141).Thus, the interaction of Vangl2 with p65 and PDLIM2 recruitment may not necessarily occur in Vangl1.

      Comments (specific to experiments and data analyses. Please address the following):

      (7) The patient population used in Fig 1 is not described in the Methods. This is a critical omission. Were age, sex etc. controlled for between healthy and disease? How was the diagnosis made? What times during sepsis were the samples collected? As presented, this data is impossible to evaluate and interpret.

      We appreciate the reviewer’s comments on our manuscript, and we have further improved the manuscript as suggested in the revised supplement materials. (Supplementary information, Page 12, lines 146-147)

      (8) In general, the statistical method should be described for each experiment presented in the figures. Comparisons should not be made only at the time point with maximal difference (such as in Fig 1F or Fig 2C, but at all time points using appropriate statistical methods). The sample size should also be included to allow determination appropriateness of parametric or non-parametric tests.

      We appreciate the reviewer’s comments on our manuscript, and we have further improved the manuscript as suggested in the revised manuscript (Figures 1F and 2C).

      (9) PCP pathways can activate p62/SQSTM1 or JNK via RhoA. JNK activation should be tested experimentally.

      According to the reviewer's comments, we further examined the effect of Vangl2 on the JNK pathway. The results showed that Vangl2 did not affect the JNK pathway (Author response image 2). This suggests that Vangl2 functions independently of the PCP pathway.

      Author response image 2.

      Vangl2 did not affect the JNK pathway. WT and Vangl2-deficient (n≥3) BMDMs were stimulated with LPS (100 ng/ml) for the indicated times. Immunoblot analysis of total and phosphorylated JNK.

      (10) Why are different cells such as A549, HEK293, CHO, 293T, THP-1 used during the studies for different experiments? Consistency would improve rigor. At least, logical explanation driving the cell type of choice for each experiment should be included in the manuscript. Nonetheless, one aspect of using a panel of cell lines indicate that the effect of Vangl2 on NF-kappa B is pleiotropic.

      We are grateful to the reviewer for their comments on our manuscript. A549, HEK293, CHO, and 293T cells are commonly utilized in protein-protein interaction studies. The selection of cell lines for overexpression (exogenous) experiment is dependent on their transfection efficiency and the ability to express TLR4 (the receptor for LPS). Additionally, we conducted endogenous experiments by using THP-1 and BMDMs, which are human macrophage cell lines and murine primary macrophages, respectively. Moreover, we generated Vangl2f/f lyz-cre mice by specifically knocking out Vangl2 in myeloid cells, and investigated the effect of Vangl2 on NF-kB signaling in vivo.

    1. Author Response

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      The manuscript describes the crystal structures of Streptococcus pneumoniae NOXs. Crystals were obtained for the wild-type and mutant dehydrogenase domain, as well as for the full-length protein comprising the membrane domain. The manuscript further carefully studies the enzyme's kinetics and substrate-specificity properties. Streptococcus pneumoniae NOX is a non-regulated enzyme, and therefore, its structure should provide a view of the NOX active conformation. The structural and biochemical data are discussed on this ground.

      Strengths:

      This is very solid work. The protein chemistry and biochemical analysis are well executed and carefully described. Similarly, the crystallography must be appreciated given the difficulty of obtaining good enzyme preparations and the flexibility of the protein. Even if solved at medium resolution, the crystal structure of the full-length protein conveys relevant information. The manuscript nicely shows that the domain rotations are unlikely to be the main mechanistic element of NOX regulation. It rather appears that the NADPH-binding conformation is pivotal to enzyme activation. The paper extensively refers to the previous literature and analyses the structures comprehensively with a comparison to previously reported structures of eukaryotic and prokaryotic NOXs.

      We thank the referee for these very nice comments about our work.

      Weaknesses:

      The manuscript is not always very clear with regard to the analysis of NADPH binding. The last section describes a "crevice" featured by the NADPH-binding sites in NOXs. It remains unclear whether this element corresponds to the different conformations of the protein C-terminal residues or more extensive structural differences. This point must be clarified.

      We agree with the referee that our terminology was not very clear. Responding to your comment helped us to improve our explanation: we have changed the text to emphasize the differences we observe in the distances between the FAD binding groove and the entire NADPH binding groove, which includes conserved NADPH-contacting motifs as well as the critical aromatic.

      A second less convincing point concerns the nature of the electron acceptor. The manuscript states that this NOX might not physiologically act as a ROS producer. A question then immediately arises: Is this protein an iron reductase?

      Can the authors better discuss or provide more data about this point?

      The referee has a legitimate point, which was also our first idea. In the initial work on SpNOX, where we discovered bacterial NOX enzymes (see Hajjar et al 2017 in mBio), we evaluated its possible role as an iron reductase. There we showed that SpNOX can reduce CytC directly; however, while some reduction of Fe3+-NTA complex (used classically in ferric reductase activity assay) occurred, this reduction was inhibitable by SOD and occurred indirectly by the superoxide produced, so therefore not a true iron reductase activity. This represents a mixed situation of direct and indirect reduction of an iron-containing acceptor that appears to preclude physiological iron reductase activity since it appears that the protein component of CytC allows it to interact with SpNOX. As these questions had been already addressed in a previous paper, we did not add anything here and we prefer to underline this possibility of another acceptor and to leave this question open for future works.

      Reviewer #2 (Public Review):

      The authors describe the structure of the S. pneumoniae Nox protein (SpNOX). This is a first. The relevance of it to the structure and function of eukaryotic Noxes is discussed in depth.

      Strengths and Weaknesses

      One of the strengths of this work is the effort put into preparing a pure and functionally active SpNOX preparation. The protein was expressed in E. coli and the purification and optimization of its thermostability and activity are described in detail, involving salt concentration, glycerol concentration, and pH.

      This reviewer was surprised by the fact that the purification protocol in the eLife paper differs from those in the mBio and Biophys. J. papers by the absence of the detergent lauryl maltose neopentyl glycol (LMNG). LMNG is only present in the activity assay at a low concentration (0.003%; molar data should be given; by my calculation, this corresponds to 30 μM).

      We regret this misunderstanding: our description was not clear enough. As the referee points out, in previous papers we purified the full length SpNOX with the detergent LMNG. In the current paper, we described only the protocol for SpNOX DH domain variant, a soluble cytoplasmic domain. We have now modified the text to clarify the difference between the purification of fulllength SpNOX variants, which were performed with detergent as cited in Vermot et al 2020, and the purification of DH domains, which are soluble and thus did not require detergent in the purification.

      In light of the presence of lipids in cryo-EM-solved structures of DUOX and NOX2, it is surprising that the authors did not use reconstitution of the purified SpNOX in phospholipid (nanodisk?). The issue is made more complicated by the statement on p. 18 of "structures solved in detergent like ours" when no use of detergent in the solubilization and purification of SpNOX is mentioned in the Methods section (p. 21-22).

      As stated above, detergent used to purify the full-length version of SpNOX. We did in fact perform some preliminary tests of reconstitution in nanodiscs. Different trials of negative staining studies showed heterogeneous size of SpNOX in nanodiscs and the initial images were not promising. Furthermore, in parallel, we had positive results in crystallography relatively quickly with protein in detergent. We thus focused on refining the crystals, which was a fairly long and mobilizing task; we decided to allocate time and resources to the promising avenue and did not further pursue nanodiscs.

      We did not go in theCryo-EM direction because the small size of the protein was initially believed to be a significant barrier to successful Cryo-EM. Perhaps we could have pursued this avenue: while our manuscript here was submitted to eLife, another group deposited a preprint in BioRxiv using CryoEM to solve the structure of SpNOX (see comment below). This structure was solved in detergent so even in this CryEM structure there is no information on the potential roles of lipids as asked by the referee.

      In this revised version, we have added a comment, in the last paragraph, in reference to the additional data available today thanks to the other structures generated by this other group (Murphy's group).

      Can the authors provide information on whether E. coli BL21 is sufficiently equipped for the heme synthesis required for the expression of the TM domain of SpNOX. Was supplementation with δaminolevulinic acid used

      The production of His-SpNox in E.coli C41(DE3) is without any δ-aminolevulinic acid supplementation. Supplementation was tested but no change was observed regarding the heme content (UV/Visible spectra) so we settled on the purification described by Vermot et al 2020. Initially, for the mBio paper (Haajar et al 2017), we performed heme titrations which gave stoichiometry between 1.35 to 1.5 heme/protein, indicating 2 hemes (these data were not shown). In the end in this work we observed two hemes in the crystal structure, thus confirming that E.coli, at least for this protein, did not need supplementation with δ-aminolevulinic acid .

      The 3 papers on SpNOX present more than convincing evidence that SpNOX is a legitimate Nox that can serve as a legitimate model for eukaryotic Noxes (cyanide resistance, inhibition by DPI, absolute FAD dependence, and NADPH/NADH as the donor or electrons to FAD). It is also understood that the physiological role of SpNOX in S. pneumoniae is unknown and that the fact that it can reduce molecular oxygen may be an experimental situation that does not occur in vivo.

      I am, however, linguistically confused by the statement that "SpNOX requires "supplemental" FAD". Noxes have FAD bound non-covalently and this is the reason that, starting from the key finding of Babior on NOX2 back in 1977 to the present, FAD has to be added to in vitro systems to compensate for the loss of FAD in the course of the purification of the enzyme from natural sources or expression in a bacterial host. I wonder whether this makes FAD more of a cosubstrate than a prosthetic group unless what the authors intend to state is that SpNOX is not a genuine flavoprotein.

      We believe there is some confusion between SpNOX – the full length transmembran protein -- and SpNOXDH -- the cytosolic domain only. The sentence pinpointed by the referee was in fact “The strict requirement of FAD addition for SpNOXDH activity suggests that the flavin behaves as a cosubstrate”. This statement was about the isolated cytosolic domain that does not contain the TM part of the protein.

      We agree that in WT NOX enzymes (including SpNOX) FAD is held within the enzyme structure and thus can be considered, by definition, as a prosthetic group. This is supported by the nanomolar affinity for FAD of SpNOX. We did not intend to say that NOX and SpNOX are not genuine flavoproteins.

      On the other hand, when isolated, the affinity of DH domain for flavins drops to the µM level. This µM level of affinity does not allow stable maintenance of the flavin in the active site as illustrated by the spectra of Figure 3. This is instead the typical affinity of a substrate or a co-substrate (similar to that of substrate NADPH) that can be exchangeable and diffuse in and out of the active site. The DH domain recognizes and reduces flavins but, as a consequence of its lower affinity, will release to its environment free reduced flavins. Thus the isolated DH behaves as a flavin reductase that uses flavin as substrate. Such enzymes have already been well described (and some of them are of the FNR family). Such enzymes, using flavin as substrate, typically have affinity for flavin in the µM range and share with the SpNOX DH binding properties centered on the isoalloxazine ring only.

      We understand that, in the text, to switch from the SpNOX to the SpNOX DH and for FAD from a prosthetic group to a diffusible co-substrate can be confusing. So, to make it clearer, we modified the following sentences and added references to “some flavin reductases characterization” that could provide support for the reader.

      “The strict requirement of FAD addition for SpNOXDH activity and its µM level of affinity suggests that the flavin behaves as a co-substrate rather than a prosthetic group. As an isolated domain, SpNOXDH may work as a flavin reductase enzyme (Gaudu et al, 1994; Fieschi et al 1995; Nivière et al 1996), ..”

      We hope that it will help.

      I am also puzzled by the statement that SpNOX "does not require the addition of Cyt c to sustain superoxide production". Researchers with a Cartesian background should differentiate between cause and effect. Cyt c serves merely as an electron acceptor from superoxide made by SpNOX but superoxide production and NADPH oxidation occur independently of the presence of added Cyt c.

      Thanks to the referee for pointing out this poor wording. We agree and have amended the text to clarify what we originally meant. It is now:

      “SpNOXDH requires supplemental FAD to sustain both superoxide production, which can be observed in the presence of Cyt c (Figure 2A), and NADPH oxidation, which can be observed in the absence of Cyt c (Figure 2B).”

      The ability of the DH domain of SpNOX (SpNOXDH) to produce superoxide is surprising to this reviewer.The result is based on the inhibition of Cyt c reduction by added superoxide dismutase (SOD) by 40%. In all eukaryotic Noxes superoxide is produced by the one-electron reduction of molecular oxygen by electrons originating from the distal heme, having passed from reduced FAD via two hemes. The proposal that superoxide is generated by direct transfer of electrons from FAD to oxygen deserves a more in-depth discussion and relies too heavily on the inhibitory effect of SOD. A control experiment with inactivated SOD should have been done (SOD is notoriously heat resistant and inactivation might require autoclaving).

      The initial reports of a NOX DH-domain-only construct (that of human Nox4) producing superoxide are cited in the text. Moreover, natural flavin reductases are known to produce superoxide due to the release of free reduced flavin in the medium.

      As explain above, FAD in full length SpNox is a relay for the electrons from NADPH to heme and is internal to the protein and thus devoted to this specific task.

      In the case of SpNOX DH, its flavin reductase behavior leads to the release in the medium of free reduced flavin as a nonspecific diffusible electron carrier. It has been already demonstrated that such free reduced flavin can efficiently reduce soluble O2 and be a source of superoxide.

      This has been particularly well documented in (Gaudu et al, 1994. J.Biol.Chem). We have added this reference to the text (see the modified sentence in a reply, 2 comments above).

      Furthermore, we want to point to the referee that the link between flavin and superoxide production here is not only based on the inhibition by SOD. When we added the flavin inhibitor DPI we observed no more superoxide production from the DH domain (Figure 2C). This supports the role of free-reduced flavin in both the production of superoxide and also part of direct cyt C reduction as observed.

      An unasked and unanswered question is that, since under aerobic conditions, both direct Cyt c reduction (60%) and superoxide production (40%) occur, what are the electron paths responsible for the two phenomena occurring simultaneously?

      We thank the referee for dedication to a clear understanding of the mechanism used by the SpNOXDH construct. It pushes us to develop a clear description of the mechanism at work here for the readers. Please find below a proposal mechanism describing the electron transfer from NAD(P)H to free flavin that can, as diffusible species, then reduce non-specifically either the O2 or the Cyt.C encountered.

      Author response image 1.

      However, it is important to remember that this is not physiological, and rather the result of using a DH domain isolated from the TM of SpNOX. Nonetheless, it shows that the DH domain is fully functional for NAD(P)H as well as the hydride transfer.

      This reviewer had difficulty in following the argument that the fact that the kcat of SpNOX and SpNOXDH are similar supports the thesis that the rate of enzyme activation is dependent on hydride transfer from nicotinamide to FAD.

      We have amended the text to clarify this point. If the reaction rate is not affected by the presence or absence of the hemes in the TM domain, this inevitably implies that the rate is NOT limited by the electron transfer to the heme, and ultimately to O2, from the FAD, and thus the hydride transfer step that oxidizes the FAD must be the rate limiting step.

      The section dealing with mutating F397 is a key part of the paper. There is a proper reference to the work of the Karplus group on plant FNRs (Deng et al). However, later work, addressing comparison with NOX2, should be cited (Kean et al., FEBS J., 284, 3302-3319, 2017). Also, work from the Dinauer group on the minimal effect of mutating or deleting the C-terminal F570 in NOX2 on superoxide production should be cited (Zhen et al., J. Biol. Chem. 273, 6575-6581, 1998).

      We thank the reviewer for pointing out our unintended omission of these important works; we have amended the text and added the citations.

      It is not clear why mutating F397 to W (both residues having aromatic side chains) would stabilize FAD binding.

      In a few words, trp’s double ring can establish larger and stronger vanderWaals contact with the isoalloxazine ring than the phe sidechain. Our discussion regarding this point is extensive in the structural section where we compare the structures with F and W in this position. At this time we do not think it is necessary to add anything to the text.

      Also, what is meant by "locking the two subdomains of the DH domain"? What subdomains are meant?

      The two subdomains are the NADPH-binding domain and the FAD-binding domain, which we define on p 11 (“SpNOXDH presents a typical fold of the FNR superfamily of reductase domain containing two sub-domains, the FAD-binding domain (FBD) and an NADPH-binding domain (NBD) “) and which are labeled in Fig. 4. By “locking” we meant to convey immobilizing them into a specific conformation; we have amended the text to clarify this point.

      Methodological details on crystallization (p. 11) should be delegated to the Methodology section. How many readers are aware that SAD means "Single Wavelength Anomalous Diffraction" or know what is the role of sodium bromide?

      We have amended the text to emphasize the intended point, which is the different origins of the two DH structures: the de novo structure was possible through co crystallization with bromide, and the molecular replacement structure used the de novo structure as a model.

      The data on the structure of SpNOX are supportive of a model of Nox activation that is "dissident" relative to the models offered for DUOX and NOX2 activation. These latter models suggested that the movement of the DH domain versus the TM domain was related to conversion from the resting to the activated state. The findings reported in this paper show that, unexpectedly, the domain orientation in SpNOX (constitutively active!) is much closer to that of resting NOX2. One of the criteria associated with the activated state in Noxes was the reduction of the distance between FAD and the proximal heme. The authors report that, paradoxically, this distance is larger in the constitutively active SpNOX (9.2 Å) than that in resting state NOX2 (7.6 Å) and the distance in Ca2+-activated DUOX is even larger (10.2 Å).

      A point made by the authors is the questioning of the paradigm that activation of Noxes requires DH domain motion.

      Instead, the authors introduce the term "tensing", within the DH domain, from a "relaxed" to a more rigid conformation. I believe that this proposal requires a somewhat clearer elaboration

      It is clear that the distance between the FAD and NADPH shown in the Duox and Nox2 structures is too large for the chemical reaction of hydride transfer. Wu et al used the terms ‘tense’ and ‘relaxed’ to describe conformations of the DH domain corresponding to ‘short distance’ and ‘longer distance’, respectively, between the two ligand binding sites. We quoted this terminology and have amended the text to clarify that we envision a motion of the NBD relative to the FBD, as distinct from a larger motion of the whole DH domain relative to the TM domain.

      The statement on p. 18, in connection to the phospholipid environment of Noxes, that the structure of SpNOX was "solved in detergent" is puzzling since the method of SpNOX preparation and purification does not mention the use of a detergent. As mentioned before, this absence of detergent in the present report was surprising because LMNG was used in the methods described in the mBio and Biophys. J. papers. The only mention of LMNG in the present paper was as an addition at a concentration of 0.003% in the activity assay buffers.

      Please see our response to similar points above. Detergent was present for the solubilization of the full-length SpNOX.

      The Conclusions section contains a proposal for the mechanism of conversion of NOX2 from the resting to the activated state. The inclusion of this discussion is welcome but the structural information on the constitutively active SpNOX can, unfortunately, contribute little to solving this important problem. The work of the Lambeth group, back in 1999 (cited as Nisimoto et al.), on the role of p67-phox in regulating hydride transfer from NADPH to FAD in NOX2 may indeed turn out to have been prophetic. However, only solving the structure of the assembled NOX2 complex will provide the much-awaited answer. The heterodimerization of NOX2 with p22-phox, the regulation of NOX2 by four cytosolic components, and the still present uncertainty about whether p67-phox is indeed the final distal component that converts NOX2 to the activated state make this a formidable task.

      The work of the Fieschi group on SpNOX is important and relevant but the absence of external regulation, the absence of p22-phox, and the uncertainty about the target molecule make it a rather questionable model for eukaryotic Noxes. The information on the role of the C-terminal Phe is of special value although its extension to the mechanism of eukaryotic Nox activation proved, so far, to be elusive.

      We really thank the referee for the positive comments on our work and the deep interest shown by this careful evaluation.

      We understand the arguments of the referee regarding the relevance of our work here to eukaryotic NOX, but we do not share the reservations expressed. While human NOXes need interactions with other proteins or have EF-hand or other domains that control them, SpNOX corresponds exactly to the minimal core common to any NOX isoform. In fact, because SpNOX has only this conserved core, it is unique in that it can work as a constitutively active NOX without protein-protein interactions or regulatory domains. Thus the fundamentals of electron transfer mechanisms of NOX enzyme are present in SpNOX.

      There might be some differences in the internal organization from isoform to isoform (as regarding the relative DH domain vs TM domain orientation) but considering the similarity between NOX2 and SpNOX topology we are rather confident that the SpNOX structure will turn out to be a reasonable model of the activated NOX2 structure. History will tell.

      In any case, this work on SpNOX allowed us to highlight hydride transfer as the limiting step and also to highlight some structural differences that could be at the source of the regulation in eukaryotic NOX. In itself, we think this is a significant contribution to the field.

      We warmly thank both referees for their constructive remarks and their help in the improvement of this manuscript.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      • The manuscript states that the flavin "behaves" like a co-substrate and thereby reports on the Km for the flavins. I feel that this terminology might be confusing. The flavin is unchanged after the reaction, and what matters is the enzyme's affinity for the flavin and the flavin concentration needed to saturate the enzyme (to have it in the fully holo form).

      See above -- answering many questions from referee2, we have extensively commented on that point (substrate, cofactor, affinity, etc..) and made some adjustments in the text to clarify. We hope it is now satisfactory.

      • I could not find the methodological description of the experiments performed to measure the Km for the flavins, and the legend of Figure S4 does not help in this regard. I think that the data (left panels of S4) should be interpreted as binding curves with associated Kd values.

      We have changed the text to clarify the method used to measure Km for flavins.

      • A related point is that the manuscript refers to Km as an "affinity". This is inappropriate and should be avoided, as the Km is not the Kd.

      We agree with the referee that the Km is not the Kd. However, under the appropriate conditions, to which our experiments conform, Km is accepted as a relevant approximation of affinity (Srinisivan, FEBS Journal, v 289 pp 6086-6098 2022). We have added a sentence to clarify this point and cite this reference in the text.

      • The environment around the putative oxygen site should be shown. The text indicates that "the residues characteristic of the O2 reducing center in eukaryotic FRD domains of NOX and DUOX enzymes are not conserved in SpNOX." How does the site look? This point relates to the more general comment above on the oxidizing substrate used by this bacterial NOX.

      This is a really interesting point that contains many potential biological developments for future studies of this prokaryotic family of NOX enzymes. While we were submitting this work to eLife for evaluation, another group (Murphy's lab) filed a pre-publication in BioRXiv, in which they also solved the structure of SpNOX but this time by CryoEM with an unexpected level of resolution for such a small protein (their paper is not yet published but probably under peer review somewhere). In their work, they made a special effort to identify the O2 reducing center (bacterial NOX sequences alignment, mutation studies, …) They were not able to localize such a site with accuracy. There is also other complementary data between their work and ours. So, we will add a paragraph at the end of the discussion to comment on this parallel work and to emphasize on the complementarity of their studies and what it brings to the final understanding of this enzyme.

      • The section "A Close-up View of NOX's NAD(P)H Binding Domains vs the FNR Gold Standard" should be clarified.

      I found it difficult to understand. Is the different conformation of Phe397 creating the crevice? Could NADPH be modeled in NOX2 and DUOX in the same conformation observed in FNR and modeled in the bacterial NOX? Or would there be clashes, implying the necessity of larger conformational changes to bring the nicotinamide closer to the FAD?

      Please see responses above on this point; we have amended the text to clarify. In a few words, we propose that activation in the eukaryotic enzymes would entail NBD subdomain (containing NADPH site) towards the FBD subdomain (containing FAD) through an internal motion within the DH domain. Doing so, they would approach the DH domain topology of SpNOX, which models an active state.

      Reviewer #2 (Recommendations For The Authors):

      On p. 6, second line, it should be (Figure 1C and 1D). Space is missing between C and "and".

      On p. 9, in Figure 3, the labeling A and B are missing. Also, the legend of part B does not correspond to the actual graph colors. Thus, the tracing of F397W is red and not grey as indicated in the legend.

      Corrected. Thank you

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review): 

      Summary: 

      In this work, the authors examine the activity and function of D1 and D2 MSNs in dorsomedial striatum (DMS) during an interval timing task. In this task, animals must first nose poke into a cued port on the left or right; if not rewarded after 6 seconds, they must switch to the other port. Critically, this task thus requires animals to estimate if at least 6 seconds have passed after the first nose poke - this is the key aspect of the task focused on here. After verifying that animals reliably estimate the passage of 6 seconds by leaving on average after 9 seconds, the authors examine striatal activity during this interval. They report that D1-MSNs tend to decrease activity, while D2-MSNs increase activity, throughout this interval. They suggest that this activity follows a drift-diffusion model, in which activity increases (or decreases) to a threshold after which a decision (to leave) is made. The authors next report that optogenetically inhibiting D1 or D2 MSNs, or pharmacologically blocking D1 and D2 receptors, increased the average wait time of the animals to 10 seconds on average. This suggests that both D1 and D2 neurons contribute to the estimate of time, with a decrease in their activity corresponding to a decrease in the rate of

      'drift' in their drift-diffusion model. Lastly, the authors examine MSN activity while pharmacologically inhibiting D1 or D2 receptors. The authors observe most recorded MSNs neurons decrease their activity over the interval, with the rate decreasing with D1/D2 receptor inhibition. 

      Major strengths: 

      The study employs a wide range of techniques - including animal behavioral training, electrophysiology, optogenetic manipulation, pharmacological manipulations, and computational modeling. The behavioral task used by the authors is quite interesting and a nice way to probe interval timing in rodents. The question posed by the authors - how striatal activity contributes to interval timing - is of importance to the field and has been the focus of many studies and labs; thus, this paper can meaningfully contribute to that conversation. The data within the paper is presented very clearly, and the authors have done a nice job presenting the data in a transparent manner (e.g., showing individual cells and animals). Overall, the manuscript is relatively easy to read and clear, with sufficient detail given in most places regarding the experimental paradigm or analyses used. 

      We are glad our main points came through to the reviewer.  

      Major weaknesses: 

      I perceive two major weaknesses. The first is the impact or contextualization of their results in terms of the results of the field more broadly. More specifically, it was not clear to me how the authors are interpreting the striatal activity in the context of what others have observed during interval timing tasks. In other words - what was the hypothesis going into this experiment? Does observing increasing/decreasing activity in D2 versus D1 support one model of interval timing over another, or does it further support a more specific idea of how DMS contributes to interval timing? Or was the main question that we didn't know if D2 or D1 neurons had differential activity during interval timing? 

      This is a helpful comment. Our hypothesis is that D1 and D2 MSNs had similar patterns of activity.  Our rationale is prior behavioral work from our group describing that blocking striatal D1 and D2 dopamine receptors had similar behavioral effects on interval timing (De Corte et al., 2019; Stutt et al., 2023), We rewrote our introduction with this idea in mind (Line 89)

      “We and others have found that striatal MSNs encode time across multiple intervals by time-dependent ramping activity or monotonic changes in firing rate across a temporal interval (Emmons et al., 2017; Gouvea et al., 2015; Mello et al., 2015; Wang et al., 2018). However, the respective roles of D2-MSNs and D1-MSNs are unknown. Past work has shown that disrupting either D2-dopamine receptors (D2) or D1-dopamine receptors (D1) powerfully impairs interval timing by increasing estimates of elapsed time (Drew et al., 2007; Meck, 2006). Similar behavioral effects were found with systemic (Stutt et al., 2024) or local dorsomedial striatal D2 or D1 disruption (De Corte et al., 2019a). These data lead to the hypothesis that D2 MSNs and D1 MSNs have similar patterns of ramping activity across a temporal interval. 

      We tested this hypothesis with a combination of optogenetics, neuronal ensemble recording, computational modeling, and behavioral pharmacology. We use a well-described mouse-optimized interval timing task (Balci et al., 2008; Bruce et al., 2021; Larson et al., 2022; Stutt et al., 2024; Tosun et al., 2016; Weber et al., 2023). Strikingly, optogenetic tagging of D2-MSNs and D1-MSNs revealed distinct neuronal dynamics, with D2-MSNs tending to increase firing over an interval and D1-MSNs tending to decrease firing over the same interval, similar to opposing movement dynamics (Cruz et al., 2022; Kravitz et al., 2010; Tecuapetla et al., 2016). MSN dynamics helped construct and constrain a four-parameter drift-diffusion computational model of interval timing, which predicted that disrupting either D2MSNs or D1-MSNs would increase interval timing response times. Accordingly, we found that optogenetic inhibition of either D2-MSNs or D1-MSNs increased interval timing response times. Furthermore, pharmacological blockade of either D2- or D1receptors also increased response times and degraded trial-by-trial temporal decoding from MSN ensembles. Thus, D2-MSNs and D1-MSNs have opposing temporal dynamics yet disrupting either MSN type produced similar effects on behavior. These data demonstrate how striatal pathways play complementary roles in elementary cognitive operations and are highly relevant for understanding the pathophysiology of human diseases and therapies targeting the striatum.”

      In the second, I felt that some of the conclusions suggested by the authors don't seem entirely supported by the data they present, or the data presented suggests a slightly more complicated story. Below I provide additional detail on some of these instances. 

      Regarding the results presented in Figures 2 and 3: 

      I am not sure the PC analysis adds much to the interpretation, and potentially unnecessarily complicates things. In particular, running PCA on a matrix of noisy data that is smoothed with a Gaussian will often return PCs similar to what is observed by the authors, with the first PC being a line up/down, the 2nd PC being a parabola that is up/down, etc. Thus, I'm not sure that there is much to be interpreted by the specific shape of the PCs here. 

      We are glad the reviewer raised this point. First, regarding the components in noisy data, what the reviewer says is correct, but usually, the variance explained by PC1 is small. This is the reason we include scree plots in our PC analysis (Fig 3B and Fig 6G). When we compare our PC1s to variance explained in random data, our PC1 variance is always stronger. We have now included this in our manuscript:

      First, we generated random data and examined how much variance PC1 might generate. 

      We added this to the methods (Line 634)

      “The variance of PC1 was empirically compared against data generated from 1000 iterations of data from random timestamps with identical bins and kernel density estimates. Average plots were shown with Gaussian smoothing for plotting purposes only.”

      These data suggested that our PC1 was stronger than that observed in random data (Line 183):

      “PCA identified time-dependent ramping activity as PC1 (Fig 3A), a key temporal signal that explained 54% of variance among tagged MSNs (Fig 3B; variance for PC1 p = 0.009 vs 46 (44-49)% variance for PC1 derived from random data; Narayanan, 2016).”

      And in the pharmacology data (Line 367):

      “The first component (PC1), which explained 54% of neuronal variance, exhibited “time-dependent ramping”, or monotonic changes over the 6 second interval immediately after trial start (Fig 6F-G; variance for PC1 p = 0.001 vs 46 (45-47)% variance in random data; Narayanan, 2016).”

      Second, we note that we have used this analysis extensively in the past, and PC1 has always been identified as a linear ramping in our work and in work by others (Line 179):

      “Work by our group and others has uniformly identified PC1 as a linear component among corticostriatal neuronal ensembles during interval timing (Bruce et al., 2021; Emmons et al., 2020, 2019, 2017; Kim et al., 2017a; Narayanan et al., 2013; Narayanan and Laubach, 2009; Parker et al., 2014; Wang et al., 2018).”

      Third, we find that PC1 is highly correlated to the GLM slope (Line 205):

      “Trial-by-trial GLM slope was correlated with PC1 scores in Fig 3A-C (PC1 scores vs. GLM slope r = -0.60, p = 10-8).”

      Fourth, our goal was not to heavily interpret PC1 – but to compare D1 vs. D2 MSNs, or compare population responses to D2/D1 pharmacology. We have now made this clear in introducing PCA analyses in the results (Line 177):

      “To quantify differences in D2-MSNs vs D1-MSNs, we turned to principal component analysis (PCA), a data-driven tool to capture the diversity of neuronal activity (Kim et al., 2017a).”

      Finally, despite these arguments the reviewer’s point is well taken. Accordingly, we have removed all analyses of PC2 from the manuscript which may have been overly interpretative. 

      We have now removed language that interpreted the components, and we now find the discussion of PC1 much more data-driven. We have also removed much of the advanced PC analysis in Figure S9. Given our extensive past work using this exact analysis of PC1, we think PCA adds a considerable amount to our manuscript justified as the reviewer suggested. 

      I think an alternative analysis that might be both easier and more informative is to compute the slope of the activity of each neuron across the 6 seconds. This would allow the authors to quantify how many neurons increase or decrease their activity much like what is shown in Figure 2.  

      We agree – we now do exactly this analysis in Figure 3D. We now clarify this in detail, using the reviewer’s language to the methods (Line 648):

      “To measure time-related ramping over the first 6 seconds of the interval, we used trial-by-trial generalized linear models (GLMs) at the individual neuron level in which the response variable was firing rate and the predictor variable was time in the interval or nosepoke rate (Shimazaki and Shinomoto, 2007). For each neuron, it’s time-related “ramping” slope was derived from the GLM fit of firing rate vs time in the interval, for all trials per neuron. All GLMs were run at a trial-by-trial level to avoid effects of trial averaging (Latimer et al., 2015) as in our past work (Bruce et al., 2021; Emmons et al., 2017; Kim et al., 2017b).”

      And to the results (Line 194):

      “To interrogate these dynamics at a trial-by-trial level, we calculated the linear slope of D2-MSN and D1-MSN activity over the first 6 seconds of each trial using generalized linear modeling (GLM) of effects of time in the interval vs trial-by-trial firing rate (Latimer et al., 2015).”

      Relatedly, it seems that the data shown in Figure 2D *doesn't* support the authors' main claim regarding D2/D1 MSNs increasing/decreasing their activity, as the trial-by-trial slope is near 0 for both cell types. 

      This likely refers to Figure 3D. The reviewer is correct that the changes in slope are small and near 0. Our goal was to show that D2-MSN and D1-MSN slopes were distinct – rather than increasing and decreasing. We have added this to the abstract (Line 46)

      “We found that D2-MSNs and D1-MSNs exhibited distinct dynamics over temporal intervals as quantified by principal component analyses and trial-by-trial generalized linear models.”

      We have clarified this idea in our hypothesis (Line 96):

      “These data led to the hypothesis that D2 MSNs and D1 MSNs have similar patterns of ramping activity across a temporal interval.”

      We have added this idea to the results (Line 194)

      “To interrogate these dynamics at a trial-by-trial level, we calculated the linear slope of D2-MSN and D1-MSN activity over the first 6 seconds of each trial using generalized linear modeling (GLM) of effects of time in the interval vs trial-by-trial firing rate (Latimer et al., 2015). Nosepokes were included as a regressor for movement. GLM analysis also demonstrated that D2-MSNs had significantly different slopes (-0.01 spikes/second (-0.10 – 0.10)), which were distinct from D1MSNs (-0.20 (-0.47– -0.06; Fig 3D; F = 8.9, p = 0.004 accounting for variance between mice (Fig S3B); Cohen’s d = 0.8; power = 0.98; no reliable effect of sex (F = 0.02, p = 0.88) or switching direction (F = 1.72, p = 0.19)). We found that D2-MSNs and D1-MSNs had significantly different slopes even when excluding outliers (4 outliers excluded outside of 95% confidence intervals; F = 7.51, p = 0.008 accounting for variance between mice) and when the interval was defined as the time between trial start and the switch response on a trial-by-trial basis for each neuron (F = 4.3, p = 0.04 accounting for variance between mice). Trial-by-trial GLM slope was correlated with PC1 scores in Fig 3A-C (PC1 scores vs. GLM slope r = -0.60, p = 108). These data demonstrate that D2-MSNs and D1-MSNs had distinct slopes of firing rate across the interval and were consistent with analyses of average activity and PC1, which exhibited time-related ramping.”

      And Line 215:

      “In summary, we used optogenetic tagging to record from D2-MSNs and D1-MSNs during interval timing. Analyses of average activity, PC1, and trial-by-trial firingrate slopes over the interval provide convergent evidence that D2-MSNs and D1MSNs had distinct and opposing dynamics during interval timing. These data provide insight into temporal processing by striatal MSNs.”

      And in the discussion (Line 415):

      “We describe how striatal MSNs work together in complementary ways to encode an elementary cognitive process, interval timing. Strikingly, optogenetic tagging showed that D2-MSNs and D1-MSNs had distinct dynamics during interval timing. “

      We have now included a new plot with box plots to make the differences in Figure 3D clear

      Other reviewers requested additional qualitative descriptions of our data, and we have referred to increases / decreases in this context. 

      Regarding the results in Figure 4: 

      The authors suggest that their data is consistent with a drift-diffusion model. However, it is unclear how well the output from the model fits the activity from neurons the authors recorded. Relatedly, it is unclear how the parameters were chosen for the D1/D2 versions of this model. I think that an alternate approach that would answer these questions is to fit the model to each cell, and then examine the best-fit parameters, as well as the ability of the model to predict activity on trials held out from the fitting process. This would provide a more rigorous method to identify the best parameters and would directly quantify how well the model captures the data. 

      We are glad the reviewer raised these points. Our goal was to use neuronal activity to fit behavioral activity, not the reverse. While we understand the reviewer’s point, we note that one behavioral output (switch time) can be encoded by many patterns of neuronal activity; thus, we are not sure we can use the model developed for behavior to fit diverse neuronal activity, or an ensemble of neurons. We have made this clear in the manuscript (Line 251):

      “Our model aimed to fit statistical properties of mouse behavioral responses while incorporating MSN network dynamics. However, the model does not attempt to fit individual neurons’ activity, because our model predicts a single behavioral parameter – switch time – that can be caused by the aggregation of diverse neuronal activity.”

      To attempt to do something close to what the reviewer suggested, we attempted to predict behavior directly from neuronal ensembles.  We have now made this clear in the methods on Line 682):

      “Analysis and modeling of mouse MSN-ensemble recordings. Our preliminary analysis found that, for sufficiently large number of neurons (𝑵 > 𝟏𝟏), each recorded ensemble of MSNs on a trial-by-trial basis could predict when mice would respond. We took the following approach: First, for each MSN, we convolved its trial-by-trial spike train 𝑺𝒑𝒌(𝒕) with a 1-second exponential kernel 𝑲(𝒕) = 𝒘 𝒆-𝒕/𝒘 if 𝒕 > 𝟎 and 𝑲(𝒕) = 𝟎 if 𝒕 ≤ 𝟎 (Zhou et al., 2018; here 𝒘 = 𝟏 𝒔). Therefore, the smoothed, convolved spiking activity of neuron 𝒋 (𝒋 = 𝟏, 𝟐, … 𝑵),

      tracks and accumulates the most recent (one second, in average) firing-rate history of the 𝒋-th MSN, up to moment 𝒕. We hypothesized that the ensemble activity

      (𝒙𝟏(𝒕), 𝒙𝟐(𝒕), … , 𝒙𝑵(𝒕)), weighted with some weights 𝜷𝒋 , could predict the trial switch time 𝒕∗ by considering the sum

      and the sigmoid 

      that approximates the firing rate of an output unit. Here parameter 𝒌   indicates how fast 𝒙(𝒕) crosses the threshold 0.5 coming from below (if 𝒌 > 𝟎) or coming from above (if 𝒌 < 𝟎) and relates the weights 𝜷𝒋 to the unknowns 𝜷H𝒋 \= 𝜷𝒋/𝒌 and 𝜷H𝟎 \= −𝟎. 𝟓/𝒌. Next, we ran a logistic fit for every trial for a given mouse over the spike count predictor matrix 7𝒙𝟏(𝒕), 𝒙𝟐(𝒕), … , 𝒙𝑵(𝒕)9 from the mouse MSN recorded ensemble, and observed value 𝒕∗, estimating the coefficients 𝜷H𝟎 and 𝜷H𝒋, and so, implicitly, the weights 𝜷𝒋. From there, we compute the predicted switch time 𝒕∗𝒑𝒓𝒆𝒅 by condition 𝒙(𝒕) = 𝟎. 𝟓. Accuracy was quantified comparing the predicted accuracy within a 1 second window to switch time on a trial-by-trial basis (Fig S4).

      And in the results (Line 254): 

      We first analyzed trial-based aggregated activity of MSN recordings from each mouse (𝒙𝒋(𝒕)) where 𝒋 = 𝟏, … , 𝑵 neurons. For D2-MSN or D1-MSN ensembles of 𝑵 > 𝟏𝟏, we found linear combinations of their neuronal activities, with some 𝜷𝒋 coefficients,

      that could predict the trial-by-trial switch response times (accuracy > 90%, Fig S4; compared with < 20% accuracy for Poisson-generated spikes of same trial-average firing rate). The predicted switch time 𝒕∗𝒑𝒓𝒆𝒅 was defined by the time when the weighted ensemble activity 𝒙(𝒕) first reached the value 𝒙) = 0.5. Finally, we built DDMs to account for this opposing trend (increasing vs decreasing) of MSN dynamics and for ensemble threshold behavior defining 𝒕∗𝒑𝒓𝒆𝒅; see the resulting model (Equations 1-3) and its simulations (Figure 4A-B).”

      And we have added a new figure, Figure S4, that demonstrates these trial-by-trial predictions of switch response times.  

      Note that we have included predictions from shuffled data similar to what the reviewer suggested based on shuffled data. Predictions are derived from neuronal ensembles on that trial; thus we could not apply a leave-one-out approach to trial-by-trial predictions.

      These models are highly predictive for larger ensembles and poorly predictive for smaller ensembles.  We think this model adds to the manuscript and we are glad the reviewer suggested it. 

      Relatedly, looking at the raw data in Figure 2, it seems that many neurons either fire at the beginning or end of the interval, with more neurons firing at the end, and more firing at the beginning, for D2/D1 neurons respectively. Thus, it's not clear to me whether the drift-diffusion model is a good model of activity. Or, perhaps the model is supposed to be related to the aggregate activity of all D1/D2 neurons? (If so, this should be made more explicit. The comment about fitting the model directly to the data also still stands).  

      Our model was inspired by the aggregate activity.  We have now made this clear in the results (Line 227): 

      “Our data demonstrate that D2-MSNs and D1-MSNs have opposite activity patterns. However, past computational models of interval timing have relied on drift-diffusion dynamics with a positive slope that accumulates evidence over time (Nguyen et al., 2020; Simen et al., 2011). To reconcile how these MSNs might complement to effect temporal control of action, we constructed a four-parameter drift-diffusion model (DDM). Our goal was to construct a DDM inspired by average differences in D2MSNs and D1-MSNs that predicted switch-response time behavior.”

      Further, it's unclear to me how, or why, the authors changed the specific parameters they used to model the optogenetic manipulation. Were these parameters chosen because they fit the manipulation data? This I don't think is in itself an issue, but perhaps should be clearly stated, because otherwise it sounds a bit odd given the parameter changes are so specific. It is also not clear to me why the noise in the diffusion process would be expected to change with increased inhibition. 

      We have clarified that our parameters were chosen to best fit behavior (Line 266):

      “The model’s parameters were chosen to fit the distribution of switch-response times:

      𝑭 = 𝟏, 𝒃 = 𝟎. 𝟓𝟐 (so 𝑻 = 𝟎. 𝟖𝟕), 𝑫 = 𝟎. 𝟏𝟑𝟓, 𝝈 = 𝟎. 𝟎𝟓𝟐 for intact D2-MSNs (Fig 4A, in black); and  𝑭 = 𝟎, 𝒃 = 𝟎. 𝟒𝟖 (so 𝑻 = 𝟎. 𝟏𝟑), 𝑫 = 𝟎. 𝟏𝟒𝟏, 𝝈 = 𝟎. 𝟎𝟓𝟐 for intact D1-MSNs (Fig 4B, in black).”

      Furthermore, we have clarified the approach to noise in the results (Line 247):  

      “The drift, together with noise 𝝃(𝒕) (of zero mean and strength 𝝈), leads to fluctuating accumulation which eventually crosses a threshold 𝑻 (see Equation 3; Fig 4A-B).”

      And Line 279: 

      “The results were obtained by simultaneously decreasing the drift rate D  (equivalent to lengthening the neurons’ integration time constant) and lowering the level of network noise 𝝈: D = 𝟎. 𝟏𝟐𝟗, 𝝈 = 𝟎. 𝟎𝟒𝟑 for D2-MSNs in Fig 4A (in red; changes in noise had to accompany changes in drift rate to preserve switch response time variance); and 𝑫 = 𝟎. 𝟏𝟐𝟐, 𝝈 = 𝟎. 𝟎𝟒𝟑  for D1-MSNs in Fig 4B (in blue). The model predicted that disrupting either D2-MSNs or D1-MSNs would increase switch response times (Fig 4C and Fig 4D) and would shift MSN dynamics.”

      Regarding the results in Figure 6: 

      My comments regarding the interpretation of PCs in Figure 2 apply here as well. In addition, I am not sure that examining PC2 adds much here, given that the authors didn't examine such nonlinear changes earlier in the paper. 

      We agree – we removed PC2 for these reasons. We have also noted that the primary reason for PC1 was to compare results of D2/D1 blockade (Line 362):

      “We noticed differences in MSN activity across the interval with D2 blockade and D1 blockade at the individual MSN level (Fig 6B-D) as well as at the population level (Fig 6E). We used PCA to quantify effects of D2 blockade or D1 blockade (Bruce et al., 2021; Emmons et al., 2017; Kim et al., 2017a). We constructed principal components (PC) from z-scored peri-event time histograms of firing rate from saline, D2 blockade, and D1 blockade sessions for all mice together. The first component (PC1), which explained 54% of neuronal variance, exhibited “timedependent ramping”, or monotonic changes over the 6 second interval immediately after trial start (Fig 6F-G; variance for PC1 p = 0.001 vs 46 (45-47)% variance in random data; Narayanan, 2016).”

      As noted above, PC1 does not explain this level of variance in noisy data.

      We also reworked Figure 6 to make the effects of D2 and D1 blockade more apparent by moving the matched sorting to the main figure: 

      A larger concern though that seems potentially at odds with the authors' interpretation is that there seems to be very little change in the firing pattern after D1 or D2 blockade. I see that in Figure 6F the authors suggest that many cells slope down (and thus, presumably, they are recoding more D1 cells), and that this change in slope is decreased, but this effect is not apparent in Figure 6C, and Figure 6B shows an example of a cell that seems to fire in the opposite direction (increase activity). I think it would help to show some (more) individual examples that demonstrate the summary effect shown by the authors, and perhaps the authors can comment on the robustness (or the variability) of this result. 

      These are important suggestions, we changed our analysis to better capture the variability and main effects in the data, exactly as the reviewer suggested. First, we now included 3 individual raster examples, exactly as the reviewer suggested

      As the reviewer suggested, we wanted to compare variability for *all* MSNs. We sorted the same MSNs across saline, D2 blockade, and D1 blockade sessions. We detailed these sorting details in the methods (Line 618):

      “Single-unit recordings were made using a multi-electrode recording system (Open

      Ephys, Atlanta, GA). After the experiments, Plexon Offline Sorter (Plexon, Dallas, TX), was used to remove artifacts. Principal component analysis (PCA) and waveform shape were used for spike sorting. Single units were defined as those 1) having a consistent waveform shape, 2) being a separable cluster in PCA space, and 3) having a consistent refractory period of at least 2 milliseconds in interspike interval histograms. The same MSNs were sorted across saline, D2 blockade, and D1 blockade sessions by loading all sessions simultaneously in Offline Sorter and sorted using the preceding criteria. MSNs had to have consistent firing in all sessions to be included. Sorting integrity across sessions was quantified by comparing waveform similarity via correlation coefficients between sessions.”

      To confirm that we were able to track neurons across sessions, we quantified waveform similarity (Line 353):

      “We analyzed 99 MSNs in sessions with saline, D2 blockade, and D1 blockade. We matched MSNs across sessions based on waveform and interspike intervals; waveforms were highly similar across sessions (correlation coefficient between matched MSN waveforms: saline vs D2 blockade r = 1.00 (0.99 – 1.00 rank sum vs correlations in unmatched waveforms p = 3x10-44; waveforms; saline vs D1 blockade r = 1.00 (1.00 – 1.00), rank sum vs correlations in unmatched waveforms p = 4x10-50). There were no consistent changes in MSN average firing rate with D2 blockade or D1 blockade (F = 1.1, p = 0.30 accounting for variance between MSNs; saline: 5.2 (3.3 – 8.6) Hz; D2 blockade 5.1 (2.7 – 8.0) Hz; F = 2.2, p = 0.14; D1 blockade 4.9 (2.4 – 7.8) Hz).”

      As noted above, this enabled us to compare activity for the same MSNs across sessions in a new Figure 6 (previously, this analysis had been in Figure S9), and used PCA to quantify this variability.

      By tracking neurons across saline, D2 blockade, and D1 blockade, readers can see all the variability in MSNs. We added these data to the results (Line 362):  

      “We noticed differences in MSN activity across the interval with D2 blockade and D1 blockade at the individual MSN level (Fig 6B-D) as well as at the population level (Fig 6E). We used PCA to quantify effects of D2 blockade or D1 blockade (Bruce et al., 2021; Emmons et al., 2017; Kim et al., 2017a). We constructed principal components (PC) from z-scored peri-event time histograms of firing rate from saline, D2 blockade, and D1 blockade sessions for all mice together. The first component (PC1), which explained 54% of neuronal variance, exhibited “timedependent ramping”, or monotonic changes over the 6 second interval immediately after trial start (Fig 6F-G; variance for PC1 p = 0.001 vs 46 (45-47)% variance in random data; Narayanan, 2016). Interestingly, PC1 scores shifted with D2 blockade (Fig 6F; PC1 scores for D2 blockade: -0.6 (-3.8 – 4.7) vs saline: -2.3 (-4.2 – 3.2), F = 5.1, p = 0.03 accounting for variance between MSNs; no reliable effect of sex (F = 0.2, p = 0.63) or switching direction (F = 2.8, p = 0.10)). PC1 scores also shifted with D1 blockade (Fig 6F; PC1 scores for D1 blockade: -0.0 (-3.9 – 4.5), F = 5.8, p = 0.02 accounting for variance between MSNs; no reliable effect of sex (F = 0.0, p = 0.93) or switching direction (F = 0.9, p = 0.34)). There were no reliable differences in PC1 scores between D2 and D1 blockade. Furthermore, PC1 was distinct even when sessions were sorted independently and assumed to be fully statistically independent (Figure S10; D2 blockade vs saline: F = 5.8, p = 0.02; D1 blockade vs saline: F = 4.9, p = 0.03; all analyses accounting for variance between mice). Higher components explained less variance and were not reliably different between saline and D2 blockade or D1 blockade. Taken together, this data-driven analysis shows that D2 and D1 blockade produced similar shifts in MSN population dynamics represented by PC1. When combined with the major contributions of D1/D2 MSNs to PC1 (Fig 3C) these findings indicate that pharmacological D2 blockade and D1 blockade disrupt ramping-related activity in the striatum.”

      Finally, we included the data in which sessions were sorted independently and assumed to be fully statistically independent in a new Figure S10.

      And in the results (Line 376): 

      “Furthermore, PC1 was distinct even when sessions were sorted independently and assumed to be fully statistically independent (Figure S10; D2 blockade vs saline: F = 5.8, p = 0.02; D1 blockade vs saline: F = 4.9, p = 0.03; all analyses accounting for variance between mice). Higher components explained less variance and were not reliably different between saline and D2 blockade or D1 blockade.”

      These changes strengthen the manuscript and better show the main effects and variability of the data. 

      Regarding the results in Figure 7: 

      I am overall a bit confused about what the authors are trying to claim here. In Figure 7, they present data suggesting that D1 or D2 blockade disrupts their ability to decode time in the interval of interest (0-6 seconds). However, in the final paragraph of the results, the authors seem to say that by using another technique, they didn't see any significant change in decoding accuracy after D1 or D2 blockade. What do the authors make of this? 

      This was very unclear. The second classifier was predicting response time, but it was confusing, and we removed it. 

      Impact: 

      The task and data presented by the authors are very intriguing, and there are many groups interested in how striatal activity contributes to the neural perception of time. The authors perform a wide variety of experiments and analysis to examine how DMS activity influences time perception during an interval-timing task, allowing for insight into this process. However, the significance of the key finding - that D2/D1 activity increases/ decreases with time - remains somewhat ambiguous to me. This arises from a lack of clarity regarding the initial hypothesis and the implications of this finding for advancing our understanding of striatal functions. 

      As noted above, we clarified our hypothesis and implications, and strengthened several aspects of the data as suggested by this reviewer.  

      Reviewer #2 (Public Review): 

      Summary: 

      In the present study, the authors investigated the neural coding mechanisms for D1- and D2expressing striatal direct and indirect pathway MSNs in interval timing by using multiple strategies. They concluded that D2-MSNs and D1-MSNs have opposing temporal dynamics yet disrupting either type produced similar effects on behavior, indicating the complementary roles of D1- and D2- MSNs in cognitive processing. However, the data was incomplete to fully support this major finding. One major reason is the heterogenetic responses within the D1-or D2MSN populations. In addition, there are additional concerns about the statistical methods used. For example, the majority of the statistical tests are based on the number of neurons, but not the number of mice. It appears that the statistical difference was due to the large sample size they used (n=32 D2-MSNs and n=41 D1-MSNs), but different neurons recorded in the same mouse cannot be treated as independent samples; they should use independent mouse-based statistical analysis. 

      Strengths: 

      The authors used multiple approaches including awake mice behavior training, optogeneticassistant cell-type specific recording, optogenetic or pharmacological manipulation, neural computation, and modeling to study neuronal coding for interval timing. 

      We appreciate the reviewer’s careful read recognizing the breadth of our approach.  

      Weaknesses: 

      (1) More detailed behavior results should be shown, including the rate of the success switches, and how long it takes to wait in the second nose poke to get a reward. For line 512 and the Figure 1 legend, the reviewer is not clear about the reward delivery. The methods appear to state that the mouse had to wait for 18s, then make nose pokes at the second port to get the reward. What happens if the mouse made the second nose poke before 18 seconds, but then exited? Would the mouse still get the reward at 18 seconds? Similarly, what happens if the mice made the third or more nosepokes within 18 seconds? It is important to clarify because, according to the method described, if the mice made a second nose poke before 18 seconds, this already counted as the mouse making the "switch." Lastly, what if the mice exited before 6s in the first nosepoke? 

      We completely agree. We have now completely revised Figure 1 to include many of these task details.

      We have clarified remaining details in the methods (Line 548):

      “Interval timing switch task. We used a mouse-optimized operant interval timing task described in detail previously (Balci et al., 2008; Bruce et al., 2021; Tosun et al., 2016; Weber et al., 2023). Briefly, mice were trained in sound-attenuating operant chambers, with two front nosepokes flanking either side of a food hopper on the front wall, and a third nosepoke located at the center of the back wall. The chamber was positioned below an 8-kHz, 72-dB speaker (Fig 1A; MedAssociates, St. Albans, VT). Mice were 85% food restricted and motivated with 20 mg sucrose pellets (BioServ, Flemington, NJ). Mice were initially trained to receive rewards during fixed ratio nosepoke response trials. Nosepoke entry and exit were captured by infrared beams. After shaping, mice were trained in the “switch” interval timing task. Mice self-initiated trials at the back nosepoke, after which tone and nosepoke lights were illuminated simultaneously. Cues were identical on all trial types and lasted the entire duration of the trial (6 or 18 seconds). On 50% of trials, mice were rewarded for a nosepoke after 6 seconds at the designated first ‘front’ nosepoke; these trials were not analyzed. On the remaining 50% of trials, mice were rewarded for nosepoking first at the ‘first’ nosepoke location and then switching to the ‘second’ nosepoke location; the reward was delivered for initial nosepokes at the second nosepoke location after 18 seconds when preceded by a nosepoke at the first nosepoke location.  Multiple nosepokes at each nosepokes were allowed. Early responses at the first or second nosepoke were not reinforced. Initial responses at the second nosepoke rather than the first nosepoke, alternating between nosepokes, going back to the first nosepoke after the second nosepoke were rare after initial training. Error trials included trials where animals responded only at the first or second nosepoke and were also not reinforced. We did not analyze error trials as they were often too few to analyze; these were analyzed at length in our prior work (Bruce et al., 2021).

      Switch response time was defined as the moment animals departed the first nosepoke before arriving at the second nosepoke. Critically, switch responses are a time-based decision guided by temporal control of action because mice switch nosepokes only if nosepokes at the first location did not receive a reward after 6 seconds. That is, mice estimate if more than 6 seconds have elapsed without receiving a reward to decide to switch responses. Mice learn this task quickly (3-4 weeks), and error trials in which an animal nosepokes in the wrong order or does not nosepoke are relatively rare and discarded. Consequently, we focused on these switch response times as the key metric for temporal control of action. Traversal time was defined as the duration between first nosepoke exit and second nosepoke entry and is distinct from switch response time when animals departed the first nosepoke. Nosepoke duration was defined as the time between first nosepoke entry and exit for the switch response times only. Trials were self-initiated, but there was an intertrial interval with a geometric mean of 30 seconds between trials.”

      And in the results on Line 131: 

      “We investigated cognitive processing in the striatum using a well-described mouseoptimized interval timing task which requires mice to respond by switching between two nosepokes after a 6-second interval (Fig 1A; see Methods; (Balci et al., 2008; Bruce et al., 2021; Larson et al., 2022; Tosun et al., 2016; Weber et al., 2023)). In this task, mice initiate trials by responding at a back nosepoke, which triggers auditory and visual cues for the duration of the trial. On 50% of trials, mice were rewarded for nosepoking after 6 seconds at the designated ‘first’ front nosepoke; these trials were not analyzed. On the remaining 50% of trials, mice were rewarded for nosepoking at the ‘first’ nosepoke and then switching to the ‘second’ nosepoke; initial nosepokes at the second nosepoke after 18 seconds triggered reward when preceded by a first nosepoke. The first nosepokes occurred before switching responses and the second nosepokes occurred much later in the interval in anticipation of reward delivery at 18 seconds (Fig 1B-D). During the task, movement velocity peaked before 6 seconds as mice traveled to the front nosepoke (Fig 1E).

      We focused on the switch response time, defined as the moment mice exited the first nosepoke before entering the second nosepoke. Switch responses are a timebased decision guided by temporal control of action because mice switch nosepokes only if nosepoking at the first nosepokes does not lead to a reward after 6 seconds (Fig 1B-E). Switch responses are guided by internal estimates of time because no external cue indicates when to switch from the first to the second nosepoke (Balci et al., 2008; Bruce et al., 2021; Tosun et al., 2016; Weber et al., 2023). We defined the first 6 seconds after trial start as the ‘interval’, because during this epoch mice are estimating whether 6 seconds have elapsed and if they need to switch responses. In 30 mice, switch response times were 9.3 seconds (8.4 – 9.7; median (IQR)); see Table 1 for a summary of mice, experiments, trials, and sessions). We studied dorsomedial striatal D2-MSNs and D1-MSNs using a combination of optogenetics and neuronal ensemble recordings in 9 transgenic mice (4 D2-Cre mice switch response time 9.7 (7.0 – 10.3) seconds; 5 D1-Cre mice switch response time 8.2 (7.7 – 8.7) seconds; rank sum p = 0.73; Table 1).”

      (2) There are a lot of time parameters in this behavior task, the description of those time parameters is mentioned in several parts, in the figure legend, supplementary figure legend, and methods, but was not defined clearly in the main text. It is inconvenient, sometimes, confusing for the readers. The authors should make a schematic diagram to illustrate the major parameters and describe them clearly in the main text. 

      We agree. We have clarified this in a new schematic, shading the interval in gray:   

      And in the results on line 131:

      “We focused on the switch response time, defined as the moment mice exited the first nosepoke before entering the second nosepoke. Switch responses are a time-based decision guided by temporal control of action because mice switch nosepokes only if nosepoking at the first nosepokes does not lead to a reward after 6 seconds (Fig 1BE). Switch responses are guided by internal estimates of time because no external cue indicates when to switch from the first to the second nosepoke (Balci et al., 2008; Bruce et al., 2021; Tosun et al., 2016; Weber et al., 2023). We defined the first 6 seconds after trial start as the ‘interval’, because during this epoch mice are estimating whether 6 seconds have elapsed and if they need to switch responses. In 30 mice, switch response times were 9.3 seconds (8.4 – 9.7; median (IQR)); see Table 1 for a summary of mice, experiments, trials, and sessions). We studied dorsomedial striatal D2-MSNs and D1-MSNs using a combination of optogenetics and neuronal ensemble recordings in 9 transgenic mice (4 D2-Cre mice switch response time 9.7

      (7.0 – 10.3) seconds; 5 D1-Cre mice switch response time 8.2 (7.7 – 8.7) seconds; rank sum p = 0.73; Table 1).”

      (3) In Line 508, the reviewer suggests the authors pay attention to those trials without "switch". It would be valuable to compare the MSN activity between those trials with or without a "switch". 

      This is a great suggestion. We analyzed such error trials and MSN activity in Figure 6 of Bruce et al., 2021. However, this manuscript was not designed to analyze errors, as they are rare beyond initial training (Bruce et al., 2021 focused on early training), and too inconsistent to permit robust analysis. This was added to the methods on Line 567:

      “Early responses at the first or second nosepoke were not reinforced. Initial responses at the second nosepoke rather than the first nosepoke, alternating between nosepokes, going back to the first nosepoke after the second nosepoke were rare after initial training. Error trials included trials where animals responded only at the first or second nosepoke and were also not reinforced. We did not analyze error trials as they were often too few to analyze; these were analyzed at length in our prior work (Bruce et al., 2021).”

      (4) The definition of interval is not very clear. It appears that the authors used a 6-second interval in analyzing the data in Figure 2 and Figure 3. But from my understanding, the interval should be the time from time "0" to the "switch", when the mice start to exit from the first nose poke. 

      We have now defined it explicitly in the schematic: 

      Incidentally, this reviewer asked us to analyze a longer epoch – this analysis beautifully justifies our focus on the first 6 seconds (now in Figure S2).

      We focus on the first six seconds as there are few nosepokes and switch responses during this epoch; however, we consider the reviewer’s definition and analyze the epoch the reviewer suggests from 0 to the switch in analyses below. 

      (5) For Figure 2 C-F, the authors only recorded 32 D2-MSNs in 4 mice, and 41 D1-MSNs in 5 mice. The sample size is too small compared to the sample size usually used in the field. In addition to the small sample size, the single-cell activity exhibited heterogeneity, which created potential issues. 

      We are glad the reviewer raised these points. First, our tagging dataset is relatively standard for optogenetic tagging. Second, we now include Cohen’s d for both PC and slope results for all optogenetic tagging analysis, which demonstrate that we have adequate statistical power and medium-to-large effect sizes (Line 186): 

      “In line with population averages from Fig 2G&H, D2-MSNs and D1-MSNs had opposite patterns of activity with negative PC1 scores for D2-MSNs and positive PC1 scores for D1-MSNs (Fig 3C; PC1 for D2-MSNs: -3.4 (-4.6 – 2.5); PC1 for D1MSNs: 2.8 (-2.8 – 4.9); F = 8.8, p = 0.004 accounting for variance between mice (Fig S3A); Cohen’s d = 0.7; power = 0.80; no reliable effect of sex (F = 0.44, p = 0.51) or switching direction (F = 1.73, p = 0.19)).”

      And Line 197:

      “GLM analysis also demonstrated that D2-MSNs had significantly different slopes (0.01 spikes/second (-0.10 – 0.10)), which were distinct from D1-MSNs (-0.20 (-0.47– 0.06; Fig 3D; F = 8.9, p = 0.004 accounting for variance between mice (Fig S3B); Cohen’s d = 0.8; power = 0.98; no reliable effect of sex (F = 0.02, p = 0.88) or switching direction (F = 1.72, p = 0.19)).”

      We added boxplots to Figure 3, which better highlight differences in these distributions.

      However, the reviewer’s point is well-taken, and we have added a caveat to the discussion exactly as the reviewer suggested (Line 496):

      “Second, although we had adequate statistical power and medium-to-large effect sizes, optogenetic tagging is low-yield, and it is possible that recording more of these neurons would afford greater opportunity to identify more robust results and alternative coding schemes, such as neuronal synchrony.”

      For both D1 and D2 MSNs, the authors tried to make conclusions on the "trend" of increasing in D2-MSNs and decreasing in D1-MSNs populations, respectively, during the interval. However, such a conclusion is not sufficiently supported by the data presented. It looks like the single-cell activity patterns can be separated into groups: one is a decreasing activity group, one is an increasing activity group and a small group for on and off response. Because of the small sample size, the author should pay attention to the variance across different mice (which needs to be clearly presented in the manuscript), instead of pooling data together and analyzing the mean activity. 

      We were not clear – we now do exactly as the reviewer suggested. We are not pooling any data – instead – as we state on line 620 - we are using linear-mixed effects models to account for mouse-specific and neuron-specific variance. This approach was developed with our statistics core for exactly the reasons the reviewer suggested (see letter). We state this explicitly in the methods (Line 704):

      “Statistics. All data and statistical approaches were reviewed by the Biostatistics,

      Epidemiology, and Research Design Core (BERD) at the Institute for Clinical and Translational Sciences (ICTS) at the University of Iowa. All code and data are made available at http://narayanan.lab.uiowa.edu/article/datasets. We used the median to measure central tendency and the interquartile range to measure spread. We used Wilcoxon nonparametric tests to compare behavior between experimental conditions and Cohen’s d to calculate effect size. Analyses of putative single-unit activity and basic physiological properties were carried out using custom routines for MATLAB.

      For all neuronal analyses, variability between animals was accounted for using generalized linear-mixed effects models and incorporating a random effect for each mouse into the model, which allows us to account for inherent between-mouse variability. We used fitglme in MATLAB and verified main effects using lmer in R. We accounted for variability between MSNs in pharmacological datasets in which we could match MSNs between saline, D2 blockade, and D1 blockade. P values < 0.05 were interpreted as significant.”

      We have now stated in the results that we are explicitly accounting for variance between mice (Line 186): 

      “In line with population averages from Fig 2G&H, D2-MSNs and D1-MSNs had opposite patterns of activity with negative PC1 scores for D2-MSNs and positive PC1 scores for D1-MSNs (Fig 3C; PC1 for D2-MSNs: -3.4 (-4.6 – 2.5); PC1 for D1MSNs: 2.8 (-2.8 – 4.9); F = 8.8, p = 0.004 accounting for variance between mice (Fig S3A); Cohen’s d = 0.7; power = 0.80; no reliable effect of sex (F = 0.44, p = 0.51) or switching direction (F = 1.73, p = 0.19)).”

      And on Line 197:

      “GLM analysis also demonstrated that D2-MSNs had significantly different slopes (0.01 spikes/second (-0.10 – 0.10)), which were distinct from D1-MSNs (-0.20 (-0.47– 0.06; Fig 3D; F = 8.9, p = 0.004 accounting for variance between mice (Fig S3B); Cohen’s d = 0.8; power = 0.98; no reliable effect of sex (F = 0.02, p = 0.88) or switching direction (F = 1.72, p = 0.19)).”

      All statistics in the manuscript now explicitly account for variance between mice. 

      This is the approach that was recommended by our the Biostatistics, Epidemiology, and

      Research Design Core (BERD) at the Institute for Clinical and Translational Sciences (ICTS) at the University of Iowa, who reviews all of our work.

      We note that these Cohen d values usually interpret as medium or large. 

      We performed statistical power calculations and include these to aid readers’ interpretation. These are all >0.8. 

      Finally, the reviewer uses the word ‘trend’. We define p values <0.05 as significant in the methods, and do not interpret trends (on line 717): 

      “P values < 0.05 were interpreted as significant.”

      And, we have now plotted values for each mouse in a new Figure S3.

      As noted in the figure legend, mouse-specific effects were analyzed using linear models that account for between-mouse variability, as discussed with our statisticians. However, the reviewer’s point is well taken, and we have added this idea to the discussion as suggested (Line 496):

      “Second, although we had adequate statistical power and medium-to-large effect sizes, optogenetic tagging is low-yield, and it is possible that recording more of these neurons would afford greater opportunity to identify more robust results and alternative coding schemes, such as neuronal synchrony.”

      (6) For Figure 2, from the activity in E and F, it seems that the activity already rose before the trial started, the authors should add some longer baseline data before time zero for clarification and comparison and show the timing of the actual start of the activity with the corresponding behavior. What behavior states are the mice in when initiating the activity? 

      This is a key point. First, we are not certain what state the animal is in until they initiate trials at the back nosepoke (“Start”). Therefore, we cannot analyze this epoch.  

      However, we can show neuronal activity during a longer epoch exactly as the reviewer suggested. Although there are modulations, the biggest difference between D2 and D1 MSNs is during the 0-6 second interval. This analysis supports our focus on the 0-6 second interval. We have included this as a new Figure S2.

      (7) The authors were focused on the "switch " behavior in the task, but they used an arbitrary 6s time window to analyze the activity, and tried to correlate the decreasing or increasing activities of MSNs to the neural coding for time. A better way to analyze is to sort the activity according to the "switch" time, from short to long intervals. This way, the authors could see and analyze whether the activity of D1 or D2 MSNs really codes for the different length of interval, instead of finding a correlation between average activity trends and the arbitrary 6s time window. 

      This is a great suggestion. We did exactly this and adjusted our linear models on a trialby-trial basis to account for time between the start of the interval and the switch. This is now added to the methods (line 656): 

      “We performed additional sensitivity analysis excluding outliers and measuring firing rate from the start of the interval to the time of the switch response on a trialby-trial level for each neuron.”

      And to the results (Line 201):

      “We found that D2-MSNs and D1-MSNs had a significantly different slope even when excluding outliers (4 outliers excluded outside of 95% confidence intervals; F=7.51, p=0.008 accounting for variance between mice) and when the interval was defined as the time between trial start and the switch response on a trial-by-trial basis for each neuron (F=4.3, p=0.04 accounting for variance between mice).”

      We now state our justification for focusing on the first 6 seconds of the interval (Line 134)

      “Switch responses are guided by internal estimates of time and temporal control of action because no external cue indicates when to switch from the first to the second nosepoke (Balci et al., 2008; Bruce et al., 2021; Tosun et al., 2016; Weber et al., 2023). We defined the first 6 seconds after trial start as the ‘interval’, because during this epoch mice are estimating whether 6 seconds have elapsed and if they need to switch responses.”

      As noted previously, epoch is now justified by Figure S2E.

      And we note that this focus minimizes motor confounds (Line 511):

      “Four lines of evidence argue that our findings cannot be directly explained by motor confounds: 1) D2-MSNs and D1-MSNs diverge between 0-6 seconds after trial start well before the first nosepoke (Fig S2), 2) our GLM accounted for nosepokes and nosepoke-related βs were similar between D2-MSNs and D1-MSNs, 3) optogenetic disruption of dorsomedial D2-MSNs and D1-MSNs did not change task-specific movements despite reliable changes in switch response time, and 4) ramping dynamics were quite distinct from movement dynamics. Furthermore, disrupting D2-MSNs and D1-MSNs did not change the number of rewards animals received, implying that these disruptions did not grossly affect motivation. Still, future work combining motion tracking with neuronal ensemble recording and optogenetics and including bisection tasks may further unravel timing vs. movement in MSN dynamics (Robbe, 2023).”

      We are glad the reviewer suggested this analysis as it strengthens our manuscript.  

      Reviewer #3 (Public Review): 

      Summary: 

      The cognitive striatum, also known as the dorsomedial striatum, receives input from brain regions involved in high-level cognition and plays a crucial role in processing cognitive information. However, despite its importance, the extent to which different projection pathways of the striatum contribute to this information processing remains unclear. In this paper, Bruce et al. conducted a study using a range of causal and correlational techniques to investigate how these pathways collectively contribute to interval timing in mice. Their results were consistent with previous research, showing that the direct and indirect striatal pathways perform opposing roles in processing elapsed time. Based on their findings, the authors proposed a revised computational model in which two separate accumulators track evidence for elapsed time in opposing directions. These results have significant implications for understanding the neural mechanisms underlying cognitive impairment in neurological and psychiatric disorders, as disruptions in the balance between direct and indirect pathway activity are commonly observed in such conditions. 

      Strengths: 

      The authors employed a well-established approach to study interval timing and employed optogenetic tagging to observe the behavior of specific cell types in the striatum. Additionally, the authors utilized two complementary techniques to assess the impact of manipulating the activity of these pathways on behavior. Finally, the authors utilized their experimental findings to enhance the theoretical comprehension of interval timing using a computational model. 

      We are grateful for the reviewer’s consideration of our work and for recognizing the strengths of our approach.  

      Weaknesses: 

      The behavioral task used in this study is best suited for investigating elapsed time perception, rather than interval timing. Timing bisection tasks are often employed to study interval timing in humans and animals.

      This is a key point, and the reviewer is correct. We use our task because of its’ translational validity; as far as we know, temporal bisection tasks have been used less often in human disease and in rodent models. We have included a new paragraph describing this in the discussion (Line 472):

      “Because interval timing is reliably disrupted in human diseases of the striatum such as Huntington’s disease, Parkinson’s disease, and schizophrenia (Hinton et al., 2007; Singh et al., 2021; Ward et al., 2011), these results have relevance to human disease. Our task version has been used extensively to study interval timing in mice and humans (Balci et al., 2008; Bruce et al., 2021; Stutt et al., 2024; Tosun et al., 2016; Weber et al., 2023). However, temporal bisection tasks, in which animals hold during a temporal cue and respond at different locations depending on cue length, have advantages in studying how animals time an interval because animals are not moving while estimating cue duration (Paton and Buonomano, 2018; Robbe, 2023; Soares et al., 2016). Our interval timing task version – in which mice switch between two response nosepokes to indicate their interval estimate has elapsed – has been used extensively in rodent models of neurodegenerative disease (Larson et al., 2022; Weber et al., 2024, 2023; Zhang et al., 2021), as well as in humans (Stutt et al., 2024). Furthermore, because many therapeutics targeting dopamine receptors are used clinically, these findings help describe how dopaminergic drugs might affect cognitive function and dysfunction. Future studies of D2-MSNs and D1-MSNs in temporal bisection and other timing tasks may further clarify the relative roles of D2- and D1-MSNs in interval timing and time estimation.”

      Furthermore, we have modified the use of the definition of interval timing in the abstract, introduction, and results to reflect the reviewers comment. For instance, in the abstract (Line 43):

      “We studied dorsomedial striatal cognitive processing during interval timing, an elementary cognitive task that requires mice to estimate intervals of several seconds and involves working memory for temporal rules as well as attention to the passage of time.”

      However, we think it is important to use the term ‘interval timing’ as it links to past work by our group and others.   

      The main results from unit recording (opposing slopes of D1/D2 cell firing rate, as shown in Figure 3D) appear to be very sensitive to a couple of outlier cells, and the predictive power of ensemble recording seems to be only slightly above chance levels. 

      This is a key point raised by other reviewers as well. We have now included measures of statistical power (as we interpret the reviewer’s comment of predictive power), effect size, and perform additional sensitivity analyses (Line 187): 

      “PC1 scores for D1-MSNs (Fig 3C; PC1 for D2-MSNs: -3.4 (-4.6 – 2.5); PC1 for D1MSNs: 2.8 (-4.9 – -2.8); F=8.8, p = 0.004 accounting for variance between mice (Fig S3A);  Cohen’s d = 0.7; power = 0.80; no reliable effect of sex (F=1.9, p=0.17) or switching direction (F=0.1, p=0.75)).”

      And on Line 197:

      “GLM analysis also demonstrated that D2-MSNs had significantly different slopes (0.01 spikes/second (-0.10 – 0.10)), which were distinct from D1-MSNs (-0.20 (-0.45– 0.06; Fig 3D; F=8.9, p = 0.004 accounting for variance between mice (Fig S3B); Cohen’s d = 0.8; power = 0.98).  We found that D2-MSNs and D1-MSNs had a significantly different slope even when excluding outliers (4 outliers excluded outside of 95% confidence intervals; F=7.51, p=0.008 accounting for variance between mice) and when the interval was defined as the time between trial start and the switch response on a trial-by-trial basis for each neuron (F=4.3, p=0.04 accounting for variance between mice).”

      These are medium-to-large Cohen’s d results, and we have adequate statistical power. These results are not easily explained by chance. 

      We also added boxplots, which highlight the differences in distribution.

      Finally, we note that our conclusions are drawn from many convergent analyses (on Line 216): 

      “Analyses of average activity, PC1, and trial-by-trial firing-rate slopes over the interval provide convergent evidence that D2-MSNs and D1-MSNs had distinct and opposing dynamics during interval timing.”

      In the optogenetic experiment, the laser was kept on for too long (18 seconds) at high power (12 mW). This has been shown to cause adverse effects on population activity (for example, through heating the tissue) that are not necessarily related to their function during the task epochs. 

      This is an important point. We are well aware of heating effects with optogenetics and other potential confounds. For the exact reasons noted by the reviewer, we had opsinnegative controls – where the laser was on for the exact same amount of time (18 seconds) and at the same power (12 mW)– in Figure S5. We have now better highlighted these controls in the methods (Line 598):

      “In animals injected with optogenetic viruses, optical inhibition was delivered via bilateral patch cables for the entire trial duration of 18 seconds via 589-nm laser light at 12 mW power on 50% of randomly assigned trials. We performed control experiments in mice without opsins using identical laser parameters in D2-cre or D1-cre mice (Fig S6).”

      And in results (Line 298):

      “Importantly, we found no reliable effects for D2-MSNs with opsin-negative controls (Fig S6).”

      And Line 306): 

      “As with D2-MSNs, we found no reliable effects with opsin-negative controls in D1MSNs (Fig S6).”

      We have highlighted these data in Figure S6: 

      Furthermore, the effect of optogenetic inhibition is similar to pharmacological effects in this manuscript and in our prior work (De Corte et al., 2019; Stutt et al., 2024) on line 459): 

      “Past pharmacological work from our group and others has shown that disrupting D2- or D1-MSNs slows timing (De Corte et al., 2019b; Drew et al., 2007, 2003; Stutt et al., 2024), in line with pharmacological and optogenetic results in this manuscript.”

      And in the discussion section on Line 488: 

      “Our approach has several limitations. First, systemic drug injections block D2- and D1-receptors in many different brain regions, including the frontal cortex, which is involved in interval timing (Kim et al., 2017a). D2 blockade or D1 blockade may have complex effects, including corticostriatal or network effects that contribute to changes in D2-MSN or D1-MSN ensemble activity. We note that optogenetic inhibition of D2-MSNs and D1-MSNs produces similar effects to pharmacology in Figure 5.”

      Given the systemic delivery of pharmacological interventions, it is difficult to conclude that the effects are specific to the dorsomedial striatum. Future studies should use the local infusion of drugs into the dorsomedial striatum. 

      This is a great point - we did this experiment in De Corte et al, 2019 with local drug infusions. This earlier study was the departure point for this experiment. We now point this out in the introduction (Line 92): 

      “Past work has shown that disrupting either D2-dopamine receptors (D2) or D1dopamine receptors (D1) powerfully impairs interval timing by increasing estimates of elapsed time (Drew et al., 2007; Meck, 2006). Similar behavioral effects were found with systemic (Stutt et al., 2024) or local dorsomedial striatal D2 or D1 disruption (De Corte et al., 2019a). These data lead to the hypothesis that D2 MSNs and D1 MSNs have similar patterns of ramping activity across a temporal interval.”

      However, the reviewer makes a great point - and we will develop this in our future work (Line 485): 

      “Future studies might extend our work combining local pharmacology with neuronal ensemble recording.”

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors): 

      Just a few minor notes: 

      (1) Figures 2C and D should have error bars. 

      We agree.  We added error bars to these figures and other rasters as recommended.  

      (2) Figures 2G and H seem to be smoothed - how was this done? 

      We added these details.

      (3) It is unclear what the 'neural network machine learning classifier' mentioned in lines 193-199 adds if the data relevant to this analysis isn't presented. I would potentially include this. 

      We agree. This analysis was confusing and not relevant to our main points; consequently, we removed it.  

      Reviewer #2 (Recommendations For The Authors): 

      Major: 

      (1)  For Figure 2, the description of the main results in (C-F) in the main text is too brief and is not clear. 

      We have added to and clarified this text (Line 147)

      “Striatal neuronal populations are largely composed of MSNs expressing D2dopamine or D1-dopamine receptors. We optogenetically tagged D2-MSNs and D1MSNs by implanting optrodes in the dorsomedial striatum and conditionally expressing channelrhodopsin (ChR2; Fig S1) in 4 D2-Cre (2 female) and 5 D1-Cre transgenic mice (2 female). This approach expressed ChR2 in D2-MSNs or D1MSNs, respectively (Fig 2A-B; Kim et al., 2017a). We identified D2-MSNs or D1MSNs by their response to brief pulses of 473 nm light; neurons that fired within 5 milliseconds were considered optically tagged putative D2-MSNs (Fig S1B-C). We tagged 32 putative D2-MSNs and 41 putative D1-MSNs in a single recording session during interval timing. There were no consistent differences in overall firing rate between D2-MSNs and D1-MSNs (D2-MSNs: 3.4 (1.4 – 7.2) Hz; D1-MSNs 5.2 (3.1 – 8.6) Hz; F = 2.7, p = 0.11 accounting for variance between mice). Peri-event rasters and histograms from a tagged putative D2-MSN (Fig 2C) and from a tagged putative D1-MSN (Fig 2D) demonstrate prominent modulations for the first 6 seconds of the interval after trial start. Z-scores of average peri-event time histograms (PETHs) from 0 to 6 seconds after trial start for each putative D2-MSN are shown in Fig 2E and for each putative D1-MSN in Fig 2F. These PETHs revealed that for the 6-second interval immediately after trial start, many putative D2-MSN neurons appeared to ramp up while many putative D1-MSNs appeared to ramp down. For 32 putative D2-MSNs average PETH activity increased over the 6second interval immediately after trial start, whereas for 41 putative D1-MSNs, average PETH activity decreased. These differences resulted in distinct activity early in the interval (0-1 seconds; F = 6.0, p = 0.02 accounting for variance between mice), but not late in the interval (5-6 seconds; F = 1.9, p = 0.17 accounting for variance between mice) between D2-MSNs and D1-MSNs. Examination of a longer interval of 10 seconds before to 18 seconds after trial start revealed the greatest separation in D2-MSN and D1-MSN dynamics during the 6-second interval after trial start (Fig S2). Strikingly, these data suggest that D2-MSNs and D1-MSNs might display opposite dynamics during interval timing.”

      (2)  For Figure3 

      (A)  Is the PC1 calculated from all MSNs of all mice (4 D2, 5 D1 mice)? 

      We clarified this (Line 182):

      “We analyzed PCA calculated from all D2-MSNs and D1-MSNs PETHs over the 6second interval immediately after trial start.”

      And for pharmacology (Line 362): 

      “We noticed differences in MSN activity across the interval with D2 blockade and D1 blockade at the individual MSN level (Fig 6B-D) as well as at the population level (Fig 6E). We used PCA to quantify effects of D2 blockade or D1 blockade (Bruce et al., 2021; Emmons et al., 2017; Kim et al., 2017a). We constructed principal components (PC) from z-scored peri-event time histograms of firing rate from saline, D2 blockade, and D1 blockade sessions for all mice together.”

      (B)  The authors should perform PCA on single mouse data, and add the plot and error bar. 

      This is a great idea. We have now included this as a new Figure S3:   

      (C)  As mentioned before, both D2-or D1- MSNs can be divided into three groups, it is not appropriate to put them together as each MSN is not an independent variable, the authors should do the statistics based on the individual mouse, and do the parametric or non-parametric comparison, and plot N (number of mice) based error bars. 

      We have done exactly this using a linear mixed effects model, as recommend by our statistics core. They have explicitly suggested that this is the best approach to these data (see letter). We have also included measures of statistical power and effect size (Line 704):  

      “All data and statistical approaches were reviewed by the Biostatistics, Epidemiology, and Research Design Core (BERD) at the Institute for Clinical and Translational Sciences (ICTS) at the University of Iowa. All code and data are made available at http://narayanan.lab.uiowa.edu/article/datasets. We used the median to measure central tendency and the interquartile range to measure spread. We used Wilcoxon nonparametric tests to compare behavior between experimental conditions and Cohen’s d to calculate effect size. Analyses of putative single-unit activity and basic physiological properties were carried out using custom routines for MATLAB.

      For all neuronal analyses, variability between animals was accounted for using generalized linear-mixed effects models and incorporating a random effect for each mouse into the model, which allows to account for inherent between-mouse variability. We used fitglme in MATLAB and verified main effects using lmer in R. We accounted for variability between MSNs in pharmacological datasets in which we could match MSNs between saline, D2 blockade, and D1 blockade. P values < 0.05 were interpreted as significant.”

      We have now included measures of ‘power’ (which we interpret to be statistical), effect size, and perform additional sensitivity analyses (Line 187): 

      “PC1 scores for D1-MSNs (Fig 3C; PC1 for D2-MSNs: -3.4 (-4.6 – 2.5); PC1 for D1MSNs: 2.8 (-4.9 – -2.8); F=8.8, p = 0.004 accounting for variance between mice (Fig S3A); Cohen’s d = 0.7; power = 0.80; no reliable effect of sex (F=1.9, p=0.17) or switching direction (F=0.1, p=0.75)).”

      And Line 197:

      “GLM analysis also demonstrated that D2-MSNs had significantly different slopes (0.01 spikes/second (-0.10 – 0.10)), which were distinct from D1-MSNs (-0.20 (-0.45– 0.06; Fig 3D; F=8.9, p = 0.004 accounting for variance between mice (Fig S3B); Cohen’s d = 0.8; power = 0.98).  We found that D2-MSNs and D1-MSNs had a significantly different slope even when excluding outliers (4 outliers excluded outside of 95% confidence intervals; F=7.51, p=0.008 accounting for variance between mice) and when the interval was defined as the time between trial start and the switch response on a trial-by-trial bases for each neuron (F=4.3, p=0.04 accounting for variance between mice).”

      These are medium-to-large Cohen’s d results, and we have adequate statistical power. These results are not easily explained by chance. 

      We also added boxplots, which highlight the differences in distributions.

      (3) For results in Figure 5 and Figure S7, according to Figure 1 legend, lines 4 to 5, the response times were defined as the moment mice exit the first nose poke (on the left) to respond at the second nose poke; and according to method session (line 522), "switch" traversal time was defined as the duration between first nose poke exit and second nose poke entry. It seems that response time is the switch traversal time, they should be the same, but in Figures B and D, the response time showed a clear difference between the laser off and on groups, while in Figures S7 C, and G, there were no differences between laser off and on group for switch traversal time. Please reconcile these inconsistencies. 

      We were not clear. We now clarify – switch responses are the moment when mice depart the first nosepoke, whereas traversal time is the time between departing the first nosepoke and arriving at the second nosepoke. We have reworked our figures to make this clear.

      And in the methods (Line 570):

      “Switch response time was defined as the moment animals departed the first nosepoke before arriving at the second nosepoke. Critically, switch responses are a time-based decision guided by temporal control of action because mice switch nosepokes only if nosepokes at the first location did not receive a reward after 6 seconds. That is, mice estimate if more than 6 seconds have elapsed without receiving a reward to decide to switch responses. Mice learn this task quickly (3-4 weeks), and error trials in which an animal nosepokes in the wrong order or does not nosepoke are relatively rare and discarded. Consequently, we focused on these switch response times as the key metric for temporal control of action. Traversal time was defined as the duration between first nosepoke exit and second nosepoke entry and is distinct from switch response time when animals departed the first nosepoke. Nosepoke duration was defined as the time between first nosepoke entry and exit for the switch response times only. Trials were self-initiated, but there was an intertrial interval with a geometric mean of 30 seconds between trials.”

      And in Figure S8, we have added graphics and clarified the legend.

      (4) The first nose poke and second nose poke are very close, why did it take so long to move from the first nose poke to the second nose poke, even though the mouse already made the decision to switch? Please see Figure S1A, it took less than 6s from the back nose poke to the first nose poke, but it took more than 6s (up to 12s) from the first nose poke to the second nose poke, what were the mice's behavior during this period? 

      This is a key detail. There is no temporal urgency as only the initial nosepoke after 18 seconds leads to reward. In other words, making a second nosepoke prior to 18 seconds is not rewarded and, in well-trained animals, is wasted effort. We have added these details to the methods (Line 124):

      “On the remaining 50% of trials, mice were rewarded for nosepoking at the ‘first’ nosepoke and then switching to the ‘second’ nosepoke; initial nosepokes at the second nosepoke after 18 seconds triggered reward when preceded by a first nosepoke. The first nosepokes occurred before switching responses and the second nosepokes occurred much later in the interval in anticipation of reward delivery at 18 seconds (Fig 1B-D). During the task, movement velocity peaked before 6 seconds as mice traveled to the front nosepoke (Fig 1E).”

      And in Figure 1, as described in detail above. 

      (5) How many trials did mice perform in one day? How many recordings/day for how many days were performed? 

      These are key details that we have now added to Table 1.

      We have added the number of recording sessions to the methods (Line 603): 

      “For optogenetic tagging, putative D1- and D2-MSNs were optically identified via 473-nm photostimulation. Units with mean post-stimulation spike latencies of ≤5 milliseconds and a stimulated-to-unstimulated waveform correlation ratio of >0.9 were classified as putative D2-MSNs or D1-MSNs (Ryan et al., 2018; Shin et al., 2018). Only one recording session was performed for each animal per day, and one recording session was included from each animal.”

      And Line 606: 

      “Only one recording session was performed for each animal per day, and one recording session was included from saline, D2 blockade, and D1 blockade sessions.”

      (6) For results in Figure 5, the authors should analyze the speed for the laser on and off group, since the dorsomedial striatum was reported to be related to control of speed (Yttri, Eric A., and Joshua T. Dudman. "Opponent and bidirectional control of movement velocity in the basal ganglia." Nature 533.7603 (2016): 402-406.). 

      We have some initial DeepLabCut data and have included it in a new Figure 1E.

      B) DeepLabCut tracking of position during the interval timing revealed that mice moved quickly after trial start and then velocity was relatively constant throughout the trial

      We measure movement speed using nosepoke duration and traversal time, which can give some measure of movement velocity.

      In Yttri and Dudman, the mice are head-fixed and moving a joystick, whereas our mice are freely moving. However, we have now included the lack of motor control as a major limitation (Line 510): 

      “Finally, movement and motivation contribute to MSN dynamics (Robbe, 2023). Four lines of evidence argue that our findings cannot be directly explained by motor confounds: 1) D2-MSNs and D1-MSNs diverge between 0-6 seconds after trial start well before the first nosepoke (Fig S2), 2) our GLM accounted for nosepokes and nosepoke-related βs were similar between D2-MSNs and D1-MSNs, 3) optogenetic disruption of dorsomedial D2-MSNs and D1-MSNs did not change task-specific movements despite reliable changes in switch response time, and 4) ramping dynamics were quite distinct from movement dynamics. Furthermore, disrupting D2-MSNs and D1-MSNs did not change the number of rewards animals received, implying that these disruptions did not grossly affect motivation. Still, future work combining motion tracking with neuronal ensemble recording and optogenetics and including bisection tasks may further unravel timing vs. movement in MSN dynamics (Robbe, 2023).”

      (7)  Figure S3 (C, E, and F), statistics should be done based on N (number of mice), not on the number of recorded neurons.  

      We have removed this section, and all other statistics in the paper properly account for mouse-specific variance, as noted above.

      (8)  Figure S1 

      (A) Are these the results from all mice superposed together, or from one mouse on one given day? How many of the trials' data were superposed?

      We included these details in a new Figure 1.

      (B, C) How many trials were included? 

      (D) How many days did these data cover? 

      We have included a new Table 1 with these important details.

      We have noted that only 1 recording session / mouse was included in analysis (Line 606):

      “Only one recording session was performed for each animal per day, and one recording session was included from each animal.”

      And Line 614: 

      “Only one recording session was performed for each animal per day, and one recording session was included from saline, D2 blockade, and D1 blockade sessions.”

      (9) Figure S2 

      (A) Can the authors add coordinates of the brain according to the mouse brain atlas or, alternatively, show it using a coronal section? 

      Great idea – added to Figure S2 legend: 

      “Figure S1: A) Recording locations in the dorsomedial striatum (targeting AP +0.4, ML -1.4, DV -2.7). Electrode reconstructions for D2-Cre (red), D1-Cre (blue), and wild-type mice (green). Only the left striatum was implanted with electrodes in all animals.”

      We have also added it to Figure S5 legend: 

      “Figure S5: Fiber optic locations from A) an opsin-expressing mouse with mCherrytagged halorhodopsin and bilateral fiber optics, and B) across 10 D2-Cre mice (red) and 6 D1-cre mice (blue) with fiber optics (targeting AP +0.9, ML +/-1.3, DV –2.5).”

      (C) Why did the waveform of laser and no laser seem the same? 

      The optogenetically tagged spike waveforms are highly similar, indicating that optogenetically-triggered spikes are like other spikes. That is the main point – optogenetically stimulating the neuron does not change the waveform. We have added this detail to the legend of S1: 

      “Inset on bottom right – waveforms from laser trials (red) and trials without laser (blue).  Across 73 tagged neurons, waveform correlation coefficients for laser trials vs. trials without laser was r = 0.97 (0.92-0.99). These data demonstrate that optogenetically triggered spikes are similar to non-optogenetically triggered spikes.”

      (10)  Figure S7, what was the laser power used in this experiment? Have the authors tried different laser powers? 

      We have now clarified the laser power on line 598: 

      “In animals injected with optogenetic viruses, optical inhibition was delivered via bilateral patch cables for the entire trial duration of 18 seconds via 589-nm laser light at 12 mW power on 50% of randomly assigned trials.”

      And for Figure S6 (was S7 previously): 

      We did not try other laser powers; our parameters were chosen a priori based on our past work.  

      (11)  In Figure S9, what method was used to sort the neurons? 

      We now clarify in the methods (Line 617): 

      “Electrophysiology. Single-unit recordings were made using a multi-electrode recording system (Open Ephys, Atlanta, GA). After the experiments, Plexon Offline Sorter (Plexon, Dallas, TX), was used to remove artifacts. Principal component analysis (PCA) and waveform shape were used for spike sorting. Single units were defined as those 1) having a consistent waveform shape, 2) being a separable cluster in PCA space, and 3) having a consistent refractory period of at least 2 milliseconds in interspike interval histograms.  The same MSNs were sorted across saline, D2 blockade, and D1 blockade sessions by loading all sessions simultaneously in Offline Sorter and sorted using the preceding criteria. MSNs had to have consistent firing in all sessions to be included. Sorting integrity across sessions was quantified by comparing waveform similarity via R2 between sessions.”

      And in the results (Line 353):

      “We analyzed 99 MSNs in sessions with saline, D2 blockade, and D1 blockade. We matched MSNs across sessions based on waveform and interspike intervals; waveforms were highly similar across sessions (correlation coefficient between matched MSN waveforms: saline vs D2 blockade r = 1.00 (0.99 – 1.00 rank sum vs correlations in unmatched waveforms p = 3x10-44; waveforms; saline vs D1 blockade r = 1.00 (1.00 – 1.00), rank sum vs correlations in unmatched waveforms p = 4x10-50). There were no consistent changes in MSN average firing rate with D2 blockade or D1 blockade (F = 1.1, p = 0.30 accounting for variance between MSNs; saline: 5.2 (3.3 – 8.6) Hz; D2 blockade 5.1 (2.7 – 8.0) Hz; F = 2.2, p = 0.14; D1 blockade 4.9 (2.4 – 7.8) Hz).”

      (C-F) statistics should be done based on the number of mice, not on the number of recorded neurons. 

      We agree, all experiments are now quantified using linear mixed effects models which formally accounts for variance contributed across animals, as discussed at length earlier in the review and with statistical experts at the University of Iowa.

      (12) For results in Figure 6, did the authors do cell-type specific recording on D1 or D2 MSNs using optogenetic tagging? As the D1- or D2- MSNs account for ~50% of all MSNs, the inhibition of a considerable amount of neurons was not observed. The authors should discuss the relation between the results from optogenetic inhibition of D1- or D2- MSNs and pharmacological disruption of D1 or D2 dopamine receptors. 

      This is a great point. First, we did not combine cell-type specific recordings with tagging as it was difficult to get enough trials for analysis in a single session in the tagging experiments, and pharmacological interventions can further decrease performance.  However, we have made our results in Figure 6 much more focused.

      We have discussed the relationship between these data in the results (Line 380): 

      “This data-driven analysis shows that D2 and D1 blockade produced similar shifts in MSN population dynamics represented by PC1.  When combined with major contributions of D1/D2 MSNs to PC1 (Fig 3C) these findings show that pharmacologically disrupting D2 or D1 MSNs can disrupt ramping-related activity in the striatum.”

      And in the discussion (Line 417): 

      “Strikingly, optogenetic tagging showed that D2-MSNs and D1-MSNs had distinct dynamics during interval timing. MSN dynamics helped construct and constrain a four-parameter drift-diffusion model in which D2- and D1-MSN spiking accumulated temporal evidence. This model predicted that disrupting either D2MSNs or D1-MSNs would increase response times. Accordingly, we found that optogenetically or pharmacologically disrupting striatal D2-MSNs or D1-MSNs increased response times without affecting task-specific movements. Disrupting D2MSNs or D1-MSNs shifted MSN temporal dynamics and degraded MSN temporal encoding. These data, when combined with our model predictions, demonstrate that D2-MSNs and D1-MSNs contribute temporal evidence to controlling actions in time.”

      And: 

      “D2-MSNs and D1-MSNs play complementary roles in movement. For instance, stimulating D1-MSNs facilitates movement, whereas stimulating D2-MSNs impairs movement (Kravitz et al., 2010). Both populations have been shown to have complementary patterns of activity during movements (Tecuapetla et al., 2016), with MSNs firing at different phases of action initiation and selection. Further dissection of action selection programs reveals that opposing patterns of activation among D2MSNs and D1-MSNs suppress and guide actions, respectively, in the dorsolateral striatum (Cruz et al., 2022). A particular advantage of interval timing is that it captures a cognitive behavior within a single dimension — time. When projected along the temporal dimension, it was surprising that D2-MSNs and D1-MSNs had opposing patterns of activity. Past pharmacological work from our group and others have shown that disrupting D2 or D1 MSNs slows timing (De Corte et al., 2019; Drew et al., 2007, 2003; Stutt et al., 2023), in line with pharmacological and optogenetic results in this manuscript. Computational modeling predicted that disrupting either D2-MSNs or D1-MSNs increased self-reported estimates of time, which was supported by both optogenetic and pharmacological experiments. Notably, these disruptions are distinct from increased timing variability reported with administrations of amphetamine, ventral tegmental area dopamine neuron lesions, and rodent models of neurodegenerative disease (Balci et al., 2008; Gür et al., 2020, 2019; Larson et al., 2022; Weber et al., 2023). Furthermore, our current data demonstrate that disrupting either D2-MSN or D1-MSN activity shifted MSN dynamics and degraded temporal encoding, supporting prior work (De Corte et al., 2019; Drew et al., 2007, 2003; Stutt et al., 2023). Our recording experiments do not identify where a possible response threshold T is instantiated, but downstream basal ganglia structures may have a key role in setting response thresholds (Toda et al., 2017).”

      (13) For Figure 2, what is the error region for G and H? Is there a statistically significant difference between the start (e.g., 0-1 s) and the end (e.g., 5-6 s) time? 

      G and H are standard error, which we have now clarified.

      And on Line 166: 

      “These differences resulted in distinct activity early in the interval (0-1 seconds; F = 6.0, p = 0.02 accounting for variance between mice), but not late in the interval (5-6 seconds; F = 1.9, p = 0.17 accounting for variance between mice) between D2-MSNs and D1-MSNs.”

      Minor: 

      (1)  Figure 2 legend showed the wrong label "Peri-event raster C) from a D2-MSN (red) and E) from a D1-MSN (blue). It should be (D). 

      Fixed, thank you.  

      (2)  Figure 2. Missing legend for (E) and (F).  

      Fixed, thank you.  

      (3)  Line 423: mistyped "\" 

      Fixed, thank you.  

      Reviewer #3 (Recommendations For The Authors): 

      -  To clarify that complementary means opposing in this context, I suggest changing the title. 

      This is a helpful suggestion. We have changed it exactly as the reviewer suggested: 

      “Complementary opposing D2-MSNs and D1-MSNs dynamics during interval timing”

      -  I recommend adding a supplementary figure to demonstrate all the nose pokes in all trials in a given session. The current figures make it hard to assess the specifics of the behavior. For example, what happens if, in a long-interval trial, the mouse pokes in the second nose poke before 6 seconds? Is that behavior punished? Do they keep alternating between the nose poke or do they stick to one nose poke? 

      We agree. We think this is a main point, and we have now redesigned Figure 1 to describe these details: 

      And added these details to the methods (Line 548): 

      “Interval timing switch task. We used a mouse-optimized operant interval timing task described in detail previously (Balci et al., 2008; Bruce et al., 2021; Tosun et al., 2016; Weber et al., 2023). Briefly, mice were trained in sound-attenuating operant chambers, with two front nosepokes flanking either side of a food hopper on the front wall, and a third nosepoke located at the center of the back wall. The chamber was positioned below an 8-kHz, 72-dB speaker (Fig 1A; MedAssociates, St. Albans, VT). Mice were 85% food restricted and motivated with 20 mg sucrose pellets (BioServ, Flemington, NJ). Mice were initially trained to receive rewards during fixed ratio nosepoke response trials. Nosepoke entry and exit were captured by infrared beams. After shaping, mice were trained in the “switch” interval timing task. Mice self-initiated trials at the back nosepoke, after which tone and nosepoke lights were illuminated simultaneously. Cues were identical on all trial types and lasted the entire duration of the trial (6 or 18 seconds). On 50% of trials, mice were rewarded for a nosepoke after 6 seconds at the designated first ‘front’ nosepoke; these trials were not analyzed. On the remaining 50% of trials, mice were rewarded for nosepoking first at the ‘first’ nosepoke location and then switching to the ‘second’ nosepoke location; the reward was delivered for initial nosepokes at the second nosepoke location after 18 seconds when preceded by a nosepoke at the first nosepoke location.  Multiple nosepokes at each nosepokes were allowed. Early responses at the first or second nosepoke were not reinforced. Initial responses at the second nosepoke rather than the first nosepoke, alternating between nosepokes, going back to the first nosepoke after the second nosepoke were rare after initial training. Error trials included trials where animals responded only at the first or second nosepoke and were also not reinforced. We did not analyze error trials as they were often too few to analyze; these were analyzed at length in our prior work (Bruce et al., 2021).”

      -  Figures 2E and 2F suggest that some D1 cells ramp up during the first 6 seconds, while others ramp down. The same is more or less true for D2s. I wonder if the analysis will lose its significance if the two outlier D1s are excluded from Figure 3D. 

      This is a great idea suggested by multiple reviewers. We repeated this analysis with outliers removed. We used a data-driven approach to remove outliers (Line 656): 

      “We performed additional sensitivity analysis excluding outliers outside of 95% confidence intervals and measuring firing rate from the start of the interval to the time of the switch response on a trial-by-trial level for each neuron.”

      And described these data in the results (Line 201): 

      “We found that D2-MSNs and D1-MSNs had a significantly different slope even when excluding outliers (4 outliers excluded outside of 95% confidence intervals; F=7.51, p=0.008 accounting for variance between mice) and when the interval was defined as the time between trial start and the switch response on a trial-by-trial basis for each neuron (F=4.3, p=0.04 accounting for variance between mice).”

      Finally, we removed the outliers the reviewers alluded to – two D1 MSNs – and found similar results (F=6.59, p=0.01 for main effect of D2 vs. D1 MSNs controlling for between-mouse variability). We elected to include the more data driven approach based on 95% confidence intervals.

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife Assessment

      This useful study examined the associations of a healthy lifestyle with comprehensive and organ-specific biological ages defined using common blood biomarkers and body measures. Its large sample size, longitudinal design, and robust statistical analysis provide solid support for the findings, which will be of interest to epidemiologists and clinicians.

      Thank you very much for your thoughtful review of our manuscript. Your valuable comments have greatly helped us improve our manuscript. We have carefully considered all the comments and suggestions made by the reviewers and have revised them to address each point. Below, we provide detailed responses to each of the reviewers' comments. Please note that the line numbers mentioned in the following responses correspond to the line numbers in the clean version of the manuscript.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      This study was to examine the associations of a healthy lifestyle with comprehensive and organ-specific biological ages. It emphasized the importance of lifestyle factors in biological ages, which were defined using common blood biomarkers and body measures.

      Strengths:

      The data were from a large cohort study and defined comprehensive and six-specified biological ages.

      Weaknesses:

      (1) Since only 8.5% of participants from the CMEC (China Multi-Ethnic Cohort Study) were included in the study, has any section bias happened?

      Thank you for your valuable question. We understand the concern regarding the potential selection bias due to only 8.5% of participants being included in the study. The baseline survey of China Multi-Ethnic Cohort Study (CMEC) employed a rigorous multi-stage stratified cluster sampling method and the repeat survey reevaluated approximately 10% of baseline participants through community-based cluster random sampling. Therefore, the sample of the repeat survey is representative. The second reason for the loss of sample size was the availability of biomarkers for BA calculation. We have compared characteristic of the overall population, the population included in and excluded from this study. Most characteristics were similar, but participants included in this study showed better in some health-related variables, one potential reason is healthier individuals were more likely to complete the follow-up survey. In conclusion, we believe that the impact of selection bias is limited.

      Author response table 1.

      Baseline characteristics of participants included and not included in the study

      BA, biological age; BMI, body mass index; CVD, cardiovascular disease; HLI, healthy lifestyle indicator.

      1 Data are presented as median (25th, 75th percentile) for continuous variables and count (percentage) for categorical variables.

      2 For HLI, "healthy" corresponds to a score of 4-5.

      3 Information on each validated BA has been reported. BA acceleration is the difference between each BA and CA in the same survey.

      (2) The authors should specify the efficiency of FFQ. How can FFQ genuinely reflect the actual intake? Moreover, how was the aMED calculated?

      Thank you for the comments and questions. We appreciate the opportunity to clarify these aspects of our study. For the first question, we evaluated the FFQ's reproducibility and validity by conducting repeated FFQs and 24-hour dietary recalls at the baseline survey. Intraclass correlation coefficients (ICC) for reproducibility ranged from 0.15 for fresh vegetables to 0.67 for alcohol, while deattenuated Spearman rank correlations for validity ranged from 0.10 for soybean products to 0.66 for rice. More details are provided in our previous study (Lancet Reg Health West Pac, 2021). We have added the corresponding content in both the main text and the supplementary materials.

      Methods, Page 8, lines 145-146: “The FFQ's reproducibility and validity were evaluated by conducting repeated FFQs and 24-hour dietary recalls.”

      Supplementary methods, Dietary assessment: “We evaluated the FFQ's reproducibility and validity by conducting repeated FFQs and 24-hour dietary recalls. Intraclass correlation coefficients for reproducibility ranged from 0.15 for fresh vegetables to 0.67 for alcohol, while deattenuated Spearman rank correlations for validity ranged from 0.10 for soybean products to 0.66 for rice.”

      For the second question, we apologize for any confusion. To avoid taking up too much space in the main text, we decided not to include the detailed aMED calculation (as described in Circulation, 2009) there and instead placed it in the supplementary materials:

      “Our calculated aMED score incorporates eight components: vegetables, legumes, fruits, whole grains, fish, the ratio of monounsaturated fatty acids (MUFA) to saturated fatty acids (SFA), red and processed meats, and alcohol. Each component's consumption was divided into sex-specific quintiles. Scores ranging from 1 to 5 were assigned based on quintile rankings to each component, except for red and processed meats and alcohol, for which the scoring was inverted. The alcohol criteria for the aMED was defined as moderate consumption. Since the healthy lifestyle index (HLI) already contained a drinking component, we removed the drinking item in the aMED, which had a score range of 7-35 with a higher score reflecting better adherence to the overall Mediterranean dietary pattern. We defined individuals with aMED scores ≥ population median as healthy diets.”

      Reference:

      (1) Xiao X, Qin Z, Lv X, Dai Y, Ciren Z, Yangla Y, et al. Dietary patterns and cardiometabolic risks in diverse less-developed ethnic minority regions: results from the China Multi-Ethnic Cohort (CMEC) Study. Lancet Reg Health West Pac. 2021;15:100252. doi: 10.1016/j.lanwpc.2021.100252.

      (2) Fung TT, Rexrode KM, Mantzoros CS, Manson JE, Willett WC, Hu FB. Mediterranean diet and incidence of and mortality from coronary heart disease and stroke in women. Circulation. 2009;119(8):1093-100. doi: 10.1161/circulationaha.108.816736.

      (3) HLI (range) and HLI (category) should be clearly defined.

      Thank you for the comment. We have added the definition of HLI (range) and HLI (category) in the methods section:

      Methods P9 lines 165-170: “The HLI was calculated by directly adding up the five lifestyle scores, ranging from 0-5, with a higher score representing an overall healthier lifestyle, denoted as HLI (range) in the following text. We then transformed HLI into a dichotomous variable in this study, denoted as HLI (category), where a score of 4-5 for HLI was considered a healthy lifestyle, and a score of 0-3 was considered an unfavorable lifestyle that could be improved.”

      (4) The comprehensive rationale and each specific BA construction should be clearly defined and discussed. For example, can cardiopulmonary BA be reflected only by using cardiopulmonary status? I do not think so.

      Thank you for the opportunity to clarify. We constructed the comprehensive BA based on all the available biochemical data from the CMEC study, selecting aging-related markers (J Gerontol A Biol Sci Med Sci, 2021), and further construct organ-specific BAs based on these selected biomarkers. The KDM algorithm does not specify biomarker types but requires them to be correlated with chronological age (CA) (Ageing Dev, 2006). Existing studies typically construct BA based on available biomarker, we included 15 biomarkers in this study, which could be considered comprehensive and extensive compared to previous research (J Transl Med. 2023; J Am Heart Assoc. 2024; Nat Cardiovasc Res. 2024). For how the biomarkers for each organ-specific BAs were selected, we categorized biomarkers primarily based on their relevance to the structure and function of each organ system according to the classification in previous studies (Nat Med, 2023; Cell Rep, 2022). Since the biomarkers we used came from clinical-lab data sets, they were categorized based on the clinical interpretation of blood chemistry tests following the methods outlined in the two referenced papers (Nat Med, 2023; Cell Rep, 2022). We only used biomarkers directly related to each specific system to minimize overlap between the indicators used for different BAs, thereby preserving the distinctiveness of organ-specific BAs. We acknowledge the limitations of this approach that a few biomarkers may not fully capture the complete aging process of a system, and certain indicators may be missing due to data constraints. However, the multi-organ BAs we constructed are cost-effective, easy to implement, and have been validated, making them valuable despite the limitations.

      Reference:

      (1) Verschoor CP, Belsky DW, Ma J, Cohen AA, Griffith LE, Raina P. Comparing Biological Age Estimates Using Domain-Specific Measures From the Canadian Longitudinal Study on Aging. J Gerontol A Biol Sci Med Sci. 2021;76(2):187-94. doi: 10.1093/gerona/glaa151.

      (2) Klemera P, Doubal S. A new approach to the concept and computation of biological age. Mech Ageing Dev. 2006;127(3):240-8. doi: 10.1016/j.mad.2005.10.004

      (3) Zhang R, Wu M, Zhang W, Liu X, Pu J, Wei T, et al. Association between life's essential 8 and biological ageing among US adults. J Transl Med. 2023;21(1):622. doi: 10.1186/s12967-023-04495-8.

      (4) Forrester SN, Baek J, Hou L, Roger V, Kiefe CI. A Comparison of 5 Measures of Accelerated Biological Aging and Their Association With Incident Cardiovascular Disease: The CARDIA Study. J Am Heart Assoc. 2024;13(8):e032847. doi: 10.1161/jaha.123.032847.

      (5) Jiang M, Tian S, Liu S, Wang Y, Guo X, Huang T, Lin X, Belsky DW, Baccarelli AA, Gao X. Accelerated biological aging elevates the risk of cardiometabolic multimorbidity and mortality. Nat Cardiovasc Res. 2024;3(3):332-42. doi: 10.1038/s44161-024-00438-8.

      (6) Tian YE, Cropley V, Maier AB, Lautenschlager NT, Breakspear M, Zalesky A. Heterogeneous aging across multiple organ systems and prediction of chronic disease and mortality. Nat Med. 2023;29(5):1221-31. doi: 10.1038/s41591-023-02296-6.

      (7) Nie C, Li Y, Li R, Yan Y, Zhang D, Li T, et al. Distinct biological ages of organs and systems identified from a multi-omics study. Cell Rep. 2022;38(10):110459. doi: 10.1016/j.celrep.2022.110459.

      (5) The lifestyle index is defined based on an equal-weight approach, but this does not reflect reality and cannot fully answer the research questions it raises.

      Thank you very much for your valuable suggestion. We used equal weight healthy lifestyle index (HLI) partly to facilitate comparisons with other studies. The equal-weight approach to construct the HLI is commonly used in current research (Bmj, 2021; Diabetes Care. 2022; Arch Gerontol Geriatr. 2022). The equal-weight HLI can demonstrate the average benefit of adopting each additional healthy lifestyle and avoid assumptions about the relative importance of different behaviors, which may vary depending on the population. To further clarify the importance of each lifestyle factor, we conducted quantile G-computation analysis, which can reflect the weight differences between lifestyle factors (PLoS Med, 2020; Clin Epigenetics, 2022).

      Reference:

      (1) Zhang YB, Chen C, Pan XF, Guo J, Li Y, Franco OH, Liu G, Pan A. Associations of healthy lifestyle and socioeconomic status with mortality and incident cardiovascular disease: two prospective cohort studies. Bmj. 2021;373:n604. doi: 10.1136/bmj.n604.

      (2) Han H, Cao Y, Feng C, Zheng Y, Dhana K, Zhu S, Shang C, Yuan C, Zong G. Association of a Healthy Lifestyle With All-Cause and Cause-Specific Mortality Among Individuals With Type 2 Diabetes: A Prospective Study in UK Biobank. Diabetes Care. 2022;45(2):319-29. doi: 10.2337/dc21-1512.

      (3) Jin S, Li C, Cao X, Chen C, Ye Z, Liu Z. Association of lifestyle with mortality and the mediating role of aging among older adults in China. Arch Gerontol Geriatr. 2022;98:104559. doi: 10.1016/j.archger.2021.104559.

      (4) Chudasama YV, Khunti K, Gillies CL, Dhalwani NN, Davies MJ, Yates T, Zaccardi F. Healthy lifestyle and life expectancy in people with multimorbidity in the UK Biobank: A longitudinal cohort study. PLoS Med. 2020;17(9):e1003332. doi: 10.1371/journal.pmed.1003332.

      (5) Kim K, Zheng Y, Joyce BT, Jiang H, Greenland P, Jacobs DR, Jr., et al. Relative contributions of six lifestyle- and health-related exposures to epigenetic aging: the Coronary Artery Risk Development in Young Adults (CARDIA) Study. Clin Epigenetics. 2022;14(1):85. doi: 10.1186/s13148-022-01304-9.

      Reviewer #2 (Public Review):

      This interesting study focuses on the association between lifestyle factors and comprehensive and organ-specific biological aging in a multi-ethnic cohort from Southwest China. It stands out for its large sample size, longitudinal design, and robust statistical analysis.

      Some issues deserve clarification to enhance this paper:

      (1) How were the biochemical indicators for organ-specific biological ages chosen, and are these indicators appropriate? Additionally, a more detailed description of the multi-organ biological ages should be provided to help understand the distribution and characteristics of BAs.

      We thank you for raising this point. As explained in our response to the fourth question from the first reviewer, we constructed the comprehensive BA b ased on all the available biochemical data from the CMEC study, selecting aging-related markers (J Gerontol A Biol Sci Med Sci, 2021), and further construct organ-specific BAs based on these selected biomarkers. The KDM algorithm does not specify biomarker types but requires them to be correlated with chronological age (CA) (Ageing Dev, 2006). Existing studies typically construct BA based on available biomarker, we included 15 biomarkers in this study, which could be considered comprehensive and extensive compared to previous research (J Transl Med. 2023; J Am Heart Assoc. 2024; Nat Cardiovasc Res. 2024). For how   the biomarkers for each organ-specific BAs were selected, we categorized biomarkers primarily based on their relevance to the structure and function of each organ system according to the classification in previous studies (Nat Med, 2023; Cell Rep, 2022). Since the biomarkers we used came from clinical-lab data sets, they were categorized based on the clinical interpretation of blood chemistry tests (Nat Med, 2023). We only used biomarkers directly related to each specific system to minimize overlap between the indicators used for different BAs, thereby preserving the distinctiveness of organ-specific BAs.

      We have added a descriptive table for the comprehensive and organ systems BAs in the supplementary materials to provide a more detailed understanding of the distribution and characteristics of BAs:

      Author response table 2.

      Description of BA and BA acceleration1

      BA, biological age

      1 Data are presented as mean (standard deviation).

      (2) The authors categorized the HLI score into a dichotomous variable, which may cause a loss of information. How did the authors address this potential issue?

      Thank you for raising this concern. We categorized each lifestyle factor into a binary variable based on relevant guidelines and studies, which recommend assigning a score of 1 if the guideline or study recommendations are met (Bmj, 2021; J Am Heart Assoc, 2023). While dichotomization may lead to some loss of information, it allows for a clearer interpretation and comparison of adherence to ideal healthy lifestyle behaviors. Another advantage of this treatment is that it allows for easy comparison with other studies. We categorized the HLI score into a dichotomous variable to enhance the practical relevance of the results (J Gerontol A Biol Sci Med Sci, 2021). Additionally, we conducted analyses using the continuous HLI score to ensure that our findings were robust, and the results were consistent with those obtained using the dichotomous HLI.

      Reference:

      (1) Verschoor CP, Belsky DW, Ma J, Cohen AA, Griffith LE, Raina P. Comparing Biological Age Estimates Using Domain-Specific Measures From the Canadian Longitudinal Study on Aging. J Gerontol A Biol Sci Med Sci. 2021;76(2):187-94. doi: 10.1093/gerona/glaa151.

      (2) Klemera P, Doubal S. A new approach to the concept and computation of biological age. Mech Ageing Dev. 2006;127(3):240-8. doi: 10.1016/j.mad.2005.10.004

      (3) Zhang R, Wu M, Zhang W, Liu X, Pu J, Wei T, et al. Association between life's essential 8 and biological ageing among US adults. J Transl Med. 2023;21(1):622. doi: 10.1186/s12967-023-04495-8.

      (4) Forrester SN, Baek J, Hou L, Roger V, Kiefe CI. A Comparison of 5 Measures of Accelerated Biological Aging and Their Association With Incident Cardiovascular Disease: The CARDIA Study. J Am Heart Assoc. 2024;13(8):e032847. doi: 10.1161/jaha.123.032847.

      (5) Jiang M, Tian S, Liu S, Wang Y, Guo X, Huang T, Lin X, Belsky DW, Baccarelli AA, Gao X. Accelerated biological aging elevates the risk of cardiometabolic multimorbidity and mortality. Nat Cardiovasc Res. 2024;3(3):332-42. doi: 10.1038/s44161-024-00438-8.

      (6) Tian YE, Cropley V, Maier AB, Lautenschlager NT, Breakspear M, Zalesky A. Heterogeneous aging across multiple organ systems and prediction of chronic disease and mortality. Nat Med. 2023;29(5):1221-31. doi: 10.1038/s41591-023-02296-6.

      (7) Nie C, Li Y, Li R, Yan Y, Zhang D, Li T, et al. Distinct biological ages of organs and systems identified from a multi-omics study. Cell Rep. 2022;38(10):110459. doi: 10.1016/j.celrep.2022.110459.

      (3) Because lifestyle data are self-reported, they may suffer from recall bias. This issue needs to be addressed in the limitations section.

      Thank you for your valuable suggestion. We acknowledge that the use of self-reported lifestyle data in our study may introduce recall bias, potentially affecting the accuracy of the information collected. We have added the following statement to the limitations section of our manuscript:

      Discussion, Page 22, lines 463-464: “Fifth, assessment of lifestyle factors was based on self-reported data collected through questionnaires, which may be subject to recall bias.”

      (4) It should be clarified whether the adjusted CA is the baseline value of CA. Additionally, why did the authors choose models with additional adjustments for time-invariant variables as their primary analysis? This approach does not align with standard FEM analysis (Lines 261-263).

      Thank you for the opportunity to clarify. We have changed the sentence to “baseline CA”. For the second question, in a standard fixed effects model (FEM), only time-varying variables are typically included. However, to enhance the flexibility of our models and account for potential variations in the association of time-invariant variables with CA, as has been commonly done in previous studies, we additionally adjusted for time-invariant variables and the baseline value of CA (BMC Med Res Methodol, 2024; Am J Clin Nutr, 2020). Moreover, sensitivity analyses using the standard FEM were conducted in this study, and robust results were obtained.

      Reference:

      (1) Tang D, Hu Y, Zhang N, Xiao X, Zhao X. Change analysis for intermediate disease markers in nutritional epidemiology: a causal inference perspective. BMC Med Res Methodol. 2024;24(1):49. doi: 10.1186/s12874-024-02167-9.

      (2) Trichia E, Luben R, Khaw KT, Wareham NJ, Imamura F, Forouhi NG. The associations of longitudinal changes in consumption of total and types of dairy products and markers of metabolic risk and adiposity: findings from the European Investigation into Cancer and Nutrition (EPIC)-Norfolk study, United Kingdom. Am J Clin Nutr. 2020;111(5):1018-26. doi: 10.1093/ajcn/nqz335.

      (5) How is the relative contribution calculated in the QGC analysis? The relative contribution of some lifestyle factors is not shown in Figure 2 and the supplementary figures, such as Supplementary Figure 7. These omissions should be explained.

      Thanks for the questions. The QGC obtains causal relationships and estimates weights for each component, which has been widely used in epidemiological research. More details about QGC can be found in the supplementary methods. The reason some results are not displayed is that we assumed all healthy lifestyle changes would have a protective effect on BA acceleration. However, the effect size of some lifestyle factors did not align with this assumption and lacked statistical significance. Because positive and negative weights were calculated separately in QGC, with all positive weights summing to 1 and all negative weights summing to 1, these factors would have had large positive weights. To avoid potential misunderstandings, we chose not to include these results in the figures. We have added explanations to the figure legends where applicable:

      “The blue bars represent results that are statistically significant in the FEM analysis, while the gray bars represent results in the FEM analysis that were not found to be statistically significant and positive weights were not shown.”

      Recommendations for the authors:

      Reviewer #2 (Recommendations For The Authors):

      To enhance this paper, some issues deserve clarification:

      (1) How were the biochemical indicators for organ-specific biological ages chosen, and are these indicators appropriate? Additionally, please provide a more detailed description of the multi-organ biological ages to help understand BAs' the distribution and characteristics.

      (2) The authors categorized the HLI score into a dichotomous variable, which may cause a loss of information. How did the authors address this potential issue?

      (3) Because lifestyle data are self-reported, they may suffer from recall bias. This issue needs to be addressed in the limitations section.

      (4) Lines 261-263: Please clarify if the adjusted CA is the baseline value of CA. Additionally, why did you choose models with additional adjustments for time-invariant variables as your primary analysis? This approach does not align with standard FEM analysis.

      (5) How is the relative contribution calculated in the QGC analysis? The relative contribution of some lifestyle factors is not shown in Figure 2 and the supplementary figures, such as Supplementary Figure 7. Please explain these omissions.

      The above five issues overlap with those raised by Reviewer #2 (Public Review). Please refer to the responses provided earlier.

      Minor revision:

      Line 50: The expression "which factors" should be changed to "which lifestyle factor."

      Thank you for the suggestion. As suggested, we have used “which lifestyle factor” instead.

      Lines 91-92: "Aging exhibits variations across and with individuals" appears to be a clerical error. According to the context, it should be "Aging exhibits variations across and within individuals."

      We thank the reviewer for the correction. We have updated the text to read:

      “Aging exhibits variations across and within individuals.”

      Line 154: The authors mentioned "Considering previous studies" but lacked references. Please add the appropriate citations.

      Thank you for pointing this out. We apologize for the oversight. We have now added the appropriate citations to support the statement "Considering previous studies" in the revised manuscript.

      Lines 170-171: "regular exercise ("12 times/week", "3-5 times/week," or "daily or almost every day")"; the first item in parentheses should be "1-2 times/week"? Please verify and correct if necessary. Additionally, check the entire text carefully to avoid confusion caused by clerical errors.

      Thank you for your careful review. We have changed the sentence to "1-2 times/week." We have thoroughly checked the entire manuscript to ensure that no other clerical errors remain.

      Clarifications for Table 1:

      i. The expression "HLI=0" is difficult to understand. Please provide a more straightforward explanation or rephrase it.

      Thank you for your feedback. We have removed the confusing expression and provided a clearer explanation in the table legend for better understanding:

      “For HLI (category), "healthy" corresponds to a score of 4-5, while "unfavorable" corresponds to a score of 0-3.”

      ii. The baseline age is presented as an integer, but the follow-up age is not. Please clarify this discrepancy.

      Thank you for pointing out this discrepancy. We calculated the precise chronological age based on based on participants' survey dates and birth dates for the biological age calculations. Initially, the table presented age as integers, but we have now updated it to show the precise ages.

    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewer #1:

      Despite the strengths, multiple analytical decisions have to be explained, justified, or clarified. Also, there is scope to enhance the clarity and coherence of the writing - as it stands, readers will have to go back and forth to search for information. Last, it would be helpful to add line numbers in the manuscript during the revision, as this will help all reviewers to locate the parts we are talking about.

      We thank the reviewer’s suggestions have added the line numbers to the revised manuscript.

      (1) Introduction:

      The introduction is somewhat unmotivated, with key terms/concepts left unexplained until relatively late in the manuscript. One of the main focuses in this work is "hyperaltruistic", but how is this defined? It seems that the authors take the meaning of "willing to pay more to reduce other's pain than their own pain", but is this what the task is measuring? Did participants ever need to PAY something to reduce the other's pain? Note that some previous studies indeed allow participants to pay something to reduce other's pain. And what makes it "HYPER-altruistic" rather than simply "altruistic"?

      As the reviewer noted, we adopted a well-established experimental paradigm to study the context-dependent effect on hyper-altruism. Altruism refers to the fact that people take others’ welfare into account when making decisions that concern both parties. Research paradigms investigating altruistic behavior typically use a social decision task that requires participants to choose between options where their own financial interests are pitted against the welfare of others (FeldmanHall et al., 2015; Hu et al., 2021; Hutcherson et al., 2015; Teoh et al., 2020; Xiong et al., 2020). On the other hand, the hyperaltruistic tendency emphasizes subjects’ higher valuation to other’s pain than their own pain (Crockett et al., 2014, 2015, 2017; Volz et al., 2017). One example for the manifestation of hyperaltruism would be the following scenario: the subject is willing to forgo $2 to reduce others’ pain by 1 unit (social-decision task) and only willing to forgo $1 to reduce the same amount of his/her own pain (self-decision task) (Crockett et al., 2014). On the contrary, if the subjects are willing to forgo less money to reduce others’ suffering in the social decision task than in the self-decision task, then it can be claimed that no hyperaltruism is observed. Therefore, hyperaltruistic preference can only be measured by collecting subjects’ choices in both the self and social decision tasks and comparing the choices in both tasks.

      In our task, as in the studies before ours (Crockett et al., 2014, 2015, 2017; Volz et al., 2017), subjects in each trial were faced with two options with different levels of pain on others and monetary payoffs on themselves. Based on subjects’ choice data, we can infer how much subjects were willing to trade 1 unit of monetary payoff in exchange of reducing others’ pain through the regression analysis (see Figure 1 and methods for the experimental details). We have rewritten the introduction and methods sections to make this point clearer to the audience.  

      Plus, in the intro, the authors mentioned that the "boundary conditions" remain unexplored, but this idea is never touched again. What do boundary conditions mean here in this task? How do the results/data help with finding out the boundary conditions? Can this be discussed within wider literature in the Discussion section?

      Boundary conditions here specifically refer to the variables or decision contexts that determine whether hyperaltruistic behavior can be elicited. Individual personality trait, motivation and social relationship may all be boundary conditions affecting the emergence of hyperaltruistic behavior. In our task, we specifically focused on the valence of the decision context (gain vs. loss) since previous studies only tested the hyperaltruistic preference in the gain context and the introduction of the loss context might bias subjects’ hyperaltruistic behavior through implicit moral framing.

      We have explained the boundary conditions in the revised introduction (Lines 45 ~ 49).

      “However, moral norm is also context dependent: vandalism is clearly against social and moral norms yet vandalism for self-defense is more likely to be ethically and legally justified (the Doctrine of necessity). Therefore, a crucial step is to understand the boundary conditions for hyperaltruism.”

      Last, what motivated the authors to examine the decision context? It comes somewhat out of the blue that the opening paragraph states that "We set out to [...] decision context", but why? Are there other important factors? Why decision context is more important than studying those others?

      We thank the reviewer for the comment. The hyperaltruistic preference was originally demonstrated between conditions where subjects’ personal monetary gain was pitted against others’ pain (social-condition) or against subjects’ own suffering (self-condition) (Crockett et al., 2014). Follow up studies found that subjects also exhibited strong egoistic tendencies if instead subjects needed to harm themselves for other’s benefit in the social condition (by flipping the recipients of monetary gain and electric shocks) (Volz et al., 2017). However, these studies have primarily focused on the gain contexts, neglecting the fact that valence could also be an influential factor in biasing subjects’ behavior (difference between gain and loss processing in humans). It is likely that replacing monetary gains with losses in the money-pain trade-off task might bias subjects’ hyperaltruistic preference due to heightened vigilance or negative emotions in the face of potential loss (such as loss aversion) (Kahneman & Tversky, 1979; Liu et al., 2020; Pachur et al., 2018; Tom et al., 2007; Usher & McClelland, 2004; Yechiam & Hochman, 2013). Another possibility is that gain and loss contexts may elicit different subjective moral perceptions (or internal moral framings) in participants, affecting their hyperaltruistic preferences (Liu et al., 2017; Losecaat Vermeer et al., 2020; Markiewicz & Czupryna, 2018; Wu et al., 2018). In our manuscript, we did not strive to compare which factors might be more important in eliciting hyperaltruistic behavior, but rather to demonstrate the crucial role played by the decision context and to show that the internal moral framing could be the mediating factor in driving subjects’ hyperaltruistic behavior. In fact, we speculate that the egoistic tendencies found in the Volz et al. 2017 study was partly driven by the subjects’ failure to engage the proper internal moral framing in the social condition (harm for self, see Volz et al., 2017 for details).

      (2) Experimental Design:

      (2a) The experiment per se is largely solid, as it followed a previously well-established protocol. But I am curious about how the participants got instructed? Did the experimenter ever mention the word "help" or "harm" to the participants? It would be helpful to include the exact instructions in the SI.

      In the instructions, we avoided words such as “harm”, “help”, or other terms reminding subjects about the moral judgement of the decisions they were about to make. Instead, we presented the options in a neutral and descriptive manner, focusing only on the relevant components (shocks and money). The instructions for all four conditions are shown in supplementary Fig. 9.

      (2b) Relatedly, the experimental details were not quite comprehensive in the main text. Indeed, the Methods come after the main text, but to be able to guide readers to understand what was going on, it would be very helpful if the authors could include some necessary experimental details at the beginning of the Results section.

      We thank the reviewer’s suggestion. We have now provided a brief introduction of the experimental details in the revised results section (Lines 125 ~132).

      “Prior to the money-pain trade-off task, we individually calibrated each subject’s pain threshold using a standard procedure[4–6]. This allowed us to tailor a moderate electric stimulus that corresponded to each subject’s subjective pain intensity. Subjects then engaged in 240 decision trials (60 trials per condition), acting as the “decider” and trading off between monetary gains or losses for themselves and the pain experienced by either themselves or an anonymous “pain receiver” (gain-self, gain-other, loss-self and loss-other, see Supplementary Fig. 8 for the instructions and also see methods for details).”

      (3) Statistical Analysis<br /> (3a) One of the main analyses uses the harm aversion model (Eq1) and the results section keeps referring to one of the key parameters of it (ie, k). However, it is difficult to understand the text without going to the Methods section below. Hence it would be very helpful to repeat the equation also in the main text. A similar idea goes to the delta_m and delta_s terms - it will be very helpful to give a clear meaning of them, as nearly all analyses rely on knowing what they mean.

      We thank the reviewer’s suggestion. We have now added the equation of the harm aversion model and provided more detailed description to the equations in the main text (Lines 150 ~155).

      “We also modeled subjects’ choices using an influential model where subjects’ behavior could be characterized by the harm (electric shock) aversion parameter κ, reflecting the relative weights subjects assigned to ∆m and ∆s, the objective difference in money and shocks between the more and less painful options, respectively (∆V=(1-κ)∆m - κ∆s Eq.1, See Methods for details)[4–6]. Higher κ indicates that higher sensitivity is assigned to ∆s than ∆m and vice versa.”

      (3b) There is one additional parameter gamma (choice consistency) in the model. Did the authors also examine the task-related difference of gamma? This might be important as some studies have shown that the other-oriented choice consistency may differ in different prosocial contexts.

      To examine the task-related difference of choice consistency (γ), we compared the performance of 4 candidate models:

      Model 1 (M1): The choice consistency parameter γ remains constant across shock recipients (self vs. other) and decision contexts (gain vs. loss).

      Model 2 (M2): γ differs between the self- and other-recipient conditions, with γ<sub>self</sub> and γ<sub>other</sub> representing the choice consistency when pain is inflicted on him/her-self or the other-recipient.

      Model 3 (M3): γ differs between the gain and loss conditions, with γ<sub>gain</sub> and γ<sub>loss</sub> representing the choice consistencies in the gain and loss contexts, respectively.

      Model 4 (M4): γ varies across four conditions, with γ<sub>self-gain</sub>, γ<sub>other-gain</sub>, γ<sub>self-loss</sub> and γ<sub>other-loss</sub> capturing the choice consistency in each condition.

      Supplementary Fig. 10 shows, after fitting all the models to subjects’ choice behavioral data, model 1 (M1) performed the best among all the four candidate models in both studies (1 & 2) with the lowest Bayesian Information Criterion (BIC). Therefore, we conclude that factors such as the shock recipients (self vs. other) and decision contexts (gain vs. loss) did not significantly influence subjects’ choice consistency and report model results using the single choice consistency parameter.

      (3c) I am not fully convinced that the authors included two types of models: the harm aversion model and the logistic regression models. Indeed, the models look similar, and the authors have acknowledged that. But I wonder if there is a way to combine them? For example:

      Choice ~ delta_V * context * recipient (*Oxt_v._placebo)

      The calculation of delta_V follows Equation 1.

      Or the conceptual question is, if the authors were interested in the specific and independent contribution of dalta_m and dalta_s to behavior, as their logistic model did, why did the authors examine the harm aversion first, where a parameter k is controlling for the trade-off? One way to find it out is to properly run different models and run model comparisons. In the end, it would be beneficial to only focus on the "winning" model to draw inferences.

      The reviewer raised an excellent point here. According to the logistic regression model, we have:

      Where P is the probability of selecting the less harmful option. Similarly, if we combine Eq.1 (∆V=1-κ)∆m-κ∆s) and Eq.2 ) of the harm aversion model, we have:

      If we ignore the constant term β<sub>0</sub> from the logistic regression model, the harm aversion model is simply a reparameterization of the logistic regression model. The harm aversion model was implemented first to derive the harm aversion parameter (κ), which is an parameter in the range of [0 1] to quantify how subjects value the relative contribution of Δm and Δs between options in their decision processes. Since previous studies used the term κ<sub>other</sub>-κ<sub>self</sub> to define the magnitude of hyperaltruistic preference, we adopted similar approach to compare our results with previous research under the same theoretical framework. However, in order to investigate the independent contribution of Δm and Δs, we will have to take γ into account (we can see that the β<sub>∆m</sub> and β<sub>∆s</sub> in the logistic regression model are not necessarily correlated by nature; however, in the harm aversion model the coefficients (1-κ) and κ is always strictly negatively correlated (see Eq. 1). Only after multiplying γ, the correlation between γ(1-κ) and γκ will vary depending on the specific distribution of γ and κ). In summary, we followed the approach of previous research to estimate harm aversion parameter κ to compare our results with previous studies and to capture the relative influence between Δm and Δs. When we studied the contextual effects (gain vs. loss or placebo vs. control) on subjects’ behavior, we further investigated the contextual effect on how subjects evaluated Δm and Δs, respectively. The two models (logistic regression model and harm aversion model) in our study are mathematically the same and are not competitive candidate models. Instead, they represent different aspects from which our data can be examined.

      We also compared the harm aversion model with and without the constant term β<sub>0</sub> in the choice function. Adding a constant term β<sub>0</sub> the above Equation 2 becomes:

      As the following figure shows, the hyperaltruistic parameters (κ<sub>other</sub>-κ<sub>self</sub>) calculated from the harm aversion model with the constant term (panels A & B) have almost identical patterns as the model without the constant term (panels C & D, i.e. Figs. 2B & 4B in the original manuscript) in both studies.

      Author response image 1.

      Figs. 2B & 4B in the original manuscript) in both studies.

       

      (3d) The interpretation of the main OXT results needs to be more cautious. According to the operationalization, "hyperaltruistic" is the reduction of pain of others (higher % of choosing the less painful option) relative to the self. But relative to the placebo (as baseline), OXT did not increase the % of choosing the less painful option for others, rather, it decreased the % of choosing the less painful option for themselves. In other words, the degree of reducing other's pain is the same under OXT and placebo, but the degree of benefiting self-interest is reduced under OXT. I think this needs to be unpacked, and some of the wording needs to be changed. I am not very familiar with the OXT literature, but I believe it is very important to differentiate whether OXT is doing something on self-oriented actions vs other-oriented actions. Relatedly, for results such as that in Figure 5A, it would be helpful to not only look at the difference but also the actual magnitude of the sensitivity to the shocks, for self and others, under OXT and placebo.

      We thank the reviewer for this thoughtful comment. As the reviewer correctly pointed out, “hyperaltruism” can be defined as “higher % of choosing the less painful option to the others relative to the self”. Closer examination of the results showed that both the degrees of reducing other’s pain as well as reducing their own pain decreased under OXT (Figure 4A). More specifically, our results do not support the claim that “In other words, the degree of reducing others’ pain is the same under OXT and placebo, but the degree of benefiting self-interest is reduced under OXT.” Instead, the results show a significant reduction in the choice of less painful option under OXT treatment for both the self and other conditions (the interaction effect of OXT vs. placebo and self vs. other: F<sub>1.45</sub>= 16.812, P < 0.001, η<sup>2</sup> = 0.272, simple effect OXT vs. placebo in the self- condition: F<sub>1.45</sub>=59.332, P < 0.001, η<sup>2</sup> = 0.569, OXT vs. placebo in the other-condition: F<sub>1.45</sub>= 14.626, P < 0.001, η<sup>2</sup> = 0.245, repeated ANOVA, see Figure 4A).

      We also performed mixed-effect logistic regression analyses where subjects’ choices were regressed against  and  in different valences (gain vs. loss) and recipients (self vs. other) conditions in both studies 1 & 2 (Supplementary Figs. 1 & 6). As we replot supplementary Fig. 6 and panel B (included as Supplementary Fig. 8 in the supplementary materials) in the above figure, we found a significant treatment × ∆<sub>s</sub> (differences in shock magnitude between the more and less painful options) interaction effect β=0.136±0.029P < =0.001, 95% CI=[-0.192, -0.079]), indicating that subject’s sensitivities towards pain were indeed different between the placebo and OXT treatments for both self and other conditions. Furthermore, the significant four-way ∆<sub>s</sub> × treatment (OXT vs. Placebo) × context (gain vs. loss) × recipient (self vs. other) interaction effect (β=0.125±0.053, P=0.018 95% CI=[0.022, 0.228]) in the regression analysis, followed by significant simple effects (In the OXT treatment: ∆<sub>s</sub> × recipient effect in the gain context: F<sub>1.45</sub>= 7.622, P < 0.008, η<sup>2</sup> = 0.145; ∆<sub>s</sub> × recipient effect in the loss context: F<sub>1.45</sub>= 7.966, P 0.007, η<sup>2</sup> = 0.150, suggested that under OXT treatment, participants showed a greater sensitivity toward ∆<sub>s</sub> (see asterisks in the OXT condition in panel B) in the other condition than the self-condition, thus restoring the hyperaltruistic behavior in loss context.

      As the reviewer suggested, OXT’s effect on hyperaltruism does manifest separately on subjects’ harm sensitivities on self- and other-oriented actions. We followed the reviewer’s suggestions and examined the actual magnitude of the sensitivities to shocks for both the self and other treatments (panel B in the figure above). It’s clear that the administration of OXT (compared to the Placebo treatment, panel B in the figure above) significantly reduced participants’ pain sensitivity (treatment × ∆<sub>s</sub>: β=-0.136±0.029, P < 0.001, 95% CI=[-0.192,-0.079]), yet also restored the harm sensitivity patterns in both the gain and loss conditions. These results are included in the supplementary figures (6 & 8) as well as in the main texts.

      Recommendations:

      (1) For Figures 2A-B, it would be great to calculate the correlation separately for gain and loss, as in other figures.

      We speculate that the reviewer is referring to Figures 3A & B. Sorry that we did not present the correlations separately for the gain and loss contexts because the correlation between an individual’s IH (instrumental harm), IB (impartial beneficence) and hyperaltruistic preferences was not significantly modulated by the contextual factors. The interaction effects in both Figs. 3A & B and Supplementary Fig.5 (also see Table S1& S2) are as following: Study1 valence × IH effect: β=0.016±0.022, t<sub>152</sub>=0.726, P=0.469; valence × IB effect: β=0.004±0.031, t<sub>152</sub>=0.115, P=0.908; Study2 placebo condition: valence × IH effect: β=0.018±0.024, t<sub>84</sub>=0.030 P=0.463; valence × IB effect: β=0.051±0.030, t<sub>84</sub>=1.711, P=0.702. We have added these statistics to the main text following the reviewer’s suggestions.

      (2) "by randomly drawing a shock increment integer ∆s (from 1 to 19) such that [...] did not exceed 20 (𝑆+ {less than or equal to} 20)." I am not sure if a random drawing following a uniform distribution can guarantee S is smaller than 20. More details are needed. Same for the monetary magnitude.

      We are sorry for the lack of clarity in the method description. As for the task design, we followed adopted the original design from previous literature (Crockett et al., 2014, 2017). More specifically:

      “Specifically, each trial was determined by a combination of the differences of shocks (Δs, ranging from 1 to 19, with increment of 1) and money (Δm, ranging from ¥0.2 to ¥19.8, with increment of ¥0.2) between the two options, resulting in a total of 19×99=1881 pairs of [Δs, Δm]. for each trial. To ensure the trials were suitable for most subjects, we evenly distributed the desired ratio Δm / (Δs + Δm) between 0.01 and 0.99 across 60 trials for each condition. For each trial, we selected the closest [Δs, Δm] pair from the [Δs, Δm] pool to the specific Δm / (Δs + Δm) ratio, which was then used to determine the actual money and shock amounts of two options. The shock amount (S<sub>less</sub>) for the less painful option was an integer drawn from the discrete uniform distribution [1-19], constraint by S<sub>less</sub> + ∆s < 20. Similarly, the money amount (M<sub>less</sub>) for the less painful option was drawn from a discrete uniform distribution [¥0.2 - ¥19.8], with the constraint of M<sub>less</sub> + ∆m < 20. Once the S<sub>less</sub>and M<sub>less</sub> were selected, the shock (S<sub>more</sub>) and money (M<sub>more</sub>) magnitudes for the more painful option were calculated as: S<sub>more</sub> = S<sub>less</sub> + ∆s, M<sub>more</sub> = M<sub>less</sub> + ∆m”  

      We have added these details to the methods section (Lines 520-533).

      Reviewer #2:

      (1) The theoretical hypothesis needs to be better justified. There are studies addressing the neurobiological mechanism of hyperaltruistic tendency, which the authors unfortunately skipped entirely.

      Also in recommendation #1:

      (1) In the Introduction, the authors claim that "the mechanistic account of the hyperaltruistic phenomenon remains unknown". I think this is too broad of a criticism and does not do justice to prior work that does provide some mechanistic account of this phenomenon. In particular, I was surprised that the authors did not mention at all a relevant fMRI study that investigates the neural mechanism underlying hyperaltruistic tendency (Crockett et al., 2017, Nature Neuroscience). There, the researchers found that individual differences in hyperaltruistic tendency in the same type of moral decision-making task is better explained by reduced neural responses to ill-gotten money (Δm in the Other condition) in the brain reward system, rather than heightened neural responses to others' harm. Moreover, such neural response pattern is related to how an immoral choice would be judged (i.e., blamed) by the community. Since the brain reward system is consistently involved in Oxytocin's role in social cognition and decision-making (e.g., Dolen & Malenka, 2014, Biological Psychiatry), it is important to discuss the hypothesis and results of the present research in the context of this literature.

      We totally agree with the reviewer that the expression “mechanistic account of the hyperaltruistic phenomenon remains unknown” in our original manuscript can be misleading to the audience. Indeed, we were aware of the major findings in the field and cited all the seminal work of hyperaltruism and its related neural mechanism (Crockett et al., 2014, 2015, 2017). We have changed the texts in the introduction to better reflect this point and added further discussion as to how oxytocin might play a role:

      “For example, it was shown that the hyperaltruistic preference modulated neural representations of the profit gained from harming others via the functional connectivity between the lateral prefrontal cortex, a brain area involved in moral norm violation, and profit sensitive brain regions such as the dorsal striatum6.” (Lines 41~45)

      “Oxytocin has been shown to play a critical role in social interactions such as maternal attachment, pair bonding, consociate attachment and aggression in a variety of animal models[42,43]. Humans are endowed with higher cognitive and affective capacities and exhibit far more complex social cognitive patterns[44]. ” (Lines 86~90)

      (2) There are some important inconsistencies between the preregistration and the actual data collection/analysis, which the authors did not justify.

      Also in recommendations:

      (4) It is laudable that the authors pre-registered the procedure and key analysis of the Oxytocin study and determined the sample size beforehand. However, in the preregistration, the authors claimed that they would recruit 30 participants for Experiment 1 and 60 for Experiment 2, without justification. In the paper, they described a "prior power analysis", which deviated from their preregistration. It is OK to deviate from preregistration, but this needs to be explicitly mentioned and addressed (why the deviation occurred, why the reported approach was justifiable, etc.).

      We sincerely appreciate the reviewer’s thorough assessment of our manuscript. In the more exploratory study 1, we found that the loss decision context effectively diminished subjects’ hyperaltruistic preference. Based on this finding, we pre-registered study 2 and hypothesized that: 1) The administration of OXT may salvage subject’s hyperaltruistic preference in the loss context; 2) The administration of OXT may reduce subjects’ sensitivities towards electric shocks (but not necessarily their moral preference), due to the well-established results relating OXT to enhanced empathy for others (Barchi-Ferreira & Osório, 2021; Radke et al., 2013) and the processing of negative stimuli(Evans et al., 2010; Kirsch et al., 2005; Wu et al., 2020); and 3) The OXT effect might be context specific, depending on the particular combination of valence (gain vs. loss) and shock recipient (self vs. other) (Abu-Akel et al., 2015; Kapetaniou et al., 2021; Ma et al., 2015).

      As our results suggested, the administration of OXT indeed restored subjects’ hyperaltruistic preference (confirming hypothesis 1, Figure 4A). Also, OXT decreased subjects’ sensitivities towards electric shocks in both the gain and loss conditions (supplementary Fig. 6 and supplementary Fig. 8), consistent with our second hypothesis. We must admit that our hypothesis 3 was rather vague, since a seminal study clearly demonstrated the context-dependent effect of OXT in human cooperation and conflict depending on the group membership of the subjects (De Dreu et al., 2010, 2020). Although our results partially validated our hypothesis 3 (supplementary Fig. 6), we did not make specific predictions as to the direction and the magnitude of the OXT effect.

      The main inconsistency is related to the sample size. When we carried out study 1, we recruited both male and female subjects. After we identified the context effect on the hyperaltruistic preference, we decided to pre-register and perform study 2 (the OXT study). We originally made a rough estimate of 60 male subjects for study 2. While conducting study 2, we also went through the literature of OXT effect on social behavior and realized that the actual subject number around 45 might be enough to detect the main effect of OXT. Therefore, we settled on the number of 46 (study 2) reported in the manuscript. Correspondingly, we increased the subject number in study 1 to the final number of 80 (40 males) to make sure the subject number is enough to detect a small-to-medium effect, as well as to have a fair comparison between study 1 and 2 (roughly equal number of male subjects). It should be noted that although we only reported all the subjects (male & female) results of study 1 in the manuscript, the main results remain very similar if we only focus on the results of male subjects in study 1 (see the figure below). We believe that these results, together with the placebo treatment group results in study 2 (male only), confirmed the validity of our original finding.

      Author response image 2.

      Author response image 3.

      We have included additional texts (Lines 447 ~ 452) in the Methods section for the discrepancy between the preregistered and actual sample sizes in the revised manuscript:

      “It should be noted that in preregistration we originally planned to recruit 60 male subjects for Study 2 but ended up recruiting 46 male subjects (mean age =  years) based on the sample size reported in previous oxytocin studies[57,69]. Additionally, a power analysis suggested that the sample size > 44 should be enough to detect a small to median effect size of oxytocin (Cohen’s d=0.24, α=0.05, β=0.8) using a 2 × 2 × 2 within-subject design[76].”

      (3) Some of the exploratory analysis seems underpowered (e.g., large multiple regression models with only about 40 participants).

      We thank the reviewer’s comments and appreciate the concern that the sample size would be an issue affecting the results reliability in multiple regression analysis.

      In Fig. 2, the multiple regression analyses were conducted after we observed a valence-dependent effect on hyperaltruism (Fig. 2A) and the regression was constructed accordingly:

      Choice ~ ∆s *context*recipient + ∆m *context*recipient+(1+ ∆s *context*recipient + ∆s*context*recipient | subject)

      Where ∆s and ∆m indicate the shock level and monetary reward difference between the more and loss painful options, context as the monetary valence (gain vs. loss) and recipient as the identity of the shock recipient (self vs. other).

      Since we have 240 trials for each subject and a total of 80 subjects in Study 1, we believe that this is a reasonable regression analysis to perform.

      In Fig. 3, the multiple regression analyses were indeed exploratory. More specifically, we ran 3 multiple linear regressions:

      hyperaltruism~EC*context+IH*context+IB*context

      Relative harm sensitivity~ EC*context+IH*context+IB*context

      Relative money sensitivity~ EC*context+IH*context+IB*context

      Where Hyperaltruism is defined as κ<sub>other</sub> - κ<sub>self</sub>, Relative harm sensitivity as otherβ<sub>∆s</sub> - selfβ<sub>∆s</sub> and Relative monetary sensitivity as otherβ<sub>∆m</sub> - selfβ<sub>∆m</sub>. EC (empathic concern), IH (instrumental harm) and IB (impartial beneficence) were subjects’ scores from corresponding questionnaires.

      For the first regression, we tested whether EC, IH and IB scores were related to hyperaltruism and it should be noted that this was tested on 80 subjects (Study 1). After we identified the effect of IH on hyperaltruism, we ran the following two regressions. The reason we still included IB and EC as predictors in these two regression analyses was to remove potential confounds caused by EC and IB since previous research indicated that IB, IH and EC could be correlated (Kahane et al., 2018).

      In study 2, we performed the following regression analyses again to validate our results (Placebo treatment in study 2 should have similar results as found in study 1).

      Relative harm sensitivity~ EC*context+IH*context+IB*context

      Relative money sensitivity~ EC*context+IH*context+IB*context

      Again, we added IB and EC only to control for the nuance effects by the covariates. As indicated in Fig. 5 C-D, the placebo condition in study 2 replicated our previous findings in study 1 and OXT administration effectively removed the interaction effect between IH and valence (gain vs. loss) on subjects’ relative harm sensitivity.

      To more objectively present our data and results, we have changed the texts in the results section and pointed out that the regression analysis:

      hyperaltruism~EC*context+IH*context+IB*context

      was exploratory (Lines 186-192).

      “We tested how hyperaltruism was related to both IH and IB across decision contexts using an exploratory multiple regression analysis. Moral preference, defined as κ<sub>other</sub> - κ<sub>self</sub>, was negatively associated with IH (β=-0.031±0.011, t<sub>156</sub>=-2.784, P =0.006) but not with IB (β=0.008±0.016, t<sub>156</sub>=0.475, P=0.636) across gain and loss contexts, reflecting a general connection between moral preference and IH (Fig. 3A & B).”

      (4) Inaccurate conceptualization of utilitarian psychology and the questionnaire used to measure it.

      Also in recommendations:

      (2) Throughout the paper, the authors placed lots of weight on individual differences in utilitarian psychology and the Oxford Utilitarianism Scale (OUS). I am not sure this is the best individual difference measure in this context. I don't see a conceptual fit between the psychological construct that OUS reflects, and the key psychological processes underlying the behaviors in the present study. As far as I understand it, the conceptual core of utilitarian psychology that OUS captures is the maximization of greater goods. Neither the Instrumental Harm (IH) component nor the Impartial Beneficence (IB) component reflects a tradeoff between the personal interests of the decision-making agent and a moral principle. The IH component is about the endorsement of harming a smaller number of individuals for the benefit of a larger number of individuals. The IB component is about treating self, close others, and distant others equally. However, the behavioral task used in this study is neither about distributing harm between a smaller number of others and a larger number of others nor about benefiting close or distant others. The fact that IH showed some statistical association with the behavioral tendency in the present data set could be due to the conceptual overlap between IH and an individual's tendency to inflict harm (e.g., psychopathy; Table 7 in Kahane et al., 2018, which the authors cited). I urge the authors to justify more why they believe that conceptually OUS is an appropriate individual difference measure in the present study, and if so, interpret their results in a clearer and justifiable manner (taking into account the potential confound of harm tendency/psychopathy).

      We thank the reviewer for the thoughtful comment and agree that “IH component is about the endorsement of harming a smaller number of individuals for the benefit of a larger number of individuals. The IB component is about treating self, close others, and distant others equally”. As we mentioned in the previous response to the reviewer, we first ran an exploratory multiple linear regression analysis of hyperaltruistic preference (κ<sub>other</sub> - κ<sub>self</sub>) against IB and IH in study 1 based on the hypothesis that the reduction of hyperaltruistic preference in the loss condition might be due to 1) subjects’ altered altitudes between IB and hyperaltruistic preference between the gain and loss conditions, and/or 2) the loss condition changed how the moral norm was perceived and therefore affected the correlation between IH and hyperaltruistic preference. As Fig. 3 shows, we did not find a significant IB effect on hyperaltruistic preference (κ<sub>other</sub> - κ<sub>self</sub>), nor on the relative harm or money sensitivity (supplementary Fig. 3). These results excluded the possibility that subjects with higher IB might treat self and others more equally and therefore show less hyperaltruistic preference. On the other hand, we found a strong correlation between hyperaltruistic preference and IH (Fig. 3A): subjects with higher IH scores showed less hyperaltruistic preference. Since the hyperaltruistic preference (κ<sub>other</sub> - κ<sub>self</sub>) is a compound variable and we further broke it down to subjects’ relative sensitivity to harm and money (other β<sub>∆s</sub> - self β<sub>∆s</sub> and other β<sub>∆m</sub> - self β<sub>∆m</sub>, respectively). The follow up regression analyses revealed that the correlation between subjects’ relative harm sensitivity and IH was altered by the decision contexts (gain vs. loss, Fig. 3C-D). These results are consistent with our hypothesis that for subjects to engage in the utilitarian calculation, they should first realize that there is a moral dilemma (harming others to make monetary gain in the gain condition). When there is less perceived moral conflict (due to the framing of decision context as avoiding loss in the loss condition), the correlation between subjects’ relative harm sensitivity and IH became insignificant (Fig. 3C). It is worth noting that these results were further replicated in the placebo condition of study 2, further indicating the role of OXT is to affect how the decision context is morally framed.

      The reviewer also raised an interesting possibility that the correlation between subject’s behavioral tendency and IH may be confounded by the fact that IH is also correlated with other traits such as psychopathy. Indeed, in the Kahane et al., 2018 paper, the authors showed that IH was associated with subclinical psychopathy in a lay population. Although we only collected and included IB and Empathic concern (EC) scores as control variables and in principle could not rule out the influence of psychopathy, we argue it is unlikely the case. First, psychopaths by definition “only care about their own good” (Kahane et al., 2018). However, subjects in our studies, as well as in previous research, showed greater aversion to harming others (compared to harming themselves) in the gain conditions. This is opposite to the prediction of psychopathy. Even in the loss condition, subjects showed similar levels of aversion to harming others (vs. harming themselves), indicating that our subjects valuated their own and others’ well-being similarly. Second, although there appears to be an association between utilitarian judgement and psychopathy(Glenn et al., 2010; Kahane et al., 2015), the fact that people also possess a form of universal or impartial beneficence in their utilitarian judgements suggest psychopathy alone is not a sufficient variable explaining subjects’ hyperaltruistic behavior.

      We have thus rewritten part of the results to clarify our rationale for using the Oxford Utilitarianism Scale (especially the IH and IB) to establish the relationship between moral traits and subjects’ decision preference (Lines 212-215):

      “Furthermore, our results are consistent with the claim that profiting from inflicting pains on another person (IH) is inherently deemed immoral1. Hyperaltruistic preference, therefore, is likely to be associated with subjects’ IH dispositions.”

      (3) Relatedly, in the Discussion, the authors mentioned "the money-pain trade-off task, similar to the well-known trolley dilemma". I am not sure if this statement is factually accurate because the "well-known trolley dilemma" is about a disinterested third-party weighing between two moral requirements - "greatest good for the greatest number" (utilitarianism) and "do no harm" (Kantian/deontology), not between a moral requirement and one's own monetary interest (which is the focus of the present study). The analogy would be more appropriate if the task required the participants to trade off between, for example, harming one person in exchange for a charitable donation, as a recent study employed (Siegel et al., 2022, A computational account of how individuals resolve the dilemma of dirty money. Scientific reports). I urge the authors to go through their use of "utilitarian/utilitarianism” in the paper and make sure their usage aligns with the definition of the concept and the philosophical implications.

      We thank the reviewer for prompting us to think over the difference between our task and the trolley dilemma. Indeed, the trolley dilemma refers to a disinterested third-party’s decision between two moral requirements, namely, the utilitarianism and deontology. In our study, when the shock recipient was “other”, our task could be interpreted as either the decision between “moral norm of no harm (deontology) and one’s self-interest maximization (utilitarian)”, or a decision between “greatest good for both parties (utilitarian) vs. do no harm (deontology)”, though the latter interpretation typically requires differential weighing of own benefits versus the benefits of others(Fehr & Schmidt, 1999; Saez et al., 2015). In fact, it could be argued that the utilitarianism account applies not only to the third party’s well-being, but also to our own well-being, or to “that of those near or dear to us” (Kahane et al., 2018).

      We acknowledge that there may lack a direct analogy between our task and the trolley dilemma and therefore have deleted the trolley example in the discussion.

      (5) Related to the above point, the sample size of Study 2 was calculated based on the main effect of oxytocin. However, the authors also reported several regression models that seem to me more like exploratory analyses. Their sample size may not be sufficient for these analyses. The authors should: a) explicitly distinguish between their hypothesis-driven analysis and exploratory analysis; b) report achieved power of their analysis.

      We appreciate the reviewer’s thorough reading of our manuscript. Following the reviewer’s suggestions, we have explicitly stated in the revised manuscript which analyses were exploratory, and which were hypothesis driven. Following the reviewer’s request, we added the achieved power into the main texts (Lines 274-279):

      “The effect size (Cohen’s f<sup>2</sup>) for this exploratory analysis was calculated to be 0.491 and 0.379 for the placebo and oxytocin conditions, respectively. The post hoc power analysis with a significance level of α = 0.05, 7 regressors (IH, IB, EC, decision context, IH×context, IB×context, and EC×context), and sample size of N = 46 yielded achieved power of 0.910 (placebo treatment) and 0.808 (oxytocin treatment).”

      (6) Do the authors collect reaction times (RT) information? Did the decision context and oxytocin modulate RT? Based on their procedure, it seems that the authors adopted a speeded response task, therefore the RT may reflect some psychological processes independent of choice. It is also possible (and recommended) that the authors use the drift-diffusion model to quantify latent psychological processes underlying moral decision-making. It would be interesting to see if their manipulations have any impact on those latent psychological processes, in addition to explicit choice, which is the endpoint product of the latent psychological processes. There are some examples of applying DDM to this task, which the authors could refer to if they decide to go down this route (Yu et al, 2021, How peer influence shapes value computation in moral decision-making. Cognition.)

      We did collect the RT information for this experiment. As demonstrated in the figure below, participants exhibited significantly longer RT in the loss context compared to the gain context (Study1: the main effect of decision context: F<sub>1,79</sub>=20.043, P < 0.001, η<sup>2</sup> =0.202; Study2-placebo: F<sub>1.45</sub>=17.177, P < 0.001, η<sup>2</sup> =0.276). In addition to this effect of context, decisions were significantly slower in the other-condition compared to the self-condition

      (Study1: the main effect of recipient: F<sub>1,79</sub>=4.352, P < 0.040, η<sup>2</sup> =0.052; Study2-placebo: F<sub>1,45</sub>=5.601, P < 0.022, η<sup>2</sup> =0.111) which replicates previous research findings (Crockett et al., 2014). However, the differences in response time between recipients was not modulated by decision context (Study1: context × recipient interaction: F<sub>1,79</sub>=1.538, P < 0.219, η<sup>2</sup> =0.019; Study2-placebo: F<sub>1,45</sub>=2.631, P < 0.112, η<sup>2</sup> =0.055). Additionally, the results in the oxytocin study (study 2) revealed no evidence supporting any effect of oxytocin on reaction time. Neither the main effect (treatment: placebo vs. oxytocin) nor the interaction effect of oxytocin on response time was statistically significant (main effect of OXT treatment: F<sub>1,45</sub>=2.380, P < 0.230, η<sup>2</sup> =0.050; treatment × context: F<sub>1,45</sub>=2.075, P < 0.157η<sup>2</sup> =0.044; treatment × recipient: F<sub>1,45</sub>=0.266, P < 0.609, η<sup>2</sup> =0.006; treatment × context × recipient: F<sub>1,45</sub>=2.909, P < 0.095, η<sup>2</sup> =0.061).;

      Author response image 4.

      We also agree that it would be interesting to also investigate how the OXT might impact the dynamics of the decision process using a drift-diffusion model (DDM). However, we have already showed in the original manuscript that the OXT increased subjects’ relative harm sensitivities. If a canonical DDM is adopted here, then such an OXT effect is more likely to correspond to the increased drift rate for the relative harm sensitivity, which we feel still aligns with the current framework in general. In future studies, including further manipulations such as time pressure might be a more comprehensive approach to investigate the effect of OXT on DDM related decision variables such as attribute drift rate, initial bias, decision threshold and attribute synchrony.

      (7) This is just a personal preference, but I would avoid metaphoric language in a scientific paper (e.g., rescue, salvage, obliterate). Plain, neutral English terms can express the same meaning clearly (e.g., restore, vanish, eliminate).

      Again, we thank the reviewer for the suggestion and have since modified the terms.

      Reviewer #3:

      The primary weakness of the paper concerns its framing. Although it purports to be measuring "hyper-altruism" it does not provide evidence to support why any of the behavior being measured is extreme enough to warrant the modifier "hyper" (and indeed throughout I believe the writing tends toward hyperbole, using, e.g., verbs like "obliterate" rather than "reduce"). More seriously, I do not believe that the task constitutes altruism, but rather the decision to engage, or not engage, in instrumental aggression.

      We agree with the reviewer (and reviewer # 2) that plain and clear English should be used to describe our results and have since modified those terms.

      However, the term “hyperaltruism”, which is the main theme of our study, was originally proposed by a seminal paper (Crockett et al., 2014) and has since been widely adopted in related studies (Crockett et al., 2014, 2015, 2017; Volz et al., 2017; Zhan et al., 2020). The term “hyperaltruism” was introduced to emphasize the difference from altruism (Chen et al., 2024; FeldmanHall et al., 2015; Hu et al., 2021; Hutcherson et al., 2015; Lockwood et al., 2017; Xiong et al., 2020). Hyperaltruism does not indicate extreme altruism. Instead, it simply reflects the fact that “we are more willing to sacrifice gains to spare others from harm than to spare ourselves from harm” (Volz et al., 2017). In other words, altruism refers to people’s unselfish regard for or devotion to the welfare of others, and hyperaltruism concerns subject’s own cost-benefit preference as the reference point and highlights the “additional” altruistic preference when considering other’s welfare. For example, in the altruistic experimental design, altruism is characterized by the degree to which subjects take other people’s welfare into account (left panel). However, in a typical hyperaltruism task design (right panel), hyperaltruistic preference is operationally defined as the difference (κ<sub>other</sub> - κ<sub>self</sub>) between the degrees to which subjects value others’ harm (κ<sub>other</sub>) and their own harm (κ<sub>self</sub>).

      Author response image 5.

      I found it surprising that a paradigm that entails deciding to hurt or not hurt someone else for personal benefit (whether acquiring a financial gain or avoiding a loss) would be described as measuring "altruism." Deciding to hurt someone for personal benefit is the definition of instrumental aggression. I did not see that in any of the studies was there a possibility of acting to benefit the other participant in any condition. Altruism is not equivalent to refraining from engaging in instrumental aggression. True altruism would be to accept shocks to the self for the other's benefit (e.g., money).  The interpretation of this task as assessing instrumental aggression is supported by the fact that only the Instrumental Harm subscale of the OUS was associated with outcomes in the task, but not the Impartial Benevolence subscale. By contrast, the IB subscale is the one more consistently associated with altruism (e.g,. Kahane et al 2018; Amormino at al, 2022) I believe it is important for scientific accuracy for the paper, including the title, to be re-written to reflect what it is testing.

      Again, as we mentioned in the previous response, hyperaltruism is a term coined almost a decade ago and has since been widely adopted in the research field. We are afraid that switching such a term would be more likely to cause confusion (instead of clarity) among audience.

      Also, from the utilitarian perspective, the gain or loss (or harm) occurred to someone else is aligned on the same dimension and there is no discontinuity between gains and losses. Therefore, taking actions to avoid someone else’s loss can also be viewed as altruistic behavior, similar to choices increasing other’s welfare (Liu et al., 2020).

      Relatedly: in the introduction I believe it would be important to discuss the non-symmetry of moral obligations related to help/harm--we have obligations not to harm strangers but no obligation to help strangers. This is another reason I do not think the term "hyper altruism" is a good description for this task--given it is typically viewed as morally obligatory not to harm strangers, choosing not to harm them is not "hyper" altruistic (and again, I do not view it as obviously altruism at all).

      We agree with the reviewer’s point that we have the moral obligations not to harm others but no obligation to help strangers (Liu et al., 2020). In fact, this is exactly what we argued in our manuscript: by switching the decision context from gains to losses, subjects were less likely to perceive the decisions as “harming others”. Furthermore, after the administration of OXT, making decisions in both the gain and loss contexts were more perceived by subjects as harming others (Fig. 6A).

      The framing of the role of OT also felt incomplete. In introducing the potential relevance of OT to behavior in this task, it is important to pull in evidence from non-human animals on origins of OT as a hormone selected for its role in maternal care and defense (including defensive aggression). The non-human animal literature regarding the effects of OT is on the whole much more robust and definitive than the human literature. The evidence is abundant that OT motivates the defensive care of offspring of all kinds. My read of the present OT findings is that they increase participants' willingness to refrain from shocking strangers even when incurring a loss (that is, in a context where the participant is weighing harm to themselves versus harm to the other). It will be important to explain why OT would be relevant to refraining from instrumental aggression, again, drawing on the non-human animal literature.

      We thank the reviewer’s comments and agree that the current understanding of the link between our results of OT with animal literature can be at best described as vague and intriguing. Current literature on OT in animal research suggests that the nucleus accumbens (NAc) oxytocin might play the critical role in social cognition and reinforcing social interactions (Dölen et al., 2013; Dölen & Malenka, 2014; Insel, 2010). Though much insight has already been gained from animal studies, in humans, social interactions can take a variety of different forms, and the consociate recognition can also be rather dynamic. For example, male human participants with self-administered OT showed higher trust and cooperation towards in-group members but more defensive aggression towards out-group members (De Dreu et al., 2010). In another human study, participants administered with OT showed more coordinated out-group attack behavior, suggesting that OT might increase in-group efficiency at the cost of harming out-group members (Zhang et al., 2019). It is worth pointing out that in both experiments, the participant’s group membership was artificially assigned, thus highlighting the context-dependent nature of OT effect in humans.

      In our experiment, more complex and higher-level social cognitive processes such as moral framing and moral perception are involved, and OT seems to play an important role in affecting these processes. Therefore, we admit that this study, like the ones mentioned above, is rather hard to find non-human animal counterpart, unfortunately. Instead of relating OT to instrumental aggression, we aimed to provide a parsimonious framework to explain why the “hyperaltruism” disappeared in the loss condition, and, with the OT administration, reappeared in both the gain and loss conditions while also considering the effects of other relevant variables.  

      We concur with the reviewer’s comments about the importance of animal research and have since added the following paragraph into the revised manuscript (Line 86~90) as well as in the discussion:

      “Oxytocin has been shown to play a critical role in social interactions such as maternal attachment, pair bonding, consociate attachment and aggression in a variety of animal models[42,43]. Humans are endowed with higher cognitive and affective capacities and exhibit far more complex social cognitive patterns[44].”

      Another important limitation is the use of only male participants in Study 2. This was not an essential exclusion. It should be clear throughout sections of the manuscript that this study's effects can be generalized only to male participants.

      We thank the reviewer’s comments. Prior research has shown sex differences in oxytocin’s effects (Fischer-Shofty et al., 2013; Hoge et al., 2014; Lynn et al., 2014; Ma et al., 2016; MacDonald, 2013). Furthermore, with the potential confounds of OT effect due to the menstrual cycles and potential pregnancy in female subjects, most human OT studies have only recruited male subjects (Berends et al., 2019; De Dreu et al., 2010; Fischer-Shofty et al., 2010; Ma et al., 2016; Zhang et al., 2019). We have modified our manuscript to emphasize that study 2 only recruited male subjects.

      Recommendations:

      I believe the authors have provided an interesting and valuable dataset related to the willingness to engage in instrumental aggression - this is not the authors' aim, although also an important aim. Future researchers aiming to build on this paper would benefit from it being framed more accurately.

      Thus, I believe the paper must be reframed to accurately describe the nature of the task as assessing instrumental aggression. This is also an important goal, as well-designed laboratory models of instrumental aggression are somewhat lacking.

      Please see our response above that to have better connections with previous research, we believe that the term hyperaltruism might align better with the main theme for this study.

      The research literature on other aggression tasks should also be brought in, as I believe these are more relevant to the present study than research studies on altruism that are primarily donation-type tasks. It should be added to the limitations of how different aggression in a laboratory task such as this one is from real-world immoral forms of aggression. Arguably, aggression in a laboratory task in which all participants are taking part voluntarily under a defined set of rules, and in which aggression constrained by rules is mutual, is similar to aggression in sports, which is not considered immoral. Whether responses in this task would generalize to immoral forms of aggression cannot be determined without linking responses in the task to some real-world outcome.

      We agree with the reviewer that “aggression in a lab task …. is similar to aggression in sports”. Our starting point was to investigate the boundary conditions for the hyperaltruism (though we don’t deny that there is an aggression component in hyperaltruism, given the experiment design we used). In other words, the dependent variable we were interested in was the difference between “other” and “self” aggression, not the aggression itself. Our results showed that by switching the decision context from the monetary gain environment to the loss condition, human participants were willing to bear similar amounts of monetary loss to spare others and themselves from harm. That is, hyperaltruism disappeared in the loss condition. We interpreted this result as the loss condition prompted subjects to adopt a different moral framework (help vs. harm, Fig. 6A) and subjects were less influenced by their instrumental harm personality trait due to the change of moral framework (Fig. 3C). In the following study (study 2), we further tested this hypothesis and verified that the administration of OT indeed increased subjects’ perception of the task as harming others for both gain and loss conditions (Fig. 6A), and such moral perception mediated the relationship between subject’s personality traits (instrumental harm) and their relative harm sensitivities (the difference of aggression between the other- and self-conditions). We believe the moral perception framework and that OT directly modulates moral perception better account for subjects’ context-dependent choices than hypothesizing OT’s context-dependent modulation effects on aggression.

      The language should also be toned down--the use of phrases like "hyper altruism" (without independent evidence to support that designation) and "obliterate" rather than "reduce" or "eliminate" are overly hyperbolic.

      We have changed terms such as “obliterate” and “eliminate” to plain English, as the reviewer suggested.

      Reference

      Abu-Akel, A., Palgi, S., Klein, E., Decety, J., & Shamay-Tsoory, S. (2015). Oxytocin increases empathy to pain when adopting the other- but not the self-perspective. Social Neuroscience, 10(1), 7–15.

      Barchi-Ferreira, A., & Osório, F. (2021). Associations between oxytocin and empathy in humans: A systematic literature review. Psychoneuroendocrinology, 129, 105268.

      Berends, Y. R., Tulen, J. H. M., Wierdsma, A. I., van Pelt, J., Feldman, R., Zagoory-Sharon, O., de Rijke, Y. B., Kushner, S. A., & van Marle, H. J. C. (2019). Intranasal administration of oxytocin decreases task-related aggressive responses in healthy young males. Psychoneuroendocrinology, 106, 147–154.

      Chen, J., Putkinen, V., Seppälä, K., Hirvonen, J., Ioumpa, K., Gazzola, V., Keysers, C., & Nummenmaa, L. (2024). Endogenous opioid receptor system mediates costly altruism in the human brain. Communications Biology, 7(1), 1–11.

      Crockett, M. J., Kurth-Nelson, Z., Siegel, J. Z., Dayan, P., & Dolan, R. J. (2014). Harm to others outweighs harm to self in moral decision making. Proceedings of the National Academy of Sciences of the United States of America, 111(48), 17320–17325.

      Crockett, M. J., Siegel, J. Z., Kurth-Nelson, Z., Dayan, P., & Dolan, R. J. (2017). Moral transgressions corrupt neural representations of value. Nature Neuroscience, 20(6), 879–885.

      Crockett, M. J., Siegel, J. Z., Kurth-Nelson, Z., Ousdal, O. T., Story, G., Frieband, C., Grosse-Rueskamp, J. M., Dayan, P., & Dolan, R. J. (2015). Dissociable Effects of Serotonin and Dopamine on the Valuation of Harm in Moral Decision Making. Current Biology, 25(14), 1852–1859.

      De Dreu, C. K. W., Greer, L. L., Handgraaf, M. J. J., Shalvi, S., Van Kleef, G. A., Baas, M., Ten Velden, F. S., Van Dijk, E., & Feith, S. W. W. (2010). The Neuropeptide Oxytocin Regulates Parochial Altruism in Intergroup Conflict Among Humans. Science, 328(5984), 1408–1411.

      De Dreu, C. K. W., Gross, J., Fariña, A., & Ma, Y. (2020). Group Cooperation, Carrying-Capacity Stress, and Intergroup Conflict. Trends in Cognitive Sciences, 24(9), 760–776.

      Dölen, G., Darvishzadeh, A., Huang, K. W., & Malenka, R. C. (2013). Social reward requires coordinated activity of nucleus accumbens oxytocin and serotonin. Nature, 501(7466), 179–184.

      Dölen, G., & Malenka, R. C. (2014). The Emerging Role of Nucleus Accumbens Oxytocin in Social Cognition. Biological Psychiatry, 76(5), 354–355.

      Evans, S., Shergill, S. S., & Averbeck, B. B. (2010). Oxytocin Decreases Aversion to Angry Faces in an Associative Learning Task. Neuropsychopharmacology, 35(13), 2502–2509.

      Fehr, E., & Schmidt, K. M. (1999). A Theory of Fairness, Competition, and Cooperation*. The Quarterly Journal of Economics, 114(3), 817–868.

      FeldmanHall, O., Dalgleish, T., Evans, D., & Mobbs, D. (2015). Empathic concern drives costly altruism. Neuroimage, 105, 347–356.

      Fischer-Shofty, M., Levkovitz, Y., & Shamay-Tsoory, S. G. (2013). Oxytocin facilitates accurate perception of competition in men and kinship in women. Social Cognitive and Affective Neuroscience, 8(3), 313–317.

      Fischer-Shofty, M., Shamay-Tsoory, S. G., Harari, H., & Levkovitz, Y. (2010). The effect of intranasal administration of oxytocin on fear recognition. Neuropsychologia, 48(1), 179–184.

      Glenn, A. L., Koleva, S., Iyer, R., Graham, J., & Ditto, P. H. (2010). Moral identity in psychopathy. Judgment and Decision Making, 5(7), 497–505.

      Hoge, E. A., Anderson, E., Lawson, E. A., Bui, E., Fischer, L. E., Khadge, S. D., Barrett, L. F., & Simon, N. M. (2014). Gender moderates the effect of oxytocin on social judgments. Human Psychopharmacology: Clinical and Experimental, 29(3), 299–304.

      Hu, J., Hu, Y., Li, Y., & Zhou, X. (2021). Computational and Neurobiological Substrates of Cost-Benefit Integration in Altruistic Helping Decision. Journal of Neuroscience, 41(15), 3545–3561.

      Hutcherson, C. A., Bushong, B., & Rangel, A. (2015). A Neurocomputational Model of Altruistic Choice and Its Implications. Neuron, 87(2), 451–462.

      Insel, T. R. (2010). The Challenge of Translation in Social Neuroscience: A Review of Oxytocin, Vasopressin, and Affiliative Behavior. Neuron, 65(6), 768–779.

      Kahane, G., Everett, J. A. C., Earp, B. D., Caviola, L., Faber, N. S., Crockett, M. J., & Savulescu, J. (2018). Beyond sacrificial harm: A two-dimensional model of utilitarian psychology. Psychological Review, 125(2), 131–164.

      Kahane, G., Everett, J. A. C., Earp, B. D., Farias, M., & Savulescu, J. (2015). ‘Utilitarian’ judgments in sacrificial moral dilemmas do not reflect impartial concern for the greater good. Cognition, 134, 193–209.

      Kahneman, D., & Tversky, A. (1979). Prospect Theory: An Analysis of Decision under Risk. Econometrica, 47(2), 263.

      Kapetaniou, G. E., Reinhard, M. A., Christian, P., Jobst, A., Tobler, P. N., Padberg, F., & Soutschek, A. (2021). The role of oxytocin in delay of gratification and flexibility in non-social decision making. eLife, 10, e61844.

      Kirsch, P., Esslinger, C., Chen, Q., Mier, D., Lis, S., Siddhanti, S., Gruppe, H., Mattay, V. S., Gallhofer, B., & Meyer-Lindenberg, A. (2005). Oxytocin Modulates Neural Circuitry for Social Cognition and Fear in Humans. The Journal of Neuroscience, 25(49), 11489–11493.

      Liu, J., Gu, R., Liao, C., Lu, J., Fang, Y., Xu, P., Luo, Y., & Cui, F. (2020). The Neural Mechanism of the Social Framing Effect: Evidence from fMRI and tDCS Studies. The Journal of Neuroscience, 40(18), 3646–3656.

      Liu, Y., Li, L., Zheng, L., & Guo, X. (2017). Punish the Perpetrator or Compensate the Victim? Gain vs. Loss Context Modulate Third-Party Altruistic Behaviors. Frontiers in Psychology, 8, 2066.

      Lockwood, P. L., Hamonet, M., Zhang, S. H., Ratnavel, A., Salmony, F. U., Husain, M., & Maj, A. (2017). Prosocial apathy for helping others when effort is required. Nature Human Behaviour, 1(7), 131–131.

      Losecaat Vermeer, A. B., Boksem, M. A. S., & Sanfey, A. G. (2020). Third-party decision-making under risk as a function of prior gains and losses. Journal of Economic Psychology, 77, 102206.

      Lynn, S. K., Hoge, E. A., Fischer, L. E., Barrett, L. F., & Simon, N. M. (2014). Gender differences in oxytocin-associated disruption of decision bias during emotion perception. Psychiatry Research, 219(1), 198–203.

      Ma, Y., Liu, Y., Rand, D. G., Heatherton, T. F., & Han, S. (2015). Opposing Oxytocin Effects on Intergroup Cooperative Behavior in Intuitive and Reflective Minds. Neuropsychopharmacology, 40(10), 2379–2387.

      Ma, Y., Shamay-Tsoory, S., Han, S., & Zink, C. F. (2016). Oxytocin and Social Adaptation: Insights from Neuroimaging Studies of Healthy and Clinical Populations. Trends in Cognitive Sciences, 20(2), 133–145.

      MacDonald, K. S. (2013). Sex, Receptors, and Attachment: A Review of Individual Factors Influencing Response to Oxytocin. Frontiers in Neuroscience, 6. 194.

      Markiewicz, Ł., & Czupryna, M. (2018). Cheating: One Common Morality for Gain and Losses, but Two Components of Morality Itself. Journal of Behavior Decision Making. 33(2), 166-179.

      Pachur, T., Schulte-Mecklenbeck, M., Murphy, R. O., & Hertwig, R. (2018). Prospect theory reflects selective allocation of attention. Journal of Experimental Psychology: General, 147(2), 147–169.

      Radke, S., Roelofs, K., & De Bruijn, E. R. A. (2013). Acting on Anger: Social Anxiety Modulates Approach-Avoidance Tendencies After Oxytocin Administration. Psychological Science, 24(8), 1573–1578.

      Saez, I., Zhu, L., Set, E., Kayser, A., & Hsu, M. (2015). Dopamine modulates egalitarian behavior in humans. Current Biology, 25(7), 912–919.

      Teoh, Y. Y., Yao, Z., Cunningham, W. A., & Hutcherson, C. A. (2020). Attentional priorities drive effects of time pressure on altruistic choice. Nature Communications, 11(1), 3534.

      Tom, S. M., Fox, C. R., Trepel, C., & Poldrack, R. A. (2007). The neural basis of loss aversion in decision-making under risk. Science, 315(5811), 515–518.

      Usher, M., & McClelland, J. L. (2004). Loss Aversion and Inhibition in Dynamical Models of Multialternative Choice. Psychological Review, 111(3), 757–769.

      Volz, L. J., Welborn, B. L., Gobel, M. S., Gazzaniga, M. S., & Grafton, S. T. (2017). Harm to self outweighs benefit to others in moral decision making. Proceedings of the National Academy of Sciences of the United States of America, 114(30), 7963–7968.

      Wu, Q., Mao, J., & Li, J. (2020). Oxytocin alters the effect of payoff but not base rate in emotion perception. Psychoneuroendocrinology, 114, 104608.

      Wu, S., Cai, W., & Jin, S. (2018). Gain or non-loss: The message matching effect of regulatory focus on moral judgements of other-orientation lies. International Journal of Psychology, 53(3), 223-227.

      Xiong, W., Gao, X., He, Z., Yu, H., Liu, H., & Zhou, X. (2020). Affective evaluation of others’ altruistic decisions under risk and ambiguity. Neuroimage, 218, 116996.

      Yechiam, E., & Hochman, G. (2013). Losses as modulators of attention: Review and analysis of the unique effects of losses over gains. Psychological Bulletin, 139(2), 497–518.

      Zhan, Y., Xiao, X., Tan, Q., Li, J., Fan, W., Chen, J., & Zhong, Y. (2020). Neural correlations of the influence of self-relevance on moral decision-making involving a trade-off between harm and reward. Psychophysiology, 57(9), e13590.

      Zhang, H., Gross, J., De Dreu, C., & Ma, Y. (2019). Oxytocin promotes coordinated out-group attack during intergroup conflict in humans. eLife, 8, e40698.

    1. Author Response

      The following is the authors’ response to the original reviews.

      eLife assessment

      This important work identifies a previously uncharacterized capacity for songbirds to recover vocal targets even without sensory experience. While the evidence supporting this claim is solid, with innovative experiments exploring vocal plasticity in deafened birds, additional behavioral controls and analyses are necessary to shore up the main claims. If improved, this work has the potential for broad relevance to the fields of vocal and motor learning.

      We were able to address the requests for additional behavioral controls about the balancing of the groups (reviewer 1) and the few individual birds that showed a different behavior (reviewer 2) without collecting any further data. See our detailed replies below.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      Zai et al test if songbirds can recover the capacity to sing auditory targets without singing experience or sensory feedback. Past work showed that after the pitch of targeted song syllables is driven outside of birds' preferred target range with external reinforcement, birds revert to baseline (i.e. restore their song to their target). Here the authors tested the extent to which this restoration occurs in muted or deafened birds. If these birds can restore, this would suggest an internal model that allows for sensory-to-motor mapping. If they cannot, this would suggest that learning relies entirely on feedback-dependent mechanisms, e.g. reinforcement learning (RL). The authors find that deafened birds exhibit moderate but significant restoration, consistent with the existence of a previously under-appreciated internal model in songbirds.

      Strengths:

      The experimental approach of studying vocal plasticity in deafened or muted birds is innovative, technically difficult, and perfectly suited for the question of feedback-independent learning. The finding in Figure 4 that deafened birds exhibit subtle but significant plasticity toward restoration of their pre-deafening target is surprising and important for the songbird and vocal learning fields, in general.

      Weaknesses:

      The evidence and analyses related to the directed plasticity in deafened birds are confusing, and the magnitude of the plasticity is far less than the plasticity observed in control birds with intact feedback. The authors acknowledge this difference in a two-system model of vocal plasticity, but one wonders why the feedback-independent model, which could powerfully enhance learning speed, is weak in this songbird system.

      We fully agree with the reviewer. This surprising weakness applies to birds’ inability rather than our approach for characterizing it.

      There remains some confusion about the precise pitch-change methods used to study the deafened birds, including the possibility that a critical cohort of birds was not suitably balanced in a way where deafened birds were tested on their ability to implement both pitch increases and decreases toward target restoration.

      Both deaf groups were balanced: (dLO and WNd) were balanced in that half of the birds (5/10 WNm and 4/8 dLO) shifted their pitch up (thus target restoration corresponded to decreasing pitch) and half of the birds (5/10 WNd and 4/8 dLO) shifted their pitch down (thus target restoration corresponded to increasing pitch), see Methods.

      To clarify the precise pitch-change method used, we added to the methods an explanation about why we used the sensitivity index 𝒅′ in Fig. 4:

      We used sensitivity 𝒅′ relative to the last 2 h of WN/LO instead of NRP because we wanted to detect a pitch change, which is the realm of detection theory, i.e. 𝒅′. Furthermore, by measuring local changes in pitch relative to the last 2 h of WN/LO reinforcement, our measurements are only minimally affected by the amount of reinforcement learning that might have occurred during this 2 h time window — choosing an earlier or longer window would have blended reinforced pitch changes into our estimates. Last but not least, changes in the way in which we normalized 𝒅’ values — dividing by 𝑺𝑩, — or using the NRP relative to the last 2 h of WN/LO did not qualitatively change the results shown in Fig. 4D.

      Reviewer #2 (Public Review):

      Summary:

      This paper investigates the role of motor practice and sensory feedback when a motor action returns to a learned or established baseline. Adult male zebra finches perform a stereotyped, learned vocalization (song). It is possible to shift the pitch of particular syllables away from the learned baseline pitch using contingent white noise reinforcement. When the reinforcement is stopped, birds will return to their baseline over time. During the return, they often sing hundreds of renditions of the song. However, whether motor action, sensory feedback, or both during singing is necessary to return to baseline is unknown.

      Previous work has shown that there is covert learning of the pitch shift. If the output of a song plasticity pathway is blocked during learning, there is no change in pitch during the training. However, as soon as the pathway is unblocked, the pitch immediately shifts to the target location, implying that there is learning of the shift even without performance. Here, they ask whether the return to baseline from such a pitch shift also involves covert or overt learning processes. They perform a series of studies to address these questions, using muting and deafening of birds at different time points. learning.

      Strengths:

      The overall premise is interesting and the use of muting and deafening to manipulate different aspects of motor practice vs. sensory feedback is a solid approach.

      Weaknesses:

      One of the main conclusions, which stems primarily from birds deafened after being pitch-shifted using white noise (WNd) birds in comparison to birds deafened before being pitchshifted with light as a reinforcer (LOd), is that recent auditory experience can drive motor plasticity even when an individual is deprived of such experience. While the lack of shift back to baseline pitch in the LOd birds is convincing, the main conclusion hinges on the responses of just a few WNd individuals who are closer to baseline in the early period. Moreover, only 2 WNd individuals reached baseline in the late period, though neither of these were individuals who were closer to baseline in the early phase. Most individuals remain or return toward the reinforced pitch. These data highlight that while it may be possible for previous auditory experience during reinforcement to drive motor plasticity, the effect is very limited. Importantly, it's not clear if there are other explanations for the changes in these birds, for example, whether there are differences in the number of renditions performed or changes to other aspects of syllable structure that could influence measurements of pitch.

      We thank the reviewer for these detailed observations. We looked into the reviewer’s claim that our main conclusion of revertive pitch changes in deaf birds with target mismatch experience hinges on only few WNd birds in the early period.

      When we remove the three birds that were close to baseline (NRP=0) in the early period, we still get the same trend that WNd birds show revertive changes towards baseline: Early 𝒅’ = −𝟎. 𝟏𝟑, 𝒑 = 𝟎. 𝟐𝟒, tstat = −𝟎.𝟕𝟒, 𝒅𝒇 = 𝟔, 𝑵 = 𝟕 birds, one-sided t-test of H0: 𝒅′ = 𝟎; Late 𝒅’ = −𝟏. 𝟐𝟔, 𝒑 = 𝟎. 𝟎𝟖, tstat = −𝟏.𝟔𝟑, 𝒅𝒇 = 𝟔, 𝑵 = 𝟕 birds, one-sided t-test of H0: 𝒅′ = 𝟎. Furthermore, even without these three birds, bootstrapping the difference between WNd and dC birds shows the same trend in the early period (p=0.22) and a significant reversion in the late period (p<0.001). Thus, the effect of reversion towards baseline in the late period is robustly observed on a population level, even when discounting for three individual birds that the reviewer suspected would be responsible for the effect.

      Moreover, note that there are not two but three WNd individuals that reached baseline in the late period (see Figure 2C, D). One of them was already close to baseline in the early period and another one was already relatively close, too.

      Also, the considerable variability among birds is not surprising, it is to be expected that the variability across deaf birds is large because of their ongoing song degradation that might lead to a drift of pitch over time since deafening.

      Last but not least, see also our multivariate model (below).

      With regards to the “differences in the number of renditions” that could explain pitch changes: Deaf birds sing less after deafening than hearing birds: they sing less during the first 2 hours (early): 87±59 renditions (WNd) and 410±330 renditions (dLO) compared to 616±272 renditions (control birds). Also, WN deaf birds sing only 4300±2300 motif renditions between the early and late period compared to the average of 11000±3400 renditions that hearing control birds produce in the same time period. However, despite these differences, when we provide WNd birds more time to recover, namely 9 days after the early period, they sung on average 12000±6000 renditions, yet their NRP was still significantly different from zero (NRP = 0.37, p=0.007, tstat=3.47, df=9). Thus, even after producing more practice songs, deaf birds do not recover baseline pitch and so the number of songs alone cannot explain why deaf birds do not fully recover pitch. We conclude that auditory experience seems to be necessary to recover song.

      We added this information to the Results.

      In this context, note that the interesting part of our work is not that deaf birds do not fully recover, but that they recover anything at all (“main conclusion”, Fig. 4). The number of songs does not explain why deaf birds with mismatch experience (WNd, singing the least and singing significantly less than control birds, p=2.3*10-6, two-tailed t-test) partially revert song towards baseline, unlike deaf birds without mismatch experience (dLO, singing significantly more than WNd birds, p=0.008, and indistinguishable from control birds, p=0.1). We added this information to the Results section.

      With regards to ‘other aspects of syllable structure’: We did not look into this. Regardless of the outcome of such a hypothetical analysis, whether other syllable features change is irrelevant for our finding that deaf birds do not recover their target song. Nevertheless, note that in Zai et al. 2020 (supplementary Figure 1), we analyzed features other than pitch change in deaf birds. Absolute change in entropy variance was larger in deaf birds than in hearing birds, consistent with the literature on song degradation after deafening (Lombardino and Nottebohm, 2000, Nordeen and Nordeen 2010 and many others). In that paper, we found that only pitch changes consistently along the LO direction. All other features that we looked at (duration, AM, FM and entropy) did not change consistently with the LO contingency. We expect that a similar result would apply for the changes across the recovery period in WNd and dLO birds, i.e., that song degradation can be seen in many features and that pitch is the sole feature that changes consistently with reinforcement (LO/WN) direction.

      While there are examples where the authors perform direct comparisons between particular manipulations and the controls, many of the statistical analyses test whether each group is above or below a threshold (e.g. baseline) separately and then make qualitative comparisons between those groups. Given the variation within the manipulated groups, it seems especially important to determine not just whether these are different from the threshold, but how they compare to the controls. In particular, a full model with time (early, late), treatment (deafened, muted, etc), and individual ID (random variable) would substantially strengthen the analysis.

      We performed a full model of the NRP as the reviewer suggests and it supports our conclusions: Neither muting, deafening nor time without practice between R and E windows have a significant effect on pitch in the E window, but the interaction between deafening and time (late, L) results in a significant pitch change (fixed effect 0.67, p=2*10-6), demonstrating that deaf birds are significantly further away from baseline (NRP=0) than hearing birds in late windows, thereby confirming that birds require auditory feedback to recover a distant pitch target. Importantly, we find a significant fixed effect on pitch in the direction of the target with mismatch experience (fixed effect -0.37, p=0.006), supporting our finding that limited vocal plasticity towards a target is possible even without auditory feedback.

      We included this model as additional analysis to our manuscript.

      The muted birds seem to take longer to return to baseline than controls even after they are unmuted. Presumably, there is some time required to recover from surgery, however, it's unclear whether muting has longer-term effects on syrinx function or the ability to pass air. In particular, it's possible that the birds still haven't recovered by 4 days after unmuting as a consequence of the muting and unmuting procedure or that the lack of recovery is indicative of an additional effect that muting has on pitch recovery. For example, the methods state that muted birds perform some quiet vocalizations. However, if birds also attempt to sing, but just do so silently, perhaps the aberrant somatosensory or other input from singing while muted has additional effects on the ability to regain pitch. It would also be useful to know if there is a relationship between how long they are muted and how quickly they return to baseline.

      We agree, it might be the case that muting has some longer-term effects that could explain why WNm birds did not recover pitch 4 days after unmuting. However, if such an effect exists, it is only weak. Arguing against the idea that a longer muting requires longer recovery, we did not find a correlation between the difference in NRP between early and late and 1. the duration the birds were muted (correlation coefficient = -0.50, p=0.20), and 2. the number of renditions the birds sung between early and late (correlation coefficient = 0.03, p=0.95), and 3. the time since they last sung the target song (last rendition of baseline, correlation coefficient = -0.43, p=0.29). Neither did we find a correlation between the early NRP and the time since the muting surgery (correlation coefficient = 0.26, p=0.53), suggesting that the lack of pitch recovery while muted was not due to a lingering burden of the muting surgery. We added these results to the results section.

      In summary, we used the WNm group to assess whether birds can recover their target pitch in the absence of practice, i.e. whether they recovered pitch in the early time period. Whether or not some long-term effect of the muting/unmuting procedure affects recovery does not impair the main finding we obtained from WNm birds in Figure 1 (that birds do not recover without practice).

      Reviewer #3 (Public Review):

      Summary:

      Zai et al. test whether birds can modify their vocal behavior in a manner consistent with planning. They point out that while some animals are known to be capable of volitional control of vocalizations, it has been unclear if animals are capable of planning vocalizations -that is, modifying vocalizations towards a desired target without the need to learn this modification by practicing and comparing sensory feedback of practiced behavior to the behavioral target. They study zebra finches that have been trained to shift the pitch of song syllables away from their baseline values. It is known that once this training ends, zebra finches have a drive to modify pitch so that it is restored back to its baseline value. They take advantage of this drive to ask whether birds can implement this targeted pitch modification in a manner that looks like planning, by comparing the time course and magnitude of pitch modification in separate groups of birds who have undergone different manipulations of sensory and motor capabilities. A key finding is that birds who are deafened immediately before the onset of this pitch restoration paradigm, but after they have been shifted away from baseline, are able to shift pitch partially back towards their baseline target. In other words, this targeted pitch shift occurs even when birds don't have access to auditory feedback, which argues that this shift is not due to reinforcement-learning-guided practice, but is instead planned based on the difference between an internal representation of the target (baseline pitch) and current behavior (pitch the bird was singing immediately before deafening).

      The authors present additional behavioral studies arguing that this pitch shift requires auditory experience of the song in its state after it has been shifted away from baseline (birds deafened early on, before the initial pitch shift away from baseline, do not exhibit any shift back towards baseline), and that a full shift back to baseline requires auditory feedback. The authors synthesize these results to argue that different mechanisms operate for small shifts (planning, does not need auditory feedback) and large shifts (reinforcement learning, requires auditory feedback).

      We thank the reviewer for this concise summary of our paper. To clarify, we want to point out that we do not make any statement about the learning mechanism birds use to make large shifts to recover their target pitch, i.e. we do not say that large shifts are learned by reinforcement learning requiring auditory feedback. We only show that large shifts require auditory feedback.

      The authors also make a distinction between two kinds of planning: covert-not requiring any motor practice and overt-requiring motor practice but without access to auditory experience from which target mismatch could be computed. They argue that birds plan overtly, based on these deafening experiments as well as an analogous experiment involving temporary muting, which suggests that indeed motor practice is required for pitch shifts.

      Strengths:

      The primary finding (that partially restorative pitch shift occurs even after deafening) rests on strong behavioral evidence. It is less clear to what extent this shift requires practice, since their analysis of pitch after deafening takes the average over within the first two hours of singing. If this shift is already evident in the first few renditions then this would be evidence for covert planning. This analysis might not be feasible without a larger dataset. Similarly, the authors could test whether the first few renditions after recovery from muting already exhibit a shift back toward baseline.

      This work will be a valuable addition to others studying birdsong learning and its neural mechanisms. It documents features of birdsong plasticity that are unexpected in standard models of birdsong learning based on reinforcement and are consistent with an additional, perhaps more cognitive, mechanism involving planning. As the authors point out, perhaps this framework offers a reinterpretation of the neural mechanisms underlying a prior finding of covert pitch learning in songbirds (Charlesworth et al., 2012).

      A strength of this work is the variety and detail in its behavioral studies, combined with sensory and motor manipulations, which on their own form a rich set of observations that are useful behavioral constraints on future studies.

      Weaknesses:

      The argument that pitch modification in deafened birds requires some experience hearing their song in its shifted state prior to deafening (Fig. 4) is solid but has an important caveat. Their argument rests on comparing two experimental conditions: one with and one without auditory experience of shifted pitch. However, these conditions also differ in the pitch training paradigm: the "with experience" condition was performed using white noise training, while the "without experience" condition used "lights off" training (Fig. 4A). It is possible that the differences in the ability for these two groups to restore pitch to baseline reflect the training paradigm, not whether subjects had auditory experience of the pitch shift. Ideally, a control study would use one of the training paradigms for both conditions, which would be "lights off" or electrical stimulation (McGregor et al. 2022), since WN training cannot be performed in deafened birds. This is difficult, in part because the authors previously showed that "lights off" training has different valences for deafened vs. hearing birds (Zai et al. 2020). Realistically, this would be a point to add to in discussion rather than a new experiment.

      We added the following statement to our manuscript:

      It is unlikely that dLO birds’ inability to recover baseline pitch is somehow due to our use of a reinforcer of a non-auditory (visual) modality, since somatosensory stimuli do not prevent reliable target pitch recovery in hearing birds (McGregor et al 2022).

      A minor caveat, perhaps worth noting in the discussion, is that this partial pitch shift after deafening could potentially be attributed to the birds "gaining access to some pitch information via somatosensory stretch and vibration receptors and/or air pressure sensing", as the authors acknowledge earlier in the paper. This does not strongly detract from their findings as it does not explain why they found a difference between the "mismatch experience" and "no mismatch experience groups" (Fig. 4).

      We added the following statement: Our insights were gained in deaf birds and we cannot rule out that deaf birds could gain access to pitch information via somatosensoryproprioceptive sensory modalities. However, such information, even if available, cannot explain the difference between the "mismatch experience” (WNd) and the "no mismatch experience" (dLO) groups, which strengthens our claim that the pitch reversion we observe is a planned change and not merely a rigid motor response (as in simple usedependent forgetting).

      More broadly, it is not clear to me what kind of planning these birds are doing, or even whether the "overt planning" here is consistent with "planning" as usually implied in the literature, which in many cases really means covert planning. The idea of using internal models to compute motor output indeed is planning, but why would this not occur immediately (or in a few renditions), instead of taking tens to hundreds of renditions?

      Indeed, what we call ‘covert planning’ refers to what usually is called ‘planning’ in the literature. Also, there seems to be currently no evidence for spontaneous overt planning in songbirds (which we elicited with deafening). Replay of song-like syringeal muscle activity can be induced by auditory stimuli during sleep (Bush, A., Doppler, J. F., Goller, F., and Mindlin, G. B. (2018), but to our knowledge there are no reports of similar replay in awake, non-singing birds, which would constitute evidence for overt planning.

      We cannot ascertain how fast birds can plan their song changes, but our findings are not in disagreement with fast planning. The smallest time window of analysis we chose is 2h, which sets a lower bound of the time frame within which we can measure pitch changes. Our approach is probably not ideally suited for determining the minimal planning time, because the deafening and muting procedures cause an increase in song variability, which calls for larger pitch sample sizes for statistical testing, and the surgeries themselves cause a prolonged period without singing during which we have no access to the birds’ planned motor output. Note that fast planning is demonstrated by the recent finding of instant imitation in nightingales (Costalunga, Giacomo, et al. 2023) and is evidenced by fast re-pitching upon context changes in Bengalese finches (Veit, L., Tian, L. Y., Monroy Hernandez, C. J., & Brainard, M. S., 2021).

      To resolve confusion, it would be useful to discuss and add references relating "overt" planning to the broader literature on planning, including in the introduction when the concept is introduced.

      Overt and covert planning are terms used in the literature on child development and on adult learning, see (Zajic, Matthew Carl, et al., Overt planning behaviors during writing in school-age children with autism spectrum disorder and attention-deficit/hyperactivity disorder, 2020) and (Abbas zare-ee, Researching Aptitude in a Process-Based Approach to Foreign Language Writing Instruction. Advances in Language and Literary Studies, 2014), and references therein.

      Indeed, muddying the interpretation of this behavior as planning is that there are other explanations for the findings, such as use-dependent forgetting, which the authors acknowledge in the introduction, but don't clearly revisit as a possible explanation of their results. Perhaps this is because the authors equate use-dependent forgetting and overt planning, in which case this could be stated more clearly in the introduction or discussion.

      We do not mean to strictly equate use-dependent forgetting and overt planning, although they can be related, namely when ‘use’ refers to ‘altered use’ as is the case when something about the behavior is missing (e.g. auditory feedback in our study), and the dependence is not just on ‘use’ but also on ‘experience’.

      We added the following sentence to the discussion: We cannot distinguish the overt planning we find from more complex use-and-experience dependent forgetting, since we only probed for recovery of pitch and did not attempt to push birds into planning pitch shifts further away from baseline.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      (1) The single main issue with this paper is in the section related to Figure 4, and the Figure itself - this is the most important part of the paper essential to buttress the claim of covert learning. However, there are several sources of confusion in the text, analyses, and figures. The key result is in Figure 4B, C - and, in the context of Figs 1-3, the data are significant but subtle. That is, as the authors state, the birds are mostly dependent on slow sensory feedback-dependent (possibly RL) mechanisms but there is a small component of target matching that evidences an internal model. One wonders why this capacity is so small - if they had a good internal model they'd be much faster and better at recovering target pitches after distortion-driven deviations even without sensory feedback.

      (1a) The analysis of the WNd and DLO reversions of pitch (related to Fig. 4) uses a d' analysis which is a pivot from the NRP analysis used in the rest of the paper. It is not clear why different analyses are being used here to compute essentially the same measure, i.e. how much did the pitch revert. It's also odd that different results are now obtained - Fig. 4 has a small but significant reversion of pitch in WNd birds but Fig. 2 shows no significant return to baseline.

      We did not test for reversion towards baseline in Fig. 2 and made no statement about whether there is a significant reversion or not. But when we do such a test, we find a significant reversion for WNd birds in the ‘late’ window (NRP=0.5, p=0.02, N=10, tstat=-1.77, two-tailed t-test), which agrees with Figure 4. In the ‘early’ window in Fig. 2, we find only a trend but no reversion (NRP = 0.76, p=0.11, n=10, tstat=-1.76), which contrasts with our findings in Figure 4. However, the discrepancy can be simply explained by the difference in time alignment that we detail in the Materials and Methods. Namely, in Figure 2, we measure pitch relative to the pitch in the morning on the day before, which is not a good measure of ‘reversion’ (since pitch had been reinforced further away during the day), which is why we do not present this analysis in the paper and dedicate a separate analysis in Figure 4 to reversion.

      (1b) Also in Fig. 4 is it the case that, as in the schematic of 4a, ALL birds in these experiments had their pitch pushed up - so that the return to baseline was all down? If this is the case the analysis may be contaminated by a pitch-down bias in deafened birds. This would ideally be tested with a balance of pitch-up and pitch-down birds in the pre-deafening period, and/or analysis of non-targeted harmonic stacks to examine their pitch changes. If non-targeted stacks exhibit pitch-down changes after deafening, then the reversion that forms the key discovery of this paper will be undermined. Please address.

      Both groups in Figure 4 were balanced (same number of birds were shifted their pitch up and down), see response to public review and Methods.

      (1c) After multiple re-reads and consultations with the Methods section I still do not understand the motivation or result for Figure 4E. Please provide clarification of the hypothesis/control being assessed and the outcome.

      Figure 4E does not add an additional result but strengthens our previous findings because we obtain the same result with a different method. The pitch of deaf birds tends to drift after deafening. To discount for this drift and the effect of time elapsed since deafening, we bootstrapped the magnitude of the pitch change in WNd and dLO birds by comparing them to dC birds in matched time windows. We modified the sentence in the results section to clarify this point:

      To discount for the effect of time elapsed since deafening and quantify the change in pitch specifically due to reinforcement, we bootstrapped the difference in 𝒅′ between dLO/WNd birds and a new group of dC birds that were deafened but experienced no prior reinforcement (see methods).

      (1d) Line 215. It's not clear in the text here how the WNd birds experience a pitch mismatch. Please clarify the text that this mismatch was experienced before deafening. This is a critical paragraph to set up the main claims of the paper. Also, it's not clear what is meant by 'fuel their plan'? I can imagine this would simply be a DA-dependent plasticity process in Area X that does not fuel a plan but rather re-wires and HVC timestep to medium spiny neurons whose outputs drive pitch changes - i.e. not a fueled plan but simply an RL-dependent re-mapping in the motor system. Alternatively, a change could result in plasticity in pallial circuits (e.g. auditory to HVC mappings) that are RL independent and invoke an inverse model along the lines of the author's past work (e.g. Ganguli and Hahnlsoer). This issue is taken up in the discussion but the setup here in the results is very confusing about the possible outcomes. This paragraph is vague with respect to the key hypotheses. It's possible that the WNd and DLO groups enable dissection of the two hypotheses above - because the DLO groups would presumably have RL signals but without recovery - but there remains a real lack of clarity over exactly how the authors are interpreting Fig 4 at the mechanistic level.

      WNd birds experience a pitch mismatch because while singing they hear that their pitch differs from baseline pitch, but the same is not true for dLO birds. We simply tested whether this experience makes a difference for reversion and it does. We added ‘before deafening’ to the paragraph and changed the wording of our hypothesis to make it clearer (we reworded ‘fuel their plan’). Mechanistic interpretations we left in the discussion. Without going to details, all we are saying is that birds can only plan to revert motor changes they are aware of in the first place.

      Minor issues

      The songs of deafened birds degrade, at a rate that depends on the bird's age. Younger crystalized birds degrade much faster, presumably because of lower testosterone levels that are associated with increased plasticity and LMAN function. Some background is needed on deafened birds to set up the WNd experiments.

      Despite deafening leading to the degradation of song (Lombardino and Nottebohm, 2000), syllable detection and pitch calculation were still possible in all deaf birds (up to 13-50 days after deafening surgery, age range 90-300 dph, n=44 birds).

      Since pitch shifting was balanced in both deaf bird groups (the same number of birds were up- and down-shifted), systematic changes in pitch post deafening (Lombardino and Nottebohm, 2000) will average out and so would not affect our findings.

      Lines 97-103. The paragraph is unclear and perhaps a call to a SupFig to show the lack of recovery would help. If I understand correctly, the first two birds did not exhibit the normal recovery to baseline if they did not have an opportunity to hear themselves sing without the WN. I am failing to understand this.

      In the early window (first 2 hours after unmuting) birds have not changed their pitch compared to their pitch in the corresponding window at the end of reinforcement (with matching time-of-day). We added ‘immediately after unmuting (early)’ to clarify this statement.

      Lines 68-69. What is the difference between (2) and (3)? Both require sensory representation/target to be mapped to vocal motor output. Please clarify or fuse these concepts.

      We fused the concept and changed the figure and explanation accordingly.

      Line 100. Please name the figure to support the claim.

      We marked the two birds in the Fig. 1H and added a reference in the text.

      Line 109. Is there a way to confirm / test if muted birds attempted to sing?

      Unfortunately, we do not have video recordings to check if there are any signs of singing attempts in muted birds.

      Line 296: Why 'hierarchically 'lower'?

      Lower because without it there is nothing to consolidate, i.e. the higher process can only be effective after the lower but not before. We clarified this point in the text.

      Past work on temporal - CAF (tcaf) by the Olveczky group showed that syllable durations and gaps could be reinforced in a way that does not depend on Area X and, therefore, related to the authors' discussion on the possible mechanisms of sensory-feedback independent recovery, may rely on the same neural substrates that Fig. 4 WNd group uses to recover. Yet the authors find in this paper that tCAF birds did not recover. There seems to be an oddity here - if covert recovery relies on circuits outside the basal ganglia and RL mechanisms, wouldn't t-CAF birds be more likely to recover? This is not a major issue but is a source of confusion related to the authors' interpretations that could be fleshed out.

      This is a good point, we reinvestigated the tCAF birds in the context of Fig 4 where we looked for pitch reversions towards baseline. tCAF birds do also revert towards baseline. We added this information to the supplement. We cannot say anything about the mechanistic reasons for lack of recovery, especially given that we did not look at brain-level mechanisms.

      Reviewer #2 (Recommendations For The Authors):

      The data presentation could be improved. It is difficult to distinguish between the early and late symbols and to distinguish between the colors for the individual lines on the plots or to match them with the points on the group data plots. In addition, because presumably, the points in plots like 2D are for the same individuals, lines connecting those points would be useful rather than trying to figure out which points are the same color.

      We added lines in Fig. 2D connecting the birds in early and late.

      The model illustrations (Fig 1A, Fig 5) are not intuitive and do not help to clarify the different hypotheses or ideas. I think these need to be reworked.

      We revised the model illustrations and hope they improved to clarify the different hypothesis.

      Some of the phrasing is confusing. Especially lines 157-158 and 256-257.

      Lines 157-158: we removed an instance of ‘WNd’, which was out of place.

      Lines 256-257: we rephrased to ‘showing that prior experience of a target mismatch is necessary for pitch reversion independently of auditory feedback’

      Reviewer #3 (Recommendations For The Authors):

      For Fig. 1, the conclusion in the text "Overall, these findings suggest that either motor practice, sensory feedback, or both, are necessary for the recovery of baseline song" is not aligned with the figure header "Recovery of pitch target requires practice".

      We rephrased the conclusion to: Overall, these findings rule out covert planning in muted birds and suggest that motor practice is necessary for recovery of baseline song.

      The use of the term "song experience" can be confusing as to whether it means motor or auditory experience. Perhaps replace it with "singing experience" or "auditory experience" where appropriate.

      We did the requested changes.

      Fig. 1A, and related text, reads as three hypotheses that the authors will test in the paper, but I don't think this turns out to the be the main goal (and if it is, it is not clear their results differentiate between hypotheses 1, 2, and 3). Perhaps reframe as discussion points and have this panel not be so prominent at the start, just to avoid this confusion.

      We modified the illustration in Fig 1A and simplified it. We now only show the 2 hypotheses that we test in the paper.

      Line 275-276, "preceding few hours necessitates auditory feedback, which sets a limit to zebra finches' covert planning ability". Did the authors mean "overt", not covert? Since their study focuses on overt planning.

      Our study focuses on covert planning in figure 1 and overt planning in subsequent figures.

      The purpose of the paragraph starting on line 278 could be more clear. Is the goal to say that overt planning and what has previously been described as use-dependent forgetting are actually the same thing? If not, what is the relationship between overt planning and forgetting? In other words, why should I care about prior work on use-dependent forgetting?

      We moved the paragraph further down where it does not interrupt the narrative. See also our reply to reviewer 3 on use-dependent forgetting.

      Line 294, "...a dependent process enabled by experience of the former...", was not clear what "former" is referring to. In general, this paragraph was difficult to understand. Line 296: Which is the "lower" process?

      We added explanatory parentheses in the text to clarify. We rephrased the sentence to ‘the hierarchically lower process of acquisition or planning as we find is independent of immediate sensory experience.’

      Line 295, the reference to "acquisition" vs. "retention". It is not clear how these two concepts relate to the behavior in this study, and/or the hierarchical processes referenced in the previous sentence. Overall, it is not clear how consolidation is related to the paper's findings.

      We added explanatory parentheses in the text and changed figure 5 to better explain the links.

      Line 305, add a reference to Warren et al. 2011, which I believe was the first study (or one of them) that showed that AFP bias is required for restoring pitch to baseline.

      We are citing Warren et al. 2011 in the sentence:

      Such separation also applies to songbirds. Both reinforcement learning of pitch and recovery of the original pitch baseline depend on the anterior forebrain pathway and its output, the lateral magnocellular nucleus of the anterior nidopallium (LMAN)(1).

      Line 310, "Because LMAN seems capable of executing a motor plan without sensory feedback", is this inferred from this paper (in which case this is an overreach) or is this referencing prior work (if so, which one, and please cite)?

      We changed the wording to ‘It remains to be seen whether LMAN is capable of executing a motor plans without sensory feedback’.

      Line 326, "which makes them well suited for planning song in a manner congruent with experience." I don't fully understand the logic. Can this sentence be clarified?

      We rephrased the sentence and added an explanation as follows: …which makes them well suited for executing song plans within the range of recent experience (i.e., if the song is outside recent experience, it elicits no LMAN response and so does not gain access to planning circuits).

    1. Author Response

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      In this manuscript, the authors report a molecular mechanism for recruiting syntaixn 17 (Syn17) to the closed autophagosomes through the charge interaction between enriched PI4P and the C-terminal region of Syn17. How to precisely control the location and conformation of proteins is critical for maintaining autophagic flux. Particularly, the recruitment of Syn17 to autophagosomes remains unclear. In this paper, the author describes a simple lipid-protein interaction model beyond previous studies focusing on protein-protein interactions. This represents conceptual advances.

      We would like to thank Reviewer #1 for the positive evaluation of our study.

      Reviewer #2 (Public Review):

      Summary:

      Syntaxin17 (STX17) is a SNARE protein that is recruited to mature (i.e., closed) autophagosomes, but not to immature (i.e., unclosed) ones, and mediates the autophagosome-lysosome fusion. How STX17 recognizes the mature autophagosome is an unresolved interesting question in the autophagy field. Shinoda and colleagues set out to answer this question by focusing on the C-terminal domain of STX17 and found that PI4P is a strong candidate that causes the STX17 recruitment to the autophasome.

      Strengths:

      The main findings are: 1) Rich positive charges in the C-terminal domain of STX17 are sufficient for the recruitment to the mature autophagosome; 2) Fluorescence charge sensors of different strengths suggest that autophagic membranes have negative charges and the charge increases as they mature; 3) Among a battery of fluorescence biosensors, only PI4P-binding biosensors distribute to the mature autophagosome; 4) STX17 bound to isolated autophagosomes is released by treatment with Sac1 phosphatase; 5) By dynamic molecular simulation, STX17 TM is shown to be inserted to a membrane containing PI4P but not to a membrane without it. These results indicate that PI4P is a strong candidate that STX17 binds to in the autophagosome.

      We would like to thank Reviewer #2 for pointing out these strengths.

      Weaknesses:

      • It was not answered whether PI4P is crucial for the STX17 recruitment in cells because manipulation of the PI4P content in autophagic membranes was not successful for unknown reasons.

      As we explained in the initial submission, we tried to deplete PI4P in autophagosomes by multiple methods but did not succeed. In this revised manuscript, we added the result of an experiment using the PI 4-kinase inhibitor NC03 (Figure 4―figure supplement 1), which shows no significant effect on the autophagosomal PI4P level and STX17 recruitment.

      Author response image 1.

      The PI 4-kinase inhibitor NC03 failed to suppress autophagosomal PI4P accumulation and STX17 recruitment. HEK293T cells stably expressing mRuby3–STX17TM (A) or mRuby3–CERT(PHD) (B) and Halotag-LC3 were cultured in starvation medium for 1 h and then treated with and without 10 μM NC03 for 10 min. Representative confocal images are shown. STX17TM- or CERT(PHD)-positive rates of LC3 structures per cell (n > 30 cells) are shown in the graphs. Solid horizontal lines indicate medians, boxes indicate the interquartile ranges (25th to 75th percentiles), and whiskers indicate the 5th to 95th percentiles. Differences were statistically analyzed by Welch’s t-test. Scale bars, 10 μm (main), 1 μm (inset).

      • The molecular simulation study did not show whether PI4P is necessary for the STX17 TM insertion or whether other negatively charged lipids can play a similar role.

      As the reviewer suggested, we performed the molecular dynamics simulation using membranes with phosphatidylinositol, a negatively charged lipid. STX17 TM approached the PI-containing membrane but was not inserted into the membrane within a time scale of 100 ns in simulations of all five structures. This data suggests that PI4P, which is more negatively charged than PI, is required for STX17 insertion. Thus, we have included these data in Figure 5E and F and added the following text to Lines 242–244. “Moreover, if the membrane contained phosphatidylinositol (PI) instead of PI4P, STX17 approached the PI-containing membrane but was not inserted into the membrane (Figure 5E, F, Video 3)."

      Author response image 2.

      (E) An example of a time series of simulated results of STX17TM insertion into a membrane consisting of 70% phosphatidylcholine (PC), 20% phosphatidylethanolamine (PE), and 10% phosphatidylinositol (PI). STX17TM is shown in blue. Phosphorus in PC, PE and PI are indicated by yellow, cyan, and orange, respectively. Short-tailed lipids are represented as green sticks. The time evolution series are shown in Video 3. (F) Time evolution of the z-coordinate of the center of mass (z_cm) of the transmembrane helices of STX17TM in the case of membranes with PI. Five independent simulation results are represented by solid lines of different colors. The gray dashed lines indicate the locations of the lipid heads. A scale bar indicates 5 nm.

      • The question that the authors posed in the beginning, i.e., why is STX17 recruited to the mature (closed) autophagosome but not to immature autophagic membranes, was not answered. The authors speculate that the seemingly gradual increase of negative charges in autophagic membranes is caused by an increase in PI4P. However, this was not supported by the PI4P fluorescence biosensor experiment that showed their distribution to the mature autophagosome only. Here, there are at least two possibilities: 1) The increase of negative charges in immature autophagic membranes is derived from PI4P. However the fluorescence biosensors do not bind there for some reason; for example, they are not sensitive enough to recognize PI4P until it reaches a certain level, or simply, their binding does not occur in a quantitative manner. 2) The negative charge in immature membranes is not derived from PI4P, and PI4P is generated abundantly only after autophagosomes are closed. In either case, it is not easy to explain why STX17 is recruited to the mature autophagosome only. For the first scenario, it is not clear how the PI4P synthesis is regulated so that it reaches a sufficient level only after the membrane closure. In the second case, the mechanism that produces PI4P only after the autophagosome closure needs to be elucidated (so, in this case, the question of the temporal regulation issue remains the same).

      We thank the reviewers for pointing this out. While the probe for weakly negative charges (1K8Q) labeled both immature and mature autophagosomes, the probes for intermediate charges (5K4Q and 3K6Q) and PI4P labeled only mature autophagosomes (Figure 2F, Figure 2–figure supplement 1B). Thus, we think that the autophagosomal membrane rapidly and drastically becomes negatively charged, and at the same time, PI4P is enriched. Although immature membranes may have weak negative charges, we did not examine which lipids contribute to the negative charges. Thus, we have added the following sentences to the Discussion part.

      “Our data of the 1K8Q probe suggest that immature autophagosomal membranes may also have slight negative charges (Figure 2E). Although the source of the negative charge of immature autophagosomes is currently unknown, it may be derived from low levels of PI4P, which is undetectable by the PI4P probes and/or other negatively charged lipids such as PI and PS (Schmitt et al., EMBO Rep, 2022).” (Lines 279–283) “In any case, it would be important to elucidate how PI 4-kinase activity or PI4P synthesis is upregulated during autophagosome maturation.” (Lines 302–303)

      Reviewer #3 (Public Review):

      Summary:

      In this study, the authors set out to address the question of how the SNARE protein Syntaxin 17 senses autophagosome maturation by being recruited to autophagosomal membranes only once autophagosome formation and sealing is complete. The authors discover that the C-terminal region of Syntaxin 17 is essential for its sensing mechanism that involves two transmembrane domains and a positively charged region. The authors discover that the lipid PI4P is highly enriched in mature autophagosomes and that electrostatic interaction with Syntaxin 17's positively charged region with PI4P drives recruitment specifically to mature autophagosomes. The temporal basis for PI4P enrichment and Syntaxin 17 recruitment to ensure that unsealed autophagosomes do not fuse with lysosomes is a very interesting and important discovery. Overall, the data are clear and convincing, with the study providing important mechanistic insights that will be of broad interest to the autophagy field, and also to cell biologists interested in phosphoinositide lipid biology. The author's discovery also provides an opportunity for future research in which Syntaxin 17's c-terminal region could be used to target factors of interest to mature autophagosomes.

      Strengths:

      The study combines clear and convincing cell biology data with in vitro approaches to show how Syntaxin 17 is recruited to mature autophagosomes. The authors take a methodical approach to narrow down the critical regions within Syntaxin 17 required for recruitment and use a variety of biosensors to show that PI4P is enriched on mature autophagosomes.

      We would like to thank Reviewer #3 for the positive comments.

      Weaknesses:

      There are no major weaknesses, overall the work is highly convincing. It would have been beneficial if the authors could have shown whether altering PI4P levels would affect Syntaxin 17 recruitment. However, this is understandably a challenging experiment to undertake and the authors outlined their various attempts to tackle this question.

      We thank Reviewer #3 for pointing this out. Please see our above response to Reviewer #2 (Public Review).

      In addition, clear statements within the figure legends on the number of independent experimental repeats that were conducted for experiments that were quantitated are not currently present in the manuscript.

      As pointed out by Reviewer #3, we have added the number of independent experimental repeats in the figure legends.

      Reviewer #1 (Recommendations For The Authors):

      This paper is well written and all experiments were conducted with a high standard. Several minor issues should be addressed before final publication.

      (1) To further confirm the charge interaction, a charge screening experiment should be performed for Fig. 2A.

      We have asked Reviewer #1 through the editor what this experiment meant and understood that it was to see the effects of high salt concentrations. We monitored the association of GFP-STX17TM with liposomes in the presence or absence of 1 M NaCl and found that it was blocked in a high ionic buffer. This data supports the electrostatic interaction of STX17 with membranes. We have included this data in Figure 2B and added the following sentences to Lines 124–126.

      “The association of STX17TM with PI4P-containing membranes was abolished in the presence of 1 M NaCl (Figure 2B). These data suggest that STX17 can be recruited to negatively charged membranes via electrostatic interaction independent of the specific lipid species.”

      Author response image 3.

      GFP–STX17TM translated in vitro was incubated with rhodamine-labeled liposomes containing 70% PC, 20% PE and 10% PI4P in the presence of 1 M NaCl or 1.2 M sucrose. GFP intensities of liposomes were quantified and shown as in Figure 1C (n > 30).

      (2) The authors claim that "Autophagosomes become negatively charged during maturation", based on experiments using membrane charge probes. Since it's mainly about the membrane, it's better to refine the claim to "The membrane of autophasosomes becomes...", which would be more precise and close to the topic of this paper.

      We would like to thank the reviewer for pointing this out. This point is valid. As recommended, we have collected the phrases “Autophagosomes become negatively charged during maturation” to “The membrane of autophagosomes becomes negatively charged during maturation” (Line 72, 118, 262, 969 (title of Figure2), 1068 (title of Figure2–figure supplyment1)).

      (3) The authors should add more discussion regarding the "specificity" for recruiting Syn17 through the charge interaction. Particularly, how Syn17 could be maintained before the closure of autophagosomes? For the MD simulations in Fig. 5, the current results don't add much to the manuscript. The cell biology experiments have demonstrated the conclusion. The authors could try to find more details about the insertion by analyzing the simulation movies. Do membrane packing defects play a role during the insertion process? A similar analysis was conducted for alpha-synuclein (https://pubmed.ncbi.nlm.nih.gov/33437978/).

      Regarding the mechanism of STX17 maintenance in the cytosol, we do not think that other molecules, such as chaperones, are essential because purified recombinant mGFP-STX17TM used in this study is soluble. However, it does not rule out such a mechanism, which would be a future study.

      In the paper by Liu et al. (PMID: 33437978), small liposomes with diameters of 25–50 nm are used. Therefore, there are packing defects in the highly curved membranes, to which alpha-synuclein helices are inserted in a curvature-dependent manner. On the other hand, autophagosomes are much larger (~1 um in diameter) and almost flat for STX17 molecules, so we think it is unlikely that STX17 recognizes the packing defect.

      Reviewer #2 (Recommendations For The Authors):

      • The two (and other) possibilities with regards to the interpretation of the negative charge/PI4P result in autophagic membranes are hoped to be discussed.

      As mentioned above, we have added the following sentences to the Discussion section. “Our data of the 1K8Q probe suggest that immature autophagosomal membranes may also have slight negative charges (Figure 2E). Although the source of the negative charge of immature autophagosomes is currently unknown, it may be derived from low levels of PI4P, which is undetectable by the PI4P probes and/or other negatively charged lipids such as PI and PS (Schmitt et al., EMBO Rep, 2022).” (Lines 279–283)

      “In any case, it would be important to elucidate how PI 4-kinase activity or PI4P synthesis is upregulated during autophagosome maturation.” (Lines 302–303)

      • Fluorescence biosensors are convenient to give an overview of the intracellular distribution of various lipids, but some of them show false-negative results. For example, evectin-2-PH for PS binds to endosomes but not to the plasma membrane, even though the latter contains abundant PS. With regards to PI4P, some biosensors illuminate both the Golgi and autophagosome, while others do not appear to bind the Golgi. Moreover, fluorescence biosensors for PI(3,5)P2 and PI(3,4)P2, which are also candidates for the STX17 insertion issue, are less reliable than others (e.g., those for PI3P and PI(4,5)P2). These problems need to be considered.

      We agree with Reviewer #2 that fluorescence biosensors are not perfect for detecting specific lipids. Based on the Reviewer’s suggestion, we have included a comment on this in the Discussion section as follows (Lines 265–268).

      “Given the possibility that fluorescence lipid probes may give false-negative results, a more comprehensive biochemical analysis, such as lipidomics analysis of mature autophagosomes, would be imperative to elucidate the potential involvement of other negatively charged lipids.”

      • A negative control for the PI4P biosensor, i.e., a mutant lacking the PI4P binding ability, is better to be tested to confirm the presence of PI4P in autophagosomes.

      We would like to thank the Reviewer for this comment. We conducted the suggested experiment and confirmed that the CERT(PHD)(W33A) mutant, which is deficient for PI4P binding (Sugiki et al., JBC. 2012), was diffusely present in the cytosol and did not localize to STX17-positive autophagosomes. This data supports our conclusion that PI4P is indeed present in autophagosomes. We have included this data in Figure 3–figure supplement 2A and explained it in the text (Lines 164–166).

      Author response image 4.

      Mouse embryonic fibroblasts (MEFs) stably expressing GFP–CERT(PHD)(W33A) and mRuby3–STX17TM were cultured in starvation medium for 1 h. Bars indicate 10 μm (main images) and 1 μm (insets).

      • As a control to the molecular dynamic simulation study, STX17 TM insertion into a membrane containing other negative charge lipids, especially PI, needs to be tested. PI is a negative charge lipid that is likely to exist in autophagic membranes (as suggested by the authors' past study).

      We thank the reviewers for this suggestion. As mentioned above (Reviewer #2, Public Review), we performed the molecular dynamics simulation using membranes containing PI and added the results in Figure 5E and F and Video 3.

      • If the putative role of PI4P could be shown in the cellular context, the authors' conclusion would be much strengthened. I wonder if overexpression of PI4P fluorescence biosensors, especially those that appear to bind to the autophagosome almost exclusively, may suppress the recruitment of STX17 there.

      We would like to thank the Reviewer for asking this question. In MEFs stably overexpressing PI4P probes driven by the CMV promoter, STX17 recruitment was not affected. Thus, simple overexpression of PI4P probes does not appear to be effective in masking PI4P in autophagosomes.

      Another idea is to use an appropriate molecule (e.g., WIPI2, ATG5) and to recruit Sac1 to autophagic membranes by using the FRB-FKBP system or the like. I hope these and other possibilities will be tested to confirm the importance of PI4P in the temporal regulation of STX17 recruitment.

      We tried the FRB-FKBP system using the phosphatase domain of yeast Sac1 fused to FKBP and LC3 fused to FRB, but unfortunately, this system failed to deplete PI4P from the autophagosomal membrane.

      Reviewer #3 (Recommendations For The Authors):

      A few areas for suggested improvement are:

      (1) It would be helpful if the authors could clarify for all figures how many independent experiments were conducted for all experiments, particularly those that have quantitation and statistical analyses.

      As pointed out by Reviewer #3, we have added the number of independent experimental repeats in the figure legends.

      The authors made several attempts to modulate PI4P levels on autophagosomes although understandably this proved to be challenging. A couple of suggestions are provided to address this area:

      (2) Given the reported role of GABARAPs in PI4K2a recruitment and PI4P production on autophagosomes, as well as autophagosome-lysosome fusion (Nguyen et al (2016) J Cell Biol) it would be worthwhile to assess whether GABARAP TKO cells have reduced PI4P and reduced Stx17 recruitment

      According to the Reviewer’s suggestion, we examined the localization of STX17 TM and the PI4P probe CERT(PHD) in ATG8 family (LC3/GABARAP) hexa KO HeLa cells that were established by the Lazarou lab (Nguyen et al., JCB 2016). As in WT cells, STX17 TM and CERT(PHD) were still colocalized with each other in hexa KO cells, suggesting that neither STX17 recruitment nor PI4P enrichment depends on ATG8 family proteins (note: the size of autophagosomes in HeLa cells is smaller than in MEFs, making it difficult to observe autophagosomes as ring-shaped structures). We have included this result in Figure 3–figure supplement 2(F) and explained it in the text (Lines 194–196, 198).

      Author response image 5.

      (F) WT and ATG8 hexa KO HeLa cells stably expressing GFP–STX17TM and transiently expressing mRuby3–CERT(PHD) were cultured in starvation medium. Bars indicate 10 μm (main images) and 1 μm (insets).

      (3) Can the authors try fusing Sac1 to one of the PI4P probes (CERT(PHD)) that were used, or alternatively to the c-terminus of Syntaxin 17? This approach would help to recruit Sac1 only to mature autophagosomes and could therefore prevent the autophagosome formation defect observed when fused to LC3B that targeted Sac1 to autophagosomes as they were forming. Understandably, this approach might seem a bit counterintuitive since the phosphatase is removing PI4P which is what is recruiting it but it could be a viable approach to keep PI4P levels low enough on mature autophagosomes so that Syntaxin 17 is no longer recruited. A Sac1 phosphatase mutant might be needed as a control.

      We would like to thank the Reviewer for these suggestions. We tried the phosphatase domain of yeast Sac1 or human SAC1 fused with STX17TM, but unfortunately, these fusion proteins did not deplete PI4P from autophagosomes.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Review:

      Reviewer #1:

      (1) To support the finding that texture is not represented in a modular fashion, additional possibilities must be considered. These include (a) the effectiveness and specificity of the texture stimulus and control stimuli, (b) further analysis of possible structure in images that may have been missed, and (c) limitations of imaging resolution.

      Thank you for your comments. To address your concerns, we have conducted a new 3T fMRI experiment to demonstrate the effectiveness and specificity of our stimuli, performed further analyses to investigate possible structure of texture-selective activation, and discussed the limitations of imaging resolution.

      (a) To demonstrate the effectiveness and specificity of our stimuli, we conducted a new 3T fMRI experiment in five participants using an experimental design and texture families similar to those in Freeman (2013). Six texture stimuli in the 7T experiment were also included. To assess the effectiveness of each stimulus type, different texture families and their corresponding noise patterns were presented in separate blocks for 24 seconds, at a high presentation rate of 5 frames per second. In Figure S7, all texture families showed significantly stronger activation in V2 compared to their corresponding noise patterns, even for those that ‘appeared’ to have residual texture (e.g., the third texture family). These results demonstrate that our texture vs. noise stimuli were effective in producing texture-selective activations in area V2. Compared to the 7T results, the 3T data showed a notable increase in texture-selective activations in V2, likely due to increased stimulus presentation speed (1.25 vs. 5 frames/second). Future studies should use stimuli with faster presentation speed to validate our results in the 7T experiment.

      (b)Thank you for pointing out the possible structures of texture-selective activations in the peripheral visual field (Figure S1). In further analyses, we also found stronger texture selectivity in more peripheral visual fields (Figure 2D), and there were weak but significant correlations in the texture-noise activation patterns during split-half analysis (Author response image 2). Although this is not strong evidence for columnar organization of naturalistic textures, it suggests a possibility for modular organizations in the peripheral visual field.

      (c) Although our fMRI result at 1-mm isotropic resolution did not show strong evidence for modular processing of naturalistic texture in V2 stripe columns, this does not exclude the possibility that smaller modules exist beyond the current fMRI resolution. We have discussed this possibility in the revised manuscript.

      We hope this response clarifies our findings, and we have revised the conclusions in the manuscript accordingly.

      (2) More in-depth analysis of subject data is needed. The apparent structure in the texture images in peripheral fields of some subjects calls for more detailed analysis. e.g Relationship to eccentricity and the need for a 'modularity index' to quantify the degree of modularity. A possible relationship to eccentricity should also be considered.

      Based on your recommendations, we have performed further analysis and found interesting results regarding the modularity index in relation to eccentricity. As shown in Figure 2D, the texture-selectivity index increased as eccentricity. This may suggest a higher possibility of modular organization for texture representation in the peripheral compared to central visual fields. We have updated our results in Figure 2C, and discussed this possibility in the revised manuscript.

      (3) Given what is known as a modular organization in V4 and V3 (e.g. for color, orientation, curvature), did images reveal these organizations? If so, connectivity analysis would be improved based on such ROIs. This would further strengthen the hierarchical scheme.

      Following your recommendations, we have conducted further analysis to investigate the potential modular organizations in V4 and V3ab. In Figure S9 (Figure S9), vertices that are most responsive to color, disparity and texture were shown in a representative subject. Indeed, texture-selective patches can be found in both V4 and V3ab, along with the color- and disparity-selective patches. We agree with you that there should be pathway-specific connectivity among the same type of functional modules. In the informational connectivity analyses, we already used highly informative voxels by feature selection, which should mainly represent information from the modular organizations in these higher visual areas.

      Reviewer #2:

      (1) In lines 162-163, it is stated that no clear columnar organization exists for naturalistic texture processing in V2. In my opinion, this should be rephrased. As far as I understand, Figure 2B refers to the analysis used to support the conclusion. The left and middle bar plots only show a circular analysis since ROIs were based on the color and disparity contrast used to define thin and thick stripes. The interesting graph is the right plot, which shows no statistically significant overlap of texture processing with thin, thick, and pale stripe ROIs. It should be pointed out that this analysis does not dismiss a columnar organization per se but instead only supports the conclusion of no coincidence with the CO-stripe architecture.

      Thank you for your suggestions. Reviewer #1 also raised a similar concern. We agree that there may be a smaller functional module of textures in area V2 at a finer spatial scale than our fMRI resolution. We have rephrased our conclusions to be more precise.

      (2) In Figure 3, cortical depth-dependent analyses are presented for color, disparity, and texture processing. I acknowledge that the authors took care of venous effects by excluding outlier voxels. However, the GE-BOLD signal at high magnetic fields is still biased to extravascular contributions from around larger veins. Therefore, the highest color selectivity in superficial layers might also result from the bias to draining veins and might not be of neuronal origin. Furthermore, it is interesting that cortical profiles with the highest selectivity in superficial layers show overall higher selectivity across cortical depth. Could the missing increase toward the pial surface in other profiles result from the ROI definition or overall smaller signal changes (effect size) of selected voxels? At least, a more careful interpretation and discussion would be helpful for the reader.

      We agree with you that there will be residual venous effects even after removing voxels containing large veins. However, calculating the selectivity index largely removed the superficial bias (Figure 3). In the revised manuscript, we discussed the limitations of cortical depth-dependent analysis using GE-BOLD fMRI.

      In Line 397-403: “Due to the limitations of the T2*w GE-BOLD signal in its sensitivity to large draining veins (Fracasso et al., 2021; Parkes et al., 2005; Uludag & Havlicek, 2021), the original BOLD responses were strongly biased towards the superficial depth in our data (Figure S8). Compared to GE-BOLD, VASO-CBV and SE-BOLD fMRI techniques have higher spatial specificity but much lower sensitivity (Huber et al., 2019). As shown in a recent study (Qian et al., 2024), using differential BOLD responses in a continuous­­ stimulus design can significantly enhance the laminar specificity of the feature selectivity measures in our results (Figure 3).”

      It is unlikely that the strongest color selectivity index in the superficial depth is a result of stronger signal change or larger effect size in this condition. As shown by the original BOLD responses in Figure S8, all stimulus conditions produced robust activations that strongly biased to the superficial depth. High texture selectivity was also found in V4 and V3ab across cortical depth, which showed a flat laminar profile.

      (3) I was slightly surprised that no retinotopy data was acquired. The ROI definition in the manuscript was based on a retinotopy atlas plus manual stripe segmentation of single columns. Both steps have disadvantages because they neglect individual differences and are based on subjective assessment. A few points might be worth discussing: (1) In lines 467-468, the authors state that V2 was defined based on the extent of stripes. This classical definition of area V2 was questioned by a recent publication (Nasr et al., 2016, J Neurosci, 36, 1841-1857), which showed that stripes might extend into V3. Could this have been a problem in the present analysis, e.g., in the connectivity analysis? (2) The manual segmentation depends on the chosen threshold value, which is inevitably arbitrary. Which value was used?

      A previous study showed that the retinotopic atlas of early visual areas (V1-V3) aligned very well across participants on the standard surface after surface-based registration by the anatomical landmarks (Benson 2018). Thus, the group-averaged atlas should be accurate in defining the boundaries of early visual areas. To directly demonstrate the accuracy of this method, retinotopic data were acquired in five participants in a 3T fMRI experiment. A phase-encoded method was used to define the boundaries of early visual areas (black lines in Author response image 1), which were highly consistent with the Benson atlas.

      Although a few feature-selective stripes may extend into V3, these stripe patterns were mainly represented in V2. Thus, the signal contribution from V3 is likely to be small and should not affect the pattern of results. The activation map threshold for manual segmentation was abs(T)>2. We have clarified this in the revised methods.

      Author response image 1.

      Retinotopic ROIs defined by the Benson atlas (left) and the polar angle map (right) of the representative subject. Black lines denote the boundaries of early visual areas based on the retinotopic map from the subject.

      Benson, N. C., Jamison, K. W., Arcaro, M. J., Vu, A. T., Glasser, M. F., Coalson, T. S., Van Essen, D. C., Yacoub, E., Ugurbil, K., Winawer, J., & Kay, K. (2018). The Human Connectome Project 7 Tesla retinotopy dataset: Description and population receptive field analysis. J Vis, 18(13), 23. https://doi.org/10.1167/18.13.23

      (4) The use of 1-mm isotropic voxels is relatively coarse for cortical depth-dependent analyses, especially in the early visual cortex, which is highly convoluted and has a small cortical thickness. For example, most layer-fMRI studies use a voxel size of around isotropic 0.8 mm, which has half the voxel volume of 1 mm isotropic voxels. With increasing voxel volume, partial volume effects become more pronounced. For example, partial volume with CSF might confound the analysis by introducing pulsatility effects.

      We agree that a 1-mm isotropic voxel is much larger in volume than a 0.8-mm isotropic voxel, but the resolution along the cortical depth is not a big difference. In addition to our study, a previous study showed that fMRI at 1-mm isotropic resolution is capable of resolving cortical depth-dependent signals (Roefs et al., 2024; Shao et al., 2021). We have discussed these issues about fMRI resolution in the revised manuscript.

      In Line 403-408: “Compared to the submillimeter voxels, as used in most laminar fMRI studies, our fMRI resolution at 1-mm isotropic voxel may have a stronger partial volume effect in the cortical depth-dependent analysis. However, consistent with our results, previous studies have also shown that 7T fMRI at 1-mm isotropic resolution can resolve cortical depth-dependent signals in human visual cortex (Roefs et al., 2024; Shao et al., 2021).”

      Shao, X., Guo, F., Shou, Q., Wang, K., Jann, K., Yan, L., Toga, A. W., Zhang, P., & Wang, D. J. J. (2021). Laminar perfusion imaging with zoomed arterial spin labeling at 7 Tesla. NeuroImage, 245, 118724. https://doi.org/10.1016/j.neuroimage.2021.118724

      Roefs, E. C., Schellekens, W., Báez-Yáñez, M. G., Bhogal, A. A., Groen, I. I., van Osch, M. J., ... & Petridou, N. (2024). The Contribution of the Vascular Architecture and Cerebrovascular Reactivity to the BOLD signal Formation across Cortical Depth. Imaging Neuroscience, 2, 1–19.

      (5) The SVM analysis included a feature selection step stated in lines 531-533. Although this step is reasonable for the training of a machine learning classifier, it would be interesting to know if the authors think this step could have reintroduced some bias to draining vein contributions.

      We excluded vertices with extremely large signal change and their corresponding voxels in the gray matter when defining ROIs. The same number of voxels were selected from each cortical depth for the SVM analysis, thus there was no bias in the number of voxels from the superficial layers susceptible to large draining veins.

      Reviewer #3:

      The authors tend to overclaim their results.

      Re: Thank you for your comments. We added more control analyses to strengthen our findings, and gave more appropriate discussion of results.

      Recommendations for the authors:

      Reviewer #1:

      (1) Controls: There is a bit more complexity than is expressed in the introduction. The authors hypothesize that the emergence of computational features such as texture may be reflected in specialized columns. That is, if texture is generated in V2, there may be texture columns (perhaps in the pale stripes of V2); but if generated at a higher level, then no texture columns would be needed. This is a very interesting and fundamental hypothesis. While there may be merit to this hypothesis, the demonstration that color and disparity are modular but not texture falls short of making a compelling argument. At a minimum, the finding that texture is not organized in V2 requires additional controls. (a) To boost the texture signal, additional texture stimuli or a sequence of multiple texture stimuli per trial could be considered. (b) Unfortunately, the comparison noise pattern also seems to contain texture; perhaps a less textured control could be designed. (c) It also appears that some of the texture images in Supplementary Figure S1 contain possible structure, e.g. in more peripheral visual fields. (d) Is it possible that the current imaging resolution is not sufficient for revealing texture domains? (e) Note that 'texture' may be a property that defines surfaces and not contours. Thus, while texture may have orientation content, its function may be associated with the surface processing pathways. A control stimulus might contain oriented elements of a texture stimulus that do not elicit texture percept; such a control might activate pale and/or thick stripes (both of which contain orientation domains), while the texture percept stimulus may activate surface-related bands in V4.

      Thank you for your suggestions. They are extremely helpful in improving our manuscript. For the controls you mentioned in (a-d), we discussed them in the public review that we also attached below.

      (a) and (b): To demonstrate the effectiveness and specificity of our stimuli, we conducted a new 3T fMRI experiment in five participants using an experimental design and texture families similar to those in Freeman (2013). All texture stimuli in the 7T experiment were also included. To assess the effectiveness of each stimulus type, different texture families and their corresponding noise patterns were presented in separate blocks for 24 seconds, at a high presentation rate of 5 frames per second. In Figure S7, all texture families showed significantly stronger activation in V2 compared to their corresponding noise patterns, even for those that ‘appeared’ to have residual texture (e.g., the third texture family). These results suggest that our texture stimuli were effective in producing texture-selective activations in area V2 compared to the noise control. Compared to the 7T results, the 3T data showed a notable increase in texture-selective activations in V2, likely due to the increased stimulus presentation speed (1.25 vs. 5 frames/second). Weak texture activations might preclude the detection of columnar representations in the 7T experiment.

      (c) Thank you for pointing out the possible structures of texture-selective activations in the peripheral visual field (Figure S1). In further analyses, we also found stronger texture selectivity in more peripheral visual fields (Figure 2D), and there were weak but significant correlations in the texture-noise activation patterns during split-half analysis (Author response image 2). Although these are not strong evidence for columnar organization of naturalistic textures, it suggests a possibility for such organizations in the peripheral visual field.

      (d) Although our fMRI result at 1-mm isotropic resolution did not show strong evidence for modular processing of naturalistic texture in V2 stripe columns, this does not exclude the possibility that smaller modules exist beyond the current fMRI resolution. We have discussed these limitations in the revised manuscript.

      We fully agree with your explanation in (e). It fits our data very well. Both texture and control stimuli strongly activated the CO-stripes (Figure 2 and Figure 2D), while modular organizations for texture were found in V4 and V3ab (Figure S9). We have discussed this explanation in the revised manuscript.

      In Line 371-374: “Consistently, our pilot results also revealed modular organizations for textures in V4 and V3ab (Figure S9). These texture-selective organizations may be related to surface representations in these higher order visual areas (Wang et al., 2024).”

      (2) Overly simple description of FF, FB circuitry. The classic anatomical definition of feedforward is output from a 'lower' area, in most cases predominantly arising from superficial layers and projecting to middle layers of a 'higher area' (Felleman and Van Essen 1991). This description holds for V1-to-V2, V2-to-V3, and V2-to-V4. [Note there are also feedforward projections from central 5 degrees of V1-to-V4 (cf. Ungerleider) as well as V3-to-V4.] The definition of feedback can be more varied but is generally considered from cells in superficial and deep layers of 'higher' areas projecting to superficial and deep layers of 'lower' areas. Feedback inputs to V1 heavily innervate Layer 1 and superficial Layer 2, as well as the deep layers. Note that feedback connections from V2 to V1, similar to that from V1 to V2, are functionally specific, i.e. thin-to-blob and pale/thick-to interblob (Federer...Angelucci 2021, Hu...Roe 2022). Thus, current views are moving away from the dogma that feedback is diffuse. Recognition that feedback may be modular introduces new ideas about analysis.

      Thanks for your detailed recommendations. We have expanded the discussion of circuit models of functional connectivity in the introduction. Our model and experiments primarily aim to investigate how higher-level areas provide feedback to the V2 area. While we acknowledge that feedback may indeed be functionally specific, our methodology has some certain advantages: it ensures signal stability and avoids the double-dipping issue. Meanwhile, it also focuses on voxels with high feature selectivity, which may already be included in the modular organizations of early visual areas. In the functional connectivity analysis, we performed feature selection to use the most informative voxels. These voxels with high feature selectivity should already be included in the modular organizations of early visual areas. Identifying functionally specific feedback connections between modular areas will be an important and meaningful work for future research. We have added a discussion of this topic in the revised manuscript.

      In Line 136-138: “Only major connections were shown here. There are also other connections, such as V1 interblobs projecting to thick stripes (Federer et al., 2021; Hu & Roe, 2022; Sincich and Horton, 2005).”

      (3) Imaging superficial layers: Although removal of the top layer of cortical voxels (top 5% of voxels) is a common method for dealing with surface vascular artifact contribution to BOLD signal, it likely removes a portion of the Layer 1&2 feedback signals. Is this why the authors define feedback and deep layer to deep layer? If so, both superficial and deep-layer data in Figure 4 should be explicitly explained and discussed.

      Thank you for pointing this out. We would like to clarify the surface-based method removing vascular artifact. The vertices influenced by large pial veins were first defined on the cortical surface, and then voxels were removed from the entire columns corresponding to these vertices to avoid sampling bias along the cortical depth. Thus, there should be complete data from all cortical depths for the remaining columns. We defined the feedback connectivity from deep layers to deep layers because it represents strong feedback connections according to literature (Markov et al., 2013; Ullman, 1995) and also avoids confounding the feedforward signals from superficial layers.

      Markov, N. T., Vezoli, J., Chameau, P., Falchier, A., Quilodran, R., Huissoud, C., Lamy, C., Misery, P., Giroud, P., Ullman, S., Barone, P., Dehay, C., Knoblauch, K., & Kennedy, H. (2014). Anatomy of hierarchy: feedforward and feedback pathways in macaque visual cortex. The Journal of comparative neurology, 522(1), 225–259. https://doi.org/10.1002/cne.23458

      Ullman S. (1995). Sequence seeking and counter streams: a computational model for bidirectional information flow in the visual cortex. Cerebral cortex, 5(1), 1–11. https://doi.org/10.1093/cercor/5.1.1

      (4) More detail on other subjects in Figure S1. Ten subjects conducted visual fixation and used a bite bar. Imaging data are illustrated in detail from one subject and the remaining subjects are depicted in graphs and in Supplemental Figure S1. Please provide arrowheads in each image to help guide the reader. Some kind of summary or index of modularity would also be helpful.

      Thanks for your suggestions. There are arrowheads in each image in our original manuscript and we have revised Figure S1 for better illustration. Additionally, we have added a table summarizing the number of stripes to provide a clearer overview.

      (5) How are ROIs in V3ab and V4 defined? V2 ROIs were defined (thin, thick, and pale stripe), but V3ab and V4 averaged across the whole area. Why not use the most activated "domains" from V3ab and V4? How does this influence connectivity analysis?

      Thank you for your question. We defined V4 and V3ab on the cortical surface using a retinotopic atlas (Benson 2018), which has been shown to be quite accurate in defining ROIs for the early visual areas. Since all ‘domains’ showed robust BOLD activation to our stimuli, we used voxels from the entire ROI in the depth-dependent analysis. In the functional connectivity analysis, we used the most informative voxels by feature selection, which should already be included in the feature domains.

      Minor:

      English language editing is needed.

      Thank you for your feedback. We have carefully revised the manuscript for clarity and readability.

      Line 31 "its" should be "their".

      Thank you. We have corrected "its" to "their".

      Replace 'representative subject' with 'subject'.

      We have replaced "representative subject" with "subject" in the manuscript.

      Replace 'naturalistic texture' with 'texture'.

      Thank you for your suggestion. The textures used in our experiment were generated based on the algorithm by Portilla and Simoncelli (2000), and the term "naturalistic texture" was used to be consistent with literature. The textures used in our study are different from traditional artificial textures, as they contain higher-order statistical dependencies. Following your recommendations, we have replaced ‘naturalistic texture’ with ‘texture’ in some places in the main text to improve readability.

      Typo: Line 126, Fig 2B should be 1B.

      Thank you. We have corrected "Fig 2B" to "Fig 1B" in Line 128.

      Fig. 2A: point out where are texture domains in anterior V2.

      The texture-selective activations in anterior V2 (corresponds to peripheral visual field) have been highlighted by arrowheads.

      Fig 2B, 3 legend: Round symbols are for each subject?

      Yes, the round symbols in Figures 2B represent data for individual participants. We have revised the legend for clarity.

      Fig. 3: Disparity and texture values do not look different across depth (except may the V2 texture values).

      While the difference in feature selectivity is small across cortical depths, they are highly consistent across participants. We have provided a figure showing the original BOLD responses in the revised manuscript (Figure S8 and Figure S8). Data from individual subjects were also available at Open Science Framework (OSF, https://doi.org/10.17605/OSF.IO/KSXT8 (‘rawBetaValues.mat’ in the data directory)).

      Line 57-59 The statement is not strictly accurate. V1 also has color, orientation, and motion representations.

      Thank you for your feedback. Our statement was intended to convey that M and P information from the geniculate input are transformed into representations of color, orientation, disparity, and motion in the primary visual cortex. We have clarified this point in the revised manuscript.

      In Line 58-60: “In the primary visual cortex (V1), the M and P information from the geniculate input are transformed into higher-level visual representations, such as motion, disparity, color, orientation, etc. (Tootell & Nasr, 2017).”

      Fig. 1B V1 interblobs also project to thick stripes (Sincich and Horton).

      Thank you for the additional information. We appreciate your input. Our figure is intended as a simplified schematic and does not fully represent all the connections. We have discussed this reference in the revised manuscript.

      In Line 136-138: “Only major connections were shown here. There are also other connections, such as V1 interblobs projecting to thick stripes (Federer et al., 2021; Hu & Roe, 2022; Sincich and Horton, 2005).”

      Line 207 "suggesting that both local and feedforward connections are involved in processing color information in area V2." Logic? English?

      Thank you for pointing this out. The superficial layers are involved in local intracortical processing by lateral connections and also send output to higher order visual areas along the feedforward pathway. Thus, the strongest color selectivity in the superficial depth of V2 supports that color information was processed in local neural circuits in area V2 and transmitted to higher order areas along the feedforward pathway. We have revised the manuscript for clarity.

      In Line 241-245: “According to the hierarchical model, the strongest color selectivity in the superficial cortical depth is consistent with the fact that color blobs locate in the superficial layers of V1 (Figure 1B, Felleman & Van Essen, 1991; Hubel & Livingstone, 1987; Nassi & Callaway, 2009). The strongest color selectivity in superficial V2 suggests that both local and feedforward connections are involved in processing color information (Figure 1C).”

      Line 254 "Laminar". Please use "cortical depth" or explicitly state that 'laminar' refers to superficial, middle, and deep as defined by cortical depth.

      Thank you for your suggestion. We have clarified the term "laminar" in the manuscript as referring to superficial, middle, and deep layers as defined by cortical depth.

      In Line 96-99: “To better understand the mesoscale functional organizations and neural circuits of information processing in area V2, the present study investigated laminar (or cortical depth-dependent) and columnar response profiles for color, disparity, and naturalistic texture in human V2 using 7T fMRI at 1-mm isotropic resolution.”

      Fig. S5 Please add a unit of isoluminance.

      Thank you for your suggestion. Supplementary Figure S10A and S10B illustrate the blue-matched luminance levels in RGB index. In our isoluminance experiment, blue was set as the reference color (RGB [0 0 255]) to measure the red and gray isoluminance.

      Line 448-449 To make this rationale clearer, refer to:

      Wang J, Nasr S, Roe AW, Polimeni JR. 2022. Critical factors in achieving fine‐scale functional MRI: Removing sources of inadvertent spatial smoothing. Human Brain Mapping. 43:3311-3331.

      Thank you for your suggestion. We have added this reference to better support the rationale of data analysis.

      Reviewer #2:

      (1) Line 126 should refer to Figure 1B.

      Thank you. We have corrected the reference in the revised manuscript as Figure 1B.

      (2) Even if only one naturalistic texture session was acquired per participant, it might be interesting to see the within-session repeatability by, e.g., splitting the texture runs into two halves.

      Thank you for your suggestion. We performed a split-half correlation analysis for participants who completed 10 runs in the naturalistic texture session. The result from one representative subject was shown in the figure below (for other participants, r = 0.38, 0.38, 0.24, and 0.23, respectively).

      Author response image 2.

      Split-half correlations for the texture-selective activation maps in a representative subject (S01) in V2.

      (3) Unfortunately, Figure S2 only shows the stripe ROIs but not V3ab or V4 ROIs. Including another figure that shows all ROIs in more detail would be interesting.

      Thank you for your suggestion. We have included a figure showing the ROIs for V4 and V3ab (the black dotted lines in Figure S9).

      (4) It would be helpful for the reader to have a more detailed discussion about methodological limitations, including the unspecificity of the GE-BOLD signal (Engel et al., 1997, Cereb Cortex, 7, 181-192; Parkes et al., 2005, MRM, 54, 1465-1472; Fracasso et al., 2021, Prog Neurobiol, 202, 102187) and the used voxel sizes.

      Thank you for your suggestion. We have added a more detailed discussion about the methodological limitations, including the unspecificity of the GE-BOLD signal and the voxel sizes used.

      In Line 397-408: “Due to the limitations of the T2*w GE-BOLD signal in its sensitivity to large draining veins (Fracasso et al., 2021; Parkes et al., 2005; Uludag & Havlicek, 2021), the original BOLD responses were strongly biased towards the superficial depth in our data (Figure S8). Compared to GE-BOLD, VASO-CBV and SE-BOLD fMRI techniques have higher spatial specificity but much lower sensitivity (Huber et al., 2019). As shown in a recent study (Qian et al., 2024), using differential BOLD responses in a continuous¬¬ stimulus design can significantly enhance the laminar specificity of the feature selectivity measures in our results (Figure 3). Compared to the submillimeter voxels, as used in most laminar fMRI studies, our fMRI resolution at 1-mm isotropic voxel may have a stronger partial volume effect in the cortical depth-dependent analysis. However, consistent with our results, previous studies have also shown that 7T fMRI at 1-mm isotropic resolution can resolve cortical depth-dependent signals in human visual cortex (Roefs et al., 2024; Shao et al., 2021).”

      (5) If I understand correctly, different numbers of runs/sessions were acquired for different subjects. It would be good to discuss if this could have impacted the results, e.g., different effect sizes could have biased the manual ROI definition.

      Thank you for your suggestion. Although there were differences in the number of runs/sessions acquired for different subjects, there were at least four runs of data for each experiment, which should be enough to examine the within-subject effect. We have discussed this point in the revised manuscript.

      In Line 481-484: “Although the number of runs were not equal across participants, there were at least four runs (twenty blocks for each stimulus condition) of data in each experiment, which should be sufficient to investigate within-subject effects.”

      (6) It would be good to add the software used for layer definition. Was it Laynii?

      We have provided more details in the revised methods.

      In Line 523-526: “An equi-volume method was used to calculate the relative cortical depth of each voxel to the white matter and pial surface (0: white matter surface, 1: pial surface, Supplementary Figure S11A), using mripy (https://github.com/herrlich10/mripy).”

      (7) It would be interesting to see (at least for one subject) the contrasts of color-selective thin stripes and disparity-selective thick stripes from single sessions to demonstrate the repeatability of measurements.

      Thank you for your suggestion. We have shown the test-retest reliability of the response pattern of color-selective thin stripes and disparity-selective thick stripes in a representative subject in Figure S5.

      (8) By any chance, do the authors also have resting-state data from the same subjects? It would be interesting to see the connectivity analysis between stripes and V3ab, V4 with resting-state data.

      Thank you for your suggestion. Unfortunately, we do not have resting-state data from the same subjects at this time. We agree with you that layer-specific connectivity analysis with resting-state data is very interesting and worth investigating in future studies.

      Reviewer #3:

      (1) For investigating information flow across areas, the authors rely on layer-specific informational connectivity analyses, which is an exciting approach. Covariation in decoding accuracy for a specific dependent variable between the superficial layers of a lower area and the middle layer of a higher area is taken as evidence for feedforward connectivity, whereas FB was defined as the connection between the two deep layers. Yet this method is not assumption-free. For example, the canonical idea (Figure 1C) of FF terminals exclusively arriving in layer 4 and FB terminals exclusively terminating in supra-or infragranular layers is not entirely correct. This is not even the case for area V1 - see for example Kathy Rockland's exquisite tractography studies, showing that even single axons with branches terminating in different layers. Also, feedback signals not only arrive in the deep layers of a lower area. Although these informational connectivity analyses can be suggestive of information flow, this reviewer doubts it can be considered as conclusive evidence. Therefore, the authors should drastically tone down their language in this respect, throughout the text. They present suggestive, not conclusive evidence. To obtain truly conclusive evidence, one likely has to perform laminar electrophysiological recordings simultaneously across multiple areas and infer the directionality of information flow using, for example, granger causality.

      Thank you for pointing out this important issue. In our response to a previous question (Reviewer #1, the 2nd comment), we have discussed other possible connections in addition to the canonical feedforward and feedback pathways. In the revised manuscript, the conclusion has been toned down to properly reflect our findings. However, we would also like to emphasize that our conclusion about laminar circuits was supported by converging lines of evidence. For example, in addition to the depth-dependent connectivity results, the role of feedback circuit in processing texture information was also supported by greater selectivity in V4 than V2, and the strongest deep layer selectivity in V2 (Figure 3C).

      (2) In the same realm, how reproducible are the information connectivity results? In the first part of the study, the authors performed a split-half analyses. This should be also done for Figure 4.

      Thank you for your suggestion. We have performed a split-half analysis for the informational connectivity results. As shown in Author response image 3, the results for the color experiment were robust and reproducible, while the disparity and texture connectivity results were less consistent between the two halves. The results from the second half (Author response image 3, below) are more consistent with the original findings (Figure 4). Overall, the pattern of results were qualitatively similar between the two halves. The inconsistency may be due to the fact that some participants had only four runs of data, which could make the split-half analysis less reliable.

      Author response image 3.

      Split-half analysis of informational connectivity.

      (3) Most of the other layer-specific claims (not the ones about the flow of information) are based on indices. It is unclear which ROIs contributed to these indices. Was it the entire extent of V1, V2, ...? Or only the visually-driven voxels within these areas? How exactly were the voxels selected? For V2, it would make sense to calculate the selectivity indices independently for the disparity and color-selective (putative) thick and (putative) thin stripe compartments, respectively. Adding voxels of non-selective compartments (e.g. putative thick stripe voxels for calculating the color-index; or adding putative thin-strip voxels for calculating the disparity index), will only add noise.

      In the revised manuscript, we have clarified that we selected the entire ROI in the depth-dependent analysis. Since our study does not have an independent functional localizer, using the entire ROI avoids the problem of double dipping. The processing of visual features is not confined solely to specific stripes. We have also provided a more comprehensive explanation of this issue in the discussion section.

      In Line 541-544: “For the cortical depth-dependent analyses in Figure 3, we used all voxels in the retinotopic ROI. Pooling all voxels in the ROI avoids the problem of double-dipping and also increases the signal-to-noise ratio of ROI-averaged BOLD responses.”

      (4) It is apparent from Figure 3, that the indices are largely (though not exclusively) driven by 2 subjects. Therefore, this reviewer wishes to see the raw data in addition to a table for calculating the color, disparity, and texture selectivity indices -along with the number of voxels that contributed to it.

      Thank you for your suggestion. We have provided a figure showing the original BOLD responses (Figure S8 and Figure S8). Data from individual subjects were also available at Open Science Framework (OSF, https://doi.org/10.17605/OSF.IO/KSXT8 (‘rawBetaValues.mat’ in the data directory)).

      Minor:

      (1) I typically find inferences about 'layer fMRI' vastly overstated. We all know that fMRI does not (yet) provide laminar-specific resolution, i.e., whereby meaningful differences in fMRI signals can be extracted from all 6 individual layers of neocortex, without partial volume effects, or without taking into account pre-and postsynaptic contributions of neurons to the fMRI signal (the cell bodies may very well lay in different layers than the dendritic trees etc.), or without taking into account the vascular anatomy, etc. The authors should use the term cortical depth-dependent fMRI throughout the text -as they do in the abstract and intro.

      Thank you for pointing out this important issue. We have now defined the meaning of layer or laminar as “cortical depth-dependent” in the introduction, to be consistent with the terminology in most published papers on this topic.

      (2) 1st sentence abstract: I disagree with this statement. The parallel streams in intermediate-level areas are probably equally well studied as the geniculostriate pathway -already starting with the seminal work of Hubel, Livingstone, and more recently by Angelucci and co-workers who looked in detail at the anatomical and functional interactions across sub-compartments of V1 and V2.

      Thank you for your feedback. In the revised manuscript, we have removed the term "much" from the first sentence of the abstract. Although there have been seminal studies of V2 sub-compartments in monkeys, only a few fMRI studies investigated this issue in humans.

      (3) The authors show inter-session correlations for color and disparity. This reviewer would like to see test-retest images since the explained variance is not terribly good. Also, show the correlation values for the inter-session texture beta values.

      Thank you for your suggestion. We have performed the test-retest reliability analysis of texture-selective patterns in the response to a previous question (Reviewer #2, the 2nd comment, Author response image 2).

      (4) The stripe definitions are threshold dependent. Please clarify whether the reported results are threshold-independent.

      Thank you for your question. To address your concern, we defined the stripe ROIs using different thresholds, and the results remained consistent. Specifically, we ranked the voxels in manually defined stripe ROIs by the color-disparity response. We then defined the lowest 10% as the thick stripe voxels, the highest 10% as thin stripe voxels, and the middle 10% as pale stripe voxels. Additionally, we adjusted the thresholds to 20% and 30% to define the three stripes (with 30% being the least strict threshold). Feature selectivities at different thresholds were shown in Figure S6 (from left to right: 10%, 20%, 30%). Notably, in all threshold conditions, there was no significant difference in texture selectivity across different stripes.

      (5) How were the visual areas defined?

      In the revised manuscript, we have provided a detailed description about methods.

      In Line 531-535: “ROIs were defined on the inflated cortical surface. Surface ROIs for V1, V2, V3ab, and V4 were defined based on the polar angle atlas from the 7T retinotopic dataset of Human Connectome Project (Benson et al., 2014, 2018). Moreover, the boundary of V2 was edited manually based on columnar patterns. All ROIs were constrained to regions where mean activation across all stimulus conditions exceeded 0.”

      (6) "According to the hierarchical model in Figure 1B and 1C, the strongest color selectivity in the superficial cortical depth is consistent with the fact that color blobs mainly locate in the superficial layers of V1, suggesting that both local and feedforward connections are involved in processing color information in area V2." But color-selective activation within V2 could be also consistent with feedback from other areas (some of which were not covered in the present experiments) -the more since most parts of the brain were not covered (i.e. a slab of 4 cm was covered)?

      Thank you for reminding us about this issue. We have discussed the possibility of feedback influence in explanation of the superficial bias of color selectivity in area V2.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews: 

      Reviewer #1 (Public review):

      Summary: 

      Authors benchmarked 5 IBD detection methods (hmmIBD, isoRelate, hap-IBD, phasedIBD, and Refined IBD) in Plasmodium falciparum using simulated and empirical data. Plasmodium falciparum has a mutation rate similar to humans but a much higher recombination rate and lower SNP density. Thus, the authors evaluated how recombination rate and marker density affect IBD segment detection. Next, they performed parameter optimization for Plasmodium falciparum and benchmarked the robustness of downstream analyses (selection detection and Ne inference) using IBD detected by each of the methods. They also tracked the computational efficiency of these methods. The authors work is valuable for the tested species and the analyses presented appear to support their claim that users should be cautious calling IBD when SNP density is low and recombination rate is high. 

      Strengths: 

      The study design was solid. The authors set up their reasoning for using P. falciparum very well. The high recombination rate and similar mutation rate to humans is indeed an interesting case. Further, they chose methods that were developed explicitly for each species. This was a strength of the work, as well as incorporating both simulated and empirical data to support their goal that IBD detection should be benchmarked in P. falciparum

      Weaknesses: 

      The scope of the optimization and application of results from the work are narrow, in that everything is finetuned for Plasmodium. Some of the results were not entirely unexpected for users of any of the tested software that was developed for humans. For example, it is known that Refined IBD is not going to do well with the combination of short IBD segments and low SNP density. Lastly, it appears the authors only did one largescale simulation (there are no reported SDs). 

      We thank the reviewer for highlighting the strengths and weaknesses of the study. 

      First, we would like to highlight that: (1) while we use Plasmodium as a model to investigate the impact of high recombination and low marker density on IBD detection and downstream analyses, our IBD benchmarking framework and strategies are widely applicable to IBD methods development for many sexually recombining species including both Plasmodium and non-Plasmodium species. (2) Although some results are not completely unexpected, such as the impact of low marker density on IBD detection, IBD-based methods have been increasingly used in malaria genomic surveillance research without comprehensive benchmarking for malaria parasites despite the high recombination rate. Due to the lack of benchmarking, researchers use a variety of different IBD callers for malaria research including those that are only benchmarked in human genomes, such as refined-ibd. Our work not only confirmed that low marker density (related to high recombination rate) can affect the accuracy of IBD detection, but also demonstrated the importance of proper parameter optimization and tool prioritization for specific downstream analyses in malaria research. We believe our work significantly contributes to the robustness of IBD segment detection and the enhancement of IBDbased malaria genomic surveillance.

      Second, we agree that there is a lack of clarity regarding simulation replicates and the uncertainty of reported estimates. We have made the following improvements, including (1) running n = 3 full sets of simulations for each analysis purpose, which is in addition to the large sample sizes and chromosomal-level replications already presented in our initial submission, and (2) updating data and figures to reflect the uncertainty at relevant levels (segment level, genome-pair level or simulation set level).   

      Reviewer #2 (Public review):

      Summary: 

      Guo et al. benchmarked and optimized methods for detecting Identity-By-Descent (IBD) segments in Plasmodium falciparum (Pf) genomes, which are characterized by high recombination rates and low marker density. Their goal was to address the limitations of existing IBD detection tools, which were primarily developed for human genomes and do not perform well in the genomic context of highly recombinant genomes. They first analysed various existing IBD callers, such as hmmIBD, isoRelate, hap-IBD, phased-IBD, refinedIBD. They focused on the impact of recombination on the accuracy, which was calculated based on two metrics, the false negative rate and the false positive rate. The results suggest that high recombination rates significantly reduce marker density, leading to higher false negative rates for short IBD segments. This effect compromises the reliability of IBD-based downstream analyses, such as effective population size (Ne) estimation. They showed that the best tool for IBD detection in Pf is hmmIBD, because it has relatively low FN/FP error rates and is less biased for relatedness estimates. However, this method is less computationally efficient. Their suggestion is to optimize human-oriented IBD methods and use hmmIBD only for the estimation of Ne. 

      Strengths: 

      Although I am not an expert on Plasmodium falciparum genetics, I believe the authors have developed a valuable benchmarking framework tailored to the unique genomic characteristics of this species. Their framework enables a thorough evaluation of various IBD detection tools for non-human data, such as high recombination rates and low marker density, addressing a key gap in the field. This study provides a

      comparison of multiple IBD detection methods, including probabilistic approaches (hmmIBD, isoRelate) and IBS-based methods (hap-IBD, Refined IBD, phased IBD). This comprehensive analysis offers researchers valuable guidance on the strengths and limitations of each tool, allowing them to make informed choices based on specific use cases. I think this is important beyond the study of Pf. The authors highlight how optimized IBD detection can help identify signals of positive selection, infer effective population size (Ne), and uncover population structure. They demonstrate the critical importance of tailoring analytical tools to suit the unique characteristics of a species. Moreover, the authors provide practical recommendations, such as employing hmmIBD for quality-sensitive analyses and fine-tuning parameters for tools originally designed for non-P. falciparum datasets before applying them to malaria research. 

      Overall, this study represents a meaningful contribution to both computational biology and malaria genomics, with its findings and recommendations likely to have an impact on the field. 

      Weaknesses: 

      One weakness of the study is the lack of emphasis on the broader importance of studying Plasmodium falciparum as a critical malaria-causing organism. Malaria remains a significant global health challenge, causing hundreds of thousands of deaths annually. The authors could have introduced better the topic, even though I understand this is a methodological paper. While the study provides a thorough technical evaluation of IBD detection methods and their application to Pf, it does not adequately connect these findings to the broader implications for malaria research and control efforts. Additionally, the discussion on malaria and its global impact could have framed the study in a more accessible and compelling way, making the importance of these technical advances clearer to a broader audience, including researchers and policymakers in the fight against malaria. 

      We thank the reviewer for highlighting the need to better contextualize the work and emphasize its relevance to malaria control and elimination efforts. We have edited the introduction and discussion sections to highlight the importance of studying Plasmodium as malaria-causing organisms and why IBD-based analysis is important to malaria researchers and policymakers. We believe the changes will better emphasize the public health relevance of the work and improve clarity for a general audience.  

      We would like to clarify that we are not recommending that researchers “optimize human-oriented IBD methods and use hmmIBD only for the estimation of Ne.” We recommended hmmIBD for Ne analysis; however, hmmIBD can be utilized for other applications, including population structure and selection detection. Thus, we generally recommend using hmmIBD for Plasmodium when phased genotypes are available. To avoid potential misunderstandings, we have revised relevant sentences in the abstract, introduction, and discussion. One reason to consider human-oriented IBD detection methods in Plasmodium research is that hmmIBD currently has limitations in handling large genomic datasets. Our ongoing research focuses on improving hmmIBD to reduce its computational runtime, making it scalable for large Plasmodium wholegenome sequence datasets.

      Recommendations for the authors

      Reviewer #1:

      (1) Additional experiments 

      (i) More simulation replicates would be valuable here. The way that results are presented, it appears as though there are no replicates. Apologies if I am incorrect, but when looking through the authors code the --num_reps defaults to one simulation and there are no SDs reported for any figure. Perhaps the authors are bypass replicates by taking a random sample of lineages? Some clarification here would be great. 

      We agree with the reviewer’s constructive suggestions. We have increased the number of simulation sets to (n = 3) in addition to the existing replicates at the chromosomal level. We did not use a larger n for full sets of simulation replicates for two reasons: (1) full replication is quite computationally intensive (n=3 simulation sets already require a week to run on our computer cluster with hundreds of CPU cores). (2), the results from different simulation sets are highly consistent with each other, likely due to our large sample size (n= 1000 haploid genomes for each parameter combination).  The consistency across simulation sets can be exemplified by the following figures (Author response image 1 and 2) based on simulation sets different from Figures and Supplementary Figures included in the manuscript. 

      Author response image 1.

      Additional simulation sets repeating experiments shown in Fig 2.

      Author response image 2.

      Post-optimization Ne estimates based on three independent simulation sets (Fig 5 shows data simulation set 1).

      In our updated figures, we address the uncertainty of measurements as follows:

      (1) For IBD accuracy based on overlapping IBD segments, we present the mean ± standard deviation (SD) at the segment level (IBD segment false positives and false negatives for each length bin) or genome-pair level (IBD error rates at the genome-wide level). Figures in the revised manuscript show results from one of the three simulation set replicates. The SD of IBD segment accuracy is included in all relevant figures. In the S2 Data file, we chose not to show SDs to avoid text overcrowding in the heatmaps; however, a detailed version, including SD plotting on the heatmap and across three simulation set replicates, is available on our GitHub repository at https://github.com/bguo068/bmibdcaller_simulations/tree/main/simulations/ext_data

      (2) For IBD-based genetic relatedness, the uncertainty is depicted in scatterplots.

      (3) For IBD-based selection signal scans, we provide the mean ± SD of the number of true selection signals and false selection signals. The SD is calculated at the simulation set level (n=3). 

      (4) For IBD network community detection, the mean ± SD of the adjusted Rand index is reported at the simulation set level (n=3). A representative simulation set is randomly chosen for visualization purposes.

      (5) For IBD-based Ne estimates, each simulation set provides confidence intervals via bootstrapping. We found Ne estimates across n=3 simulation sets to be highly consistent and decided to display Ne from one of the simulation sets.

      (6) For the measurement of computational efficiency and memory usage, the mean ± SD was calculated across chromosomes from the same simulation sets.

      We have included a paragraph titled "Replications and Uncertainty of Measures" in the methods section to clarify simulation replications. Additionally, a table of simulation replicates is provided in the new S1 Data file under the sheet named “02_simulation_replicates.”

      (ii) I might also recommend a table or illustrative figure with all the simulation parameters for the readers rather than them having to go to and through a previous paper to get a sense of the tested parameters. 

      We have now generated tables containing full lists of simulation/IBD calling parameters. We have organized the tables into two sections: simulation parameters and IBD calling parameters. For the simulations, we are using three demographic models: the single-population (SP) model, the multiple-population (MP) model, and the human population demography in the UK (UK) model, each with different sets of parameters. Parameters and their values are listed separately for each demographic model (SP, MP and UK). For the IBD calling, we have five different IBD callers, each with different parameters. We have provided lists of the parameters and their values separately for each caller. In total, there are 15 different combinations of 3 demographic models in simulation and five callers in IBD detection (Author response image 3). We provide a table for each of the 15 combinations. We also provide a single large table by concatenating all 15 tables. In the combined table, demographic model-specific or IBD caller-specific parameters are displayed in their own columns, with NA values (empty cells) appearing in rows where these parameters are not applied (see S2 Data file).

      Author response image 3.

      Schematic of combined parameters from simulations and IBD detection (also included in the S2 Data file)

      (2) Recommendations for improving the writing and presentation 

      Overall, the writing was great, especially the introduction. 

      Three thoughts: 

      (i) It would be great if the authors included a few sentences with guidance on the approach one would take if their organism was not human or P. falciparum

      We have updated our discussion with the following statement: “Beyond Plasmodium parasites, there are many other high-recombining organisms such as Apicomplexan species like Theileria, insects like Apis mellifera (honeybee), and fungi like Saccharomyces cerevisiae (Baker's yeast). For these species, our optimized parameters may not be directly applicable, but the benchmarking framework established in this study can be utilized to prioritize and optimize IBD detection methods in a context-specific manner.”

      (ii) I think there was a lot of confusion about the simulations as they were presented between the co-reviewer and I. Clarification on whether there were replicates and how sampling of lineages occurred would be helpful for a reader. 

      We have added a paragraph with heading “Replications and uncertainty of measures” under the method section to clarify simulation replicates.  Please also refer to our response above for more details (Reviewer #1 (1) Additional experiments).

      (iii) Maybe we missed it, but could the authors add a sentence or two about why isoRelate performed so poorly (e.g. lines 206-207) considering it was developed for Plasmodium? This result seems important. 

      IsoRelate assumes non-phased genotypes as input; therefore, even if phased genotypes are provided, the HMM model used in isoRelate (distinct from the hmmIBD model) may not utilize them. Below, we present examples of IBD segments between true sets and inferred sets from both isoRelate and hmmIBD, where many small IBD segments identified by tskibd (ground truth) and hmmIBD (inferred) are not detected by isoRelate (inferred), although isoRelate still captures very long IBD segments. These patterns are also illustrated in Fig. 3 and S3 Fig. We acknowledge that isoRelate may outperform other methods in the context of unphased genotypes. However, we chose not to benchmark IBD calling methods using unphased genotypes in simulations, as the results may be significantly influenced by the quality of genotype phasing for all other IBD detection methods. The characterization of deconvolution methods is beyond the scope of this paper. We have added a paragraph in the discussion to reflect the above explanation.

      Author response image 4.

      Example IBD segments inferred by isoRelate and hmmIBD compared to true IBD segments calculated by tskibd.

      (3) Minor corrections to the text and figures 

      Lines 105-110 feel like introduction because the authors are defining IBD and goals of work 

      We have shortened these sentences and retained only relevant information for transition purposes. 

      Line 121-122 The definition of false positive is incorrect, it appears to be the exact text from false negative 

      We apologize for the typo and have corrected the definition, so that  it is consistent with that in the methods section. 

      Lines 177-180 feels more like discussion than results 

      We have removed this sentence for brevity. 

      Figure 1: 

      Remove plot titles from the figure 

      Write out number in a 

      The legend in b overlaps the data so moving that inset to the right would be helpful 

      We have removed the titles from Figure 1. In Figure 1a, we have changed the format of  the y-axis tick labels from scientific notation to integers.  In Figure 1b, we have adjusted the size and location of the legend so that it does not overlap with the data points.

      Figure 2-3 & S4-5: 

      It was hard to tell the difference between [3-4) and [10-18) because the colors and shapes are similar. It might be worth using a different color or shape for one of them? 

      We have changed the color for the [10-18) group so that the two groups are easier to distinguish.

      Figure 3 & S3-5: 

      Biggest suggestion is that when an axis is logged it should not only be mentioned in the caption but also should be shown in the figure as well. 

      We have updated all relevant figures so that the log scale is noted in the figure captions (legends) as well as in the figures (in the x and/or y axis labels).

      Supplementary Figure S2 

      (i) It would be nice to either combine it with the main text Figure 1 (I don't believe it would be overwhelming) or add in the other two methods for comparison 

      We have now plotted data for all five IBD callers in S1 Fig for better comparison. 

      (ii) the legend overlaps the data so relocating it to the top or bottom would be helpful 

      We have moved the legend to the bottom of the figure to avoid overlap with the data.

      Reviewer #2:

      I don't have any major comments on the paper. It is well-written, although perhaps a bit long and repetitive in some sections. Make sure not to repeat the same concepts too many times. 

      We have consolidated and removed several paragraphs to reduce repetition of the same concepts.

      I am not a methodological developer, but it seems you have addressed several challenges regarding IBD detection in P. falciparum. You have also acknowledged the study's caveats, which I agree with. 

      Thank you for the positive comments.

      Minor comments: 

      -In my opinion, the paper would benefit from including the workflow figure in the main text rather than keeping it in the supplementary materials. This would make it more accessible and useful for readers. 

      We have moved the original S1 Fig to be Fig 1 in the main text.

      -Some of the figures (e.g. Fig. 2, 4) should be larger for better clarity and interpretation. 

      We have updated Fig 2 and Fig 4 (now labeled as Figure 3 and 5) to make them larger for improved clarity and interpretation.

      -While the focus on P. falciparum is understandable, it would have been valuable to include examples of other species and discuss the broader implications of the findings for a broader field. 

      We have updated the third-to-last paragraph to discuss implications for other species, such as Apicomplexan species like Theileria, insects like Apis mellifera (honeybee), and fungi like Saccharomyces cerevisiae (Baker's Yeast). We acknowledge that optimal parameters and tool choices may vary among species due to differences in demographic history and evolutionary parameters. However, we emphasize that the methods outlined are adaptable for prioritizing and optimizing IBD detection methods in a context-specific manner across different species.

      -Figure 6 is somewhat confusing and could use clearer labeling or additional explanation to improve comprehension. 

      We have updated the labels and titles in the figure to improve clarity. We also edited the figure caption for better clarity.

      -Although hmmIBD outperformed other tools in accuracy, its computational inefficiency due to single-threaded execution poses a significant challenge for scaling to large datasets. The trade-off between accuracy and computational cost could be discussed in more detail. 

      We have added a paragraph in the discussion section to highlight the trade-off between accuracy and computation cost. We noted that we are developing an adapted tool to enhance the hmmIBD model and significantly reduce the runtime via parallelizing the IBD inference process.

    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewer #1 (Public Review):

      The authors of this study use electron microscopy and 3D reconstruction techniques to study the morphology of distinct classes of Drosophila sensory neurons *across many neurons of the same class.* This is a comprehensive study attempting to look at nearly all the sensory neurons across multiple sensilla to determine a) how much morphological variability exists between and within neurons of different and similar sensory classes, and 2) identify dendritic features that may have evolved to support particular sensory functions. This study builds upon the authors' previous work, which allowed them to identify and distinguish sensory neuron subtypes in the EM volumes without additional staining so that reconstructed neurons could reliably be placed in the appropriate class. This work is unique in looking at a large number of individual neurons of the same class to determine what is consistent and what is variable about their class-specific morphologies.

      This means that in addition to providing specific structural information about these particular cells, the authors explore broader questions of how much morphological diversity exists between sensory neurons of the same class and how different dendritic morphologies might affect sensory and physiological properties of neurons.

      The authors found that CO2-sensing neurons have an unusual, sheet-like morphology in contrast to the thin branches of odor-sensing neurons. They show that this morphology greatly increases the surface area to volume ratio above what could be achieved by modest branching of thin dendrites, and posit that this might be important for their sensory function, though this was not directly tested in their study. The study is mainly descriptive in nature, but thorough, and provides a nice jumping-off point for future functional studies. One interesting future analysis could be to examine all four cell types within a single sensilla together to see if there are any general correlations that could reveal insights about how morphology is determined and the relative contributions of intrinsic mechanisms vs interactions with neighboring cells. For example, if higher than average branching in one cell type correlated with higher than average branching in another type, if in the same sensilla. This might suggest higher extracellular growth or branching cues within a sensilla. Conversely, if higher branching in one cell type consistently leads to reduced length or branching in another, this might point to dendrite-dendrite interactions between cells undergoing competitive or repulsive interactions to define territories within each sensilla as a major determinant of the variability.

      We thank the reviewer for the insightful comments and appreciation for our study.

      Reviewer #2 (Public Review):

      The manuscript employs serial block‐face electron microscopy (SBEM) and cryofixation to obtain high‐resolution, three‐dimensional reconstructions of Drosophila antennal sensilla containing olfactory receptor neurons (ORNs) that detectCO2. This method has been used previously by the same lab in Gonzales et. al, 2021. (https://elifesciences.org/articles/69896), which had provided an exemplary model by integrating high-resolution EM with electrophysiology and cell-type-specific labeling.

      We thank the reviewer for expressing appreciation for our published study.

      The previous study ended up correlating morphology with activity for multiple olfactory sensillar types. Compared to the 2021 study, this current manuscript appears somewhat incomplete and lacks integration with activity.

      We thank the reviewer for their feedback. However, we would like to clarify that our previous study did not correlate morphology with activity to a greater extent than the current study. Both employed the same cryofixation, SBEM-based approach without recording odor-induced activity, but the focus of the current work is fundamentally different. While the previous study examined multiple sensillum types, the current study concentrates on a single sensillum type to address a distinct biological question regarding morphological heterogeneity. We appreciate the opportunity to clarify this distinction, and we hope that the revised manuscript more clearly conveys the unique scope and contributions of this study.

      In fact older studies have also reported two-dimensional TEM images of the putative CO2 neuron in Drosophila (Shanbhag et al., 1999) and in mosquitoes (McIver and Siemicki, 1975; Lu et al, 2007), and in these instances reported that the dendritic architecture of the CO2 neuron was somewhat different (circular and flattened, lamellated) from other olfactory neurons.

      We thank the reviewer for pointing this out. As noted in both the Introduction and Discussion sections, previous studies—including those cited by the reviewer—suggested that CO2-sensing neurons may have a distinct dendritic morphology. However, those earlier studies lacked the means to definitively link the observed morphology to CO2 neuron identity.

      In contrast, our study assigns neuronal identity based on quantitative morphometric measurements, allowing us to confidently associate the unique dendritic architecture with CO2 neurons. Furthermore, we extend previous observations by providing full 3D reconstructions and nanoscale morphometric analyses, offering a much more comprehensive and definitive characterization of these neurons. We believe this represents a significant advancement over earlier work.

      The authors claim that this approach offers an artifact‐minimized ultrastructural dataset compared to earlier. In this study, not only do they confirm this different morphology but also classify it into distinct subtypes (loosely curled, fully curled, split, and mixed). This detailed morphological categorization was not provided in prior studies (e.g., Shanbhag et al., 1999).

      We thank the reviewer for acknowledging the significance of our study.

      The authors would benefit from providing quantitative thresholds or objective metrics to improve reproducibility and to clarify whether these structural distinctions correlate with distinct functional roles.

      We thank the reviewer for raising this point. However, we would like to clarify that assigning neurons to strict morphological subtypes was not the primary aim of our study. In practice, dendritic architectures can be highly complex, with individual neurons often displaying features characteristic of multiple subtypes. This is precisely why we included a “mixed” subtype category—to acknowledge and capture this morphological heterogeneity rather than impose rigid classification boundaries.

      Our intent in defining subtypes was not to imply discrete functional classes, but rather to highlight the range of morphological variation observed across ab1C neurons. While we agree that exploring potential correlations between structure and function is an important future direction, the current study focuses on characterizing this diversity using 3D reconstruction and morphometric analysis. We hope this clarifies the purpose and scope of our morphological categorization.

      Strengths:

      The study makes a convincing case that ab1C neurons exhibit a unique, flattened dendritic morphology unlike the cylindrical dendrites found in ab1D neurons. This observation extends previous qualitative TEM findings by not only confirming the presence of flattened lamellae in CO₂ neurons but also quantifying key morphometrics such as dendritic length, surface area, and volume, and calculating surface area-to-volume ratios. The enhanced ratios observed in the flattened segments are speculated to be linked to potential advantages in receptor distribution (e.g., Gr21a/Gr63a) and efficient signal propagation.

      We thank the reviewer for appreciating the significance our current study.

      Weaknesses:

      While the manuscript offers valuable ultrastructural insights and reveals previously unappreciated heterogeneity among CO₂-sensing neurons, several issues warrant further investigation in addition to the points made above.

      (1) Although this quantitative approach is robust compared to earlier descriptive reports, its impact is somewhat limited by the absence of direct electrophysiological data to confirm that ultrastructural differences translate into altered neuronal function. A direct comparison or discussion of how the present findings align with the functional data obtained from electrophysiology would strengthen the overall argument.

      We thank the reviewer for this comment. We would like to clarify, however, that our study does not claim that the observed morphological heterogeneity necessarily leads to functional diversity. Rather, we consider this as a possible implication and discuss it as a potential question for future research. This idea is raised only in the Discussion section, and we are carefully not to present functional diversity as a conclusion of our study. Nonetheless, we have reviewed the relevant paragraph to ensure the language remains cautious and does not overstate our interpretation.

      We also acknowledge the significance of directly linking ultrastructural features to neuronal function through electrophysiological recordings. However, at present, it is technically challenging to correlate the nanoscale morphology of individual ORNs with their functional activity, as this would require volume EM imaging of the very same neurons that were recorded via electrophysiology. Currently, there is no dye-labeling method compatible with single-sensillum recording and SBEM sample preparation that allows for unambiguous identification and segmentation of recorded ORNs at the necessary ultrastructural resolution.

      To acknowledge this important limitation, we have added a paragraph in the Discussion section, as suggested, to clarify the current technical barriers and to highlight this as a promising direction for future methodological advances.

      (2) Clarifying the criteria for dendritic subtype classification with quantitative parameters would enhance reproducibility and interpretability. Moreover, incorporating electrophysiological recordings from ab1C neurons would provide compelling evidence linking structure and function, and mapping key receptor proteins through immunolabeling could directly correlate receptor distribution with the observed morphological diversity.

      Please see our response to the comment regarding the technical limitations of directly correlating ultrastructure with electrophysiological data.

      In addition, we would like to address the suggestion of using immunolabeling to map receptor distribution in relation to the 3D EM models. Currently, antibodies against Gr21a or Gr63a (the receptors expressed in ab1C neurons) are not available. Even if such antibodies were available, immunogold labeling for electron microscopy requires harsh detergent treatment to increase antibody permeability, damaging morphological integrity. These treatments would compromise the very morphological detail that our study aims to capture and quantify.

      (3) Even though Cryofixation is claimed to be superior to chemical fixation for generating fewer artifacts, authors need to confirm independently the variation observed in the CO2 neuron morphologies across populations. All types of fixation in TEMs cause some artifacts, as does serial sectioning. Without understanding the error rates or without independent validation with another method, it is hard to have confidence in the conclusions drawn by the authors of the paper.

      We thank the reviewer for raising concerns regarding potential artifacts in morphological analyses. However, we would like to clarify that cryofixation is widely regarded as a gold standard for ultrastructural preservation and minimizing fixation-induced artifacts, as supported by extensive literature. This is why we adopted high-pressure freezing and freeze substitution in our study.

      We have also published a separate methods paper (Tsang et al., eLife, 2018) directly comparing our cryofixation-based protocol with conventional chemical fixation, demonstrating substantial improvements in morphological preservation. This provides strong empirical support for the reliability of our approach.

      Regarding the suggestion to validate observed morphological variation across populations: we note that determining the presence of artifacts requires a known ground truth, which is inherently unavailable as we could not measure the morphometrics of fly olfactory receptor neurons in their native state. In the absence of such a benchmark, we have instead prioritized using the best-available preparation methods and high-resolution imaging to ensure structural integrity.

      Addressing these concerns and integrating additional experiments would significantly bolster the manuscript's completeness and advancement.

      We appreciate the reviewer’s feedback. As discussed in our responses to the specific comments above, certain suggested experiments are currently limited by technical constraints, particularly in the context of high-resolution volume EM for insect tissues enclosed in cuticles.

      Nevertheless, we have carefully addressed the reviewer’s concerns to the fullest extent possible within the scope of this study. We have revised the manuscript to clarify methodological limitations, added new explanatory content where appropriate, and ensured that our interpretations remain well grounded in the data. We hope these revisions strengthen the clarity and completeness of the manuscript.

      Reviewer #3 (Public Review):

      In the current manuscript entitled "Population-level morphological analysis of paired CO2- and odor-sensing olfactory neurons in D. melanogaster via volume electron microscopy", Choy, Charara et al. use volume electron microscopy and sensillum. They aim to investigate the degree of dendritic heterogeneity within a functional class of neurons using ab1Cand ab1D, which they can identify due to the unique feature of ab1 sensilla to house four neurons and the stereotypic location on the third antennal segment. This is a great use of volumetric electron imaging and neuron reconstruction to sample a population of neurons of the same type. Their data convincingly shows that there is dendritic heterogeneity in both investigated populations, and their sample size is sufficient to strongly support this observation. This data proposes that the phenomenon of dendritic heterogeneity is common in the Drosophila olfactory system and will stimulate future investigations into the developmental origin, functional implications, and potential adaptive advantage of this feature.

      Moreover, the authors discovered that there is a difference between CO2- and odour-sensing neurons of which the first show a characteristic flattened and sheet-like structure not observed in other sensory neurons sampled in this and previous studies. They hypothesize that this unique dendritic organization, which increases the surface area to volume ratio, might allow more efficient CO2 sensing by housing higher numbers of CO2 receptors. This is supported by previous attempts to express CO2 sensors in olfactory sensory neurons, which lack this dendritic morphology, resulting in lower CO2 sensitivity compared to endogenous neurons.

      Overall, this detailed morphological description of olfactory sensory neurons' dendrites convincingly shows heterogeneity in two neuron classes with potential functional impacts for odour sensing.

      Strength:

      The volumetric EM imaging and reconstruction approach offers unprecedented details in single cell morphology and compares dendrite heterogeneity across a great fraction of ab1 sensilla. The authors identify specific shapes for ab1C sensilla potentially linked to their unique function in CO2 sensing.

      We thank the reviewer for the insightful comments and appreciation for our study.

      Weaknesses:

      While the morphological description is highly detailed, no attempts are made to link this to odour sensitivity or other properties of the neurons. It would have been exciting to see how altered morphology impacts physiology in these olfactory sensory cells.

      We agree that linking morphological variation to physiological properties, such as odor sensitivity, would be a highly valuable direction for future research. However, the aim of the current study is to provide an in-depth nanoscale characterization based on a substantial proportion of ab1 sensilla, highlighting morphological heterogeneity among homotypic ORNs.

      At present, it is technically challenging to correlate the nanoscale morphology of individual ORNs with their physiological responses, as this would require volume EM imaging of the exact neurons recorded via single-sensillum electrophysiology. Currently, no dye-labeling method exists that is compatible with both single-sensillum recording and the stringent requirements of SBEM sample preparation to allow for unambiguous identification and segmentation of recorded ORNs.

      To acknowledge this important limitation, we have added a paragraph in the Discussion section clarifying the current technical barriers and highlighting this as a promising area for future methodological development. Please also see our responses to the reviewer’s 4th comment below, where we present preliminary experiments examining whether odor sensitivity varies among homotypic ORNs.

      (Please see the following pages for additional responses to the reviewers’ specific comments. These responses are not intended for publication.)

      Reviewer #1 (Recommendations for the authors):

      As this is mainly a descriptive paper I have no suggestions for additional experiments. Minor Text Suggestions:

      (1) The authors might want to include a better description/definition of the fly antennae, olfactory sensilla and their basic structure/makeup, position of the sensory neurons and dendrites within, etc, in the introduction perhaps in cartoon form to help readers that are not familiar (i.e. non-Drosophila readers) with the terminology and basic organization can follow the paper more easily from the start.

      We thank the reviewer for the helpful suggestion to broaden the appeal of our study to a wider readership. In response, we added a new introductory paragraph at the beginning of the Results section, along with illustrations in a new supplementary figure (Figure 1—figure supplement 1). The new paragraph reads as follows.

      “The primary olfactory organ in Drosophila is the antenna, which contains hundreds olfactory sensilla on the surface of its third segment (Figure 1—figure supplement 1A) . Each sensillum typically encapsulates the outer dendrites of two to four ORNs. The outer dendrites are the sites where odorant receptors are expressed, enabling the detection of volatile chemicals. A small portion of the outer dendrites lies beneath the base of the sensillum cuticle. At the ciliary constriction, the outer dendrites connect to the inner dendritic segment, which then links to the soma of each ORN (Figure 1—figure supplement 1B).”

      (2) In Figure 4D, the letter annotations above the graphs are not clearly defined anywhere that I could easily find. Please clarify with different symbols and/or in the figure legend so readers can easily comprehend the stats that are presented.

      We thank the reviewer for raising this point. As suggested, in the revised Figure 4D legend, following the original sentence “Statistical significance is determined by Kruskal-Wallis one-way ANOVA on ranks and denoted by different letters”, we added “For example, labels “a” and “b” indicate a significant difference between groups (P < 0.05), whereas labels with identical or shared letters (e.g., “a” and “a”, “a,b” and “a”, or “a,b” and “b”) indicate no significant difference.”

      Reviewer #3 (Recommendations for the authors):

      There are several aspects that I would like the authors to consider to improve the current manuscript:

      (1) Line 331: "Our analysis highlights how structural scaling in ab1D neurons achieves enhanced sensory capacity while maintaining the biophysical properties of dendrites". This is a strong statement, and not shown by the authors. They speculate about this in the discussion, but I would like them to soften the language here.

      We thank the reviewer for raising this point. As suggested, we have softened the language in the sentence in question. The revised version is as follows.

      “Our analysis suggests that structural scaling in ab1D neurons may enhance sensory capacity while preserving the biophysical properties of dendrites.”

      (2) The Supplementary material is not well presented and is not cited in the manuscript. It is not clear what the individual data files show, where they refer to, etc. Please provide clear labels of all data, cite them at the appropriate location in the manuscript, and make them more accessible to the reader. Also, there are two Videos mentioned in the manuscript that are not included in the submission.

      We thank the reviewer for bringing this to our attention and apologize for the oversight. We appreciate the reviewer’s careful attention to the supplementary materials. We have addressed these issues accordingly: 1) all source data have been consolidated in to a single, clearly labeled Excel file to improve accessibility for readers; this file is now cited at the appropriate locations in the manuscript. 2) The supplementary videos mentioned in the manuscript have also been included in the re-submission.

      (3) In Figure 1B, it is hard to recapitulate the increase in dendritic density in the presented pictures. Could the authors please highlight dendrites in the raw imaging files (e.g. by colour coding as done later in the manuscript). Also, it might be helpful to indicate the measured parameters visually in this Figure (e.g. volume, length, etc.).

      We thank the reviewer for the helpful suggestion. As suggested, we have pseudocolored the dendrites in Figure 1B to enhance visual clarity.

      As noted, the original legend stated that “the sensilla were arranged from left to right in order of increasing dendritic branch counts”. To improve clarity, we have now added the number of dendritic branches above each sensillum to make this information more explicit.

      We hope these changes make the figure more accessible and informative for readers.

      (4) Given the strength of the authors in in vivo physiology and single sensilla recordings, I would be very curious about how the described morphological heterogeneity is reflected in the response properties of ab1Cs and ab1Ds. Can the authors provide data (already existing from their lab) of these two neurons on response heterogeneity? I acknowledge that spike sorting can be very challenging in ab1s, but maybe it is possible to show the range of response sensitivities upon CO2 stimulation in ab1Cs? The authors speculate in the discussion and presented data will only be correlative - however I think it would strengthen the manuscript to have some link to physiology included.

      We thank the reviewer for this insightful comment. We share the same curiosity about response variability among homotypic ORNs, including ab1C and ab1D. Ideally, this question could be addressed by recording from a large proportion of neurons of a given ORN type to assess the response variability within a single antenna. However, due to technical limitations, we are only able to reliably record from 3–4 ab1 sensilla per antennal preparation, representing approximately 8% of the total ab1 population.

      Moreover, our recordings are typically limited to ab1 sensilla located on the posterior-medial side of the antenna, as this region provides the best accessibility for our recording electrode. This spatial constraint may limit our ability to sample the full morphological diversity of ab1C and ab1D neurons.

      Given these limitations, it is technically challenging to rigorously assess physiological variability in ab1C and ab1D responses across the entire ab1 population. Nonetheless, we attempted to address this question using a different sensillum type where a larger proportion of the population is accessible to single-sensillum recording per antennal preparation. Specifically, we focused on ab2 sensilla in the following analysis because we can reliably record from 6 sensilla per antenna, representing approximately 25% of the total ab2 population.

      In the preliminary data presented below, we recorded from 6 ab2A ORNs per antenna across a total of 6 flies. Spike analysis revealed that odor-evoked responses were consistent across individual ab2A neurons (Author response image 1A). When analyzing the dose-response curve for each ORN, we found no statistically significant differences in odor sensitivity, either among ORNs within the same antenna or across different flies (Author response image 1B; two-way ANOVA: P > 0.99 within antennae, P > 0.99 across flies). This is further supported by the closely clustered EC50 values (Author response image 1C). This result suggests that odor sensitivity is largely uniform among homotypic ab2A ORNs.

      Author response image 1.

      Homotypic ab2A ORNs display similar odorant sensitivity. (A) Single-sensillum recording. Raster plots of ab2A/Or59b ORN spike responses. Six ab2A ORNs from the same antenna were recorded per fly. Odor stimulus: methyl acetate (10-6). (B) Dose-response relationships of peak spike responses, normalized to the maximum response of the ORN to facilitate comparison of odor sensitivity. Each curve represents responses from a single ab2A ORN fitted with the Hill equation (n=36 ab2 sensilla from 6 flies). Responses recorded from the same antenna are indicated by the same color. Statistical comparisons between different ab2A ORNs from the same antenna (P > 0.99) or across flies (P > 0.99) were performed by two-way ANOVA. (C) Quantification of individual pEC50 values from (B), defined as -logEC50.

      However, we are hesitant to include this result in the main manuscript for several reasons. First, it does not directly relate to the morphometric analysis of ab1C and ab1D neurons, which is the primary focus of our study. Second, while we were able to record from approximately 25% of the ab2 population, this level of coverage is still limited and potentially subject to sampling bias due to the spatial constraints of the antennal region accessible to the recording electrode.

      At best, our data suggest limited variability in odor sensitivity among the recorded ab2A ORNs. However, we are cautious about generalizing this finding to the entire ab2 population. In light of these considerations, we hope the reviewer can appreciate the technical challenges inherent in addressing what may appear to be a straightforward question.

      For these reasons, we have chosen to include this preliminary result in the response only, rather than in the main manuscript.

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife assessment 

      This study investigates associations between retrotransposon element expression and methylation with age and inflammation, using multiple public datasets. The study is valuable because a systematic analysis of retrotransposon element expression during human aging has been lacking. However, the data provided are incomplete due to the sole reliance on microarray expression data for the core analysis of the paper. 

      Both reviewers found this study to be important. We have selected the microarray datasets of human blood adopted by a comprehensive study of ageing published in a Nature

      Communications manuscript (DOI: doi: 10.1038/ncomms9570). We only included the datasets specifically collected for ageing studies. Therefore, the large RNA-seq cohorts for cancer, cardiovascular, and neurological diseases were not relevant to this study and cannot be included.   

      Public Reviews: 

      Reviewer #1 (Public Review): 

      Summary: 

      Tsai and Seymen et al. investigate associations between RTE expression and methylation and age and inflammation, using multiple public datasets. The concept of the study is in principle interesting, as a systematic analysis of RTE expression during human aging is lacking. 

      We thank the reviewer for the positive comment. 

      Unfortunately, the reliance on expression microarray data, used to perform the core analysis of the paper places much of the study on shaky ground. The findings of the study would not be sufficiently supported until the authors validate them with more suitable methods. 

      In our discussion section in the manuscript, we have clarified that “we are aware of the limitations imposed by using microarray in this study, particularly the low number of intergenic probes in the expression microarray data. Our study can be enriched with the advent of large  RNA-seq cohorts for aging studies in the future.”  However, the application of microarray for RTE expression analysis was introduced previously (DOI: 10.1371/journal.pcbi.1002486) and applied in some highly cited and important publications before (DOI: 10.1038/ncomms1180, DOI: 10.1093/jnci/djr540). In fact, in a manuscript published by Reichmann et al.  (DOI: 10.1371/journal.pcbi.1002486) which was cited 76 times, the authors showed and experimentally verified that cryptic repetitive element probes present in Illumina and Affymetrix gene expression microarray platforms can accurately and sensitively monitor repetitive element expression data. Inspired by this methodological manuscript with reasonable acceptance by other researchers, we trusted that the RTE microarray probes could accurately quantify RTE expression at class and family levels.

      Strengths: 

      This is a very important biological problem. 

      Weaknesses: 

      RNA microarray probes are obviously biased to genes, and thus quantifying transposon analysis based on them seems dubious. Based on how arrays are designed there should at least be partial (perhaps outdated evidence) that the probe sites overlap a protein-coding or non-coding RNA. 

      We disagree with the reviewer that quantifying transposon analysis based on microarray data is dubious. As previously shown by Reichmann et al., the quantification is reliable as long as the probes do not overlap with annotated genes and they are in the correct orientation to detect sense repetitive element transcripts. Reichman et al. identified 1,400 repetitive element probes in version 1.0, version 1.1 and version 2.0 of the Illumina Mouse WG-6 Beadchips by comparing the genomic locations of the probes with the Repeatmasked regions of the mouse genome. We applied the same criteria for Illumina Human HT-12 V3 (29431 probes) and V4 (33963) to identify the RTE-specific probes. 

      The authors state they only used intergenic probes, but based on supplementary files, almost half of RTE probes are not intergenic but intronic (n=106 out of 264). 

      All our identified RTE probes overlap with intergenic regions. However, due to their repetitive natures, some probes overlap with intronic regions, too. We have replaced "intergenic" with "non-coding" in our resubmission to show that they do not overlap with the exons of protein-coding genes. However, we do not rule out the possibility that some of our detected RTE probes might overlap non-coding RNAs. In fact, the border between coding and non-coding genomes has recently become very fuzzy with new annotations of the genome. RTE RNAs can be easily considered as non-coding RNAs if we challenge our traditional junk DNA view. 

      This is further complicated by the fact that not all this small subset of probes is available in all analyzed datasets. For example, 232 probes were used for the MESA dataset but only 80 for the GTP dataset. Thus, RTE expression is quantified with a set of probes which is extremely likely to be highly affected by non-RTE transcripts and that is also different across the studied datasets. Differences in the subsets of probes could very well explain the large differences between datasets in multiple of the analyses performed by the authors, such as in Figure 2a, or 3a. It is nonetheless possible that the quantification of RTE expression performed by the authors is truly interpretable as RTE expression, but this must be validated with more data from RNA-seq. Above all, microarray data should not be the main type of data used in the type of analysis performed by the authors. 

      In this study, we did not compare MESA with GTP etc. We have analysed each dataset separately based on the available data for that dataset. Therefore, sacrificing one analysis because of the lack of information from the other does not make sense. We would do that if we were after comparing different datasets. Moreover, the datasets are not comparable because they were collected from different types of blood samples. 

      Reviewer #2 (Public Review): 

      Summary: 

      Yi-Ting Tsai and colleagues conducted a systematic analysis of the correlation between the expression of retrotransposable elements (RTEs) and aging, using publicly available transcriptional and methylome microarray datasets of blood cells from large human cohorts, as well as single-cell transcriptomics. Although DNA hypomethylation was associated with chronological age across all RTE biotypes, the authors did not find a correlation between the levels of RTE expression and chronological age. However, expression levels of LINEs and LTRs positively correlated with DNA demethylation, and inflammatory and senescence gene signatures, indicative of "biological age". Gene set variation analysis showed that the inflammatory response is enriched in the samples expressing high levels of LINEs and LTRs. In summary, the study demonstrates that RTE expression correlates with "biological" rather than "chronological" aging. 

      Strengths: 

      The question the authors address is both relevant and important to the fields of aging and transposon biology. 

      We thank the reviewer for finding this study relevant and important.

      Weaknesses: 

      The choice of methodology does not fully support the primary claims. Although microarrays can detect certain intergenic transposon sequences, the authors themselves acknowledge in the Discussion section that this method's resolution is limited. More critical considerations, however, should be addressed when interpreting the results. The coverage of transposon sequences by microarrays is not only very limited (232 unique probes) but also predetermined. This implies that any potential age-related overexpression of RTEs located outside of the microarray-associated regions, or of polymorphic intact transposons, may go undetected. Therefore, the authors should be more careful while generalising their conclusions. 

      This is a bioinformatics study, and we have already admitted and discussed the limitations in the discussion section of this manuscript. All technologies have their own limitations, and this should not stop us from shedding light on scientific facts because of inadequate information. In the manuscript, we have discussed that all large and proper ageing studies were performed using microarray technology. Peters et al. (DOI: doi: 10.1038/ncomms9570) adopted all these datasets in their transcriptional landscape of ageing manuscript, which was used in previous studies of ageing as well. Our study essentially applies the Reichmann et al. method to the peripheral blood-related data from the Peters et al. manuscript. Since hypomethylation due to ageing is a well-established and broad epigenetic reprogramming, it is unlikely that only a fraction of RTEs is affected by this phenomenon. Therefore, the subsampling of RTEs should not affect the result so much. Indeed, this is supported in our study by the inverse correlation between DNA methylation and RTE expression for LINE and SINE classes despite having limited numbers of probes for LINE and SINE expressions.    

      Additionally, for some analyses, the authors pool signals from RTEs by class or family, despite the fact that these groups include subfamilies and members with very different properties and harmful potentials. For example, while sequences of older subfamilies might be passively expressed through readthrough transcription, intact members of younger groups could be autonomously reactivated and cause inflammation. The aggregation of signals by the largest group may obscure the potential reactivation of smaller subgroups. I recommend grouping by subfamily or, if not possible due to the low expression scores, by subgroup. For example, all HERV subfamilies are from the ERVL family. 

      We agree with the reviewer that different subfamilies of RTEs play different roles through their activation. However, we will lose our statistical power if we study RTE subfamilies with a few probes. Global epigenetic alteration and derepression of RTEs by ageing have been observed to be genome-wide. While our systematic analysis across RTE classes and families cannot capture alterations in subfamilies due to statistical power, it is still relevant to the research question we are addressing.

      Next, Illumina arrays might not accurately represent the true abundance of TEs due to nonspecific hybridization of genomic transposons. Standard RNA preparations always contain traces of abundant genomic SINEs unless DNA elimination is specifically thorough. The problem of such noise should be addressed. 

      We have checked the RNA isolation step from MESA, GTP, and GARP manuscripts. The total RNA was isolated using the Qiagen mini kit following the manufacturer’s recommendations. The authors of these manuscripts did not mention whether they eliminated genomics DNA, but we assumed they were aware of the DNA contamination and eliminated it based on the manufacturer’s recommendations. We have looked up the literature about nonspecific hybridization of RTEs but could not find any evidence to support this observation. We would appreciate the reviewers providing more evidence about such RTE contaminations.   

      Lastly, scRNAseq was conducted using 10x Genomics technology. However, quantifying transposons in 10x sequencing datasets presents major challenges due to sparse signals. 

      Applying the scTE pipeline (https://www.nature.com/articles/s41467-021-21808-x), we have found that the statical power of quantifying RTE classes (LINE, SINE, and LTR) or  RTE families (L1, L2, All, ERVK, etc.) are as good as each individual gene. However, our proposed method cannot analyse RTE subfamilies, and we did not do that. 

      Smart-seq single-cell technology is better suited to this particular purpose. 

      We agree with the reviewer that Smart-seq provides higher yield than 10x, but there is no Smartseq data available for ageing study.  

      Anyway, it would be more convincing if the authors demonstrated TE expression across different clusters of immune cells using standard scRNAseq UMAP plots instead of boxplots. 

      Since the number of RTE reads per cell is low, showing the expression of RTEs per cell in UMAP may not be the best statistical approach to show the difference between the aged and young groups. This is why we chose to analyse with Pseudobulk and displayed differential expression using boxplot rather than UMAP for each immune cell type. 

      I recommend validating the data by RNAseq, even on small cohorts. Given that the connection between RTE overexpression and inflammation has been previously established, the authors should consider better integrating their observations into the existing knowledge. 

      Please see below. We have analysed RNA-seq data suggested by Reviewer 1 in the Recommendations for the Authors section.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors): 

      I can recommend two sizeable human PMBC RNA-seq datasets that the authors could use:

      Marquez et al. 2020 (phs001934.v1.p1, controlled access) and Morandini et al. 2023 (GSE193141, public access). There are likely other suitable datasets that I am not aware of. I would also recommend using identical sets of probes to quantify RTE expression across studies. If certain datasets have too few probes and would thus limit the number of probes available across all studies it might be a good idea to exclude the dataset, especially if the analysis has been supplemented by the additional RNA-seq datasets. 

      Until recently, there was no publicly-available, non-cancerous, large cohort of RNA-seq data for ageing studies. We tried to gain access to the two RNA-seq datasets suggested by reviewer 2: Marquez et al. 2020 (phs001934.v1.p1, controlled access) and Morandini et al. 2023 (GSE193141, public access). 

      Unfortunately, Marquez et al. 2020 data is not accessible because the authors only provide the data for projects related to cardiovascular diseases. However, we did analyse Morandini et al. 2023 data, and we can confirm that no association was observed between any class and family of RTEs with chronological ageing (Author response image 1), which is the second strong piece of evidence supporting the statement in the manuscript. However, as expected, we found a positive correlation between RTE expression and IFN-I signature score (Author response image 2).

      Author response image 1.

      Linear analysis of RTE expression and chronological age.

      Author response image 2.

      Linear analysis of RTE expression and IFN gene signature expression.

      The authors use "biological age" and inflammation as interchangeable concepts, including in the title. Please correct this wording. 

      We have now added a new terminology to the manuscript called “biological age-related (BAR)”, which has been clearly addressed this distinction. We don’t think it is needed to change the title.  

      The authors find correlations between RTE expression and age-associated gene signatures but not chronological age itself. This is puzzling because, as the wording suggests, the expression of these inflammatory pathways is age-associated. If RTE expression correlates with inflammation which itself correlates with age, one might expect RTE expression to also correlate with age. Do the authors see a correlation between various inflammatory gene signatures and chronological age, in the analyzed datasets? If yes, then how would you explain that discrepancy? Moreover, in this case, I would recommend using a linear model, rather than correlation, to separate the effects of chronological age and RTE expression on inflammation (Inflammation et al ~ Age + RTE expression), or equivalent designs.

      As described above, we have now introduced the BAR terminology, which resolves this confusion. We did not find a correlation between RTE expression and chronological age. However, we did identify the correlation between BAR gene signatures and RTE expression.

      To separate the effects of chronological age and RTE expression on BAR gene signature scores, we performed a generalized linear model (GLM) analysis using BAR gene signature scores as response variables and RTE expression and chronological age as predictors (BAR gene signature scores ~ RTE expression + chronological age). Significant association was observed between BAR gene signature scores and RTE expression in the GARP cohort (Author response image 3). However, when chronological age is considered as predictor, we did not identify a correlation between chronological age and BAR gene signatures, indicating that BAR events are not corelated with chronological age (Author response image 3).  

      Author response image 3.

      Generalized linear models (GLM) analysis (BAR gene signature scores ~ RTE expression + chronological age). For each RTE family, we separately performed GLM. Age (RTE family) indicates the chronological age when used in the design formula for that specific RTE family. 

      Some of the gene sets used by the authors have considerable overlap with others and are also not particularly comprehensive. I can recommend this very comprehensive gene set: https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/SAUL_SEN_MAYO.  

      We did not choose to use large gene lists such as the suggested SEN_MAYO list, as we found Singscore struggles to generate reliable scores with sufficient variance when the number of genes increase to more than twenty. Although there is some overlap between inflammation-related genes and cellular senescence genes (e.g., IL6, IL1A, IL1B), it is important to note that each gene list focuses on different aspects of biological aging and should not be dismissed as redundant.

      Minor comments: 

      Overall, several sentences in the manuscript feel somewhat unnatural. I would recommend further proofreading. I will mention some examples:  

      Thank you for your feedback. We have fixed all these issues in the new submission.  

      • One line 34, "like the retroviruses" should be "like retroviruses. There are several other places in the text where "the" is not required. 

      Fixed.

      • On line 86, "to generate the RTE expression". "the" is again not necessary and I would replace "generate" with "quantify". 

      Fixed.

      • On line 86, "we mapped the probe locations to RepeatMasker". RepeatMasker is not a genome. Do you mean you mapped the probe location to a genome annotated by RepeatMasker? The same applies to line 99.  

      Fixed. We changed the sentence to: “To quantify RTE expression, we mapped the microarray probe locations to RTE locations in RepeatMasker to extract the list of noncoding (intergenic or intronic) probes that cover the RTE regions.”

      • Figure 1 contains a typo in the aims section: "evetns" instead of "events".  

      Fixed.

      • On line 495 "filtered out" seems to imply your removed intergenic probes. I assume you mean that you specifically selected intergenic probes. 

      Fixed.

      • Figure 1 nicely summarizes your datasets. Could you add a Figure 1b panel showing how you used RNA arrays to quantify RTE expression? This should include the number of probes for each RTE family, so I suggest merging this with Figure S1.  

      We disagree with the reviewer to merge Figure 1 and Figure S1 because they are addressing two different concepts.  

      Reviewer #2 (Recommendations For The Authors): 

      In Figure 2c, it is unclear what colour scale has been used for age. 

      Thank you for the comment. We have added a legend for age in this figure.

      There are no figure legends for Supplementary Figures 1 to 5 and all figures after Supplementary Figure 8. 

      A new version with legends has been submitted.

      For different datasets used, the choice of "healthy" patients should be more clear and explicit.

      Are asymptomatic patients with autoimmune inflammatory disorders considered as "healthy"? If not only healthy patients' blood is analysed (such as PBMS from primary osteoarthrosis), how inflammatory signatures enrichment discovered in this study may be associated not just with "biological age" but with the disease itself? 

      In our analysis, we did not exclusively study "healthy" individuals, as none of our datasets were initially collected from strictly healthy populations. While the microarray datasets were not specifically collected from people with particular diseases, they were also not screened for asymptomatic conditions. To demonstrate the same pattern in healthier cohorts, we added scRNA-seq analysis of confirmed healthy individuals to our study. However, the focus of this study is not on healthy aging. Instead, it is on biological ageing that includes both healthy and non-healthy ageing.

      We included the GARP (primary osteoarthritis) dataset as it is a cohort of age-related diseases (ARD). While we cannot definitively attribute inflammatory signatures enrichment to biological aging or disease, the observation of such enrichment in a cohort of ARD is worth considering. To make this clearer, we have replaced the term “healthy” with “non-cancerous” for microarray analysis throughout the paper.

    1. Author Response

      The following is the authors’ response to the original reviews.

      Response to reviewers

      We would like to thank the reviewers for their feedback. Below we address their comments and have indicated the associated changes in our point-by-point response (blue: answers, red: changes in manuscript).

      Reviewer #1:

      Overall, the hypotheses and results are clearly presented and supported by high quality figures. The study is presented in a didactic way, making it easy for a broad audience to understand the significance of the results. The study does present some weaknesses that could easily be addressed by the authors.

      We thank the reviewer for appreciating our work and providing useful suggestions for improvement.

      1) First, there are some anatomical inaccuracies: line 129 and fig1C, the authors omit m.dial septum projections to area CA1 (in addition to the entorhinal cortex). Moreover, in addition to CA1, CA3 also provides monosynaptic feedback projections to the medial septum CA3. Finally, an indirect projection from CA1/3 excitatory neurons to the lateral septum, which in turn sends inhibitory projections to the medial septum could be included or mentioned by the authors. This could be of particular relevance to support claims related to effects of neurostimulations, whereby minutious implementation of anatomical data could be key.

      If not updating their model, the authors could add this point to their limitation section, where they already do a good job of mentioning some limitations of using the EC as a sole oscillatory input to CA1.

      We acknowledge that our current model strongly simplifies the interconnections between the medial septum and the hippocampal formation, but including more anatomical details is beyond the scope of this manuscript and would be a topic for future work. Nevertheless, we followed the reviewer’s advice to stress this point in our manuscript. First, we moved a paragraph that was initially in the “methods” section to the “results” section (L.141-150 of the revised manuscript):

      “Biologically, GABAergic neurons from the medial septum project to the EC, CA3, and CA1 fields of the hippocampus (Toth et al., 1993; Hajós et al., 2004; Manseau et al., 2008; Hangya et al., 2009; Unal et al., 2015; Müller and Remy, 2018). Although the respective roles of these different projections are not fully understood, previous computational studies have suggested that the direct projection from the medial septum to CA1 is not essential for the production of theta in CA1 microcircuits (Mysin et al., 2019). Since our modeling of the medial septum is only used to generate a dynamic theta rhythm, we opted for a simplified representation where the medial septum projects only to the EC, which in turn drives the different fields of the hippocampus. In our model, Kuramoto oscillators are therefore connected to the EC neurons and they receive projections from CA1 neurons (see methods for more details).”

      Second, we expanded the corresponding paragraph in the limitation section to discuss this point further (L.398-415 of the revised manuscript):

      “We decided to model septal pacemaker neurons projecting to the EC as the main source of hippocampal theta as reported in multiple experimental studies (Buzsáki, 2002; Buzsáki et al., 2003; Hangya et al., 2009). However, experimental findings and previous models have also proposed that direct septal inputs are not essential for theta generation (Wang, 2002; Colgin et al., 2013; Mysin et al., 2019), but play an important role in phase synchronization of hippocampal neurons. Furthermore, the model does not account for the connections between the lateral and medial septum and the hippocampus (Takeuchi et al., 2021). These connections include the inhibitory projections from the lateral to the medial septum and the monosynaptic projections from the hippocampal CA3 field to the lateral septum. An experimental study has highlighted the importance of the lateral septum in regulating the hippocampal theta rhythm (Bender et al., 2015), an area that has not been included in the model. Specifically, theta-rhythmic optogenetic stimulation of the axonal projections from the lateral septum to the hippocampus was shown to entrain theta oscillations and lead to behavioral changes during exploration in transgenic mice. To account for these discrepancies, our model could be extended by considering more realistic connectivity patterns between the medial / lateral septum and the hippocampal formation, including glutamatergic, cholinergic, and GABAergic reciprocal connections (Müller and Remy, 2018), or by considering multiple sets of oscillators each representing one theta generator.”

      1. The authors test conditions of low theta inputs, which they liken to pathological states (line 112). It is not clear what pathology the authors are referring to, especially since a large amount of 'oscillopathies' in the septohippocampal system are associated with decreased gamma/PAC, but not theta oscillations (e.g. Alzheimer's disease conditions).

      In the manuscript, we referred to “oscillopathies” in a broad sense way as we did not want to overstate the biological implications of the model or the way we modeled pathological states. To our knowledge, several studies have yielded inconsistent results regarding the specific changes in theta or gamma power in Alzheimer’s disease, and the most convincing alteration seems to be the theta-gamma phase-amplitude coupling (PAC) (for review see e.g., Kitchigina, V. F. Alterations of Coherent Theta and Gamma Network Oscillations as an Early Biomarker of Temporal Lobe Epilepsy and Alzheimer’s Disease. Front Integr Neurosci 12, 36 (2018)), as also mentioned by the reviewer.

      In this study, the most straightforward way to reduce theta-gamma PAC was to reduce the amplitude of the oscillators’ gain, which affected theta power, gamma power, and theta-gamma PAC (Figure 5 of the revised manuscript). Affecting their synchronization level (i.e., the order parameter) did not affect any of these variables (Figure 5 – Figure Supplement 4).

      In order to alter theta-gamma PAC without affecting theta or gamma power, we believe that more complex changes should be performed in the model, likely at the level of individual neurons in the hippocampal formation. For example, cholinergic deprivation has been previously used in a multi-compartment model of the hippocampal CA3 to mimic Alzheimer’s disease and to draw functional implications on the slowing of theta oscillations and the storage of new information (Menschik, E. D. & Finkel, L. H. Neuromodulatory control of hippocampal function: towards a model of Alzheimer’s disease. Artif Intell Med 13, 99–121 (1998)).

      This has now been added to the limitations section (L.458-465 of the revised manuscript):

      “Finally, we likened conditions of low theta input to pathological states characteristic of oscillopathies such as Alzheimer’s disease, as these conditions disrupted all aspects of theta-gamma oscillations in our model: theta power, gamma power, and theta-gamma PAC (Figure 5). However, it should be noted that changes in theta or gamma power in these pathologies are often unclear, and that the most consistent alteration that has been reported in Alzheimer’s disease is a reduction of theta-gamma PAC (for review, see Kitchigina, 2018). Future work should explore the effects of cellular alterations intrinsic to the hippocampal formation and their impact on theta-gamma oscillations.”

      1. While relevant for the clinical field, there is overall a missed opportunity to explain many experimental accounts with this novel model. Although to this day, clinical use of DBS is mostly restricted to electrical (and thus cell-type agnostic) stimulation, recent studies focusing on mechanisms of neurostimulations have manipulated specific subtypes in the medial septum and observed effects on hippocampal oscillations (e.g. see Muller & Remy, 2017 for review). Focusing stimulations in CA1 is of course relevant for clinical studies but testing mechanistic hypotheses by focusing stimulation on specific cell types could be highly informative. For instance, could the author reproduce recent optogenetic studies (e.g. Bender et al. 2015 for stimulation of fornix fibers; Etter et al., 2019 & Zutshi et al. 2018 for stimulation of septal inhibitory neurons)? Cell specific manipulations should at least be discussed by the authors.

      We acknowledge the importance of cell-type-specific manipulation in the septo-hippocampal circuitry. However, our model was designed to study neurostimulation protocols that affect the hippocampal formation, not the medial septum, which is why only the hippocampal formation is composed of biophysically realistic (i.e., conductance-based) neuronal models. To replicate the various studies mentioned by the reviewer (which are all very relevant), we would need to implement a biophysical model of the medial septum, which would be an entirely new project.

      Nevertheless, we can use the existing model to replicate optogenetic studies that induced gamma oscillations in excitatory-inhibitory circuits, using either ramped photostimulation targeting excitatory neurons (Adesnik et al., 2010; Akam et al., 2012; Lu et al., 2015), or pulsed stimulation driving inhibitory cells in the gamma range (Cardin et al., 2009; Iaccarino et al., 2016). In fact, such approaches have been demonstrated not just in the hippocampus but also in the neocortex, and represent a hallmark of local excitatory-inhibitory circuits. To account for these experimental results and replicate them, we have added 4 new figures (Figure 2 and its 3 figure supplements) and an extensive section in the results part (L.151-217 of the revised manuscript):

      “From a conceptual point of view, our model is thus composed of excitatory-inhibitory (E-I) circuits connected in series, with a feedback loop going through a population of coupled phase oscillators. In the next sections, we first describe the generation of gamma oscillations by individual E-I circuits (Figure 2), and illustrate their behavior when driven by an oscillatory input such as theta oscillations (Figure 3). We then present a thorough characterization of the effects of theta input and stimulation amplitude on theta-nested gamma oscillations (Figure 4 and Figure 5). Finally, we present some results on the effects of neurostimulation protocols for restoring theta-nested gamma oscillations in pathological states (Figure 6 and Figure 7).

      Generation of gamma oscillations by E-I circuits

      It is well-established that a network of interconnected pyramidal neurons and interneurons can give rise to oscillations in the gamma range, a mechanism termed pyramidal-interneuronal network gamma (PING) (Traub et al., 2004; Onslow et al., 2014; Segneri et al., 2020;). This mechanism has been observed in several optogenetic studies with gradually increasing light intensity (i.e., under a ramp input) affecting multiple different circuits, such as layer 2-3 pyramidal neurons of the mouse somatosensory cortex (Adesnik et al., 2010), the CA3 field of the hippocampus in rat in vitro slices (Akam et al., 2012), and in the non-human primate motor cortex (Lu et al., 2015). In all cases, gamma oscillations emerged above a certain threshold in terms of photostimulation intensity, and the frequency of these oscillations was either stable or slightly increased when increasing the intensity further. We sought to replicate these findings with our elementary E-I circuits composed of single-compartment conductance-based neurons driven by a ramping input current (Figure 2 and Figure S2). As an example, all the results in this section will be shown for an E-I circuit that has similar connectivity parameters as the CA1 field of the hippocampus in our complete model (see section “Hippocampal formation: inputs and connectivity” in the methods).

      For low input currents provided to both neuronal populations, only the highly-excitable interneurons were activated (Figure 2A). For a sufficiently high input current (i.e., a strong input that could overcome the inhibition from the fast-spiking interneurons), the pyramidal neurons started spiking as well. As the amplitude of the input increased, the activity of the both neuronal populations became synchronized in the gamma range, asymptotically reaching a frequency of about 60 Hz (Figure 2A bottom panel). Decoupling the populations led to the abolition of gamma oscillations (Figure 2B), as neuronal activity was determined solely by the intrinsic properties of each cell. Interestingly, when the ramp input was provided solely to the excitatory population, we observed that the activity of the pyramidal neurons preceded the activity of the inhibitory neurons, while still preserving the emergence of gamma oscillations (Figure S2 A). As expected, decoupling the populations also abolished gamma oscillations, with the excitatory neurons spiking a frequency determined by their intrinsic properties and the inhibitory population remaining silent (Figure S2B).

      To further characterize the intrinsic properties of individual inhibitory and excitatory neurons, we derived their input-frequency (I-F) curves, which represent the firing rate of individual neurons in response to a tonic input (Figure S3A). We observed that for certain input amplitudes, the firing rates of both types of neurons was within the gamma range. Interestingly, in the absence of noise, each population could generate by itself gamma oscillations that were purely driven by the input and determined by the intrinsic properties of the neurons (Figure S3B). Adding stochastic Gaussian noise in the membrane potential disrupted these artificial oscillations in decoupled populations (Figure S3C). All subsequent simulations were run with similar noise levels to prevent the emergence of artificial gamma oscillations.

      Another potent way to induce gamma oscillations is to drive fast-spiking inhibitory neurons using pulsed optogenetic stimulation at gamma frequencies, a strategy that has been used both in the neocortex (Cardin et al., 2009) and hippocampal CA1 (Iaccarino et al., 2016). In particular, Cardin and colleagues systematically investigated the effect of driving either excitatory or fast-spiking inhibitory neocortical neurons at frequencies between 10 and 200 Hz (Cardin et al., 2009). They showed that fast-spiking interneurons are preferentially entrained around 40-50 Hz, while excitatory neurons respond better to lower frequencies. To verify the behavior of our model against these experimental data, we simulated pulsed optogenetic stimulation as an intracellular current provided to our reduced model of a single E-I circuit. Stimulation was applied at frequencies between 10 and 200 Hz to excitatory cells only, to inhibitory cells only, or to both at the same time (Figure S4). The population firing rates were used as a proxy for the local field potentials (LFP), and we computed the relative power in a 10-Hz band centered around the stimulation frequency, similarly to the method proposed in (Cardin et al., 2009). When presented with continuous stimulation across a range of frequencies in the gamma range, interneurons showed the greatest degree of gamma power modulation (Figure S4). Furthermore, when the stimulation was delivered to the excitatory population, the relative power around the stimulation frequency dropped significantly in frequencies above 10 Hz, similar to the reported experimental data (Cardin et al., 2009). The main difference between our simulation results and these experimental data is the specific frequencies at which fast-spiking interneurons showed resonance, which was slow gamma around 40 Hz in the mouse barrel cortex and fast gamma around 90 Hz in our model. This could be attributed to several factors, such as differences in the cellular properties between cortical and hippocampal fast-spiking interneurons, or the differences between the size of the populations and their relevant connectivity in the cortex and the hippocampus.”

      Author response image 1.

      Figure 2. Emergence of gamma oscillations in coupled excitatory-inhibitory populations under ramping input to both populations. A. Two coupled populations of excitatory pyramidal neurons (NE = 1000) and inhibitory interneurons (NI = 100) are driven by a ramping current input (0 nA to 1 nA) for 5 s. As the input becomes stronger, oscillations start to emerge (shaded green area), driven by the interactions between excitatory and inhibitory populations. The green inset shows the raster plot (neuronal spikes across time) of the two populations during the green shaded period (red for inhibitory; blue for excitatory). When the input becomes sufficiently strong (shaded magenta area), the populations become highly synchronized and produce oscillations in the gamma range (at approximately 50 Hz). The spectrogram (bottom panel) shows the power of the instantaneous firing rate of the pyramidal population as a function of time and frequency. It reveals the presence of gamma oscillations that emerge around 2s and increase in frequency until 4 s, when they settle at approximately 60 Hz. B. Similar depiction as in panel A. with the pyramidal-interneuronal populations decoupled. The absence of coupling leads to the abolition of gamma oscillations, each cell spiking activity being driven by its own inputs and intrinsic properties.

      Author response image 2.

      Figure S2 (Figure 2 – Figure Supplement 1). Emergence of gamma oscillations in coupled excitatoryinhibitory populations under ramping input to the excitatory population. Similar representation as in Figure 2, but with the input provided only to the excitatory population. All conclusions remain the same. In addition, the inhibitory population does not show any spiking activity in the decoupled case.

      Author response image 3.

      Figure S3 (Figure 2 – Figure Supplement 2). Cell-intrinsic spiking activity in decoupled excitatory and inhibitory populations under ramping input. A. Input-Frequency (I-F) curves for excitatory cells (left panel; pyramidal neurons with ICAN) and inhibitory cells (right panel; interneurons, fast-spiking) used in the model. Above a certain tonic input (around 0.35 nA for excitatory and 0.1 nA for inhibitory neurons), neurons can spike in the gamma range. B. Raster plot showing the spiking activity of excitatory (blue, NE = 1000) and inhibitory (red, NI = 100) neurons in decoupled populations under ramping input (top trace) and in the absence of noise in the membrane potential. Despite random initial conditions across neurons, oscillations emerge in both populations due to the intrinsic properties of the cells, with a frequency that is predicted by the respective I-F curves (panel A.). C. Similar representation as panel B. but with the addition of stochastic noise in the membrane potential of each neuron. The presence of noise disrupts the emergence of oscillations in these decoupled populations.

      Author response image 4.

      Figure S3 (Figure 2 – Figure Supplement 2). Cell-intrinsic spiking activity in decoupled excitatory and inhibitory populations under ramping input. A. Input-Frequency (I-F) curves for excitatory cells (left panel; pyramidal neurons with ICAN) and inhibitory cells (right panel; interneurons, fast-spiking) used in the model. Above a certain tonic input (around 0.35 nA for excitatory and 0.1 nA for inhibitory neurons), neurons can spike in the gamma range. B. Raster plot showing the spiking activity of excitatory (blue, NE = 1000) and inhibitory (red, NI = 100) neurons in decoupled populations under ramping input (top trace) and in the absence of noise in the membrane potential. Despite random initial conditions across neurons, oscillations emerge in both populations due to the intrinsic properties of the cells, with a frequency that is predicted by the respective I-F curves (panel A.). C. Similar representation as panel B. but with the addition of stochastic noise in the membrane potential of each neuron. The presence of noise disrupts the emergence of oscillations in these decoupled populations.

      Beyond these weaknesses, this study has a strong utility for researchers wanting to explore hypotheses in the field of neurostimulations. In particular, I see value in such models for exploring more intricate, phase specific effects of continuous, as well as close loop stimulations which are on the rise in systems neuroscience.

      We thank the reviewer for this appreciation of our work and its future perspectives.

      Recommendations For The Authors:

      Line 144, the authors mention that their MI values are erroneous in absence of additive noise - could this be due to the non-sinusoidal nature of the phase signal recorded, and be fixed by upscaling model size?

      We thank the reviewer for this question and suggestion. The main reason behind the errors in the computation of the MI lies in the complete absence of oscillations at specific frequencies. Filtered signals within specific bands produced a power of 0 (or extremely low values), as seen in the power spectral densities. In such cases, the phase signal was not mathematically defined, but the toolbox we used to compute it still returned a numerical result that was inaccurate (for more details on the computation of the MI see Tort et al., 2010). To mitigate this numerical artefact, we decided to add uniform noise in the computed firing rates. This strategy is illustrated on Figure S6 (Figure 3 – Figure Supplement 2), which we have copied below for reference. Alternative approaches could probably have been used, such as increasing the noise in the membrane potential so that neurons would start spiking with firing rates that show more realistic power spectra, even in the absence of external inputs.

      Author response image 5.

      Figure S6 (Figure 3 – Figure Supplement 2). Quantification of PAC with and without noise. A. Quantifying PAC in the absence of noise produced inaccurate identification of the coupled frequency bands, due to the complete absence of oscillations at some frequencies. All analyses are based on the CA1 firing rates (top traces) during a representative simulation. Power spectral densities of these firing rates (left) indicate that some frequencies have a power of 0. PAC of the excitatory population was assessed using two graphical representations, the polar plot (middle) and comodulogram (right), and quantified using the MI. The comodulogram was calculated by computing the MI across 80% overlapping 1-Hz frequency bands in the theta range and across 90% overlapping 10-Hz frequency bands in the gamma range and subsequently plotted as a heat map. In the absence of noise, a slow theta frequency centered around 5 Hz is found to modulate a broad range of gamma frequencies between 40 and 100 Hz. The value indicated on the comodulogram indicates the average MI in the 3-9 Hz theta range and 40-80 Hz gamma range. As in Figure 2, the polar plot represents the amplitude of gamma oscillations (averaged across all theta cycles) at each phase of theta (theta range: 3-9 Hz, phase indicated as angular coordinate) and for different gamma frequencies (radial coordinate, binned in 1-Hz ranges). B. Adding uniform noise to the firing rate (with an amplitude ranging between 15 and 25% of the maximum firing rate) improved the identification of the coupled frequency bands. In this case, the slower theta frequency centered around 5 Hz modulates a gamma band located between 45 and 75 Hz.

      Reviewer #2:

      The main strength of this model is its use of a fairly physiologically detailed model of the hippocampus. The cells are single-compartment models but do include multiple ion channels and are spatially arranged in accordance with the hippocampal structure. This allows the understanding of how ion channels (possibly modifiable by pharmacological agents) interact with system-level oscillations and neurostimulation. The model also includes all the main hippocampal subfields. The other strength is its attention to an important topic, which may be relevant for dementia treatment or prevention, which few modeling studies have addressed. The work has several weaknesses.

      We thank the reviewer for appreciating our detailed description of the hippocampal formation and the focus on neurostimulation applications that aim at treating oscillopathies, especially dementia.

      1. First, while investigations of hippocampal neurostimulation are important there are few experimental studies from which one could judge the validity of the model findings. All its findings are therefore predictions. It would be much more convincing to first show the model is able to reproduce some measured empirical neurostimulation effect before proceeding to make predictions.

      We acknowledge that the results presented in Figures 4-7 of the revised manuscript cannot be compared to existing experimental data, and are therefore purely predictive. Future experimental work is needed to verify these predictions.

      Yet, we would also like to stress that the motivation behind this project was the inadequacy of previous models of theta-nested gamma oscillations (Onslow et al., 2014; Aussel et al., 2018; Segneri et al., 2020) to account for the mechanism of theta phase reset that occurs during electrical stimulation of the fornix or perforant path (Williams and Givens, 2003). Since we could not use these previous models to study the effects of neurostimulation on theta-nested gamma oscillations, we had to modify them to account for a dynamical theta input, which is the main methodological novelty that is reported in our manuscript (Figures 1 and 3 of the revised manuscript).

      Despite the scarcity of experimental studies that could confirm the full model, we sought to replicate a few experimental findings that employed optogenetic stimulation to induce gamma oscillations in individual excitatory-inhibitory circuits. Although not specific to the hippocampus, these studies have shown that gamma oscillations can be induced using either ramped photostimulation targeting excitatory neurons (Adesnik et al., 2010; Akam et al., 2012; Lu et al., 2015), or pulsed stimulation driving inhibitory cells in the gamma range (Cardin et al., 2009; Iaccarino et al., 2016). To account for these experimental results and replicate them, we have added 4 new figures (Figure 2 and its 3 figure supplements) and an extensive section in the results part (L.141-217 of the revised manuscript). The added section and related figures are indicated in our response to reviewer 1, comment 3 (p 2-7).

      2.1. Second, the model is very specific. Or if its behavior is to be considered general it has not been explained why.

      Although the spatial organization and cellular details of the model are indeed very specific, its general behavior, i.e., the production of theta-nested gamma oscillations and theta phase reset, are common to any excitatory-inhibitory circuit interconnected with Kuramoto oscillators. To illustrate this point, we have generalized our approach to the neural mass model developed by Onslow and colleagues (Onslow ACE, Jones MW, Bogacz R. A Canonical Circuit for Generating Phase-Amplitude Coupling. PLoS ONE. 2014 Aug; 9(8):e102591). These results are represented in a new supplementary figure (Figure3 – Figure Supplement 4), and briefly described in a new paragraph of the results section (L.262-268 of the revised manuscript):

      “Importantly, our approach is generalizable and can be applied to other models producing theta-nested gamma oscillations. For instance, we adapted the neural mass model by Onslow and colleagues (Onslow et al., 2014), replaced the fixed theta input by a set of Kuramoto oscillators, and demonstrated that it could also generate theta phase reset in response to single-pulse stimulation (Figure S8). These results illustrate that the general behavior of our model is not specific to the tuning of individual parameters in the conductancebased neurons, but follows general rules that are captured by the level of abstraction of the Kuramoto formalism.”

      Author response image 6.

      Figure S8 (Figure 3 – Figure Supplement 4). A neural mass model of coupled excitatory and inhibitory neurons driven by Kuramoto oscillators generates theta-nested gamma oscillations and theta phase reset. A. Two coupled neural masses (one excitatory and one inhibitory) driven by Kuramoto oscillators, which represent a dynamical oscillatory drive in the theta range, were used to implement a neural mass equivalent to our conductance-based model represented in Figure 1. Neural masses were modeled using the WilsonCowan formalism, with parameters adapted from Onslow et al. (2014) (𝑊𝐸𝐸 = 4.8, 𝑊𝐸𝐼 = 𝑊𝐼𝐸 = 4, 𝑊𝐼𝐼 = 0). B. The normalized population firing rates exhibit theta-nested gamma oscillations (middle and bottom panels) in response to the dynamic theta rhythm (top panel). A stimulation pulse delivered at the descending phase of the rhythm to both populations (marked by the inverted red triangle) produces a robust theta phase reset, similarly to Figure 3A.

      This simplified model is described in more details in the methods (L.694-710 of the revised manuscript). Additionally, the generation of gamma oscillations by individual excitatory-inhibitory circuits is now described in details in the added section “Generation of gamma oscillations by E-I circuits” (L.159-217 of the revised manuscript), which has already been discussed in our response to reviewer 1, comment 3 (p 2-7).

      2.2. For example, the model shows bistability between quiescence and TNGO, however what aspect of the model underlies this, be it some particular network structure or particular ion channel, for example, is not addressed.

      We thank the reviewer for mentioning this point, which we have now addressed. The “bistable” behavior that we reported occurs for values of the theta input that are just below the threshold to induce selfsustained theta-gamma oscillations (Figure 5 of the revised manuscript, point B). Moreover, the presence of the Calcium-Activated-Nonspecific (CAN) cationic channel, which is expressed by pyramidal neurons in the entorhinal cortex, CA3, and CA1 fields of the hippocampus, is necessary for this behavior to occur. Indeed, abolishing CAN channels in all areas of the model suppresses this behavior. We have now addressed this point in a new supplementary figure (Figure 5 – Figure Supplement 4) and a short description in the text (L.287-303 of the revised manuscript).

      “In the presence of dynamic theta input, the effects of single-pulse stimulation depended both on theta input amplitude and stimulation amplitude, highlighting different regimes of network activity (Figure 5 and Figure S9, Figure S10, Figure S11). For low theta input, theta-nested gamma oscillations were initially absent and could not be induced by stimulation (Figure 5A). At most, the stimulation could only elicit a few bursts of spiking activity that faded away after approximately 250 ms, similar to the rebound of activity seen in the absence of theta drive. For increasing theta input, the network switched to an intermediate regime: upon initialization at a state with no spiking activity, it could be kicked to a state with self-sustained theta-nested gamma oscillations by a single stimulation pulse of sufficiently high amplitude (Figure 5B). This regime existed for a range of septal theta inputs located just below the threshold to induce self-sustained theta-gamma oscillations without additional stimulation, as characterized by the post-stimulation theta power, gamma power, and theta-gamma PAC (Figure 5D). Removing CAN currents from all areas of the model abolished this behavior (Figure S12), which is interesting given the role of this current in the multistability of EC neurons (Egorov et al., 2002; Fransen et al., 2006) and in the intrinsic ability of the hippocampus to generate thetanested gamma oscillations (Giovannini et al., 2017). For the highest theta input, the network became able to spontaneously generate theta-nested gamma oscillations, even when initialized at a state with no spiking activity and without additional neurostimulation (Figure 5C).”

      Author response image 7.

      Figure S12 (Figure 5 – Figure Supplement 4). CAN currents are necessary for the production of selfsustained theta-gamma oscillations in response to single-pulse stimulation. A. Same as Figure 5B. B. Similar simulation as panel A., but without the presence of CAN currents in the EC, CA3 and CA1 fields of the hippocampus. Removing CAN currents from the model abolishes self-sustained theta-nested gamma oscillations in response to a single stimulation pulse (for the parameters represented in Figure 5, point B).

      Furthermore, we realized that the terminology “bistable” may not be justified as we could not perform a systematic bifurcation analysis, which is typically carried out in simpler neural mass models (e.g., Onslow et al., 2014; Segneri et al., 2020). Therefore, we decided to rephrase the sentences about “bistability” to keep a more general terminology. The following sentences were revised:

      L.20-23: “We showed that, for theta inputs just below the threshold to induce self-sustained theta-nested gamma oscillations, a single stimulation pulse could switch the network behavior from non-oscillatory to a state producing sustained oscillations.”

      L.305-309: “Based on the above analyses, we considered two pathological states: one with a moderate theta input (i.e., moderately weak projections from the medial septum to the EC) that allowed the initiation of selfsustained oscillations by single stimulation pulses (Figure 5, point B), and one with a weaker theta input characterized by the complete absence of self-sustained oscillations even following transient stimulation (Figure 5, point A).”

      L.316-317: “In the case of a moderate theta input and in the presence of phase reset, delivering a pulse at either the peak or trough of theta could induce theta-nested gamma oscillations (Figure 6A and 6C).”

      L.353-357: “A very interesting finding concerns the behavior of the model in response to single-pulse stimulation for certain values of the theta amplitude (Figure5). For low theta amplitudes, a single stimulation pulse was capable of switching the network behavior from a state with no spiking activity to one with prominent theta-nested gamma oscillations. Whether such an effect can be induced in vivo in the context of memory processes remains an open question.”

      2.3. Similarly for the various phase reset behaviors that are found.

      We would like to clarify the fact that the observed phase reset curves (reported in Figure 3D) are a direct consequence of the choice of an appropriate phase response function for the Kuramoto oscillators representing the medial septum. This choice is inspired by experimentally measured phase response curves from CA3 neurons. These aspects are described briefly in the introduction and in more details in the methods, as indicated below:

      L.101: “This new hybrid dynamical model could generate both theta-nested gamma oscillations and theta phase reset, following a particular phase response curve (PRC) inspired by experimental literature (Lengyel et al., 2005; Akam et al., 2012; Torben-Nielsen et al., 2010).”

      L.528-537: “Hereafter, we call the term 𝑍(𝜃) the phase response function, to distinguish it from the PRC obtained from experimental data or simulations (see section below "Data Analysis", "Phase Response Curve"). Briefly, the PRC of an oscillatory system indicates the phase delay or advancement that follows a single pulse, as a function of the phase at which this input is delivered. The phase response function 𝑍(𝜃) was chosen to mimic as well as possible experimental PRCs reported in the literature (Lengyel et al., 2005; Kwag and Paulsen, 2009; Akam et al., 2012). These PRCs appear biphasic and show a phase advancement (respectively delay) for stimuli delivered in the ascending (respectively descending) slope of theta. To accurately model this behavior, we used the following equation for the phase response function, where 𝜃𝑝𝑒𝑎𝑘 represents the phase at which the theta rhythm reaches its maximum and the parameter 𝜙𝑜𝑓𝑓𝑠𝑒𝑡 controls the desired phase offset from the peak:

      Author response image 8.

      On the figure below, we illustrate the phase response curves of CA3 neurons measured by Lengyel et al., 2005 (panel A.), and compare it with our simulated phase response curves (panel B.). Note that the conventions for phase advance and phase delay are reversed between the two panels.

      Finally, we would like to acknowledge that the model “is not derived from experimental phase response curves of septal neurons of which there is no direct measurement”, as mentioned by the reviewer in their comment 4 below. Despite the lack of experimental data specific to medial septum neurons, we argue that this phase response function is the only one that mathematically supports the generation of self-sustained theta-nested gamma oscillations in our current model. This statement is illustrated by Figure S7 (Figure 3 – Figure Supplement 3) and is mentioned in the results (L.249-261 of the revised manuscript):

      We modeled this behavior by a specific term (which we called the phase response function) in the general equation of the Kuramoto oscillators (see methods, Equation 1). Importantly, introducing a phase offset in the phase response function disrupted theta-nested gamma oscillations (Figure S7), which suggests that the septohippocampal circuitry must be critically tuned to be able to generate such oscillations. The strength of phase reset could also be adjusted by a gain that was manually tuned. In the presence of the physiological phase response function and of a sufficiently high reset gain, a single stimulation pulse delivered to all excitatory and inhibitory CA1 neurons could reset the phase of theta to a value close to its peaks (Figure 3A). We computed the PRC of our simulated data for different stimulation amplitudes and validated that our neuronal network behaved according to the phase response function set in our Kuramoto oscillators (Figure 3D). It should be noted that including this phase reset mechanism affected the generated theta rhythm even in the absence of stimulation, extending the duration of the theta peak and thereby slowing down the frequency of the generated theta rhythm.

      Author response image 9.

      Figure S7 (Figure 3 – Figure Supplement 3). Network behavior generated by Kuramoto oscillators with nonphysiological phase response functions. Each panel is similar to Figure 3A, but with a different offset added to the phase response function of the Kuramoto oscillators (see methods, Equation 4). The center frequency was set to 6 Hz in all of these simulations. Overall, theta oscillations in these cases are less sinusoidal and show more abrupt phase changes than in the physiological case. A. A phase offset of −𝜋∕2 leads to an overall theta oscillation of 4 Hz, with a second peak following the main theta peak. B. A phase offset of +𝜋∕2 reduces the peak of theta, resetting the rhythm to the middle of the ascending phase. C. A phase offset of 𝜋 or -𝜋 leads to the CA1 output resetting the theta rhythm to the trough of theta.

      2.4. We may wonder whether a different hippocampal model of TNGO, of which there are many published (for example [1-6]) would show the same effect under neurostimulation. This seems very unlikely […]

      [1] Hyafil A, Giraud AL, Fontolan L, Gutkin B. Neural cross-frequency coupling: connecting architectures, mechanisms, and functions. Trends in neurosciences. 2015 Nov 1;38(11):725-40.

      [2] Tort AB, Rotstein HG, Dugladze T, Gloveli T, Kopell NJ. On the formation of gamma-coherent cell assemblies by oriens lacunosum-moleculare interneurons in the hippocampus. Proceedings of the National Academy of Sciences. 2007 Aug 14;104(33):13490-5.

      [3] Neymotin SA, Lazarewicz MT, Sherif M, Contreras D, Finkel LH, Lytton WW. Ketamine disrupts theta modulation of gamma in a computer model of hippocampus. Journal of Neuroscience. 2011 Aug 10;31(32):11733-43.

      [4] Ponzi A, Dura-Bernal S, Migliore M. Theta-gamma phase-amplitude coupling in a hippocampal CA1 microcircuit. PLOS Computational Biology. 2023 Mar 23;19(3):e1010942.

      [5] Bezaire MJ, Raikov I, Burk K, Vyas D, Soltesz I. Interneuronal mechanisms of hippocampal theta oscillations in a full-scale model of the rodent CA1 circuit. Elife. 2016 Dec 23;5:e18566.

      [6] Chatzikalymniou AP, Gumus M, Skinner FK. Linking minimal and detailed models of CA1 microcircuits reveals how theta rhythms emerge and their frequencies controlled. Hippocampus. 2021 Sep;31(9):982-1002.

      The highlighted publications, while very important in their findings regarding theta-gamma phase-amplitude coupling, focused on specific subfields of the hippocampus. In our work, we aimed to develop a model that includes the different anatomical divisions of the hippocampal formation, while still exhibiting theta-nested gamma oscillations, which is why we decided to expand the model by Aussel et al. (2018). Exploring the behavior of all these different hippocampal models under neurostimulation is beyond the scope of the current manuscript.

      Nevertheless, we have added a new figure (Figure 3 – Figure Supplement 4) showing an adaptation of our modeling approach to a generic neural mass model of theta-nested gamma oscillations (Onslow et al., 2014), which illustrates the generalizability of our findings and is described in details in our response to comment 2.1. Moreover, we have further addressed the comments of the reviewers regarding bistability and phase response curves in our responses to comments 2.2 and 2.3.

      Furthermore, we have added references to all 6 of these publications in the revised version of the manuscript:

      L.43-50: Moreover, the modulation of gamma oscillations by the phase of theta oscillations in hippocampal circuits, a phenomenon termed theta-gamma phase-amplitude coupling (PAC), correlates with the efficacy of memory encoding and retrieval (Jensen and Colgin, 2007; Tort et al., 2009; Canolty and Knight, 2010; Axmacher et al., 2010; Fell and Axmacher, 2011; Lisman and Jensen, 2013; Lega et al., 2016). Experimental and computational work on the coupling between oscillatory rhythms has indicated that it originates from different neural architectures and correlates with a range of behavioral and cognitive functions, enabling the long-range synchronization of cortical areas and facilitating multi-item encoding in the context of memory (Hyafil et al., 2015)."

      L.415-426: “In terms of neuronal cell types, we also made an important simplification by considering only basket cells as the main class of inhibitory interneuron in the whole hippocampal formation. However, it should be noted that many other types of interneurons exist in the hippocampus and have been modeled in various works with higher computational complexity (e.g., Bezaire et al., 2016; Chatzikalymniou et al., 2021). Among these various interneurons, oriens-lacunosum moleculare (OLM) neurons in the CA1 field have been shown to play a crucial role in synchronizing the activity of pyramidal neurons at gamma frequencies (Tort et al., 2007), and in generating theta-gamma PAC (e.g., Neymotin et al., 2011; Ponzi et al., 2023). Additionally, these cells may contribute to the formation of specific phase relationships within CA1 neuronal populations, through the integration between inputs from the medial septum, the EC, and CA3 (Mysin et al., 2019). Future work is needed to include more diverse cell types and detailed morphologies modeled through multiple compartments.”

      2.5. […] and indeed the quiescent state itself shown by this model seems quite artificial.

      We would like to clarify the fact that the “quiescent state” mentioned by the reviewer is a simply a state where the theta input is too low to induce theta-nested gamma oscillations. In this regime, neurons are active only due to the noise term in the membrane potential, which was adjusted based on Figure S3 (Figure 2 – Figure Supplement 2, shown below), at the minimal level needed to disrupt artificial synchronization in decoupled populations. For an input of 0 nA, we acknowledge that this network is indeed fully quiescent (i.e., does not show any spiking activity). However, as soon as the input increases, spontaneous spiking activity starts to appear with an average firing rate that depends on the input amplitude and is characterized by the input-frequency curves (panel A.). Please note that adding more noise could eliminate the observed quiescence in the absence of any input, but that it would not affect qualitatively the reported results.

      Author response image 10.

      Figure S3 (Figure 2 – Supplement 2). Cell-intrinsic spiking activity in decoupled excitatory and inhibitory populations under ramping input. A. Input-Frequency (I-F) curves for excitatory cells (left panel; pyramidal neurons with ICAN) and inhibitory cells (right panel; interneurons, fast-spiking) used in the model. Above a certain tonic input (around 0.35 nA for excitatory and 0.1 nA for inhibitory neurons), neurons can spike in the gamma range. B. Raster plot showing the spiking activity of excitatory (blue, NE = 1000) and inhibitory (red, NI = 100) neurons in decoupled populations under ramping input (top trace) and in the absence of noise in the membrane potential. Despite random initial conditions across neurons, oscillations emerge in both populations due to the intrinsic properties of the cells, with a frequency that is predicted by the respective IF curves (panel A.). C. Similar representation as panel B. but with the addition of stochastic noise in the membrane potential of each neuron. The presence of noise disrupts the emergence of oscillations in these decoupled populations.

      2.6. Some indication that particular ion channels, CAN and M are relevant is briefly provided and the work would be much improved by examining this aspect in more detail.

      We thank the reviewer for acknowledging the importance of these ion channels. We have now added a new supplementary figure (Figure 5 – Figure Supplement 4), which is described in more details in our response to comment 2.2 and illustrates the role of the CAN current in the generation of theta-nested gamma oscillations following a single stimulation pulse. Moreover, we would like to stress that the impact of CAN currents in the ability of the hippocampus to generate theta-nested gamma oscillations intrinsically, i.e., in the absence of persistent external input, has already been investigated in details by a previous computational study cited in our manuscript (Giovannini F, Knauer B, Yoshida M, Buhry L. The CAN-In network: A biologically inspired model for self-sustained theta oscillations and memory maintenance in the hippocampus. Hippocampus. 2017 Apr;809 27(4):450–463).

      2.7. In summary, the work would benefit from an intuitive analysis of the basic model ingredients underlying its neurostimulation response properties.

      We thank the reviewer for this suggestion. By addressing the reviewer’s previous comments (reviewer 2, comments 2.1 and 2.2), which overlap partly with the first reviewer (reviewer 1, comment 3), we believe we have improved the manuscript and have provided key information related to the way the model responds to neurostimulation.

      3..) Third, while the model is fairly realistic, considerable important factors are not included and in fact, there are much more detailed hippocampal models out there (for example [5,6]). In particular, it includes only excitatory cells and a single type of inhibitory cell. This is particularly important since there are many models and experimental studies where specific cell types, for example, OLM and VIP cells, are strongly implicated in TNGO.

      [5] Bezaire MJ, Raikov I, Burk K, Vyas D, Soltesz I. Interneuronal mechanisms of hippocampal theta oscillations in a full-scale model of the rodent CA1 circuit. Elife. 2016 Dec 23;5:e18566.

      [6] Chatzikalymniou AP, Gumus M, Skinner FK. Linking minimal and detailed models of CA1 microcircuits reveals how theta rhythms emerge and their frequencies controlled. Hippocampus. 2021 Sep;31(9):982-1002.

      We thank the reviewer for pointing out these interesting avenues for future studies. As indicated in previous responses (reviewer 1, comment 1; reviewer 2, comment 2.4), we have added several paragraphs to discuss these limitations, the rationale behind our simplifications, and potential improvements. In particular, we have added the following paragraphs to discuss our simplifications in terms of connectivity and cell types:

      Anatomical connectivity:

      L.141-150: “Biologically, GABAergic neurons from the medial septum project to the EC, CA3, and CA1 fields of the hippocampus (Toth et al., 1993; Hajós et al., 2004; Manseau et al., 2008; Hangya et al., 2009; Unal et al., 2015; Müller and Remy, 2018). Although the respective roles of these different projections are not fully understood, previous computational studies have suggested that the direct projection from the medial septum to CA1 is not essential for the production of theta in CA1 microcircuits (Mysin et al., 2019). Since our modeling of the medial septum is only used to generate a dynamic theta rhythm, we opted for a simplified representation where the medial septum projects only to the EC, which in turn drives the different subfields of the hippocampus. In our model, Kuramoto oscillators are therefore connected to the EC neurons and they receive projections from CA1 neurons (see methods for more details).”

      Cell types:

      L.415-426: “In terms of neuronal cell types, we also made an important simplification by considering only basket cells as the main class of inhibitory interneuron in the whole hippocampal formation. However, it should be noted that many other types of interneurons exist in the hippocampus and have been modeled in various works with higher computational complexity (e.g., Bezaire et al., 2016; Chatzikalymniou et al., 2021). Among these various interneurons, oriens-lacunosum moleculare (OLM) neurons in the CA1 field have been shown to play a crucial role in synchronizing the activity of pyramidal neurons at gamma frequencies (Tort et al., 2007), and in generating theta-gamma PAC (e.g., Neymotin et al., 2011; Ponzi et al., 2023). Additionally, these cells may contribute to the formation of specific phase relationships within CA1 neuronal populations, through the integration between inputs from the medial septum, the EC, and CA3 (Mysin et al., 2019). Future work is needed to include more diverse cell types and detailed morphologies modeled through multiple compartments.”

      3.2. Other missing ingredients one may think might have a strong impact on model response to neurostimulation (in particular stimulation trains) include the well-known short-term plasticity between different hippocampal cell types and active dendritic properties.

      We agree with the reviewer that plasticity mechanisms are important to include in future work, which we had already mentioned in the limitations section of the manuscript:

      L.436-443: “Importantly, we did not consider learning through synaptic plasticity, even though such mechanisms could drastically modify synaptic conduction for the whole network (Borges et al., 2017). Even more interestingly, the inclusion of spike-timing-dependent plasticity would enable the investigation of stimulation protocols aimed at promoting LTP, such as theta-burst stimulation (Larson et al., 2015). This aspect would be of uttermost importance to make a link with memory encoding and retrieval processes (Axmacher et al., 2006; Tsanov et al., 2009; Jutras et al., 2013) and with neurostimulation studies for memory improvement (Titiz et al., 2017; Solomon et al., 2021).”

      1. Fourth the MS model seems somewhat unsupported. It is modeled as a set of coupled oscillators that synchronize. However, there is also a phase reset mechanism included. This mechanism is important because it underlies several of the phase reset behaviors shown by the full model. However, it is not derived from experimental phase response curves of septal neurons of which there is no direct measurement. The work would benefit from the use of a more biologically validated MS model.

      We would like to confirm that the phase reset mechanism is indeed at the core of using Kuramoto oscillators to model a particular system. For more details about our choice of a phase response function and the obtained results in terms of phase response curves, we refer the reader to our response to comment 2.3.

      Generally speaking, we chose to use Kuramoto oscillators as it is the simplest model that can provide an oscillatory input to another system while including a phase reset mechanism. This set of oscillators was used to replace the fixed sinusoidal wave that represented theta inputs in previous models (Onslow et al., 2014; Aussel et al., 2018; Segneri et al., 2020). Kuramoto oscillators are a well-established model of synchronization in various fields of physics. They have also been used in neuroscience to model the phase reset of collective rhythms (Levnajić et al. 2010), and the effects of DBS on the basal ganglia network in Parkinson’s disease (Tass et al. 2003, Ebert et al. 2014, Weerasinghe et al. 2019).

      More detailed models of the medial septum exist in the literature (e.g., Wang et al. 2002, Hajós et al. 2004) and model the GABAergic effects of the septal projections onto the hippocampal formation. However, it is not trivial to infer the connectivity parameters and the degree of innervation between the hippocampus and the medial septum. Furthermore, the claims made in our study do not necessarily depend on the nature of the projections between the two areas. Therefore, we decided to represent the medial septum in a conceptual way and focus mostly on the effects of these projections rather than replicating them in detail.

      Aussel, Amélie, Laure Buhry, Louise Tyvaert, and Radu Ranta. “A Detailed Anatomical and Mathematical Model of the Hippocampal Formation for the Generation of Sharp-Wave Ripples and Theta-Nested Gamma Oscillations.” Journal of Computational Neuroscience 45, no. 3 (December 2018): 207–21. https://doi.org/10.1007/s10827-018-0704-x.

      Ebert, Martin, Christian Hauptmann, and Peter A. Tass. “Coordinated Reset Stimulation in a Large-Scale Model of the STN-GPe Circuit.” Frontiers in Computational Neuroscience 8 (2014): 154. https://doi.org/10.3389/fncom.2014.00154.

      Hajós, M., W.E. Hoffmann, G. Orbán, T. Kiss, and P. Érdi. “Modulation of Septo-Hippocampal θ Activity by GABAA Receptors: An Experimental and Computational Approach.” Neuroscience 126, no. 3 (January 2004): 599–610. https://doi.org/10.1016/j.neuroscience.2004.03.043.

      Levnajić, Zoran, and Arkady Pikovsky. “Phase Resetting of Collective Rhythm in Ensembles of Oscillators.” Physical Review E 82, no. 5 (November 3, 2010): 056202. https://doi.org/10.1103/PhysRevE.82.056202.

      Onslow, Angela C. E., Matthew W. Jones, and Rafal Bogacz. “A Canonical Circuit for Generating PhaseAmplitude Coupling.” Edited by Adriano B. L. Tort. PLoS ONE 9, no. 8 (August 19, 2014): e102591. https://doi.org/10.1371/journal.pone.0102591.

      Segneri, Marco, Hongjie Bi, Simona Olmi, and Alessandro Torcini. “Theta-Nested Gamma Oscillations in Next Generation Neural Mass Models.” Frontiers in Computational Neuroscience 14 (2020). https://doi.org/10.3389/fncom.2020.00047. T ass, Peter A. “A Model of Desynchronizing Deep Brain Stimulation with a Demand-Controlled Coordinated Reset of Neural Subpopulations.” Biological Cybernetics 89, no. 2 (August 1, 2003): 81–88. https://doi.org/10.1007/s00422-003-0425-7.

      Wang, Xiao-Jing. “Pacemaker Neurons for the Theta Rhythm and Their Synchronization in the Septohippocampal Reciprocal Loop.” Journal of Neurophysiology 87, no. 2 (February 1, 2002): 889–900. https://doi.org/10.1152/jn.00135.2001.

      Weerasinghe, Gihan, Benoit Duchet, Hayriye Cagnan, Peter Brown, Christian Bick, and Rafal Bogacz. “Predicting the Effects of Deep Brain Stimulation Using a Reduced Coupled Oscillator Model.” PLoS Computational Biology 15, no. 8 (August 8, 2019): e1006575. https://doi.org/10.1371/journal.pcbi.1006575.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1:

      Weaknesses:

      (1) The authors themselves propose in their Introduction that the "ECM-associated changes are increasingly perceived as causative, rather than consequential"; however, they have not conducted mechanistic (gain of function/loss of function) studies either in vitro or in vivo from any of their identified targets to truly prove causality. This remains one of the limitations of this study. Thus, future studies should investigate this point in detail. For instance, it would have been intriguing to dissect if knocking out specific genes involved in one specific model or genes common to both would yield distinct phenotypic outcomes.

      We agree with the reviewer that our study does not provide mechanistic verification of the function of identified targets with suggested role in the development and/or resolution of fibrosis. The current study was primarily conducted in order to identify these possible targets with focus on the identification of differences in extracellular matrix deposited in two selected models of liver fibrosis with different modes of action. To conduct further studies using knock-out/in models for verification of causality of proposed targets was at this point well beyond our intention. However, we are fully aware of the potential of identified molecules and further studies to disect their roles in liver diseases are part of future plans.

      (2) The majority of the conclusions are derived primarily from the proteomic analyses. Although well conducted, it would strengthen the study to corroborate some of the major findings by other means such as IHC/IF with the corresponding quantifications and not only representative images.

      We have now provided additional IF images and their quantifications in accordance with the Reviewer’s suggestions to our major MS findings to strenghten the significance of the MS data (see detailed answer below).

      Reviewer #2:

      Weaknesses:

      (1) As it currently stands, the data, whilst extensive, is primarily focussed on the proteomic data which is fairly descriptive and I am not clear on the additional insight gained in their approach that is not already detailed from the extensive transcriptomic studies. The manuscript overall would benefit from some mechanistic functional insight to provide new additional modes of action relevant to fibrosis progression.  

      We agree with the reviewer that our study could initially appear descriptive. However, this characteristics is inherent to most omics studies, which tend to provide hypothesis-free testing of a large number of analytes in order to find a multitude of candidate biomarkers(1). Importantly, we believe our study provides insights that go beyond the scope of previously published transcriptomic analyses.

      Specifically, our work focuses on compartment-specific changes in the liver proteome, with an emphasis on the extracellular matrix (ECM) composition and alterations in protein solubility—features that cannot be captured by transcriptomic studies. The matrisome is more than a structural scaffold; it functions as a reservoir for secreted factors, including growth factors and cytokines, which modulate the local cellular microenvironment. Transition dynamics between the insoluble matrisome and soluble protein pools influence the signaling capabilities and bioavailability of these factors. Moreover, fibrous ECM assemblies directly impact tissue mechanics, providing cells embedded within the matrix with spatially distinct biochemical and biomechanical contexts. The current understanding of matrisome composition in the context of specific liver disease etiologies is limited. Dr. Friedman, in his 2022 review on hepatic fibrosis, highlights the unmet need to elucidate etiology-specific protein signatures of the cirrhotic liver matrisome, which could serve as disease staging or prognostic biomarkers(2). Our study addresses this gap by characterizing the distinct matrisome profiles associated with hepatotoxic- versus cholestasis-driven liver injury. We believe our findings lay the groundwork for identifying etiology-specific biomarkers and potential therapeutic targets for antifibrotic interventions, offering a novel layer of insight beyond what transcriptomic data alone can provide.

      (2) Whilst there is some human data presented it is a minimal analysis without quantification that would imply relevance to disease state. Although studying disease progression in animals is a fundamental aspect of understanding the full physiological response of fibrotic disease, without more human insight makes any analysis difficult to fulfil their suggestion that these targets identified will be of use to treat human disease.

      We thank the reviewer for this comment. Our study primarily focuses on utilizing animal models to explore the fundamental physiological processes underlying the development and resolution of fibrotic liver disease. To address the translational relevance of our findings, we concentrated on clusterin, one of the key target proteins identified during our analysis of the insoluble proteome. Specifically, we investigated its localization in human liver samples, focusing on its association with collagen deposits (Figure 6F). To this end, we analyzed human liver samples of diverse etiologies and varying degrees of fibrotic damage, including samples representing four distinct stages of HCV-induced fibrosis (Figure 6F, lower panel). While this analysis highlights the presence and localization of clusterin in fibrotic deposits, we acknowledge that our study does not include extensive quantification or mechanistic insight into clusterin's role in human liver fibrosis. We believe that the data presented in this manuscript provide a valuable foundation for future investigations into clusterin’s involvement in liver fibrosis across different etiologies. Recognizing the translational importance of this work, we have already initiated a prospective study involving human patients, which aims to conduct a more comprehensive analysis of clusterin's function and its potential as a therapeutic target.

      To further support our findings on clusterin's role in fibrosis development and resolution and to address the reviewer's concern, we quantified clusterin deposits in the available human samples representing four distinct stages of HCV-induced fibrotic disease. Using immunofluorescence (IF) images at a 20x field of view, we measured both clusterin and collagen deposits to illustrate changes in clusterin abundance during fibrosis progression (stages F1–F4) in relation to collagen deposition dynamics. The quantified data have been included for the reviewer's consideration (Figure 1). However, it is important to emphasize that this quantification was conducted on a single human sample per fibrotic stage, which limits the statistical robustness of the analysis. A more comprehensive evaluation involving additional patient samples would be necessary for a more definitive conclusion. For this reason, we propose to include these results solely in our rebuttal letter and to incorporate a more extensive analysis in our intended follow-up study, where larger cohorts will allow for a thorough investigation of clusterin's role in human liver fibrosis.

      Author response image 1.

      Dynamics of clusterin abundance with the development of HCV-induced fibrotic disease in comparison to the changes in collagen deposits. IF images of human liver sections from different stages of chronic HCV infection were immunolabeled for clusterin and collagen 1. Clusterin- and collagenpositive (<sup>+</sup>) areas (as %) from three to eight fields of view (20x objective) were evaluated for each fibrosis stage (F1-F4). 

      (3) Some of the terminology is incorrect while discussing these models of injury used and care should be taken. For example - both models are toxin-induced and I do not think these data have any support that the DDC model has a higher carcinogenic risk. An investigation into the tumour-induced risk would require significant additional models. These types of statements are incorrect and not supported by this study.

      We are grateful to the reviewer for drawing our attention to the incorrect use of the term "toxin-induced". In two instances, where the wording was incorrect, we have corrected the term to hepatotoxin-induced as it was originally intended. While we believe that our proteomic signature data and identified signaling pathways suggest a potential carcinogenic risk associated with the cholestatic, but not the hepatotoxic model, we have toned down the statements on this issue in the article to respect the reviewer's perspective. These changes, which are highlighted in the track changes mode of the article, aim to make the conclusions of the study more precise and thus improve the clarity of our conclusions.

      Reviewer #1 (Recommendations for the authors): 

      (1) In the Discussion, the authors could consider pointing out that one limitation of the study is a lack of mechanistic (gain of function/loss of function) studies either in vitro or in vivo from any of their identified targets to truly prove causality. 

      As noted earlier, we fully agree with both reviewers that a limitation of this study is its descriptive nature, which is an inherent characteristic of omics-based research. In our manuscript, we aimed to "determine compartment-specific proteomic landscapes of liver fibrosis and delineate etiology-specific ECM components," with the overarching goal of providing a foundation for future antifibrotic therapies.

      The insights gained from our study will indeed serve as a critical basis for subsequent research, where we will prioritize mechanistic investigations to elucidate the roles of the identified targets. While we acknowledge the importance of gain- or loss-of-function studies to establish causality, we believe this falls outside the primary scope of the current manuscript. Instead, we envision these mechanistic approaches as key elements of our future research efforts. For this reason, we feel it is not necessary to further expand on this limitation in the current discussion.

      (2) The majority of the conclusions are derived primarily from the proteomic analyses. Although well conducted, it would strengthen the study to corroborate some of the major findings by other means such as IHC/IF with the corresponding quantifications and not only representative images. For example, the IF stainings for ECM1 should also be quantified - ECM1. 

      To strengthen our MS findings on ECM1 expression and to address the reviewer's concern, we have now included quantification of ECM1 using IF staining at selected time points in Figure S7E and we refer to these data in the Results section (p. 12 of the current manuscript). The IF quantification data correspond well to the MS data showing increase in ECM1 expression with fibrosis development and decline with partial fibrosis resolution.

      (3) S1 - it would be important to show Sirius Red images over the time course, especially for CCl4 T4 where fibrosis resolution is occurring. Proteomics data also show this group clusters more closely with control mice and seeing a representative image would add further credibility to this point. 

      Requsted Sirius Red images are now part of the Figure S1B, documenting partial fibrosis resolution and overall parenchyma healing in T4 in both models.

      (4) How comparable are the periods of the two models? 2 weeks in one model may not be the same as 2 weeks in the other depending on the severity of the pathogenesis. 

      We appreciate the reviewer’s comment regarding the comparability of time points between the two models. Indeed, the temporal dynamics of fibrosis development differ between the models employed in our study, and we have carefully considered this aspect to ensure the validity of our comparative analysis. To address this, we started our comparisons at a stage corresponding to the onset of fibrosis in each model. Specifically, quantification of Sirius Red-positive areas, indicative of collagen deposition (Figure S1B), revealed that 2 weeks of DDC treatment produced a comparable extent of fibrosis to that observed after 3 weeks of CCl₄ treatment. This point was designated as the initial fibrosis time point (T1, Figure S1B), from which further treatment was applied to induce more advanced fibrosis. This approach allowed us to standardize the comparison of fibrosis progression between the two models.

      (5) Figure 4A-D - cell-type-specific signatures should be corroborated by actual IHC or IF stainings if possible. HNF4a (hepatocytes), CK19 (cholangiocytes), aSMA (activated fibrogenic HSCs), immune cells (B220, F4/80, Cd11b, CD11c etc).

      We thank the reviewer for this valuable suggestion. To strengthen our analysis, we have now complemented the box plots of cell type-specific signatures derived from the MS data (Figure 4A-D) with immunofluorescence (IF) staining, which has been included in the Supplemental Data (Figure S6). Specifically, we provide representative IF images from control and T1-T4 time points for each model, documenting the changes in abundance with treatment in:

      A) Hepatocytes (HNF4α), activated hepatic stellate cells (αSMA), and cholangiocytes (CK19).

      B) Immune cell populations, including B cells (B220) and macrophages/monocytes/Kupffer cells (F4/80), as these immune cell groups were not only identified in our MS analysis but also have established roles in the selected models(3, 4, 5). 

      The representative images shown in Figure S6 show the dynamics of the cellular populations in each of the models, which correspond well with the MS data (compare Figures 4A-D and S5). These additional data further validate our findings and enhance the robustness of our conclusions.

      References:

      (1) Thiele M, Villesen IF, Niu L, et al. Opportunities and barriers in omics-based biomarker discovery for steatotic liver diseases. J Hepatol 2024;81:345-359.

      (2) Friedman SL, Pinzani M. Hepatic fibrosis 2022: Unmet needs and a blueprint for the future. Hepatology 2022;75:473-488.

      (3) Best J, Verhulst S, Syn WK, et al. Macrophage Depletion Attenuates Extracellular Matrix Deposition and Ductular Reaction in a Mouse Model of Chronic Cholangiopathies. PLoS One 2016;11:e0162286.

      (4) Aoyama T, Inokuchi S, Brenner DA, et al. CX3CL1-CX3CR1 interaction prevents carbon tetrachlorideinduced liver inflammation and fibrosis in mice. Hepatology 2010;52:1390-400.

      (5) Yang W, Chen L, Zhang J, et al. In-Depth Proteomic Analysis Reveals Phenotypic Diversity of Macrophages in Liver Fibrosis. J Proteome Res 2024;23:5166-5176.

    1. Author Response

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      The current manuscript focuses on the adenine phosphoribosyltransferase (Aprt) and how the lack of its function affects nervous system function. It puts it into the context of Lesch-Nyhan disease, a rare hereditary disease linked to hypoxanthine-guanine phosphoribosyltransferase (HGPRT). Since HGPRT appears absent in Drosophila, the study focuses initially on Aprt and shows that aprt mutants have a decreased life-span and altered uric acid levels (the latter can be attenuated by allopurinol treatment). Moreover, aprt mutants show defects in locomotor reactivity behaviors. A comparable phenotype can be observed when specifically knocking down aprt in dopaminergic cells. Interestingly, also glia-specific knock-down caused a similar behavioral defect, which could not be restored when re-expressing UAS-aprt, while neuronal re-expression did restore the mutant phenotype. Moreover, mutants, pan-neuronal and pan-neuronal plus glia RNAi for aprt caused sleep-defects. Based on immunostainings Dopamine levels are increased; UPLC shows that adenosine levels are reduced and PCR showed in increase of Ent2 levels are increased (but not AdoR). Moreover, aprt mutants display seizure-like behaviors, which can be partly restored by purine feeding (adenosine and N6methyladenosine). Finally, expression of the human HGPRT also causes locomotor defects.

      The authors provide a wide range of genetic experimental data to assess behavior and some molecular assessment on how the defects may emerge. It is clearly written, and the arguments follow the experimental evidence that is provided. The findings provide a new example of how manipulating specific genes in the fruit fly allows the study of fundamental molecular processes that are linked to a human disease.

      We thank the reviewer for his clear understanding and positive assessment of our work.

      Reviewer #2 (Public Review):

      The manuscript by Petitgas et al demonstrates that loss of function for the only enzyme responsible for the purine salvage pathway in fruit-flies reproduces the metabolic and neurologic phenotypes of human patients with Lesch-Nyhan disease (LND). LND is caused by mutations in the enzyme HGPRT, but this enzyme does not exist in fruit-flies, which instead only have Aprt for purine recycling. They demonstrate that mutants lacking the Aprt enzyme accumulate uric acid, which like in humans can be rescued by feeding flies allopurinol, and have decreased longevity, locomotion and sleep impairments and seizures, with striking resemblance to HGPRT loss of function in humans. They demonstrate that both loss of function throughout development or specifically in the adult ubiquitously or in all neurons, or dopaminergic neurons, mushroom body neurons or glia, can reproduce the phenotypes (although knock-down in glia does not affect sleep). They show that the phenotypes can be rescued by over-expressing a wild-type form of the Aprt gene in neurons. They identify a decrease in adenosine levels as the cause underlying these phenotypes, as adenosine is a neurotransmitter functioning via the purinergic adenosine receptor in neurons. In fact, feeding flies throughout development and in the adult with either adenosine or m6A could prevent seizures. They also demonstrate that loss of adenosine caused a secondary up-regulation of ENT nucleoside transporters and of dopamine levels, that could explain the phenotypes of decreased sleep and hyperactivity and night. Finally, they provide the remarkable finding that over-expression of the human mutant HGPRT gene but not its wild-type form in neurons impaired locomotion and induced seizures. This means that the human mutant enzyme does not simply lack enzymatic activity, but it is toxic to neurons in some gain-of-function form. Altogether, these are very important and fundamental findings that convincingly demonstrate the establishment of a Drosophila model for the scientific community to investigate LND, to carry out drug testing screens and find cures.

      We thank the reviewer for his clear understanding and positive assessment of our work.

      The experiments are conducted with great rigour, using appropriate and exhaustive controls, and on the whole the evidence does convincingly or compellingly support the claims. The exception is an instance when authors mention 'data not shown' and here data should either be provided, or claims removed: "feeding flies with adenosine or m6A did not rescue the SING phenotype of Aprt mutants (data not shown)". It is important to show these data (see below).

      As recommended by the reviewer, these results are now shown in the new Figure S15.

      Sleep is used to refer to lack of movement of flies to cross a beam for more than 5 minutes. However, lack of movement does not necessarily mean the flies are asleep, as they could be un-motivated to move (which could reflect abnormal dopamine levels) or engaged in incessant grooming instead. These differences are important for future investigation into the neural circuits affect by LND.

      We agree that the method we used could overestimate sleep duration because flies that don't move do not necessarily sleep either, as it is the case with brain-dopamine deficient flies (Riemensperger et al., PNAS 2011). To address this issue, we have recorded video data showing that after 5 min of inactivity, wild-type and Aprt5 mutant flies are less sensitive to stimulation, indicating that they were indeed asleep. This is now shown in the new Figure S10 and mentioned on page 17, lines 338-339 in the main text. In addition, in this work we report that Aprt mutant flies have a nocturnal insomnia phenotype. Sleep overestimation is not, therefore, an issue that could challenge these results.

      The authors claim that based on BLAST genome searchers, there are no HPRTI (encoding HGPRT) homologues in Drosophila. However, such a claim would require instead structure-based searches that take into account structural conservation despite high sequence divergence, as this may not be detected by regular BLAST.

      To reinforce our conclusions about the lack of homologue of the human HPRT1 gene in Drosophila, we have now added a Results section about the evolution of HGPRT proteins on pages 6-7, lines 122150, and two phylogenetic analyses as new Figures S2 and S3 with more details in legends. We have also carried out structural similarity searches against the RCSB PDB repository. The structural analysis did not identify any relevant similarity with HGPRT 3D structures in Insecta (mentioned lines 146-150). We hope these new analyses address the Reviewer's concerns. Furthermore, as shown in Table S2, no enzymatic HGPRT activity could be detected in extracts of wild-type Drosophila. A protein that would be structurally similar to human HGPRT but with a divergent sequence could not be involved in purine recycling without expressing HGPRT-like activity. In contrast, enzymatic Aprt activity could be easily detected in this organism (Figure S4 and Table S1).

      This work raises important questions that still need resolving. For example, the link between uric acid accumulation, reduced adenosine levels, increased dopamine and behavioural neurologic consequences remain unresolved. It is important that they show that restoring uric acid levels does not rescue locomotion nor seizure phenotypes, as this means that this is not the cause of the neurologic phenotypes.

      We agree with the reviewer about the potential importance of our results and the need to resolve the exact origin of the neurological phenotypes. This would need to be addressed in further studies in our opinion. The fact that allopurinol treatment did not improve the locomotor ability of Aprt5 mutant flies is now shown in Figure 1D, E to emphasize this result. Results showing that allopurinol does not rescue the bang-sensitivity phenotype of Aprt-deficient mutants are shown in Figure S14.

      Instead, their data indicate adenosine deficiency is the cause. However, one weakness is that for the manipulations they test some behaviours but not all. The authors could attempt to improve the link between mechanism and behaviour by testing whether over-expression of Aprt in neurons or glia, throughout development or in the adult, and feeding with adenosine and m6A can rescue each of the behavioural phenotypes handled: lifespan, SING, sleep and seizures. The authors could also attempt to knock-down dopamine levels concomitantly with feeding with adenosine or m6A to see if this rescues the phenotypes of SING and sleep.

      The reviewer is right. However, carrying out all these experiments properly with enough repeats will require about two more years of work. Because of that, they could not be included in the revision of the present article. Here we show that Aprt overexpression in neurons, but not in glia, rescues the SING phenotype of Aprt5 mutants (Figure 2B and 2E). We have also added in the revised article the new result that Aprt overexpression reduces transcript levels of DTH1, which codes for the neural form of the dopamine-synthesizing enzyme tyrosine hydroxylase (new Figure 5F).

      Visualising the neural circuits that express the adenosine receptor could reveal why the deficit in adenosine can affect distinct behaviours differentially, and which neurologic phenotypes are primary and which secondary consequences of the mutations. This would allow them to carry out epistasis analysis by knocking-down AdoR in specific circuits, whilst at the same time feeding Aprt mutants with Adenosine.

      Deciphering the specific circuits involved in the various effects of adenosine would indeed be extremely interesting. Unfortunately very few is currently known about the neural circuits that express AdoR in flies. No antibody is available to detect this receptor in situ and mutated AdoR gene coding for a tagged form of the receptor has not been engineered yet to our knowledge.

      The revelation that the mutant form of human HGPRT has toxic effects is very intriguing and important and it invites the community to investigate this further into the future.

      To conclude, this is a fundamental piece of work that opens the opportunity for the broader scientific community to use Drosophila to investigate LND.

      We sincerely thank the reviewer for his thoughtful and positive comments on our work.

      Reviewer #3 (Public Review):

      The study attempts to develop a Drosophila model for the human disease of LND. The issue here, and the main weakness of this study, is that Drosophila does not express the enzyme, HGPRT, which when mutated causes LND. The authors, instead, mutate the functionally-related Drosophila Aprt enzyme. However, it is unknown whether Aprt is also a structural homologue. Because of this, it will likely not be possible to identify pharmacological compounds that rescue HGPRT activity via a direct interaction (unless modelling predicts high conservation of substrate binding pocket between the two enzymes, etc).

      As stated in our Provisional Responses prior to revision of the Reviewed Preprint, the enzymes APRT and HGPRT are actually known to be functionally and structurally related. We apologize for not providing this information in the original submission. This point is now made clearer in the revised article on page 39, lines 785-792. Indeed, both human APRT and HGPRT belong to the type I PRTases family identified by a conserved phosphoribosyl pyrophosphate (PRPP) binding motif, which is used as a substrate to transfer phosphoribosyl to purines. This binding motif is only found in PRTases from the nucleotide synthesis and salvage pathways (see: Sinha and Smith (2001) Curr Opin Struct Biol 11(6):733-9, doi: 10.1016/s0959-440x(01)00274-3). The purine substrates adenine, hypoxanthine and guanine share the same chemical skeleton and APRT can bind hypoxanthine, indicating that APRT and HGPRT also share similarities in their substrate binding sites (Ozeir et al. (2019) J Biol Chem. 294(32):11980-11991, doi: 10.1074/jbc.RA119.009087). Moreover, Drosophila Aprt and Human APRT are closely related as the amino acid sequences of APRT proteins have been highly conserved throughout evolution (see Figure S5B in our paper).

      An additional weakness is that the study does not identify a molecule that may act as a lead compound for further development for treating LND. Rather, the various rescues reported are selective for only a subset of the disease-associated phenotypes. Thus, whilst informative, this first section of the study does not meet the study ambitions.

      In this study, we identify adenosine and N6-methyladenosine as rescuers of the epileptic behavior in Aprt mutant flies (shown in Figure 7E, F). Interestingly, the same molecules have been found to rescue the viability of fibroblasts and neural stem cells derived from iPSCs of LND patients, in which de novo purine synthesis was prevented (discussed on page 38, lines 747-753). This suggests that the Drosophila model reported here could help to identify new genetic targets and pharmacological compounds capable to rescue HGPRT mutations in humans.

      The second approach adopted is to express a 'humanised mutated' form of HGPRT in Drosophila, which holds more promise for the development of a pharmacological screen. In particular, the locomotor defect is recapitulated but the seizure-like activity, whilst reported as being recapitulated, is debatable. A recovery time of 2.3 seconds is very much less than timings for typical seizure mutants. Nevertheless, the SING behaviour could be sufficient to screen against. However, this is not explored.

      We agree with the reviewer that it would be very interesting to do a pharmacological screen in this second LND model. However, we did not have the possibility to carry out such a screen yet.

      In summary, this is a largely descriptive study reporting the behavioural effects of an Aprt loss-offunction mutation. RNAi KD and rescue expression studies suggest that a mix of neuronal (particularly dopaminergic and possibly adenosinergic signalling pathways) and glia are involved in the behavioural phenotypes affecting locomotion, sleep and seizure. There is insufficient evidence to have confidence that the Arpt fly model will prove valuable for understanding / treating LND.

      Here we report many common phenotypes between the Aprt fly model and the symptoms of LND patients (reduced longevity, locomotor problems, sleep defects, overproduction of uric acid that is rescued by allopurinol treatment…). Moreover, APRT and HGPRT enzymes are both functional and structural homologues, as explained in our answers. We also found that the same drugs can rescue the seizure-like phenotype in Aprt-deficient flies and the viability of LND fibroblasts and neural stem cells, derived from iPSCs of LND patients, in which de novo purine synthesis is prevented (Figure 7E, F). In many respects, our results therefore suggest that Aprt mutant flies could be useful to better understand LND, and potentially to screen for new therapeutic compounds.

      From the Reviewing Editor:

      (1) How are the pathways of purine catabolism different between flies and mammals? How does the absence of HGPRT and presence of only AGPRT affect purine catabolism? When did HGPRT appear in evolution?

      Purine catabolism is quite similar in flies and mammals, except for the lack of urate oxidase in primates, as described in Figure S1. We added words in the revised article about purine anabolism/catabolism pathways lines 123-126 (see below our detailed response to Reviewer 1’s Recommandations). HGPRT is present in Bacteria, Archea and Eukaryota, and nearly all animal phyla. However, BLAST search indicates that HGPRT homologues cannot be found in most insect species, such as Drosophila. To reinforce our conclusions about the lack of homologue of the human HPRT1 gene in Drosophila melanogaster, we have now added a Results section about the evolution of HGPRT proteins on pages 6-7, lines 122-150, and two phylogenetic analyses as new Figures S2 and S3 with details in legends.

      In addition to BLAST a structural based modelling method should be used to establish the loss of HGPRT in Drosophila.

      In agreement with the phylogenetic analyses, we have confirmed that no HGPRT enzymatic activity can be detected in wild-type Drosophila extract (Table S2). To complete these observations, as recommended by reviewer #2, we have carried out 3D structure-based searches in the RCSB Protein Data Bank. This enabled us to compare human HGPRT with all currently available protein structures. W found no Drosophila protein with a divergent sequence showing relevant structural similarity to human HGPRT. In contrast, this search identified proteins similar to human HGPRT in many other species of Eukaryota, Archea and Bacteria. This is now mentioned on page 7, lines 146-150 in the revised article.

      (2) Of the three biochemical changes reported the change in dopamine levels should be validated by other methods given the unreliable nature of IHC.

      As recommended by Reviewer #1, we have added the results of new experiments carried out by RTqPCR and Western blotting, which confirm the effect of Aprt mutation on brain dopamine levels. In addition, we added the consistent result that Aprt overexpression reduces transcript levels of DTH1. The results are shown in the new panels E to H of Figure 5 and mentioned in the text on page 20, lines 385-389.

      (3) As suggested by reviewer 2 it would be helpful to clearly identify which of the three biochemical changes (DA, uric acid, adenosine) are responsible for the numerous behaviours tested. This is important because it is relevant for developing any therapeutic strategy arising from this study.

      We agree that it would be very interesting to decipher the relationship between the different behaviors observed in mutant flies and the biochemical changes (dopamine, uric acid or adenosine). However, this would require a large amount of new experiments and it would probably double the size of our paper, which already includes many original data. In our opinion, such a detailed study should logically be the purpose of another article.

      (4) There is concern regarding the robustness of the seizure data. Reviewer 3 has suggestions on how to address this.

      See our answers to Reviewer 3’s recommendations below.

      (5) Editorial corrections and changes suggested by reviewers 2 and 3 need to be addressed.

      As indicated in our answers, we have taken into account and when possible addressed the corrections and changes suggested by the reviewers.

      (6) It is recommended that the authors tone down the relevance of this model for LND, particularly in the abstract. The focus should be on stating what is actually delivered.

      As recommended by the reviewing editor, and to take in account the reserved comments of reviewer #3, we have toned down our affirmation that our new fly models are relevant for LND in the last sentences of the Abstract and Discussion, and also added a question mark in the subtitle of the Discussion on line 777. As mentioned in our provisional responses to the Public Reviews, we would like to emphasize, however, that reviewers #1 and #2 expressed more confidence than reviewer #3 in the potential usefulness of our work. Reviewer #1 indeed stated that: “The findings provide a new example of how manipulating specific genes in the fruit fly allows the study of fundamental molecular processes that are linked to a human disease”, and reviewer #2 further wrote: "Altogether, these are very important and fundamental findings that convincingly demonstrate the establishment of a Drosophila model for the scientific community to investigate LND, to carry out drug testing screens and find cures”, and added: “To conclude, this is a fundamental piece of work that opens the opportunity for the broader scien2fic community to use Drosophila to inves2gate LND”.

      Reviewer #1 (Recommendations For The Authors):

      • An important prerequisite for the current study is that there appears to be no HGPRT "activity" in Drosophila. It is initially stated that there was previously no "HGPRT activity observed" in two papers form the 70ies. It would be important to corroborate this notion and provide some background on the <br /> /catabolism pathways. How shared or divergent are these pathways between Drosophila and mammals?

      In agreement with the pioneering studies of Becker (1974a, b), we have confirmed in this work that no HGPRT enzymatic activity can be detected in wild-type Drosophila extracts, as mentioned in Results on page 6, lines 127-130 and reported in Table S2. Purine catabolism is quite similar in flies and mammals, except for the lack of urate oxidase in primates, as shown in Figure S1. All the enzymes involved in purine anabolism/catabolim or recycling in humans have been conserved in Drosophila and humans, with the notorious exception of HPRT1.

      If there is no HGPRT gene, but only the APRT ortholog, what would this mean for the metabolites? Our enzymatic assays on Drosophila extracts indicated that hypoxanthine and guanine cannot be recycled into IMP and GMP, respectively, contrary to adenine which can be converted into AMP in flies. In the absence of HGPRT activity, GMP and IMP could be produced by de novo purine synthesis, or, alternatively, synthesized from AMP, which can be converted into IMP by the enzyme AMPD, and then IMP can be converted into GMP by the enzymes IMPDH and GMPS. These metabolic pathways are depicted in Figure S1A.

      Is the lack of HGPRT specific for Drosophila, insects (generally in invertebrates)? I feel clarifying this would provide more insight into the motivation of the experimental approach.

      As suggested by the Reviewer and the Reviewing Editor, we have addressed the evolution of HGPRT proteins more precisely in the revision. We have added a section on this subject in Results on pages 67, lines 122-150, and two phylogenetic analyses as Figures S2 and S3 with details in legends. A phylogenetic analysis was carried out a few years ago by Giorgio Matassi, who is now co-author of this paper. The most striking result was the great impact of horizontal gene transfer in the evolution of HGPRT in Insects (Figures S2 and S3). Our analysis of the phyletic distribution of HGPRT proteins revealed their striking rareness in Insecta, and in particular, their absence in Drosophilidae. The PSIBlast search detected however a significant hit in Drosophila immigrans (accession KAH8256851.1). Yet, this sequence is 100% identical to the HGPRT of the Gammaroteobacterium Serratia marcescens. Indeed, a phylogenetic analysis showed that D. immigrans HGPRT clusters with the Serratia genus (see Figure S3). This can be interpreted either a contamination of the sequenced sample, or as a very recent horizontal gene transfer event. The second scenario is more likely for the corresponding nucleotide sequences differ by 5 synonymous substitutions (out of 534 positions). A powerful approach to try to understand the "origin" of the D. immigrans protein would be to analyze whether horizontal gene transfer has affected its chromosomal neighbours. This approach, proposed previously by G. Matassi (BMC Evol Biol, 2017, 17:2, doi: 10.1186/s12862-016-0850-6), is highly demanding in terms of computing time and would require an ad hoc study. We hope that these new analyses address the Reviewer's concerns.

      • On the mechanistic side on how the behavioral defects may arise, the authors show that dopaminergic neurons (and glia cells) are involved. One interesting finding is that dopamine immunostainings suggest increased dopamine levels. However, immunostainings are notorious for artifacts and do not provide a strong quantitative assessment. I feel it would be helpful to have an alternative technique to corroborate this finding.

      We agree with the reviewer and we added the results of further confirmatory experiments in the four new panels E-H of Figure 5, showing that: 1) the transcript levels of DTH1 (encoding the neuronal isoform of the dopamine-synthesizing enzyme tyrosine hydroxylase in Drosophila) are increased in Aprt5 mutants compared to wild-type flies (new Figure 5E), 2) consistent with this, DTH1 transcript levels were found in contrast to be decreased when Aprt was overexpressed ubiquitously in flies (new Figure 5F), 3) Western blot experiments showed that DTH1 protein levels are also increased in Aprt5 mutant flies compared to controls (new Figure 5G-H).

      Reviewer #2 (Recommendations For The Authors):

      As mentioned in the public review, the behavioural phenotypes of decreased lifespan, SING, sleep and seizures could be tested for all manipulations: feeding with allopurinol, adenosine and m6A, and combining this with knock-down dopamine levels in PAMs or MBs. This could help dissect the relationship between mutations in Aprt and behaviour.

      We thank the reviewer for these suggestions, and, indeed, we would have liked to do all these experiments. However, as mentioned in our responses to the Public Reviews, carrying out these experiments properly with sufficient repeats would require about two more years of work. We have already accumulated a large amount of data, so we have decided to publish our results at this stage in order to make our new fly models available to the scientific community. We are giving careful and due consideration to these experimental proposals and we hope to continue our investigation on this topic in the future.

      It would also be helpful to find out which neurons and glia express AdoR. Perhaps there are already tools available the authors could test or at least check with the scRNAseq Fly Atlas (public Scope database).

      Following the reviewer’s recommendation, we have checked the scRNAseq Fly Atlas for AdoR expression in the brain, compared to that of ple (encoding tyrosine hydroxylase) and Eaat1 (encoding the astrocytic glutamate transporter). As shown in the image below, the results are not very informative. AdoR appears to be expressed in rather widespread subsets of neurons and glial cells, that partly overlap with ple and Eaat1 expression. Further work would be required to identify more precisely the neurons and glial cells expressing AdoR in the brain.

      Author response image 1.

      Page 7, line 161: use of the word 'normalize'. "We tried to normalise uric acid content in flies..." would best to use 'rescue' instead, as normalisation in science has a different meaning.

      We modified this word as suggested.

      Page 9 line 203: 'genomic deficiencies that cover': the genetic term is 'uncover', as a deficiency for a locus reveals a phenotypes, thus it is said 'a gene uncovered by xx deficiency".

      Thank you for this helpful remark. We corrected this in line 221.

      Page 10, lines 206-208: 'allopurinol treatment did not improve the locomotor activity...". These are important observations that should be best presented within the main manuscript Figure 1.

      As recommended, we have transferred the graphs of Figure S5 to new panels D and E of Figure 1.

      Figure 4: please indicate genotypes in the figure, where no information is given that these are UASAprt-RNAi experiments.

      We added the complete genotype in Figure 4G, and also in Figure S12C and D. Thank you for noting that.

      Page 25 line 491: "None of these drugs was able to rescue the SING defects (data not shown)". Either provide the data or remove this claim.

      We have added these data in the new Figure S15.

      Statistical analyses: details are provided in the methods, but the name of test and multiple comparisons corrections should be also provided in the legends.

      Thank you very much for the careful proofreading. This was an oversight and we have added the information in all legends of the revised article.

      Reviewer #3 (Recommendations For The Authors):

      This is a difficult manuscript to appreciate. The abstract and introduction suggest that the study is to identify novel treatments for a human disease (LND) by development of a Drosophila model. Much of the results, however, are focussed to describing the consequences to purine metabolism of the Aprt mutation. To my mind, a rewrite to focus on the latter would be beneficial. The potential applicability to LND would be best restricted to the discussion.

      We apologize for not making our goals clearer. Our purpose was to find out if purine recycling deficiency could lead to metabolic and neurobehavioral disturbances in Drosophila, as it is the case in human LND patients when HGPRT is mutated. Interestingly, we observed that mutation of the only purine recycling enzyme in flies, Aprt, did induce defects in part comparable to that of LND in humans, including overproduction of uric acid that is rescued by allopurinol treatment, reduced longevity, and various neurobehavioral phenotypes including bang-sensitive seizure, sleep defects and locomotor impairments. We also identified adenosine and N6-methyladenosine as rescuers of the epileptic behavior in these mutants. These drugs were also identified as therapeutic candidates in screens based on iPSCs from LND patients. This suggests that Aprt deficiency in Drosophila could be used as a model to better understand this disease and find new therapeutic targets.

      Regardless of the above comment, the concluding sentence of the abstract is inappropriate. This study does not show that Drosophila can be used to identify a cure for LND.

      We agree with the Reviewer that the last sentence of the abstract was too affimative. As also recommended by the reviewing editor, we have modified this sentence in the abstract and other sentences in the text in order to tone down the affirmation that our new fly models are relevant for LND. See our answers to the Reviewing Editor above for details.

      Indeed, I would challenge the premise that screening against a functional, but unknown if structural, homologue (Aprt) will ever provide an exploitable opportunity. To meet this statement, this study needs to identify a treatment that rescues all of the behavioural phenotypes associated with the Aprt mutation, in addition to rescuing the influences of the mis-expression of mutated HGPRT.

      APRT and HGPRT are both functionally and structurally related. Both human APRT and HGPRT belong to the type I PRTases family identified by a conserved phosphoribosyl pyrophosphate (PRPP) binding motif, which is used as a substrate to transfer phosphoribosyl to purines. This binding motif is only found in PRTases from the nucleotide synthesis and salvage pathways (see: Sinha and Smith (2001) Curr Opin Struct Biol 11(6):733-9733-9, doi: 10.1016/s0959-440x(01)00274-3). The purine substrates adenine, hypoxanthine and guanine share the same chemical skeleton and APRT can bind hypoxanthine, indicating that APRT and HGPRT also share similarities in their substrate binding sites (Ozeir et al. (2019) J Biol Chem. 294(32): 11980-11991, doi: 10.1074/jbc.RA119.009087)). This point has been made clearer in the Discussion page 39, in lines 785-792.. Finally, Drosophila Aprt and Human APRT are closely related as the amino acid sequences of APRTs have been highly conserved throughout evolution (shown in Figure S5B).

      With respect to expression of the mutated HGPRT: the short seizure recovery time of 2.3 seconds is not very convincing evidence of a seizure phenotype. This is far below the timings reported for typical BS mutations. Because of this, the authors should run a positive control (e.g. one of the wellestablished BS mutations: parabss, eas or jus) to validate their assay. Moreover, was the seizure induced by the Aprt mutation (17.3 secs - again a low value) rescued by prior exposure to an antiepileptic? Could this behaviour be, instead, related to the SING locomotor phenotype?

      The assay we used to test for bang-sensitivity has been validated in previous articles from different laboratories. We agree that the recovery times we observed were shorter than those of the BS mutations mentioned by the reviewer. However, we could cite another Drosophila BS mutant, porin, that shows similarly short recovery times (2.5 and 6 sec, according to the porin alleles tested, Graham et al. J Biol Chem. 2010, doi: 10.1074/jbc.M109.080317). This is now mentioned on page 36 lines 717-720). In addition, the BS phenotype we observed with Aprt mutants was robust and highly significant compared to control flies (Figure 7). We did not try to rescue this phenotype by exposing the flies to an antiepileptic, but we do not think that it can be related to the SING phenotype. Indeed, providing adenosine or N6-methyladenosine to Aprt5 mutant flies was able to rescue the BS phenotype (Figure 7E, F), but did not rescue the locomotor defects (new Figure S15). Moreover, SING performances of Aprt5 mutant flies at 8 or 30 d a. E. are decreased nearly in almost identical way (Figure 1C), while we observed an effect on BS behavior at 30 d a. E., which implies that the SING and BS behaviors are most likely unrelated.

      Line 731 states that 'Aprt mutants show a typical BS phenotype' - whilst accurate to some extent (e.g. the behaviour depicted in the supp videos), it should be made clear, it should be made clear that the recovery time is uncharacteristically short and thus differs from typical BS mutations.

      We have corrected the sentence in the revised article to mention that (page 36, lines 717-718).

      Line 732 stating that BS phenotype is often linked to neuronal activity - what other links would there be? Even if via glia or other tissues the final effect is via neurons.

      We have modified this sentence (page 36, line 720).

      The introduction and, particularly, the discussion are overly long and, in the case of the latter, repetitive of the results text. Pruning to make the paper more concise would be very beneficial. Removal of the extensive speculation about how DA and adenosine may interact would help in this regard (line 688 onwards). Indeed, in many places the discussion morphs into a review.

      We agree with the reviewer on this point, and have therefore done our best to shorten the Introduction and Discussion, which are now 24% and 21% shorter, respectively, in the revised article compared to the original submission.

      The applicability of using Drosophila Aprt mutations to screen for compounds that may treat LND is predicated on some degree of similarity in either enzyme structure or metabolic pathways. A discussion of how relevant, therefore, studying Aprt is needs to be included. Given the authors insights - where should potential new rugs be targeted to?

      As stated above, we now mention in the article that APRT and HGPRT share similarities in their structure. In addition, the metabolic pathways between humans and Drosophila have been largely conserved (shown in Figure S1B).

    1. Author Response

      The following is the authors’ response to the original reviews.

      Response to review.

      We thank the editors and reviewers for their time in assessing our manuscript. We changed the title to remove the word “all” because we realized that was hyperbolic. Corrections in response to review are in blue text throughout the manuscript document (other minor corrections are not highlighted).

      eLife assessment

      This study presents valuable insights into the evolution of the gasdermin family, making a strong case that a GSDMA-like gasdermin was already present in early land vertebrates and was activated by caspase-1 cleavage. Convincing biochemical evidence is provided that extant avian, reptile, and amphibian GSDMA proteins can still be activated by caspase-1 and upon cleavage induce pyroptosis-like cell death - at least in human cell lines. The caspase-1 cleavage site is only lost in mammals, which use the more recently evolved GSDMD as a caspase-1 cleavable pyroptosis inducer. The presented work will be of considerable interest to scientists working on the evolution of cell death pathways, or on cell death regulation in non-mammalian vertebrates.

      We thank the editor for their time in evaluating our manuscript. We agree with the eLife assessment and with the comments of the reviewers.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      The authors start out by doing a time-calibrated gene/species tree analysis of the animal gasdermin family, resulting in a dendrogram showing the relationship of the individual gasdermin subfamilies and suggesting a series of gene duplication events (and gene losses) that lead to the gasdermin distribution in extant species. They observe that the GSDMA proteins from birds, reptiles, and amphibians do not form a clade with the mammalian GSDMAs and notice that the non-mammalian GSDMA proteins share a conserved caspase-1 cleavage motif at the predicted activation site. The authors provide several series of experiments showing that the non-mammalian GSDMA proteins can indeed be activated by caspase-1 and that this activation leads to cell death (in human cells). They also investigate the role of the caspase-1 recognition tetrapeptide for cleavage by caspase-1 and for the pathogen-derived protease SpeB.

      We thank the reviewer for their time in evaluating our manuscript.

      Strengths:

      The evolutionary analysis performed in this manuscript appears to use a broader data basis than what has been used in other published work. An interesting result of this analysis is the suggestion that GSDMA is evolutionarily older than the main mammalian pyroptotic GSDMD, and that birds, reptiles, and amphibians lack GSDMD but use GSDMA for the same purpose. The consequence that bird GSDMA should be activated by an inflammatory caspase (=caspase1) is convincingly supported by the experiments provided in the manuscript.

      We thank the reviewer for their assessment of the manuscript.

      Weaknesses:

      1. As a non-expert in phylogenetic tree reconstruction, I find the tree resulting from the authors' analysis surprising (in particular the polyphyly of GSDMA) and at odds with several other published trees of this family. The differences might be due to differences in the data being used or due to the tree construction method, but no explanation for this discrepancy is provided.

      We agree, and we have modified the text to add more context to explain why our analysis generated a different topology: “In comparison to previously published studies, we used different methods to construct our gasdermin phylogenetic tree, with the result that our tree has a different topology. The topology of our tree is likely to be affected by our increased sampling of gasdermin sequences; we included 1,256 gasdermin sequences in comparison to 300 or 97 sequences used in prior studies. Prior studies used maximum likelihood tree building techniques, whereas we used a more computationally intensive Bayesian method using BEAST with strict molecular clocks that allows us to provide divergence time estimates, which we calibrated using mammal fossil estimated ages. We think that this substantially increased sampling paired with time calibration allow us to produce a more accurate phylogeny of the gasdermin protein family.”

      To explain and further support our method in a more technical manner, in our phylogenetic tree, non-mammal GSDMAs are paralogous to mammals GSDMAs whereas others have found that non-mammal GSDMAs are orthologous to mammal GSDMAs. We obtained moderate support for the non-mammal GSDMA placement with Bayesian posterior 0.42 and with maximum likelihood bootstrap support of 0.96. Angosto-Bazarra et al. has for their placement a Bayesian posterior of 0.66 and maximum likelihood bootstrap support of 0.98. These are good results, but they arise from significantly fewer sequences than are included in our tree. However, in Fig S2 of Angosto-Bazarra et al. the support drops to 0.08. That the posteriors in both are not 1 indicate the presence of phylogenetic conflicts (i.e., a significant fraction of alternative trees), which means that the tree of our study or Angosto-Bazarra could be incorrect. That said, our tree is supported by biological support, and our dataset is substantially larger. To better characterize this node, further sampling with even more species would be required. We exhausted the current available sequences at the time our tree was generated.

      Differences between our study and previous studies:

      Author response table 1.

      1. While the cleavability of bird/reptile GSDMA by caspase-1 is well-supported by several experiments, the role of this cleavage for pyroptotic cell killing is addressed more superficially. One cell viability assay upon overexpression of GSDMA-NTD in human HEK293 cells is shown and one micrograph shows pyroptotic morphology upon expression in HeLa cells. It is not clear why these experiments were limited to human cells…

      We did include one more experiment in human cells which is Figure 4B, in which we express full length chicken GSDMA with dimerizable caspase-1, and show that LDH release requires the cleavage site aspartate, D244. That said, we agree that our use of only human cell lines is a weakness of the paper. We thought that the best way to definitively show the interaction of caspase-1 and GSDMA was to perform experiments in chicken macrophages. Therefore, we generated a custom-raised anti-chicken-GSDMA antibody. Unfortunately, the quality of the antibody was insufficient to detect endogenous GSDMA in chicken bone marrow-derived macrophages. Off target binding prevented the observation of chicken GSDMA bands. We added a section to the discussion acknowledge the need for further studies: “In future studies, the association of bird/amphibian/reptile GSDMA and caspase-1 should be confirmed in native cells from each of these animals.”

      …and why two different cell types were used for the two complementary results.

      In the paper we used 293T cells and HeLa cells as generic cell types that have distinct benefits. In general, we used 293T/17 cells for experiments where high transfection efficiency was most critical, as it is simple to achieve 90% or higher transfection efficiency in this line. However, 293T/17s have poor spreading in culture and thus are not as useful for morphologic studies. 293T/17 cells do display pyroptotic ballooning upon gasdermin activation, however, the images are less pronounced in comparison to other cell types that have more distinct morphology. Therefore, we used HeLa cells for the microscopy experiments because they are more adherent and larger than 293T/17s which make for easier visualization of pyroptotic ballooning. We have added the following statement to the text to make our rationale for the use of different cell line more apparent: “In these experiments, 293T/17s were used for their high transfection efficiency, and HeLas were used for microscopy studies for their larger size and improved adherence.”

      1. The introduction mentions as a motivation for this work our lack of knowledge of how human GSDMA is activated. This is indeed an interesting and pressing question, but it is not really addressed in the manuscript. This is particularly true when believing the authors' dendrogram results that the bird and mammalian GSDMA families do not form a clade.

      As a consequence, the significance of this finding is mostly limited to birds and reptiles.

      Our aspirations were to discover hidden facets of mammal GSDMA by using a molecular evolutionary analysis. bird/amphibian/reptile GSDMA. Although we did not learn the identity of a host protease that activates mammalian GSDMA, we serendipitously discovered the evolutionary history of the association of caspase-1 with the gasdermin family. We think this manuscript provides an important and interesting advance in the field to reveal the process of evolution at work in the gasdermin family, and that the association of caspase-1 with a gasdermin to cause pyroptosis is an unbroken pairing throughout evolution. It is surprising to us that the specific gasdermin partner has changed over time.

      Reviewer #2 (Public Review):

      Summary:

      The authors investigated the molecular evolution of members of the gasdermin (GSDM) family. By adding the evolutionary time axis of animals, they created a new molecular phylogenetic tree different from previous ones. The analyzed result verified that non-mammalian GSDMAs and mammalian GSDMAs have diverged into completely different and separate clades. Furthermore, by biochemical analyses, the authors demonstrated non-mammalian GSDMA proteins are cleaved by the host-encoded caspase-1. They also showed mammalian GSDMAs have lost the cleavage site recognized by caspase-1. Instead, the authors proposed that the newly appeared GSDMD is now cleaved by caspase-1.

      We thank the reviewer for their time in evaluating our manuscript.

      Through this study, we have been able to understand the changes in the molecular evolution of GSDMs, and by presenting the cleavage of GSDMAs through biochemical experiments, we have become able to grasp the comprehensive picture of this family of molecules. However, there are some parts where explanations are insufficient, so supplementary explanations and experiments seem to be necessary.

      Strengths:

      It has a strong impact in advancing ideas into the study of pyroptotic cell death and even inflammatory responses involving caspase-1.

      We thank the reviewer for the critical consideration of the phylogeny presented.

      Weaknesses:

      Based on the position of mammalian GSDMA shown in the molecular phylogenetic tree (Figure 1), it may be difficult to completely agree with the authors' explanation of the evolution of GSDMA.

      1. Focusing on mammalian GSDMA, this group, and mammalian GSDMD diverged into two clades, and before that, GSDMA/D groups and mammalian GSDMC separated into two, more before that, GSDMB, and further before that, non-mammalian GSDMA, when we checked Figure 1. In the molecular phylogenetic tree, it is impossible that GSDMA appears during evolution again. Mammalian GSDMAs are clearly paralogous molecules to non-mammalian GSDMAs in the figure. If they are bona fide orthologous, the mammalian GSDMA group should show a sub-clade in the non-mammalian GSDMA clade. It is better to describe the plausibility of the divergence in the molecular evolution of mammalian GSDMA in the Discussion section.

      We appreciate the reviewer’s careful consideration of our phylogeny. We agree that we did not make this clear enough in the discussion. Indeed, this is a confusing point, and is a critical concept in the paper. This is among our most important findings, so we have added a line addressing this finding to the abstract. We think about these concepts starting from the oldest common ancestor of a group, and then think about how genes duplicate over time. To the discussion we now begin with the following:

      We discovered that GSDMA in amphibians birds and reptiles are paralogs to mammal GSDMA. Surprisingly, the GSDMA genes in both the amphibians/reptiles/birds and mammal groups appear in the exact same locus. Therefore, this GSDMA gene was present in the common ancestor of all these animals. In mammals, this GSDMA duplicated to form GSDMB and GSDMC. Finally, a new gene duplicate, GSDMD, arose in a different chromosomal location. Then this GSDMD gene became a superior target for caspase-1 after developing the exosite. Once GSDMD had evolved, we speculate that the mammalian GSDMA became a pseudogene that was available to evolve a new function. This new function included a new promoter to express mammalian GSDMA primarily in the skin, and perhaps acquisition of a new host protease that has yet to be discovered.

      In further support of the topology of our Bayesian tree in Figure 1, we also performed a maximum likelihood analysis, which also placed the GSDMA genes into similarly distinct clades (Figure 1-S3). Finally, we have biological evidence to support this reasoning, where caspase-1 cleaves non-mammal GSDMAs and also mammal GSDMD (and no longer can cleave mammal GSDMA).

      1. Regarding (1), it is recommended that the authors reconsider the validity of estimates of divergence dates by focusing on mammalian species divergence. Because the validity of this estimation requires a recheck of the molecular phylogenetic tree, including alignment.

      Our reconstructed evolution of gasdermins is consistent with the mammal tree of life. We constrained Bayesian estimation of divergences using soft calibrations from mammal fossil estimated ages. We have included the fossil calibration of mammalian gasdermins to the results section and to our methods.

      1. If GSDMB and/or GSDMC between non-mammalian GSDMA and mammalian GSDMD as shown in the molecular phylogenetic tree would be cleaved by caspase-1, the story of this study becomes clearer. The authors should try that possibility.

      It is known that mammal GSDMB and GSDMC cannot be activated by caspase-1. We propose that GSDMA was cleaved by caspase-1 only in extinct mammals that had not yet associated GSDMD with caspase-1. Such an extinct mammal could have encoded a GSDMA cleaved by caspase-1, a GSDMB cleaved by granzyme A, and GDSMC cleaved by caspase-8. Later, the GSDMA gene was again duplicated to form GSDMD. After GSDMD was targeted by caspase-1, then GSDMA was free to gain its current function in barrier tissues.

      Reviewer #1 (Recommendations For The Authors):

      As a non-expert on phylogenetic tree construction, I found the "time-calibrated maximum clade credibility coalescent tree" hard to digest. I would have liked to see an explanation of how this method is different from what has been used before and why the authors consider it to be better. This is particularly important when considering that the resulting tree shown in Figure 1 is quite different from other published trees of the same family (e.g. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8742441 where the GSDMA family appears monophyletic).

      Please see response to Reviewer 1 weaknesses above. Also, we have moved the text “time-calibrated maximum clade credibility coalescent tree” to the figure legend.

      In the bioinformatical analysis of the conserved caspase-1 cleavage motif in bird GSDMA sequences, I would recommend also addressing the residue behind the cleavage site Asp, as this position has an unusually high conservation (mostly Gly) in bird GSDMA.

      This is a great observation. We suspect that this may reflect a need for flexibility in the secondary structure to allow the cleavage site to enter the enzymatic pocket of the caspase. This residue is also similarly enriched in mammal GSDMD, which is also cleaved by caspase-1. We also note high conservation of a P2' proline residue in birds with the FASD tetrapeptide, which could also be important for displaying the tetrapeptide to the caspase.

      This comment prompted us to search the literature for evidence of these residues in caspase-1 substrate preference studies. Remarkably, a P1' glycine and P2` proline are among the most enriched residues in human caspase-1 targets. This supports our hypothesis that caspase-1 cleaves GSDMA in non-mammals. We added the following to the results section: “Additionally, the P1' residue in amphibian, bird and reptile GSDMA was often a glycine, and the P2' residue was often a proline, especially in birds with FASD/FVSD tetrapeptides (Fig. 2B). A small P1' residue is preferred by all caspases. By using a peptide library, glycine has been determined to be the optimal P1' residue for caspase-1 and caspase-4. Further, in a review of the natural substrates of caspase-1, glycine was the second most common P1' residue, and proline was the most common P2' residue. These preferences were not observed for caspase-9.”

      Finally, I would like the authors to at least explain why the cell viability assays were done in 293T cells while the micrographs were done in HeLa cells. Why not show both experiments for both cell types?

      In the paper we used 293T cells and HeLa cells as generic cell types that have distinct benefits. In general, we used 293T/17 cells for experiments where high transfection efficiency was most critical, as it is simple to achieve 90% or higher transfection efficiency in this line. However, 293T cells have poor spreading in culture and thus are not as useful for morphologic studies. 293T/17 cells do display pyroptotic ballooning upon gasdermin activation, however, the images are less pronounced in comparison to other cell types that have more distinct morphology. Therefore, we used HeLa cells for the microscopy experiments because they are more adherent and larger than 293T/17s which make for easier visualization of pyroptotic ballooning. We have added the following statement to the text to make our rationale for the use of different cell line more apparent: “In these experiments, 293T/17s were used for their high transfection efficiency, and HeLas were used for microscopy studies for their larger size and improved adherence.”

      There are a number of minor points related to language and presentation:

      • the expressions "pathogens contaminate the cytosol", "mammals can encode..", "an outsized effect" are unusual and might be rephrased.

      We changed these to:

      “manipulate the host cell, sometimes contaminating the cytosol with pathogen associated molecular patterns, or disrupting aspects of normal cell physiology”,

      “Only mammals encode GSDMC and GSDMD alongside the other four gasdermins.”,

      and

      “greater effect”

      • in line 87 the abbreviation "GSDMEc" is first used without explanation (of the "c").

      This is an important distinction, as GSDMEc proteins were only recently uncovered. To remedy this, we have added the following text following line 87: “This gasdermin was recently identified as an ortholog of GSDMA.

      It was called GSDMEc, following the nomenclature of other duplications of GSDME in bony fish that have been named GSDMEa and GSDMEb.”

      • line 89 grammar problem.

      Corrected

      • line 186ff the sentence "We believe..." does not appear to make sense.

      We revised the text to make this clear, changing the text to now read “We hypothesized that activating pyroptosis using separate gasdermins for caspase-1 and caspase-3 is a useful adaptation and allows for fine-tuning of these separate pathways. In mammals, this separation depends on the activation of GSDMD by caspase-1 and the activation of GSDME by caspase-3.”

      • many figures use pictures rather than text to represent species groups. These pictures are not always intuitive. As an example, in Figure 6 the 'snake' represents amphibians. After reading the text, I understand that these should probably be the caecilian amphibians, but not every reader might know what these critters look like. In Figure 7, I have no idea what the black blob (2nd image from top) is supposed to be.

      In crafting the manuscript, we found the use of text to denote the various species to be cumbersome. The species silhouettes are a standard graphical depiction used in evolutionary biology, which we think aids readability to the figures. For example, in a paper cited in our manuscript, these same silhouettes were used to depict the evolution of GSDMs (https://doi.org/10.3389/fcell.2022.952015 Figure 1A, Figure 3D, Figure 4G). However, we agree that many readers will not know that caecilians are legless amphibians that resemble snakes in their body morphology, but are not close to snakes by phylogeny. We think it is important to use an image of a caecilian amphibian because the more iconic amphibians (frogs, salamanders) do not encode GSDMA. To increase clarity, we have mentioned the morphology of caecilians in the legend of Figure 2, Figure 6, and Figure 7 when caecilican amphibians are first introduced.

      In Figure 2: “Note, that caecilians morphologically are similar to snakes in their lack of legs and elongated body, however, this is an example of convergent evolution as caecilians are amphibians and are thus more closely related to frogs and salamanders than snakes.”

      In Figure 6: “M. unicolor is an amphibian despite sharing morphological similarity to a snake.”

      In Figure 7: “In caecilian amphibians, which are morphologically similar to snakes, birds, and reptiles, GSDMA is cleaved by caspase-1.”

      The black blob is the mollusk Lingula anatina, which unfortunately has an indistinct silhouette. To clarify this, we have added text to label the images in Figure 7.

      Reviewer #2 (Recommendations For The Authors):

      1. Line 214, in "(Fig. 3-S2) Human and mouse ..", it is necessary to type a period.

      2. Line 238, in the subtitle, GSMA should be amended to GSDMA.

      These have both been corrected.

    1. Author Response

      The following is the authors’ response to the original reviews.

      We thank the three reviewers for their positive comments and helpful suggestions. We have addressed the issues raised which have helped to improve the manuscript. Below, we address the specific points with detailed responses.

      Reviewer #1 (Recommendations For The Authors):

      Minor comments

      1) Figure 2 - figure supplement 1. The figure states minimal medium while the legend states rich medium.

      We have corrected the legend as the experiment was done in minimal medium.

      2) Figure 3B - the statements in the text do not seem to match what is in the figure. "Cluster 1 (293 genes, 12 priority unstudied) is enriched for genes showing high expression variability across different conditions (71) and for genes induced during meiotic differentiation (72) and in response to TORC1 inhibitors (29). Cluster 2 (570 genes, 20 priority unstudied) is enriched for phenotypes related to cell mating and sporulation, e.g. 'incomplete cell-wall disassembly at cell fusion site' or 'abnormal shmoo morphology'". These terms (high expression variability, meiotic differentiation, TORC1 inhibitors, cell mating and sporulation/abnormal shmoo morphology" are not seen in the figure.

      As stated in the Results, we have carried out analyses with both Metascape and AnGeLi for functional enrichments in different GO and KEGG pathway terms (Figure 3B; Metascape) and/or among genes from published expression or phenotyping studies (AnGeLi). The enrichments for expression variability, meiotic differentiation, TORC1 inhibitors, and cell mating/sporulation/abnormal shmoo morphology are not based on GO terms but on lists from published expression and phenotyping experiments. We have slightly edited the sentence in the Results to make this clearer.

      3) The authors could consider citing a systematic screen for sporulation in the introduction (PMID: 292590

      We have cited 17 papers for growth screens under different conditions using similar approaches as used by us. Given that we already cite 100 papers, we did not choose to cite numerous other papers reporting screens for more complex phenotypes (cell morphology, mating, meiosis, recombination, etc), which are not directly relevant to our study here.

      Reference PMID: 292590 refers to a 1979 paper in the German Dentist Journal.

      Reviewer #2 (Recommendations For The Authors):

      General comments

      1) The authors use their NET-FF approach to predict GO Biological Process and Molecular Function terms (Figure 4). Why was the Cellular Component ontology not included? In general, gene and protein functional characterization is best described by the Biological Process and Cellular Component ontologies, whereas Molecular Function describes the biochemical activity of a protein. In other words, proteins which share Biological Process and/or Cellular Component annotations often function in the same module, which may not be the case for shared Molecular Function annotations.

      We did not include Cellular Component because in previous benchmarking of our method using CAFA datasets our approach did not perform well at predicting Cellular Component. This aspect is harder to pick up from homology data and protein network data and is generally the toughest challenge in CAFA. In contrast, our predictions of Biological Process and Molecular Function are competitive with other methods. We have now made the reason for omitting Cellular Component clearer in the Methods.

      2) The authors use protein embeddings produced by integrating 6 STRING networks using the deepNF method. One of these networks is the "database" network. According to STRING (https://academic.oup.com/nar/article/47/D1/D607/5198476): "The database channel is based on manually curated interaction records assembled by expert curators, at KEGG, Reactome, BioCyc and Gene Ontology, as well as legacy datasets from PID and BioCarta". If one of the input networks contains information from GO, and then embeddings containing this information are used to predict GO annotations, are the authors not then leaking annotations which could improve downstream GO annotation predictions? It would be valuable to demonstrate to what extent the "database" network is contributing by repeating the GO prediction analyses with this network removed.

      We agree and also pointed out this circularity in the manuscript. We used an independent dataset – phenotype data – to benchmark our method, which showed good performance. Note that this study did not aim to develop a completely new method or improve on deepNF and CATH-FunFams but to integrate and exploit their combined power. For that reason, we wanted to keep as many high-quality curated edges in the STRING network as possible. Combining these independent methods brings synergies from their complementary approaches to facilitate interpretation of gene function.

      Minor comments

      1) Ternary encoding was used as a preprocessing step on the phenotype data before clustering was performed. An explanation of why this encoding was necessary (as opposed to a normalization/standardization approach) would be helpful.

      Ternary encoding was not strictly necessary but provided more nuanced and coherent clusters. Some conditions and mutants were associated with much larger phenotypic responses which disproportionately influenced the clustering. After trying different approaches, we followed the recommendations from the R package microbialPhenotypes (https://github.com/peterwu19881230/microbialPhenotypes), which is now specified in the legend of Fig. 3A. Discretizing the data also helped to compare phenotypes across different types of mutants, and we have applied this approach previously in our phenomics study of non-coding RNA mutants (Rodriguez-Lopez et al. eLife 2022). Moreover, this approach allowed us to generate vectors of phenotypes for calculating phenotypic distances between mutants (including hamming distance or Pearson correlations), which supported the posterior cluster analysis using Cytoscape.

      2) The authors use a validation set to perform early-stopping on the deepNF model. However, it appears that the validation set proteins are then used in downstream analyses anyway: "After training, weights from the epoch with the lowest validation loss were used to generate embeddings for all proteins" (my emphasis). In the case where the model was being used to generalize to new proteins (such as classification), this analysis would not be a valid way to perform hyperparameter tuning (e.g. early-stopping) since the validation set is then used in downstream analyses. However, deepNF is performing an unsupervised, multi- network encoding on all the available datapoints (proteins). In the case where only deepNF loss is being used to tune the hyperparameters, it's not necessary to use a held-out validation set - it is appropriate to use the full set of proteins to do this.

      Our Random Forest consisted of 500 trees with default values for the number of sub- features as √n and partial sampling of 0.7. GO terms were predicted using 5-fold cross- validation. Changing parameters showed that our model was robust to the values of the hyperparameters, so we settled on our initial model.

      3) The NET-FF hyperparameter tuning results should be made available in the supplement.

      We do not think this would be useful for the reason described in the reply above.

      Reviewer #3 (Recommendations For The Authors):

      Major points

      1) Why were the quantitive colony size data converted to -1, 0, and 1?

      It is unclear to me why the authors decided to convert the colony size data to ternary encoding of -1, 0, and 1. The original colony size data seem to be of fairly high precision so that the authors can detect a 5% difference from the wild type. I guess the authors must have tried using the quantitive colony size data for clustering analysis and found the results unsatisfactory. If that is the case, can the authors provide some possible explanations?

      A similar query has been raised by Reviewer 2. Ternary encoding provided more nuanced and coherent clusters. Some conditions and mutants were associated with much larger phenotypic responses which disproportionately influenced the clustering. After trying different approaches, we followed the recommendations from the R package microbialPhenotypes, as now specified in the legend of Fig. 3A. Discretizing the data also helped to compare phenotypes across different types of mutants, and we have applied this approach previously in our phenomics study of non-coding RNA mutants (Rodriguez-Lopez et al. eLife 2022). Moreover, this approach allowed us to generate vectors of phenotypes for calculating phenotypic distances between mutants (including hamming distance or Pearson correlations), which supported the posterior cluster analysis using Cytoscape.

      2) What do 5% difference and 10% difference look like?

      The authors used 5% difference and 10% difference as cutoffs. I am curious whether a 5% difference in colony size is obvious to human eyes. Can the authors show some plate images and label colonies that differ from the wild type by about 5% and 10%? It will help readers understand the thresholds used for determining whether a mutant has a phenotype.

      Showing the original ‘raw’ colonies would not be meaningful because all colony sizes have been grid-corrected as described (Kamrad et al. eLife 2020). The grid correction takes care of three issues: (1) it converts colony size into an easily interpretable value by reporting a ratio relative to wild type; (2) it makes results comparable across different plates/batches; and (3) it corrects for within-plate positional effects which become apparent due to the same wild-type grid strain showing different fitness in different plate positions. But in principle, detecting a 5% difference in colony size by eye would be hard, and multiple measurements are required (>10 repeats) to obtain statistically reliable results. Author response image 1 shows the grid colonies in red frames and numbers at bottom right of colonies indicate the corrected effect sizes. Colony 17-8 (top right) is an example of a colony differing by 5% compared to neighbouring colonies 16-8 and 17-9.

      Author response image 1.

      3) How were the phenotyping conditions chosen?

      I am sure that the authors have put a lot of thoughts into designing the 131 phenotyping conditions. It will benefit the readers if the authors can explain how these conditions were chosen. For example, what literature precedents were considered and which conditions have never been examined before in S. pombe research? For drug treatment conditions, were pilot tests done to choose drug doses based on the growth inhibition effects on the wild type?

      We have used a wide range of different types of conditions that affect diverse processes (see colour legend on top of Fig. 3A). This was based on our previous experience and selection of conditions in large-scale phenotyping of wild strains (Jeffares et al. Nature Genetics 2015) and non-coding RNA mutants (Rodriguez-Lopez et al. eLife 2022). For previously applied conditions (e.g. oxidants), we used literature precedents for the doses, while for other conditions, we used trial and error to adjust the diose such that wild-type cell growth is barely inhibited. For some drugs and stresses, we assayed both low and high doses, in which wild-type cell growth is normal or inhibited, respectively, to uncover both sensitive or resistant mutants.

      Minor points

      1) One of the growth condition is "YES_ethanol_1percent_no_glucose". I am curious how this is possible, as S. pombe cannot use ethanol as a carbon source.

      We assume that the cells contain sufficient internal glucose to fuel growth and division for a few cycles before running out of glucose. Thus, cells showed some residual growth on this medium, but growth is indeed very limited. Nevertheless, we could identify both sensitive and resistant mutants in this condition.

      2) Abstract "over 900 new proteins affected the resistance to oxidative stress". This sentence should be rephrased. Perhaps it is better to say "over 900 proteins were newly implicated in the resistance to oxidative stress".

      Yes, we have edited the sentence as suggested.

      3) Page 4 "S. pombe encodes 641 'unknown' genes (PomBase, status March 2023). " "Among these 643 unknown proteins, many are apparently found only in the fission yeast clade, but 380 are more widely conserved. " Which number is correct, 641 or 643?

      These numbers keep changing slightly. We now consistently use 641, the number from March 2023.

      4) Page 4 "These priority unstudied proteins have not been directly studied in any organism but can be assumed to have pertinent biological roles conserved over 500 million years of evolution. " According to http://timetree.org/, S. pombe and H. sapiens diverged about 1275 million years ago.

      We have now changed ‘over 500 million’ to ‘over 1000 million’, although there are of course different estimates for these times.

      5) "Using these potent wet and dry methods, we obtained 103,520 quantitative phenotype datapoints for 3,492 non-essential genes across 131 diverse conditions."

      I think "quantitative phenotype datapoints" are generated using wet methods, not dry methods. Yes, we have now deleted ‘Using these potent wet and dry methods,’ and start the sentence with ‘We obtained…’

      6) Abstract "We assayed colony-growth phenotypes to measure the fitness of deletion mutants for all 3509 non-essential genes"

      Page 6 "We performed colony-based phenotyping of the deletion mutants for all non- essential S. pombe genes"

      It is not clear to me how the authors can claim that the 3509 non-essential genes correspond to "all non-essential S. pombe genes". The authors should explain how they classify S. pombe genes into essential genes and non-essential genes. The deletion project papers (Kim et al. 2010 and Hayles et al. 2013) provided binary classification for most but not all genes, as there are genes whose deletion mutants were not generated by the deletion project. PomBase does not use a binary classification and there are a number of genes deemed "Gene Deletion Viability: Depends on conditions" by PomBase.

      We used the latest deletion library (Bioneer Version 5) as well as additional deletion mutants published by Kathy Gould and colleagues, which together should capture all non- essential genes. But we agree that non-essentiality is not that clear-cut and context- dependent. So we have deleted ‘all’ in the two sentences highlighted above.

      7) Page 20 "Other clusters contained mostly genes involved in vacuolar/endosomal transport and peroxisome function, along with poorly characterized genes (Figure 6B)."

      This sentence needs rephrasing. Perhaps it is better to say "Cluster 31 and cluster 22 contained respectively mostly genes involved in vacuolar/endosomal transport and peroxisome function, along with poorly characterized genes (Figure 6B)."

      We have edited this sentence to ‘Cluster 31 and Cluster 22 contained mostly genes involved in vacuolar/endosomal transport and peroxisome function, respectively, along with poorly characterized genes (Figure 6B).’

      8) Legend of Figure 2-figure supplement 1A

      "Left: Volcano plot of mutant colony sizes for priority unstudied genes (green) and all other genes (grey) growing in rich medium. " I think "rich medium" should be "minimal medium".

      Yes, we have now corrected this.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      In this manuscript, the role of orexin receptors in dopamine neurons is studied. Considering the importance of both orexin and dopamine signalling in the brain, with critical roles in arousal and drug seeking, this study is important to understand the anatomical and functional interaction between these two neuromodulators. This work suggests that such interaction is direct and occurs at the level of SN and VTA, via the expression of OX1R-type orexin receptors by dopaminergic neurons.

      Strengths:

      The use of a transgenic line that lacks OX1R in dopamine-transporter-expressing neurons is a strong approach to dissecting the direct role of orexin in modulating dopamine signalling in the brain. The battery of behavioural assays to study this line provides a valuable source of information for researchers interested in the role of orexin-A in animal physiology.

      We thank the reviewer for summarizing the importance and significance of our study. 

      Weaknesses:

      The choice of methods to demonstrate the role of orexin in the activation of dopamine neurons is not justified and the quantification methods are not described with enough detail. The representation of results can be dramatically improved and the data can be statistically analysed with more appropriate methods.

      We have further improved our description of the methods in the revised reviewed preprint, and here in the response letter, we respond point-by-point to ‘Reviewer #1 (Recommendations For The Authors)’ below. 

      Reviewer #2 (Public Review):

      Summary:

      This manuscript examines the expression of orexin receptors in the midbrain - with a focus on dopamine neurons - and uses several fairly sophisticated manipulation techniques to explore the role of this peptide neurotransmitter in reward-related behaviors. Specifically, in situ hybridization is used to show that dopamine neurons predominantly express the orexin receptor 1 subtype and then go on to delete this receptor in dopamine neurons using a transgenic strategy. Ex vivo calcium imaging of midbrain neurons is used to show that in the absence of this receptor orexin is no longer able to excite dopamine neurons of the substantia nigra.

      The authors proceed to use this same model to study the effect of orexin receptor 1 deletion on a series of behavioral tests, namely, novelty-induced locomotion and exploration, anxiety-related behavior, preference for sweet solutions, cocaine-induced conditioned place preference, and energy metabolism. Of these, the most consistent effects are seen in the tests of novelty-induced locomotion and exploration in which the mice with orexin 1 receptor deletion are observed to show greater levels of exploration, relative to wild-type, when placed in a novel environment, an effect that is augmented after icv administration of orexin.

      In the final part of the paper, the authors use PET imaging to compare brain-wide activity patterns in the mutant mice compared to wildtype. They find differences in several areas both under control conditions (i.e., after injection of saline) as well as after injection of orexin. They focus on changes in the dorsal bed nucleus of stria terminalis (dBNST) and the lateral paragigantocellular nucleus (LPGi) and perform analysis of the dopaminergic projections to these areas. They provide anatomical evidence that these regions are innervated by dopamine fibers from the midbrain, are activated by orexin in control, but not mutant mice, and that dopamine receptors are present. Thus, they argue these anatomical data support the hypothesis that behavioral effects of orexin receptor 1 deletion in dopamine neurons are due to changes in dopamine signaling in these areas.

      Strengths:

      Understanding how orexin interacts with the dopamine system is an important question and this paper contains several novel findings along these lines. Specifically:

      (1) The distribution of orexin receptor subtypes in VTA and SN is explored thoroughly.

      (2) Use of the genetic model that knocks out a specific orexin receptor subtype from only dopamine neurons is a useful model and helps to narrow down the behavioral significance of this interaction.

      (3) PET studies showing how central administration of orexin evokes dopamine release across the brain is intriguing, especially since two key areas are pursued - BNST and LPGi - where the dopamine projection is not as well described/understood.

      We thank the reviewer for the careful summary and highlighting the novelty of our study.

      Weaknesses:

      The role of the orexin-dopamine interaction is not explored in enough detail. The manuscript presents several related findings, but the combination of anatomy and manipulation studies does not quite tell a cogent story. Ideally, one would like to see the authors focus on a specific behavioral parameter and show that one of their final target areas (dBNST or LPGi) was responsible or at least correlated with this behavioral readout. In addition, some more discussion on what the results tell us about orexin signaling to dopamine neurons under normal physiological conditions would be very useful. For example, what is the relevance of the orexin-dopamine interaction blunting noveltyinduced locomotion under wildtype conditions?

      We agree that focusing on some orexin-dopamine targeting areas, such as dBNST or LPGi, is important to further reveal the anatomy-behavior links and underlying mechanisms. While we are very interested in further investigations, in the present manuscript we mainly aim to give an overview of the behavioral roles of orexin-dopamine interaction and to propose some promising downstream pathways in a relatively broad and systematical way. 

      We have explained the physiological meanings of our results in more detail in the discussion in the revised reviewed preprint (lines 282-293, 318-332, ). Novelty-induced behavioral response should be at proper levels under normal physiological conditions. The orexin-dopamine interaction blunting novelty-induced locomotion could be important to keep attention on the main task without being distracted too much by other random stimuli in the environment. When this balance is disrupted, behavioral deficit may happen, such as attention deficit and hyperactivity disorder (ADHD).  

      In some places in the Results, insufficient explanation and reporting is provided. For example, when reporting the behavioral effects of the Ox1 deletion in two bottle preference, it is stated that "[mutant] mice showed significant changes..." without stating the direction in which preference was affected.

      For the reward-related behaviors described in this study, we did not find significant changes between [mutant] and control mice. We agree that it will be helpful for readers by describing the behavioral tests in more details. In the revised reviewed preprint, we have described in more detail in the results and Materials and Methods section how the control and [mutant] mice behave to the reward (lines 162-165, 171-181, 526-528).  

      The cocaine CPP results are difficult to interpret because it is unclear whether any of the control mice developed a CPP preference. Therefore, it is difficult to conclude that the knockout animals were unaffected by drug reward learning. Similarly, the sucrose/sucralose preference scores are also difficult to interpret because no test of preference vs. water is performed (although the data appear to show that there is a preference at least at higher concentrations, it has not been tested).

      We described the CPP analysis in the Materials and Methods section (lines 523-528 ) as below: ‘The percentage of time spent in the reward-paired compartment was calculated: 100 x time spent in the compartment / (total time - time spent in the middle area). The CPP score was then analyzed using the calculated percentage of time: 100 x (time on the test day – time on pre-test days)/ time on pre-test days. The pre-test and test days were before and after the conditioning, respectively. Thus, the CPP score above zero indicates that the CPP preference has developed.’ In Figure 2—figure supplement 4 C and F, it was shown that most control and knockout mice had a CPP score above zero. The control and knockout groups both developed a preference and there was no significant difference between the groups. 

      For the sucrose/sucralose preference tests, in Figure 2—figure supplement 4 A and D, we present values as the percentages of sucrose/sucralose consumption in total daily drinking amount (sucrose/sucralose solution + water). Thus, percentages above 50% indicates mice prefer sucrose/sucralose to water. As shown in the figure, male mice only showed weak preference of 0.5% sucrose, compared to water, and under all other tested conditions, the mice showed strong preference of the sweet solution. There was no significant difference between control and knockout mice. 

      We have described this in more details in the Results and Materials and Methods section in the revised reviewed preprint. 

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      (1) Figure 1, A-I. It is difficult to depict the anatomical subdivision of VTA in Figure 1, panels A and B. It is recommended to add a panel showing a schematic illustration of the SNc and subregions of VTA: PN, PIF, PBP, IF (providing more detail than in Figure 1, panel J). It is also recommended to show lower magnification images (as in Figure 1 - supplement 1), including both hemispheres, and to delineate the outline of the different subregions using curved lines, based on reference atlases (similar to Figure 1, panel I, please include distance from bregma). It would be helpful to indicate in Figure 1 that panel A is a control mouse and panel B is a Ox1RΔDAT mouse and include C-F letters to show corresponding insets. Anatomically, the paraintrafasicular nucleus (PIF) is positioned between the paranigral nucleus (PN) and the parabrachial pigmented nucleus (PBP). The authors have depicted the PIF ventral to the PN in Figure 1 panels A, B, and I. These panels and the quantification of Ox1R/2R positive cells within the different subdivisions need to be corrected accordingly. The image analysis method used to quantify RNAscope fluorescent images is not described in sufficient detail. Please expand this section.

      According to the reviewer’s suggestions, we have refined Figure 1 in the revised reviewed preprint. We are now showing the schematic illustration of the SN and subregions of VTA in panel I, with blue squares to label the regions shown in panels A and B, and the distance from bregma is included. The outlines to delineate SN and the subregions of VTA are adjusted from straight to curved lines based on reference atlases. As suggested, we have also indicated panel A is a control and panel B is a Ox1RΔDAT mouse and included C-F letters to show corresponding insets. We apologize for the mistake about labeling PIF and PN positions in Figure A. We have corrected the labeling of their positions and double checked the quantification accordingly. This does not change our discussion or conclusion since both PIF and PN are the medial part of VTA, where both Ox1R and Ox2R are observed. The description of the image analysis in Matierials and Methods section has been improved (lines 378-385). We decided not to show lower magnification images than in Figure 1—supplement 1 to include both hemispheres, in the interests of clarity and reader-friendliness.  

      (2) Figure 1, J-L. The claim that orexin activates dopaminergic SN and VTA neurons is weakly supported by the data provided. Calcium imaging of SN dopaminergic neurons in control mice suggests a discrete effect of 100 nM orexin-A application compared to baseline. Application of 300 nM shows a slightly bigger effect, but none of these results are statistically analysed. 

      We are surprised by this comment and thank the reviewer for pointing out our apparent lack of clarity in the previous version (lines 96-106 and legend of Figure 1K, L). In more detail, we explain the data analysis in the new version (lines 119-133, 451-465) and the legend of Figure 1K, L and Figure 1-figure supplement 3).

      The main goal of this part of the project was to functionally validate the Ox1R knockout in dopaminergic (DAT-expressing) neurons. This was a prerequisite for the behavioral and PET imaging experiments. We used GCaMP-mediated Ca2+ imaging in acute brain slices to reach this goal. This analysis was performed on the dopaminergic SN neurons, which we used as an "indicator population" because a large number of these neurons express Ox1R, but only a few express Ox2R. 

      The analysis consisted of two parts:

      a) For each neuron, we tested whether it responded to orexin A. At the single cell level, a neuron was considered orexin A-responsive if the change in fluorescence induced by orexin A was three times larger than the standard deviation (3 σ criterion) of the baseline fluorescence, corresponding to a Zscore of 3. We found that 56% of the neurons tested responded to orexin A, while 44% of the neurons did not respond to orexin A (Figure 1L, top). These data agree with the number of Ox1R-expressing neurons (Figure 1J). 

      b) We also determined the orexin A-induced GCaMP fluorescence for each neuron, expressed as a percentage of GCaMP fluorescence induced upon application of high K+ saline. Accordingly, the "population response" of all analyzed neurons was expressed as the mean ± SEM of these responses. The significance of this mean response was tested for each group (control and Ox1R KO) using a onesample t-test. We found a marked and highly significant (p < 0.0001, n = 71) response of control neurons to 100 nM orexin A, while the Ox1R KO neurons did not respond (p = 0.5, n = 86). Note that, as described in a), 44% of the neurons contributing to the mean do not respond to orexin. Thus, the orexin responses of most responders are significantly higher than the mean. This is also evident in the example recordings in Figure 1K and Figure1—figure supplement 3. The orexin A-induced change in fluorescence was increased by increasing the orexin A concentration to 300 nM.

      Note: As mentioned above, the orexin A response was expressed for each neuron individually as a percentage of its high K+saline-induced GCaMP fluorescence. This value is a solid reference point, reflecting the GCaMP fluorescence at maximal voltage-activated Ca2+ influx. Obviously, the Ca2+ concentration at this point is extremely high and not typically reached under physiological conditions. Therefore, as shown in Figure1—figure supplement 3 for completeness, the physiologically relevant responses may appear relatively minor at first glance when presented together in one figure (compare Figure1—figure supplement 3 A and B).

      The authors should provide more evidence of the orexin-induced activation of dopaminergic neurons in the SN to support this claim and investigate whether a similar activation is observed in VTA neurons. 

      Following the reviewer's suggestion, we confirmed orexin A-induced activation of dopaminergic neurons in the mouse SN by using perforated patch clamp recordings (Figure1—figure supplement 2).

      This finding is consistent with previous extracellular in vivo recordings in rats (Liu et al., 2018).

      The activation of dopaminergic neurons in the mouse VTA by orexin A has been shown repeatedly in earlier studies (e.g., Baimel et al., 2017; Korotkova et al., 2003; Tung et al., 2016).

      In addition, Figure 3-Figure Supplement 2 shows that injection of orexin does not induce c-Fos expression in SN and VTA dopaminergic neurons of control and Ox1RΔDAT mice, which further weakens the claim made by the authors.

      Figure 3—Figure Supplement 2 in the original submission is now Figure 3—Figure Supplement 3 in the revised reviewed preprint. It shows low c-Fos expression in SN and VTA dopaminergic neurons, and orexin-induced c-Fos expression was observed in Th-negative cells in SN and VTA. 

      Technically relatively straightforward, Fos analysis is widely (and successfully) used in studies to reveal neuronal activation. However, this approach has limitations, e.g., regarding sensitivity and temporal resolution. Electrophysiological or optical imaging techniques can circumvent these shortcomings. The electrophysiological and Ca2+ imaging studies presented here, along with previous electrophysiological studies by others, clearly show that orexin A acutely and directly stimulates SN and VTA dopaminergic neurons.

      In vivo, the injection of orexin A induced a pronounced c-Fos activity in non-dopaminergic cells of the VTA and SN but not in dopaminergic neurons. This result shows that the detection of c-Fos has worked in principle. Whether the absent c-Fos staining in dopaminergic neurons is due to lack of sensitivity, whether other IEGs would have worked better here, or whether there are other, e.g., cell type-specific reasons for the absence of staining, cannot be determined from the current data.

      (3) Figure 2, I-L. The fact that ICV injection of both saline and orexin causes a sustained increase of locomotion (around 20 minutes in males, and over 30 minutes in females) is problematic and could mask the effects of orexin, particularly in females. It is unclear what panels J and L are showing. To be appropriately analysed, the authors should plot the pre- and post-injection AUC data for all groups and analyse it as a two-way mixed ANOVA, with the within-subjects factor "pre/post injection activity" and between-subjects factor "group". The authors can only warrant a statistically meaningful hyperlocomotor effect in Ox1RΔDAT mice if a significant interaction is found.

      Though mice were habituated to the injection, it still makes sense to see the injection-induced increase in locomotion to some extent. We described in the figure legend that the AUC was calculated for the period after orexin injection, which meant 5 – 90 min in Figure 2 I, K. We have clearly observed significant differences between genotypes and between saline and orexin application, which means the genotype and orexin impact is strong enough to pop up despite of the injection effect. 

      As the reviewer’s suggests, we have now plotted the pre- and post-injection AUC data for all groups and analyzed it as a two-way mixed ANOVA, with the within-subjects factor "pre/post injection activity" and between-subjects factor "group". To match the pre- and post-injection duration, we are now comparing AUC for around 60 min before and after the injection. A significant interaction is found here. Panels I-L are renewed, and the differences induced by Ox1R knockout and orexin confirmed the results shown in the initially submitted manuscript.  

      (4) Figure 3. The literature has robustly shown that one of the main projection areas of VTA and SN dopaminergic neurons is the striatum, in particular its ventral part. It is surprising to see that this region is not affected by the lack of OX1R or by the injection of orexin. How can the authors explain that identified regions with significantly different activity include neighbouring brain structures with heterogenous composition? See for example, in panel A, section bregma 0.62mm, a significant region is seen expanding across the cortex, corpus callosum, and striatum. While the data from PET studies is potentially interesting, it may not be adequate to provide enough resolution to allow examination of the anatomical distribution of orexin-mediated neuronal activation.

      While the striatum is a major projection area of dopaminergic neurons in VTA and SN, the projection and function of Ox1R-positive dopaminergic neurons is not clear. We have improved the description of dopamine function diversity in the revised reviewed preprint (lines 46-58), and it was reported before that the projection-defined dopaminergic populations in the VTA exhibited different responses to orexin A (Baimel et al., 2017). Moreover, the striatum activity is modulated by the indirect effect via other brain regions affected by Ox1R-positive dopaminergic neurons. It is unknown how the striatum activity should change after Ox1R deletion in dopaminergic neurons. We could not rule out the possibility that the striatum is indeed modulated by the Ox1R-positive dopaminergic neurons, though there was only a trend of genotype difference (Ox1RΔDAT vs. ctrl) in the ventral striatum in the section bregma 1.42 mm in Figure 3A. The ICV injection of orexin is potentially acting on Ox1R and Ox2R in the whole brain, so projections from other brain regions to the striatum also affect striatum activity and could have masked the effect of Ox1R-positive dopaminergic neurons. 

      The spatial resolution of the PET data is in the order of ~1 mm^3. As we also explained in the Materials and Methods section, the size of a voxel in the original PET data is 0.4mm x 0.4mm x 0.8 mm. All calculations were performed on this grid. The higher-resolved images shown in Figure 3 are for presentation purposes only inspired by a request of the reviewer who asked us to show this in the Jais et al. 2016 manuscript. To make this clearer we now added the p-map images with the original voxel size to the supplement (Figure 3—figure supplement 1). For the interest in specific brain areas, more precise identification of anatomical sub-regions requires using methods with higher spatial resolution such as staining of brain slices for c-Fos-positive cells as we do in Figure 4.

      PET is a powerful tool to identify global regions of activation/inhibition. In the manuscript, we have described in the results and discussion section that the activity in brain regions with related functions were changed. In panel A, Ox1RΔDAT showed activity increase in MPA, Pir and endopiriform claustrum, which are important for olfactory sensation; spinal trigeminal nucleus, sp5, and IRt, which regulates mastication and sensation of the oral cavity and the surface of the face; SubCV and Gi, which regulates sleeping and motion-related arousal and motivation. In panel B, changes in HDB, MCPO, Pir, DEn, S1, V2L and V1 are related to sensation, and changes in BNST, LPGi and M2 are important for emotion, exploration, and action selection. 

      (5) Figure 4. As in Figure 1, the authors should consider including a schematic illustration of the brain areas that are being analysed using a reference atlas. It is also recommended to provide more details describing the quantification of the images. Without such information, the data is not convincing, in particular, the claim that Ox1R depletion causes a decrease in DRD1 in BNST is unclear. Additional unbiased quantitative approaches could be used to strengthen this point.

      We have added Figure 4—figure supplement 1 as a schematic illustration of the brain areas that were being analyzed using a reference atlas. More details describing the unbiased quantification of the images have been added to Materials and Methods. We have added Figure 4—figure supplement 3, to show DRD1, DRD2 and the merged signal separately.  

      (6) The discussion starts by stating that the main findings of this study are based on RNAscope and optophysiological experiments, however, the latter are not presented anywhere in the manuscript. This sentence (line 192) should be revised. The authors state in line 193 that OX1R is the only orexin receptor in the SN, but they show in Figure 1 that in the SN, 3% of neurons express OX2R and 2% co-express both receptors. 

      We thank the reviewer for the input. We have rephrased the beginning of the discussion to clarify the objectives (lines 238 - 246). In doing so, we changed "optophysiological experiments" and "single orexin receptor" (lines 192 and 193 in the original manuscript) to " Ca2+ imaging experiments" and "main subtype of orexin receptors ", respectively. In this context, it should be noted that Ca2+ imaging is considered an optophysiological method - optophysiology generally refers to techniques that combine optical methods with physiological measurements.

      The results of LPGi and BNST dopamine receptors in control and Ox1RΔDAT mice are poorly discussed. The authors should justify why these two regions were selected for further validation and how these may be related to the behavioural effects found in Ox1RΔDAT regarding exposure to a novel context.

      Ox1RΔDAT mice exhibited increased novelty- and orexin-induced locomotion compared to control mice. After orexin injection, PET imaging shows that the neural activity of BNST and LPGi was lower or higher than in control mice, respectively. We selected BNST and LPGi for further validation because we think their key functional roles in regulating emotion, exploratory behaviors and locomotor speed are related to novelty-induced locomotion. We confirmed changes in neural activity change by c-Fos staining and investigated the expression patterns of dopamine receptors in BNST and LPGi. Our findings suggested that Ox1R deletion in dopaminergic neurons results in the disinhibition of neural activity in LPGi via dopaminergic pathways and the decrease of dopamine-mediated neural activity in BNST. Emotion perception affects the decision of how to respond to the novelty. It is possible that novelty activates the orexin system and Ox1R signaling in dopaminergic neurons promotes emotion perception and inhibits exploration. Of course, further careful investigation is necessary to test this hypothesis in the future experiments. We have improved the rational description and discussion in the

      ‘Results’ and ‘Discussion’ section in the revised reviewed preprint (lines 210-213, 259-270, 293-308). 

      Reviewer #2 (Recommendations For The Authors):

      A major recommendation - if possible - would be to directly show that one or both of the two target areas - dBNST and LPGi - are associated with the behavioral effects caused by the deletion of the orexin receptor 1 in dopamine neurons.

      We completely agree that it would be very valuable to directly show dBNST and LPGi are associated with the behavioral effects caused by the deletion of Ox1R in dopaminergic neurons. While we are very interested in carefully investigating specific orexin-dopamine targeting areas and related neural circuits in the future, in the present manuscript, we mainly aim to give an overview of the behavioral roles of orexin-dopamine interaction and propose some promising downstream pathways. 

      The authors should state if data are corrected for multiple comparisons, e.g., in the PET study of different regions.

      We have included information about the post-hoc tests for all 2-way ANOVA analyses in the submitted manuscript. For the PET study, the p-values in the p-maps were not corrected for multiple comparison, Figure 3—figure supplement 2 shows the raw data of each mouse and the analysis method (t-test). In the revised reviewed preprint, we include the information on the analysis method in the figure legends of Figure 3. 

      We consider that saline and orexin injections mimic the resting and active state of mice, respectively, and would like to study genotype effect under each condition. Doing 2-way ANOVA takes in count the difference between orexin and saline injection, which could mask the genotype effect under a certain condition. Therefore, we decided to perform t-tests for each condition in Figure 3. While we provide readers with full information in Figure 3—figure supplement 2 with the raw data of each individual mouse, below we present the p-maps after multiple comparisons (Sidak’s post hoc test). After multiple comparisons, we could see changes in similar brain regions as in Figure 3, though significant values are reduced by the correction for multiple comparisons, and under orexin-injection condition, we fail to see significantly higher activity around the lateral paragigantocellular nucleus (LPGi), nucleus of the horizontal limb of the diagonal band (HDB) and magnocellular preoptic nucleus (MCPO) in Ox1RΔDAT mice. In order to more precisely identify the anatomical locations, we performed additional experiments to confirm the changes revealed by PET. For example, LPGi is a relatively small region confirmed and identified more precisely by c-Fos immunostaining (Figure 4A, C). 

      Author response image 1.

      PET imaging studies comparing Ox1RΔDAT and control mice, with post-hoc t-test to correct for multiple comparisons. 3D maps of p-values in PET imaging studies comparing Ox1RΔDAT and control mice, after intracerebroventricular (ICV) injection of (A) saline (NS) and (B) orexin A. Control-NS, n = 8; control-orexin, n = 6; Ox1RΔDAT, n = 8. M2, secondary motor cortex; MPA, medial preoptic area; Pir, piriform cortex; IEn, intermediate endopiriform claustrum; DEn, dorsal endopiriform claustrum; VEn, ventral endopiriform claustrum; LSS, lateral stripe of the striatum; BNST, the dorsal bed nucleus of the stria terminalis; S1Sh, primary somatosensory cortex, shoulder region; S1HL, primary somatosensory cortex, hindlimb region; S1BF, primary somatosensory cortex, barrel field; S1Tr, primary somatosensory cortex, trunk region; V1, primary visual cortex; V2L, secondary visual cortex, lateral area; SubCV, subcoeruleus nucleus, ventral part; Gi, gigantocellular reticular nucleus; IRt, intermediate reticular nucleus; sp5, spinal trigeminal tract.

      Provide a rationale for following up on BNST and LPGi and not any of the regions identified in the PET study.

      We thank the reviewer for the careful reading and important input. Ox1RΔDAT mice exhibited increased novelty- and orexin-induced locomotion compared to control mice. After orexin injection, PET imaging shows that the neural activity of BNST and LPGi was lower or higher than control mice, respectively.

      We selected BNST and LPGi for further validation because we think their key functional roles in regulating emotion, exploratory behaviors and locomotor speed are related to novelty-induced locomotion. We confirmed the neural activity change by c-Fos staining and investigated the expression patterns of dopamine receptors in BNST and LPGi. Our findings suggested that Ox1R deletion in dopaminergic neurons results in the disinhibition of neural activity in LPGi via dopaminergic pathways and the decrease of dopamine-mediated neural activity in BNST. Emotion perception affects the decision how to respond to the novelty. It is possible that novelty activates the orexin system and Ox1R signaling in dopaminergic neurons promotes emotion perception and inhibits exploration. Of course, further investigation is necessary to test this hypothesis in future. We have improved the rational description and discussion in the ‘Results’ and ‘Discussion’ section in the revised reviewed preprint (lines 210-213, 259-270, 293-308). 

      Heatmap in Fig. 1K should not have smoothing across the y-axis, individual cells should be discrete.

      We thank the reviewer for bringing this issue to our attention. The data had not been intentionally smoothed (neither across the x-axis nor the y-axis), but it was probably a formatting issue. We have corrected this and separated individual cell traces with lines (Figure 1K, Figure 1—figure supplement 3).

      Dopamine cells are well known to lack Fos expression in most cases. Did the authors consider using another IEG to show neural activation, e.g., pERK?

      We did not use another IEG. The electrophysiological and Ca2+ imaging studies presented here, along with previous electrophysiological studies by others, clearly show that orexin A acutely and directly stimulates SN and VTA dopaminergic neurons. Please see also the response to a related comment of Reviewer 1.

      Consider adding a lower magnification section to anatomical figures to aid the reader in orienting and identifying the location.

      We have added the schematic illustration of SN, VTA, BNST and LPGi in Figure 1I and Figure 4— figure supplement 1. We hope this helps the reader in orienting and identifying the location.  

      Data availability should be stated.

      There are no restrictions on data availability. We have added this section to the revised reviewed preprint.

      Line 50. Some more references both historical and recent could be given to support this statement about the function of dopamine.

      We have improved the description and references to support the statement about dopamine function (lines 46-58). We have cited recent studies and some reviews in the revised reviewed preprint (lines 4658). 

      The PET data (Fig. 3) might be easier to visualize and interpret if a white background was used. In addition, is there a more refined way of presenting the data in Fig 3, S1?

      It is common to present imaging data such as PET and MRI on a black background. We also have already applied this color scheme in multiple publications and would therefore prefer to stick to this color scheme. 

      While Figure 3 is the concise way to present PET data, we aim to show the original individual results of mice in Figure 3—figure supplement 2 and to demonstrate how we performed the statistical analysis. Therefore, we take an example voxel of the respective brain area, perform the t-test, and present the data as bars with individual dots. 

      Line 97. State what type of Ca imaging here, e.g., "we performed Ca imaging in ex vivo slices of VTA and SN".

      As the reviewer suggested, we have specified the type of Ca2+ imaging (line 112).

      Line 165. State which groups this post-mortem analysis was performed on and if any differences were to be found (not expected to find differences in this anatomical tracing experiment but good to report this as both groups were used).

      Postmortem analysis of c-Fos staining revealed low c-Fos expression in dopaminergic neurons in the VTA and SN of Ox1RΔDAT and control mice after ICV injection of saline or orexin A (1 nmol). No obvious changes were observed among the groups. We have improved the description in the revised reviewed preprint (lines 202-208).

      Line 192. What do you mean by optophysiological here? The Ca imaging (which is a fairly small, confirmatory element of the manuscript).

      We have changed ‘optophysiological experiments’ (line 192 in initial submitted manuscript) to ‘calcium imaging experiments’ and rephrased the beginning of the discussion to clarify the objectives (lines 238246).

      The protein level in the diet is substantially higher than in most rodent diets (34% here vs 14-20% in most commercial rodent chows). Please comment on this.

      This diet is for rat and mouse maintenance, purchased from ssniff Spezialdiäten GmbH (product V1554).

      The percentage of calories supplied by protein is affected by the calculation methods. The company calculated with pig equation before and the value was 34% in the old instruction data sheet. They have updated the value to 23% in the new data sheet with calculations by Atwater factors. We thank the reviewer for reminding us and have updated the values in the revised reviewed preprint (lines 314-316). 

      Editor's note:

      Should you choose to revise your manuscript, please include full statistical reporting including exact p-values wherever possible alongside the summary statistics (test statistic and df) and 95% confidence intervals. These should be reported for all key questions and not only when the p-value is less than 0.05.

      We have provided the source data and the statistical reporting for each Figure with the revision

      References

      Baimel, C., Lau, B. K., Qiao, M., & Borgland, S. L. (2017). Projection-target-defined effects of orexin and dynorphin on VTA dopamine neurons. Cell Rep, 18(6), 1346-1355.  https://doi.org/10.1016/j.celrep.2017.01.030

      Korotkova, T. M., Eriksson, K. S., Haas, H. L., & Brown, R. E. (2002). Selective excitation of GABAergic neurons in the substantia nigra of the rat by orexin/hypocretin in vitro. Regul Pept, 104(1-3), 83-89. https://doi.org/10.1016/s0167-0115(01)00323-8 

      Korotkova, T. M., Sergeeva, O. A., Eriksson, K. S., Haas, H. L., & Brown, R. E. (2003). Excitation of ventral tegmental area dopaminergic and nondopaminergic neurons by orexins/hypocretins. J Neurosci, 23(1), 7-11. https://www.ncbi.nlm.nih.gov/pubmed/12514194

      Liu, C., Xue, Y., Liu, M. F., Wang, Y., Liu, Z. R., Diao, H. L., & Chen, L. (2018). Orexins increase the firing activity of nigral dopaminergic neurons and participate in motor control in rats. J Neurochem, 147(3), 380-394. https://doi.org/10.1111/jnc.14568 

      Tung, L. W., Lu, G. L., Lee, Y. H., Yu, L., Lee, H. J., Leishman, E., Bradshaw, H., Hwang, L. L., Hung, M. S., Mackie, K., Zimmer, A., & Chiou, L. C. (2016). Orexins contribute to restraint stress-induced cocaine relapse by endocannabinoid-mediated disinhibition of dopaminergic neurons. Nat Commun, 7, 12199. https://doi.org/10.1038/ncomms12199

    1. Author response:

      The following is the authors’ response to the original reviews.

      We would like to thank all of the reviewers for their helpful and the effort they made in reading and evaluating our manuscript. In response to them, we have made major changes to the text and figures and performed substantial new experiments. These new data and changes to the text and figures have substantially strengthened the manuscript. We believe that the manuscript is now very strong in both its impact and scope and we hope that reviewers will find it suitable for publication in eLife

      A point-by-point response to the reviewers' specific comments is provided below.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      In this report, Yu et al ascribe potential tumor suppressive functions to the non-core regions of RAG1/2 recombinases. Using a well-established BCR-ABL oncogene-driven system, the authors model the development of B cell acute lymphoblastic leukemia in mice and found that RAG mutants lacking non-core regions show accelerated leukemogenesis. They further report that the loss of non-core regions of RAG1/2 increases genomic instability, possibly caused by increased off-target recombination of aberrant RAG-induced breaks. The authors conclude that the non-core regions of RAG1 in particular not only increase the fidelity of VDJ recombination, but may also influence the recombination "range" of off-target joints, and that in the absence of the non-core regions, mutant RAG1/2 (termed cRAGs) catalyze high levels of off-target recombination leading to the development of aggressive leukemia.

      Strengths:

      The authors used a genetically defined oncogene-driven model to study the effect of RAG non-core regions on leukemogenesis. The animal studies were well performed and generally included a good number of mice. Therefore, the finding that cRAG expression led to the development of more aggressive BCR-ABL+ leukemia compared to fRAG is solid.

      Weaknesses:

      In general, I find the mechanistic explanation offered by the authors to explain how the non-core regions of RAG1/2 suppress leukemogenesis to be less convincing. My main concern is that cRAG1 and cRAG2 are overexpressed relative to fRAG1/2. This raises the possibility that the observed increased aggressiveness of cRAG tumors compared to fRAG tumors could be solely due to cRAG1/2 overexpression, rather than any intrinsic differences in the activity of cRAG1/2 vs fRAG1/2; and indeed, the authors allude to this possibility in Fig S8, where it was shown that elevated expression of RAG (i.e. fRAG) correlated with decreased survival in pediatric ALL. Although it doesn't mean the authors' assertions are incorrect, this potential caveat should nevertheless be discussed.

      We appreciate the valuable suggestions from the reviewer. BCR-ABL1+ B-ALL is characterized by halted early B-lineage differentiation. In BCR-ABL1+ B cells, RAG recombinases are highly expressed, leading to the inactivation of genes that encode essential transcription factors for B-lineage differentiation. This results in cells being trapped within the precursor compartment, thereby elevating RAG gene expression. Our interpretation of the data suggests that, in BCR-ABL1+ B-ALL mouse models, the high expression of both cRAG and fRAG and the deletion of the non-core regions influence the precision of RAG targeting within the genome. This causes more genomic damage in cRAG tumors than in fRAG tumors, consequently leading to the observed increased aggressiveness of cRAG tumors compared to fRAG tumors. We discussed the issues on Page 12, lines 295-307 in the revised manuscript.

      Some of the conclusions drawn were not supported by the data.

      (1) I'm not sure that the authors can conclude based on μHC expression that there is a loss of pre-BCR checkpoint in cRAG tumors. In fact, Fig. 2B showed that the differences are not statistically significant overall, and more importantly, μHC expression should be detectable in small pre-B cells (CD43-). This is also corroborated by the authors' analysis of VDJ rearrangements, showing that it has occurred at the H chain locus in cRAG cells.

      We appreciate the insightful comment from the reviewer. Upon reevaluation of the data presented in Fig. 2B, we identified and rectified certain errors. The revised analysis now shows that the differences in μHC expression are statistically significant. This significant expression of μHC in fRAG leukemic cells implies that these cells may progress further in differentiation, potentially acquiring an immune phenotype. These modifications have been incorporated into the manuscript on page 7, lines 153-156 in the revised manuscript.

      (2) The authors found a high degree of polyclonal VDJ rearrangements in fRAG tumor cells but a much more limited oligoclonal VDJ repertoire in cRAG tumors. They concluded that this explains why cRAG tumors are more aggressive because BCR-ABL induced leukemia requires secondary oncogenic hits, resulting in the outgrowth of a few dominant clones (Page 19, lines 381-398). I'm not sure this is necessarily a causal relationship since we don't know if the oligoclonality of cRAG tumors is due to selection based on oncogenic potential or if it may actually reflect a more restricted usage of different VDJ gene segments during rearrangement.

      Thank you for your insightful comments and questions regarding the relationship between the oligoclonality of V(D)J rearrangements and the aggressiveness of cRAG tumors. You raise an important point regarding whether the observed oligoclonality is a result of selective pressure favoring clones with specific oncogenic potential, or if it reflects inherent limitations in V(D)J segment usage during rearrangement in cRAG models. In our study, we observed a marked difference in the V(D)J rearrangement patterns between fRAG and cRAG tumor cells, with cRAG tumors exhibiting a more limited, oligoclonal repertoire. This observation led us to speculate that the aggressive nature of cRAG tumors might be linked to a selective advantage conferred by specific V(D)J rearrangements that cooperate with the BCR-ABL1 oncogene to drive leukemogenesis. However, we acknowledge that our current data do not definitively establish a causal relationship between oligoclonality and tumor aggressiveness. The restricted V(D)J repertoire in cRAG tumors could indeed be due to a more constrained rearrangement process, possibly influenced by the altered expression or function of RAG1/2 in the absence of non-core regions. This could limit the diversity of V(D)J rearrangements, leading to the emergence of a few dominant clones not necessarily because they have greater oncogenic potential, but because of a narrowed field of rearrangement possibilities.

      To address this question more thoroughly, future studies could examine the functional consequences of specific V(D)J rearrangements found in dominant cRAG tumor clones. This could include assessing the oncogenic potential of these rearrangements in isolation and in cooperation with BCR-ABL1, as well as exploring the mechanistic basis for the restricted V(D)J repertoire. Such studies would provide deeper insight into the interplay between RAG-mediated recombination, clonal selection, and leukemogenesis in BCR-ABL1+ B-ALL.

      We appreciate your feedback on this matter and agree that further investigation is required to unravel the precise relationship between V(D)J rearrangement diversity and leukemic progression in cRAG models. We have revised our discussion to reflect these considerations and to clarify the speculative nature of our conclusions regarding the link between oligoclonality and tumor aggressiveness. We added more discussion on this issue on Page 7, lines 166-170 in the revised manuscript.

      (3) What constitutes a cancer gene can be highly context- and tissue-dependent. Given that there is no additional information on how any putative cancer gene was disrupted (e.g., truncation of regulatory or coding regions), it is not possible to infer whether increased off-target cRAG activity really directly contributed to the increased aggressiveness of leukemia.

      We totally agree you raised the issues. In Supplementary Table 3, we have presented data on off-target gene disruptions, specifically in introns, exons, downstream regions, promoters, 3' UTRs, and 5' UTRs. However, this dataset alone does not suffice to conclusively determine whether the increased off-target activity of cRAG directly influences the heightened aggressiveness of leukemia. To bridge this knowledge gap, our future research will extend to include both knockout and overexpression experiments targeting these off-target genes.

      (4) Fig. 6A, it seems that it is really the first four nucleotide (CACA) that determines fRAG binding and the first three (CAC) that determine cRAG binding, as opposed to five for fRAG and four for cRAG, as the author wrote (page 24, lines 493-497).

      We thank the reviewer for the insightful comment. In response, we have revised the text to accurately reflect the nucleotide sequences responsible for RAG binding and cleavage. Specifically, we now clarify that the first four nucleotides (CACA) are crucial for fRAG binding and cleavage, while the initial three nucleotides (CAC) are essential for cRAG binding and cleavage. These updates have been made on page 10, lines 242-245 of the revised manuscript.

      (5) Fig S3B, I don't really see why "significant variations in NHEJ" would necessarily equate "aberrant expression of DNA repair pathways in cRAG leukemic cells". This is purely speculative. Since it has been reported previously that alt-EJ/MMEJ can join off target RAG breaks, do the authors detect high levels of microhomology usage at break points in cRAG tumors?

      We appreciate the reviewer's comment. Currently, we have not observed microhomology usage at breakpoints in cRAG tumors. We plan to address this aspect in a future, more detailed study. Regarding the 'aberrant expression of DNA repair pathways in cRAG leukemic cells, we acknowledge that this is speculative. Therefore, we have carefully rephrased this to 'suggesting a potential aberrant expression of DNA repair pathways in cRAG leukemic cells.' This modification is reflected on page 12, lines 290-291 of the revised manuscript.

      (6) Fig. S7, CDKN2B inhibits CDK4/6 activation by cyclin D, but I don't think it has been shown to regulate CDK6 mRNA expression. The increase in CDK6 mRNA likely just reflects a more proliferative tumor but may have nothing to do with CDKN2B deletion in cRAG1 tumors.

      We fully concur with the reviewer's comment. We have deleted this inappropriate part from the text.

      Insufficient details in some figures. For instance, Fig. 1A, please include statistics in the plot showing a comparison of fRAG vs cRAG1, fRAG vs cRAG2, cRAG1 vs cRAG2. As of now, there's a single p-value (0.0425) stated in the main text and the legend but why is there only one p-value when fRAG is compared to cRAG1 or cRAG2? Similarly, the authors wrote "median survival days 11-26, 10-16, 11-21 days, P < 0.0023-0.0299, Fig. S2B." However, it is difficult for me to figure out what are the numbers referring to. For instance, is 11-26 referring to median survival of fRAG inoculated with three different concentrations of GFP+ leukemic cells or is 11-26 referring to median survival of fRAG, cRAG1, cRAG2 inoculated with 10^5 cells? It would be much clearer if the authors can provide the numbers for each pair-wise comparison, if not in the main text, then at least in the figure legend. In Fig. 5A-B, do the plots depict SVs in cRAG tumors or both cRAG and fRAG cells? Also in Fig. 5, why did 24 SVs give rise to 42 breakpoints, and not 48? Doesn't it take 2 breaks to accomplish rearrangement? In Fig. 6B-C, it is not clear how the recombination sizes were calculated. In the examples shown in Fig. 4, only cRAG1 tumors show intra-chromosomal joins (chr 12), while fRAG and cRAG2 tumors show exclusively inter-chromosomal joins.

      We appreciate the reviewer's feedback and have made the following revisions:

      (1) The text has been adjusted to rectify the previously mentioned error in the figure legends (page 1, lines 5-6).

      (2) We have clarified the intended message in the revised text (page 6, lines 129-130) and the figure legend (page 4-5, lines 107-113) for greater precision.

      (3) Figure 5A-B now presents an overview of all structural variants (SVs) identified in both cRAG and fRAG cells, offering a comprehensive comparison.

      (4) Among the analyzed SVs, 24 generated a total of 48 breakpoints, with 41 occurring within gene bodies and the remaining 7 in adjacent flanking sequences. This informs our exon-intron distribution profile analysis.

      (5) We have defined recombination sizes as ‘the DNA fragment size spanning the two breakpoints’ for clarity (page 10, lines 251-252).

      (6) All off-target recombinations identified in the genome-wide analyses of fRAG, cRAG1, and cRAG2 leukemic cells were determined to be intra-chromosomal joins, highlighting their specific nature within the genomic context.

      Insufficient details on certain reagents/methods. For instance, are the cRAG1/2 mice of the same genetic background as fRAG mice (C57BL/6 WT)? On Page 23, line 481, what is a cancer gene? How are they defined? In Fig. 3C, are the FACS plots gated on intact cells? Since apoptotic cells show high levels of gH2AX, I'm surprised that the fraction of gH2AX+ cells is so much lower in fRAG tumors compared to cRAG tumors. The in vitro VDJ assay shown in Fig 3B is not described in the Method section (although it is described in Fig S5b). Fig. 5A-B, do the plots depict SVs in cRAG tumors or both cRAG and fRAG cells?

      We are grateful for the reviewer's feedback and have incorporated their insights as follows:

      (1) We clarify that both cRAG1/2 and fRAG mice share the same genetic background, specifically the C57BL/6 WT strain, ensuring consistency across experimental models.

      (2) We define a 'cancer gene' as one harboring somatic mutations implicated in cancer. To support our analysis, we refer to the Catalogue Of Somatic Mutations In Cancer (COSMIC) at http://cancer.sanger.ac.uk/cosmic. COSMIC serves as the most extensive repository for understanding the role of somatic mutations in human cancers.

      (3) Upon thorough review of the raw data for γ-H2AX and the fluorescence-activated cell sorting (FACS) plots gated on intact cells, we propose that the observed discrepancies might stem from the limited sensitivity of the γ-H2AX flow cytometry detection method. This insight prompts our commitment to employing more efficient detection methodologies in forthcoming studies.

      (4) Detailed procedures for the in vitro V(D)J recombination assay have been included in the Methods section (page 15, lines 384-388) to enhance the manuscript's comprehensiveness and reproducibility.

      (5) The presented plots offer a comprehensive overview of structural variants (SVs) identified in both cRAG and fRAG cells, providing a holistic view of the genomic landscape across different models.

      Reviewer #3 (Public Review):

      Summary:

      In the manuscript, the authors summarized and introduced the correlation between the non-core regions of RAG1 and RAG2 in BCR-ABL1+acute B lymphoblastic leukemia and off-target recombination which has certain innovative and clinical significance.

      Recommendations For The Authors:

      Reviewer #1 (Recommendations For The Authors):

      I would suggest that the authors tone down some of their conclusions, which are not necessarily supported by their own data. in addition, there are some minor mistakes in figure assembly/presentation. For instance, I believe that the axes labels in Fig. 1E were flipped. BrdU should be on y-axis and 7-AAD on the x-axis. Fig. 3B, the y-axis contains a typo, it should be "CD90.1..." and not "D90.1...". In Fig. 5C, the numbers seem to be flipped, with 93% corresponding to cRAG1 and 100% to cRAG2 (compare with the description on page 23, lines 474-475). Fig. 5C, y-axis, "hybrid" is a typo. Page 3, line 59: The abbreviation of RSS has already been described earlier (p4, line 53).

      We thank the reviewer for these suggestions. We carefully checked the raw data and corrected these mistakes in the revised manuscript.

      Page 3, line 63: "signal" segment (commonly referred to as signal ends), not "signaling" segment.

      We have changed “signaling segment” to “signal ends in the revised manuscript. (page 3, lines 54-55)

      Page 3, lines 64-65: VDJ recombination promotes the development of both B and T cells, and aberrant recombination can cause both B and T cell lymphomas.

      The statement about the role of V(D)J recombination in B and T cell development and its link to lymphomagenesis is grounded in a substantial body of research. Theoretical frameworks and empirical studies delineate how aberrations in the recombination process can lead to genomic instability, potentially triggering oncogenic events. This connection is extensively documented in immunology and oncology literature, illustrating the critical balance between necessary genetic rearrangements for immune diversity and the risk of malignancy when these processes are dysregulated (Thomson, et al.,2020; Mendes, et al.,2014; Onozawa and Aplan,2012).

      Page 4, line 72: "recombinant dispensability" is not a commonly used phrase. Do the authors mean the say that the non-core regions of RAG1/2 are not strictly required for VDJ recombination?

      We thank the reviewers for their insightful suggestion. We have revised the sentence to read, 'Although the non-core regions of RAG1/2 are not essential for V(D)J recombination, the evolutionary conservation of these regions suggests their potential significance in vivo, possibly affecting RAG activity and expression in both quantitative and qualitative manners.' This revision appears on page 3, lines 61-62, in the revised manuscript.

      Fig. 4. It would have been nice to show at least one more cRAG1 tumor circus plot.

      We appreciate the reviewer's comment and concur with the suggestion. In future sequencing experiments, we will consider including additional replicates. However, due to time and financial constraints, the current sequencing effort was limited to a maximum of three replicates.

      Reviewer #3 (Recommendations For The Authors):

      In the manuscript, the authors summarized and introduced the correlation between the non-core regions of RAG1 and RAG2 in BCR-ABL1+acute B lymphoblastic leukemia and off-target recombination which has certain innovative and clinical significance. The following issues need to be addressed by the authors.

      (1) Authors should check and review extensively for improvements to the use of English.

      We thank the reviewer for their comment. With assistance from a native English speaker, we have carefully revised the manuscript to enhance its readability.

      (2) Authors should revise the conclusion so that the above can be clearly reviewed and summarized.

      The conclusion has been partially revised in the revised manuscript.

      (3) The article should state that the experiment was independently repeated three times.

      The experiment was repeated under the same conditions three times and the information has been descripted in Statistics section on page 19, lines 473-475 in the revised manuscript.

      (4) The article will be more convincing if it uses references in the last 5 years.

      We are grateful to the reviewer for their guidance in enhancing our manuscript. We have incorporated additional references from the past five years in the revised version.

      (5) Additional experiments are suggested to elucidate the molecular mechanisms related to off-target recombination.

      We thank the reviewer for this suggestion. In future experiments, we plan to perform ChIP-seq analysis to investigate the relationship between chromatin accessibility and off-target effects, as well as to examine the impact of knocking out and overexpressing off-target genes on cancer development and progression.

      (6) It is suggested to further analyze the effect of the absence of non-core RAG region on the differentiation and development of peripheral B cells in mice by flow analysis and expression of B1 and B2.

      Thank you very much for highlighting this crucial issue. FACS analysis was performed, revealing that leukemia cells in peripheral B cells in mice did not express CD5. The data are presented as follows:

      Author response image 1.

      (7) Fig3A should have three biological replicates and the molecular weight should be labeled on the right side of the strip.

      Thank you for this suggestion. The experiment was independently repeated three times, and the molecular weights have been labeled on the right side of the bands in the revised version

      References:

      Mendes RD, Sarmento LM, Canté-Barrett K, Zuurbier L, Buijs-Gladdines JG, Póvoa V, Smits WK, Abecasis M, Yunes JA, Sonneveld E, Horstmann MA, Pieters R, Barata JT, Meijerink JP. 2014. PTEN microdeletions in T-cell acute lymphoblastic leukemia are caused by illegitimate RAG-mediated recombination events. BLOOD 124:567-578. doi:10.1182/blood-2014-03-562751

      Onozawa M, Aplan PD. 2012. Illegitimate V(D)J recombination involving nonantigen receptor loci in lymphoid malignancy. Genes Chromosomes Cancer 51:525-535. doi:10.1002/gcc.21942

      Thomson DW, Shahrin NH, Wang P, Wadham C, Shanmuganathan N, Scott HS, Dinger ME, Hughes TP, Schreiber AW, Branford S. 2020. Aberrant RAG-mediated recombination contributes to multiple structural rearrangements in lymphoid blast crisis of chronic myeloid leukemia. LEUKEMIA 34:2051-2063. doi:10.1038/s41375-020-0751-y

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      In the current manuscript, the authors use theoretical and analytical tools to examine the possibility of neural projections to engage ensembles of synaptic clusters in active dendrites. The analysis is divided into multiple models that differ in the connectivity parameters, speed of interactions, and identity of the signal (electric vs. second messenger). They first show that random connectivity almost ensures the representation of presynaptic ensembles. As expected, this convergence is much more likely for small group sizes and slow processes, such as calcium dynamics. Conversely, fast signals (spikes and postsynaptic potentials) and large groups are much less likely to recruit spatially clustered inputs. Dendritic nonlinearity in the postsynaptic cells was found to play a highly important role in distinguishing these clustered activation patterns, both when activated simultaneously and in sequence. The authors tackled the difficult issue of noise, showing a beneficiary effect when noise 'happens' to fill in gaps in a sequential pattern but degraded performance at higher background activity levels. Last, the authors simulated selectivity to chemical and electrical signals. While they find that longer sequences are less perturbed by noise, in more realistic activation conditions, the signals are not well resolved in the soma.

      While I think the premise of the manuscript is worth exploring, I have a number of reservations regarding the results.

      (1) In the analysis, the authors made a simplifying assumption that the chemical and electrical processes are independent. However, this is not the case; excitatory inputs to spines often trigger depolarization combined with pronounced calcium influx; this mixed signaling could have dramatic implications on the analysis, particularly if the dendrites are nonlinear (see below)

      We thank the reviewer for pointing out that we were not entirely clear about the strong basis upon which we had built our analyses of nonlinearity. In the previous version we had relied on published work, notably (Bhalla 2017), which does include these nonlinearities. However, we agree it is preferable to unambiguously demonstrate all the reported selectivity properties in a single model with all the nonlinearities discussed. We have now done so. This is now reported in the paper:

      “A single model exhibits multiple forms of nonlinear dendritic selectivity

      We implemented all three forms of selectivity described above, in a single model which included six voltage and calcium-gated ion channels, NMDA, AMPA and GABA receptors, and chemical signaling processes in spines and dendrites. The goal of this was three fold: To show how these nonlinear operations emerge in a mechanistically detailed model, to show that they can coexist, and to show that they are separated in time-scales. We implemented a Y-branched neuron model with additional electrical compartments for the dendritic spines (Methods). This model was closely based on a published detailed chemical-electrical model (Bhalla 2017). We stimulated this model with synaptic input corresponding to the three kinds of spatiotemporal patterns described in figures Figure 8 - Supplement 1 (sequential synaptic activity triggering electrical sequence selectivity), Figure 8 - Supplement 2 (spatially grouped synaptic stimuli leading to local Ca4_CaM activation), and Figure 8 - Supplement 3 (sequential bursts of synaptic activity triggering chemical sequence selectivity). We found that each of these mechanisms show nonlinear selectivity with respect to both synaptic spacing and synaptic weights. Further, these forms of selectivity coexist in the composite model (Figure 8 Supplements 1, 2, 3), separated by the time-scales of the stimulus patterns (~ 100 ms, ~ 1s and ~10s respectively). Thus mixed signaling in active nonlinear dendrites yields selectivity of the same form as we explored in simpler individual models. A more complete analysis of the effect of morphology, branching and channel distributions deserves a separate in-depth analysis, and is outside the scope of the current study.”

      (2) Sequence detection in active dendrites is often simplified to investigating activation in a part of or the entirety of individual branches. However, the authors did not do that for most of their analysis. Instead, they treat the entire dendritic tree as one long branch and count how many inputs form clusters. I fail to see why simplification is required and suspect it can lead to wrong results. For example, two inputs that are mapped to different dendrites in the 'original' morphology but then happen to fall next to each other when the branches are staggered to form the long dendrites would be counted as neighbors.

      We have added the below section within the main text in the section titled “Grouped Convergence of Inputs” to address the effect of branching.

      “End-effects limit convergence zones for highly branched neurons

      Neurons exhibit considerable diversity with respect to their morphologies. How synapses extending across dendritic branch points interact in the context of a synaptic cluster/group, is a topic that needs detailed examination via experimental and modeling approaches. However for the sake of analysis, we present calculations under the assumption that selectivity for grouped inputs might be degraded across branch points.

      Zones beginning close to a branch point might get interrupted. Consider a neuron with B branches. The length of the typical branch would be L/B. As a conservative estimate if we exclude a region of length Z for every branch, the expected number of zones that begin too close to a branch point is

                                                                          [Equation 3]

      For typical pyramidal neurons B~50, so Eend ~ 0.05 for values of Z of ~10 µm. Thus pyramidal neurons will not be much affected by branching effects, Profusely branching neurons like Purkinje cells have B~900 for a total L of ~7800 µm, (McConnell and Berry, 1978), hence Eend ~1 for values of Z of ~10 µm. Thus almost all groups in Purkinje neurons would run into a branch point or terminal. For the case of electrical groups, this estimate would be scaled by a factor of 5 if we consider a zone length of 50 µm. However, it is important to note that these are very conservative estimates, as for clusters of 4-5 inputs, the number of synapses available within a zone are far greater (~100 synapses within 50 µm).”

      (3) The simulations were poorly executed. Figures 5 and 6 show examples but no summary statistics.

      We have included the summary statistics in Figure 5F and Figure 6E. The statistics for both these panels were generated by simulating multiple spatiotemporal combinations of ectopic input in the presence of different stimulus patterns for each sequence length.

      The authors emphasize the importance of nonlinear dendritic interactions, but they do not include them in their analysis of the ectopic signals! I find it to be wholly expected that the effects of dendritic ensembles are not pronounced when the dendrites are linear.

      We would like to clarify that both Figures 5 and 6 already included nonlinearities. In Figure 5, the chemical mechanism involving the bistable switch motif is strongly selective for ordered inputs in a nonlinear manner. A separate panel highlighting this (Panel C) has now been included in Figure 5. This result had been previously shown in Figure 3I of (Bhalla 2017). We have reproduced it in Figure 5C.

      The published electrical model used in Figure 6 also has a nonlinearity which predominantly stems from the interaction of the impedance gradient along the dendrite with the voltage dependence of NMDARs. Check Figure 4C,D of (Branco, Clark, and Häusser 2010).

      To provide a comprehensive analysis of dendritic integration, the authors could simulate more realistic synaptic conductances and voltage-gated channels. They would find much more complicated interactions between inputs on a single site, a sliding temporal and spatial window of nonlinear integration that depends on dendritic morphology, active and passive parameters, and synaptic properties. At different activation levels, the rules of synaptic integration shift to cooperativity between different dendrites and cellular compartments, further complicated by nonlinear interactions between somatic spikes and dendritic events.

      We would like to clarify two points. First, the key goal of our study was to understand the role played by random connectivity in giving rise to clustered computation. In this revision we provide simulations to show the mechanistic basis for the nonlinearities, and then abstracted these out in order to scale the analysis to networks. These nonlinearities were taken as a given, though we elaborated previous work slightly in order to address the question of ectopic inputs. Second, in our original submission we relied on published work for the estimates of dendritic nonlinearities. Previous work from (Poirazi, Brannon, and Mel 2003; Branco, Clark, and Häusser 2010; Bhalla 2017) have already carried out highly detailed realistic simulations, and in some cases including chemical and electrical nonlinearities as the reviewer mentions (Bhalla 2017). Hence we did not feel that this needed to be redone.

      In this resubmission we have addressed the above and two additional concerns, namely whether the different forms of selectivity can coexist in a single model including all these nonlinearities, and whether there is separation of time-scales. The answer is yes to both. The outcome of this is presented in Figure 8 and the associated supplementary figures, and all simulation details are provided on the github repository associated with this paper. A more complete analysis of interaction of multiple nonlinearities in a detailed model is material for further study.

      While it is tempting to extend back-of-the-napkin calculations of how many inputs can recruit nonlinear integration in active dendrites, the biological implementation is very different from this hypothetical. It is important to consider these questions, but I am not convinced that this manuscript adequately addressed the questions it set out to probe, nor does it provide information that was unknown beforehand.

      We developed our analysis systematically, and perhaps the reviewer refers to the first few calculations as back-of-the-napkin. However, the derivation rapidly becomes more complex when we factor in combinatorics and the effect of noise. This derivation is in the supplementary material. Furthermore, the exact form of the combinatorial and noise equations was non-trivial to derive and we worked closely with the connectivity simulations (Figures 2 and 4) to obtain equations which scale across a large parameter space by sampling connectivity for over 100000 neurons and activity over 100 trials for each of these neurons for each network configuration we have tested.

      the biological implementation is very different from this hypothetical.

      We do not quite understand in what respect the reviewer feels that this calculation is very different from the biological implementation. The calculation is about projection patterns. In the discussion we consider at length how our findings of selectivity from random projections may be an effective starting point for more elaborate biological connection rules. We have added the following sentence:

      “We present a first-order analysis of the simplest kind of connectivity rule (random), upon which more elaborate rules such as spatial gradients and activity-dependent wiring may be developed.”

      In case the reviewer was referring to the biological implementation of nonlinear integration, we treat the nonlinear integration in the dendrites as a separate set of simulations, most of which are closely based on published work (Bhalla 2017). We use these in the later sections of the paper to estimate selectivity terms, which inform our final analysis.

      In the revision we have worked to clarify this progression of the analysis. As indicated above, we have also made a composite model of all of the nonlinear dendritic mechanisms, chemical and electrical, which underlie our analysis.

      nor does it provide information that was unknown beforehand.

      We conducted a broad literature survey and to the best of our knowledge these calculations and findings have not been obtained previously. If the reviewer has some specific examples in mind we would be pleased to refer to it.

      Reviewer #2 (Public Review):

      Summary:

      If synaptic input is functionally clustered on dendrites, nonlinear integration could increase the computational power of neural networks. But this requires the right synapses to be located in the right places. This paper aims to address the question of whether such synaptic arrangements could arise by chance (i.e. without special rules for axon guidance or structural plasticity), and could therefore be exploited even in randomly connected networks. This is important, particularly for the dendrites and biological computation communities, where there is a pressing need to integrate decades of work at the single-neuron level with contemporary ideas about network function.

      Using an abstract model where ensembles of neurons project randomly to a postsynaptic population, back-of-envelope calculations are presented that predict the probability of finding clustered synapses and spatiotemporal sequences. Using data-constrained parameters, the authors conclude that clustering and sequences are indeed likely to occur by chance (for large enough ensembles), but require strong dendritic nonlinearities and low background noise to be useful.

      Strengths:

      (1) The back-of-envelope reasoning presented can provide fast and valuable intuition. The authors have also made the effort to connect the model parameters with measured values. Even an approximate understanding of cluster probability can direct theory and experiments towards promising directions, or away from lost causes.

      (2) I found the general approach to be refreshingly transparent and objective. Assumptions are stated clearly about the model and statistics of different circuits. Along with some positive results, many of the computed cluster probabilities are vanishingly small, and noise is found to be quite detrimental in several cases. This is important to know, and I was happy to see the authors take a balanced look at conditions that help/hinder clustering, rather than to just focus on a particular regime that works.

      (3) This paper is also a timely reminder that synaptic clusters and sequences can exist on multiple spatial and temporal scales. The authors present results pertaining to the standard `electrical' regime (~50-100 µm, <50 ms), as well as two modes of chemical signaling (~10 µm, 100-1000 ms). The senior author is indeed an authority on the latter, and the simulations in Figure 5, extending those from Bhalla (2017), are unique in this area. In my view, the role of chemical signaling in neural computation is understudied theoretically, but research will be increasingly important as experimental technologies continue to develop.

      Weaknesses:

      (1) The paper is mostly let down by the presentation. In the current form, some patience is needed to grasp the main questions and results, and it is hard to keep track of the many abbreviations and definitions. A paper like this can be impactful, but the writing needs to be crisp, and the logic of the derivation accessible to non-experts. See, for instance, Stepanyants, Hof & Chklovskii (2002) for a relevant example.

      It would be good to see a restructure that communicates the main points clearly and concisely, perhaps leaving other observations to an optional appendix. For the interested but time-pressed reader, I recommend starting with the last paragraph of the introduction, working through the main derivation on page 7, and writing out the full expression with key parameters exposed. Next, look at Table 1 and Figure 2J to see where different circuits and mechanisms fit in this scheme. Beyond this, the sequence derivation on page 15 and biophysical simulations in Figures 5 and 6 are also highlights.

      We appreciate the reviewers' suggestions. We have tightened the flow of the introduction. We understand that the abbreviations and definitions are challenging and have therefore provided intuitions and summaries of the equations discussed in the main text.

      Clusters calculations

      “Our approach is to ask how likely it is that a given set of inputs lands on a short segment of dendrite, and then scale it up to all segments on the entire dendritic length of the cell.

      Thus, the probability of occurrence of groups that receive connections from each of the M ensembles (PcFMG) is a function of the connection probability (p) between the two layers, the number of neurons in an ensemble (N), the relative zone-length with respect to the total dendritic arbor (Z/L) and the number of ensembles (M).”

      Sequence calculations

      “Here we estimate the likelihood of the first ensemble input arriving anywhere on the dendrite, and ask how likely it is that succeeding inputs of the sequence would arrive within a set spacing.

      Thus, the probability of occurrence of sequences that receive sequential connections (PcPOSS) from each of the M ensembles is a function of the connection probability (p) between the two layers, the number of neurons in an ensemble (N), the relative window size with respect to the total dendritic arbor (Δ/L) and the number of ensembles (M).”

      (2) I wonder if the authors are being overly conservative at times. The result highlighted in the abstract is that 10/100000 postsynaptic neurons are expected to exhibit synaptic clustering. This seems like a very small number, especially if circuits are to rely on such a mechanism. However, this figure assumes the convergence of 3-5 distinct ensembles. Convergence of inputs from just 2 ense mbles would be much more prevalent, but still advantageous computationally. There has been excitement in the field about experiments showing the clustering of synapses encoding even a single feature.

      We agree that short clusters of two inputs would be far more likely. We focused our analysis on clusters with three of more ensembles because of the following reasons:

      (1) The signal to noise in these clusters was very poor as the likelihood of noise clusters is high.

      (2) It is difficult to trigger nonlinearities with very few synaptic inputs.

      (3) At the ensemble sizes we considered (100 for clusters, 1000 for sequences), clusters arising from just two ensembles would result in high probability of occurrence on all neurons in a network (~50% in cortex, see p_CMFG in figures below.). These dense neural representations make it difficult for downstream networks to decode (Foldiak 2003).

      However, in the presence of ensembles containing fewer neurons or when the connection probability between the layers is low, short clusters can result in sparse representations (Figure 2 - Supplement 2). Arguments 1 and 2 hold for short sequences as well.

      (3) The analysis supporting the claim that strong nonlinearities are needed for cluster/sequence detection is unconvincing. In the analysis, different synapse distributions on a single long dendrite are convolved with a sigmoid function and then the sum is taken to reflect the somatic response. In reality, dendritic nonlinearities influence the soma in a complex and dynamic manner. It may be that the abstract approach the authors use captures some of this, but it needs to be validated with simulations to be trusted (in line with previous work, e.g. Poirazi, Brannon & Mel, (2003)).

      We agree that multiple factors might affect the influence of nonlinearities on the soma. The key goal of our study was to understand the role played by random connectivity in giving rise to clustered computation. Since simulating a wide range of connectivity and activity patterns in a detailed biophysical model was computationally expensive, we analyzed the exemplar detailed models for nonlinearity separately (Figures 5, 6, and new figure 8), and then used our abstract models as a proxy for understanding population dynamics. A complete analysis of the role played by morphology, channel kinetics and the effect of branching requires an in-depth study of its own, and some of these questions have already been tackled by (Poirazi, Brannon, and Mel 2003; Branco, Clark, and Häusser 2010; Bhalla 2017). However, in the revision, we have implemented a single model which incorporates the range of ion-channel, synaptic and biochemical signaling nonlinearities which we discuss in the paper (Figure 8, and Figure 8 Supplement 1, 2,3). We use this to demonstrate all three forms of sequence and grouped computation we use in the study, where the only difference is in the stimulus pattern and the separation of time-scales inherent in the stimuli.

      (4) It is unclear whether some of the conclusions would hold in the presence of learning. In the signal-to-noise analysis, all synaptic strengths are assumed equal. But if synapses involved in salient clusters or sequences were potentiated, presumably detection would become easier? Similarly, if presynaptic tuning and/or timing were reorganized through learning, the conditions for synaptic arrangements to be useful could be relaxed. Answering these questions is beyond the scope of the study, but there is a caveat there nonetheless.

      We agree with the reviewer. If synapses receiving connectivity from ensembles had stronger weights, this would make detection easier. Dendritic spikes arising from clustered inputs have been implicated in local cooperative plasticity (Golding, Staff, and Spruston 2002; Losonczy, Makara, and Magee 2008). Further, plasticity related proteins synthesized at a synapse undergoing L-LTP can diffuse to neighboring weakly co-active synapses, and thereby mediate cooperative plasticity (Harvey et al. 2008; Govindarajan, Kelleher, and Tonegawa 2006; Govindarajan et al. 2011). Thus if clusters of synapses were likely to be co-active, they could further engage these local plasticity mechanisms which could potentiate them while not potentiating synapses that are activated by background activity. This would depend on the activity correlation between synapses receiving ensemble inputs within a cluster vs those activated by background activity. We have mentioned some of these ideas in a published opinion paper (Pulikkottil, Somashekar, and Bhalla 2021). In the current study, we wanted to understand whether even in the absence of specialized connection rules, interesting computations could still emerge. Thus, we focused on asking whether clustered or sequential convergence could arise even in a purely randomly connected network, with the most basic set of assumptions. We agree that an analysis of how selectivity evolves with learning would be an interesting topic for further work.

      References

      Bhalla, Upinder S. 2017. “Synaptic Input Sequence Discrimination on Behavioral Timescales Mediated by Reaction-Diffusion Chemistry in Dendrites.” Edited by Frances K Skinner. eLife 6 (April):e25827. https://doi.org/10.7554/eLife.25827.

      Branco, Tiago, Beverley A. Clark, and Michael Häusser. 2010. “Dendritic Discrimination of Temporal Input Sequences in Cortical Neurons.” Science (New York, N.Y.) 329 (5999): 1671–75. https://doi.org/10.1126/science.1189664.

      Foldiak, Peter. 2003. “Sparse Coding in the Primate Cortex.” The Handbook of Brain Theory and Neural Networks. https://research-repository.st-andrews.ac.uk/bitstream/handle/10023/2994/FoldiakSparse HBTNN2e02.pdf?sequence=1.

      Golding, Nace L., Nathan P. Staff, and Nelson Spruston. 2002. “Dendritic Spikes as a Mechanism for Cooperative Long-Term Potentiation.” Nature 418 (6895): 326–31. https://doi.org/10.1038/nature00854.

      Govindarajan, Arvind, Inbal Israely, Shu-Ying Huang, and Susumu Tonegawa. 2011. “The Dendritic Branch Is the Preferred Integrative Unit for Protein Synthesis-Dependent LTP.” Neuron 69 (1): 132–46. https://doi.org/10.1016/j.neuron.2010.12.008.

      Govindarajan, Arvind, Raymond J. Kelleher, and Susumu Tonegawa. 2006. “A Clustered Plasticity Model of Long-Term Memory Engrams.” Nature Reviews Neuroscience 7 (7): 575–83. https://doi.org/10.1038/nrn1937.

      Harvey, Christopher D., Ryohei Yasuda, Haining Zhong, and Karel Svoboda. 2008. “The Spread of Ras Activity Triggered by Activation of a Single Dendritic Spine.” Science (New York, N.Y.) 321 (5885): 136–40. https://doi.org/10.1126/science.1159675.

      Losonczy, Attila, Judit K. Makara, and Jeffrey C. Magee. 2008. “Compartmentalized Dendritic Plasticity and Input Feature Storage in Neurons.” Nature 452 (7186): 436–41. https://doi.org/10.1038/nature06725.

      Poirazi, Panayiota, Terrence Brannon, and Bartlett W. Mel. 2003. “Pyramidal Neuron as Two-Layer Neural Network.” Neuron 37 (6): 989–99. https://doi.org/10.1016/S0896-6273(03)00149-1.

      Pulikkottil, Vinu Varghese, Bhanu Priya Somashekar, and Upinder S. Bhalla.     2021.

      “Computation, Wiring, and Plasticity in Synaptic Clusters.” Current Opinion in Neurobiology, Computational Neuroscience, 70 (October):101–12. https://doi.org/10.1016/j.conb.2021.08.001.

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife assessment

      This useful manuscript reports mechanisms behind the increase in fecundity in response to sub-lethal doses of pesticides in the crop pest, the brown plant hopper. The authors hypothesize that the pesticide works by inducing the JH titer, which through the JH signaling pathway induces egg development. Evidence for this is, however, inadequate.

      We greatly appreciate your valuable comments and constructive suggestions for our work. All in all, the manuscript has been carefully edited and improved following your suggestions. We also provide more evidence to support our statements by conducting new experiments. First, we found that also EB treatment of adult females can stimulate egg-laying. Second, EB treatment in female adults increases the number of mature eggs in the ovary and ovarioles. Third, EB treatment in females enhances the expression of the kr-h1 gene in the whole body of BPH. Finally, EB treatment in female adults increases the JHIII titer, but has no impact on the 20E titer.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      Gao et al. have demonstrated that the pesticide emamectin benzoate (EB) treatment of brown planthopper (BPH) leads to increased egg-laying in the insect, which is a common agricultural pest. The authors hypothesize that EB upregulates JH titer resulting in increased fecundity.

      Strengths:

      The finding that a class of pesticide increases the fecundity of brown planthopper is interesting.

      We greatly appreciate your positive comments on our work.

      Weaknesses:

      (1) EB is an allosteric modulator of GluCl. That means EB physically interacts with GluCl initiating a structural change in the cannel protein. Yet the authors' central hypothesis here is about how EB can upregulate the mRNA of GluCl. I do not know whether there is any evidence that an allosteric modulator can function as a transcriptional activator for the same receptor protein. The basic premise of the paper sounds counterintuitive. This is a structural problem and should be addressed by the authors by giving sufficient evidence about such demonstrated mechanisms before.

      Thank you for your question. As the reviewer points out, EB physically interacts with its target protein GluCl and thus affects its downstream signaling pathway. In the manuscript, we reported that EB-treated brown planthoppers display increased expression of GluCl in the adult stage (Fig. 5A). Actually, there are many studies showing that insects treated with insecticides can increase the expression of target genes. For example, the relative expression level of the ryanodine receptor gene of the rice stem borer, Chilo suppressalis was increased 10-fold after treatment with chlorantraniliprole, an insecticide which targets the ryanodine receptor (Peng et al., 2017). Besides this, in Drosophila, starvation (and low insulin) elevates the transcription level of the sNPF and tachykinin receptors (Ko et al., 2015; Root et al., 2011). In brown planthoppers, reduction in mRNA and protein expression of a nicotinic acetylcholine receptor α8 subunit is associated with resistance to imidacloprid (Zhang et al., 2015). RNA interference knockdown of α8 gene decreased the sensitivity of N. lugens to imidacloprid (Zhang et al., 2015). Hence, expression of receptor genes can be regulated by diverse factors including insecticide treatment. In our case, we found that EB can upregulate its target gene GluCl. However, we did not claim that EB functions as transcriptional activator for GluCl, and we still do not know why EB treatment changes the expression of GluCl in the brown planthopper. Considering our experiments are lasting several days, it might be an indirect (or secondary) effect caused by other factors, which change the expression of GluCl gene upon EB action of the channel. One reason is maybe that the allosteric interaction with GluCl by EB makes it dysfunctional and the cellular response is to upregulate the channel/receptor to compensate. We have inserted text on lines 738 - 757 to explain these possibilities.

      (2) I am surprised to see a 4th instar larval application or treatment with EB results in the upregulation of JH in the adult stages. Complicating the results further is the observation that a 4th instar EB application results in an immediate decrease in JH titer. There is a high possibility that this late JH titer increase is an indirect effect.

      Thank you for your question. Treatment with low doses or sublethal doses of insecticides might have a strong and complex impact on insects (Gandara et al., 2024; Gong et al., 2022; Li et al., 2023; Martelli et al., 2022). We kept the 4th instar of brown planthoppers feeding on EB for four days. They will develop to 5th instar after four days treatment, which is the final nymphal stage of BPH. Since the brown planthopper is a hemimetabolous insect, we cannot rule out the possibility that an indirect effect of treatment with EB results in the upregulation of JH in the adult stages. In this new revised manuscript, we investigated the impact of EB treatment in the adult stage. We found that female adults treated with EB also laid more eggs than controls (Figure 1-figure supplement 1A). The following experiments were performed in adults to address how EB treated stimulates egg-laying in adult brown planthopper.

      (1) We found that EB treatment in adults increases the number of mature eggs in ovary (new Figure 2-figure supplement 1). We add this results in lines 234 – 238 and 281-285.

      (2) We measured the JH titer after the female adults had been treated with EB. We found that EB can also increase the JH titer but has no impact on the 20E titer in the female adult (Figure 3-S3A and B). We add this results in lines 351 – 356 and 281-285.

      (3) EB treatment in adults increases the gene expression of JHAMT and Kr-h1 (Figure 3-S3C and D). We add this results in lines 378 – 379, lines 387-390 and lines 457-462.

      (3) The writing quality of the paper needs improvement. Particularly with respect to describing processes and abbreviations. In several instances the authors have not adequately described the processes they have introduced, thus confusing readers.

      Thank you for your suggestion. We have thoroughly revised the paper to improve clarity.

      (4) In the section 'EB promotes ovarian development' the authors have shown that EB treatment results in increased detention of eggs which contradicts their own results which show that EB promotes egg laying. Again, this is a serious contradiction that nullifies their hypothesis.

      Thank you for pointing this out. We revised the figure 2B to show number of mature eggs in the ovary. The number of mature eggs in ovaries of females that fed on EB was higher than in control females. We also show that BPH fed with EB laid more eggs than controls. Thus, our results suggest that EB promotes ovary maturation (and egg production) and also increases egg laying (Figure 1 and Table S1). Thus, we found that EB treatment can increase both the production of eggs and increase egg laying. We add this results in lines 234 – 238.

      (5) Furthermore, the results suggest that oogenesis is not affected by EB application. The authors should devote a section to discussing how they are observing increased egg numbers in EB-treated insects while not impacting Oogenesis.

      Thank you for your suggestions, and apologies for the lack of clarity in our initial explanation. First, we found that EB treatment led to an increase in the number of eggs laid by female brown planthoppers (Figure 1). Through dissection experiments, we observed that EB-treated females had more mature eggs in their ovaries (Figure 2A and B), indicating that the increased egg-laying was due to a larger production of mature eggs in the ovaries after EB treatment. This is now explained on lines 229-238.

      Additionally, since there is no systematic description of oogenesis in the brown planthopper, we were the first to observe the oogenesis process in this species using immunohistochemistry and laser confocal microscopy. Based on the developmental characteristics, we defined the different stages of oogenesis (Figure 2C, Figure 2-figure supplement 2). We did not observe any significant effect of EB treatment on the various stages of oogenesis, indicating that EB treatment does not impair normal egg development (Figure 2D). Instead, the increase in vitellogenin accelerates the production of mature eggs. This is now explained on lines 243-262.

      During the maturation process, eggs require uptake of vitellogenin, and an increase in vitellogenin (Vg) content can accelerate egg maturation, producing more mature eggs. Our molecular data suggest that EB treatment leads to an upregulation of vg expression. Based on these findings, we conclude that the increase in egg-laying caused by EB treatment is due to the upregulation of vg (Figure 3I), which raises vitellogenin content, promoting the uptake of vitellogenin by maturing eggs and resulting in the production of more mature eggs. We have revised the text on lines 389-395 to clarify this point.

      (6) Met is the receptor of JH and to my understanding, remains mostly constant in terms of its mRNA or protein levels throughout various developmental periods in many different insects. Therefore, the presence of JH becomes the major driving factor for physiological events and not the presence of the receptor Met. Here the authors have demonstrated an increase in Met mRNA as a result of EB treatment. Their central hypothesis is that EB increases JH titer to result in enhanced fecundity. JH action will not result in the activation of Met. Although not contradictory to the hypothesis, the increase in mRNA content of Met is contrary to the findings of the JH field thus far.

      Thank you for your comment. Our results showed that EB treatment can mildly increase (about 2-fold) expression of the Met gene in brown planthoppers (Figure 3G). And our data indicated that Met and FAMeT expression levels were not influenced so much by EB compared with kr-h1 and vg (Figure 3H and I). We agree that JH action will not result in the increase of Met. However, we cannot rule out the possibility of other factors (indirect effects), induced by EB treatment that increase the mRNA expression level of Met. One recent paper reported that downregulation of transcription factor CncC will increase met expression in beetles (see Figure 6A in this reference) (Jiang et al., 2023). Many studies have reported that insecticide treatment will activate the CncC gene signaling pathway, which regulates detoxification gene expression (Amezian et al., 2023; Fu et al., 2024; Hu et al., 2021). Hence, it is possible that EB might influence the CncC gene pathway which then induces met expression. This EB effect on met upregulation may be similar to the upregulation of GluCl and some other secondary effects. We have discussed this on lines 725-738.

      (7) As pointed out before, it is hard to rationalize how a 4th instar exposure to EB can result in the upregulation of key genes involved in JH synthesis at the adult stage. The authors must consider providing a plausible explanation and discussion in this regard.

      Thank you for your comments. It must be mentioned that although we exposed the BPH to EB at 4th instar, we make the insect feed on the EB-treated rice plants for four days. After that, the insect will develop into 5<sup>th</sup> instar, the final nymphal stage of brown planthopper. Since brown planthoppers do not have a pupal stage, this might cause the EB presented to the insects last a longer time even in the adult stage. Besides this, we found that EB treatment will increase the weight of adult females (Figure 1-figure supplement 3E and F), which indicates that EB might increase food intake in BPHs that might produce more insulin peptide. Insulin might increase the JH synthesis at the adult stage. In our revised study we also investigate EB impairment in adult BPHs. We found that, similar to the nymphal stage, EB treatment in adult BPHs also increases the egg laying. Furthermore, the JH titer was increased after treatment of BPH with EB in adults. Besides this, GluCl and kr-h1 genes were also up-regulated after EB treatment in the adult stage. We have discussed this on lines 739-746.

      (8) I have strong reservations against such an irrational hypothesis that Met (the receptor for JH) and JH-Met target gene Kr-h1 regulate JH titer (Line 311, Fig 3 supplemental 2D). This would be the first report of such an event on the JH field and therefore must be analysed in depth. I strongly suggest the authors remove such claims from the manuscript without substantiating it.

      Thank you for your suggestions and comments. We have changed our claims in this revised MS. We found that EB treatment can enhance Kr-h1 expression. We have no evidence to support that JH can induce met expression. We have rewritten the manuscript to avoid confusion (see text on lines 725-735).

      (9) Kr-h1 is JH/Met target gene. The authors demonstrate that silencing of Kr-h1 results in inhibition of FAMeT, which is a gene involved in JH synthesis. A feedback loop in JH synthesis is unreported. It is the view of this reviewer that the authors must go ahead with a mechanistic detail of Kr-h1 mediated JH upregulation before this can be concluded. Mere qPCR experiments are not sufficient to substantiate a claim that is completely contrary to the current understanding of the JH signalling pathway.

      Thank you for your suggestions and comments. We agree that only qPCR experiments are not enough to provide this kind of claim. More evidences need to be provided to support this. We have revised the MS to avoid confusion (see text on lines 725-735).

      (10) The authors have performed knockdowns of JHAMT, Met, and Kr-h1 to demonstrate the effect of these factors on fecundity in BPH. Additionally, they have performed rescue experiments with EB application on these knockdown insects (Figure 3K-M). This, I believe, is a very flawed experiment. The authors demonstrate EB works through JHAMT in upregulating JH titer. In the absence of JHAMT, EB application is not expected to rescue the phenotype. But the authors have reported a complete rescue here. In the absence of Met, the receptor of JH, either EB or JH is not expected to rescue the phenotype. But a complete rescue has been reported. These two experimental results contradict their own hypothesis.

      Thank you for your comments. We thought that this rescue is possible since knockdown of the genes is incomplete when using dsRNA injection (and residual gene expression allows for EB action). It is not a total knockout and actually, these genes still have a low level of expression in the dsRNA-injected insects. Since EB can upregulate the expression of JHAMT, Met, and Kr-h1, it is reasonable that EB treatment can rescue the down-regulation effects of these three genes and make fecundity completely rescued. We have clarified this on lines 411-413).

      (11) A significant section of the paper deals with how EB upregulates JH titer. JH is a hormone synthesized in the Corpora Allata. Yet the authors have chosen to use the whole body for all of their experiment. Changes in the whole body for mRNA of those enzymes involved in JH synthesis may not reflect the situation in Corpora Allata. Although working with Corpora Allata is challenging, discarding the abdomen and thorax region and working with the head and neck region of the insect is easily doable. Results from such sampling are always more convincing when it comes to JH synthesis studies.

      Thank you for your suggestions. Because the head is very difficult to separate from the thorax region in brown planthoppers as you can see in Author response image 1. We are now trying to answer how EB regulates JH synthesis using Drosophila as a model.

      Author response image 1.

      The brown planthopper

      (12) The phenomenon reported was specific to BPH and not found in other insects. This limits the implications of the study.

      Thank you for your comments. The brown planthopper is a serious insect pest on rice in Asia. Our findings can guide the use of this insecticide in the field. Besides this, our findings indicated that EB, which targets GluCl can impair the JH titer. Our findings added new implications for how a neuronal system influences the JH signaling pathway. We will further investigate how EB influences JH in the future and will use Drosophila as a model to study the molecular mechanisms.

      (13) Overall, the molecular experiments are very poorly designed and can at best be termed superficial. There are several contradictions within the paper and no discussion or explanation has been provided for that.

      Thank you for your comments. We have revised the paper according to your suggestions and added further explanation of our results in the discussion parts and hope the conclusions are better supported in the new version. We have discussed this on lines 725-746 and 778-799.

      Reviewer #2 (Public Review):

      The brown plant hopper (BPH) is a notorious crop pest and pesticides are the most widespread means of controlling its population. This manuscript shows that in response to sublethal doses of the pesticide (EB), BPH females show enhanced fecundity. This is in keeping with field reports of population resurgence post-pesticide treatment. The authors work out the mechanism behind this increase in fecundity. They show that in response to EB exposure, the expression of its target receptor, GluCl, increases. This, they show, results in an increase in the expression of genes that regulate the synthesis of juvenile hormone (JH) and JH itself, which, in turn, results in enhanced egg-production and egg-laying. Interestingly, these effects of EB exposure are species-specific, as the authors report that other species of plant hoppers either don't show enhanced fecundity or show reduced fecundity. As the authors point out, it is unclear how an increase in GluCl levels could result in increased JH regulatory genes.

      We greatly appreciate your valuable comments and constructive suggestion to our work. We will try to figure out how EB interacts with its molecular target GluCl and then increases JH regulatory genes in the future work using Drosophila as models.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      Overall, the molecular experiments are very poorly designed and can at best be termed superficial. There are several contradictions within the paper and no discussion or explanation has been provided for that.

      The authors should consider a thorough revision.

      Thank you for your comments. We have thoroughly revised the paper according to your suggestions and added further experiments and explanations of our results in the discussion parts.

      Reviewer #2 (Recommendations For The Authors):

      It would help the reader to have more schematics along with the figures. The final figure is helpful, but knowing the JH pathway, and where it acts would help with the interpretations as one reads the manuscript and the figures. The pathways represented in 4N or 5J are helpful but could be improved upon for better presentation.

      It would be nice to have some discussion on how the authors think EB exposure results in an increase in GluCl expression, and how that in turn affects the expression of so many genes.

      Thank you for your comments. We have thoroughly revised the paper according to your suggestions and added further experiments and explanations of how we think EB exposure results in an increase in JH titer and other genes in the discussion parts. We have added the test on lines 753-761.

      References

      Amezian, D., Fricaux, T., de Sousa, G., Maiwald, F., Huditz, H.-I., Nauen, R., Le Goff, G., 2023. Investigating the role of the ROS/CncC signaling pathway in the response to xenobiotics in Spodoptera frugiperda using Sf9 cells. Pesticide Biochemistry and Physiology 195, 105563.

      Fu, B., Liang, J., Hu, J., Du, T., Tan, Q., He, C., Wei, X., Gong, P., Yang, J., Liu, S., Huang, M., Gui, L., Liu, K., Zhou, X., Nauen, R., Bass, C., Yang, X., Zhang, Y., 2024. GPCR–MAPK signaling pathways underpin fitness trade-offs in whitefly. Proceedings of the National Academy of Sciences 121, e2402407121.

      Gandara, L., Jacoby, R., Laurent, F., Spatuzzi, M., Vlachopoulos, N., Borst, N.O., Ekmen, G., Potel, C.M., Garrido-Rodriguez, M., Böhmert, A.L., Misunou, N., Bartmanski, B.J., Li, X.C., Kutra, D., Hériché, J.-K., Tischer, C., Zimmermann-Kogadeeva, M., Ingham, V.A., Savitski, M.M., Masson, J.-B., Zimmermann, M., Crocker, J., 2024. Pervasive sublethal effects of agrochemicals on insects at environmentally relevant concentrations. Science 386, 446-453.

      Gong, Y., Cheng, S., Desneux, N., Gao, X., Xiu, X., Wang, F., Hou, M., 2022. Transgenerational hormesis effects of nitenpyram on fitness and insecticide tolerance/resistance of Nilaparvata lugens. Journal of Pest Science.

      Hu, B., Huang, H., Hu, S., Ren, M., Wei, Q., Tian, X., Esmail Abdalla Elzaki, M., Bass, C., Su, J., Reddy Palli, S., 2021. Changes in both trans- and cis-regulatory elements mediate insecticide resistance in a lepidopteron pest, Spodoptera exigua. PLOS Genetics 17, e1009403.

      Jiang, H., Meng, X., Zhang, N., Ge, H., Wei, J., Qian, K., Zheng, Y., Park, Y., Reddy Palli, S., Wang, J., 2023. The pleiotropic AMPK–CncC signaling pathway regulates the trade-off between detoxification and reproduction. Proceedings of the National Academy of Sciences 120, e2214038120.

      Ko, K.I., Root, C.M., Lindsay, S.A., Zaninovich, O.A., Shepherd, A.K., Wasserman, S.A., Kim, S.M., Wang, J.W., 2015. Starvation promotes concerted modulation of appetitive olfactory behavior via parallel neuromodulatory circuits. eLife 4, e08298.

      Li, Z., Wang, Y., Qin, Q., Chen, L., Dang, X., Ma, Z., Zhou, Z., 2023. Imidacloprid disrupts larval molting regulation and nutrient energy metabolism, causing developmental delay in honey bee Apis mellifera. eLife

      Martelli, F., Hernandes, N.H., Zuo, Z., Wang, J., Wong, C.-O., Karagas, N.E., Roessner, U., Rupasinghe, T., Robin, C., Venkatachalam, K., Perry, T., Batterham, P., Bellen, H.J., 2022. Low doses of the organic insecticide spinosad trigger lysosomal defects, elevated ROS, lipid dysregulation, and neurodegeneration in flies. eLife 11, e73812.

      Peng, Y.C., Sheng, C.W., Casida, J.E., Zhao, C.Q., Han, Z.J., 2017. Ryanodine receptor genes of the rice stem borer, Chilo suppressalis: Molecular cloning, alternative splicing and expression profiling. Pestic. Biochem. Physiol. 135, 69-77.

      Root, Cory M., Ko, Kang I., Jafari, A., Wang, Jing W., 2011. Presynaptic facilitation by neuropeptide signaling mediates odor-driven food search. Cell 145, 133-144.

      Zhang, Y., Wang, X., Yang, B., Hu, Y., Huang, L., Bass, C., Liu, Z., 2015. Reduction in mRNA and protein expression of a nicotinic acetylcholine receptor α8 subunit is associated with resistance to imidacloprid in the brown planthopper, Nilaparvata lugens. Journal of Neurochemistry 135, 686-694.

    1. Author Response

      The following is the authors’ response to the original reviews.

      eLife assessment

      This is an important study that leverages a human-chimpanzee tetraploid iPSC model to test whether cis-regulatory divergence between species tends to be cell type-specific. The evidence supporting the study's primary conclusion--that species differences in gene regulation are enriched in cell type-specific genes and regulatory elements--is compelling, although attention to biases introduced by sequence conservation is merited, and the case that is made for cell type-specific changes reflecting adaptive evolution is incomplete. This work will be of broad interest in evolutionary and functional genomics.

      Public Reviews:

      Reviewer #1 (Public Review):

      This study aims to identify gene expression differences exclusively caused by cis-regulatory genetic changes by utilizing hybrid cell lines derived from human and chimpanzee. While previous attempts have focused on specific tissues, this study expands the comparison to six different tissues to investigate tissue specificity and derive insights into the evolution of gene expression.

      One notable strength of this work lies in the use of composite cell lines, enabling a comparison of gene expression between human and chimpanzee within the same nucleus and shared trans factors environment. However, a potential weakness of the methodology is the use of bulk RNA-seq in diverse tissues, which limits the ability to determine cell-type-specific gene expression and chromatin accessibility regions.

      We agree that profiling single cells could lead to additional exciting discoveries. Although heterogeneity in cell types within samples will indeed reduce our power to detect cell-type-specific divergence, thankfully any heterogeneity will not introduce false positives, since our use of interspecies hybrids controls for differences in cell-type abundance. As a result, we think that the molecular differences we identified in this study represent a subset of the true cell-type specific cis-regulatory differences that would be identified with deep single-cell profiling. We have included a new paragraph in the discussion on future directions, highlighting the utility of single-cell profiling as an exciting future direction (lines 482-490): “In addition to following up on our findings on GAD1 and FABP7, there are other exciting future directions for this work. First, additional bulk assays such as those that measure methylation, chromatin conformation, and translation rate could lead to a better understanding of what molecular features ultimately lead to cell type-specific changes in gene expression. Furthermore, the use of deep single cell profiling of hybrid lines derived from iPSCs from multiple individuals of each species during differentiation could enable the identification of many more highly context-specific changes in gene expression and chromatin accessibility such as the differences in GAD1 we highlighted here. Finally, integration with data from massively parallel reporter assays and deep learning models will help us link specific variants to the molecular differences we identified in this study.”

      Another concern is the use of two replicates derived from the same pair of individuals. While the authors produced cell lines from two pairs of individuals in a previous study (Agloglia et al., 2021), I wonder why only one pair was used in this study. Incorporating interindividual variation would enhance the robustness of the species differences identified here.

      We agree that additional replicates, especially from lines from other individuals, would have improved the robustness of the species differences we identified. In our experience with these hybrid cells (as well as related work from many other labs), inter-species differences typically have much larger magnitudes than intra-species differences, so we expect that the vast majority of differences we identified would be validated with data from additional individuals. Unfortunately, differentiating additional cells and generating these data for this study would be cost-prohibitive. We now mention the use of additional replicates in lines 485-488 of the discussion: “Furthermore, the use of deep single cell profiling of hybrid lines derived from iPSCs from multiple individuals of each species during differentiation could enable the identification of many more highly context-specific changes in gene expression and chromatin accessibility such as the differences in GAD1 we highlighted here.”

      Furthermore, the study offers the opportunity to relate inter-species differences to trends in molecular evolution. The authors discovered that expression variance and haploinsufficiency score do not fully account for the enrichment of divergence in cell-type-specific genes. The reviewer suggests exploring this further by incorporating external datasets that bin genes based on interindividual transcriptomics variation as a measure of extant transcriptomics constraint (e.g., GTEx reanalysis by Garcia-Perez et al., 2023 - PMID: 36777183). Additionally, stratifying sequence conservation on ASCA regions, which exhibit similar enrichment of cell-type-specific features, using the Zoonomia data mentioned also in the text (Andrews et al., 2023 -- PMID: 37104580) could provide valuable insights.

      To address this, we used PhastCons scores computed from a 470-way alignment of mammals as we could not find publicly available PhastCons data from Zoonomia. When stratifying by the median PhastCons score of all sites in a peak, we observe very similar results to those obtained when stratifying by the constraint metric from the gnomAD consortium (see below). The one potential difference is that peaks in the top two bins have slightly weaker enrichment relative to the other bins when using PhastCons, but this is not the case when using gnomAD’s metric. We have elected to include this in the public review but not the manuscript as we are reluctant to add to the complexity of what is already complex analysis.

      Author response image 1.

      Finally, we think that comparisons of the properties of gene expression variance computed from ASE (as done by Starr et al.) and total expression (as done by Garcia-Perez et al.) is a very interesting, potentially complex question that is beyond the scope of this paper but an exciting direction for future work.

      Another potential strength of this study is the identification of specific cases of paired allele-specific expression (ASE) and allele-specific chromatin accessibility (ASCA) with biological significance.

      Prioritizing specific variants remains a challenge, and the authors apply a machine-learning approach to identify potential causative variants that disrupt binding sites in two examples (FABP7 and GAD1 in motor neurons). However, additional work is needed to convincingly demonstrate the functionality of these selected variants. Strengthening this section with additional validation of ASE, ASCA, and the specific putative causal variants identified would enhance the overall robustness of the paper.

      We strongly agree with the reviewer that additional work validating our results would be of considerable interest. We hope to perform follow-up experiments in the future. For now, we have been careful to present these variants only as candidate causal variants.

      Additionally, the authors support the selected ASE-ASCA pairs by examining external datasets of adult brain comparative genomics (Ma et al., 2022) and organoids (Kanton et al., 2019). While these resources are valuable for comparing observed species biases, the analysis is not systematic, even for the two selected genes. For example, it would be beneficial to investigate if FABP7 exhibits species bias in any cell type in Kanton et al.'s organoids or if GAD1 is species-biased in adult primate brains from Ma et al. Comparing these datasets with the present study, along with the Agoglia et al. reference, would provide a more comprehensive perspective.

      We agree with the reviewer’s suggestion that investigating GAD1 and FABP7 expression in other datasets is worthwhile. Unfortunately, the difference in human vs. chimpanzee organoid maturation rates and effects of culture conditions in Kanton et al. makes it unsuitable for plotting the expression of FABP7 as its expression is highly dependent on neuronal maturation. We therefore plotted bulk RNAseq data from multiple cortical regions from Sousa et al. 2017 (see below). This corroborates our claim that FABP7 has human-biased expression in adult humans compared to chimpanzees and rhesus macaques. We also investigated expression of GAD1 in the Ma et al. data as the reviewer suggested.

      Author response image 2.

      While there are differences in GAD1 expression between adult humans and chimpanzees, they are unlikely to be linked to the HAR we highlight as it is likely a transiently active cis-regulatory element (see below). In addition, some cell types seem to have chimpanzee-derived changes in GAD1 expression (e.g. SST positive neurons) whereas others seem to have human-derived changes in GAD1 expression (e.g. LAMP5 positive neurons).

      Author response image 3.

      While these are potentially interesting observations, we think that their inclusion in the manuscript might distract from our emphasis on the cell type-specific and developmental stage-specific of the changes in FABP7 and GAD1 expression we observe so we have not included them in the manuscript.

      The use of the term "human-derived" in ASE and ASCA should be avoided since there is no outgroup in the analysis to provide a reference for the observed changes.

      We agree with the reviewer that the term human-derived should be used with care and have changed the phrasing of line 230 to “human-chimpanzee differences in expression”. With regard to FABP7 we think that our analysis of the Ma et al. data—which includes data from rhesus macaques as an outgroup—justifies our use of “human-derived” in lines 360 and 457. As chimpanzee and macaque expression of FABP7 are similar but human expression is quite different, the most parsimonious explanation for our observations is that FABP7 upregulation occurred in the human lineage.

      Finally, throughout the paper, the authors refer to "hybrid cell lines." It has been suggested to use the term "composite cell lines" instead to address potential societal concerns associated with the term "hybrid," which some may associate with reproductive relationships (Pavlovic et al., 2022 -- PMID: 35082442). It would be interesting to know the authors' perspective on these concerns and recommendations presented in Pavlovic et al., given their position as pioneers in this field.

      We appreciate this question. Whether to refer to our fused cells as “hybrids” or not was indeed a question we considered at great length, starting from the very beginning of this project in 2015. From consultations with multiple bioethicists-- both formal and informal-- we have long been aware of the possibility of misunderstanding based on the word “hybrid”. However, we felt this possibility was outweighed by the long and well-established history of other scientists referring to interspecies fused cells as hybrids. This convention-- which is based on hundreds of papers about heterokaryons, somatic cell hybrids, and radiation hybrids-- goes back over 50 years (e.g. Bolund et al, Exp Cell Res 1969). Soon after the establishment of this nomenclature, cell fusion became widespread and ever since then it has become commonplace to generate interspecies hybrid cells from animals, plants and fungi.

      It is also important to note that in over two years since we published the first two papers on humanchimpanzee fused cells, we have been unable to find any misunderstanding of our use of the term “hybrid”. We have searched blogs, media articles, and social media, all with no evidence of misunderstanding. Therefore, in the current manuscript, rather than creating confusion by renaming a well-established approach, we have opted to clearly and prominently define hybrid cells: in the abstract of our paper we introduce the hybrid cells as “the product of fusing induced pluripotent stem (iPS) cells of each species in vitro.”

      Reviewer #2 (Public Review):

      In this paper, Wang and colleagues build on previous technical and analytical achievements in establishing tetraploid human-chimpanzee hybrid iPSCs to investigate the cell type-specificity of allelespecific expression and allele-specific chromatin accessibility across six differentiated cell types (here, "allele-specific" indicates species differences with a cis-regulatory basis). The combined body of work is remarkable in its creativity and ambition and has real potential for overcoming major challenges in understanding the evolutionary genetics of between-species differences. The present paper contributes to these efforts by showing how differentiated cells can be used to test a long-standing hypothesis in evolutionary genetics: that cis-regulatory changes may be particularly important in divergence because of their potential for modularity.

      In my view, the paper succeeds in making this case: allele (species)-specific expression (ASE) and allelespecific chromatin accessibility (ASCA) are enriched in genes asymmetrically expressed in one cell type, and many cases of ASE/ASCA are cell type-specific. The authors do an excellent job showing that these results are robust across a set of possible analysis decisions. It is somewhat less clear whether these enrichments are primarily a product of relaxed constraint on cell type-specific genes or primarily result from positive selection in the human or chimp lineage. While the authors attempt to control for constraint using several variables (variance in ASE in humans and the sequence-based probability of haploinsufficiency score, pHI), these are imperfect proxies for constraint. For the pHI scores, enrichments for ASE also appear to be strongest in the least constrained genes. Overall, the relative role of relaxation of constraint versus positive selection is unresolved, although the manuscript's language leans in favor of an important role for selection.

      We agree with the reviewer and apologize for the wording that indeed focused more on positive selection than relaxed constraint. We have added language clarifying that our stance is that our analyses suggest some role for positive selection, but that we do not claim that positive selection plays a larger role than reduced constraint (lines 432-437): “Overall, this suggests that broad changes in expression in cell type-specifically expressed genes may be an important substrate for evolution but it remains unclear whether positive selection or lower constraint plays a larger role in driving the faster evolution of more cell type-specifically expressed genes. Future work will be required to more precisely quantify the relative roles of positive selection and evolutionary constraint in driving changes in gene expression.”

      The remainder of the manuscript draws on the cell type-specific ASE/ASCA data to nominate candidate genes and pathways that may have been important in differentiating humans and chimpanzees. Several approaches are used here, including comparing human-chimp ASE to the distribution of ASE observed in humans and investigating biases in the direction of ASE for genes in the same pathway. The authors also identify interesting candidate genes based on their role in development or their proximity to human accelerated regions (where many changes have arisen on the human lineage in otherwise deeply conserved sequence) and use a deep neural network to identify sequence changes that might be causally responsible for ASE/ASCA. These analyses have value and highlight potential strategies for using ASE/ASCA and hybrid cell line data as a hypothesis-generating tool. Of course, the functional follow-up that experimentally tested these hypotheses or linked sequence/expression changes in the candidate pathways to organismal phenotype would have strengthened the paper further- but this is a lot to ask in an already technically and analytically challenging piece of work.

      We thank the reviewer for the kind words and strongly agree that follow-up experiments and orthogonal analyses will be key in validating our results and establishing links to human-specific phenotypes.

      As a minor critique, the present paper is very closely integrated with other manuscripts that have used the hybrid human-chimp cell lines for biological insight or methods development. Although its contributions make it a strong stand-alone contribution, some aspects of the methods are not described in sufficient detail for readers to understand (even on a general conceptual level) without referencing that work, which may somewhat limit reader understanding.

      We agree with the points the reviewer raises regarding the clarity of our methods. We have amended several sections to provide more conceptual information while pointing the reader to other publications for the technical details. For convenience, we include the text here as well as in the new draft.

      Lines 207-214 now provide more intuition for the method used to detect lineage-specific selection: “Next, we sought to use our RNA-seq data to identify instances of lineage-specific selection. In the absence of positive selection, one would expect that an approximately equal number of genes in a pathway would have human-biased vs. chimpanzee-biased ASE. Significant deviation from this expectation (as determined by the binomial test) rejects the null hypothesis of neutral evolution, instead providing evidence of lineage-specific selection on this pathway. Using our previously published modification of this test that incorporates a tissue-specific measure of constraint on gene expression, we detected several signals of lineage-specific selection, some of which were cell type-specific (Starr et al., 2023, Additional file 2).” This is also reflected in the Methods in lines 729-731: “Positive selection on a gene set is only inferred if there is statistically significant human- or chimpanzee-biased ASE in that gene set (using an FDR-corrected p-value from the binomial test).”

      Reviewer #3 (Public Review):

      The authors utilize chimpanzee-human hybrid cell lines to assess cis-regulatory evolution. These hybrid cell lines offer a well-controlled environment, enabling clear differentiation between cis-regulatory effects and environmental or other trans effects.

      In their research, Wang et al. expand the range of chimpanzee-human hybrid cell lines to encompass six new developmental cell types derived from all three germ layers. This expansion allows them to discern cell type-specific cis-regulatory changes between species from more pleiotropic ones. Although the study investigates only two iPSC clones, the RNA- and ATAC-seq data produced for this paper is a valuable resource.

      The authors begin their analysis by examining the relationship between allele-specific expression (ASE) as a measure of species divergence and cell type specificity. They find that cell-type-specific genes exhibit more divergent expression. By integrating this data with measures of constraint within human populations, the authors conclude that the increased divergence of tissue-specific genes is, at least in part, attributable to positive selection. A similar pattern emerges when assessing allele-specific chromatin accessibility (ASCA) as a measure of divergence of cis-regulatory elements (CREs) in the same cell lines.

      By correlating these two measures, the authors identify 95 CRE-gene pairs where tissue-specific ASE aligns with tissue-specific ASCA. Among these pairs, the authors select two genes of interest for further investigation. Notably, the authors employ an intriguing machine-learning approach in which they compare the inferred chromatin state of the human sequence with that of the chimpanzee sequence to pinpoint putatively causal variants.

      Overall, this study delves into the examination of gene expression and chromatin accessibility within hybrid cell lines, showcasing how this data can be leveraged to identify potential causal sequence differences underlying between-species expression changes.

      We appreciate this assessment.

      I have three major concerns regarding this study:

      1. The only evidence that the cells are indeed differentiated in the right direction is the expression of one prominent marker gene per cell type. Especially for the comparison of conservation between the differentiated cell types, it would be beneficial to describe the cell type diversity and the differentiation success in more detail.

      We appreciate this assessment. We agree that evidence beyond a single marker gene is necessary to demonstrate that the differentiations were successful and that a discussion of the limitations of these differentiations in the manuscript is worthwhile. We included figures showing additional marker genes and a thorough discussion of the differentiations in the supplement. For convenience, we have copied the supplemental figure and text here:

      “Before continuing with the analysis, we tested whether the differentiations were successful and contained primarily our target cell types. The very low expression of NANOG, a marker for pluripotency, across all differentiations indicates that the samples contain very few iPSCs (Agoglia et al., 2021). For cardiomyocytes (CM), NKX2-5, MYBPC3, and TNNT2 definitively distinguish CM from other heart cell types and their high expression indicates successful differentiations (Burridge et al., 2014). For motor neurons, the high expression of ELAVL2, a pan-neuronal marker, indicates a high abundance of neurons in the sample (Mickelsen et al., 2019). The expression of ISL1 and OLIG2 further demonstrates that these are motor neurons and not other types of neurons (Maury et al., 2015). For retinal pigment epithelium (RPE), the combined expression of MITF, PAX6, and TYRP1 provides strong evidence that the differentiations were successful in producing RPE cells (Sharma et al., 2019). For skeletal muscle, the very high expression of MYL1, MYLPF, and MYOG indicates that these samples contain a high proportion of skeletal muscle cells (Chal et al., 2016). In general, all these populations of cells contain some proportion of progenitors as there is detectable expression of MKI67 in all samples.

      The low expression of ALB (a marker for mature hepatocytes) and the high expression of TTR and GPC3 (markers for hepatocyte progenitors) combined with the high expression of HNF1B indicate that the bulk of the cells in the HP samples are hepatocyte progenitors rather than mature hepatocytes or endoderm cells, although there are likely some endoderm cells and immature hepatocytes in the sample (Hay et al., 2008; Mallanna & Duncan, 2013). Similarly, the combined expression of PDX1 and NKX6-1 and the low expression of NEUROG3 (a marker of endocrine progenitors which differentiate from pancreatic progenitors) in the PP samples indicates that these primarily contain pancreatic progenitors but likely contain some endocrine progenitors and endoderm cells (Cogger et al., 2017; Korytnikov & Nostro, 2016).

      Notably, HP and PP are closely related cell types that are derived from the same lineage. Indeed, heterogeneous multipotent progenitors can contribute to both the adult liver and adult pancreas in mice (Willnow et al., 2021). Progenitors that express PDX1 (often used as a marker for the pancreatic lineage) can differentiate into hepatocytes (Willnow et al., 2021). As a result, some overlap in the transcriptomic signature of both cell types is expected and we cannot rule out that the HP samples contain cells that could differentiate into pancreatic cells or that the PP samples contain cells that could differentiate into hepatocytes. However, the expression of NKX6-1 and GP2, markers for pancreatic progenitors, in the PP samples but not the HP samples indicates that these two populations of cells are distinct. Overall, the similarity of PP and HP likely explains the lower number of cell type-specific genes and genes showing cell type-specific ASE for these cell types. This similarity does not alter the conclusions presented in the main text.”

      Author response image 4.

      Author response image 5.

      Marker gene expression in different cell types. In order, the panels show: a marker for pluripotency, a marker gene for dividing cells, marker genes for cardiomyocytes, marker genes for hepatocytes and hepatocyte progenitors, marker genes for motor neurons, marker genes for pancreatic progenitors and more mature pancreatic cell types, marker genes for retinal pigment epithelial cells, and marker genes for skeletal myocytes. Hepatocyte progenitors and pancreatic progenitors generally show similar gene expression profiles. TPM: transcript per million.

      1. Check for a potential confounding effect of sequence similarity on the power to detect ASE or ASCA.

      We agree that checking for confounding by power to detect ASE or ASCA would increase confidence in our results. We have added supplementary figures 29-33 to show the results as well as a discussion of these figures in the text (lines 318-326):

      “Finally, it is possible that CREs and genes that are less conserved will have more SNPs, and therefore more power to call ASCA and ASE, leading to systematically biased estimates. There is a weak positive correlation between the number of SNPs and the -log10(FDR) for ASE and a weak negative or no correlation for ASCA (Supp Fig. 29). Similarly, we observe a weak relationship between the number of SNPs in CREs or genes and absolute log fold-change estimates (Supp Fig. 30). Although the relationship between the number of SNPs and ASE/ASCA is weak, we confirmed that cell type-specific genes and peaks are still strongly enriched for ASE and ASCA when stratifying by number of SNPs (Supp Fig. 31-32). Overall, our analysis suggests that the result that more cell type-specific genes and CREs are more evolutionarily diverged is robust to a variety of possible confounders.”

      Author response image 6.

      Relationship between number of SNPs and -log10(FDR) in a) ASE and -log10(pvalue) b) ASCA. These scatter plots show the relationship between the number of SNPs in a gene or peak and the -log10(FDR) for ASE or ASCA. Genes with significant ASE (FDR < 0.05) and peaks with significant ASCA (binomial p-value < 0.05) were annotated as blue dots, and all other genes and peaks were annotated as gray dots. All genes in each cell type in RNA-seq are shown. For clarity, the few outlier peaks with more than 200 SNPs are excluded from these plots.

      Author response image 7.

      Relationship between number of SNPs and absolute log2 fold-change in a) ASE and b) ASCA. These scatter plots show the relationship between the number of SNPs in a gene or peak and the estimated absolute log2 fold-change for ASE or ASCA. Genes with significant ASE (FDR < 0.05) and peaks with significant ASCA (binomial p-value < 0.05) were annotated as blue dots, and all other genes and peaks were annotated as gray dots. All genes in each cell type in RNA-seq are shown. For clarity, the few outlier peaks with more than 200 SNPs are excluded from these plots.

      Author response image 8.

      Cell type-specifically expressed genes are enriched for genes with ASE when stratifying by the number of SNPs per gene. a) Results when SKM is included. Genes were put into five bins with an equal number of genes in each bin. Genes with the fewest SNPs are in the 0-20% bin and genes with the most SNPs are in the 80-100% bin. Significance (using the Wald test) is indicated by asterisks where *** indicates p < 0.005, ** indicates p < 0.01, and * indicates p < 0.05. b) The same as in (a) but excluding SKM.

      Author response image 9.

      Cell type-specific peaks are enriched for ASCA when stratifying by the number of SNPs per peak. a) Peaks with an absolute log2 fold-change greater than or equal to 0.5 were called as having ASCA. Peaks were put into five bins with an equal number of peaks in each bin. Peaks with the fewest SNPs are in the 0-20% bin and genes with the most SNPs are in the 80-100% bin. Significance (using the Wald test) is indicated by asterisks where *** indicates p < 0.005, ** indicates p < 0.01, and * indicates p < 0.05. b) The same as in (a) but peaks with a binomial p-value less than or equal to 0.05 were called as having ASCA.

      1. In the last part the authors showcase 2 examples for which the log2 fold changes in chromatin state scores as inferred by the machine learning model Sei are used. This is an interesting and creative approach, however, more sanity checks on this application are necessary.

      We agree with the reviewer about the importance of sanity checks and apologize for omitting these from the manuscript. Below we highlight several such checks from previous publications:

      In the original Sei paper (Chen et al. 2022), the authors included several tests of their model’s ability to predict the effects on individual genetic variants. Using eQTL data from GTEx, they found that variants predicted to increase enhancer activity were more likely to be up-regulating eQTLs, and those predicted to increase polycomb repression had the expected repressive effect. These relationships became stronger when restricting the analysis only to fine-mapped eQTLs with >95% posterior probabilities of causality. Chen et al. also found that previously known disease-causing noncoding variants from the Human Gene Mutation Database were far more likely to reduce predicted enhancer/promoter activity than matched variants not linked to any disease.

      In addition, we note that a similar approach to ours was recently used to analyze all HARs and included considerable efforts to validate the utility of the Sei predictions in identifying causal variants (Whalen et al. 2023 in Neuron). For example, Whalen et al. found that the Sei output correlated with the effects of genetic variants on expression in a massively parallel reporter assay. They also found that the effect sizes predicted by Sei were much higher for variants in HARs than polymorphic variants in the human population, which is consistent with the idea that variants in HARs lie in highly conserved bases that are more likely to disrupt cis-regulatory elements. Finally, Whalen et al. found that effects on chromatin state predicted by Sei were generally highly correlated across tissues, supporting our approach that leverages all Sei outputs regardless of which cell type or tissue they correspond to. Overall, we think that Sei is a potentially powerful way to prioritize causal variants and that improved machine learning models trained on more extensive and context-specific data will be even more powerful.

    1. Author response:

      The following is the authors’ response to the original reviews.

      We thank the reviewers for their comments and provide answers /clarifications and new data; There were 3 important recurrent points we already address here: 

      (a) The reviewers were concerned that the observed motor defects (measured by startle induced negative geotaxis- “SING”) where a reasonable behavioral measure of DAN function.

      Previously, Riemensperger et al., 2013 (PMID: 24239353) already linked synaptic loss of the dopaminergic PAM neurons to SING impairments. Furthermore, in a separate paper that we recently posted on BioRxiv, we show that the SING defects in PD mutants are rescued when the flies are fed L-DOPA (Kaempf et al 2024; BioRxiv). In this same paper we also show a very strong correlation between SING defects and defects in dopaminergic synaptic innervation of PAM DAN onto Mushroom body neurons. Both experiments suggest that the motor defects are the result of defects in dopamine release. Altogether, these data suggest that the combination of the SING assay and a quantification of the synaptic region of PAM DAN onto Mushroom body neurons is a suitable measure for DAN function.

      (b) The reviewers asked if the OPN dysfunction in young animals is connected to dopaminergic neuron (DAN) dysfunction in later life; 

      We have conducted additional experiments and have included the results (new Figure 6): Our young PD mutants (we included Aux<sup>R927G</sup>, Synj<sup>R258Q</sup> and LRRK2<sup>G2019S</sup>) show olfactory defects, but normal DAN function (measured by assessing the TH-labeled synaptic area onto the Mushroom body neurons and by SING). Aged PD mutants show both olfactory defects and DAN dysfunction. When we express the wildtype PD gene in (a.o.) OPN of PD mutants using the GH146-Gal4 (that does not drive expression in DAN) we are able to rescue the DAN defects (synaptic area and SING) that occur later in life. This indeed suggests there is a cell non-autonomous positive effect on DAN dysfunction that occurs at later stages in the life of our PD mutants (new Figure 6a). 

      In a set of independent experiments, we also fed one of our mutants (LRRK2<sup>G2019S</sup>) nicotine, activating Nicotinic acetylcholine receptors (that are also activated by the release of acetylcholine from cholinergic neurons such as OPN). While nicotine does not rescue the olfactory preference defect, the OPN synapse morphology defect or the OPN-associated defects in Ca<sup>2+</sup>-imaging in LRRK2<sup>G2019S</sup> mutants (Figure 6b), it does rescue the DAN-associated defects, including SING, synapse loss and defects in Ca<sup>2+</sup>-imaging (Figure 6c).

      Finally, we generated human induced dopaminergic neurons derived from iPSC with a LRRK2<sup>G2019S</sup> mutation and incubated these neurons with nicotine. Again, this induced a rescue of a LRRK2-mutant-induced defect in neuronal activity measured by Ca<sup>2+</sup>-imaging. This is specific to nicotine since the rescue was absent when cells were also incubated with mecamylamine, a non-competitive antagonist of nicotinic acetylcholine receptors, trumping the effects of nicotine (Figure 6d-e").

      (c) The reviewers indicated that the GH146 Gal 4 driver is expressed in other cells than OPN and thus, they noted that the defects we observe may not only be the result of OPN dysfunction. 

      It is correct that GH146-dependent Gal expression includes OPNs (that are cholinergic) and one pair of inhibitory APL neurons (that are GABAergic) (Li et al., 2017 (PMID: 29149607), Lui et al., 2009 (PMID: 19043409)). We have adapted the text to explicitly state this. There are only 2 APL per fly brain and our single cell sequencing experiment does not have the resolution to allow us to test if these neurons had a significant number of DEG. However, as indicated above (in (b)), we are able to rescue DAN dysfunction by mimicking cholinergic output (application of nicotine). These data do not exclude that APL-neuron problems contribute to the defects we observe in our PD mutants, but they do suggest that cholinergic output is critical to maintain normal DAN function.

      Public Reviews:  

      Reviewer #1 (Public Review):  

      This is a fantastic, comprehensive, timely, and landmark pan-species work that demonstrates the convergence of multiple familial PD mutations onto a synaptic program. It is extremely well written and I have only a few comments that do not require additional data collection. 

      Thank you for this enthusiastic endorsement.

      Major Comments:  

      neurons and the olfactory system are acutely impacted by these PD mutations. However, I wonder if this is the case:  

      (1) In the functional experiments performing calcium imaging on projection neurons I could not find a count of cell bodies across conditions. Since the loss of OPNs could explain the reduced calcium signal, this is a critical control to perform. A differential abundance test on the single-cell data would also suffice here and be easy for the authors to perform with their existing data. 

      This is indeed an important number, and we had included this in the Supplemental figure 2a.

      Also, the number of DAN and Visual projection neurons were not significantly different between the genotypes (Supplemental Figure 2a in the manuscript). 

      (2) One of the authors' conclusions is that cholinergic

      a. Most Drosophila excitatory neurons are cholinergic

      and only a subpopulation appear to be dysregulated by these mutations. The authors point out that visual neurons also have many DEGs, couldn't the visual system also be dysregulated in these flies? Is there something special about these cholinergic neurons versus other cholinergic neurons in the fly brain? I wonder if they can leverage their nice dataset to say something about vulnerability. 

      Yes, the reviewer is right, and we have changed our wording to be more specific. The reviewer also noted correctly that neurons in the visual system rank high in terms of number of DEGs, but we did not conduct elaborate experiments to assess if these visual system neurons are functional. Of note, several of our mutants show (subtle) electroretinogram defects, that are a measure of visual system integrity, but further work is needed to determine the origin of these defects. 

      The question about the nature of the underlying vulnerability pathways is interesting. In preliminary work we have selected a number of DEGs common to vulnerable cells in several PD mutants, and conducted a screen where we manipulated the expression of these DEGs and looked for rescue of the olfactory preference defects in our PD mutants. The strongest genetic interaction was with genes encoding proteins involved in proteostasis (Atg8/LC3, Lamp1 and Hsc70-4) (Reviewer Figure 3). While interesting, these results require further work to understand the underlying molecular mechanisms. We present these preliminary data here but have not included them in the main manuscript. 

      b. As far as I can tell, the cross-species analysis of DEGs (Figure 3) is agnostic to neuronal cell type, although the conclusion seems to suggest only cholinergic neurons were contrasted. Is this correct? Could you please clarify this in the text as it's an important detail. If not, Have the authors tried comparing only cholinergic neuron DEGs across species? That would lend strength to their specificity argument. The results for the NBM are impressive. Could the authors add more detail to the main text here about other regions to the main text? 

      The reviewer is correct that we compiled the DEG of all affected cells, the majority of which are cholinergic neurons. 

      For the human data we focused on the NBM samples, because it contained the highest fraction of cholinergic neurons (as compared to the other 2 regions), but even so, it was not possible to analyze the cholinergic neurons alone because the fraction of cholinergic neurons in the human material was too low to be statistically analyzed independently. Note that both wildtype and PD samples contained a low number of cholinergic neurons (i.e. the DEG differences we detected were not the result of sequencing different types of cells - see also Supplemental Figure 3b and d). We have indicated this more clearly in the text.

      c. Uniquely within the human data, are cholinergic neurons more dysregulated than others? I understand this is not an early timepoint but would still be useful to discuss. 

      As indicated in the previous point, unfortunately the fraction of cholinergic neurons in the human material was low and we were not able to analyze these cells on their own. 

      Author response image 1.

      Upregulation of protein homeostasis rescues hyposmia across familial models of PD. Results of a behavioral screen for cell-specific rescue of olfactory preference defects of young PD fly models using up and downregulation of deregulated genes in affected cell types. Genes implicated in the indicated pathways are over expressed or knocked down using GH146-Gal4 (OPN>) and UAS-constructs (over expression or RNAi) . UAS-only (-) and OPN>UAS (+) were scored in parallel and are compared to each other. n.d. not determined; Bars represent mean ± s.e.m.; grey zone indicates the variance of controls; n≥5 independent experiments per genotype, with ~50 flies each; red bars: p<0.05 in ANOVA and Bonferroni-corrected comparison to UAS-only control.

      d. In the discussion, the authors say that olfactory neurons are uniquely poised to be dysregulated as they are large and have high activity. Is this really true compared to other circuits? I didn't find the references convincing and I am not sure this has been borne out in electron microscopy reconstructions for anatomy.  

      We agree and have toned down this statement.

      Reviewer #2 (Public Review):  

      Summary:  

      Pech et al selected 5 Parkinson's disease-causing genes, and generated multiple

      Drosophila lines by replacing the Drosophila lrrk, rab39, auxilin (aux), synaptojanin

      (synj), and Pink1 genes with wild-type and pathogenic mutant human or Drosophila cDNA sequences. First, the authors performed a panel of assays to characterize the phenotypes of the models mentioned above. Next, by using single-cell RNA-seq and comparing fly data with human postmortem tissue data, the authors identified multiple cell clusters being commonly dysregulated in these models, highlighting the olfactory projection neurons. Next, by using selective expression of Ca<sup>2+</sup>-sensor GCaMP3 in the OPN, the authors confirmed the synaptic impairment in these models, which was further strengthened by olfactory performance defects.  

      Strengths:  

      The authors overall investigated the functionality of PD-related mutations at endogenous levels and found a very interesting shared pathway through singlecell analysis, more importantly, they performed nice follow-up work using multiple assays.  

      Weaknesses:  

      While the authors state this is a new collection of five familial PD knock-in models, the Aux<sup>R927G</sup> model has been published and carefully characterized in Jacquemyn et al., 2023. ERG has been performed for Aux R927G in Jacquemyn et al., 2023, but the findings are different from what's shown in Figure 1b and Supplementary Figure 1d, which the authors should try to explain. 

      We should have explained this better: the ERG assay in Jacquemyn et al., and here, in Pech et al., are different. While the ERGs in our previous publication were recorded under normal endogenous conditions, the flies in our current study were exposed to constant light for 7 days. This is often done to accelerate the degeneration phenotype. We have now indicated this in the text (and also refer to the different experimental set up compared to Jacquemyn et al).

      Moreover, according to the authors, the hPINK1control was the expression of human PINK1 with UAS-hPINK1 and nsyb-Gal4 due to technical obstacles. Having PINK1 WT being an overexpression model, makes it difficult to explain PINK1 mutant phenotypes. It will be strengthened if the authors use UAS-hPINK1 and nsyb-Gal4 (or maybe ubiquitous Gal4) to rescue hPink1L347P and hPink1P399L phenotypes.

      The UAS-hPink1 was originally created by the Lu lab (Yang et al., 2003, PMID: 12670421) and has been amply used before in Pink1 loss-of-function backgrounds (e.g. in Yang et al., 2006, PMID: 16818890). In our work, the control we refer to was UAS-hPink1 expression (driven by nSyb-gal4) in a Pink1 knock-out background. For unknown reasons we were unable to replace the fly Pink1 with a human pink1 cDNA, we explained this in the methods section and added a remark in the new manuscript.

      In addition, although the authors picked these models targeting different biology/ pathways, however, Aux and Synj both act in related steps of Clathrin-mediated endocytosis, with LRRK2 being their accessory regulatory proteins. Therefore, is the data set more favorable in identifying synaptic-related defects? 

      We picked these particular mutants, as they were the first we created in the context of a much larger collection of “PD flies” (see also Kaempf et al 2024, BioRxiv). We have made adaptations to the text to tone down the statement on the broad selection of mutants. 

      GH146-GAL4+ PNs are derived from three neuroblast lineages, producing both cholinergic and GABAergic inhibitory PNs (Li et al, 2017). Therefore, OPN neurons have more than "cholinergic projection neurons". How do we know from singlecell data that cholinergic neurons were more vulnerable across 5 models? 

      The reviewer is correct that GH146 drives expression in other cells than OPN and we now clearly state this in the text. We do present additional arguments that substantiate our conclusion that cholinergic neurons are affected: (1) our single cell sequencing identifies the most DEGs in cholinergic neurons. (2) nicotine (a compound activating cholinergic receptors) rescues dopamine-related problems in old PD-mutant flies. (3) Likewise, nicotine also alleviates problems we observed in LRRK2 mutant human induced dopaminergic neurons and this is blocked by mecamylamine, a non-competitive antagonist of nicotinic acetylcholine receptors.

      In Figure 1b, the authors assumed that locomotion defects were caused by dopaminergic neuron dysfunction. However, to better support it, the author should perform rescue experiments using dopaminergic neuron-specific Gal4 drivers. Otherwise, the authors may consider staining DA neurons and performing cell counting. Furthermore, the authors stated in the discussion, that "We now place cholinergic failure firmly ahead of dopaminergic system failure in flies", which feels rushed and insufficient to draw such a conclusion, especially given no experimental evidence was provided, particularly related to DA neuron dysfunction, in this manuscript. 

      Previously, Riemensperger et al., 2013 (PMID: 24239353) already linked synaptic loss of the dopaminergic PAM neurons to locomotion impairments (measured by SING). Furthermore, in a separate paper we show that the motor defects (SING) observed in PD mutants are rescued when the flies are fed L-DOPA, but not D-DOPA (Kaempf et al 2024; BioRxiv). In this same paper, we also show a significant correlation between SING defects and defects in dopaminergic synaptic innervation of PAM DAN onto Mushroom body neurons. We have referred to both articles in the revised manuscript.

      The statement on cholinergic failure ahead of dopaminergic failure was made in the context of the sequence of events: young flies did not show DAN defects, but they did display olfactory defects. The statement was indeed not meant to imply causality. However, we have now conducted new experiments where we express wild type PD genes using GH146-Gal4 (that does not express in DAN) in the PD mutants and assess dopaminergic-relevant phenotypes later in life (see also new Figure 6 in the manuscript). This shows that GH146Gal4-specific rescue is sufficient to alleviate the DAN-dependent SING defects in old flies. Likewise, as indicated above, application of nicotine is also sufficient to rescue the DAN-associated defects (in PD mutant flies and human induced mutant dopaminergic neurons).  

      It is interesting to see that different familial PD mutations converge onto synapses. The authors have suggested that different mechanisms may be involved directly through regulating synaptic functions, or indirectly through mitochondria or transport. It will be improved if the authors extend their analysis on Figure 3, and better utilize their single-cell data to dissect the mechanisms. For example, for all the candidates listed in Figure 3C, are they all altered in the same direction across 5 models?  

      This is indeed the case: the criteria for "commonly deregulated" included that the DEGs are changed in the same direction across several mutants. We ranked genes according to their mean gene expression across the mutants as compared it to the wildtype control: i.e. only if the DEGs are all up- or all down-regulated they end up on the top or bottom of our list. We added a remark in the revised manuscript. In preliminary work we also selected a number of the DEGs and conducted a screen where we manipulated the expression of these genes looking for rescue of the olfactory preference defects in our PD mutants. The strongest genetic interaction was with genes encoding proteins involved in proteostasis (Atg8/LC3, Lamp1 and Hsc70-4; and we also show a genetic interaction between EndoA and Lrrk in this work and in Matta et al., 2012) (Author response image 1 above). While interesting, these results require further work to understand the underlying molecular mechanisms. We present these preliminary data here, but have not included them in the main manuscript. 

      While this approach is carefully performed, the authors should state in the discussions the strengths and the caveats of the current strategy. For example, what kind of knowledge have we gained by introducing these mutations at an endogenous locus? Are there any caveats of having scRNAseq at day 5 only but being compared with postmortem human disease tissue?  

      We have included a “strengths and caveats section” in the discussion addressing these points.

      Reviewer #3 (Public Review):  

      Summary:  

      This study investigates the cellular and molecular events leading to hyposmia, an early dysfunction in Parkinson's disease (PD), which develops up to 10 years prior to motor symptoms. The authors use five Drosophila knock-in models of familial PD genes (LRRK2, RAB39B, PINK1, DNAJC6 (Aux), and SYNJ1 (Synj)), three expressing human genes and two Drosophila genes with equivalent mutations.  

      The authors carry out single-cell RNA sequencing of young fly brains and singlenucleus RNA sequencing of human brain samples. The authors found that cholinergic olfactory projection neurons (OPN) were consistently affected across the fly models, showing synaptic dysfunction before the onset of motor deficits, known to be associated with dopaminergic neuron (DAN) dysfunction.  

      Single-cell RNA sequencing revealed significant transcriptional deregulation of synaptic genes in OPNs across all five fly PD models. This synaptic dysfunction was confirmed by impaired calcium signalling and morphological changes in synaptic OPN terminals. Furthermore, these young PD flies exhibited olfactory behavioural deficits that were rescued by selective expression of wild-type genes in OPNs.  

      Single-nucleus RNA sequencing of post-mortem brain samples from PD patients with LRRK2 risk mutations revealed similar synaptic gene deregulation in cholinergic neurons, particularly in the nucleus basalis of Meynert (NBM). Gene ontology analysis highlighted enrichment for processes related to presynaptic function, protein homeostasis, RNA regulation, and mitochondrial function.  

      This study provides compelling evidence for the early and primary involvement of cholinergic dysfunction in PD pathogenesis, preceding the canonical DAN degeneration. The convergence of familial PD mutations on synaptic dysfunction in cholinergic projection neurons suggests a common mechanism contributing to early non-motor symptoms like hyposmia. The authors also emphasise the potential of targeting cholinergic neurons for early diagnosis and intervention in PD.  

      Strengths:  

      This study presents a novel approach, combining multiple mutants to identify salient disease mechanisms. The quality of the data and analysis is of a high standard, providing compelling evidence for the role of OPN neurons in olfactory dysfunction in PD. The comprehensive single-cell RNA sequencing data from both flies and humans is a valuable resource for the research community. The identification of consistent impairments in cholinergic olfactory neurons, at early disease stages, is a powerful finding that highlights the convergent nature of PD progression. The comparison between fly models and human patients' brains provides strong evidence of the conservation of molecular mechanisms of disease, which can be built upon in further studies using flies to prove causal relationships between the defects described here and neurodegeneration.  

      The identification of specific neurons involved in olfactory dysfunction opens up potential avenues for diagnostic and therapeutic interventions.  

      Weaknesses:  

      The causal relationship between early olfactory dysfunction and later motor symptoms in PD remains unclear. It is also uncertain whether this early defect contributes to neurodegeneration or is simply a reflection of the sensitivity of olfactory neurons to cellular impairments. The study does not investigate whether the observed early olfactory impairment in flies leads to later DAN deficits. Additionally, the single-cell RNA sequencing analysis reveals several affected neuronal populations that are not further explored. The main weakness of the paper is the lack of conclusive evidence linking early olfactory dysfunction to later disease progression.

      We agree that this is an interesting avenue to pursue and as indicated above in Figure 6 and in the reworked manuscript, we have now included data that strengthens the connection between early OPN defects and the later DAN dependent problems. Additional future work will be needed to elucidate the mechanisms of this cell-non autonomous effect. 

      The rationale behind the selection of specific mutants and neuronal populations for further analysis could be better qualified. 

      We have added further explanation in the reworked text.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):  

      Minor Comments:  

      (1) Questions about the sequencing methods and analysis approaches. From reading the methods and main text, I was confused about aspects of the Drosophila single-cell profiling. Firstly, did the authors multiplex their fly samples? 

      No, we did not. Genotypes were separately prepared and sequenced, but they were all processed in parallel to avoid batch effects. 

      Secondly, it seems like there are two rounds of dataset integration performed, Harmony and Seurat's CCA-based method. This seems unorthodox. Could the authors comment on why they perform two integrations? 

      Thanks for pointing this out, this was a mistake in the methods section (copied from a much older version of the manuscript). In this manuscript, we only used harmony for dataset integration and removed the methods on Seurat-CCA. 

      Finally, for all dataset integrations please state in the main text how datasets were integrated (by age, genotype, etc). 

      Datasets were integrated by sample id, corresponding to individual libraries.

      (2) The authors focus on OPNs with a really nice set of experiments. I noticed however that Kenyon cells were also dysregulated. What about Olfactory sensory neurons? Could the authors provide comments on this? 

      Olfactory sensory neurons are located in the antennae of the fly brain and were not captured by our analysis. However, the GH146-Gal4-specific rescue experiments indicate these sensory neurons are likely not severely functionally impaired. Kenyon cells are an interesting affected cell type to look at in future experiments, as they are directly connected to DANs.

      (3) There are several citations of Jenett et al 2012 that seem wrong (related to single-cell datasets).

      We are sorry for this and have corrected this in the text.  

      Reviewer #2 (Recommendations For The Authors):  

      (1) In the key resources table, a line called CG5010k.o. (chchd2k.o.) was mentioned, but was not used in the paper. The authors should remove it. 

      Sorry, this was from a previous older version of the manuscript. We fixed this.

      (2) Why did the authors use human CDS for LRRK2, Rab39B, and PINK1, but fly CDS for Aux and Synj1? Is it based on the conservation of amino acid residues? Although the authors cited a review (Kalia & Lang, 2015) to justify the selection of the mutations, for the interest of a broad audience, it is recommended that the authors expand their introduction for the rationale of their selection, including the pathogenicity of each selected mutation, original human genetics evidence, conservation between fly and human. 

      (a) We used Drosophila cDNA for rescue experiments with aux and synj since knockin of the human homologues at the locus of these genes did not rescue its loss-offunction (lethality). 

      (b) We expanded the introduction to provide further explanation on the selection of our mutants we analyzed in this work. We picked these particular mutants, as they were the first we created in the context of a much larger collection of “PD flies” (see also Kaempf et al 2024, BioRxiv). We have made adaptations to the text to tone down the statement on the broad selection of mutants. 

      (3) Supplemental Figure 1a, is mRNA level normalized to an internal control? If not, it is not appropriate to compare the results directly from two primer sets, since each primer set may have different amplification efficiency. 

      We are sorry for the lack of information. Indeed, mRNA levels were determined using the Δ-Δ-CT method, where Ct values were first normalized to the housekeeping gene Rp49, and next expressed as a percent of endogenous Drosophila gene expression. We expanded the methods section and now also enlist the primers for Rp49 along with the other qPCR primers in Supplemental File 1.

      (4) For Figure 2, it may be helpful to have a supplemental table or figure showcasing the clusters with significant changes (based on cell number-adjusted DEGs) for each model, i.e., what are those black cell clusters in Figure 2? "Thus, cellular identity and cellular composition are preserved in young PD fly models." In Figure S2A, the authors only show cell composition percentages for 3 cell clusters, are the bars 95% standard error? 

      The error bars in Supplemental Figure 2a represent the 95 % CI. We have included a new supplemental table with the number of cells per cell cluster for each mutant (Supplemental File 3).

      What about the remaining 183 cell clusters? Are there any KI-model cell clusters that are statistically different than controls? What about the annotated cell types (e.g., the 81 with cell identities)? Please consider at least providing or pointing to a table to state how many have significant differences, or if there are truly none. 

      As mentioned above, we have included a new supplemental table with the number of cells per cell cluster for each mutant (Supplemental File 3).

      (5) What are the rows in the sunburst plot in Figure 3a? Please be more descriptive in the figure legend or label the figure. 

      We have expanded on this in the figure legend and now also include a summary of the SynGO analysis in Supplemental File 7. In Figure 3a, a summary sunburst plot is presented, reflecting the GO terms (inner rings, indicated in a) with their subdivided levels (the complete list is provided in Supplemental File 7). In Figure 3a’ and a” the DEG data acquired from the different datasets (human vs fly) are applied to the sunburst plot where rings are color-coded according to enrichment Q-value.

      (6) In Table S4, which clusters (in the table) have normalized residuals that are outside of the 95% confidence interval of the regression model displayed in Figure S2e? They use this analysis to adjust for cell number bias and point out the "most significant cell clusters" affected in each model. This may be helpful for readers who want to grab a full list of responsive clusters. 

      We have included this information in Supplemental File 5 (Tab “Cell types outside of CIs”) in the supplemental data of the manuscript.

      (7) The human samples used all have different LRRK2 variants: for the crossspecies comparisons, do Lrrk flies have greater similarity to the human PD cases compared to the other fly models?

      No, comparing the vulnerable gene signatures from each of the fly mutants to the DEGs from the human samples does not show any greater similarity between the LRRK mutants compared to the other mutants.

      Reviewer #3 (Recommendations For The Authors):  

      Clarifications required:  

      Some of the mutations used are not common PD-associated genes, the authors should explain the rationale behind using these particular mutants, and not using well-established fly models of PD (like for example GBA flies) or SNCA overexpression.

      We opted to use knock-ins of mutations that are causal to Parkinsonism. Given flies do not express an alpha-synuclein homologue we were not able to add this ‘as such’ to our collection. Future work can indeed also include expression models or risk factor models (like GBA). As also requested by another reviewer, we did add further rationale and explanation to the genes we chose to analyze in this work.

      Why starvation rather than lifespan for PD models? For the lifespan data shown there are no error bars, if the stats test is a log-rank or Cox proportional hazards (usually used in survival analysis, this should be stated), it would also be good to have the survival plots for all the survival during starvation, not just PINK1. 

      While starvation assays can provide valuable insights into acute metabolic and physiological stress responses, we acknowledge that lifespan is a critical parameter and would provide a more comprehensive understanding of the PD models in our study. Based on this consideration and the reviewer’s feedback we have removed the starvation data from the manuscript. Unfortunately, we did not perform lifespan experiments, which is why these data were not included in the manuscript. However, based on our observations (though not detailed analysis), all genotypes tested—except for the PINK1 mutants—appeared to have a normal lifespan. For PINK1 mutants, most flies died by 25 days of age. Therefore, we conducted our assays using 15-day-old PINK1 mutant flies.

      Do the fly models used have different lifespans, and how close to death was the SING assay performed? Different mutations show different effects, most phenotypes are really mild (hRab39BG192R has no phenotype), and PINK1 has the strongest, are these simply reflections of how strong the model is?  

      The ages of flies we analyzed are indicated in the legend. As mentioned before, all but PINK1 mutants- had a normal life span: i.e. we did not detect abnormal low number of flies or premature death at 50 days of age, except for the PINK1 mutants tested in this manuscript where most flies died by 25 days of age. Therefore, we conducted our assays using 15-day-old PINK1 mutant flies.

      Rab39G192R has no phenotype in the tests presented, suggesting no degeneration, why use RabG192R for scRNA seq? Seems an odd choice, the authors should explain. 

      Single-cell sequencing was initiated before the full phenotypic characterization of all mutants was completed. Although basic characterization of the Rab39<sup>G192R</sup> mutant PD flies revealed either no significant phenotypes or only mild effects in the assays performed (Figure 1), the sequencing data provided additional insights into potential cellular and molecular alterations. Furthermore, all PD-mutant knock-ins, including Rab39<sup>G192R</sup> mutant PD flies, show dysfunctional synaptic terminals of their OPN neurons as they had significantly weaker Ca<sup>2+</sup>-responses, even though their synaptic area was increased (Figure 4 g-h). Furthermore, all mutants also had olfactory behavior defects (Figure 5 a). 

      When the authors state that “For example, in the NBM, an area associated with PD (Arendt et al., 1983), 20% of the DEG that has an orthologous gene in the fly are also found among the most deregulated genes across PD fly models" a test should be performed to confirm this is a significant overlap (such as a hypergeometric test). 

      We have performed this test, of the 2486 significantly differential human genes, 1149 have a fly orthologue, and of these, 28.46 % overlap with the deregulated fly genes (5 % top and bottom gene as shown in Supplemental Table 7). Performing a hypergeometric test confirms that this overlap is significant, with a p-value of 9.06e<sup>76</sup>. We have included this in the text.

      The authors speak of deregulation when speaking of the overlap between human and fly DE genes, but do the over-expressed genes in flies overlap with overexpressed genes in humans, or is the direction of transcription deregulation not concordant? If it is mostly not concordant, can the authors please comment as to why they might think that is the case? 

      In our fly experiments, we identified DEG in affected cell types and then defined common DEG by looking at the average change across the fly mutants. Genes that show a consistent change (all or mostly up, or all or mostly down) in the different mutants will end at the top of our list while genes that are up in some mutants and downregulated in others will average out and not end up in our commonly deregulated gene list. For comparison to the human data, we only looked for the presence of the human homologue, but did not assess if the change occurred in the same direction. More work will be needed to define the most relevant changes, but in a mini-screen we did select a number of DEG present in fly and human datasets from different functional categories and tested if they genetically interact with our PD mutants. As shown in Reviewer Figure 3, we find that modulating proteostasis pathway-encoding genes rescue the olfactory preference defect across many PD mutants. 

      Can the authors explain why only the NMB region was used for comparison with the fly data?  

      We used the NMB because this region has the highest number of cholinergic neurons to compare the deregulation in those neurons to the deregulation in the cholinergic OPN of mutant PD flies.

      In Figure 4, can the genotypes please be stated in full and why is the hPINK1 fly giving no detectable signal? 

      Despite several attempts, we failed to knock-in wild type hPink1 in the fly pink1 locus. Therefore, the hPink1 control used throughout the manuscript was the nSybGal4>UAS-hPink1 in Pink1 knock-out background, except for Figure 4. Particularly, for experiments in this figure, we could not use UAS-hPink1 with nSyb-Gal4, since we needed OPN-specific expression of Gal4 to drive UAS-GCamP expression.

      Therefore, this was labeled as “not determined” (“n.d.”), as indicated in the figure and the legend. We explained this better in the methods section, added a remark in the new manuscript and expanded the legend of Figure 4.

      The paper states that" These findings imply that factors affecting the function of cholinergic neurons might, by the absence of insufficient innervation, lead to DAN problems and degeneration, warranting further exploration of the underlying molecular mechanisms", this should be less strong, the paper never looks at DAN, only at OPN neurons. Fly neurons are mostly cholinergic, and human neurons are mostly glutamatergic, so jumping from one system to the other might not be as straightforward, the authors should comment on this. 

      We now included a new exciting experiment where we assessed DAN function in aged PD mutants where the wildtype gene was expressed in OPN using GH146-Gal4. We find this manipulation rescued DAN defects (measured by SING) in older flies. We further corroborated our observation by “replacing” cholinergic innervation with nicotine feeding in PD mutants. Also, this rescues the SING defect as well as the defects in neuronal activity in PAM DAN (based on live synaptic calcium imaging). Finally, we also show that incubating LRRK2<sup>G2019S</sup> mutant human induced dopaminergic neurons with nicotine is sufficient to rescue functional defects in these neurons (measured using calcium imaging). We included this data in the new manuscript and show them also in Figure 6 above (new Figure 6 in the revised manuscript). 

      Experiments that would improve the manuscript:  

      Does rescue of OPN function also rescue later progressive symptoms (geotaxis response)?  

      It does, as indicated in the previous point and shown in Figure 6.

      Do the fly PD models used show DAN degeneration? This could be assessed by stains with anti-TH stains. 

      We quantified DAN cell bodies using anti-TH, but see very little or no loss. There is, however, loss of synaptic innervation of the PAM onto the mushroom bodies. We included the data in a new Figure 6 (see also Figure 6). Furthermore, we have quantified this across the genetic space of familial Parkinsonism in Kaempf et al., 2024, BioRxiv. Note that this phenotype is also rescued by expressing wildtype CDS in their OPN using GH146-Gal4.

      Minor issues: 

      The final sentence on page 5 is repetitive with the introduction. 

      Indeed, we removed the redundant sentence.

      First line of the new section on page 6, the authors probably mean cholinergic olfactory projection neurons, not just cholinergic neurons. 

      Yes, and corrected.

      At the top of page 7 the authors state: "Additionally, we also found enrichment of genes involved in RNA regulation and mitochondrial function that are also important for the functioning of synaptic terminals", where is the data showing this? The authors should point to the supplemental file showing this.  

      We now included a reference to Supplemental File 7 that includes a summary of those data. Additionally, we also included references to back this claim.

      Just before the discussion, Rab39BG193R should be Rab39BG192R.  

      Sorry for this, it is now corrected.

      Stating "fifth row" in Fig 5c and d is confusing, can the figure be labelled more clearly?  

      We modified the figure (including extra marks and colors) and expanded the legend and the main text to differentiate better between expression of the rescues in OPN versus T1 neurons revealing that only expression in OPN neurons rescues the olfactory defects while expression in T1 neurons does not.

      In the methods, the authors describe clustering done both in Scanpy and Seurant, why were both run? Which clustering was used for further analysis?

      We only used Scanpy with Harmony and removed the methods on Seurat-CCA. Thanks for pointing this out, this was a mistake in the methods section (copied from a previous version of the manuscript).

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer 1 (Public Comments):

      (1) The central concern for this manuscript is the apparent lack of reproducibility. The way the authors discuss the issue (lines 523-554) it sounds as though they are unable to reproduce their initial results (which are reported in the main text), even when previous versions of AlphaFold2 are used. If this is the case, it does not seem that AlphaFold can be a reliable tool for predicting antibody-peptide interactions.

      The driving point behind the multiple sequence alignment (MSA) discussion was indeed to point out that AlphaFold2 (AF2) performance when predicting scFv:peptide complexes is highly dependent upon the MSA, but that is a function of MSA generation algorithm (MMseqs2, HHbiltz, jackhmmer, hhsearch, kalign, etc) and sequence databases, and less an intrinsic function of AF2. It is important to report MSA-dependent performance precisely because this results in changing capabilities with respect to peptide prediction.

      Performance also significantly varies with the target peptide and scFv framework changes. By reporting the varying success rates (as a function of MSA, peptide target, and framework changes) we aim to help future researchers craft modified algorithms that can achieve increased reliability at protein-peptide binding predictions. Ultimately, tracking down how MSA generation details vary results (especially when the MSA’s are hundreds long) is significantly outside the scope of this paper. Our goal for this paper was to show a general method for identification of linear antibody epitopes using only sequence information, and future work by us or others should focus on optimization of the process. 

      (2) Aside from the fundamental issue of reproducibility, the number of validating tests is insufficient to assess the ability of AlphaFold to predict antibody-peptide interactions. Given the authors' use of AlphaFold to identify antibody binding to a linear epitope within a whole protein (in the mBG17:SARS-Cov-2 nucleocapsid protein interaction), they should expand their test set well beyond Myc- and HA-tags using antibody-antigen interactions from existing large structural databases.

      Performing the calculations at the scale that the reviewer is requesting is not feasible at this time. We showed in this manuscript that we were able to predict 3 of 3 epitopes, including one antigen and antibody pair that have not been deposited into the PDB with no homologs. While we feel that an N=3 is acceptable to introduce this method to the scientific community, we will consider adding more examples of success and failure in the future to optimize and refine the method as computational resources become available. Notably, future efforts that attempt high-throughput predictions of this class using existing databases should take particular care to avoid contamination.

      (3) As discussed in lines 358-361, the authors are unsure if their primary control tests (antibody binding to Myc-tag and HA-tag) are included in the training data. Lines 324-330 suggest that even if the peptides are not included in the AlphaFold training data because they contain fewer than 10 amino acids, the antibody structures may very well be included, with an obvious "void" that would be best filled by a peptide. The authors must confirm that their tests are not included in the AlphaFold training data, or re-run the analysis with these templates removed.

      First, we address the simpler question of templates.

      The reruns of AF2 with the local 2022 rebuild, the most reproducible method used with results most on par with the MMSEQS server in the Fall of 2022, were run without templates. This is because the MSA was generated locally; no templates were matched and generated locally. The only information passed then was the locally generated MSA, and the fasta sequence of the unchanging scFv and the dynamic epitope sequence. Because of how well this performed despite the absence of templates, we can confidently say the inclusion of the template flag is not significant with respect to how universally accurately PAbFold can identify the correct epitope. 

      Second, we can partially address the question of whether the AlphaFold models had access to models suitable, in theory, for “memorization” of pertinent structural details. 

      With respect to tracking the exact role and inclusion of specific PDB entries, the AF2 paper provides the following:

      “Structures from the PDB were used for training and as templates (https://www.wwpdb.org/ftp/pdb-ftp-sites; for the associated sequence data and 40% sequence clustering see also https://ftp.wwpdb.org/pub/pdb/derived_data/ and https://cdn.rcsb.org/resources/sequence/clusters/bc-40.out). Training used a version of the PDB downloaded 28 August 2019, while the CASP14 template search used a version downloaded 14 May 2020. The template search also used the PDB70 database, downloaded 13 May 2020 (https://wwwuser.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_dbs/).”

      Three of these links are dead. As such, it is difficult to definitively assess the role of any particular PDB entry with respect to AF2 training/testing, nor what impact homologous training structures given the very large number of immunoglobin structures in the training set. That said, we can summarize information for the potentially relevant PDB entries (l 2or9, which is shown in Fig. 1 and 1frg), and believe it is most conservative to assume that each such entry was within the training set.

      PDB entry 2or9 (released 2008): the anti-c-myc antibody 9E10 Fab fragment in complex with an 11-amino acid synthetic epitope: EQKLISEEDLN. This crystal structure is also noteworthy for featuring a binding mode where the peptide is pinned between two Fab. The apo structure (2orb) is also in the database but lacks the peptide and a resolved structure for CDR H3.

      PDB entry 1a93 (released 1998): a c-Myc-Max leucine zipper structure, where the c-Myc epitope (in a 34-amino acid protein) adopts an alpha helical conformation completely different from the epitope captured in entry 2or9.

      PDB entries 5xcs and 5xcu (released 2017): engineered Fv-clasps (scFv alternatives) in complex with the 9-amino acid synthetic HA epitope: YPYDVPDYA.

      PDB entry 1frg (released 1994): anti-HA peptide Fab in complex with HA epitope subset Ace-DVPDYASL-NH2.

      Since the 2or9 entry has our target epitope (10 aa) embedded within an 11aa sequence, we have revised this line in the manuscript:

      The AlphaFold2 training set was reported to exclude chains of less than 10, which would eliminate the myc and HA epitope peptides. => The AlphaFold2 training set was reported to exclude chains of less than 10, which would eliminate the HA epitope peptide from potential training PDB entries such as 5xcs or 5xcu”

      It is important to note that we obtained the best prediction performance for the scFv:peptide pair that had no pertinent PDB entries (mBG17). Specifically, doing a Protein Blast against the PDB using the mBG17 scFv revealed diverse homologs, but a maximum sequence identity of 89.8% for the heavy chain (to an unrelated antibody) and 93.8% for the light chain (to an unrelated antibody). Additionally, while it is possible that the AF2 models might have learned from the complex in pdb entry 2or9, Supplemental Figure 3 shows how often the peptide is “misplaced”, and the performance does not exceed the performance for mBG17.

      (4) The ability of AlphaFold to refine the linear epitope of antibody mBG17 is quite impressive and robust to the reproducibility issues the authors have run into. However, Figure 4 seems to suggest that the target epitope adopts an alpha-helical structure. This may be why the score is so high and the prediction is so robust. It would be very useful to see along with the pLDDT by residue plots a structure prediction by residue plot. This would help to see if the high confidence pLDDT is coming more from confidence in the docking of the peptide or confidence in the structure of the peptide.

      The reviewer is correct that target mBG17 epitope adopts an alpha helical conformation, and we concur that this likely contributes to the more reliable structure prediction performance.  When we predict the structure of the epitope alone without the mBG17 scFv, AF2 confidently predicts an alpha helix with an average pLDDT of 88.2 (ranging from 74.6 to 94.4). 

      Author response image 1.

      The AF2 prediction for the mBG17 epitope by itself.

      However, as one interesting point of comparison, a 10 a.a. poly-alanine peptide is also consistently folded into an alpha-helical coil by AF2. The A<sub>10</sub> peptide is also predicted to bind among the traditional scFv CDR loops, but the pLDDT scores are very poor (Supplemental Figure 5J). We also observed the opposite case; when a peptide has a very unstructured region in the binding domain but is nonetheless still be placed confidently, as seen in Supplemental Figure 3 C&D. Therefore, while we suspect peptides with strong alpha helical propensity are more likely to be accurately predicted, the data suggests that that alpha helix adoption is neither necessary nor sufficient to reach a confident prediction.

      (5) Related to the above comment, pLDDT is insufficient as a metric for assessing antibody antigen interactions. There is a chance (as is nicely shown in Figure S3C) that AlphaFold can be confident and wrong. Here we see two orange-yellow dots (fairly high confidence) that place the peptide COM far from the true binding region. While running the recommended larger validation above, the authors should also include a peptide RMSD or COM distance metric, to show that the peptide identity is confident, and the peptide placement is roughly correct. These predictions are not nearly as valuable if AlphaFold is getting the right answer for the wrong reasons (i.e. high pLDDT but peptide binding to a nonCDR loop region). Eventual users of the software will likely want to make point mutations or perturb the binding regions identified by the structural predictions (as the authors do in Figure 4).

      We agree with the reviewer that pLDDT is not a perfect metric, and we are following with great interest the evolving community discussion as to what metrics are most predictive of binding affinity (e.g. pAE, or pITM as a decent predictor for binding, but not affinity ranking). To our knowledge, there is not yet a consensus for the most predictive metrics for protein:protein binding nor protein:peptide binding. Intriguingly, since the antigen peptides are so small in our case, the pLDDT of the peptide residues should be mostly reporting on the confidence of the distances to neighboring protein residues.

      As to the suggestion for a RMSD or COM distance metric, we agree that these are useful -with the caveat that these require a reference structure. The goal of our method is to quickly narrow down candidate linear epitopes and thereby guide experimentalists to more efficiently determine the actual binding sequence of an antibody-antigen sequence. Presumably this would not be necessary if a reference structure were known. 

      It may also be possible to invent a method to filter unlikely binding modes that is specific to antibodies and peptide epitopes that does not require a known reference structure, but this would be an interesting problem for subsequent study.

      Reviewer 1 (Recommendations for the Authors):

      (1) "Linear epitope" should be more precisely defined in the text. It isn't clear whether the authors hope that they can use AlphaFold to predict where on a given protein antigen an antibody will bind, or which antigenic peptide the antibody will bind to. The authors discuss both problems, and there is an important distinction between the two. If the authors are only concerned with isolated antigenic peptides, rather than linear epitopes in their full length structural contexts, they should be more precise in the introduction and discussion.

      We thank the reviewer for the prompt towards higher precision. We are using the short contiguous antigen definition of “linear epitope” that depends on secondary rather than tertiary structure. The linear epitopes this paper considers are short “peptides” that form secondary structure independent of their structure in the complete folded antigen protein. We have clarified our definition of “linear epitope” in the text (lines 64-66). 

      (2) Line 101: "Not all portions of the antibody are critical". First, this is not consistent with the literature, particularly where computational biology is concerned.

      See https://pubs.acs.org/doi/10.1021/acs.jctc.7b00080 . Second, while I largely agree with what I think the authors are trying to say (that we can largely reduce the problem to the CDR loops), this is inconsistent with what the authors later find, which is that inexplicably the VH/VL scaffold used alters results strongly.

      We have adopted verbiage that should be less provocative: “Fortunately, with respect to epitope specificity, antibody constant domains are less critical than the CDR loops and the remainder of the variable domain framework regions.”

      (3) Related to the above comment, do the authors have any idea why epitope prediction performance improved for the chimeric scFvs? Is this due to some stochasticity in AlphaFold? Or is there something systematic? Expanding the test dataset would again help answer this question.

      We agree that future study with a larger test set could help address this intriguing result, for which we currently lack a conclusive explanation. Part of our motivation for this publication was to bring to light this unexpected result. Notably, these framework differences are not only implicated as a factor in driving AF2 performance, but also changing experimental intracellular performance as reported by our group (DOI: 10.1038/s41467-019-10846-1 ). We can generate a variety of hypotheses for this phenomenon. Just as MSA sub-sampling has been a popular approach to drive AF2 to sample alternative conformations, sequence recombination may be a generically effective way to generate usefully different binding predictions. However, it is difficult to discriminate between recombination inducing subtle structural tweaks that increase protein intracellular fitness and binding, from recombination causing changes to the MSA that affect the likelihood of sampling a good epitope binding conformation. It is also possible that the chimeras are more deftly predicted by AF2 due to differences in sequence representation during the training of the AF2 models (e.g. more exposure to models containing 15F11 or 2E2 structures). We attempted to deconvolute MSA differences by using single-sequence mode (Supplementary Figure 13) but this ablated performance.

      (4) Figure 2: The reported consensus pLDDT scores are actually quite low here, suggesting low confidence in the result. This is in strong contrast to the reported consensus scores for mBG17. Again, a larger test dataset would help set a quantitative cutoff for where to draw the line for "trustworthy" AlphaFold predictions in antibody-peptide binding applications.

      We agree that a larger dataset will be useful to begin to establish metrics and thresholds and will contribute to the aforementioned community discussion about reliable predictors of binding. Our current focus is not structure prediction per se. In the current work we are more focused on relative binding likelihood and increasing the efficiency of experimental epitope verification by flagging the most likely linear epitopes. Thus, while the pLDDT scores are low for Myc in Figure 2, it is remarkable (and worth reporting) that there is still useful signal in the relative variation in pLDDT. The utility of the signal variation is evident in the ability to short-list correct lead peptides via the two methods we demonstrate (consensus and per-residue max).

      (5) Figure 4: if the authors are going to draw conclusions from the actual structure predictions of AlphaFold (not just the pLDDT scores), the side-chain accuracy placement should be assessed in the test dataset (RMSD or COM distance).

      We agree with the reviewer that side-chain placement accuracy is important when evaluating the accuracy of AF2 structure predictions. However, here our focus was relative binding likelihood rather than structure prediction. The one case where we attempted to draw conclusions from the structure prediction was in the context of mBG17, where there is not yet an experimental reference structure. Absolutely, if we were to obtain a crystal structure for that complex, we would assess side-chain placement accuracy. 

      (6) Lines 493-508: I am not sure that this assessment for why AlphaFold has difficulty with antibody-antigen interactions is correct. If the authors' interpretation is correct (larger complicated structures are more challenging to move) then AlphaFold-Multimer (https://www.biorxiv.org/content/10.1101/2021.10.04.463034v2.full) wouldn't perform as well as it does. Instead, the issue is likely due to the incredibly high diversity in antibody CDR loops, which reduces the ability of the AlphaFold MSA step (which the authors show is quite critical to predictions: Figure S13) to inform structure prediction. This, coupled with the importance of side chain placement in antibody and TCR interactions, which is notoriously difficult (https://elifesciences.org/articles/90681), are likely the largest source of uncertainty in antibody-antigen interaction prediction.

      We agree with the reviewer that CDR loop diversity (and associated side chain placement challenges) are a major barrier to successfully predict antibody-antigen complexes. Presumably this is true for both peptide antigens and protein antigens. Indeed, the authors of AlphaFold-multimer admit that the updated model struggles with antibody-antigen complexes, saying “As a limitation, we observe anecdotally that AlphaFold-Multimer is generally not able to predict binding of antibodies and this remains an area for future work.” The point about how loop diversity could reduce MSA quality is well taken. We have included the following thanks to the guidance of the reviewer when discussing MSA sensitivity is discussed later on in lines 570-572.: 

      “These challenges are presumably compounded by the incredible diversity of the CDR loops in antibodies which could decrease the useful signal from the MSA as well as drive inconsistent MSA-dependent performance”.

      With respect to lines 493-508, we have also rephrased a key sentence to try to better explain that we are comparing the often-good recognition performance for short epitopes to the never-good performance when those epitopes are embedded within larger sequences. Instead of saying, “In contrast, a larger and complicated structure may be more challenging to move during the AlphaFold2 structure prediction or recycle steps.” we now say in lines 520-522 , “In contrast, embedding the epitope within a larger and more complicated structure appears to degrade the ability of AlphaFold2 to sample a comparable bound structure within the allotted recycle steps.”

      (7) Related to major comment 1: Are AlphaFold predictions deterministic? That is, if you run the same peptide through the PAbFold pipeline 20 times, will you get the same pLDDT score 20 times? The lack of reproducibility may be in part due to stochasticity in AlphaFold, which the authors could actually leverage to provide more consistent results.

      This is a good question that we addressed while dissecting the variable performance. When the random seed is fixed, AF2 returns the same prediction every time. After running this 10 times with a fixed seed, the mBG17 epitope was predicted with an average pLDDT of 88.94, with a standard deviation of 1.4 x 10<sup>-14</sup>. In contrast, when no seed is specified, AF2 did not return an *identical* result. However, the results were still remarkably consistent. Running the mBG17 epitope prediction 10 times with a different seed gave an average pLDDT of 89.24, with a standard deviation of 0.49. 

      (8) Related to major comment 2: The authors could use, for example, this previous survey of 1833 antibody-antigen interactions (https://www.sciencedirect.com/science/article/pii/S2001037023004725) the authors could likely pull out multiple linear epitopes to test AlphaFold's performance on antibody peptide interactions. A large number of tests are necessary for validation.

      We thank the reviewer for this report of antibody-antigen interactions and will use it as a source of complexes in a future expanded study. Given the quantity and complexity of the data that we are already providing, as well as logistical challenges for compute and personnel the reviewer is asking for, we must defer this expansion to future work.

      (9) Related to major comment 3: Apologies if this is too informal for a review, but this Issue on the AlphaFold GitHub may be useful: https://github.com/googledeepmind/alphafold/issues/416 .

      We thank the reviewer for the suggestion – per our response above we have indeed run predictions with no templates. Since we are using local AlphaFold2 calculations with localcolabfold, the use or non-use of templates is fairly simple: including a “—templates” flag or not.

      (10) Related to major comment 4: I am not sure if AlphaFold outputs by-residue secondary structure prediction by default, but I know that Phyre2 does http://www.sbg.bio.ic.ac.uk/~phyre2/html/page.cgi?id=index .

      To our knowledge, AF2 does not predict secondary structure independent of the predicted tertiary structure. When we need to analyze the secondary structure we typically use the program DSSP from the tertiary structure. 

      (11) The documentation for this software is incomplete. The GitHub ReadMe should include complete guidelines for users with details of expected outputs, along with a thorough step-by-step walkthrough for use.

      We thank the reviewer for pointing this out, but we feel that the level of detail we provide in the GitHub is sufficient for users to utilize the method described.

      Stylistic comments:

      (1) I do not think that the heatmaps (as in 1C, top) add much information for the reader. They are largely uniform across the y-axis (to my eyes), and the information is better conveyed by the bar and line graphs (as in 1C, middle and bottom panels).

      We thank the reviewer for this feedback but elect to leave it in on the premise of more data presented is (usually) better. Including the y-axis reveals common patterns such as the lower confidence of the peptide termini, as well as the lack of some patterns that might have occurred. For example, if a subset of five contiguous residues was necessary and sufficient for local high confidence this could be visually apparent as a “staircase” in the heat map.

      (2) A discussion of some of the shortcomings of other prediction-based software (lines 7177) might be useful. Why are these tools less well-equipped than AlphaFold for this problem? And if they have tried to predict antibody-antigen interactions, why have they failed?

      We agree with the reviewer that a broader review of multiple methods would be interesting and useful. One challenge is that the suite of available methods is evolving rapidly, though only a subset work for multimeric systems. Some detail on deficiencies of other approaches was provided in lines 71-77 originally, although we did not go into exhaustive detail since we wanted to focus on AF2. We view using AF2 in this manner is novel and that providing additional options predict antibody epitopes will be of interest to the scientific community. We also chose AF2 because we have ample experience with it and is a software that many in the scientific community are already using and comfortable with. Additionally, AF2 provided us with a quantification parameter (pLDDT) to assess the peptides’ binding abilities. We think a future study that compares the ability of multiple emerging tools for scFv:peptide prediction will be quite interesting. 

      (3) Similar to the above comment, more discussion focused on why AlphaFold2 fails for antibodies (lines 126-128) might be useful for readers.  

      We thank the reviewer for the suggestion. The following line has been added shortly after lines 135-137:

      “Another reason for selecting AF2 is to attempt to quantify its abilities the compare simple linear epitopes, since the team behind AF-multimer reported that conformational antibody complexes were difficult to predict accurately (14).”

      Per earlier responses, we also added text that flags one particular possible reason for the general difficulty of predicting antibody-antigen complexes (the diversity of the CDR loops and associated MSA challenges).

      (4) The first two paragraphs of the results section (lines 226-254) could likely be moved to the Methods. Additionally, details of how the scores are calculated, not just how the commands are run in python, would be useful.

      Per the reviewer suggestion, we moved this section to the end of the Methods section. Also, to aid in the reader’s digestion of the analysis, the following text has been added to the Results section (lines 256-264):

      “Both the ‘Simple Max’ and ‘Consensus’ methods were calculated first by parsing every pLDDT score received by every residue in the antigen sequence sliding window output structures. From the resulting data structure, the Simple Max method simply finds the maximum pLDDT value ever seen for a single residue (across all sliding windows and AF2 models). For the Consensus method, per-residue pLDDT was first averaged across the 5 AF2 models. These averages are reported in the heatmap view, and further averaged per sliding window for the bar chart below.

      In principle, the strategy behind the Consensus method is to take into account agreement across the 5 AF2 models and provide insight into the confidence of entire epitopes (whole sliding windows of n=10 default) instead of disconnected, per-residue pLDDT maxima.” 

      (5) Figure 1 would be more useful if you could differentiate specifically how the Consensus and Simple Max scoring is different. Providing examples for how and why the top 5 peptide hits can change (quite significantly) using both methods would greatly help readers understand what is going on.

      Per the reviewer suggestion, we have added text to discuss the variable hit selection that results from the two scoring metrics. The new text (lines 264-271) adds onto the added text block immediately above:

      “Having two scoring metrics is useful because the selection of predicted hits can differ. As shown in Figure 2, part of the Myc epitope makes it into the top 5 peptides when selection is based on summing per-residue maximum pLDDT (despite there being no requirement that these values originate in the same physical prediction). In contrast, a Consensus method score more directly reports on a specific sliding window, and the strength of the highest confidence peptides is more directly revealed with superior signal to noise as shown in Figure 3. Variability in the ranking of top hits between the two methods arises from the fundamental difference in strategy (peptide-centric or residue-centric scoring) as well as close competition between the raw AF2 confidence in the known peptide and competing decoy sequences.”

      (6) Hopefully the reproducibility issue is alleviated, but if not the discussion of it (lines 523554) should be moved to the supplement or an appendix.

      The ability of the original AF2 model to predict protein-protein complexes was an emergent behavior, and then an explicit training goal for AF2.multimer. In this vein, the ability to predict scFv:peptide complexes is also an emergent capability of these models. It is our hope that by highlighting this capacity, as well as the high level of sensitivity, that this capability will be enhanced and not degraded in future models/algorithms (both general and specialized). In this regard, with an eye towards progress, we think it is actually important to put this issue in the scientific foreground rather than the background. When it comes to improving machine learning methods negative results are also exceedingly important.

      Reviewer 2 (Recommendations for the Author):

      - Line 113, page 3 - the structures of the novel scFv chimeras can be rapidly and confidently be predicted by AlphaFold2 to the structures of the novel scFv chimeras can be rapidly and confidently predicted by AlphaFold2.

      The superfluous “be” was removed from the text.

      - Line 276 and 278 page 9 - peptide sequences QKLSEEDLL and EQKLSEEDL in the text are different from the sequences reported in Figures 1 and 2 (QKLISEEDLL and EQKLISEEDL). Please check throughout the manuscript and also in the Figure caption (as in Figure 2).

      These changes were made throughout the text. 

      - I would include how you calculate the pLDDT score for both Simple Max approach and Consensus analysis.

      Good suggestion, this should be covered via the additions noted above.

    1. Author response:

      The following is the authors’ response to the original reviews

      We thank the reviewers for the constructive comments, which have improved the manuscript. In response to these comments, we have made the following major changes to the main text and reviewer response:

      (1) Added experimental and computational evidence to support the use of Cut&Tag to determine speckle location.

      (2) Performed new Transmission Electron Microscopy (TEM) experiments to visualize interchromatin granule clusters +/- speckle degradation.

      (3) Altered the text of the manuscript to remove qualitative statements and clarify effect sizes.

      (4) Performed new analyses of published whole genome bisulfite data from LIMe-Hi-C following DNMT1 inhibition to demonstrate that CpG methylation is lost at DNMT1i-specific gained CTCF sites.

      (5) Included citations for relevant literature throughout the text.

      These revisions in addition to others are described in the point-by-point response below.

      Reviewer #1 (Public review):

      Summary

      Roseman et al. use a new inhibitor of the maintenance DNA methyltransferase DNMT1 to probe the role of methylation on binding of the CTCF protein, which is known to be involved chromatin loop formation. As previous reported, and as expected based on our knowledge that CTCF binding is methylation-sensitive, the authors find that loss of methylation leads to additional CTCF binding sites and increased loop formation. By comparing novel loops with the binding of the pre-mRNA splicing factor SON, which localizes to the nuclear speckle compartment, they propose that these reactivated loops localize to near speckles. This behavior is dependent on CTCF whereas degradation of two speckle proteins does not affect CTCF binding or loop formation. The authors propose a model in which DNA methylation controls the association of genome regions with speckles via CTCF-mediated insulation.

      Strengths

      The strengths of the study are 1) the use of a new, specific DNMT1 inhibitor and 2) the observation that genes whose expression is sensitive to DNMT1 inhibition and dependent on CTCF (cluster 2) show higher association with SON than genes which are sensitive to DNMT1 inhibition but are CTCF insensitive, is in line with the authors' general model.

      Weaknesses

      There are a number of significant weaknesses that as a whole undermine many of the key conclusions, including the overall mechanistic model of a direct regulatory role of DNA methylation on CTCF-mediated speckle association of chromatin loops.

      We appreciate the reviewer’s constructive comments and address them point-by-point below.

      (1) The authors frequently make quasi-quantitative statements but do not actually provide the quantitative data, which they actually all have in hand. To give a few examples: "reactivated CTCF sites were largely methylated (p. 4/5), "many CTCF binding motifs enriched..." (p.5), "a large subset of reactivated peaks..."(p.5), "increase in strength upon DNMT1 inhibition" (p.5); "a greater total number....." (p.7). These statements are all made based on actual numbers and the authors should mention the numbers in the text to give an impression of the extent of these changes (see below) and to clarify what the qualitative terms like "largely", "many", "large", and "increase" mean. This is an issue throughout the manuscript and not limited to the above examples.

      Related to this issue, many of the comparisons which the authors interpret to show differences in behavior seem quite minor. For example, visual inspection suggests that the difference in loop strength shown in figure 1E is something like from 0 to 0.1 for K562 cells and a little less for KCT116 cells. What is a positive control here to give a sense of whether these minor changes are relevant. Another example is on p. 7, where the authors claim that CTCF partners of reactivated peaks tend to engage in a "greater number" of looping partners, but inspection of Figure 2A shows a very minor difference from maybe 7 to 7.5 partners. While a Mann-Whitney test may call this difference significant and give a significant P value, likely due to high sample number, it is questionable that this is a biologically relevant difference.

      We have amended the text to include actual values, instead of just qualitative statements. We have also moderated our claims in the text to note where effect sizes are more modest.

      The following literature examples can serve as positive controls for the effect sizes that we might expect when perturbing CTCF. Our observed effect sizes are largely in line with these expected magnitudes.

      https://pmc.ncbi.nlm.nih.gov/articles/PMC8386078/ Fig. 2E

      https://www.cell.com/cell-reports/pdf/S2211-1247(23)01674-1.pdf Fig. 3J,K

      https://academic.oup.com/nar/article/52/18/10934/7740592 Fig. S5D (CTCF binding only).

      (2) The data to support the central claim of localization of reactivated loops to speckles is not overly convincing. The overlap with SON Cut&Tag (figure 2F) is partial at best and although it is better with the publicly available TSA-seq data, the latter is less sensitive than Cut&Tag and more difficult to interpret. It would be helpful to validate these data with FISH experiments to directly demonstrate and measure the association of loops with speckles (see below).

      A recent publication we co-authored validated the use of speckle (SON) Cut&Run using FISH (Yu et al, NSMB 2025, doi: 10.1038/s41594-024-01465-6). This paper also supports a role of CTCF in positioning DNA near speckles. Unfortunately, the resolution of these FISH probes is in the realm of hundreds of kilobases. This was not an issue for Yu et. al., as they were looking at large-scale effects of CTCF degradation on positioning near speckles. However, FISH does not provide the resolution we need to look at more localized changes over methylation-specific peak sites.

      Instead, we use Cut&Tag to look at these high-resolution changes. In Figure 3C, we show that SON localizes to DNMT1i-specific peaks only upon DNMT1 inhibition. We further demonstrate that this interaction is dependent on CTCF. In response to reviewer comments, we have now also performed spike-in normalized Cut&Tag upon acute (6 hr) SON degradation to validate that our signal is also directly dependent on SON and not merely due to a bias toward open chromatin.

      Author response image 1.

      TSA-seq has been validated with FISH (Chen et. al., doi: 10.1083/jcb.201807108), Alexander et. Al 10.1016/j.molcel.2021.03.006) Fig 6. We include TSA-seq data where possible in our manuscript to support our claims.

      We also note that Fig 2F shows all CTCF peaks and loops, not just methylation-sensitive peaks and loops, to give a sense of the data. We apologize for any confusion and have clarified this in the figure legend.

      (3) It is not clear that the authors have indeed disrupted speckles from cells by degrading SON and SRRM2. Speckles contain a large number of proteins and considering their phase separated nature stronger evidence for their complete removal is needed. Note that the data published in ref 58 suffers from the same caveat.

      Based upon the reviewers’ feedback, we generated Tranmission electron microscopy (TEM) data to visualize nuclear speckles +/- degradation of SON and SRRM2 (DMSO and dTAG). We were able to detect Interchromatin Granules Clusters (ICGs) that are representative of nuclear speckles in the DMSO condition. However, even at baseline, we observed a large degree of cell-to-cell variability in these structures. In addition, we also observe potential structural changes in the distribution of heterochromatin upon speckle degradation. Consequently, we hesitate to make quantitative conclusions regarding loss of these nuclear bodies. In the interest of transparency, we have included representative raw images from both conditions for the reviewers’ consideration.

      We also note that in Ref 58 (Ilik et. Al., https://doi.org/10.7554/eLife.60579), the authors show diffusion of speckle client proteins RBM25, SRRM1, and PNN upon SON and SRRM2 depletion, further supporting speckle dissociation in these conditions.

      Author response image 2.

      Author response image 3.

      (4) The authors ascribe a direct regulatory role to DNA methylation in controlling the association of some CTCF-mediated loops to speckles (p. 20). However, an active regulatory role of speckle association has not been demonstrated and the observed data are equally explainable by a more parsimonious model in which DNA methylation regulates gene expression via looping and that the association with speckles is merely an indirect bystander effect of the activated genes because we know that active genes are generally associated with speckles. The proposed mechanism of a regulatory role of DNA methylation in controlling speckle association is not convincingly demonstrated by the data. As a consequence, the title of the paper is also misleading.

      While it is difficult to completely rule out indirect effects, we do not believe that the relationship between methylation-sensitive CTCF sites and speckles relies only on gene activity.

      We can partially decouple SON Cut&Tag signal from gene activation if we break down Figure 4D to look only at methylation-sensitive CTCF peaks on genes whose expression is unchanged upon DNMT1 inhibition (using thresholds from manuscript, P-adj > 0.05 and/or |log2(fold-change)| < 0.5). This analysis shows that many methylation-sensitive CTCF peaks on genes with unchanged expression still change speckle association upon DNMT1 inhibition. This result refutes the necessity of transcriptional activation to recruit speckles to CTCF.

      Author response image 4.

      We note the comparator upregulated gene set here is small (~20 genes with our stringent threshold for methylation-sensitive CTCF after 1 day DNMT1i treatment).

      However, we acknowledge that these effects cannot be completely disentangled. We previously included the statement “other features enriched near speckles, such as open chromatin, high GC content, and active gene expression, could instead contribute to increased CTCF binding and looping near speckles” in the discussion. In response to the reviewer’s comment, we have further tempered our statements on page 20/21 and also added a statement noting that DNA demethylation and gene activation cannot be fully disentangled. While we are also open to a title change, we are unsure which part of the title is problematic. 

      (5) As a minor point, the authors imply on p. 15 that ablation of speckles leads to misregulation of genes by altering transcription. This is not shown as the authors only measure RNA abundance, which may be affected by depletion of constitutive splicing factors, but not transcription. The authors would need to show direct effects on transcription.

      We agree, and we have changed this wording to say RNA abundance.

      Reviewer #2 (Public review):

      Summary:

      CTCF is one of the most well-characterized regulators of chromatin architecture in mammals. Given that CTCF is an essential protein, understanding how its binding is regulated is a very active area of research. It has been known for decades that CTCF is sensitive to 5-cystosine DNA methylation (5meC) in certain contexts. Moreover, at genomic imprints and in certain oncogenes, 5meC-mediated CTCF antagonism has very important gene regulatory implications. A number of labs (eg, Schubeler and Stamatoyannopoulos) have assessed the impact of DNA methylation on CTCF binding, but it is important to also interrogate the effect on chromatin organization (ie, looping). Here, Roseman and colleagues used a DNMT1 inhibitor in two established human cancer lines (HCT116 [colon] and K562 [leukemia]), and performed CTCF ChIPseq and HiChIP. They showed that "reactivated" CTCF sites-that is, bound in the absence of 5meC-are enriched in gene bodies, participate in many looping events, and intriguingly, appear associated with nuclear speckles. This last aspect suggests that these reactivated loops might play an important role in increased gene transcription. They showed a number of genes that are upregulated in the DNA hypomethylated state actually require CTCF binding, which is an important result.

      Strengths:

      Overall, I found the paper to be succinctly written and the data presented clearly. The relationship between CTCF binding in gene bodies and association with nuclear speckles is an interesting result. Another strong point of the paper was combining DNMT1 inhibition with CTCF degradation.

      Weaknesses:

      The most problematic aspect of this paper in my view is the insufficient evidence for the association of "reactivated" CTCF binding sites with nuclear speckles needs to be more diligently demonstrated (see Major Comment). One unfortunate aspect was that this paper neglected to discuss findings from our recent paper, wherein we also performed CTCF HiChIP in a DNA methylation mutant (Monteagudo-Sanchez et al., 2024 PMID: 39180406). It is true, this is a relatively recent publication, although the BioRxiv version has been available since fall 2023. I do not wish to accuse the authors of actively disregarding our study, but I do insist that they refer to it in a revised version. Moreover, there are a number of differences between the studies such that I find them more complementary rather than overlapping. To wit, the species (mouse vs human), the cell type (pluripotent vs human cancer), the use of a CTCF degron, and the conclusions of the paper (we did not make a link with nuclear speckles). Furthermore, we used a constitutive DNMT knockout which is not viable in most cell types (HCT116 cells being an exception), and in the discussion mentioned the advantage of using degron technology:

      "With high-resolution techniques, such as HiChIP or Micro-C (119-121), a degron system can be coupled with an assessment of the cis-regulatory interactome (118). Such techniques could be adapted for DNA methylation degrons (eg, DNMT1) in differentiated cell types in order to gauge the impact of 5meC on the 3D genome."

      The authors here used a DNMT1 inhibitor, which for intents and purposes, is akin to a DNMT1 degron, thus I was happy to see a study employ such a technique. A comparison between the findings from the two studies would strengthen the current manuscript, in addition to being more ethically responsible.

      We thank the reviewer for the helpful comments, which we address in the point-by-point response below. We sincerely apologize for this oversight in our references. We have included references to your paper in our revised manuscript. It is exciting to see these complementary results! We now include discussion of this work to contextualize the importance of methylation-sensitive CTCF sites and motivate our study.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      To address the above points, the authors should:

      (1) Provide quantitative information in the text on all comparisons and justify that the small differences observed, albeit statistically significant, are biologically relevant. Inclusion of positive controls to give an indication of what types of changes can be expected would be helpful.

      We have added quantitative information to the text, as discussed in the response to public comments above.  We also provide literature evidence of expected effect sizes in that response.

      (2) Provide FISH data to a) validate the analysis of comparing looping patterns with SON Cut&Tag data as an indicator of physical association of loops with speckles and b) demonstrate by FISH increased association of some of the CTCF-dependent loops/genes (cluster 2) with speckles upon DNMT1 inhibition.

      Please see response to Reviewer 1 comment #2 above. Unfortunately, FISH will not provide the resolution we need for point a). We have confidence in our use of TSA-seq and Cut&Tag to study SON association with CTCF sites on a genome-wide scale, which would not be possible with individual FISH probes. Specifically, since the submission of our manuscript several other researchers (Yu et al, Nat. Struct. and Mol. Biol. 2025, Gholamalamdari et al eLife 2025) have leveraged CUT&RUN/CUT&TAG and TSA-seq to map speckle associated chromatin and have validated these methods with orthogonal imaging based approaches.

      (3) Demonstrate loss of speckles upon SON or SRRM2 by probing for other speckle components and ideally analysis by electron microscopy which should show loss of interchromatin granules.  

      We have performed TEM in K562 cells +/- SON/SRRM2 degradation. Please see response to Reviewer 1 comment #3. Specifically, interchromatin granule clusters are visible in the TEM images of the DMSO sample (see highlighted example above), however, given the heterogeneity of these structures and potential global alterations in heterochromatin that may be occurring following speckle loss, we refrained from making quantitative conclusions from this data. We instead include the raw images above.

      (4) The authors should either perform experiments to clearly show whether loop association is transcription dependent or whether association is merely a consequence of gene activation. Alternatively, they should tone down their model ascribing a direct regulatory role of methylation in control of loop association with speckles and also discuss other models. Unless the model is more clearly demonstrated, the title of the paper should be changed to reflect the uncertainty of the central conclusion.

      Please see response to Reviewer 1 comment #4 above.

      (5) The authors should either probe directly for the effect of speckle ablation on transcription or change their wording.

      We have changed our wording to RNA abundance.

      Reviewer #2 (Recommendations for the authors):

      Major:

      ⁃ There was no DNA methylation analysis after inhibitor treatment. Ideally, genome bisulfite sequencing should be performed to show that the DNMT1i-specific CTCF binding sites are indeed unmethylated. But at the very least, a quantitative method should be employed to show the extent to which 5meC levels decrease in the presence of the DNMT1 inhibitor

      Response: We have now included analysis of genome wide bisulfite information from LIMe-Hi-C (bisulfite Hi-C) in K562 following DNMT1i inhibition. Specifically, we leverage the CpG methylation readout and find that DNTM1i-specific CTCF sites are more methylated than non-responsive CTCF peaks at baseline. In addition, these sites show the greatest decrease in CpG methylation upon 3 days of DNMT1 inhibition. We include a figure detailing these analyses in the supplement (Fig S1E). In addition, we have added CpG methylation genome browser tracks to (Fig S1D). In terms of global change, we have found that 3 days of DNMT1 inhibitor treatment leads to a reduction in methylation to about ~1/4 the level at baseline.

      I am not convinced that CUT&Tag is the proper technique to assess SON binding. CUT&Tag only works under stringent conditions (high salt), and can be a problematic assay for non-histone proteins, which bind less well to chromatin. In our experience, even strong binders such as CTCF exhibit a depleted binding profile when compared to ChIP seq data. I would need to be strongly convinced that the analysis presented in figures 2F-J and S2 D-I simply do not represent ATAC signal (ie, default Tn5 activity). For example, SON ChIP Seq, CUT&Tag in the SON degron and/or ATAC seq could be performed. What worries me is that increased chromatin accessibility would also be associated with increased looping, so they have generated artifactual results that are consistent with their model.

      As the reviewer suggested, we have now performed spike-in normalized SON Cut&Tag with DNMT1 inhibition and 6 hours of SON/SRRM2 degradation in our speckle dTAG knockin cell line. These experiments confirm that the SON Cut&Tag signal we see is SON-dependent. If the signal was truly due to artifactual binding, gained peaks would be open irrespective of speckle binding, however we see a clear speckle dependence as this signal is much lower if SON is degraded.

      Author response image 5.

      Moreover, in our original Cut&Tag experiments, we did not enrich detectable DNA without using the SON antibody (see last 4 samples-IgG controls). This further suggests that our signal is SON-dependent.

      Author response image 6.

      Finally, we see good agreement between Cut&Tag and TSA-seq (Spearman R=0.82).  The agreement is particularly strong in the top quadrant, which is most relevant since this is where the non-zero signal is.

      Author response image 7.

      Minor points

      ⁃ Why are HCT116 cells more responsive to treatment than K562 cells? This is something that could be addressed with DNA methylation analysis, for example

      K562 is a broadly hypomethylated cell line (Siegenfeld et.al, 2022 https://doi.org/10.1038/s41467-022-31857-5 Fig S2A-C). Thus, there may be less dynamic range to lose methylation compared to HCT116.

      Our results are also consistent with previous results comparing DKO HCT116 and aza-treated K562 cells (Maurano 2015, http://dx.doi.org/10.1016/j.celrep.2015.07.024). They state “In K562 cells, 5-aza-CdR treatment resulted in weaker reactivation than in DKO cells…”  In addition, cell-type-specific responsiveness to DNA methyltransferase KO depending upon global CpG methylation levels, has also been observed in ES and EpiLC cells (Monteagudo-Sanchez et al., 2024), which we now comment on in the manuscript.

      ⁃ How many significant CTCF loops in DNMTi, compared to DMSO? It was unclear what the difference in raw totals is.

      We now include a supplemental table with the HiChIP loop information. We call similar numbers of raw loops comparing DNMT1i and DMSO, as only a small subset of loops is changing.

      ⁃ For the architectural stripes, it would be nice to see a representative example in the form of a contact plot. Is that possible to do with the hiChIP data?

      As described in our methods, we called architectural stripes using Stripenn (Yoon et al 2022) from LIMe-Hi-C data under DNMT1i conditions (Siegenfeld et al, 2022). Shown below is a representative example of a stripe in the form of a Hi-C contact map.

      Author response image 8.

      ⁃ Here 4-10x more DNMT1i-specific CTCF binding sites were observed than we saw in our study. What are thresholds? Could the thresholds for DNMT1i-specific peaks be defined more clearly? For what it's worth, we defined our DNMT KO-specific peaks as fold-change {greater than or equal to} 2, adjusted P< 0.05. The scatterplots (1B) indicate a lot of "small" peaks being called "reactivated."

      We called DNMT1i-specific peaks using HOMER getDifferentialPeaksReplicates function. We used foldchange >2 and padj <0.05. We further restricted these peaks to those that were not called in the DMSO condition. 

      ⁃ On this note, is "reactivated" the proper term? Reactivated with regards to what? A prior cell state? I think DNMT1i-specific is a safer descriptor.

      We chose this term based on prior literature (Maurano 2015 http://dx.doi.org/10.1016/j.celrep.2015.07.024, Spracklin 2023 https://doi.org/10.1038/s41594-022-00892-7) . However, we agree it is not very clear, so we’ve altered the text to say “DNMT1i-specific”. We thank the reviewer for suggesting this improved terminology.

      ⁃ It appears there is a relatively small enrichment for CTCF peaks (of any class) in intergenic regions. How were intergenic regions defined? For us, it is virtually half of the genome. We did some enrichment of DNMT KO-specific peaks in gene bodies (our Supplemental Figure 1C), but a substantial proportion were still intergenic.

      We defined intergenic peaks using HOMER’s annotatepeaks function, with the -gtf option using Ensembl gene annotations (v104). We used the standard annotatepeaks priority order, which is TSS > TTS> CDS Exons > 5’UTR exons >3’ UTR exons > Introns > Intergenic.

      Maurano et. al. 2015 (http://dx.doi.org/10.1016/j.celrep.2015.07.024) also found reduced representation of intergenic sites among demethylation-reactivated CTCF sites in their Fig S5A. We note this is not a perfect comparison because their data is displayed as a fraction of all intergenic peaks.

      ⁃ We also recently published a review on this subject: The impact of DNA methylation on CTCF-mediated 3D genome organization NSMB 2024 (PMID: 38499830) which could be cited if the authors choose.

      We have cited this relevant review.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      Prior research indicates that NaV1.2 and NaV1.6 have different compartmental distributions, expression timelines in development, and roles in neuron function. The lack of subtype-specific tools to control Nav1.2 and Nav1.6 activity however has hampered efforts to define the role of each channel in neuronal behavior. The authors attempt to address the problem of subtype specificity here by using aryl sulfonamides (ASCs) to stabilize channels in the inactivated state in combination with mice carrying a mutation that renders NaV1.2 and/or NaV1.6 genetically resistant to the drug. Using this innovative approach, the authors find that action potential initiation is controlled by NaV1.6 while both NaV1.2 and NaV1.6 are involved in backpropagation of the action potential to the soma, corroborating previous findings. Additionally, NaV1.2 inhibition paradoxically increases the firing rate, as has also been observed in genetic knockout models. Finally, the potential anticonvulsant properties of ASCs were tested. NaV1.6 inhibition but not NaV1.2 inhibition was found to decrease action potential firing in prefrontal cortex layer 5b pyramidal neurons in response to current injections designed to mimic inputs during seizure. This result is consistent with studies of loss-of-function Nav1.6 models and knockdown studies showing that these animals are resistant to certain seizure types. These results lend further support for the therapeutic promise of activity-dependent, NaV1.6-selective, inhibitors for epilepsy.

      Strengths:

      (1) The chemogenetic approaches used to achieve selective inhibition of NaV1.2 and NaV1.6 are innovative and help resolve long-standing questions regarding the role of Nav1.2 and Nav1.6 in neuronal electrogenesis.

      (2) The experimental design is overall rigorous, with appropriate controls included.

      (3) The assays to elucidate the effects of channel inactivation on typical and seizure-like activity were well selected.

      Weaknesses:

      (1) The potential impact of the YW->SR mutation in the voltage sensor does not appear to have been sufficiently assessed. The activation/inactivation curves in Figure 1E show differences in both activation and inactivation at physiologically relevant membrane voltages, which may be significant even though the V1/2 and slope factors are roughly similar.

      We have performed new experiments testing how YW->SR mutations affect spiking on their own. The reviewer’s intuition was correct; the small changes in voltage-dependence in NaV1.6 identified in heterologous expression systems translated into a ~2 mV hyperpolarization in threshold in neurons.

      (2) Additional discussion of the fact that channels are only partially blocked by the ASC and that ASCs act in a use-dependent manner would improve the manuscript and help readers interpret these results.

      We have updated text extensively to address this concern. Details are found in the author suggestions below.

      (3) NaV1.6 was described as being exclusively responsible for the change in action potential threshold, but when NaV1.6 alone was inactivated, the effect was significantly reduced from the condition in which both channels were inactivated (Figure 4E). Similarly, Figure 6C shows that blockade of both channels causes threshold depolarization prior to the seizure-like event, but selective inactivation of NaV1.6 does not. As NaV1.2 does not appear to be involved in action potential initiation and threshold change, what is the mechanism of this dissimilarity between the NaV1.6 inactivation and combined NaV1.6/ NaV1.2 inactivation?

      We believe the dissimilarity is due to interactions between NaV1.2 and other channel classes (e.g., potassium channels) throughout the cell, including the somatodendritic domain. NaV1.6 that initiates APs, localized to the AIS, do not live in isolation, and AP threshold can be affected by the recent membrane potential history. Loss of NaV1.2-mediated depolarization in the dendrites begets less potassium channel-mediated repolarization, as described in Figure 4.

      (4) The idea that use-dependent VGSC-acting drugs may be effective antiseizure medications is well established. Additional discussion or at least acknowledgement of the existing, widely used, use-dependent VGSC drugs should be included (e.g. Carbamazepine, Lamotrigine, Phenytoin). Also, the idea that targeting NaV1.6 may be effective for seizures is established by studies using genetic models, knockdown, and partially selective pharmacology (e.g. NBI-921352). Additional discussion of how the results reported here are consistent with or differ from studies using these alternative approaches would improve the discussion

      We agree; the concept of use-dependent block as a means to treat seizure is not new, and we have updated the discussion to include commentary on other medications currently in use. What is new here is our ability to explore the role of NaV1.2 and NaV1.6 in electrogenesis with a level of drug selectivity that could not be achieved without the addition of the YW->SR mutations. This approach in itself will not be useful in the clinic, but it may help guide drug design in the future. One major interpretation of this work is that NaV1.6 block is more effective than NaV1.2 block in general, and may even be effective for non-SCN8A genetic conditions. This is indeed one of the reasons that we believe that drugs like NBI-921352, itself an aryl-sulfonamide, is being tested in seizure models.

      Reviewer #2 (Public review):

      The authors used a clever and powerful approach to explore how Nav1.2 and Nav1.6 channels, which are both present in neocortical pyramidal neurons, differentially control firing properties of the neurons. Overall, the approach worked very well, and the results show very interesting differences when one or the other channel is partially inhibited. The experimental data is solid and the experimental data is very nicely complemented by a computational model incorporating the different localization of the two types of sodium channels.

      In my opinion the presentation and interpretation of the results could be improved by a more thorough discussion of the fact that only incomplete inhibition of the channels can be achieved by the inhibitor under physiological recording conditions and I thought the paper could be easier to digest if the figures were re-organized. However, the key results are well-documented.

      This is a concern raised by multiple reviewers, and we thank you all for your help in improving the way in which we discuss the results. We have revised the manuscript extensively, moving figures around per your advice and the advice of R1 in their comments to authors.

      Reviewer #3 (Public review):

      Summary:

      The authors used powerful and novel reagents to carefully assess the roles of the voltage gated sodium channel (NaV) isoforms in regulating the neural excitability of principal neurons of the cerebral cortex. Using this approach, they were able to confirm that two different isoforms, NaV1.2 and NaV1.6 have distinct roles in electrogenesis of neocortical pyramidal neurons.

      Strengths:

      Development of very powerful transgenic mice in which NaV1.2 and/or NaV1.6 were modified to be insensitive to ASCs, a particular class of NaV blocker. This allowed them to test for roles of the two isoforms in an acute setting, without concerns of genetic or functional compensation that might result from a NaV channel knockout.

      Careful biophysical analysis of ASC effects on different NaV isoforms.

      Extensive and rigorous analysis of electrogenesis - action potential production - under conditions of blockade of either NaV1.2 or NaV1 or both.

      Weaknesses:

      Some results are overstated in that the representative example records provided do not directly support the conclusions.

      We have swapped out example records to better capture the median effect observed and to better capture our discussion of these results. Please see below, in recommendations for authors, for details.

      Results from a computational model are provided to make predictions of outcomes, but the computational approach is highly underdeveloped.

      Modeling has been elaborated upon extensively, with more detail in methods, a new sensitivity analysis supplemental figure, and a deposition into ModelDB.  Please see below, in recommendations for authors, for details.

      Reviewer #1 (Recommendations for the authors):

      Regarding the concern about the potential impact of the YWàSR mutation: All results in Figures 2-6 report only within-subject changes before and after drug-activating protocols. These results show that the drug has no effect on the mutant channel, but whether the mutant channel itself has any effect on neuronal properties is not clear. This deficiency could be rectified by reporting raw values for AP threshold, spike rate, etc. in the pre-drug condition and statistically analyzing the apparent differences in the activation/inactivation curves.

      Data in our original submission only included data in the presence of GNE-4076. We now present new data showing how the YWàSR mutation affects baseline activity of neurons. These data are in Supplemental Figure 1. Compared to wildtype (no drug control) neurons, we observe no change in peak dV/dt. However, threshold is hyperpolarized by approximately 2 mV in dual knockin neurons (median values: -57.4 mV for dual knockin and -55 mV for wildtype). This is consistent with measures from heterologously expressed channels, where we observed somewhat subtle shifts in voltage-dependence of inactivation and activation in NaV1.6 as a result of YWàSR incorporation. 

      In addition to these data, we also include the baseline dataset from Figure 3, where GNE-4076 is present throughout recording, and report that neither threshold nor peak dV/dt are influenced by the presence of GNE at baseline. This suggests that any drug binding at baseline (i.e., before firing APs via somatic current injection) is negligible, consistent with the concept that GNE-4076 has low affinity for the closed channel state.

      Minor Comments:

      While the single-cell response to "seizure-like" input aptly demonstrates the change in action potential threshold and firing rate induced by NaV1.6 inhibition, this component of the paper could be enhanced by a network-level assay that assesses the impact of this drug on an actual seizure-like event in acute slices or on seizure susceptibility in vivo.

      This is an excellent thought, and the work near the end of this manuscript is an effort to mimic network-like activity in a controlled way in single cells. To expand this to bona fide seizure-like activity in acute slices or in vivo is something that we are considering for future studies. To do this properly requires extensive validation of dosing and seizure induction that will require several years’ effort.

      Fig 1e caption says "circles" but the markers are squares

      This has been corrected, thank you for catching it.

      Color scheme in S2B is not intuitive to me

      We’ve now updated the caption to better describe the color scheme used within.

      Fig S2: graph or show change in threshold

      Empirical threshold data are in main figure 3D. Changes in threshold related to modeling are now included in a new sensitivity analysis that is in a new Supplemental Figure 2.

      Fig 3A example of NaV1.6 inhibition does not show change in AP threshold apparent in the aggregate data

      We have updated the representative example to better illustrate the change in AP threshold for NaV1.6 inhibition.

      "AP initiation is mediated exclusively by NaV1.6" not corroborated by data; APs still occur when NaV1.6 is inhibited

      This was an over-interpretation of our data, indeed. We have updated the language to be more accurate to the following: “AP threshold and AP initiation appears to be initiated in an NaV1.6-rich region in control conditions; when NaV1.6 is inhibited, APs can occur at more depolarized potentials, likely mediated predominately by NaV1.2.”

      Fig S3C missing WT/Scn8aSR/SR significance marking. Chosen example makes it look like there is a small decrease.

      Please note that there is no difference between these two conditions when in delta dV/dt for AIS inflection point (p = 0.4344).

      Reviewer #2 (Recommendations for the authors):

      This manuscript presents a clever and powerful approach to examining differential roles of Nav1.2 and Nav1.6 channels in excitability of pyramidal cell excitability, by engineering mice in which a sulfonamide inhibitor of both channels has reduced affinity for one or the other. Overall, the results in the manuscript are interesting and give important information about differential roles of Nav1.6 and Nav1.2 channels.

      The paper makes an important contribution to better understanding distinct roles of Nav1.2 and Nav1.6 channels. This improved understanding could help guide design of anti-seizure drugs targeted to sodium channels.

      Having made it clear that I think this is an important and impressive piece of work for which the authors should be congratulated, I found reading and interpreting the manuscript a frustrating experience. I will be blunt about the ways in which I found the presentation and discussion to be frustrating and even annoying, in the spirit of frank feedback by one interested and appreciative reader that the authors can consider or reject as they wish.

      From the start, I had the feeling that the authors were presenting and discussing the results in a sanitized "never-mind-about the details" fashion such as might be appropriate for a seminar to a general audience not interested in details, but not appropriate for a research paper.

      Our intent certainly was not to frustrate or annoy readers. We are very grateful that you have provided these comments, which have certainly improved the manuscript, hopefully mitigating some of the frustration for future readers. We appreciate that there are complex drug and voltage effects occurring within these studies, and in an effort to distill these effects into digestible prose, we appear to have been too earnest. We have expanded on the requested topics below and please note that, for the aficionados, every figure displays individual data. Further, we have made a special effort to ensure that features of excitability are presented throughout the drug and manipulation timecourse, including time-points before and after periods subject to statistical comparison, so that the reader may draw their own conclusions.

      General:

      There were two major ways in which I found the presentation and discussion frustrating and even annoying: First, not clearly discussing early in the presentation the fact that it is impossible to achieve complete inhibition with this agent during measurements of physiological firing and second, presenting so much of the effects as deltas of various parameters rather than showing effects on absolute values of the parameters.

      Our response to the first issue will follow the next comment, as it relates to this statement. Regarding use of deltas and absolute values for changes in threshold and dV/dt across figures. Every cell has a unique AP threshold and peak dV/dt, and we found that displaying data zeroed to baseline values best illustrated the effects of GNE-4076. Without this, GNE-based effect could be buried within the cell-to-cell variability. This helped most when trying to make the case that threshold was unaffected in 2a/8a YWàSR knockin animals. We continue to believe that this is the best way to display the data in the primary figures, but to provide a more complete account, we now present absolute values in supplemental tables and supplemental figures.

      The first issue, the incomplete inhibition by the agent, was the most annoying because the authors obviously thought a lot about this and even closed the paper by proposing this as a positive feature of this class of inhibitors, yet discussed it only piecemeal - and with most of the key experimental data in the Supplement. There are two fundamental characteristics of this (and other) sulfonamide inhibitors that complicate interpretation of experiments, especially when applied in a slice experiment: they only bind to the channel when the channel is depolarized, and even when the channel is depolarized for many seconds, bind very slowly to the channel.

      That makes it almost impossible to know exactly what fraction of channels is being inhibited during measurements of firing. Obviously, the authors are well-aware of this issue and they allude to it and even make use of it in some of the protocols, but they never really discuss it in a very clear manner.

      We agree that it is impossible to know the precise fraction of channels inhibited in acute slice preparations. But the reason for this is likely different than what has been interpreted by this reviewer. To state that ASMs “only bind to the channel when the channel is depolarized, and even when the channel is depolarized for many seconds, bind very slowly to the channel.” is not consistent with prior data on ASM–channel interactions. Clarification on these points may help the reviewer and a broader audience better understand the effects occurring here, and we appreciate being able to both address this concept here and by revising the manuscript.

      First, ASMs bind activated channels and stabilize the inactivated state. It is correct that channels are more likely to enter these states when subject to voltage depolarization, but channel state is stochastic and can enter activated states near resting membrane potentials. The on-rate is fast enough that channels are blocked immediately in recordings in heterologous systems (Figure 1C). It is more likely that channel biophysical state stochasticity, along with drug concentration used herein, are likely dictating the rate at which channels accumulate block during repetitive spiking.

      To address this in text, we have revised the 3rd paragraph of the introduction to better incorporate these ideas. This also helps with comments in the reviewer paragraph below.

      The key experimental data on this is relegated to the Supplemental Figures. When the reader is first shown results of the effects of the inhibitor on firing in Fig 2, the presentation has been set up as if everything is perfect, and the inhibitor will be completely inhibiting either both or only one channel according to the mouse. With this presentation, it is then exceptionally striking that the cell in the middle panel of Fig 2A, labeled "Nav1.2/1.6 Inhibited" is firing action potentials very nicely even with both channels "inhibited". For a reader not already aware that there is likely only partial inhibition of each channel, the reaction will be "Huh? Shouldn't blocking both channels simply completely block excitability?". The authors do preface Fig 2 by a very brief allusion to the incomplete inhibition: "In spiking neurons, ASCs would therefore be predicted to exhibit use-dependence, progressively blocking channels in proportion to a neuron's activity rate" but this comes out of nowhere after the over-simplified picture of complete inhibition up to that point, and without any estimation of how much inhibition there is likely to be before activity, or how much induction of inhibition there is likely to be during the activity. Without this, interpreting the data in Fig 2 is basically impossible.

      The key experimental data on this issue is really in Supplemental Figures 1-2 and Fig 4, and I found myself immediately ping-ponging back and forth between the Supplemental figures and the main text trying to understand what is going on with the partial inhibition. This was frustrating.

      Thank you for these suggestions; they help with readability appreciably. We have re-organized the figures presented in the manuscript and emphasized details about ASCs to ensure readers can discern between near-complete blockade of channels (Figures 1-4) and activity-dependent ASC onboarding (Figures 5-7). We now present near-complete block experiments first, detailing the current clamp-> voltage clamp (-12 mV)-> current clamp experiments. We incorporated Supp. Fig. 1 into main Figure 1 and moved Supp. Fig. 2 into main Fig. 2.

      As the reviewer notes, there are clear time-dependent effects on channel function when stepping to -12 mV, independent of GNE-4076 block. As stated previously, “We therefore focused on the 12-20 sec after voltage-clamp offset for subsequent analysis, as it is a period in which most channel-intrinsic recovery has occurred, but also a period in which we would still expect significant block from GNE-4076.” We hope that reordering the manuscript as suggested and placing these results near the beginning will help with discerning between near-complete block and activity depending onboarding. By beginning with these experiments, which underscore that 100% block cannot be studied without “contamination” from native slow inactivation, we hope that the readers can better understand why data was done as presented.

      In my opinion, the paper would be greatly improved by a detailed discussion of the voltage- and time-dependence of the inhibitor at the very beginning of the paper. For me, reading and digesting the paper would have been far easier if Fig 1 included a discussion of the voltage- and time-dependence of inhibition, and next Figs were then Supplemental Figs 1-2, and main Fig 4. The key questions are: how much inhibition is there before a 10-s current injection from the resting potential, and how much additional inhibition is there produced during either the 10-s bout of firing or the "on-boarding" depolarization protocol, and how long does that additional inhibition last? The most direct information on that is in the plots in Fig. 4D and Fig 4F in combination with Supplemental Fig 1, which shows that the on-boarding depolarization reduces current to about 30% of current before on-boarding. This is so central to the interpretation of all the results that I think Supp Fig 1 should be in the main paper as the first piece of data in neurons.

      We originally had the nucleated patch data in supplement due to space constraints in an already large figure 1. Based on your recommendation we have moved it to the main figure. We have also changed the ordering of the paper and related figures to present data as suggested. Hopefully this better guides readers through the questions you are raising above, which are addressed in the (now reordered) figures mentioned above.

      Specific:

      (1) Fig.1 I can find no information on the voltage protocol used to generate the dose-response curves. In the literature characterizing sulfonamide blockers, most protocols use very unphysiological strong, long depolarization to induce inhibition, usually with equally unphysiological short hyperpolarizations to produce recovery from inactivation. One assumes something like that was used here. Obviously, the protocol needs to be explained.

      We updated the methods section to better describe the voltage protocol used to generate the dose response curves. In contrast to the literature characterizing sulfonamide blockers, we used pulses that closely mimic physiological activation from -80 mV (rest) to 0 mV (depolarized) for 20 msec. GNE-4076 was perfused onto cells at increasing concentrations throughout the experiment. At each successive dose, cells were held at 0 mV to allow adequate GNE-4076 onboarding.

      (2) Supp Fig1. This shows the effect of depolarization to enhance inhibition, but not how much inhibition there was before the depolarization. Presumably, there were measurements during the application of drug? How much inhibition is there before the depolarization? Why does the time only go to 20-s, when the times in Figs 4 go to 10 minutes?

      Nucleated patch recordings are notoriously difficult to maintain for long durations, especially when subjecting the patch to large voltage deflections. These recordings extend to 20s recovery periods because that is the duration for which we maintained all recordings, though some exhibited rather impressive longevity and allowed for several minutes of recording thereafter. Regardless, the goal here was to assess block within the 12-20 sec recovery window we utilized in current clamp recordings from intact neurons. This was achieved.

      Please note that GNE-4076 was present throughout all recordings. This was in part due to time constraints, as we could not maintain patches long enough to also perform wash-in. The degree of inhibition can be inferred by comparing peak dV/dt and threshold of cells in the absence and presence of GNE-4076. These data are presented in a new Supplemental figure 1, showing no difference in threshold or peak dV/dt.

      (3) Fig. 4. Similar question here - this is a very nice and informative figure, but we see only the delta in threshold and dv/dt, but how were the initial absolute values different in the drug compared to control?

      These data are presented in a new Supplemental Figure 1, showing no difference in threshold or peak dV/dt.

      (4) Fig 2. As far as I can tell, we have no idea how much inhibition there is at rest, before the current injection -what is the dv/dt in the drug compared to in the control? Were there experiments in which the current injections were delivered before and after applying drug? If not, at least it would be useful to see population data on dv/dt of the first spike in control and with drug.

      These data are presented in a new Supplemental Figure 1, showing no difference in threshold or peak dV/dt.

      (5). Fig. 2. Do the authors have any quantitative information on how much extra inhibition would be produced at 200 nM drug using physiological waveforms of firing?

      These types of analyses are part of later figures using EPSC-like waveforms to evoke spiking.

      I was unconvinced that the changes in threshold and dv/dt during the firing in the drug necessarily represent time-dependent use-dependent effects of drug. Partial inhibition by TTX would probably produce greater progressive changes in spike shape and reduced ability to fire robustly.

      TTX is not use-dependent, so it is a good contrast to GNE-4076. We experimented with a few cells at 2 and 10 nM TTX concentrations and found that concentrations required to mimic the block of spiking that occurs with 200 nM GNE-4076 in WT cells was associated with a marked use-independent elevation in AP threshold, with an inability to maintain ~10 Hz spiking rates with the baseline EPSC-like stimulation pattern. These effects are very different from those produced by GNE-4076, but were expected given the use-independence of TTX. We did not pursue this line of inquiry fully, so we present these data only as individual examples in the reviewer figure below:

      Author response image 1.

      Data from Figure 6B, D, E are replicated here with individual lines of 2 nM and 10 nM TTX shown in dashed lines. Note marked changes in threshold not observed with GNE-4076. TTX sourced from Alomone Labs.

      Minor:

      p. 5 and elsewhere: it seems unnecessary to give values of threshold and dv/dt to three decimal places, especially when the precision is not better than a single decimal place.

      We have reduced unnecessary precision throughout.

      Reviewer #3 (Recommendations for the authors):

      The computational model is highly underdeveloped. Without more rigorous development the results of the computational model appear to provides little additional insight beyond that expected from the known axodendritic localizations of NaV 1.2 and 1.6. If the authors wish to use the computational results to make rigorous predictions, then this section needs to be either be expanded to be more complete and promoted to a regular figure, with full details of the model, and how it was evaluated for accuracy. Alternatively, this point regarding computational insight could be de-emphasized and or removed from the paper.

      Modeling:

      (1) I don't see any methods describing the precise model parameters that were used.

      Apologies, this is a model that we have built and tested extensively over the years (PMID: 38290518, 35417922, 34348157, 31995133, 31230762, 28256214), though there have been some small updates over these works. We have deposited this model at ModelDB and provide data there regarding model construction (access #2019342).

      (2) There appears to be no robustness test to assess whether the particular results/conclusions were unduly dependent on particular model construction decisions.

      We have now generated a new supplemental figure 2 that explores the robustness of these observations to changes in NaV1.2 and NaV1.6 position within the AIS and changes in relative density of NaV1.2 and NaV1.6. As shown there, the model is tolerant to all but extreme, non-physiological manipulations to these parameters.

      (3) Figure S2 does not really provide convincing evidence of a biologically relevant model. Probably the model itself needs to be redesigned to better replicate the biological response and be validated by testing parameter sensitivity.

      a) All of the results in S2C show that there is a huge reduction in the first action potential (black?) followed by relatively little change in subsequent spikes. This is not seen in any of the models. The progressive changes in threshold as predicted by the model for dual and NaV1.6 block are not at all evident in the results of C, except perhaps for the the very first and the very last spikes.

      b) The baseline action potential in B is different than the recorded action potentials. In particular, the somatic depolarization occurs much later and over a more extended time frame than the real neuron, and the phase plot shows an actual dip in depolarization at the transition to the somatic spike, which is not representative of naturally occurring action potentials.

      To address both (a) and (b), please note that in empirical experiments there are two parallel processes occurring: block by GNE-4076 and channel recovery from inactivation. In the model we can isolate the effects of block to test that parameter fully and in isolation. This is something that we could never achieve biologically. The important take home here in both cases is to observe that with NaV1.6 block there is a change in threshold, whereas with NaV1.2 block there is none.

      (4) The one finding that seems to be robust is that the changes in NaV1.2 have little effect on threshold.

      Yes! This is a major take-home message from both the model and the use of these knockin mice in combination with GNE-4076. In mature pyramidal cells, NaV1.6 is the major determinant of AP threshold. And to editorialize on this observation, changes in threshold are a useful metric to test if other pharmacology are truly selective for NaV1.2 over NaV1.6. We note that phrixotoxin-3, which is described as NaV1.2 specific in multiple papers, was never tested for specificity over NaV1.6 in its original description, and we find that it fails this test in our hands.

      Data presentation:

      (1) The phase plots in Figure 3B (left and right) appear to be visually identical, and as such don't strongly support any particular conclusion.

      We changed the representative example record (specifically for Fig. 3A-B) to more directly support the conclusions.

      (2) It is unclear to me what is meant by AP speed (title of Figure 3 legend). Do the authors mean propagation speed along the axon, or perhaps the rate of action potential firing?

      Apologies, we are referencing dV/dt when we mention AP speed. We updated AP speed to AP velocity throughout the manuscript.

    1. Author response:

      The following is the authors’ response to the original reviews.

      We thank the reviewers and editor for their positive view and constructive valuable comments on the manuscript.  Following we address the suggestions of the reviewers.

      Reviewer #1 (Public Review):

      (1) It will be interesting to monitor the levels of another MIM insertase namely, OXA1. This will help to understand whether some of the observed changes in levels of OXPHOS subunits are related to alterations in the amounts of this insertase.

      OXA1 was not detected in the untargeted mass spectrometry analysis, most likely due to the fact that it is a polytopic membrane protein, spanning the membrane five times (1,2). Consequently, we measured OXA1 levels with immunoblotting, comparing patient fibroblast cells to the HC. No significant change in OXA1 steady state levels was observed.

      These results are now displayed (Fig. S3B and C) and discussed in the revised manuscript.

      Figure 3: How do the authors explain that although TIMM17 and TIMM23 were found to be significantly reduced by Western analysis they were not detected as such by the Mass Spec. method?

      The untargeted mass spectrometry in the current study failed to detect the presence of TIMM17 for both, patient fibroblasts and mice neurons, while TIMM23 was detected only for mice neurons and a decrease was observed for this protein but was not significant. This is most likely due to the fact that TIMM17 and TIMM23 are both polytopic membrane proteins, spanning the membrane four times, which makes it difficult to extract them in quantities suitable for MS detection (2,3).

      (2) How do the authors explain the higher levels of some proteins in the TIMM50 mutated cells?

      The levels of fully functional TIM23 complex are deceased in patients' fibroblasts. Therefore, the mechanism by which the steady state level of some TIM23 substrate proteins is increased, can only be explained relying on events that occur outside the mitochondria. This could include increase in transcription, translation or post translation modifications, all of which may increase their steady state level albite the decrease in the steady state level of the import complex.

      (3) Can the authors elaborate on why mutated cells are impaired in their ability to switch their energetic emphasis to glycolysis when needed?

      Cellular regulation of the metabolic switch to glycolysis occurs via two known pathways: 1) Activation of AMP-activated protein kinase (AMPK) by increased levels of AMP/ADP (4). 2) Inhibition of pyruvate dehydrogenase (PDH) complexes by pyruvate dehydrogenase kinases (PDK) (5). Therefore, changes in the steady state levels of any of these regulators could push the cells towards anaerobic energy production, when needed. In our model systems, we did not observe changes in any of the AMPK, PDH or PDK subunits that were detected in our untargeted mass spectrometry analysis (see volcano plots below, no PDK subunits were detected in patient fibroblasts). Although this doesn’t directly explain why the cells have an impaired ability to switch their energetic emphasis, it does possibly explain why the switch did not occur de facto.

      Author response image 1.

      Reviewer #2 (Public Review):

      (1) The authors claim in the abstract, the introduction, and the discussion that TIMM50 and the TIM23 translocase might not be relevant for mitochondrial protein import in mammals. This is misleading and certainly wrong!!!

      Indeed, it was not in our intention to claim that the TIM23 complex might not be relevant. We have now rewritten the relevant parts to convey the correct message:

      Abstract –

      Line 25 - “Strikingly, TIMM50 deficiency had no impact on the steady state levels of most of its putative substrates, suggesting that even low levels of a functional TIM23 complex are sufficient to maintain the majority of complex-dependent mitochondrial proteome.”

      Introduction –

      Line 87 - Surprisingly, functional and physiological analysis points to the possibility that low levels of TIM23 complex core subunits (TIMM50, TIMM17 and TIMM23) are sufficient for maintaining steady-state levels of most presequence-containing proteins. However, the reduced TIM23CORE component levels do affect some critical mitochondrial properties and neuronal activity.

      Discussion –

      Line 339 – “…surprising, as normal TIM23 complex levels are suggested to be indispensable for the translocation of presequence-containing mitochondrial proteins…”

      Line 344 – “…it is possible that unlike what occurs in yeast, normal levels of mammalian TIMM50 and TIM23 complex are mainly essential for maintaining the steady state levels of intricate complexes/assemblies.”

      Line 396 – “In summary, our results suggest that even low levels of TIMM50 and TIM23CORE components suffice in maintaining the majority of mitochondrial matrix and inner membrane proteome. Nevertheless, reductions in TIMM50 levels led to a decrease of many OXPHOS and MRP complex subunits, which indicates that normal TIMM50 levels might be mainly essential for maintaining the steady state levels and assembly of intricate complex proteins.”

      Reviewer #1 (Recommendations For The Authors):

      (1) Lines 25-26: The authors write "Strikingly, TIMM50 deficiency had no impact on the steady state levels of most of its substrates". Since the current data challenges the definition of some proteins as substrates of TIMM50, I suggest using the term "putative substrates".

      Changed as suggested

      (2) Line 27: It is not clear whether the wording "general import role of TIM23" it refers to the TIM23 protein or the TIM23 complex. This should be clarified.

      Clarified. It now states "TIM23 complex".

      (3) Line 72: should be "and plays".

      Changed as suggested.

      (4) It will be helpful to include in Figure 1 a small scheme of TIMM50 and to indicate in which domain the T252M mutation is located.

      We predicted the AlphaFold human TIMM50 structure and indicated the mutation site and the different TIMM50 domains. The structure is included in Fig. 1A.

      (5) I suggest labelling the "Y" axis in Fig. 1B as "Protein level (% of control)".

      Changed as suggested in Fig. 1C (previously Fig. 1B) and in Fig. 2C.

      (6) Line 179: since the authors tested here only about 10 mitochondrial proteins (out of 1500), I think that the word "many" should be replaced by "several representative" resulting in "steady state levels of several representative mitochondrial proteins".

      Changed as requested.

      (7) Line 208: correct typo.

      Typo was corrected.

      (8) Figure 4 is partially redundant as its data is part of Figure 3. The authors can consider combining these two figures. Accordingly, large parts of the legend of Figure 4 are repeating information in the legend to Figure 3 and can refer to it.

      We revamped Figures 3 and 4. Figure 3 now shows the analysis of fibroblasts proteomics while Figure 4 focuses on neurons proteomics. We also modified the legend of Figure 4.

      Reviewer #2 (Recommendations For The Authors):

      (1) Abstract: 'Strikingly, TIMM50 deficiency had no impact on the steady state levels of most of its substrates, challenging the currently accepted import dogma of the essential general import role of TIM23 and suggesting that fully functioning TIM23 complex is not essential for maintaining the steady state level of the majority of mitochondrial proteins'. This sentence needs to be rephrased. The data do not challenge any dogma! The authors only show that lower levels of functional TIM23 are sufficient.

      We have rewritten all the relevant sentences as suggested (details are also mentioned in response to reviewer 2 public review point 1)

      (2) Introduction: 'Surprisingly, functional and physiological analysis points to the possibility that TIMM50 and a fully functional TIM23 complex are not essential for maintaining steady-state levels of most presequence-containing proteins'. This again needs to be rephrased.

      Rewritten as suggested (details mentioned in response to reviewer 2 public review point 1)

      (3) Discussion: 'In summary, our results challenge the main dogma that TIMM50 is essential for maintaining the mitochondrial matrix and inner membrane proteome, as steady state level of most mitochondrial matrix and inner membrane proteins did not change in either patient fibroblasts or mouse neurons following a significant decrease in TIMM50 levels.' This again needs to be rephrased.

      Rewritten as suggested (details mentioned in response to reviewer 2 public review point 1)

      (4) The analysis of the proteomics experiment should be improved. The authors show in Figures 3 and 4 several times the same volcano plots in which different groups of proteins are indicated. It would be good to add (a) a principal component analysis to show that the replicates from the mutant samples are consistently different from the controls, (b) a correlation plot that compares the log-fold-change of P1 to that of P2 to show which of the proteins are consistently changed in P1 and P2 and (c) a GO term analysis to show in an unbiased way whether mitochondrial proteins are particular affected upon TIMM50 depletion.

      Figures 3 and 4 have been changed to avoid redundancy. Figure 3 now focuses on fibroblasts proteomics (with additional analysis), while Figure 4 focuses on neurons proteomics. PCA analysis was added in Fig S1, showing that the proteomics replicates of both patients (P1 and P2) are consistently different than the healthy control (HC) replicates. Correlation plots were added in Figure 3C and D, showing high correlation of the downregulated and upregulated mitochondrial proteins between P1 and P2. These plots further highlight that MIM proteins are more affected than matrix proteins and that the OXPHOS and MRP systems comprise the majority of significantly downregulated proteins in both patients. GO term analysis was performed for all the detected proteins that got significantly downregulated in both patients. The GO term analysis is displayed in Figure S3A, and shows that mitochondrial proteins, mainly of the OXPHOS and MRP machineries, are particularly affected.

      (5) Figure 1. The figure shows the levels of TIM and TOM subunits in two mutant samples. The quantifications suggest that the levels of TIMM21, TOMM40, and mtHsp60 are not affected. However, from the figure, it seems that there are increased levels of TIMM21 and reduced levels of TOMM40 and mtHsp60. Unfortunately, in the figure most of the signals are overexposed. Since this is a central element of the study, it would be good to load dilutions of the samples to make sure that the signals are indeed in the linear range and do scale with the amounts of samples loaded.

      The representative WB panels display the Actin loading control of the representative TIMM50 repeat (the top panel). However, each protein was tested separately, at least three times, and was normalized to its own Actin loading control.

      (6) Figure 2B. All panels are shown in color except the panel for TIMM17B which is grayscale. This should be changed to make them look equal.

      All the western blot panels were changed to grayscale.

      (7) Discussion: 'Despite being involved in the import of the majority of the mitochondrial proteome, no study thus far characterized the effects of TIMM50 deficiency on the entire mitochondrial proteome.' This sentence is not correct as proteomic data were published previously, for example for Trypanosomes (PMID: 34517757) and human cells (PMID: 38828998).

      We have corrected the statement to “Despite being involved in the import of the majority of the mitochondrial proteome, little is known about the effects of TIMM50 deficiency on the entire mitochondrial proteome.”

      (8) A recent study on a very similar topic was published by Diana Stojanovki's group that needs to be cited: PMID: 38828998. The results of this comprehensive study also need to be discussed!!!

      We have added the following in the discussion:

      Line 362 – “These observations are similar to the recent analysis of patient-derived fibroblasts which demonstrated that TIMM50 mutations lead to severe deficiency in the level of TIMM50 protein (6,7). Notably, this decrease in TIMM50 was accompanied with a decrease in the level of other two core subunits, TIMM23 and TIMM17. However, unexpectedly, proteomics analysis in our study and that conducted by Crameri et al., 2024 indicate that steady state levels of most TIM23-dependent proteins are not affected despite a drastic decrease in the levels of the TIM23CORE complex (7). The most affected proteins constitute of intricate complexes, such as OXPHOS and MRP machineries. Thus, both these studies indicate a surprising possibility that even reduced levels of the TIM23CORE components are sufficient for maintaining the steady state levels of most presequence containing substrates.

      (1) Homberg B, Rehling P, Cruz-Zaragoza LD. The multifaceted mitochondrial OXA insertase. Trends Cell Biol. 2023;33(9):765–72.

      (2) Carroll J, Altman MC, Fearnley IM, Walker JE. Identification of membrane proteins by tandem mass spectrometry of protein ions. Proc Natl Acad Sci U S A. 2007;104(36):14330–5.

      (3) Ting SY, Schilke BA, Hayashi M, Craig EA. Architecture of the TIM23 inner mitochondrial translocon and interactions with the matrix import motor. J Biol Chem [Internet]. 2014;289(41):28689–96. Available from: http://dx.doi.org/10.1074/jbc.M114.588152

      (4) Trefts E, Shaw RJ. AMPK: restoring metabolic homeostasis over space and time. Mol Cell [Internet]. 2021;81(18):3677–90. Available from: https://doi.org/10.1016/j.molcel.2021.08.015

      (5) Zhang S, Hulver MW, McMillan RP, Cline MA, Gilbert ER. The pivotal role of pyruvate dehydrogenase kinases in metabolic flexibility. Nutr Metab. 2014;11(1):1–9.

      (6) Reyes A, Melchionda L, Burlina A, Robinson AJ, Ghezzi D, Zeviani M.  Mutations in TIMM50 compromise cell survival in OxPhos‐dependent metabolic conditions . EMBO Mol Med. 2018;

      (7) Crameri JJ, Palmer CS, Stait T, Jackson TD, Lynch M, Sinclair A, et al. Reduced Protein Import via TIM23 SORT Drives Disease Pathology in TIMM50-Associated Mitochondrial Disease. Mol Cell Biol [Internet]. 2024;0(0):1–19. Available from: https://doi.org/10.1080/10985549.2024.2353652

    1. Author response:

      The following is the authors’ response to the original reviews.

      Recommendations for the authors

      Reviewer #1 (Recommendations For The Authors):

      Below I summarize points that should be addressed in a revised version of the manuscript.

      • Page 6, first paragraph: I don't understand by the signals average out to a single state. If the distribution is indeed randomly distributed, a broad signal with low intensity should be present.

      We agree that this statement may cause confusion. We changed the text (marked in bold) to clarify the statement: The mobility of the undocked SBDs will be higher than the diffusion of the whole complex, allowing the sampling of varying interdomain distances within a single burst. However, these dynamic variations are subsequently averaged to a singular FRET value during FRET calculations for each burst, and may appear as a single low FRET state in the histograms.

      • Page 6, third paragraph: how can the donor only be detected in the acceptor channel? Is this tailing out?

      Donor only signal is not detected in the acceptor channel. As described in page 5 and in the Materials & Methods section, the dye stoichiometry value is defined for each burst/dwell using three types of photon counts: donor-based donor emission (FDD), donor-based acceptor emission (FDA) and acceptorbased acceptor emission (FAA).

      When no acceptor fluorophore is present FAA=0 and S=1.

      Some donor photons bleed through into the acceptor channel, but we correct for this by calculating the leakage and crosstalk factors as described in the Materials and Methods (page 20).

      We changed the text (marked in bold) in the manuscript to address the question: The FRET data of both OpuA variants is best explained by a four-state model (Figure 2A,B; fourth and fifth panel) (Supplementary File 3). Two of the four states represent donor-only (S≈1) or acceptor-only (S≈0) dwells. The full bursts belonging to donor-only and acceptor-only molecules were excluded prior to mpH2MM. This means that some molecules transit to a donor-only or acceptor-only state within the burst period, which most likely reflects blinking or bleaching of one of the fluorophores. These donoronly and acceptor-only states were also excluded during further analysis. The other two states reflect genuine FRET dwells that were analyzed by mpH2MM. They represent different conformations of the SBDs.

      • Page 7, "SBD dynamics ..": why was the V149Q mutant only analyzed in the K521C background and not also in the N414C background?

      The two FRET states were best distinguished in OpuA-K521C. Therefore, we decided to focus on OpuA-K521C and not OpuA-N414C. OpuA-V149Q was used to show that reduced docking efficiency does not affect the transition rate constants and relative abundances of the two FRET states, and we regarded it sufficient to test the SBD dynamics in OpuA-K521C only.

      • Page 8, second paragraph: why was the N414C mutant analyzed only from 0 - 600 mM and not also up to 1000 mM?

      In line with the previous answer, our main focus was on OpuA-K521C, since the two FRET states were best distinguished in OpuA-K521C. OpuA-N414C was used to prove that similar states are observed when measuring with fluorophores on the opposite site of the SBD. We studied how the FRET states change in response to different conditions that correspond to different stages of the transport cycle and how it changes in response to different ionic strengths. Initially, 600 mM KCl was used to study the dynamics of the SBD at high ionic strength. Later in this study, we tested a very wide range of different salt concentrations for OpuA-K521C to get detailed insights into the dynamics of the SBDs over a wide ionic strength range. Note that 1 M KCl is a very high, non-physiological ionic strength for the typical habitat of L. lactis and was only used to show that the high FRET state occurs even under very extreme conditions.

      • Page 8, third paragraph: why was the dimer (if it is the source of the FRET signal) only partially disrupted?

      We acknowledge that this is a very good point. However, we purposely did not speculate on this point in the manuscript, because we have limited information on the molecular details of the interaction. As we highlight on page 8, the SBDs experience each other in a very high apparent concentration (millimolar range). This means that the interactions are most likely very weak (low affinity) and not very specific. Such interactions are in the literature referred to as the quinary structure of proteins and they occur at the high macromolecular crowding in the cell and in proteins with tethered domains, and thus at high local concentrations. Such interactions can be screened by high ionic strength. In the revised manuscript, we now present the partially disrupted dimer structure in the context of the quinary structure of a protein (page 11):

      In other words, the high FRET state may comprise an ensemble of weakly interacting states rather than a singular stable conformation, resembling the quinary structure of proteins. The quinary structure of proteins is typically revealed in highly crowded cellular environments and describes the weak interactions between protein surfaces that contribute to their stability, function, and spatial organization (Guin & Gruebele, 2019). Despite the current study being conducted under dilute conditions, the local concentration of SBDs (~4 mM) mimics a densely populated environment and reveal quinary structure.

      • Page 9, second paragraph: according to the EM data processing, only 20% of the particles were used for 3D reconstruction. Why? Does it mean that the remaining 80% were physiologically not relevant? If so, why were the 20% used relevant?

      We note that it is a fundamental part of image processing of single particle cryo-EM data to remove false positives or low-resolution particles throughout the processing workflow. In particular when using a very low and therefore generous threshold during automated particle picking, as we did (t=0.01 and t=0.05 for the 50 mM KCl and 100 mM KCl datasets, respectively), the initial set of particles includes a significant amount of false positives – a tradeoff to avoid excluding particles belonging to low populated classes/orientations. It is thus common that more than 50% of ‘particles’ are excluded in the first rounds of 2D classification. In our case, only 30% and 52% of particles were retained after such first clean-up steps. Subsequently, the particle set is further refined, and additional false positives and low-resolution particles are excluded during extensive rounds of 3D classification. We also note that during the final steps, most of the data excluded represents particles of lower quality that do not contribute to a high-resolution, or belong to low population protein conformations. This does not mean that such a population is not physiological relevant. In conclusion, having only 5-20% of the initial automated picked particles contributing to the reconstruction of the final cryo-EM map is common, with the vast majority of excluded particles being false positives.

      • Page 11, third paragraph: the way the proposed model is selected is also my main criticism. All alternative models do not fit the data. Therefore, the proposed model is suggested. However, I do not grasp any direct support for this model. Either I missed it or it is not presented.

      Concerning the specific model in Figure 5, the reviewer is correct. We do not provide direct evidence for a side-ways interaction. However, we have evidence of transient interactions and our data rule out several scenarios of interaction, leaving 5C as the most likely model. This is also the main conclusion of this paper: In conclusion, the SBDs of OpuA transiently interact in a docking competent conformation, explaining the cooperativity between the SBDs during transport. The conformation of this interaction is not fixed but differs substantially between different conditions.

      Because the interaction is very short-lived it was not possible to visualize molecular details of this interaction. We present Figure 5 to hypothesize the most likely type of interaction, since many possibilities can be excluded with the vast amount of presented data. To make our point more clear that we discuss models and rule out several possibilities but not demonstrate a specific interaction between the SBDs, we now write on page 10 (changes marked in bold): We have shown that the SBDs of OpuA come close together in a short-lived state, which is responsive to the addition of glycine betaine (Figure 4A). Although the occurrence of the state varies between different conditions, it was not possible to negate the high-FRET state completely, not even under very high or low KCl concentrations, or in the presence of 50 mM arginine plus 50 mM glutamate (Figure 4A,B). To evaluate possible interdomain interactions scenarios we consider the following: (1) The SBDs of OpuA are connected to the TMDs with very short linkers of approximately 4 nm, which limit their movement and allow the receptor to sample a relatively small volume near its docking site. (2) in low ionic strength condition OpuA-K521C displays a high FRET state with mean FRET values of 0.7-0.8, which correspond to inter-dye distances of approximately 4 nm. (3) The high FRET state is responsive to glycine betaine, which points toward direct communication between the two SBDs. (4) The distance between the density centers of the SBDs in the cryo-EM reconstructions (based on particles with a low and high FRET state) is 6 nm, which aligns with the dimensions of an SBD (length: ~6 nm, maximal width: ~4 nm). These findings collectively indicate that two SBDs interact but not necessarily in a singular conformation but possibly as an ensemble of weakly interacting states. Hence, we discuss three possible SBD-SBD interaction models to explain the highFRET state:

      Reviewer #2 (Recommendations For The Authors):

      In the abstract and elsewhere the authors suggest that the SBDs physically interact with one another, and that this interaction is important for the transport mechanism, specifically for its cooperativity.

      I feel that this main claim is not well established. The authors convincingly demonstrate that the SBDs largely occupy two states relative to one another and that in one of these states, they are closer than in the other. Unless I have missed (or failed to understand) some major details of the results, I did not find any evidence of a physical interaction. Have the authors established that the high FRET state indeed corresponds to the physical engagement of the SBDs? I feel that a direct demonstration of an interaction is much missing.

      Along the same lines, in the low-salt cryo-EM structure, where the SBDs are relatively closer together, the SBDs are still separated and do not interact.

      See also our response to the final comment of reviewer 1. Furthermore, please carefully consider the following: (1) FRET values of 0.7-0.8 correspond to inter-dye distances of approximately 4 nm. (2) The high FRET state is responsive to glycine betaine, which points toward direct communication between the two SBDs. (3) The cryo-EM reconstruction is the average of all the particles in the final dataset, including both the particles with a low and high FRET state. Further, the local resolution of the SBDs in the cryo-EM map is low, indicative of high degree of flexibility. Thus, a potential interaction is possible within the observed range of flexibility. (4) The distance between the density centers is 6 nm, aligning with the dimensions of an SBD (length: 6 nm, maximal width: 4 nm). These factors collectively indicate SBD interactions, and we present these points now more explicitly in Figure 4 and the last part of the results section (page 9).

      Once the authors successfully demonstrate that direct physical interaction indeed occurs, they will need to provide data that places it in the context of the transport cycle. Do the SBDs swap ligand molecules between them? Do they bind the ligand and/or the transporter cooperatively? What is the role of this interaction?

      We acknowledge the intriguing nature of the posed questions, but they extend beyond the scope of this study. It is extremely challenging to obtain high-resolution structures of highly dynamic multidomain proteins, like OpuA, and to probe transient interactions as we do here for the SBDs of OpuA. We therefore combined cryo-TEM with smFRET studies and perform the most advanced and state-of-theart analysis tools as acknowledged by reviewer 1. We link our observations on the structural dynamics and interactions of the SBDs to a previous study, where we showed that the two SBDs of OpuA interact cooperatively. We do not have further evidence that connect the physical interactions to the transport cycle. In our view, the collective datasets indicate that the here reported physical interactions between the SBDs increase the transport efficiency.

      As far as I understand, the smFRET data have been interpreted on the basis of a negative observation, i.e., that it is "likely" that none of the FRET states corresponds to a docked SBD. To convincingly show this, a positive observation is required, i.e., observation of a docked state.

      The aim of this study was to study interdomain dynamics and not specifically docking. We have previously shown that docking can be visualized via cryo-EM (Sikkema et al., 2020), however the SBDs of OpuA appear to only dock in specific turnover conditions. We now show that the high FRET state of OpuA cannot represent a docked state, but that the SBDs transiently interact (see our response to the first comment). Importantly, a docked state was also not found in the cryo-EM reconstructions at low ionic strength, representing the smFRET conditions where we observe the interactions between the SBDs. The high FRET state occupies 30% of the dwells in this condition, and such a high percentage of molecules would have become apparent during cryo-EM 3D classification in case they would form a docked state. Therefore, we conclude that docking does not occur in low ionic strength apo condition. We discuss this point and our reasoning on page 11 of the revised manuscript.

      In this respect, I find it troubling that in none of the tested conditions, the authors observed a FRET state which corresponds to the docked state. Such a state, which must exist for transport to occur (as mentioned in the authors' previous publications), needs to be demonstrated. This brings me to my next question: why have the authors not measured FRET between the SBDs and the transporter? Isn't this a very important piece that is missing from their puzzle?

      We agree that investigating docking behavior under varied turnover conditions requires focused experiments on FRET dynamics between the SBDs and the transporter. As noted on page 5, OpuA exists as a homodimer, implying that a single cysteine mutation introduces two cysteines in a single functional transporter. To specifically implement a cysteine mutation in only one SBD and one transmembrane domain, it is necessary to artificially construct a heterodimer. We recently published initial attempts in this direction, and this will be a subject for future research but still requires years of work.

      Additionally, I feel that important controls are missing. For example, how will the data presented in Fig1 look if the transporter is labeled with acceptor or donor only? How do soluble SBDs behave?

      In the employed labeling method, donor and acceptor dyes are mixed in a 1:1 ratio and randomly attached to the two cysteines in the transporter. This automatically yields significant fractions of donor only and acceptor only transporters which are always present during the smFRET recordings. We can visualize those molecules on the basis of the dye stoichiometry, which we calculate by using three types of photon counts: donor-based donor emission (FDD), donor-based acceptor emission (FDA) and acceptorbased acceptor emission (FAA).

      Unfiltered plots look as follows (a dataset of OpuA-K521C at 600 mM KCl):

      Author response image 1.

      Donor only and acceptor only molecules have a very well discernible stoichiometry of 1 and 0, respectively. The filtering procedure is described in the materials and methods section, and these plots can be found in the supplementary database. We did not add them to the main text or supplementary materials of the original manuscript, as this is a very common procedure in the field of smFRET. We now include such a dataset in the revised manuscript.

      Soluble SBDs of OpuA have been studied previously (e.g. Wolters et al., 2010 & De Boer et al. 2019). For example, we have shown by SEC-MALLLS that soluble SBDs do not form dimers, which is consistent with our notion that the SBDs interact with low affinity. It is not possible to study interdomain dynamics between soluble SBDs by smFRET, because the measurements are carried out at picomolar concentrations (monomeric conditions). We emphasize that smFRET measurements with native complexes, with SBDs near each other at apparent millimolar concentrations, is physiologically more relevant.

      Additional comments:

      (1) "It could well be that cooperativity and transient interactions between SBDs is more common than previously anticipated" and a similar statement in the abstract. What evidence is there to suggest that the transient interactions between SBDs are a common phenomenon?

      On page 11, we write: Dimer formation of SBPs has been described for a variety of proteins from different structural clusters of substrate-binding proteins [33–38,51–53]. We cite 9 papers that report SBD/SBP dimers. This suggest to us that the phenomenon of interacting substrate-binding proteins could be more common. Moreover, the concentration of maltose-binding protein and other SBPs in the periplasm of Gram-negative bacteria can reach (sub)millimolar concentrations, and low-affinity interactions may play a role not only in membrane protein-tethered SBDs (like in OpuA) but also be important in soluble substrate-receptors. Such low-affinity interactions are rarely studied in biochemical experiments.

      (2) I think that the data presented in 1B-C better suits the supplementary information.

      Figure 1B-D is already a summary of the supplementary information that describes the optimization of OpuA purification. We think it is valuable to show this part of the figure in the main text. A very clean and highly pure OpuA sample is essential for smFRET experiments. Quality of protein preparations and data analysis are key for the type of measurements we report in this paper.

      (3) "the first peak in the SEC profile corresponds...." The peaks should be numbered in the figure to facilitate their identification.

      We have changed the figure as suggested.

      (4) "smFRET is a powerful tool for studying protein dynamics, but it has only been used for a handful of membrane proteins". With the growing list of membrane proteins studied by smFRET I find this an overstatement.

      We removed this sentence in the new version of the manuscript.

      (5) "We rationalized that docking of one SBD could induce a distance shift between the two SBDs in the FRET range of 3-10 nm (Figure 1E)" How and why was this assumed?

      We realize that this is one of the sentences that caused confusion about the aim of this study. In this part of the manuscript, we should not have used docking as an example and we apologize for that. We replaced the sentence by: These variants are used to study inter-SBD dynamics in the FRET range of 310 nm (Figure 1E).

      Also Figure 1E was adjusted to prevent confusion:

      Author response image 2.

      In addition, to avoid any confusion we changed the following sentence on page 4 (changes marked in bold): We designed cysteine mutations in the SBD of OpuA to study interdomain dynamics in the full length transporter.

      (6) "However, the FRET distributions are broader than would be expected from a single FRET state, especially for OpuA-K521C" Have the authors established how a single state FRET of OpuA looks? Is there a control that supports this claim?

      Below we compare two datasets from OpuA-K521C in 600 mM KCl with a typical smFRET dataset from the well-studied substrate-binding protein MBP from E. coli, which resides in a single state. Left: OpuA-K521C; Right: MBP

      Author response image 3.

      We agree that this cannot be assumed from the presented data. Therefore we rewrote this sentence: However, the FRET distributions tail towards higher FRET values, especially OpuA-K521C.

      (7) "V149Q was designed as a mild mutation that would reduce docking efficiency and thereby substrate loading, but leave the intrinsic transport and ATP hydrolysis efficiency intact." I find this statement confusing: How can a mutation reduce docking efficiency yet leave the transport activity unchanged?

      We rewrote the sentences (changes marked in bold): V149Q was designed as a mild mutation that would reduce docking efficiency and thereby substrate loading, but leave the ionic strength sensing in the NBD and the binding of glycine betaine and ATP intact. Accordingly, a reduced docking efficiency should result in a lower absolute glycine betaine-dependent ATPase activity. At the same time the responsiveness of the system to varying KCl, glycine betaine, or Mg-ATP concentrations should not change.

      (8) Along the same lines: "whereas the glycine betaine-, Mg-ATP-, or KCl-dependent activity profiles remain unchanged" vs. "OpuA-V149Q-K521C exhibited a 2- to 3-fold reduction in glycine betainedependent ATPase activity".

      See comment at point 7.

      (9) In general, I find the writing wanting at places, not on par with the high standards set by previous publications of this group.

      We recognize the potential ambiguity in our phrasing. We hope that after incorporating the feedback provided by the reviewers our manuscript will convey our findings in a clearer manner.

      Extra changes to the text:

      (1) Title changed: The substrate-binding domains of the osmoregulatory ABC importer OpuA physically transiently interact

      (2) Second part of the abstract changed: We now show, by means of solution-based single-molecule FRET and analysis with multi-parameter photon-by-photon hidden Markov modeling, that the SBDs transiently interact in an ionic strength-dependent manner. The smFRET data are in accordance with the apparent cooperativity in transport and supported by new cryo-EM data of OpuA. We propose that the physical interactions between SBDs and cooperativity in substrate delivery are part of the transport mechanism.

      (3) Page 6, third paragraph and Figure 2B: the wrong rate number was extracted from table 1. Changed this in the text and figure: 112 s-1  173 s-1. It did not affect any of the interpretations or conclusions.

      (4) Page 8, last paragraph, changed: smFRET was also performed in the absence of KCl and with a saturating concentration of glycine betaine (100 µM). The mean FRET efficiency of the highFRET state of OpuA-K521C increased to 0.78, which corresponds to an inter-dye distance of about 4 nm. This indicates that the dyes at the two SBDs move very close towards each other (Figure 4A) (Table 1) (Supplementary File 34).

      (5) Page 9, second paragraph changed: Due to the inherent flexibility of the SBDs, with respect to both the MSP protein of the nanodisc and the TMDs of OpuA, their resolution is limited. Furthermore, the cryo-EM reconstructions average all the particles in the final dataset, including those with a low and high FRET state. Nevertheless, in both conditions, the densities that correspond to the SBDs can be observed in close proximity (Figure 4D). The distance between the density centers is 6 nm and align with the dimensions of an SBD, providing further evidence for physical interactions between the SBDs.

    1. Author Response

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Summary:

      The authors aim to address a critical challenge in the field of bioinformatics: the accurate and efficient identification of protein binding sites from sequences. Their work seeks to overcome the limitations of current methods, which largely depend on multiple sequence alignments or experimental protein structures, by introducing GPSite, a multi-task network designed to predict binding residues of various molecules on proteins using ESMFold.

      Strengths:

      • Benchmarking. The authors provide a comprehensive benchmark against multiple methods, showcasing the performances of a large number of methods in various scenarios.

      • Accessibility and Ease of Use. GPSite is highlighted as a freely accessible tool with user-friendly features on its website, enhancing its potential for widespread adoption in the research community.

      RE: We thank the reviewer for acknowledging the contributions and strengths of our work!

      Weaknesses:

      • Lack of Novelty. The method primarily combines existing approaches and lacks significant technical innovation. This raises concerns about the original contribution of the work in terms of methodological development. Moreover, the paper reproduces results and analyses already presented in previous literature, without providing novel analysis or interpretation. This further diminishes the contribution of this paper to advancing knowledge in the field.

      RE: The novelty of this work is primarily manifested in four key aspects. Firstly, although we have employed several existing tools such as ProtTrans and ESMFold to extract sequence features and predict protein conformations, these techniques were hardly explored in the field of binding site prediction. We have successfully demonstrated the feasibility of substituting multiple sequence alignments with language model embeddings and training with predicted structures, providing a new solution to overcome the limitations of current methods for genome-wide applications. Secondly, though a few methods tend to capture geometric information based on protein surfaces or atom graphs, surface calculation and property mapping are usually time-consuming, while massage passing on full atom graphs is memory-consuming and thus challenging to process long sequences. Besides, these methods are sensitive towards details and errors in the predicted structures. To facilitate large-scale annotations, we have innovatively applied geometric deep learning to protein residue graphs for comprehensively capturing backbone and sidechain geometric contexts in an efficient and effective manner (Figure 1). Thirdly, we have not only exploited multi-task learning to integrate diverse ligands and enhance performance, but also shown its capability to easily extend to the binding site prediction of other unseen ligands (Figure 4 D-E). Last but not least, as a “Tools and Resources” article, we have provided a fast, accurate and user-friendly webserver, as well as constructed a large annotation database for the sequences in Swiss-Prot. Leveraging this database, we have conducted extensive analyses on the associations between binding sites and molecular functions, biological processes, and disease-causing mutations (Figure 5), indicating the potential of our tool to unveil unexplored biology underlying genomic data.

      We have now revised the descriptions in the “The geometry-aware protein binding site predictor (GPSite)” section to highlight the novelty of our work in a clearer manner:

      “In conclusion, GPSite is distinguished from the previous approaches in four key aspects. First, profiting from the effectiveness and low computational cost of ProtTrans and ESMFold, GPSite is liberated from the reliance on MSA and native structures, thus enabling genome-wide binding site prediction. Second, unlike methods that only explore the Cα models of proteins 25,40, GPSite exploits a comprehensive geometric featurizer to fully refine knowledge in the backbone and sidechain atoms. Third, the employed message propagation on residue graphs is global structure-aware and time-efficient compared to the methods based on surface point clouds 21,22, and memory-efficient unlike methods based on full atom graphs 23,24. Residue-based message passing is also less sensitive towards errors in the predicted structures. Last but not least, instead of predicting binding sites for a single molecule type or learning binding patterns separately for different molecules, GPSite applies multi-task learning to better model the latent relationships among different binding partners.”

      • Benchmark Discrepancies. The variation in benchmark results, especially between initial comparisons and those with PeSTo. GPSite achieves a PR AUC of 0.484 on the global benchmark but a PR AUC of 0.61 on the benchmark against PeSTo. For consistency, PeSTo should be included in the benchmark against all other methods. It suggests potential issues with the benchmark set or the stability of the method. This inconsistency needs to be addressed to validate the reliability of the results.

      RE: We thank the reviewer for the constructive comments. Since our performance comparison experiments involved numerous competitive methods whose training sets are disparate, it was difficult to compare or rank all these methods fairly using a single test set. Given the substantial overlap between our protein-binding site test set and the training set of PeSTo, we meticulously re-split our entire protein-protein binding site dataset to generate a new test set that avoids any overlap with the training sets of both GPSite and PeSTo and performed a separate evaluation, where GPSite achieves a higher AUPR than PeSTo (0.610 against 0.433). This is quite common in this field. For instance, in the study of PeSTo (Nat Commun 2023), the comparisons of PeSTo with MaSIF-site, SPPIDER, and PSIVER were conducted using one test set, while the comparison with ScanNet was performed on a separate test set.

      Based on the reviewer’s suggestion, we have now replaced this experiment with a direct comparison with PeSTo using the datasets from PeSTo, in order to enhance the completeness and convincingness of our results. The corresponding descriptions are now added in Appendix 1-note 2, and the results are added in Appendix 2-table 4. For convenience, we also attach the note and table here:

      “Since 340 out of 375 proteins in our protein-protein binding site test set share > 30% identity with the training sequences of PeSTo, we performed a separate comparison between GPSite and PeSTo using the training and test datasets from PeSTo. By re-training with simply the same hyperparameters, GPSite achieves better performance than PeSTo (AUPR of 0.824 against 0.797) as shown in Appendix 2-table 4. Furthermore, when using ESMFold-predicted structures as input, the performance of PeSTo decreases substantially (AUPR of 0.691), and the superiority of our method will be further reflected. As in 24, the performance of ScanNet is also included (AUPR of 0.720), which is also largely outperformed by GPSite.”

      Author response table 1.

      Performance comparison of GPSite with ScanNet and PeSTo on the protein-protein binding site test set from PeSTo 24

      Note: The performance of ScanNet and PeSTo are directly obtained from 24. PeSTo* denotes evaluation using the ESMFold-predicted structures as input. The metrics provided are the median AUPR, median AUC and median MCC. The best/second-best results are indicated by bold/underlined fonts.

      • Interface Definition Ambiguity. There is a lack of clarity in defining the interface for the binding site predictions. Different methods are trained using varying criteria (surfaces in MaSIF-site, distance thresholds in ScanNet). The authors do not adequately address how GPSite's definition aligns with or differs from these standards and how this issue was addressed. It could indicate that the comparison of those methods is unreliable and unfair.

      RE: We thank the reviewer for the comments. The precise definition of ligand-binding sites is elucidated in the “Benchmark datasets” section. Specifically, the datasets of DNA, RNA, peptide, ATP, HEM and metal ions used to train GPSite were collected from the widely acknowledged BioLiP database [PMID: 23087378]. In BioLiP, a binding residue is defined if the smallest atomic distance between the target residue and the ligand is <0.5 Å plus the sum of the Van der Waal’s radius of the two nearest atoms. Meanwhile, most comparative methods regarding these ligands were also trained on data from BioLiP, thereby ensuring fair comparisons.

      However, since BioLiP does not include data on protein-protein binding sites, studies for protein-protein binding site prediction may adopt slightly distinct label definitions, as the reviewer suggested. Here, we employed the protein-protein binding site data from our previous study [PMID: 34498061], where a protein-binding residue was defined as a surface residue (relative solvent accessibility > 5%) that lost more than 1 Å2 absolute solvent accessibility after protein-protein complex formation. This definition was initially introduced in PSIVER [PMID: 20529890] and widely applied in various studies (e.g., PMID: 31593229, PMID: 32840562). SPPIDER [PMID: 17152079] and MaSIF-site [PMID: 31819266] have also adopted similar surface-based definitions as PSIVER. On the other hand, ScanNet [PMID: 35637310] employed an atom distance threshold of 4 Å to define contacts while PeSTo [PMID: 37072397] used a threshold of 5 Å. However, it is noteworthy that current methods in this field including ScanNet (Nat Methods 2022) and PeSTo (Nat Commun 2023) directly compared methods using different label definitions without any alignment in their benchmark studies, likely due to the subtle distinctions among these definitions. For instance, the study of PeSTo directly performed comparisons with ScanNet, MaSIF-site, SPPIDER, and PSIVER. Therefore, we followed these previous works, directly comparing GPSite with other protein-protein binding site predictors.

      In the revised “Benchmark datasets” section, we have now provided more details for the binding site definitions in different datasets to avoid any potential ambiguity:

      “The benchmark datasets for evaluating binding site predictions of DNA, RNA, peptide, ATP, and HEM are constructed from BioLiP”; “A binding residue is defined if the smallest atomic distance between the target residue and the ligand is < 0.5 Å plus the sum of the Van der Waal’s radius of the two nearest atoms”; “Besides, the benchmark dataset of protein-protein binding sites is directly from 26, which contains non-redundant transient heterodimeric protein complexes dated up to May 2021. Surface regions that become solvent inaccessible on complex formation are defined as the ground truth protein-binding sites. The benchmark datasets of metal ion (Zn2+, Ca2+, Mg2+ and Mn2+) binding sites are directly from 18, which contain non-redundant proteins dated up to December 2021 from BioLiP.”

      While GPSite demonstrates the potential to surpass state-of-the-art methods in protein binding site prediction, the evidence supporting these claims seems incomplete. The lack of methodological novelty and the unresolved questions in benchmark consistency and interface definition somewhat undermine the confidence in the results. Therefore, it's not entirely clear if the authors have fully achieved their aims as outlined.

      The work is useful for the field, especially in disease mechanism elucidation and novel drug design. The availability of genome-scale binding residue annotations GPSite offers is a significant advancement. However, the utility of this tool could be hampered by the aforementioned weaknesses unless they are adequately addressed.

      RE: We thank the reviewer for acknowledging the advancement and value of our work, as well as pointing out areas where improvements can be made. As discussed above, we have now carried out the corresponding revisions in the revised manuscript to enhance the completeness and clearness of our work.

      Reviewer #2 (Public Review):

      Summary:

      This work provides a new framework, "GPsite" to predict DNA, RNA, peptide, protein, ATP, HEM, and metal ions binding sites on proteins. This framework comes with a webserver and a database of annotations. The core of the model is a Geometric featurizer neural network that predicts the binding sites of a protein. One major contribution of the authors is the fact that they feed this neural network with predicted structure from ESMFold for training and prediction (instead of native structure in similar works) and a high-quality protein Language Model representation. The other major contribution is that it provides the public with a new light framework to predict protein-ligand interactions for a broad range of ligands.

      The authors have demonstrated the interest of their framework with mostly two techniques: ablation and benchmark.

      Strengths:

      • The performance of this framework as well as the provided dataset and web server make it useful to conduct studies.

      • The ablations of some core elements of the method, such as the protein Language Model part, or the input structure are very insightful and can help convince the reader that every part of the framework is necessary. This could also guide further developments in the field. As such, the presentation of this part of the work can hold a more critical place in this work.

      RE: We thank the reviewer for recognizing the contributions of our work and for noting that our experiments are thorough.

      Weaknesses:

      • Overall, we can acknowledge the important effort of the authors to compare their work to other similar frameworks. Yet, the lack of homogeneity of training methods and data from one work to the other makes the comparison slightly unconvincing, as the authors pointed out. Overall, the paper puts significant effort into convincing the reader that the method is beating the state of the art. Maybe, there are other aspects that could be more interesting to insist on (usability, interest in protein engineering, and theoretical works).

      RE: We sincerely appreciate the reviewer for the constructive and insightful comments. As to the concern of training data heterogeneity raised by the reviewer, it is noteworthy that current studies in this field, such as ScanNet (Nat Methods 2022) and PeSTo (Nat Commun 2023), directly compare methods trained on different datasets in their benchmark experiments. Therefore, we have adhered to the paradigm in these previous works. According to the detailed recommendations by the reviewer, we have now improved our manuscript by incorporating additional ablation studies regarding the effects of training procedure and language model representations, as well as case studies regarding the predicted structure’s quality and GPSite-based function annotations. We have also refined the Discussion section to focus more on the achievements of this work. A comprehensive point-by-point response to the reviewer’s recommendations is provided below.

      Reviewer #2 (Recommendations For The Authors):

      Major comments:

      Overall I think the work is slightly deserved by its presentation. Some improvements could be made to the paper to better highlight the significance of your contribution.

      RE: We thank the reviewer for recognizing the significance of our work!

      • Line 188: "As expected, the performance of these methods mostly decreases substantially utilizing predicted structures for testing because they were trained with high-quality native structures.

      This is a major ablation that was not performed in this case. You used the predicted structure to train, while the other did not. One better way to assess the interest of this approach would be to compare the performance of a network trained with only native structure to compare the leap in performance with and without this predicted structure as you did after to assess the interest of some other aspect of your method such as single to multitask.

      RE: We thank the reviewer for the valuable recommendation. We have now assessed the benefit of training with predicted instead of native structures, which brings an average AUPR increase of 4.2% as detailed in Appendix 1-note 5 and Appendix 2-table 9. For convenience, we also attach the note and table here:

      “We examined the performance under different training and evaluation settings as shown in Appendix 2-table 9. As expected, the model yields exceptional performance (average AUPR of 0.656) when trained and evaluated using native structures. However, if this model is fed with predicted structures of the test proteins, the performance substantially declines to an average AUPR of 0.573. This trend aligns with the observations for other structure-based methods as illustrated in Figure 2. More importantly, in the practical scenario where only predicted structures are available for the target proteins, training the model with predicted structures (i.e., GPSite) results in superior performance than training the model with native structures (average AUPR of 0.594 against 0.573), probably owing to the consistency between the training and testing data. For completeness, the results in Appendix 3-figure 2 are also included where GPSite is tested with native structures (average AUPR of 0.637).”

      Author response table 2.

      Performance comparison on the ten binding site test sets under different training and evaluation settings

      Note: The numbers in this table are AUPR values. “Pep” and “Pro” denote peptide and protein, respectively. “Avg” means the average AUPR values among the ten test sets. “native” and “predicted” denote applying native and predicted structures as input, respectively.

      • Line 263: "ProtTrans consistently obtains competitive or superior performance compared to the MSA profiles, particularly for the target proteins with few homologous sequences (Neff < 2)."

      This seems a bit far-fetched. If we see clearly in the figure that the performances are far superior for Neff < 2. The performances seem rather similar for higher Neff. Could the author evaluate numerically the significance of the improvement? MSA profiles outperform GPSite on 4 intervals and I don't know the distribution of the data.

      RE: We thank the reviewer for the valuable suggestion. We have now revised this sentence to avoid any potential ambiguity:

      “As evidenced in Figure 4B and Appendix 2-table 8, ProtTrans consistently obtains competitive or superior performance compared to the MSA profile. Notably, for the target proteins with few homologous sequences (Neff < 2), ProtTrans surpasses MSA profile significantly with an improvement of 3.9% on AUC (P-value = 4.3×10-8).”

      The detailed significance tests and data distribution are now added in Appendix 2-table 8 and attached below as Author response-table 3 for convenience:

      Author response table 3.

      Performance comparison between GPSite and the baseline model using MSA profile for proteins with different Neff values in the combined test set of the ten ligands

      Note: Significance tests are performed following the procedure in 12,25. If P-value < 0.05, the difference between the performance is considered statistically significant.

      • Line 285: "We first visualized the distributions of residues in this dataset using t-SNE, where the residues are encoded by raw feature vectors encompassing ProtTrans embeddings and DSSP structural properties, or latent embedding vectors from the shared network of GPSite. "

      Wouldn't embedding from single-task be more relevant to show the interest of multi-task training here? Is the difference that big when comparing embeddings from single-task training to embeddings from multi-task training? Otherwise, I think the evidence from Figure 4e is sufficient, the interest of multitasking could be well-shown by single-task vs. multi-task AUPR and a few examples or predictions that are improved.

      RE: We thank the reviewer for the comment. In the second paragraph of the “The effects of protein features and model designs” section, we have compared the performance of multi-task and single-task learning. However, the visualization results in Figure 4D are related to the third paragraph, where we conducted a downstream exploration of the possibility to extend GPSite to other unseen ligands. This is based on the hypothesis that the shared network in GPSite may have captured certain common ligand-binding mechanisms during the preceding multi-task training process. We visualized the distributions of residues in an unseen carbohydrate-binding site dataset using t-SNE, where the residues are encoded by raw feature vectors (ProtTrans and DSSP), or latent embedding vectors from the shared network trained before. Although the shared network has not been specifically trained on the carbohydrate dataset, the latent representations from GPSite effectively improve the discriminability between the binding and non-binding residues as shown in Figure 4D. This finding indicates that the shared network trained on the initial set of ten molecule types has captured common binding mechanisms and may be applied to other unseen ligands.

      We have now added more descriptions in this paragraph to avoid potential ambiguity:

      “Residues that are conserved during evolution, exposed to solvent, or inside a pocket-shaped domain are inclined to participate in ligand binding. During the preceding multi-task training process, the shared network in GPSite should have learned to capture such common binding mechanisms. Here we show how GPSite can be easily extended to the binding site prediction for other unseen ligands by adopting the pre-trained shared network as a feature extractor. We considered a carbohydrate-binding site dataset from 54 which contains 100 proteins for training and 49 for testing. We first visualized the distributions of residues in this dataset using t-SNE 55, where the residues are encoded by raw feature vectors encompassing ProtTrans embeddings and DSSP structural properties, or latent embedding vectors from the shared network of GPSite trained on the ten molecule types previously.”

      • Line291: "Employing these informative hidden embeddings as input features to train a simple MLP exhibits remarkable performance with an AUC of 0.881 (Figure 4E), higher than that of training a single-task version of GPSite from scratch (AUC of 0.853) or other state-of-the-art methods such as MTDsite and SPRINT-CBH."

      Is it necessary to introduce other methods here? The single-task vs multi-task seems enough for what you want to show?

      RE: We thank the reviewer for the comment. As discussed above, here we aim to show the potential of GPSite for the binding site prediction of unseen ligand (i.e., carbohydrate) by adopting the pre-trained shared network as a feature extractor. Thus, we think it’s reasonable to also include the performance of other state-of-the-art methods in this carbohydrate benchmark dataset as baselines.

      • Line 321: "Specifically, a protein-level binding score can be generated for each ligand by averaging the top k predicted scores among all residues. Empirically, we set k to 5 for metal ions and 10 for other ligands, considering that the binding interfaces of metal ions are usually smaller."

      Since binding sites are usually not localized on one single amino-acid, we can expect that most of the top k residues are localized around the same area of the protein both spatially and along the sequence. Is it something you observe and could consider in your method?

      RE: We thank the reviewer for the comment. We employed a straightforward method (top-k average) to convert GPSite’s residue-level annotations into protein-level annotations, where k was set empirically based on the distributions of the numbers of binding residues per sequence observed in the training set. We have not put much effort in optimizing this strategy since it mainly serves as a proof-of-concept experiment (Figure 5 A-C) to show the potential of GPSite in discriminating ligand-binding proteins. We have now revised this sentence to better explain how we selected k:

      “Specifically, a protein-level binding score indicating the overall binding propensity to a specific ligand can be generated by averaging the top k predicted scores among all residues. Empirically, we set k to 5 for metal ions and 10 for other ligands, considering the distributions of the numbers of binding residues per sequence observed in the training set.”

      As for the question raised by the reviewer, we can indeed expect that most of the top k predicted binding residues tend to cluster into several but not necessarily one area. For instance, certain macromolecules like DNA may interact with several protein surface patches due to their elongated structures (e.g., Author esponse-figure 1A). Another case may be a protein binding to multiple molecules of the same ligand type (e.g., Author response-figure 1B).

      Author response image 1.

      The structures of 4XQK (A) and 4KYW (B) in PDB.

      • Line 327: The accuracy of the GPSite protein-level binding scores is further validated by the ROC curves in Figure 5B, where GPSite achieves satisfactory AUC values for all ligands except protein (AUC of 0.608).

      Here may be a good place to compare yourself with others, do other frameworks experience the same problem? If so, AUC and AUPR are not relevant here, can you expose some recall scores for example?

      RE: We thank the reviewer for the valuable recommendation. We have conducted comprehensive method comparisons in the preceding “GPSite outperforms state-of-the-art methods” section, where GPSite surpasses all existing frameworks across various ligands. Here, the genome-wide analyses of Swiss-Prot in Figure 5 serve as a downstream demonstration of GPSite’s capacity for large-scale annotations. We didn’t compare with other methods since most of them are time-consuming or memory-consuming, thus unavailable to process sequences of substantial quantity or length. For example, it takes about 8 min for the MSA-based method GraphBind to annotate a protein with 500 residues, while it just takes about 20 s for GPSite (see Appendix 3-figure 1 for detailed runtime comparison). It is also challenging for the atom-graph-based method PeSTo to process structures more than 100 kDa (~1000 residues) on a 32 GB GPU as the authors suggested, while GPSite can easily process structures containing up to 2500 residues on a 16 GB GPU.

      Regarding the recall score mentioned by the reviewer, GPSite achieves a recall of 0.95 (threshold = 0.5) for identifying protein-binding proteins. This indicates that GPSite can accurately identify positive samples, but it also tends to misclassify negative samples as positive. In our original manuscript, we claimed that “This may be ascribed to the fact that protein-protein interactions are ubiquitous in living organisms while the Swiss-Prot function annotations are incomplete”. To better support this claim, we have now added two examples in Appendix 1-note 7, where GPSite confidently predicted the presences of the “protein binding” function (GO:0005515). Notably, this function was absent in these two proteins in the Swiss-Prot database at the time of manuscript preparation (release: 2023-05-03), but has been included in the latest release of Swiss-Prot (release: 2023-11-08). For convenience, we also attach the note here:

      “As depicted in Figure 5A, GPSite assigns relatively high prediction scores to the proteins without “protein binding” function in the Swiss-Prot annotations, leading to a modest AUC value of 0.608 (Figure 5B). This may be ascribed to the fact that protein-protein interactions are ubiquitous in living organisms while the Swiss-Prot function annotations are incomplete. To support this hypothesis, we present two proteins as case studies, both sharing < 20% sequence identity with the protein-binding training set of GPSite. The first case is Aminodeoxychorismate synthase component 2 from Escherichia coli (UniProt ID: P00903). GPSite confidently predicted this protein as a protein-binding protein with a high prediction score of 0.936. Notably, this protein was not annotated with the “protein binding” function (GO:0005515) or any of its GO child terms in the Swiss-Prot database at the time of manuscript preparation (https://rest.uniprot.org/unisave/P00903?format=txt&versions=171, release: 2023-05-03). However, in the latest release of Swiss-Prot (https://rest.uniprot.org/unisave/P00903?format=txt&versions=174, release: 2023-11-08) during manuscript revision, this protein is annotated with the “protein heterodimerization activity” function (GO:0046982), which is a child term of “protein binding”. In fact, the heterodimerization activity of this protein has been validated through experiments in the year of 1996 (PMID: 8679677), indicating the potential incompleteness of the Swiss-Prot annotations. The other case is Hydrogenase-2 operon protein HybE from Escherichia coli (UniProt ID: P0AAN1), which was also predicted as a protein-binding protein by GPSite (score = 0.909). Similarly, this protein was not annotated with the “protein binding” function in the Swiss-Prot database at the time of manuscript preparation (https://rest.uniprot.org/unisave/P0AAN1?format=txt&versions=108). However, in the latest release of Swiss-Prot (https://rest.uniprot.org/unisave/P0AAN1?format=txt&versions=111), this protein is annotated with the “preprotein binding” function (GO:0070678), which is a child term of “protein binding”. In fact, the preprotein binding function of this protein has been validated through experiments in the year of 2003 (PMID: 12914940). These cases demonstrate the effectiveness of GPSite for completing the missing function annotations in Swiss-Prot.”

      • Line 381: 'Despite the noteworthy advancements achieved by GPSite, there remains scope for further improvements. Given that the ESM Metagenomic Atlas 34 provides 772 million predicted protein structures along with pre-computed language model embeddings, self-supervised learning can be employed to train a GPSite model for predicting masked sequence and structure attributes, or maximizing the similarity between the learned representations of substructures from identical proteins while minimizing the similarity between those from different proteins using a contrastive loss function training from scratch. Additional opportunities for upgrade exist within the network architecture. For example, a variational Expectation-Maximization (EM) framework 58 can be adopted to handle the hierarchical graph structure inherent in proteins, which contains the top view of the residue graph and the bottom view of the atom graph inside a residue. Such an EM procedure enables training two separate graph neural networks for the two views while simultaneously allowing interaction and mutual enhancement between the two modules. Meta-learning could also be explored in this multi-task scenario, which allows fast adaptation to unseen tasks with limited labels.'

      I think this does not belong here. It feels like half of your discussion is not talking about the achievements of this paper but future very specific directions. Focus on the take-home arguments (performances of the model, ability to predict a large range of tasks, interest in key components of your model, easy use) of the paper and possible future direction but without being so specific.

      RE: We thank the reviewer for the valuable suggestion. We have now simplified the discussions on the future directions notably:

      “Despite the noteworthy advancements achieved by GPSite, there remains scope for further improvements. GPSite may be improved by pre-training on the abundant predicted structures in ESM Metagenomic Atlas, and then fine-tuning on binding site datasets. Besides, the hidden embeddings from ESMFold may also serve as informative protein representations. Additional opportunities for upgrade exist within the network architecture. For example, a variational Expectation-Maximization framework can be adopted to handle the hierarchical atom-to-residue graph structure inherent in proteins. Meta-learning could also be explored in this multi-task scenario, which allows fast adaptation to unseen tasks with limited labels.”

      • Overall there is also a lack of displayed structure. You should try to select a few examples of binding sites that were identified correctly by your method and not by others, if possible get some insights on why. Also, some negative examples could be interesting so as to have a better idea of the interest.

      RE: We thank the reviewer for the valuable recommendation. We have performed a case study for the structure of the glucocorticoid receptor in Figure 3 D-H to illustrate a potential reason for the robustness of GPSite. Moreover, we have now added a case study in Appendix 1-note 3 and Appendix 3-figure 5 to explain why GPSite sometimes is not as accurate as the state-of-the-art structure-based method. For convenience, we also attach the note and figure here:

      “Here we present an example of an RNA-binding protein, i.e., the ribosome biogenesis protein ERB1 (PDB: 7R6Q, chain m), to illustrate the impact of predicted structure’s quality. As shown in Appendix 3-figure 5, ERB1 is an integral component of a large multimer structure comprising protein and RNA chains (i.e., the state E2 nucleolar 60S ribosome biogenesis intermediate). Likely due to the neglect of interactions from other protein chains, ESMFold fails to predict the correct conformation of the ERB1 chain (TM-score = 0.24). Using this incorrect predicted structure, GPSite achieves an AUPR of 0.580, lower than GraphBind input with the native structure (AUPR = 0.636). However, the performance of GraphBind substantially declines to an AUPR of 0.468 when employing the predicted structure as input. Moreover, if GPSite adopts the native structure for prediction, a notable performance boost can be obtained (AUPR = 0.681).”

      Author response image 2.

      The prediction results of GPSite and GraphBind for the ribosome biogenesis protein ERB1. (A) The state E2 nucleolar 60S ribosome biogenesis intermediate (PDB: 7R6Q). The ribosome biogenesis protein ERB1 (chain m) is highlighted in blue, while other protein chains are colored in gray. The RNA chains are shown in orange. (B) The RNA-binding sites on ERB1 (colored in red). (C) The ESMFold-predicted structure of ERB1 (TM-score = 0.24). The RNA-binding sites are also mapped onto this predicted structure (colored in red). (D-G) The prediction results of GPSite and GraphBind for the predicted and native ERB1 structures. The confidence of the predictions is represented with a gradient of color from blue for non-binding to red for binding.

      Minor comments:

      • Line 169: "Note that since our test sets may partly overlap with the training sets of these methods, the results reported here should be the upper limits for the existing methods."

      Yes, but they were potentially not trained on the most recent structures in that case. These methods could also see improved performance with an updated training set.

      RE: We thank the reviewer for the comment. We have now deleted this sentence.

      • Line176: "Since 358 of the 375 proteins in our protein-binding site test set share > 30% identity with the training sequences of PeSTo, we re-split our protein-binding dataset to generate a test set of 65 proteins sharing < 30% identity with the training set of PeSTo for a fair evaluation."

      Too specific to be here in my opinion.

      RE: We thank the reviewer for the comment. We have now moved these details to Appendix 1-note 2. The description in the main text here is now more concise:

      “Given the substantial overlap between our protein-binding site test set and the training set of PeSTo, we conducted separate training and comparison using the datasets of PeSTo, where GPSite still demonstrates a remarkable improvement over PeSTo (Appendix 1-note 2).”

      • Figure 2. The authors should try to either increase Fig A's size or increase the font size. This could probably be done by compressing the size of Figure C into a single figure.

      RE: We thank the reviewer for the suggestion. We have now increased the font size in Figure A. Besides, the figures in the final version of the manuscript should be clearer where we could upload SVG files.

      • Have you tried using embeddings from more structure-aware pLM such as ESM Fold embeddings (fine-tuned) or ProstTrans (that may be more recent than this study)?

      RE: We thank the reviewer for the insightful comment. We have not yet explored the embeddings from structure-aware pLM, but we acknowledge its potential as a promising avenue for future investigation. We have now added this point in our Discussion section:

      “Besides, the hidden embeddings from ESMFold may also serve as informative protein representations.”

      Reviewer #3 (Public Review):

      Summary

      The authors of this work aim to address the challenge of accurately and efficiently identifying protein binding sites from sequences. They recognize that the limitations of current methods, including reliance on multiple sequence alignments or experimental protein structure, and the under-explored geometry of the structure, which limit the performance and genome-scale applications. The authors have developed a multi-task network called GPSite that predicts binding residues for a range of biologically relevant molecules, including DNA, RNA, peptides, proteins, ATP, HEM, and metal ions, using a combination of sequence embeddings from protein language models and ESMFold-predicted structures. Their approach attempts to extract residual and relational geometric contexts in an end-to-end manner, surpassing current sequence-based and structure-based methods.

      Strengths

      • The GPSite model's ability to predict binding sites for a wide variety of molecules, including DNA, RNA, peptides, and various metal ions.

      • Based on the presented results, GPSite outperforms state-of-the-art methods in several benchmark datasets.

      • GPSite adopts predicted structures instead of native structures as input, enabling the model to be applied to a wider range of scenarios where native structures are rare.

      • The authors emphasize the low computational cost of GPSite, which enables rapid genome-scale binding residue annotations, indicating the model's potential for large-scale applications.

      RE: We thank the reviewer for recognizing the significance and value of our work!

      Weaknesses

      • One major advantage of GPSite, as claimed by the authors, is its efficiency. Although the manuscript mentioned that the inference takes about 5 hours for all datasets, it remains unclear how much improvement GPSite can offer compared with existing methods. A more detailed benchmark comparison of running time against other methods is recommended (including the running time of different components, since some methods like GPSite use predicted structures while some use native structures).

      RE: We thank the reviewer for the valuable suggestion. Empirically, it takes about 5-20 min for existing MSA-based methods to make predictions for a protein with 500 residues, while it only takes about 1 min for GPSite (including structure prediction). However, it is worth noting that some predictors in our benchmark study are solely available as webservers, and it is challenging to compare the runtime between a standalone program and a webserver due to the disparity in hardware configurations. Therefore, we have now included comprehensive runtime comparisons between the GPSite webserver and other top-performing servers in Appendix 3-figure 1 to illustrate the practicality and efficiency of our method. For convenience, we also attach the figure here as Author response-figure 3. The corresponding description is now added in the “GPSite outperforms state-of-the-art methods” section:

      “Moreover, GPSite is computationally efficient, achieving comparable or faster prediction speed compared to other top-performing methods (Appendix 3-figure 1).”

      Author response image 3.

      Runtime comparison of the GPSite webserver with other top-performing servers. Five protein chains (i.e., 8HN4_B, 8USJ_A, 8C1U_A, 8K3V_A and 8EXO_A) comprising 100, 300, 500, 700, and 900 residues, respectively, were selected for testing, and the average runtime is reported for each method. Note that a significant portion of GPSite’s runtime (75 s, indicated in orange) is allocated to structure prediction using ESMFold.

      • Since the model uses predicted protein structure, the authors have conducted some studies on the effect of the predicted structure's quality. However, only the 0.7 threshold was used. A more comprehensive analysis with several different thresholds is recommended.

      RE: We thank the reviewer for the comment. We assessed the effect of the predicted structure's quality by evaluating GPSite’s performance on high-quality (TM-score > 0.7) and low-quality (TM-score ≤ 0.7) predicted structures. We did not employ multiple thresholds (e.g., 0.3, 0.5, and 0.7), as the majority of proteins in the test sets were accurately predicted by ESMFold. Specifically, as shown in Figure 3B, Appendix 3-figure 3 and Appendix 2-table 5, the numbers of proteins with TM-score ≤ 0.7 are small in most datasets (e.g., 42 for DNA and 17 for ATP). Consequently, there is insufficient data available for analysis with lower thresholds, except for the RNA test set. Notably, Figure 3C presents a detailed inspection of the 104 proteins with TM-score < 0.5 in the RNA test set. Within this subset, GPSite consistently outperforms the state-of-the-art structure-based method GraphBind with predicted structures as input, regardless of the prediction quality of ESMFold. Only in cases where structures are predicted with extremely low quality (TM-score < 0.3) does GPSite fall behind GraphBind input with native structures. This result further demonstrates the robustness of GPSite. We have now added clearer explanations in the “GPSite is robust for low-quality predicted structures” section:

      “Figure 3B and Appendix 3-figure 3 show the distributions of TM-scores between native and predicted structures calculated by US-align in the ten benchmark datasets, where most proteins are accurately predicted with TM-score > 0.7 (see also Appendix 2-table 5)”; “Given the infrequency of low-quality predicted structures except for the RNA test set, we took a closer inspection of the 104 proteins with predicted structures of TM-score < 0.5 in the RNA test set.”

      • To demonstrate the robustness of GPSite, the authors performed a case study on human GR containing two zinc fingers, where the predicted structure is not perfect. The analysis could benefit from more a detailed explanation of why the model can still infer the binding site correctly even though the input structural information is slightly off.

      RE: We thank the reviewer for the comment. We have actually explained the potential reason for the robustness of GPSite in the second paragraph of the “GPSite is robust for low-quality predicted structures” section. In summary, although the whole structure of this protein is not perfectly predicted, the local structures of the binding domains of peptide, DNA and Zn2+ are actually predicted accurately as evidenced by the superpositions of the native and predicted structures in Figure 3D and 3E. Therefore, GPSite can still make reliable predictions. We have now revised this paragraph to explain these more clearly:

      “Figure 3D shows the structure of the human glucocorticoid receptor (GR), a transcription factor that binds DNA and assembles a coactivator peptide to regulate gene transcription (PDB: 7PRW, chain A). The DNA-binding domain of GR also consists of two C4-type zinc fingers to bind Zn2+ ions. Although the structure of this protein is not perfectly predicted (TM-score = 0.72), the local structures of the binding domains of peptide and DNA are actually predicted accurately as viewed by the superpositions of the native and predicted structures in Figure 3D and 3E. Therefore, GPSite can correctly predict all Zn2+ binding sites and precisely identify the binding sites of DNA and peptide with AUPR values of 0.949 and 0.924, respectively (Figure 3F, G and H).”

      • To analyze the relatively low AUC value for protein-protein interactions, the authors claimed that it is "due to the fact that protein-protein interactions are ubiquitous in living organisms while the Swiss-Prot function annotations are incomplete", which is unjustified. It is highly recommended to support this claim by showing at least one example where GPSite's prediction is a valid binding site that is not present in the current Swiss-Prot database or via other approaches.

      RE: We thank the reviewer for the valuable recommendation. To support this claim, we have now added two examples in Appendix 1-note 7, where GPSite confidently predicted the presences of the “protein binding” function (GO:0005515). Notably, this function was absent in these two proteins in the Swiss-Prot database at the time of manuscript preparation (release: 2023-05-03), but has been included in the latest release of Swiss-Prot (release: 2023-11-08). For convenience, we also attach the note below:

      “As depicted in Figure 5A, GPSite assigns relatively high prediction scores to the proteins without “protein binding” function in the Swiss-Prot annotations, leading to a modest AUC value of 0.608 (Figure 5B). This may be ascribed to the fact that protein-protein interactions are ubiquitous in living organisms while the Swiss-Prot function annotations are incomplete. To support this hypothesis, we present two proteins as case studies, both sharing < 20% sequence identity with the protein-binding training set of GPSite. The first case is Aminodeoxychorismate synthase component 2 from Escherichia coli (UniProt ID: P00903). GPSite confidently predicted this protein as a protein-binding protein with a high prediction score of 0.936. Notably, this protein was not annotated with the “protein binding” function (GO:0005515) or any of its GO child terms in the Swiss-Prot database at the time of manuscript preparation (https://rest.uniprot.org/unisave/P00903?format=txt&versions=171, release: 2023-05-03). However, in the latest release of Swiss-Prot (https://rest.uniprot.org/unisave/P00903?format=txt&versions=174, release: 2023-11-08) during manuscript revision, this protein is annotated with the “protein heterodimerization activity” function (GO:0046982), which is a child term of “protein binding”. In fact, the heterodimerization activity of this protein has been validated through experiments in the year of 1996 (PMID: 8679677), indicating the potential incompleteness of the Swiss-Prot annotations. The other case is Hydrogenase-2 operon protein HybE from Escherichia coli (UniProt ID: P0AAN1), which was also predicted as a protein-binding protein by GPSite (score = 0.909). Similarly, this protein was not annotated with the “protein binding” function in the Swiss-Prot database at the time of manuscript preparation (https://rest.uniprot.org/unisave/P0AAN1?format=txt&versions=108). However, in the latest release of Swiss-Prot (https://rest.uniprot.org/unisave/P0AAN1?format=txt&versions=111), this protein is annotated with the “preprotein binding” function (GO:0070678), which is a child term of “protein binding”. In fact, the preprotein binding function of this protein has been validated through experiments in the year of 2003 (PMID: 12914940). These cases demonstrate the effectiveness of GPSite for completing the missing function annotations in Swiss-Prot.”

      • The authors reported that many GPSite-predicted binding sites are associated with known biological functions. Notably, for RNA-binding sites, there is a significantly higher proportion of translation-related binding sites. The analysis could benefit from a further investigation into this observation, such as the analyzing the percentage of such interactions in the training site. In addition, if there is sufficient data, it would also be interesting to see the cross-interaction-type performance of the proposed model, e.g., train the model on a dataset excluding specific binding sites and test its performance on that class of interactions.

      RE: We thank the reviewer for the suggestion. We would like to clarify that the analysis in Figure 5C was conducted at “protein-level” instead of “residue-level”. As described in the second paragraph of the “Large-scale binding site annotation for Swiss-Prot” section, a protein-level ligand-binding score was assigned to a protein by averaging the top k residue-level predicted binding scores. This protein-level score indicates the overall binding propensity of the protein to a specific ligand. We gathered the top 20,000 proteins with the highest protein-level binding scores for each ligand and found that their biological process annotations from Swiss-Prot were consistent with existing knowledge. We have now revised the corresponding sentence to explain these more clearly:

      “Exploiting the residue-level binding site annotations, we could readily extend GPSite to discriminate between binding and non-binding proteins of various ligands. Specifically, a protein-level binding score indicating the overall binding propensity to a specific ligand can be generated by averaging the top k predicted scores among all residues.”

      As for the cross-interaction-type performance raised by the reviewer, we have now conducted cross-type evaluations to investigate the specificity of the ligand-specific MLPs and the inherent similarities among different ligands in Appendix 1-note 6 and Appendix 2-table 10. For convenience, we also attach the note and table here:

      “We conducted cross-type evaluations by applying different ligand-specific MLPs in GPSite for the test sets of different ligands. As shown in Appendix 2-table 10, for each ligand-binding site test set, the corresponding ligand-specific network consistently achieves the best performance. This indicates that the ligand-specific MLPs have specifically learned the binding patterns of particular molecules. We also noticed that the cross-type performance is reasonable for the ligands sharing similar properties. For instance, the DNA-specific MLP exhibits a reasonable AUPR when predicting RNA-binding sites, and vice versa. Similar trends are also observed between peptide and protein, as well as among metal ions as expected. Interestingly, the cross-type performance between ATP and HEM is also acceptable, potentially attributed to their comparable molecular weights (507.2 and 616.5, respectively).”

      Author response table 4.

      Cross-type performance by applying different ligand-specific MLPs in GPSite for the test sets of different ligands

      Note: “Pep” and “Pro” denote peptide and protein, respectively. The numbers in this table are AUPR values. The best/second-best result in each test set is indicated by bold/underlined font.

    1. Author Response

      The following is the authors’ response to the original reviews.

      We are pleased to send you a revised version of our manuscript entitled “voyAGEr: free web interface for the analysis of age-related gene expression alterations in human tissues” and the associated shiny web app, in which we incorporate the referees’ feedback. We would like to express our gratitude for their time and valuable insights, which have contributed to the improvement of our work. We appreciate the rigorous evaluation process that eLife maintains.

      In this letter, we address each of the reviewers' comments and concerns, point-by-point, offering detailed responses and clarifications. We have made several revisions to our manuscript following their recommendations.

      We must note that the revised version of the manuscript has two novel joint first authors, Rita Martins-Silva and Alexandre Kaizeler, who performed all the requested reanalyses, given that the initial first author, Arthur Schneider, already left our lab. We must also point to the following minor unsolicited improvements we took the opportunity to make:

      • Added a comprehensive tutorial to the GitHub repository on how to navigate through voyAGEr’s features.

      • Implemented sample randomisation in the scatter plots depicting gene expression across the age axis to ensure data privacy.

      • Implemented minor adjustments within the web app to enhance user comprehension and clarity when visualizing the data.

      • Improved clarity of the methodological sections.

      Reviewer 1

      (1.1) While this may be obvious to others for some reason that escaped me, I was unsure what was the basis for the authors' choice of 16 years as the very specific sliding window size. If I'm not alone in this, it might add clarity for other readers and users if this parameter choice were explained and justified more explicitly.

      We apologise for our omission in providing the rationale behind our choice in the previous version. We chose 16 years as our sliding window size because this was the minimum needed to guarantee the presence of more than one sample per window, across all the tissues considered in the study (Figure R1 below).

      We added the following sentence to the manuscript (v. Methods, ShARP-LM):

      “This was the minimum age span needed to guarantee the presence of more than one sample per window, across all considered tissues.”

      (1.2) "In particular, tissue-specific periods of major transcriptional changes in the fifth and eighth decades of human lifespan have been revealed, reflecting the so-called digital aging and consistently with what is observed in mice" here I think that "consistently" should be "consistent".

      We thank the reviewer for the comment and following the suggestion, we have revised 'Consistently' to 'consistent' as it is the correct usage in our sentence.

      (1.3) "On a different note, sex biases have been reported in for the expression of SALL1 and KAL1 in adipose tissue and lung, respectively." Here I think that "in for" should be "in".

      As recommended by the reviewer, we have replaced ‘in for’ for ‘in’. As we substituted KAL1, the current sentence now stands as “On a different note, sex biases have been reported in the expression of SALL1 and DDX43 in adipose tissue and lung, respectively”.

      (1.4) "We downloaded the matrix with the RNA-seq read counts for each gene in each GTEx v7 sample from the project's data portal (https://www.gtexportal.org/)." In my pdf manuscript this hyperlink appears to be broken.

      We appreciate the reviewer's attention to the broken link, and we have rectified the issue. The link should now be fully operational, effectively directing users to the GTEx Portal.

      (1.5) Under methods, I might suggest "Development platform" or "Development platforms" over "Development's platform" as a heading.

      We have modified the heading of this section in the methods to 'Development Platforms', as we believe it better reflects the information conveyed.

      Reviewer 2

      (2.1) In this tool/resource paper, it is crucial that the data used is up-to-date to provide the most comprehensive and relevant information to users. However, the authors utilized GTEx v7, which is an outdated (2016) version of the dataset. It is worth noting that GTEx v8 includes over 940 individuals, representing a 35% increase in individuals, and a 50% increase in the total number of samples. The authors should check the newer versions of GTEx and update the data.

      When the development of the voyAGEr web application began, GTEx version 7 was the most up to date. Nevertheless, we agree that the version 8 offers a notably more extensive dataset, encompassing a larger number of individuals, samples, and introducing new tissues. Consequently, we have updated our application to incorporate the data from GTEx version 8.

      (2.2) The authors did not address any correction for batch effects or RNA integrity numbers, which are known to affect transcriptome profiles. For instance, our analysis of GTEx v8 Cortex tissue revealed that after filtering out lowly expressed genes, in the same way authors did, PC1 (which accounts for 24% of the variation) had a Spearman's correlation value of 0.48 (p<6.1e-16) with RNA integrity number.

      We acknowledge the validity of the reviewer’s comment and appreciate the importance of such corrections to enhancing data interpretation. In response, we conducted a thorough unbiased investigation into potential batch effects, with the COHORT variable emerging as the primary driver of those observed across most tissues. Furthermore, SMRIN (as the reviewer pointed), DTHHRDY, MHSMKYRS and the number of detected genes in each sample were consistently associated with the primary sources of variation. As a result, we implemented batch effect correction for those five conditions, in a tissue-specific manner.

      We provide a detailed explanation of the batch effect correction methodology and its importance in the biological interpretation of results in the Methods section, specifically under "Read count data pre-processing". Additionally, we have included two new supplementary figures, Sup. Figures 7 and 8, to illustrate a batch effect example in lung tissue and emphasise the critical role of this correction in data interpretation.

      (2.3) The data analyzed in the GTEx dataset is not filtered or corrected for the cause of death, which can range from violent and sudden deaths to slow deaths or cases requiring a ventilator. As a result, the data may not accurately represent healthy aging profiles but rather reflect changes in the transcriptome specific to certain diseases due to the age-related increase in disease risk. While the authors do acknowledge this limitation in the discussion, stating that it is not a healthy cohort and disease-specific analysis is not feasible due to the limited number of samples, it would be useful for users to have the option to analyze only cases of fast death, excluding ventilator cases and deaths due to disease. This is typically how GTEx data is utilized in aging studies. Alternatively, the authors should consider including the "cause of death" variable in the model.

      This comment is closely related to the prior discussion (point 2.2). Notably, two of the covariates selected for batch effect correction, namely, DTHHRDY (Death classification based on the 4-point Hardy Scale1) and COHORT (indicating whether the participant was a postmortem, organ, or surgical donor1), have a direct relevance to this issue, i.e., both relate to the cause of death of the individual.

      1 According to the nomenclature of variables described in https://www.ncbi.nlm.nih.gov/projects/gap/cgibin/ GetListOfAllObjects.cgi?study_id=phs000424.v9.p2&object_type=variable

      We therefore effectively account for their influence on gene expression, mitigating these factors' impact.

      This approach represents a compromise, as it is practically infeasible to ascertain the absence of underlying health conditions in the remaining samples, even if only considering cases of “fast death”. Hence, we opted to keep all samples, independently of the cause of death of its donor, to dilute potential effects associated with individual causes of death.

      (2.4) The age distribution varies across tissues which may impact the results of the study. The authors' claim that age distribution does not affect the outcomes is inconclusive. Since the study aims to provide cross-tissue analysis, it is important to note that differing age distributions across tissues can influence the overall results. To address this, the authors should conduct downsampling to different age distributions across tissues and evaluate the level of tissue-specific or common changes that remain after the distributions are made similar.

      We acknowledge that variations in age distributions are evident across different tissues, with brain tissues displaying a notably pronounced disparity (green density lines in Figure R2 below).

      To address this issue comprehensively, we conducted tissue-specific downsampling, by reducing the number of samples in a given age window to the minimum available sample size within all age windows for a given tissue. The histograms (density plots) of the number of samples per age window of 16 years considered in the ShARP-LM model, as well as the minimum number of samples in each age window, per tissue are illustrated in Figure R1. After performing downsampling, we computed the logFC and p-value of differential expression for each gene, per age window, and compared them (for all genes in a given age window) with those involving all samples.

      Despite changes in logFC with downsampling, a considerable positive correlation is maintained (Figure R3, top panel). This suggests that the overall trends in gene expression changes persist. However, the downsampling process expectedly results in a decrease of statistical power within each age window concomitant with the decreased sample size, evident from the shift of genes from the third to the first quadrant in Figure R3, bottom panel. Consequently, we have opted for maintaining results encompassing all samples and removing the paragraph in the Discussion that asserted the absence of age distribution impact on the overall outcomes (“Indeed, we found no confounding between the distribution of samples’ ages and the trend of gene expression progression over age in any tissue.”), as we deem it inaccurate, potentially leading to misinterpretation. We have added a supplementary figure (Supplementary Figure 8, identical to Figure R3) illustrating the effect of downsampling, and the following paragraph to the manuscript’s Discussion section:

      “When downsampling to ensure a balanced age distribution, a loss of statistical power is apparent but a considerable positive correlation with the original results is maintained and a substantial number of significant alterations remain so (Supplementary Figure 8).”

      We acknowledge that this limitation can be addressed with the growing accumulation of human tissue transcriptomes in publicly available databases, a trend we anticipate in the near future. We are committed to promptly updating voyAGEr with any new data releases that may offer a solution to this concern.

      Nonetheless, we want to underscore, as the reviewer has astutely pointed out, that while voyAGEr can facilitate cross-tissue comparisons, it must be done with caution. In this regard, we inserted the following paragraph into the Discussion:

      “Due to the tissue-specific nature of the pre-processing steps (v. Read count data preprocessing in the Methods section), and given that most of the plotted gene expression distributions are centred and scaled by tissue, it is important to note that voyAGEr may not be always suited for direct comparisons between different tissues. For instance, it does not allow to directly ascertain if a gene exhibits different expression levels in different tissues or if the expression of a particular gene in one tissue changes more drastically with age than in another tissue.”

      (2.5) The GTEx resource is extremely valuable, however, it comes with challenges. GTEx contains tissue samples from the same individuals across different tissues, resulting in varying degrees of overlap in sample origin across tissues as not all tissues are collected for all individuals. This could affect the similar/different patterns observed across tissues. As this tool is meant for broader use by the community, it is crucial for the authors to either rule out this possibility by conducting a cross-tissue comparison using a non-parametric model that accounts for the dependency between samples from the same individual, or to provide information on the degree of similarity between samples so that the users can keep this possibility in mind when using the tool for hypothesis generation.

      We agree that the variable degrees of overlap between tissues (Figure R4) could lead to a confounding between trends in a population of common individuals and those associated with age. We therefore examined the contributions of variables 'donor,' 'tissue,' and 'age' to the overall variance in the data (Figure R5, panel A), having normalised the data collectively across all tissues. Tissue and donor contribute approximately 90% and 10% of the variance, respectively. Age exhibits minimal impact (around 1%), which may be attributed to the relative subtlety of its effects on gene expression and to the tissue specificity of ageing-associated changes. Notably, removing the 'donor' variable does not transfer this variance to 'age', suggesting a limited confounding between these variables (see Figure R5, panel B).

      We also specifically examined the pairs of tissues exhibiting the lowest (Brain Amygdala / Small Intestine), median (Pancreas / Heart Left Ventricle), and highest (Kidney Cortex / Muscle Skeletal) percentages of shared donors. We identified and selectively removed samples from shared donors while maintaining the original sample size imbalance between tissues. Subsequently, we calculated each gene’s mean expression within each age window from the ShARP-LM pipeline, followed by each gene’s Pearson’s correlation of expression between tissue pairs. The resulting coefficients, both with and without the removal of common donors, were compared in scatter plots (Figure R6, left plots). As this process inherently involves downsampling, which may impact results (v. comment 2.4), we performed additional downsampling by randomly removing samples from both tissues according to the proportions defined for the removal of common donors (Figure R6, right plots).

      In the chosen scenarios, we note a similar impact between the targeted removal of common donors and random downsampling. Nevertheless, the effects of removing samples may vary according to the absolute number of remaining samples. Consequently, singling out individual cases may not provide conclusive insights. To systematically address this, we represented all tissue pairs in a heatmap, colour-coded based on whether the removal of common donors is more impactful (red) or less impactful (blue) than random downsampling (Figure R7). The values depicted in the heatmap, denoted as the Impact of Common Donors (ICD), are computed for each tissue pair. This calculation involves several steps: first, we determined the absolute difference in Pearson’s correlation for each gene’s mean expression within each age window from the ShARP-LM pipeline, between the original data and the subset of data without common donors (DiffWoCD) or with random downsampling (DiffRD). Subsequently, the medians of DiffWoCD and DiffRD are computed, and the difference between these median values provides the ICD for each tissue pair. Due to the unidirectional nature of correlation (i.e., the results for tissue 1 vs tissue 2 mirror those for tissue 2 vs tissue 1), the resulting matrix is triangular in form.

      We have added a supplementary figure (Supplementary Figure 4, a composition of Figures R4-R7, together with a scatterplot relating the values of heatmaps R4 and R7) that aims to provide guidance to users when interpreting specific tissue pairs, acknowledging inherent limitations (refer to comment 2.4). We have also inserted the following paragraph into the manuscript’s Discussion section:

      “Furthermore, we must emphasise that the majority of GTEx donors contributed samples to multiple tissues (Supplementary Figure 4A), potentially introducing biases and confounders when comparing gene expression patterns between tissues. Our analyses of variance (Supplementary Figure 4B) and downsampling to control for common donors (Supplementary Figures 4C-E) suggest very limited global confounding between the impacts of donor and age on gene expression and that any potential cross-tissue bias not to depend much on the proportion of common donors (Supplementary Figure 4E). However, this effect must be taken into account when comparing specific pairs of tissues (e.g., Colon – Transverse and Whole Blood, Supplementary Figure 4D).”

      (2.6) The authors aimed to create an open-source and ever-evolving resource that could be adapted and improved with new functionality. However, this goal was only partially achieved. Although the code for the web app is open source, crucial components such as the statistical tests or the linear model are not included in the repository, limiting the tool's customizability and adaptability.

      We greatly appreciate the reviewer’s concern and share their commitment to maintaining the principles of openness, reproducibility, and adaptability for voyAGEr. voyAGEr was primarily designed as a visualisation tool, displaying pre-processed results, and indeed only the code for the Shiny app itself was accessible through the project's GitHub repository.

      To address this shortcoming, we have made the entire data preprocessing script publicly available in the GitHub repository of voyAGEr. This script encompasses, among others, filtration, normalisation, batch effect correction, the ShARP-LM pipeline and statistical tests employed, and module definition. Moreover, the web app itself offers functionality to export relevant plots and tables.

      (2.7) Furthermore, the authors' choice of visualization platform (R shiny) may not be the best fit for extensibility and open-source collaboration, as it lacks modularity. A more suitable alternative could be production-oriented platforms such as Flask or FastAPI.

      We appreciate this thoughtful concern. The decision to use Shiny was primarily driven by our data having already been prepared in the R environment during pre-processing steps. Consequently, and as the web app serves the purpose of visualisation only (and not data processing), Shiny is as a natural and convenient extension of our scripts, enabling data visualisation seamlessly.

      We acknowledge that Shiny may lack the modularity required for optimal open-source collaboration. While we recognise the merits of alternative platforms like Flask or FastAPI, we decided to keep Shiny because the current iteration of voyAGEr offers significant value to the community. Transitioning to a different platform would be a time-consuming endeavour, that would postpone the release of such resource.

      However, the reviewer’s feedback regarding modularity and open-source collaboration is duly noted and highly valuable. We will certainly take it into account when developing new web applications within our laboratory.

      (2.8) To facilitate collaboration and improve the tool's adaptability, data resulting from the preprocessing pipeline should be made publicly available. This would make it easier for others to contribute and extend the tool's functionality, ultimately enhancing its value for the scientific community.

      As outlined in point 2.6 of this rebuttal letter, certain metadata used in our analysis are subject to restricted access. To address this, we have taken several measures to foster transparency and reproducibility of our analyses. First, we have made the scripts for data pre-processing publicly available, along with a comprehensive explanation of our methodology within the main manuscript. This empowers users to replicate our analyses and provides a foundation for those interested in contributing to the tool's development. Furthermore, we have created new issues on voyAGEr’s GitHub repository, outlining novel features and improvements we envision for the application in the future. We actively encourage users to engage with this section.

      (2.9) It is unfortunate that the manuscript has no line numbers, which makes pointing out language issues or typos cumbersome. Below are some minor typos present in the current version mostly due to inconsistent usage of British vs US English, and the authors would be advised to do a thorough proofreading for the final submission.

      • Page 12: Inconsistent spelling of "analyzed" and "analysed". Should be "analyzed", since US English is used throughout the rest of the paper.

      • Page 14: "randomised"

      • Page 15: "emphasise"

      We apologise for it and include line numbers in the revised version. We have opted for British English and corrected the manuscript accordingly.

      (2.10) Some figures in the supplemental material have a low resolution (e.g. S. Fig 5). Especially figures that are not based on screenshots would ideally be of a higher resolution.

      As voyAGEr is designed as a web application for visualisation, it is inherent that some screenshots of the final resource may have lower resolutions. In response to this concern, we re-generated the figures in this manuscript with a resolution that maintains clarity and readability. We also recreated figures not derived from screenshots, further improving their resolution.

      We saved all figures in PDF format and are sending them together with this letter and the revised manuscript, to address any potential issues related to low-resolution figures that may occur during the export of the Word document.

      <(2.11) In Fig. 1 in the bottom row the sex labels are hard to see.

      We have adapted the figure to address this concern.

      (2.12) Math symbols and equations are not well formatted. For example, the GE equation on p. 13, or Oiij equation should be properly typeset. Also, the Oiij notation might be confusing, I believe the authors meant to use a capital "I", i.e. OI_ij.

      We have incorporated these recommendations into the revised manuscript.

      (2.13) The Readme file in the git repo is very short. It would be helpful to have build and run instructions.

      We have updated the README file in the GitHub repository, which now contains, among other features, instructions for launching the Shiny app and building the associated Docker image. Additionally, a simple tutorial has also been included to assist users in navigating through voyAGEr's functionalities.

      (2.14> "Module" tab's UI inconsistent to other tabs (i.e. "Gene" and "Tissue"), since it contains an "About" page. Adding the "About" page in the actual "Module" page might make the UI clearer.

      We believed that the Modules section, due to its distinct methodology, would benefit from an additional tab explaining its underlying rationale. We relate to the reviewer’s concern regarding the use of tabs throughout the application and made changes to the app in order to ensure consistency.

      (2.15) I would suggest changing the type of the article to "Tools and Resources".

      We agree and followed the reviewer’s suggestion.

      Reviewer 3

      (3.1) In the gene-centric analyses section of the result, to improve this manuscript and database, linear regression tests accounting for the entire range of age should be added. The authors' algorithm, ShARP-LM, tests locally within a 16-year window which makes it has lower power than the linear regression test with the whole ages. I suspect that the power reduction is strongly affected in the younger age range since a larger number of GTEx donors are enriched in old age. By adding the results from the lm tests, readers would gain more insight and evidence into how significantly their interest genes change with age.

      We are grateful for the reviewer's thoughtful and pertinent recommendation and have thus conducted linear regression tests covering the entire age range. The outcomes of these tests have been integrated into the web application, denoted by a dotted orange line on the 'Gene Expression Alterations Over Age' plots. Additionally, a summary of statistics of overall changes, encompassing pvalues, t-statistics, and logFC per year, has been included below the plot title. We have also updated the manuscript to include such changes (v. Methods, Gene-centric visualisation of tissue-specific expression changes across age):

      “We also applied a linear model across the entire age range, thereby providing users with more insight and supporting evidence into how a specific gene changes with age. For visualisation purposes, we incorporated a dashed orange line, with the logFC per year for the Age effect as slope, in the respective scatter plots (Figure 3B c). We depict the Sex effect therein by prominent dots on the average samples, with pink and blue denoting females and males, respectively.”

      Concerning the observation about the potential reduction in statistical power due to the limited number of samples in younger ages, we acknowledge its validity. Indeed, we have addressed this issue in the manuscript's Discussion (v. Supplementary Figure 6).

      (3.1) In line with the ShARP-LM test results, it is not clear which criterion was used to define the significant genes and the following enrichment analyses. I assume that the criterion is P < 0.05, but it should be clearly noted. Additionally, the authors should apply adjusted p-values for multiple-test correction. The ideal criterion is an adjusted P < 0.05. However, if none or only a handful of genes were found to be significant, the authors could relax the criteria, such as using a regular P < 0.01 or 0.05.

      We apologise for any confusion regarding the terminology "significant genes." Our choice to use nonadjusted p-values for determining the significance of gene expression changes with Age, Sex, and their interaction was deliberate, and we would like to clarify our reasoning:

      (1) In the "Gene" tab of the application, individual genes are examined. When users inquire about a specific gene, multiple-testing correction of the p-value does not apply.

      (2) In the "Tissue" tab, using adjusted p-values and a threshold of 0.05 yielded very few differentially expressed genes, limiting the utility of Peaks. Our objective therein is not to assess the significance of alterations in individual genes but to provide a metric for global alterations within a tissue. We then determine significance based on the False Discovery Rate (FDR), using the p-values as a nominal metric of gene expression alterations.

      To avoid using the concept of “differential expression”, commonly linked to significance, we now refer to 'altered genes' in both the manuscript and the app. For clarity and to align with voyAGEr's role as a hypothesis-generation tool, we define 'altered genes' as those with non-adjusted p-values < 0.01 or < 0.05, as discriminated in the Methods section.

      (3.3) In the gene-centric analyses section, authors should provide a full list of donor conditions and a summary table of conditions as supplementary.

      We appreciate the suggestion and we have now included a reference that directs readers to those data, alternatively to including this information as an additional supplementary table. We would like to emphasise that the web app includes information on donor conditions we hypothesise to affect gene expression.

      3.4) The tissue-specific assessment section has poor sub-titles. Every title has to contain information.

      We agree and revised the sub-titles to more accurately reflect the information conveyed in each corresponding section.

      (3.5) I have an issue understanding the meaning of NES from GSEA in the tissue-specific assessment section. The authors performed GSEA for the DEGs against the background genes ordered by tstatistics (from positive to negative) calculated from the linear model. I understand the p-value was two-tailed, which means that both positive and negative NES are meaningful as they represent up-regulated expression direction (positive coefficient) and down-regulated expression direction (negative coefficient) with age, respectively, within a window. However, in the GSEA section of Methods, authors were not fully elaborate on this directionality but stated, "The NES for each pathway was used in subsequent analyses as a metric of its over- or downrepresentation in the Peak". The authors should clearly elaborate on how to interpret the NES from their results.

      We added the following paragraph to the manuscript’s Methods section, in order to clarify the NES’ directionality:

      “We extracted the GSEA normalised enrichment score (NES), which represents the degree to which a certain gene set is overrepresented at the extreme ends of the ranked list of genes. A positive NES corresponds to the gene set’s overrepresentation amongst up-regulated genes within the age window, whereas a negative NES signifies its overrepresentation amongst down-regulated genes. The NES for each pathway was used in subsequent analyses as a metric of its up- or down-regulation in the Peak.”

      (3.6) In the Modules of co-expressed genes section, the authors did not explain how or why they selected the four tissues: brain, skeletal muscle, heart (left ventricle), and whole blood. This should be elaborated on.

      We apologise for not providing a detailed explanation for this selection. As the ‘Modules of coexpressed genes’ section was primarily intended as a proof of concept, we opted to include tissues for which we had a substantial number of samples available and availability of comprehensive cell type signatures, those being the tissues that met such criteria. Nonetheless, as the diversity of cell type signatures increases (e.g., through the increasing availability of scRNA-seq datasets), we plan to encompass a wider range of tissues in the near future. However, as this task is time-demanding and in order to avoid a substantial delay in the release of voyAGEr, we opted to approach this issue in the next version of the App and included a dedicated issue in the projects’ GitHub repository so that users can share their preferences of the next tissues to include.

      We also added a brief sentence in this regard to the Methods section of the manuscript:

      “The four tissues (Brain - Cortex, Muscle - Skeletal, Heart - Left Ventricle, and Whole Blood) covered by the Module section of voyAGEr were selected due to their relatively high sample sizes and availability of comprehensive cell type signatures. The increasing availability of human tissue scRNA-seq datasets (e.g., through the Human Cell Atlas) will allow future updates of voyAGEr to encompass a wider range of tissues.”

      (3.7) In the modules of the co-expressed genes section, the authors did not provide an explanation of the "diseases-manual" sub-tab of the "Pathway" tab of the voyAGEr tool. It would be helpful for readers to understand how the candidate disease list was prepared and what the results represent.

      We greatly appreciate the reviewer's feedback, and in response, we have restructured the 'Modules of co-expressed genes' method section to provide a more comprehensive explanation of the 'diseases' sub-section. To clarify, we obtained a curated set of diseases and their associated genes from DisGeNET v.7.0. We assessed the enrichment of modules in relation to these diseases through two methods: a manual approach utilising Fisher’s tests (i.e. comparing the genes of a given module with the genes associated with a given disease) and another through use of the disgenet2r package, employing the function disease_enrichment. Significance of these enrichments were determined by adjusting p-values using the Benjamini-Hochberg correction.

      (3.8) Most figures have low resolutions, and their fonts are too small to read.

      As already mentioned in issue 2.10, we have recreated all of the images with better resolution to enhance legibility. We also exported such figures in PDF, which we attach to this revision.

      (3.9) Authors used GTEx V7, which is not latest version. Although researchers have developed a huge amount of pipelines and tools for their research, most of them were neglected without a single update. I am sure many users, including myself, would appreciate it if the authors kept updating the database with GTEx V8 for the future version of the database.

      We express our gratitude to the reviewer for their valuable suggestion, and, as already explained in issue 2.1, we have incorporated GTEx V8 into voyAGEr.

      (3.10) I would like to have an option for downloading the results as a whole for gene, tissue, and coexpressed genes. This would be a great option for secondary analysis by users.

      The implementation of such feature would be a time-demanding endeavour that would delay the release of voyAGEr, and we therefore chose not to perform it for this version. However, we agree that it would be a good resource for secondary analyses and acknowledge the possibility of adding this feature in the future. For now, voyAGEr allows the user to download all plots and corresponding data.

      (3.11) How the orders of tissues in the heatmaps (both gene and tissue section) were determined? Did the authors apply hierarchical clustering? If not, I would recommend the authors perform the hierarchical clustering and add it to display the heatmap display.

      We apologise for the oversight in explaining the process behind determining the order of tissues. To clarify, we employed hierarchical clustering to establish the tissue order for visualisation within the app. Although the reviewer suggested adding a dendrogram to illustrate this clustering, we decided against it. The reason for such is that including a dendrogram, while informative, is not essential for the app's primary purpose.

      (3.12) I understand that this is a vast amount of work, but I hope that the authors can expand the coexpressed module analysis to include other tissues in the future version of the database.

      Knowing what co-expressed genes in line with aging are and their pathway and disease enrichments across tissues would be highly informative, and I'm sure many users, including myself, would greatly appreciate it. <br /> We express our gratitude to the reviewer for the valuable suggestion and for acknowledging the extensive effort required to incorporate new tissues into the module section. We completely agree that understanding co-expressed genes across the aging process is of significant value, and we are committed to the ongoing inclusion of additional tissues. As already stated in issue 3.6, comprehensive list of tissues slated for integration in future voyAGEr versions is readily available on voyAGEr’s GitHub repository.

      Author response image 1.

      Density plots (“smoothed” histograms) of the distribution of numbers of samples per moving age window for the ShARP-LM pipeline, categorised by tissue. The numerical value within each rectangle represents the minimum number of samples observed across all age windows for that particular tissue.

      Author response image 2.

      Density lines (“smoothed” histograms) of the distribution of the age of donors per tissue. As depicted in the chart, there are more samples for older ages, particularly of brain tissues.

      Author response image 3.

      Effect of downsampling in ShARP-LM results. A – Per tissue violin plots of gene-wide distributions of Pearson’s correlation coefficients between original and downsampled logFC values for the Age variable across age windows, with tissues coloured by and ordered by increasing percentage of downsampling-associated reduction in the number of samples. B – Density scatter plots of comparison of associated original and downsampled p-values for each tissue, coloured by the downsampling percentage in each age window, highlighting the low range of p-values (from 0 to 0.1). Despite changes in logFC with downsampling, a considerable correlation in significance is maintained, although downsampling naturally results in a loss of statistical power, evident by the shift of points towards the first quadrant (dashed lines: p-value = 0.05).

      Author response image 4.

      Heatmap depicting the percentage of common donors between pairs of tissues. A given square illustrates the percentage of all samples of tissue in the x axis (Tissue 1) that is in common with the tissue in the y axis (Tissue 2)

      Author response image 5.

      Assessment of the relative contributions of different sources to the dataset’s variance. A - tissue accounts for approximately 90% of the total variance, while donor contributes around 10%; age has a minimal impact (1%), likely due to the relative subtlety of its effects on gene expression and to the tissue specificity of ageing dynamics. B - Removal of the donor variable does not transfer variance to age, suggesting limited confounding between the two variables.

      Author response image 6.

      Impact of the relative proportion of common donors on gene expression correlation between tissue pairs. Panels A, B, and C showcase the tissue pairs with the highest (Muscle Skeletal / Kidney Cortex), median (Pancreas / Heart Left Ventricle), and lowest (Small Intestine / Brain Amygdala) percentages of common donors, respectively. The left panels illustrate gene-bygene Pearson’s correlations of gene expression between the two tissues, comparing the scenarios with (x-axis) and without (yaxis) the removal of common donors. The ri ght panels depict the same comparisons, but with random downsampling (y-axis) in both tissues based on the proportions defined for common donor removal. The depicted examples show that the outcomes are comparable when removing common donors or employing random downsampling.

      Author response image 7.

      Comparison of the impacts of removing common donor samples and random downsampling across tissue pairs. The heatmap is coloured based on whether the removal of common donors has a greater (red) or lesser impact (blue) than random downsampling. The values depicted in the heatmap, denoted as the Impact of Common Donors (ICD), are computed for each tissue pair. This calculation involves several steps: first, by determining the absolute difference in Pearson’s correlation for each gene’s mean expression within each age window from the ShARP-LM pipeline, between the original data and the subset of data without common donors (DiffWoCD) or with random downsampling (DiffRD). Subsequently, the medians of DiffWoCD and DiffRD are computed, and the difference between these median values provides the ICD for each tissue pair. Due to the unidirectional nature of correlation (i.e., the results for tissue 1 vs tissue 2 mirror those for tissue 2 vs tissue 1), the resulting matrix is triangular in form. Grey tiles denote NA values, i.e., where the tissue-tissue comparison does not have a meaning, namely self-self and between sex-specific tissues. Top right insert: density line (“smoothed” histogram) of all ICD values.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      The authors present a modelling study to test the hypothesis that horizontal gene transfer (HGT) can modulate the outcome of interspecies competition in microbiomes, and in particular promote bistability in systems across scales. The premise is a model developed by the same authors in a previous paper where bistability happens because of a balance between growth rates and competition for a mutual resource pool (common carrying capacity). They show that introducing a transferrable element that gives a "growth rate bonus" expands the region of parameter space where bistability happens. The authors then investigate how often (in terms of parameter space) this bistability occurs across different scales of complexity, and finally under selection for the mobile element (framed as ABR selection).

      Strengths:

      The authors tackle an important, yet complex, question: how do different evolutionary processes impact the ecology of microbial ecosystems? They do a nice job at increasing the scales of heterogeneity and asking how these impact their main observable: bistability.

      We appreciate the reviewer for agreeing with the potential value of our analysis. We are also grateful for the constructive comments and suggestions on further analyzing the influence of the model structure and the associated assumptions. We have fully addressed the raised issues in the updated manuscript and below.

      Weaknesses:

      The author's starting point is their interaction LV model and the manuscript then explores how this model behaves under different scenarios. Because the structure of the model and the underlying assumptions essentially dictate these outcomes, I would expect to see much more focus on how these two aspects relate to the specific scenarios that are discussed. For example:

      A key assumption is that the mobile element conveys a multiplicative growth rate benefit (1+lambda). However, the competition between the species is modelled as a factor gamma that modulates the competition for overall resource and thus appears in the saturation term (1+ S1/Nm + gamma2*S2/Nm). This means that gamma changes the perceived abundance of the other species (if gamma > 1, then from the point of view of S1 it looks like there are more S2 than there really are). Most importantly, the relationship between these parameters dictates whether or not there will be bistability (as the authors state).

      This decoupling between the transferred benefit and the competition can have different consequences. One of them is that - from the point of view of the mobile element - the mobile element competes at different strengths within the same population compared to between. To what degree introducing such a mobile element modifies the baseline bistability expectation thus strongly depends on how it modifies gamma and lambda.

      Thus, this structural aspect needs to be much more carefully presented to help the reader follow how much of the results are just trivial given the model assumptions and which have more of an emergent flavour. From my point of view, this has an important impact on helping the reader understand how the model that the authors present can contribute to the understanding of the question "how microbes competing for a limited number of resources stably coexist". I do appreciate that this changes the focus of the manuscript from a presentation of simulation results to more of a discussion of mathematical modelling.

      We thank the reviewer for the insightful suggestions. We agree with the reviewer that the model structure and the underlying assumptions need to be carefully discussed, in order to understand the generality of the theoretical predictions. In particular, the reviewer emphasized that how HGT affects bistability might depend on how mobile genetic elements modified growth rates and competition. In the main text, we have shown that when mobile genes only influence species growth rates, HGT is expected to promote multistability (Fig. 1 and 2). However, when mobile genes modify species interactions, the effect of HGT on multistability is dependent on how mobile genes change competition strength (Fig. 3a to f). When mobile genes increase competition, HGT promotes multistability (Fig. 3c and e). In contrast, when mobile genes relax competition, HGT is expected to reduce multistability (Fig. 3d and f).

      In light of the reviewer’s comments, we have further generalized the model structure, by accounting for the scenario where mobile genes simultaneously modify growth rates and competition. The effect of mobile genes on growth rates is represented by the magnitude of 𝜆’s, and the influence on competition is described by another parameter 𝛿. By varying these two parameters, we can evaluate how the model structure and the underlying assumptions affect the baseline expectation. We performed additional simulations with broad ranges of 𝜆 and 𝛿 values. In particular, we analyzed whether HGT would promote the likelihood of bistability in two-species communities compared with the scenario without gene transfer (Fig. 3g-i). Our results suggested that: (1) With or without HGT, reducing 𝜆 (increasing neutrality) promotes bistability; (2) With HGT, increasing 𝛿 promotes bistability; (2) Compared with the population without HGT, gene transfer promotes bistability when 𝛿 is zero or positive, while reduces bistability when 𝛿 is largely negative. These results agree with the reviewer’s comment that the baseline bistability expectation depends on how HGT modifies gamma and lambda. In the updated manuscript, we have thoroughly discussed how the model structure and the underlying assumptions can influence the predictions (line 238-253). 

      We further expanded our analysis, by calculating how other parameters, including competition strength, growth rate ranges, and death/dilution rate, would affect the multistability of communities undergoing horizontal gene transfer (Fig. S2, S3, S9, S10, S11, S12, S13, S15). Together with the results presented in the first draft, these analysis enables a more comprehensive understanding of how different mechanisms, including but not limited to HGT, collectively shaped community multistability. In the updated manuscript, the reviewer can see the change of focus from exploring the effects of HGT to a more thorough discussion of the mathematical model. The revised texts highlighted in blue and the supplemented figures reflect such a change.

      Reviewer #2 (Public review):

      Summary:

      In this work, the authors use a theoretical model to study the potential impact of Horizontal Gene Transfer on the number of alternative stable states of microbial communities. For this, they use a modified version of the competitive Lotka Volterra model-which accounts for the effects of pairwise, competitive interactions on species growth-that incorporates terms for the effects of both an added death (dilution) rate acting on all species and the rates of horizontal transfer of mobile genetic elements-which can in turn affect species growth rates. The authors analyze the impact of horizontal gene transfer in different scenarios: bistability between pairs of species, multistability in communities, and a modular structure in the interaction matrix to simulate multiple niches. They also incorporate additional elements to the model, such as spatial structure to simulate metacommunities and modification of pairwise interactions by mobile genetic elements. In almost all these cases, the authors report an increase in either the number of alternative stable states or the parameter region (e.g. growth rate values) in which they occur.

      In my opinion, understanding the role of horizontal gene transfer in community multistability is a

      very important subject. This manuscript is a useful approach to the subject, but I'm afraid that a thorough analysis of the role of different parameters under different scenarios is missing in order to support the general claims of the authors. The authors have extended their analysis to increase their biological relevance, but I believe that the analysis still lacks comprehensiveness.

      Understanding the origin of alternative stable states in microbial communities and how often they may occur is an important challenge in microbial ecology and evolution. Shifts between these alternative stable states can drive transitions between e.g. a healthy microbiome and dysbiosis. A better understanding of how horizontal gene transfer can drive multistability could help predict alternative stable states in microbial communities, as well as inspire novel treatments to steer communities towards the most desired (e.g. healthy) stable states.

      Strengths:

      (1) Generality of the model: the work is based on a phenomenological model that has been extensively used to predict the dynamics of ecological communities in many different scenarios.

      (2) The question of how horizontal gene transfer can drive alternative stable states in microbial communities is important and there are very few studies addressing it.

      We thank the reviewer for the positive comments on the potential novelty and conceptual importance of our work. We are also grateful for the constructive suggestions on the generality and comprehensiveness of our analysis. In particular, we agree with the reviewer that a thorough analysis of the role of different parameter could further improve the rigor of this work. We have fully addressed the raised issues in the updated manuscript and below.

      Weaknesses:

      (1) There is a need for a more comprehensive analysis of the relative importance of the different model parameters in driving multistability. For example, there is no analysis of the effects of the added death rate in multistability. This parameter has been shown to determine whether a given pair of interacting species exhibits bistability or not (see e.g. Abreu et al 2019 Nature Communications 10:2120). Similarly, each scenario is analyzed for a unique value of species interspecies interaction strength-with the exception of the case for mobile genetic elements affecting interaction strength, which considers three specific values. Considering heterogeneous interaction strengths (e.g. sampling from a random distribution) could also lead to more realistic scenarios - the authors generally considered that all species pairs interact with the same strength. Analyzing a larger range of growth rates effects of mobile genetic elements would also help generalize the results. In order to achieve a more generic assessment of the impact of horizontal gene transfer in driving multistability, its role should be systematically compared to the effects of the rest of the parameters of the model.

      We appreciate the suggestions. For each of the parameters that the reviewer mentioned, we have performed additional simulations to evaluate its importance in driving multistability. 

      For the added death rate, we have calculated the bistability feasibility of two-species populations under different values of 𝐷. Our results suggested that (1) varying death rate indeed changed the bistability probability of the system; (2) when the death rate was zero, mobile genetic elements that only modify growth rates would have no effects on system’s bistability. These results highlighted the importance of added death rate in driving multistability (Fig. S2, line 136-142). 

      For the interspecies interaction strength, we first extended our analysis on two-species populations. By calculating the bistability probability under different values of 𝛾, we showed that when interspecies interaction strength was smaller than 1, the influence of HGT on population bistability became weak (Fig. S3, line 143-147). We also considered heterogenous interaction strengths in multispecies communities, by randomly sampling 𝛾<sub>ij</sub> values from uniform distributions. While our results suggested the heterogeneous distribution of 𝛾<sub>ij</sub> didn’t fundamentally change the main conclusion, the mean value and variance of 𝛾<sub>ij</sub> affected the influence of HGT on multistability. The effects of HGT on community multistability becomes stronger when the mean value of 𝛾<sub>ij</sub> gets larger than 1 and the variance of 𝛾<sub>ij</sub> is small (Fig. S12, line 190-196).

      We also analyzed different ranges of growth rates effects of mobile genetic elements. In particular, we sampled 𝜆<sub>ij</sub> values from uniform distributions with given widths. Greater width led to larger range of growth rate effects. We used five-species populations as an example and tested different ranges. Our results suggested that multistability was more feasible when the growth rate effects of MGEs were small. The qualitative relationship between HGT and community was not dependent on the range of growth rate effects (Fig. S13, line 197-205).

      (2) The authors previously developed this theoretical model to study the impact of horizontal gene transfer on species coexistence. In this sense, it seems that the authors are exploring a different (stronger interspecies competition) range of parameter values of the same model, which could potentially limit novelty and generality.

      We appreciate the comment. In a previous work (PMID: 38280843), we developed a theoretical model that incorporated horizontal gene transfer process into the classic LV framework. This model provides opportunities to investigate the role of HGT in different open questions of microbial ecology. In the previous work, we considered one fundamental question: how competing microbes coexist stably. In this work, however, we focused on a different problem: how alternative stable states emerge in complex communities. While the basic theoretical tool that we applied in the two works were similar, the scientific questions, application contexts and the implications of our analysis were largely different. The novelty of this work arose from the fact that it revealed the conceptual linkage between alternative stable states and a ubiquitous biological process, horizontal gene transfer. This linkage is largely unknown in previous studies. Exploring such a linkage naturally required us to consider stronger interspecies competitions, which in general would diminish coexistence but give rise to multistability. We believe that the analysis performed in this work provide novel and valuable insights for the field of microbial ecology. 

      With all the supplemented simulations that we carried out in light of the all the reviewer’s comments, we believe the updated manuscript also provide a unified framework to understand how different biological processes collectively shaped the multistability landscape of complex microbiota undergoing horizontal gene transfer. The comprehensive analyses performed and the diverse scenarios considered in this study also contribute to the novelty and generality of this work.  

      (3) The authors analyze several scenarios that, in my opinion, naturally follow from the results and parameter value choices in the first sections, making their analysis not very informative. For example, after showing that horizontal gene transfer can increase multistability both between pairs of species and in a community context, the way they model different niches does not bring significantly new results. Given that the authors showed previously in the manuscript that horizontal gene transfer can impact multistability in a community in which all species interact with each other, one might expect that it will also impact multistability in a larger community made of (sub)communities that are independent of (not interacting with) each-which is the proposed way for modelling niches. A similar argument can be made regarding the analysis of (spatially structured) metacommunities. It is known that, for smaller enough dispersal rates, space can promote regional diversity by enabling each local community to remain in a different stable state. Therefore, in conditions in which the impact of horizontal gene transfer drives multistability, it will also drive regional diversity in a metacommunity.

      Thanks. Based on the reviewer’s comments, we have move Fig. 3 and 4 to Supplementary Information. In the updated manuscript, we have focused more on analyzing the roles of different parameters in shaping community multistability.

      (4) In some cases, the authors consider that mobile genetic elements can lead to ~50% growth rate differences. In the presence of an added death rate, this can be a relatively strong advantage that makes the fastest grower easily take over their competitors. It would be important to discuss biologically relevant examples in which such growth advantages driven by mobile genetic elements could be expected, and how common such scenarios might be.

      We appreciate the suggestion. Mobile genetic elements can drive large growth rate differences when they encode adaptative traits like antibiotic resistance (line 197-198). 

      We also analyzed different ranges of growth rates effects of mobile genetic elements, by sampling 𝜆<sub>ij</sub> values from uniform distributions with given widths. Our results suggested that multistability was more feasible when the fitness effects of MGEs were small (Fig. S13b). The qualitative relationship between HGT and community was not dependent on the range of growth rate effects (Fig. S13a and b). We discussed these results in line 197-205 of the updated main text.

      Reviewer #3 (Public review):

      Hong et al. used a model they previously developed to study the impact of horizontal gene transfer (HGT) on microbial multispecies communities. They investigated the effect of HGT on the existence of alternative stable states in a community. The model most closely resembles HGT through the conjugation of incompatible plasmids, where the transferred genes confer independent growth-related fitness effects. For this type of HGT, the authors find that increasing the rate of HGT leads to an increasing number of stable states. This effect of HGT persists when the model is extended to include multiple competitive niches (under a shared carrying capacity) or spatially distinct patches (that interact in a grid-like fashion). Instead, if the mobile gene is assumed to reduce between-species competition, increasing HGT leads to a smaller region of multistability and fewer stable states. Similarly, if the mobile gene is deleterious an increase in HGT reduces the parameter region that supports multistability.

      This is an interesting and important topic, and I welcome the authors' efforts to explore these topics with mathematical modeling. The manuscript is well written and the analyses seem appropriate and well-carried out. However, I believe the model is not as general as the authors imply and more discussion of the assumptions would be helpful (both to readers + to promote future theoretical work on this topic). Also, given the model, it is not clear that the conclusions hold quite so generally as the authors claim and for biologically relevant parameters. To address this, I would recommend adding sensitivity analyses to the manuscript.

      We thank the reviewer for the agreeing that our work addressed an important topic and was wellconducted. We are also grateful for the suggestion on sensitivity analysis, which is very helpful to improve the rigor and generality of our conclusion. All the raised issues have been fully addressed in the updated manuscript and below.

      Specific points

      (1) The model makes strong assumptions about the biology of HGT, that are not adequately spelled out in the main text or methods, and will not generally prove true in all biological systems. These include:

      a) The process of HGT can be described by mass action kinetics. This is a common assumption for plasmid conjugation, but for phage transduction and natural transformation, people use other models (e.g. with free phage that adsorp to all populations and transfer in bursts).

      b) A subpopulation will not acquire more than one mobile gene, subpopulations can not transfer multiple genes at a time, and populations do not lose their own mobilizable genes. [this may introduce bias, see below].

      c) The species internal inhibition is independent of the acquired MGE (i.e. for p1 the self-inhibition is by s1).

      These points are in addition to the assumptions explored in the supplementary materials, regarding epistasis, the independence of interspecies competition from the mobile genes, etc. I would appreciate it if the authors could be more explicit in the main text about the range of applicability of their model, and in the methods about the assumptions that are made.

      We are grateful for the reviewer’s suggestions. In main text and methods of the updated manuscript, we have made clear the assumptions underlying our analysis. For point (a), we have clarified that our model primarily focused on plasmid transfer dynamics (line 74, 101, 517). Therefore, the process of HGT can be described by mass action kinetics, which is commonly assumed for plasmid transfer (line 537-538). For point (b), our model allows a cell to acquire more than one mobile genes. Please see our response to point (3) for details. We have also made it clear that we assumed the populations would not lose their own mobile gene completely (line 526-527). For (c), we have also clarified it in the updated manuscript (line 111-112, 527-528). 

      We have also performed a series of additional simulations to show the range of applicability of our model. In particular, we discuss the role of other mechanisms, including interspecies interaction strength, the growth rate effects of MGEs, MGE epistasis and microbial death rates in shaping the multistability of microbial communities undergoing HGT. These results were provided in Fig. S2, S3, S9, S10, S11, S12, S13 and S15.

      (2) I am not surprised that a mechanism that creates diversity will lead to more alternative stable states. Specifically, the null model for the absence of HGT is to set gamma to zero, resulting in pij=0 for all subpopulations (line 454). This means that a model with N^2 classes is effectively reduced to N classes. It seems intuitive that an LV-model with many more species would also allow for more alternative stable states. For a fair comparison, one would really want to initialize these subpopulations in the model (with the same growth rates - e.g. mu1(1+lambda2)) but without gene mobility.

      We appreciate the insightful comments. The reviewer was right that in our model HGT created additional subpopulations in the community. However, with or without HGT, we calculated the species diversity and multistability based on the abundances of the 𝑁 species (s<sub>i</sub> in our model), instead of all the p<sub>ij</sub> subpopulations. Therefore, although there exist more ‘classes’ in the model with HGT, the number of ‘classes’ considered when we calculated community diversity and multistability was equal. In light of the reviewer’s suggestion, we have also performed additional simulations, where we initialized the subpopulations in the model with nonzero abundances. Our results suggested that initializing the p<sub>ij</sub> subpopulations with non-zero abundances didn’t change the main conclusion (Fig. S11, line 188-189).

      (3) I am worried that the absence of double gene acquisitions from the model may unintentionally promote bistability. This assumption is equivalent to an implicit assumption of incompatibility between the genes transferred from different species. A highly abundant species with high HGT rates could fill up the "MGE niche" in a species before any other species have reached appreciable size. This would lead to greater importance of initial conditions and could thus lead to increased multistability.

      This concern also feels reminiscent of the "coexistence for free" literature (first described here http://dx.doi.org/10.1016/j.epidem.2008.07.001 ) which was recently discussed in the context of plasmid conjugation models in the supplementary material (section 3) of https://doi.org/10.1098/rstb.2020.0478 .

      We appreciate the comments. Our model didn’t assume the incompatibility between MGEs transferred from different species. Instead, it allows a cell to acquire more than one MGEs. In our model, p<sub>ij</sub> described the subpopulation in the 𝑖-th species that acquired the MGE from the 𝑗th species. Here, p<sub>ij</sub> can have overlaps with p<sub>ik</sub> (𝑗 ≠ 𝑘). In other words, a cell can belong to p<sub>ij</sub> and p<sub>ik</sub> at the same time. The p<sub>ij</sub> subpopulation is allowed to carry the MGEs from the other species. In the model, we used to describe the influence of the other MGEs on the growth of p<sub>ij</sub>.

      We also thank the reviewer for bringing two papers into our attention. We have cited and discussed these papers in the updated manuscript (line 355-362).

      (4) The parameter values tested seem to focus on very large effects, which are unlikely to occur commonly in nature. If I understand the parameters in Figure 1b correctly for instance, lambda2 leads to a 60% increase in growth rate. Such huge effects of mobile genes (here also assumed independent from genetic background) seem unlikely except for rare cases. To make this figure easier to interpret and relate to real-world systems, it could be worthwhile to plot the axes in terms of the assumed cost/benefit of the mobile genes of each species.

      Thanks for the comments. In the main text, we presented one simulation results that assumed relatively large effects of MGE on species fitness, as the reviewer pointed out. In the updated manuscript, we have supplemented numerical simulations that considered different ranges of fitness effects, including the fitness effect as small as 10% (Fig. S13a). We have also plotted the relationship between community multistability and the assumed fitness effects of MGEs, as the reviewer suggested (Fig. S13b). Our results suggested that multistability was more feasible when the fitness effects of MGEs were small, and changing the range of MGE fitness effects didn’t fundamentally change our main conclusion. These results were discussed in line 197-205 of the updated main text.

      Something similar holds for the HGT rate (eta): given that the population of E. coli or Klebsiella in the gut is probably closer to 10^9 than 10^12 (they make up only a fraction of all cells in the gut), the assumed rates for eta are definitely at the high end of measured plasmid transfer rates (e.g. F plasmid transfers at a rate of 10^-9 mL/CFU h-1, but it is derepressed and considered among the fastest - https://doi.org/10.1016/j.plasmid.2020.102489 ). To adequately assess the impact of the HGT rate on microbial community stability it would need to be scanned on a log (rather than a linear) scale. Considering the meta-analysis by Sheppard et al. it would make sense to scan it from 10^-7 to 1 for a community with a carrying capacity around 10^9.

      We thank the reviewer for the constructive suggestion. We have carried out additional simulations by scanning the 𝜂 value from 10<sup>-7</sup> to 1. The results suggested that increasing HGT rates started to promote multistability when 𝜂 value exceeded 10<sup>-2</sup> per hour (Fig. S9, line 337-346). This corresponds to a conjugation efficiency of 10<sup>-11</sup> cell<sup>-1</sup> ∙ mL<sup>-1</sup>∙ mL when the maximum carrying capacity equals 10<sup>9</sup> cellsmL<sup>-1</sup>, or a conjugation efficiency of 10<sup>-14</sup> cell<sup>-1</sup> ∙ hr<sup>-1</sup>∙ mL when the maximum carrying capacity equals 10<sup>12</sup> cellsmL<sup>-1</sup>.

      (5) It is not clear how sensitive the results (e.g. Figure 2a on the effect of HGT) are to the assumption of the fitness effect distribution of the mobile genes. This is related to the previous point that these fitness effects seem quite large. I think some sensitivity analysis of the results to the other parameters of the simulation (also the assumed interspecies competition varies from figure to figure) would be helpful to put the results into perspective and relate them to real biological systems.

      We appreciate the comments. In light of the reviewer’s suggestion, we have changed the range of the fitness effects and analyzed the sensitivity of our predictions to this range. As shown in Fig. S13, changing the range of MGE fitness effects didn’t alter the qualitative interplay between HGT and community multistability. We have also examined the sensitivity of the results to the strength of interspecies competition strength (Fig. S3, S10, S12). These results suggested that while the strength of interspecies interactions played an important role in shaping community multistability, the relationship between HGT rate and multistability was not fundamentally changed by varying interaction strength. In addition, we examined the role of death rates (Fig. S2). In the updated manuscript, we discussed the sensitivity of our prediction to these parameters in line 136-147, 190205, 335-354.

      Recommendations for the authors:

      Reviewer #2 (Recommendations for the authors):

      Please find below a few suggestions that, in my opinion, could help improve the manuscript.

      TITLE

      It might not be clear what I 'gene exchange communities' are. Perhaps it could be rewritten for more specificity (e.g. '...communities undergoing horizontal gene transfer').

      We have updated the title as the reviewer suggested.

      ABSTRACT

      The abstract could also be edited to improve clarity and specificity. Terms like 'complicating factors' are vague, and enumerating specific factors would be better. The results are largely based on simulations, no analytical results are plotted, so I find that the sentence starting with 'Combining theoretical derivation and numerical simulations' can be a bit misleading.

      We appreciate the suggestions. We have enumerated the specific factors and scenarios in the updated abstract (line 18-26). We have also replaced 'Combining theoretical derivation and numerical simulations' with ‘Combining mathematical modeling and numerical simulations’.

      INTRODUCTION

      -  Line 42, please revise this paragraph. The logical flow is not so clear, it seems a bit like a list of facts, but the main message might not be clear enough. Also, it would be good to define 'hidden' states or just rewrite this sentence.

      We appreciate the suggestion. In the updated manuscript, we have rewritten this paragraph to improve the logical flow and clarity (line 46-52).

      -  Line 54, there is little detail about both theoretical models and HGT in this paragraph, and mixing the two makes the paragraph less focused. I suggest to divide into two paragraphs and expand its content. For example, you could explain a bit some relevant implications of MGE.

      We appreciate the suggestion. In the updated manuscript, we have divided this paragraph into two paragraphs, focusing on theoretical models and HGT, respectively (line 55-71). In particular, we have added explanations on the implications of MGEs (line 66-69), as the reviewer suggested.

      -  Line 72, as mentioned in the abstract, it would be better to explicitly mention which confounding factors are going to be discussed.

      Thanks for the suggestion. We have rewritten this part as “We further extended our analysis to scenarios where HGT changed interspecies interactions, where microbial communities were subjected to strong environmental selections and where microbes lived in metacommunities consisting of multiple local habitats. We also analyzed the role of different mechanisms, including interspecies interaction strength, the growth rate effects of MGEs, MGE epistasis and microbial death rates in shaping the multistability of microbial communities. These results created a comprehensive framework to understand how different dynamic processes, including but not limited to HGT rates, collectively shaped community multistability and diversity” (line 75-82).

      RESULTS

      -  The basic concepts (line 77) should be explained with more detail, keeping the non-familiar reader in mind. The reader might not be familiar with the concept of bistability in terms of species abundance. Also, note that mutual inhibition does not necessarily lead to positive feedback, as an interaction strength between 0 and 1 might still be considered inhibition. In any case, in Figure 1 it is not obvious how the positive feedback is represented, the caption should explain it. Note that neither the main text nor the caption explains the metaphor of the landscape and the marble that you are using in Figure 1a.

      We have rewritten this paragraph to provide more details on the basic concepts (line 86-99). We have removed the statement about ‘mutual inhibition’ to avoid being misleading. We have also updated the caption of Fig. 1a to explain the metaphor of the landscape and the marble (line 389396). 

      -  In the classical LV model, bistability does not depend on growth rates, but only on interaction strength. Therefore, I think that much of the results are significantly influenced by the added death rate. I believe that if the death rate is set to zero, mobile genetic elements that only modify growth rates will have no effect on the system's bistability. Because of this, I think that a thorough analysis of the role of the added death (dilution) rate and the distribution of growth rates is especially needed.

      We are grateful for the reviewer’s insightful comments. In the updated manuscript, we have thoroughly analyzed the role of the added death (dilution) rate on the bistability of communities composed of two species (Fig. S2). Indeed, as the reviewer pointed out, if the death rate equals zero, mobile genetic elements that only modify growth rates will have no effect on the system's bistability. We have discussed the role of death rate in line 136-142 of the updated manuscript.

      We have also expanded our analysis on the distribution of growth rates. In particular, we considered different ranges of growth rates effects of mobile genetic elements, by sampling 𝜆<sub>ij</sub> values from uniform distributions with given widths (Fig. S13). Greater width led to larger range of growth rate effects. We used five-species populations as an example and tested different ranges.

      Our results suggested that multistability was more feasible when the growth rate effects of MGEs were small (Fig. S13b). The qualitative relationship between HGT and community was not dependent on the range of growth rate effects (Fig. S13a). These results are discussed in line 197205 of the updated manuscript.

      -  The analysis uses gamma values that, in the absence of an added death rate, render a species pair bistable. Therefore, multistability would be quite expected for a 5 species community. Note that, multistability is possible in communities of more than 2 species even if all gamma values are smaller than 1. Analyzing a wide range of interaction strength distributions would really inform on the relative role of HGT in multistability across different community scenarios.

      We are grateful for the reviewer’s suggestion. In light of the reviewer’s comments, in the updated manuscript, we have performed additional analysis by focusing on a broader range of interaction strengths (Fig. S3, S10, S12), especially the gamma values below 1 (Fig. S10). Our results agreed with the reviewer’s notion that multistability was possible in communities of more than 2 species even if all gamma values were smaller than 1 (Fig. S10). 

      -  I would recommend the authors extend the analysis of the model used for Figures 1 and 2. Figures 3 and 4 could be moved to the supplement (see my point in the public review), unless the authors extend the analysis to explain some non-intuitive outcomes for niches and metacommunities.

      Thanks. In the updated manuscript we have performed additional simulations to extend the analysis in Figure 1 and 2. These results were presented in Fig. S2, S3, S9, S10, S11, S12, and S13. We have also moved Figure 3 and 4 to SI as the reviewer suggested.

      -  The authors seem to refer to fitness and growth rates as the same thing. This could lead to confusion - the strongest competitor in a species pair could also be interpreted as the fittest species despite being the slowest grower. I think there's no need to use fitness if they refer to growth rates. In any case, they should define fitness if they want to use this concept in the text.

      We are grateful for the insightful suggestion. To avoid confusion, we have used ‘growth rate’ throughout the updated manuscript.

      -  Across the text, the language needs some revision for clarity, specificity, and scientific style. In lines 105 - 109 there are some examples, like the use of 'in a lot of systems', and ' interspecies competitions' (I believe they mean interspecies interaction strengths).

      We appreciate the reviewer for pointing them out. We have thoroughly checked the text and made the revisions whenever applicable to improve the clarity and specificity.

      -  Many plots present the HGT rate on the horizontal axis. Could the authors explain why is it that the rate of HGT is relatively important for the number of alternative stable states? I understand how from zero to a small positive number there is a qualitative change. Beyond that, it shouldn't affect bistability too much, I think. If I am right, then other parameters could be more informative to plot in the horizontal axis. If I am wrong, I think that providing an explanation for this would be valuable.

      Thanks. To address the reviewer’s comment, we have systematically analyzed the effects of HGT on community multistability, by scanning the HGT rate from 10<sup>-7</sup> to 10<sup>0</sup>hr<sup>-1</sup> . In communities of two or multiple species, our simulation results showed that multistability gradually increased with HGT rate when HGT rate exceeded 10<sup>2</sup>hr<sup>-1</sup>. These results, presented in Fig. S9 and discussed in line 337-346, provided a more quantitative relationship between multistability and HGT rate.

      While in this work we showed the potential role of HGT in modulating community multistability, our results didn’t exclude the role of the other parameters. Motivated by the comments raised by the reviewers, in the updated manuscript, we have performed additional simulations to analyze the influence of other parameters in shaping community multistability. These parameters include death or dilution rate (Fig. S2), interaction strength (Fig. S3, S9, S10, S11, S12, S14, S15), 𝜆 range (Fig. S13, S15) and 𝛿 value (Fig. 3g, h, i). In many of the supplemented results (Fig. S2b, S3b, S13b, Fig. 3g, 3h and 3i), we have also plotted the data by using these parameters as the x axis. We believe the updated work now provided a more comprehensive framework to understand how different mechanisms, including but not limited to HGT, might shape the multistability of complex microbiota. These points were discussed in line 136-147, 190-205, 238-253, 334-354 of the updated main text. 

      -  My overall thoughts on the case of antibiotic exposure are similar to those of previous sections. Very few of the different parameters of the model are analyzed and discussed. In this case, the authors increased the interaction strength to ~0.4 times higher compared to previous sections. Was this necessary, and why?

      Thanks for the comments. In the previous draft, the interaction strength 𝛾=1.5 was tested as an example. Motivated by the reviewer’s comments, in the updated manuscript, we have examined different interaction strengths, including the strength ( 𝛾 = 1.1 ) commonly tested in other scenarios. The prediction equally held for different 𝛾 values (Fig. S15). We have also analyzed different 𝜆 ranges (Fig. S15). These results, together with the analyses presented in the earlier version of the manuscript, suggested the potential role of HGT in promoting multistability for communities under strong selection. The supplemented results were presented in Fig. S15 and discussed in line 293-295 of the updated manuscript.

      -  Line 195, if a gene encodes for the production of a public good, why would its HGT reduce interaction strength? I can think of the opposite scenario: the gene is a public good, and without HGT there is only one species that can produce it. Let's imagine that the public good is an enzyme that deactivates an antibiotic that is present in the environment, and then the species that produces has a positive interaction with another species in a pairwise coculture. If HGT happens, the second species becomes a producer and does not need the other one to survive in the presence of antibiotics anymore. The interaction can then become more competitive, as e.g. competition for resources could become the dominant interaction.

      We are grateful for pointing it out. In the updated manuscript, we have removed this statement.

      DISCUSSION

      -  L 267 "by comparison with empirical estimates of plasmid conjugation rates from a previous study [42], the HGT rates in our analysis are biologically relevant in a variety of natural environments". The authors are using a normalized model and the relevance of other parameter values is not discussed. If the authors want to claim that they are using biologically relevant HGT, they should also discuss whether the rest of the parameter values are biologically relevant. I recommend relaxing this statement about HGT rates.

      We appreciate the suggestion. We agree with the reviewer that other parameters including the death/dilution rate, interactions strength and 𝜆 ranges are also important in shaping community multistability. We have performed additional analysis to show the effects of these parameters. In light of the reviewer’s suggestion, we have relaxed this statement and thoroughly discussed the context-dependent effect of HGT as well as the roles of different parameters (line 334-354).

      -  Last sentence: "Therefore, inhibiting the MGE spread using small molecules might offer new opportunities to reshape the stability landscape and narrow down the attraction domains of the disease states". It is not clear what procedure/technique the authors are suggesting. If they want to keep this statement, the authors should give more details on how small molecules can be/are used to inhibit MGE.

      We appreciated the comments. Previous studies have shown some small molecules like unsaturated fatty acids can inhibit the conjugative transfer of plasmids. By binding the type IV secretion traffic ATPase TrwD, these compounds limit the pilus biogenesis and DNA translocation. We have provided more details regarding this statement in the updated manuscripts (line 376-379).

      METHODS

      -  Line 439, mu_i should be presented as the maximum 'per capita' growth rate.

      We have updated the definition of 𝜇i following the suggestion (line 529).

      -  Line 444, this explanation is hard to follow, please expand it to provide more details. You could provide an example, like explaining that all individuals from S1 have the MGE1 and therefore they have mu_1 = mu_01 ... After HGT, their fitness changes if they get the plasmid from S2, so a term lambda2 appears.

      Thanks. In the updated manuscript, we have expanded the explanation by providing an example as the reviewer suggested (line 534-537).

      -  The normalization assumes a common carrying capacity Nm (Eqs 1-4) and then it's normalized (Eqs. 5-8). It would be better to start from a more general scenario in which each species has a different carrying capacity and then proceed with the normalization.

      We appreciate the suggestion. In the updated manuscript, we have started our derivation from the scenario where each species has a different carrying capacity before proceeding with the normalization (section 1 of Methods, line 516-554). The same equations can be obtained after normalization.

      -  I think that the meaning of kappa (the plasmid loss rate) is not explained in the text.

      Thanks for pointing it out. We have explained the meaning of kappa in the updated text (line 108, 154, 539-541, 586-587, 607).

      SUPPLEMENT

      -  Figure S4, what are the different colors in panel b?

      In panel b of Fig. S4, the different colors represent the simulation results repeated with randomized growth rates. We have made it clear in the updated SI.

      Reviewer #3 (Recommendations for the authors):

      (1) Please extend your description of the model, so it is easier to understand for readers who have not read the first paper. Especially the choice to describe the model as species and subpopulations, as opposed to writing it as MGE-carrying and MGE-free populations of each species makes it quite complicated to understand which parameters influence each other.

      Thanks for the suggestion. We have extended the model description in the updated manuscript, which provides a more detailed introduction on model configurations and parameter definitions (line 86-99, 101-113, 151-159). We have also updated the Methods to extend the model description.

      (2) Please define gamma_ji in equation 13 and eta_jki in equation 14 (how to map the indices onto the assumed directionality of the interaction).

      We have defined these two parameters in the updated manuscript (line 584-586, 630-632).

      (3)  Line 511: please add at the beginning of this paragraph that you are assuming a grid-like arrangement of patches which will be captured by dispersal term H.

      We have updated this paragraph to make this assumption clear (line 636-637).

      (4)  Line 540: "used in our model" (missing a word).

      We have corrected it in the updated manuscript.

      (5)  Currently the analyses looking at the types of growth effects HGT brings (Figures 5-7) feel very "tacked on". These are not just "confounding factors", but rather scenarios that are much more biologically realistic than the assumption of independent effects. I would introduce them earlier in the text, as I think many readers may not trust your results until they know this was considered (+ how it changes the conclusions).

      We are grateful for the suggestion. We agree with the reviewer that these biologically realistic scenarios should be introduced earlier in the text. In the updated manuscript, we have moved these analyses forward, as sections 3, 4 and 5. We have also avoided the term “confounding factors”. Instead, in the updated manuscript, we have separated these analyses into different sections, and clearly described each scenario in the section title (line 217-218, 254, 275).

      (6)  In some places the manuscript refers to HGT, in others to MGE presence (e.g. caption of Figure 6). These are not generally the same thing, as HGT could also occur due to extracellular vesicles or natural transformation etc. Please standardize the nomenclature and make it clearer which type of processes the model describes.

      We appreciate the comment. The model in this work primarily focused on the process of plasmid transfer. We have made it clear throughout the main text. 

      (7)  In many figures the y-axis starts at a value other than 0. This is a bit misleading. In addition, I would recommend changing the title "Area of bistability region" to "Area of bistability" or perhaps even "Area of multistability" (since more than two species are considered).

      Thanks for the suggestion. We have updated all the relevant figures to make sure that their y-axes start at 0. We have also changed the title “Area of bistability region” to “Area of multistability”, whenever it is applicable.

      (8)  Figure 7: what are the assumed fitness effects of the mobile genes in the simulation? Which distribution were they drawn from? Please add this info to the figure caption here and elsewhere.

      In Figure 7, we explored an extreme scenario of the fitness effects of the mobile genes, where the population was subjected to strong environmental selection and only cells carrying the mobile gene could grow. Therefore, the carriage of the mobile gene changed the species growth rate from 0 to a positive value µ<sub>i</sub>. When calculating the number of stable states in the communities, we randomly drew the µ<sub>i</sub> values from a uniform distribution between 0.3 and 0.7 hr<sup>-1</sup>. We had added this information in the figure caption (line 505-508) and method (line 615-617) of the updated manuscript.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      The model of phosphotransfer from Y169 IKK to S32 IkBa is compelling and an important new contribution to the field. In fact, this model will not be without controversy, and publishing the work will catalyze follow-up studies for this kinase and others as well. As such, I am supportive of this paper, though I do also suggest some shortening and modification.

      We appreciate the reviewers candid response on the difficulty of this study and the requirement of follow-up studies to confirm a direct transfer of the phosphate. We also have edited the manuscript to make it shorter.

      Generally, the paper is well written, but several figures should be quantified, and experimental reproducibility is not always clear. The first 4 figures are slow-going and could be condensed to show the key points, so that the reader gets to Figures 6 and 7 which contain the "meat" of the paper.

      We have indicated the experimental reproducibility in the methodology section against each assay. We have shortened the manuscript corresponding to sections describing figures 1-4. However, when we talked to some of our colleagues whose expertise do not align with kinases and IKK, we realized that some description were necessary to introduce them to the next figures. Additionally, we added Fig. S6 indicating that the radiolabelled phospho-IKK2 Y169F is unable to transfer its own phosphate group(s) to the substrate IkBa.

      Reviewer #2 (Public Review):

      Phosphorylation of IκBα is observed after ATP removal, although there are ambiguous requirements for ADP.

      We agree with the reviewer that this observation is puzzling. We hypothesize that ADP is simultaneously regulating the transfer process likely through binding to the active site.

      It seems that the analysis hinges on the fidelity of pan-specific phosphotyrosine antibodies.

      We agree with the reviewer. To bolster our conclusion, we used antibodies from two different sources. These were Monoclonal mouse anti-Phospho-Tyrosine (catalogue number: 610000) was from BD Biosciences or from EMD Millipore (catalogue no. 05-321X).

      The analysis often returns to the notion that tyrosine phosphorylation(s) (and critical active site Lys44) dictate IKK2 substrate specificity, but evidence for this seems diffuse and indirect. This is an especially difficult claim to make with in vitro assays, omitting the context of other cellular specificity determinants (e.g., localization, scaffolding, phosphatases).

      We agree with the concerns that the specificity could be dependent on other cellular specificity determinants and toned down our claims where necessary. However, we would like to point out that the specificity of IKK2 towards S32 and S36 of IkBa in cells in response to specific stimuli is well-established. It is also well-established that its non-catalytic scaffolding partner NEMO is critical in selectively bringing IkBa to IKK from a large pool of proteins. The exact mechanism of how IKK2 choose the two serines amongst many others in the substrate is not clear.

      Multiple phosphorylated tyrosines in IKK2 were apparently identified by mass spectrometric analyses, but the data and methods are not described. It is common to find non-physiological post-translational modifications in over-expressed proteins from recombinant sources. Are these IKK2 phosphotyrosines evident by MS in IKK2 immunoprecipitated from TNFa-stimulated cells? Identifying IKK2 phosphotyrosine sites from cells would be especially helpful in supporting the proposed model.

      Mass spectrometric data for identification of phosphotyrosines from purified IKK2 is now incorporated (Figure S3A). Although we have not analyzed IKK2 from TNF-a treated cells in this study, a different study of phospho-status of cellular IKK2 indicated tyrosine phosphorylation (Meyer et al 2013).

      Reviewer #3 (Public Review):

      The identity and purity of the used proteins is not clear. Since the findings are so unexpected and potentially of wide-reaching interest - this is a weakness. Similar specific detection of phospho-Ser/Thr vs phospho-Tyr relies largely on antibodies which can have varying degrees of specificity.

      We followed a stringent purification protocol of several steps (optimized for the successful crystallization of the IKK2) that removed most impurities (PMID: 23776406, PMID: 39227404). The samples analysed with ESI MS did not show any significant contaminating kinase from the Sf9 cells.

      Sequence specific phospho-antibodies used in this study are very well characterized and have been used in the field for years (Basak et al 2007, PMID: 17254973). We agree on the reviewer’s concerns on the pan-specific phospho-antibodies. Since phospho-tyrosine detection is the crucial aspect of this study, we minimized such bias by using pan-specific phosphotyrosine antibodies from two independent sources.

      Reviewer #1 (Recommendations For The Authors):

      I understand that Figure 3 shows that K44M abolishes both S32/26 phosphorylation and tyrosine phosphorylation, but not PEST region phosphorylation. This suggests that autophosphorylation is reflective of its known specific biological role in signal transduction. But I do not understand why "these results strongly suggest that IKK2-autophosphorylation is critical for its substrate specificity". That statement would be supported by a mutant that no longer autophosphorylates, and as a result shows a loss of substrate specificity, i.e. phosphorylates non-specific residues more strongly. Is that the case? Maybe Darwech et al 2010 or Meyer et al 2013 showed this.

      Later figures seem to address this point, so maybe this conclusion should be stated later in the paper.

      We have now clarified this in the manuscript and moved the comment to the next section. We have consolidated the results in Figure 3 and 4 in the previous version into a single figure in Figure. The text has also been modified accordingly.

      Page 10: mentions DFG+1 without a proper introduction. The Chen et al 2014 paper appears to inform the author's interest in Y169 phosphorylation, or is it just an additional interesting finding? Does this publication belong in the Introduction or the Discussion?

      The position of Y169 at the DFG+1 was intriguing and the 2014 article Chen et al further bolstered our interest in this residue to be investigated. We think this publication is important in both sections. 

      To understand the significance of Figure 4D, we need a WT IKK2 control: or is there prior literature to cite? This is relevant to the conclusion that Y169 phosphorylation is particularly important for S32 phosphorylation.

      We have now added a new supplementary figure where activities of WT and Y169F IKK2 towards WT and S32/S36 mutants are compared (Figure S3F). At a similar concentration, the activity of WT-IKK2 is many fold higher than that of YtoF mutants (Fig. 4C). The experiments were performed simultaneously, although samples were loaded on different gels but otherwise processed in a similar way. The corresponding data is now included in the manuscript as Figure S3F.

      The cold ATP quenching experiment is nice for testing the model that Y169 functions as a phospho sink that allows for a transfer reaction. However, there is only a single timepoint and condition, which does not allow for a quantitative analysis. Furthermore, a positive control would make this experiment more compelling, and Y169F mutant should show that cold ATP quenching reduces the phosphorylation of IkBa.

      We thank the reviewer for appreciating our experimental design, and pointing out the concerns. We kept the ATP-time point as the maximum of the non-competition experiment. Also, we took 50mM ATP to compare its competition with highest concentration of ADP used. The idea behind using the maximum time and ATP (comparable to ADP) was to capture the effect of competitive-effect of ATP, if any, that would be maximal in the given assay condition in comparison with the phospho-transfer set up in absence of cold ATP. We agree that finer ranges of ATP concentration and time points would have enabled more quantitative analyses. We have now included data where different time intervals are tested (Figure S5D).

      Why is the EE mutant recognized by anti-phospho-serine antibodies? In Figure 2F.

      We anticipate Serine residues besides those in the activation loop to be phosphorylated when IKK2 is overexpressed and purified from the Sf9 cells. Since Glu (E) mimics phospho-Ser, the said antibody cross reacts with the IKK2-EE that mimics IKK2 phosphorylated at Ser177 and 181.

      Figure 7B is clear, but 7C does not add much.

      We have now removed the Fig. 7C in the current version. Figure 7 is now renumbered as Figure 6 that does not contain the said cartoon.  

      Reviewer #2 (Recommendations For The Authors):

      Regarding the specificity arguments (see above in public review), the authors note that NEMO is very important in IKK specificity, and - if I'm understanding correctly - most of these assays were performed without NEMO. Would the IKK2-NEMO complex change these conclusions?

      NEMO is a scaffolding protein whose action goes beyond the activation of the IKK-complex. In cells, NEMO brings IkBa from a pool of thousands of proteins to its bonafide kinase when the cells encounter specific signals. In other words, NEMO channels IKK-activity towards its bonafide substrate IkBa at that moment. Though direct proof is lacking, it is likely that NEMO present IkBa in the correct pose to IKK such that the S32/S36 region of IkBa is poised for phosphorylation. The proposed mechanism in the current study further ensures the specificity and fidelity of that phosphorylation event. We believe this mechanism will be preserved in the IKK-NEMO complex unless proven otherwise. As shown below, IKK2 undergoes tyrosine autophosphorylation in presence of NEMO.

      Author response image 1.

      The work primarily focuses on Y169 as a candidate target for IKK autophosphorylation. This seems reasonable given the proximity to the ATP gamma phosphate. However, Y188F more potently disrupted IκBα phosphorylation. The authors note that this could be due to folding perturbations, but this caveat would also apply to Y169F. A test for global fold perturbations for both Tyr mutants would be helpful.

      Y188 is conserved in S/T kinases and that in PKA (Y204) has been studied extensively using structural, biochemical and biophysical tools. It was found in case of PKA that Y204 participates in packing of the hydrophobic core of the large lobe. Disruption of this core structure by mutation allosterically affect the activity of the kinase. We also observed similar engagement of Y188 in IKK2’s large lobe, and speculated folding perturbations in analogy with the experimental evidence observed in PKA. What we meant was mutation of Y188 would allosterically affect the kinase activity. Y169 on the other hand is unique at that position, an no experimental evidence on the effect of phospho-ablative mutation of this residue exist in the literature. Hence, we refrained from speculating its effect on the folding or conformational allostery, however, such a possibility cannot be ruled out. 

      I struggled to follow the rationalization of the results of Figure 4D, the series of phosphorylation tests of Y169F against IκBα with combinations of phosphoablative or phosphomimetic variants at Ser32 and Ser36. This experiment is hard to interpret without a direct comparison to WT IKK2.

      We agree with the reviewer’s concerns. Through this experiment we wanted to inform about the importance of Tyr-phosphorylation of IKK2 in phosphorylating S32 of IκBα which is of vital importance in NF-kB signaling. We have now provided a comparison with WT-IKK2 in the supplementary Figure S3F. We hope this will help bring more clarity to the issue.

      MD simulations were performed to compare structures of unphosphorylated vs. Ser-phosphorylated (p-IKK2) vs. Ser+Tyr-phosphorylated (P-IKK2) forms of IKK2. These simulations were performed without ATP bound, and then a representative pose was subject to ADP or ATP docking. The authors note distortions in the simulated P-IKK2 kinase fold and clashes with ATP docking. Given the high cellular concentration of ATP, it seems more logical to approach the MD with the assumption of nucleotide availability. Most kinase domains are highly dynamic in the absence of substrate. Is it possible that the P-IKK2 poses are a result of simulation in a non-physiological absence of bound ATP? Ultimately, this MD observation is linked to the proposed model where ADP-binding is required for efficient phospho-relay to IκBα. Therefore, this observation warrants scrutiny. Perhaps the authors could follow up with binding experiments to directly test whether P-IKK2 binds ADP and fails to bind ATP.

      We thank that reviewer for bringing up this issue. This is an important issue and we must agree that we don’t fully understand it yet. We took more rigorous approach this time where we used three docking programs: ATP and ADP were docked to the kinase structures using LeDock and GOLD followed by rescoring with AutoDock Vina. We found that ATP is highly unfavourable to P-IKK2 compared to ADP. To further address these issues, we performed detailed MM-PBSA (Molecular Mechanics Poisson-Boltzmann Surface Area) analyses after MD-simulation to estimate binding free energies and affinities of ADP and ATP for each of the three differently phosphorylated states of IKK2. These analyses (Figure S4 E and F) clearly indicate that phosphorylated IKK2 have much higher preference for ADP over ATP. However, it does not negate ATP-binding by P-IKK2 in a different pose that may not support kinase activity.

      We could not perform any binding experiment because of the following reason. We incubated FL IKK2 WT with or without cold ATP for 30mins, and then incubated these samples with <sup>32</sup>P-ATP and analysed the samples by autoradiography after resolving them on a 10% SDS-PAGE. We found that even after pre-incubation of the kinase with excess cold ATP it still underwent autophosphorylation when radioactive ATP was added as shown below. This prevented us from doing direct binding experiment with ATP as it would not represent true binding event. We also noticed that after removal of bulk ATP post autophosphorylation, phosphorylated IKK2 is capable of further autophosphorylation when freshly incubated with ATP. We have not been able to come up with a condition that would only account for binding of ATP and not hydrolysis. 

      Author response image 2.

      The authors could comment on whether robust phosphorylation of NEMO was expected (Figure 1D). On a related note, why is NEMO a single band in the 1D left panel and double bands on the right?

      No, we did not expect robust phosphorylation of NEMO. However, robust phosphorylation of NEMO is observed only in the absence of IκBα. In presence of IκBα, phosphorylation of NEMO goes down drastically. These were two different preparations of NEMO. When TEV-digestion to remove His-tag is incomplete it gives two bands as the tagged and untagged versions cannot be separated in size exclusion chromatography which is the final step.

      Page 14, line 360. "...observed phosphorylation of tyrosine residue(s) only upon fresh ATP-treatment..." I'm not sure I understand the wording here (or the relevance of the citation). Is this a comment on unreported data demonstrating the rapid hydrolysis of the putative phosphotyrosine(s)? If so, that would be helpful to clarify and report in the supporting information.

      In our X-ray crystallographic studies with phosphorylated IKK2 we failed to observe any density of phosphate moiety. Furthermore, this IKK2 showed further autophosphorylation when incubated with fresh ATP. These two observations lead us to believe that some of the autophosphorylation are transient in nature. However, quantitative kinetic analyses of this dephosphorylation have not been performed.

      Figure S3 middle panel: The PKA substrate overlaid on the IKK2 seems sterically implausible for protein substrate docking. Is that just a consequence of the viewing angle? On a related note, Figure S3 may be mislabeled as S4 in the main text).

      It is a consequence of the viewing angle. Also, we apologize for this inadvertent mislabelling. It has been corrected in the current version.

      Reviewer #3 (Recommendations For The Authors):

      The detection of phosphorylated amino acids relies largely on antibodies which can have a varying degree of specificity. An alternative detection mode of the phospho-amino acids for example by MS would strengthen the evidence.

      We agree with the concern of specificity bias of antibodies. We tried to minimize such bias by using two different p-Tyr antibodies as noted previously and also in the methodology section. We were also able to detect phospho-tyrosine residues by MS/MS analyses, representative spectra are now added (Figure S3A).

      IKK2 purity - protocol states "desired purity". What was the actual purity and how was it checked? MS would be useful to check for the presence of other kinases.

      Purity of the recombinantly purified IKK2s are routinely checked by silver staining. A representative silver stained SDS-PAGE is shown (Figure S1C). It may be noted that, there’s a direct correlation of expression level and solubility, and hence purification yield and quality with the activity of the kinase. Active IKK2s express at much higher level and yields cleaner prep. In our experience, inactive IKKs like K44M give rise to poor yield and purity. We analysed K44M by LC MS/MS to identify other proteins present in the sample. We did not find any significant contaminant kinase the sample (Figure S1D). The MS/MS result is attached.

      Figure 1C&D: where are the Mw markers? What is the size of the band? What is the MS evidence for tyrosine phosphorylation?

      We have now indicated MW marker positions on these figures.

      MS/MS scan data for the two peptides containing pTyr169 and pTyr188 are shown separately (Figure S3A).

      Figure 2F: Why is fresh ATP necessary? Why was Tyr not already phosphorylated? The kinetics of this process appear to be unusual when the reaction runs to completion within 5 minutes ?

      As stated earlier, we believe some of the autophosphorylation are transient in nature. We think the Tyr-phosphorylation are lost due to the action of cellular phosphatases. We agree with the concern of the reviewer that, the reaction appears to reach completion within 5 minutes in Fig 2F. We believe it is probably due to the fact that the amount of kinase used in this study exceeds the linear portion of the dynamic range of the antibody used. Lower concentration of the kinase do show that reaction does not reach completion until 60mins as shown in Fig. 2A.

      Figure 3: Can the authors exclude contamination with a Tyr kinase in the IKK2-K44M prep? The LC/MS/MS data should be included.

      We have reanalysed the sample on orbitrap to check if there’s any Tyr-kinase or any other kinase contamination. We used Spodoptera frugiperda proteome available on the Uniprot website for this analysis. These analyses confirmed that there’s no significant kinase contaminant present in the fraction (Figure S1D).

      What is the specificity of IKK-2 Inhibitor VII? Could it inhibit a contaminant kinase?

      This inhibitor is highly potent against IKK2 and the IKK-complex, and to a lesser extent to IKK1. No literature is available on its activity on other kinases. In an unrelated study, this compound was used alongside MAPK inhibitor SB202190 wherein they observed completely different outcomes of these two inhibitors (Matou-Nasri S, Najdi M, AlSaud NA, Alhaidan Y, Al-Eidi H, Alatar G, et al. (2022) Blockade of p38 MAPK overcomes AML stem cell line KG1a resistance to 5-Fluorouridine and the impact on miRNA profiling. PLoS ONE 17(5):e0267855. https://doi.org/10.1371/journal.pone.0267855). This study indirectly proves that IKK inhibitor VII does not fiddle with the MAPK pathways. We have not found any literature on the non-specific activity of this inhibitor.

      Figure 6B: the band corresponding to "p-IkBa" appears to be similar in the presence of ADP (lanes 4-7) or in the absence of ADP but the presence of ATP (lane 8).

      Radioactive p-IκBα level is more when ADP is added than in absence of ADP. In presence of cold ATP, radioactive p-IκBα level remains unchanged. This result strongly indicate that the addition of phosphate group to IκBα happens directly from the radioactively labelled kinase that is not competed out by the cold ATP.

    1. Author response:

      The following is the authors’ response to the current reviews.

      Reviewer #1 (Public review):

      Summary:

      In the manuscript "Intergenerational transport of double-stranded RNA limits heritable epigenetic changes," Shugarts and colleagues investigate intergenerational dsRNA transport in the nematode C. elegans. By inducing oxidative damage, they block dsRNA import into cells, which affects heritable gene regulation in the adult germline (Fig. 2). They identify a novel gene, sid-1-dependent gene-1 (sdg-1), upregulated upon SID-1 inhibition (Fig. 3). Both transient and genetic depletion of SID-1 lead to the upregulation of sdg-1 and a second gene, sdg-2 (Fig. 5). Interestingly, while sdg-1 expression suggests a potential role in dsRNA transport, neither its overexpression nor loss-of-function impacts dsRNA-mediated silencing in the germline (Fig. 7).

      Strengths:

      • The authors employ a robust neuronal stress model to systematically explore SID-1 dependent intergenerational dsRNA transport in C. elegans.

      • They discover two novel SID-1-dependent genes, sdg-1 and sdg-2.

      • The manuscript is well-written and addresses the compelling topic of dsRNA signaling in C. elegans.

      Weaknesses:

      • The molecular mechanism downstream of SDG-1 remains unclear. Testing whether sdg-2 functions redundantly with sdg-1could provide further insights.

      • SDG-1 dependent genes in other nematodes remain unknown.

      We thank the reviewer for highlighting the strengths of the work along with a couple of the interesting future directions inspired by the reported discoveries. The restricted presence of genes encoding SDG-1 and its paralogs within retrotransposons suggests intriguing evolutionary roles for these proteins. Future work could examine whether such fast-evolving or newly evolved proteins with potential roles in RNA regulation are more broadly associated with retrotransposons. Multiple SID-1-dependent proteins (including SDG-1 and SDG-2) could act together to mediate downstream effects. This possibility can be tested using combinatorial knockouts and overexpression strains. Both future directions have the potential to illuminate the evolutionarily selected roles of dsRNA-mediated signaling through SID-1, which remain a mystery.

      Reviewer #2 (Public review):

      Summary:

      RNAs can function across cell borders and animal generations as sources of epigenetic information for development and immunity. The specific mechanistic pathways how RNA travels between cells and progeny remains an open question. Here, Shugarts, et al. use molecular genetics, imaging, and genomics methods to dissect specific RNA transport and regulatory pathways in the C. elegans model system. Larvae ingesting double-stranded RNA is noted to not cause continuous gene silencing throughout adulthood. Damage of neuronal cells expressing double-stranded target RNA is observed to repress target gene expression in the germline. Exogenous short or long double-stranded RNA required different genes for entry into progeny. It was observed that the SID-1 double-stranded RNA transporter showed different expression over animal development. Removal of the sid-1 gene caused upregulation of two genes, the newly described sid-1-dependent gene sdg-1 and sdg-2. Both genes were observed to be negatively regulated by other small RNA regulatory pathways. Strikingly, loss then gain of sid-1 through breeding still caused variability of sdg-1 expression for many, many generations. SDG-2 protein co-localizes with germ granules, intracellular sites for heritable RNA silencing machinery. Collectively, sdg-1 presents a model to study how extracellular RNAs can buffer gene expression in germ cells and other tissues.

      Strengths:

      (1) Very cleaver molecular genetic methods and genomic analyses, paired with thorough genetics, were employed to discover insights into RNA transport, sdg-1 and sdg-2 as sid-1-dependent genes, and sdg-1's molecular phenotype.

      (2) The manuscript is well cited, and figures reasonably designed.

      (3) The discovery of the sdg genes being responsive to the extracellular RNA cell import machinery provides a model to study how exogenous somatic RNA is used to regulate gene expression in progeny. The discovery of genes within retrotransposons stimulates tantalizing models how regulatory loops may actually permit the genetic survival of harmful elements.

      Weaknesses:

      (1) The manuscript is broad, making it challenging to read and consider the data presented. Of note, since the original submission, the authors have improved the clarity of the writing and presentation.

      Comments on revised version:

      This reviewer thanks the authors for their efforts in revising the manuscript. In their rebuttal, the authors acknowledged the broad scope of their manuscript. I concur. While I still think the manuscript is a challenge to read due to its expansive nature, the current draft is substantially improved when compared to the previous one. This work will contribute to our general knowledge of RNA biology, small RNA regulatory pathways, and RNA inheritance.

      We thank the reviewer for highlighting the strengths of the manuscript and for helping us improve the presentation of our results and discussion.


      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      In the manuscript "Intergenerational transport of double-stranded RNA limits heritable epigenetic changes" Shugarts and colleagues investigate intergenerational dsRNA transport in the nematode C. elegans. They induce oxidative damage in worms, blocking dsRNA import into cells (and potentially affecting the worms in other ways). Oxidative stress inhibits dsRNA import and the associated heritable regulation of gene expression in the adult germline (Fig. 2). The authors identify a novel gene, sid-1-dependent gene-1 (sdg-1), which is induced upon inhibition of SID-1 (Fig. 3). Both transient inhibition and genetic depletion of SID-1 lead to the upregulation of sdg-1 and a second gene, sdg-2 (Fig. 5). The expression of SDG-1 is variable, potentially indicating buffering regulation. While the expression of Sdg-1 could be consistent with a role in intergenerational transport of dsRNA, neither its overexpression nor loss-of-function impacts dsRNA-mediated silencing (Fig. 7) in the germline. It would be interesting to test if sdg-2 functions redundantly.

      In summary, the authors have identified a novel worm-specific protein (sdg-1) that is induced upon loss of dsRNA import via SID-1, but is not required to mediate SID-1 RNA regulatory effects.

      We thank the reviewer for highlighting our findings on SDG-1. We found that oxidative damage in neurons enhanced dsRNA transport into the germline and/or subsequent silencing.

      Remaining Questions:

      • The authors use an experimental system that induces oxidative damage specifically in neurons to release dsRNAs into the circulation. Would the same effect be observed if oxidative damage were induced in other cell types?

      It is possible that oxidative damage of other tissues using miniSOG (as demonstrated in Xu and Chisholm, 2016) could also enhance the release of dsRNA into the circulation from those tissues. However, future experiments would be needed to test this empirically because it is also possible that the release of dsRNA depends on physiological properties (e.g., the molecular machinery promoting specific secretion) that are particularly active in neurons. We chose to use neurons as the source of dsRNA because by expressing dsRNA in a variety of tissues, neurons appeared to be the most efficient at the export of dsRNA as measured using SID-1-dependent silencing in other tissues (Jose et al., PNAS, 2009).

      • Besides dsRNA, which other RNAs and cellular products (macromolecules and small signalling molecules) are released into the circulation that could affect the observed changes in germ cells?

      We do not yet know all the factors that could be released either in naive animals or upon oxidative damage of neurons that influence the uptake of dsRNA into other tissues. The dependence on SID-1 for the observed enhancement of silencing (Fig. 2) shows that dsRNA is necessary for silencing within the germline. Whether this import of dsRNA occurs in conjunction with other factors (e.g., the uptake of short dsRNA along with yolk into oocytes (Marré et al., PNAS, 2016)) before silencing within the germline will require further study. A possible approach could be the isolation of extracellular fluid (Banse and Hunter, J Vis Exp., 2012) followed by characterization of its contents. However, the limited material available using this approach and the difficulty in avoiding contamination from cellular damage by the needle used for isolating the material make it challenging.

      • SID-1 modifies RNA regulation within the germline (Fig. 7) and upregulates sdg-1 and sdg-2 (Fig. 5). However, SID-1's effects do not appear to be mediated via sdg-1. Testing the role of sdg-2 would be intriguing.

      We observe the accumulation of sdg-1 and sdg-2 RNA in two different mutants lacking SID-1, which led us to conservatively focus on the analysis of one of these proteins for this initial paper. We expect that more sensitive analyses of the RNA-seq data will likely reveal additional genes regulated by SID-1. With the ability to perform multiplexed genome-editing, we hope in future work to generate strains that have mutations in many SID-1-dependent genes to recapitulate the defects observed in sid-1(-) animals. Indeed, as surmised by the reviewer, we are focusing on sdg-2 as the first such SID-1-dependent gene to analyze using mutant combinations.

      • Are sdg-1 or sdg-2 conserved in other nematodes or potentially in other species?  appears to be encoded or captured by a retro-element in the C. elegans genome and exhibits stochastic expression in different isolates. Is this a recent adaptation in the C. elegans genome, or is it present in other nematodes? Does loss-of-function of sdg-1 or sdg-2 have any observable effect?

      Clear homologs of SDG-1 and SDG-2 are not detectable outside of C. elegans. Consistent with the location of the sdg-1 gene within a Cer9 retrotransposon that appears to have integrated only within the C. elegans genome, sequence conservation between the genomes of related species is only observed outside the region of the retrotransposon (see Author response image 1, screenshot from UCSC browser). There were no obvious defects detected in animals lacking sdg-1 (Fig. 7) or in animals lacking sdg-2 (data not shown). It is possible that further exploration of both mutants and mutant combinations lacking additional SID-1-dependent genes would reveal defects. We also plan to examine these mutants in sensitized genetic backgrounds where one or more members of the RNA silencing pathway have been compromised.

      Author response image 1.

      Clarification for Readability:

      To enhance readability and avoid misunderstandings, it is crucial to specify the model organism and its specific dsRNA pathways that are not conserved in vertebrates:

      We agree with the reviewer and thank the reviewer for the specific suggestions provided below. To take the spirit of the suggestion to heart we have instead changed the title of our paper to clearly signal that the entire study only uses C. elegans. We have titled the study ‘Intergenerational transport of double-stranded RNA in C. elegans can limit heritable epigenetic changes’

      • In the first sentence of the paragraph "Here, we dissect the intergenerational transport of extracellular dsRNA ...", the authors should specify "in the nematode C. elegans". Unlike vertebrates, which recognise dsRNA as a foreign threat, worms and other invertebrates pervasively use dsRNA for signalling. Additionally, worms, unlike vertebrates and insects, encode RNA-dependent RNA polymerases that generate dsRNA from ssRNA substrates, enabling amplification of small RNA production. Especially in dsRNA biology, specifying the model organism is essential to avoid confusion about potential effects in humans.

      We agree with most statements made by the reviewer, although whether dsRNA is exclusively recognized as a foreign threat by all vertebrates of all stages remains controversial. Our changed title now eliminates all ambiguity regarding the organism used in the study.

      • Similarly, the authors should specify "in C. elegans" in the sentence "Therefore, we propose that the import of extracellular dsRNA into the germline tunes intracellular pathways that cause heritable RNA silencing." This is important because C. elegans small RNA pathways differ significantly from those in other organisms, particularly in the PIWI-interacting RNA (piRNA) pathways, which depend on dsRNA in C. elegans but uses ssRNA in vertebrates. Specification is crucial to prevent misinterpretation by the reader. It is well understood that mechanisms of transgenerational inheritance that operate in nematodes or plants are not conserved in mammals.

      The piRNAs of C. elegans are single-stranded but are encoded by numerous independent genes throughout the genome. The molecules used for transgenerational inheritance of epigenetic changes that have been identified thus far are indeed different in different organisms. However, the regulatory principles required for transgenerational inheritance are general (Jose, eLife, 2024). Nevertheless, we have modified the title to clearly state that the entire study is using C. elegans.  

      • The first sentence of the discussion, "Our analyses suggest a model for ...", would also benefit from specifying "in C. elegans". The same applies to the figure captions. Clarification of the model organism should be added to the first sentence, especially in Figure 1.

      With the clarification of the organism used in the title, we expect that all readers will be able to unambiguously interpret our results and the contexts where they apply. 

      Reviewer #2 (Public review):

      Summary:

      RNAs can function across cell borders and animal generations as sources of epigenetic information for development and immunity. The specific mechanistic pathways how RNA travels between cells and progeny remains an open question. Here, Shugarts, et al. use molecular genetics, imaging, and genomics methods to dissect specific RNA transport and regulatory pathways in the C. elegans model system. Larvae ingesting double stranded RNA is noted to not cause continuous gene silencing throughout adulthood. Damage of neuronal cells expressing double stranded target RNA is observed to repress target gene expression in the germline. Exogenous supply of short or long double stranded RNA required different genes for entry into progeny. It was observed that the SID-1 double-stranded RNA transporter showed different expression over animal development. Removal of the sid-1 gene caused upregulation of two genes, the newly described sid-1-dependent gene sdg-1 and sdg-2. Both genes were observed to also be negatively regulated by other small RNA regulatory pathways. Strikingly, loss then gain of sid-1 through breeding still caused variability of sdg-1 expression for many, many generations. SDG-2 protein co-localizes with a Z-granule marker, an intracellular site for heritable RNA silencing machinery. Collectively, sdg-1 presents a model to study how extracellular RNAs can buffer gene expression in germ cells and other tissues.

      We thank the reviewer for highlighting our findings and underscoring the striking nature of the discovery that mutating sid-1 using genome-editing resulted in a transgenerational change that could not be reversed by changing the sid-1 sequence back to wild-type.

      Strengths:

      (1) Very clever molecular genetic methods and genomic analyses, paired with thorough genetics, were employed to discover insights into RNA transport, sdg-1 and sdg-2 as sid-1-dependent genes, and sdg-1's molecular phenotype.

      (2) The manuscript is well cited, and figures reasonably designed.

      (3) The discovery of the sdg genes being responsive to the extracellular RNA cell import machinery provides a model to study how exogenous somatic RNA is used to regulate gene expression in progeny. The discovery of genes within retrotransposons stimulates tantalizing models how regulatory loops may actually permit the genetic survival of harmful elements.

      We thank the reviewer for the positive comments.

      Weaknesses:

      (1) As presented, the manuscript is incredibly broad, making it challenging to read and consider the data presented. This concern is exemplified in the model figure, that requires two diagrams to summarize the claims made by the manuscript.

      RNA interference (RNAi) by dsRNA is an organismal response where the delivery of dsRNA into the cytosol of some cell precedes the processing and ultimate silencing of the target gene within that cell. These two major steps are often not separately considered when explaining observations. Yet, the interpretation of every RNAi experiment is affected by both steps. To make the details that we have revealed in this work for both steps clearer, we presented the two models separated by scale - organismal vs. intracellular. We agree that this integrative manuscript appears very broad when the many different findings are each considered separately. The overall model revealed here forms the necessary foundation for the deep analysis of individual aspects in the future.

      (2) The large scope of the manuscript denies space to further probe some of the ideas proposed. The first part of the manuscript, particularly Figures 1 and 2, presents data that can be caused by multiple mechanisms, some of which the authors describe in the results but do not test further. Thus, portions of the results text come across as claims that are not supported by the data presented.

      We agree that one of the consequences of addressing the joint roles of transport and subsequent silencing during RNAi is that the scope of the manuscript appears large. We had suggested multiple interpretations for specific observations in keeping with the need for further work. To avoid any misunderstandings that our listing of possible interpretations be taken as claims by the reader, we have followed the instructions of the reviewer (see below) and moved some of the potential explanations we raised to the discussion section.

      (3) The manuscript focuses on the genetics of SDGs but not the proteins themselves. Few descriptions of the SDGs functions are provided nor is it clarified why only SDG-1 was pursued in imaging and genetic experiments. Additionally, the SDG-1 imaging experiments could use additional localization controls.

      We agree that more work on the SDG proteins will likely be informative, but are beyond the scope of this already expansive paper.  We began with the analysis of SDG-1 because it had the most support as a regulator of RNA silencing (Fig. 5f). Indeed, in other work (Lalit and Jose, bioRxiv, 2024), we find that AlphaFold 2 predicts the SDG-1 protein to be a regulator of RNA silencing that directly interacts with the dsRNA-editing enzyme ADR-2 and the endonuclease RDE-8. Furthermore, we expect that more sensitive analyses of the RNA-seq data are likely to reveal additional genes regulated by SID-1. Using multiplexed genome editing, we hope to generate mutant combinations lacking multiple sdg genes to reveal their function(s).

      We agree that given the recent discovery of many components of germ granules, our imaging data does not have sufficient resolution to discriminate between them. We have modified our statements and our model regarding the colocalization of SDG-1 with Z-granules to indicate that the overlapping enrichment of SDG-1 and ZNFX-1 in the perinuclear region is consistent with interactions with other nearby granule components.

      Recommendations for the authors:

      Reviewer #2 (Recommendations for the authors):

      Major

      (1) As presented, the manuscript is almost two manuscripts combined into one. This point is highlighted in Figure 7h, which basically presents two separate models. The key questions addressed in the manuscript starts at Figure 3. Figures 1 and 2 are interesting observations but require more experiments to define further. For example, as the Results text describes for Figure 1, "These differences in the entry of ingested dsRNA into cells and/or subsequent silencing could be driven by a variety of changes during development. These include changes in the uptake of dsRNA into the intestine, distribution of dsRNA to other tissues from the intestine, import of dsRNA into the germline, and availability of RNA silencing factors within the germline." Presenting these (reasonable) mechanistic ideas detracted from the heritable RNA epigenetic mechanism explored in the later portion of the manuscript. There are many ways to address this issue, one being moving Figures 1 and 2 to the Supplement to focus on SID-1 related pathways.

      Since this manuscript addresses the interaction between intercellular transport of dsRNA and heritable epigenetic changes, it was necessary to establish the possible route(s) that dsRNA could take to the germline before any inference could be made regarding heritable epigenetic changes. As suggested below (pt. 2), we have now moved the alternatives we enumerated as possible explanations for some experimental results (e.g., for the differences quoted here) to the discussion section.

      (2) The manuscript includes detailed potential interpretations in the Results, making them seem like claims. Here is an example:

      "Thus, one possibility suggested by these observations is that reduction of sdg-1 RNA via SID-1 alters the amount of SDG-1 protein, which could interact with components of germ granules to mediate RNA regulation within the germline of wild-type animals."

      This mechanism is a possibility, but placing these ideas in the citable results makes it seem like an overinterpretation of imaging data. This text and others should be in the Discussion, where speculation is encouraged. Results sections like this example and others should be moved to the discussion.

      We have rephrased motivating connections between experiments like the one quoted above and also moved such text to the discussion section wherever possible.

      (3) A paragraph describing the SDG proteins will be helpful. Homologs? Conserved protein domains? mRNA and/or protein expression pattern across worm, not just the germline? Conservation across Caenorhabditis sp? These descriptions may help establish context why SDG-1 localizes to Z-granules.

      We have now added information about the conservation of the sdg-1 gene in the manuscript. AlphaFold predicts domains with low confidence for the SDG-1 protein, consistent with the lack of conservation of this protein (AlphaFold requires multiple sequence alignments to predict confidently). In the adult animal, the SDG-1 protein was only detectable in the germline. Future work focused on SDG-1, SDG-2 and other SDG proteins will further examine possible expression in other tissues and functional domains if any. Unfortunately, in multiple attempts of single-molecule FISH experiments using probes against the sdg-1 open reading frame, we were unable to detect a specific signal above background (data not shown). Additional experiments are needed for the sensitive detection of sdg-1 expression outside the germline, if any.  

      (4) Based on the images shown, SDG-1 could be in other nearby granules, such as P granules or mutator foci. Additional imaging controls to rule out these granules/condensates will greatly strengthen the argument that SDG-1 protein localizes to Z-granules specifically.

      We have modified the final model to indicate that the perinuclear colocalization is with germ granules broadly and we agree that we do not have the resolution to claim that the observed overlap of SDG-1::mCherry with GFP::ZNFX-1 that we detect using Airyscan microscopy is specifically with Z granules. Our initial emphasis of Z-granule was based on the prior report of SDG-1 being co-immunoprecipitated with the Z-granule surface protein PID-2/ZSP-1. However, through other work predicting possible direct interactions using AlphaFold (Lalit and Jose, bioRxiv, 2024), we were unable to detect any direct interactions between PID-2 and SDG-1. Indeed, many additional granules have been recently reported (Chen et al., Nat. Commun., 2024; Huang et al., bioRxiv 2024), making it possible that SDG-1 has specific interactions with a component of one of the other granules (P, Z, M, S, E, or D) or adjacent P bodies.

      Minor

      (1) "This entry into the cytosol is distinct from and can follow the uptake of dsRNA into cells, which can rely on other receptors." Awkard sentence. Please revise.

      We have now revised this sentence to read “This entry into the cytosol is distinct from the uptake of dsRNA into cells, which can rely on other receptors”

      (2) Presumably, the dsRNA percent of the in vitro transcribed RNA is different than the 50 bp oligos that can be reliably annealed by heating and cooling. Other RNA secondary structure possibilities warrant further discussion.

      We agree that in vitro transcribed RNA could include a variety of undefined secondary structures in addition to dsRNAs of mixed length. Such structures could recruit or titrate away RNA-binding proteins in addition to the dsRNA structures engaging the canonical RNAi pathway, resulting in mixed mechanisms of silencing. Future work identifying such structures and exploring their impact on the efficacy of RNAi could be informative. We have now added these considerations to the discussion and thank the reviewer for highlighting these possibilities.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      In this study, the authors engineer the endogenous left boundary of the Drosophila eve TAD, replacing the endogenous Nhomie boundary by either a neutral DNA, a wildtype Nhomie boundary, an inverted Nhomie boundary, or a second copy of the Homie boundary. They perform Micro-C on young embryos and conclude that endogenous Nhomie and Homie boundaries flanking eve pair with head-to-tail directionality to form a chromosomal stem loop. Abrogating the Nhomie boundary leads to ectopic activation of genes in the former neighboring TAD by eve embryonic stripe enhancers. Replacing Nhomie by an inverted version or by Homie (which pairs with itself head-to-head) transformed the stem loop into a circle loop. An important finding was that stem and circle loops differentially impact endogenous gene regulation both within the eve TAD and in the TADs bracketing eve. Intriguingly, an eve TAD with a circle loop configuration leads to ectopic activation of flanking genes by eve enhancers - indicating compromised regulatory boundary activity despite the presence of an eve TAD with intact left and right boundaries.

      Strengths:

      Overall, the results obtained are of high-quality and are meticulously discussed. This work advances our fundamental understanding of how 3D genome topologies affect enhancer-promoter communication.

      Weaknesses:

      Though convincingly demonstrated at eve, the generalizability of TAD formation by directional boundary pairing remains unclear, though the authors propose this mechanism could underly the formation of all TADs in Drosophila and possibly even in mammals. Strong and ample evidence has been obtained to date that cohesin-mediated chromosomal loop extrusion explains the formation of a large fraction of TADs in mammals. 

      (1.1) The difficultly with most all of the studies on mammal TADs, cohesin and CTCF roadblocks is that the sequencing depth is not sufficient, and large bin sizes (>1 kb) are needed to visualize chromosome architecture.  The resulting contact profiles show TAD neighborhoods, not actual TADs.

      The problem with these studies is illustrated by comparing the contact profiles of mammalian MicroC data sets at different bin sizes in Author response image 1.  In this figure, the darkness of the “pixels” in panels E, F, G and H was enhanced by reducing brightness in photoshop.

      Author response image 1.

      Mammalian MicroC profiles different bun sizes

      Panels A and C show “TADs” using bin sizes typical of most mammalian studies (see Krietenstein et al. (2023) (Krietenstein et al. 2020)).  At this level of resolution, TADs, the “trees” that are the building blocks of chromosomes, are not visible.  Instead, what is seen are TAD neighborhoods or “forests”.  Each neighborhood consists of several dozen individual TADs.  The large bins in these panels also artificially accentuated TAD:TAD interactions, generating a series of “stripes” and “dots” that correspond to TADs bumping into each other and sequences getting crosslinked.  For example, in panel A there is prominent stripe on the edge of a “TAD” (blue arrow).  In panel C, this stripe resolves into a series of dots arranged as parallel, but interrupted “stripes” (green and blue arrows).  At the next level of resolution, it can be seen that the stripe marked by the blue arrow and magenta asterisk is generated by contacts between the left boundary of the TAD indicated by the magenta bar with sequences in a TAD (blue bar) ~180 kb way.  While dots and stripes are prominent features in contact profiles visualized with larger bin sizes (A and C), the actual TADs that are observed with a bin size of 200 bp (examples are underlined by black bars in panel G) are not bordered by stripes, nor are they topped by obvious dots.  The one possible exception is the dot that appears at the top of the volcano triangle underlined with magenta.

      The chromosome 1 DNA segment from the MicroC data of Hseih et al. (2023) (Hsieh et al. 2020) shows a putative volcano triangle with a plume (indicated by a V in Author response image 1 panels D, F and H).  Sequences in the V TAD don’t crosslink with their immediate neighbors, and this gives a “plume” above the volcano triangle, as indicate by the light blue asterisk in panels D, F and H.  Interestingly the V TAD does contact two distant TADs, U on the left and W on the right. The U TAD is ~550 kb from V, and the region of contact is indicated by the black arrow.  The W TAD is ~585 kb from V, and the region of contact is indicated by the magenta arrow.  While the plume still seems to be visible with a bin size of 400 bp (light blue asterisk), it is hard to discern when the bin size is 200 bp, as there are not enough reads.

      The evidence demonstrating that cohesin is required for TAD formation/maintenance is based on low resolution Hi-C data, and the effects that are observed are on TAD neighborhoods (forests) and not TADs (trees).  In fact, there is published evidence that cohesin is not required in mammals for TAD formation/maintenance.  In an experiment from Goel et al. 2023 the authors depleted the cohesin component Rad21 and then visualized the effects on TAD organization using the high resolution region capture MicroC (RCMC) protocol.  The MicroC contact map in this figure visualizes a ~250 kb DNA segment around the Ppm1pg locus at 250 bp resolution.  On the right side of the diagonal is the untreated control, while the left side shows the MicroC profile of the same region after Rad21 depletion.  The authors indicated that there was a 97% depletion of Rad21 in their experiment.  However, as is evident from a comparison of the experimental and control, loss of Rad21 has no apparent effect on the TAD organization of this mammalian DNA segment.

      Several other features are worth noting.  First, unlike the MicroC experiments shown in Author response image 1, there are dots at the apex of the TADs in this chromosomal segment.  In the MicroC protocol, fixed chromatin is digested to mononucleosomes by extensive MNase digestion.  The resulting DNA fragments are then ligated, and dinucleosome-length fragments are isolated and sequenced. 

      DNA sequences that are nucleosome free in chromatin (which would be promoters, enhancers, silencers and boundary elements) are typically digested to oligonucleotides in this procedure and won’t be recovered. This means that the dots shown here must correspond to mononucleosome-length elements that are MNase resistant.  This is also true for the dots in the MicroC contact profiles of the Drosophila Abd-B regulatory domain (see Fig. 2B in the paper).  Second, the TADs are connected to each other by 45o stripes (see blue and green arrowheads).  While it is not clear from this experiment whether the stipes are generated by an active mechanism (enzyme) or by some “passive” mechanism (e.g., sliding), the stripes in this chromosomal segment are not generated by cohesin, as they are unperturbed by Rad21 depletion.  Third, there are no volcano triangles with plumes in this chromosomal DNA segment.  Instead, the contact patterns (purple and green asterisks) between neighboring TADs closely resemble those seen for the Abd-B regulatory domains (compare Goel et al. 2023 with Fig. 2B in the paper).  This similarity suggests that the TADs in and around Ppm1g may be circle-loops, not stem-loops.  As volcano triangles with plumes also seem to be rare in the MicroC data sets of Krietenstein et al. (Krietenstein et al. 2020) and Hesih et al. (Hsieh et al. 2020) (with the caveat that these data sets are low resolution: see Author response image 1), it is possible that much of the mammalian genome is assembled into circle-loop TADs, a topology that can’t be generated by the cohesin loop extrusion (bolo tie clip) /CTCF roadblock model.

      While Rad21 depletion has no apparent effect on TADs, it does appear to impact TAD neighborhoods.  This is in a supplemental figure in Goel et al. (Goel et al. 2023).  In this figure, TADs in the Ppm1g region of chromosome 5 are visualized with bin sizes of 5 kb and 1 kb.  A 1.2 Mb DNA segment is shown for the 5 kb bin size, while an 800 kb DNA segment is shown for the 1 kb bin size.  As can be seen from comparing the MicroC profiles in Author response image 2 with that in Goel et al. 2023, individual TADs are not visible.  Instead, the individual TADs are binned into large TAD “neighborhoods” that consist of several dozen or more TADs.

      Unlike the individual TADs shown in Goel et al. 2023, the TAD neighborhoods in Author response image 2 are sensitive to Rad21 depletion.  The effects of Rad21 depletion can be seen by comparing the relative pixel density inside the blue lines before (above the diagonal) and after (below the diagonal) auxin-induced Rad21 degradation.  The reduction in pixel density is greatest for more distant TAD:TAD contacts (farthest from the diagonal).  By contrast, the TADs themselves are unaffected (Goel et al. 2023), as are contacts between individual TADs and their immediate neighbors.  In addition, contacts between partially overlapping TAD neighborhoods are also lost.  At this point it isn’t clear why contacts between distant TADs in the same neighborhood are lost when Rad21 is depleted; however, a plausible speculation is that it is related to the functioning of cohesin in holding newly replicated DNAs together until mitosis and whatever other role it might have in chromosome condensation.

      Author response image 2.

      Ppm1g full locus chr5

      Moreover, given the unique specificity with which Nhomie and Homie are known to pair (and exhibit "homing" activity), it is conceivable that formation of the eve TAD by boundary pairing represents a phenomenon observed at exceptional loci rather than a universal rule of TAD formation. Indeed, characteristic Micro-C features of the eve TAD are only observed at a restricted number of loci in the fly genome…..

      (1.2) The available evidence does not support the claim that nhomie and homie are “exceptional.”  To begin with, nhomie and homie rely on precisely the same set of factors that have been implicated in the functioning of other boundaries in the fly genome.  For example, homie requires (among other factors) the generic boundary protein Su(Hw) for insulation and long-distance interactions (Fujioka et al. 2024).  (This is also true of nhomie: unpublished data.)  The Su(Hw) protein (like other fly polydactyl zinc finger proteins) can engage in distant interactions.  This was first shown by Sigrist and Pirrotta (Sigrist and Pirrotta 1997), who found that the su(Hw) element from the gypsy transposon can mediate long-distance regulatory interactions (PRE dependent silencing) between transgenes inserted at different sites on homologous chromosomes (trans interactions) and at sites on different chromosomes.

      The ability to mediate long-distance interactions is not unique to the su(Hw) element, or homie and nhomie.  Muller et al. (Muller et al. 1999) found that the Mcp boundary from the Drosophila BX-C is also able to engage in long-distance regulatory interactions—both PRE-dependent silencing of mini-white and enhancer activation of mini-white and yellow.  The functioning of the Mcp boundary depends upon two other generic insulator proteins, Pita and the fly CTCF homolog (Kyrchanova et al. 2017).  Like Su(Hw) both are polydactyl zinc finger proteins, and they resemble the mammalian CTCF protein in that their N-terminal domain mediates multimerization (Bonchuk et al. 2020; Zolotarev et al. 2016).  Figure 6 from Muller et el. 1999 shows PRE-dependent “pairing sensitive silencing” interactions between transgenes carrying a mini-white reporter, the Mcp and scs’ (Beaf dependent)(Hart et al. 1997) boundary elements, and a PRE closely linked to Mcp.  In this experiment flies homozygous for different transgene inserts were mated and the eye color was examined in their transheterozygous progeny.  As indicated in the figure, the strongest trans-silencing interactions were observed for inserts on the same chromosomal arm; however, transgenes inserted on the left arm of chromosome 3 can interact across the centromere with transgenes inserted on the right arm of chromosome 3. 

      Figure 5C (left) from Muller et el. 1999 shows a trans-silencing interaction between w#11.102 at 84D and w#11.16 approximately 5.8 Mb away, at 87D.  Figure 5C (right) shows a trans-silencing interaction across the centromere between w#14.29 on the left arm of chromosome 3 at 78F and w#11.102 on the right arm of chromosome 3 at 84D. The eye color phenotype of mini-white-containing transgenes is usually additive: homozygyous inserts have twice as dark eye color as the corresponding hemizygous inserts.  Likewise, in flies trans-_heterozygous for _mini-white transgenes inserted at different sites, the eye color is equivalent to the sum of the two transgenes.  This is not true when mini-white transgenes are silenced by PREs.  In the combination shown in panel A, the t_rans-_heterozygous fly has a lighter eye color than either of the parents.  In the combination in panel B, the _trans-_heterozygous fly is slightly lighter than either parent.

      As evident from the diagram in Figure 6 from Muller et el. 1999, all of the transgenes inserted on the 3rd chromosome that were tested were able to participate in long distance (>Mbs) regulatory interactions.  On the other hand, not all possible pairwise interactions are observed.  This would suggest that potential interactions depend upon the large scale (Mb) 3D folding of the 3rd chromosome.

      When the scs boundary (Zw5 dependent) (Gaszner et al. 1999) was added to the transgene to give sMws’, it further enhanced the ability of distant transgenes to find each other and pair.  All eight of the sMws’ inserts that were tested were able to interact with at least one other sMws’ insert on a different chromosome and silence mini-white.  Vazquez et al. () subsequently tagged the sMws’ transgene with LacO sequences (ps0Mws’) and visualized pairing interactions in imaginal discs.  Trans-heterozygous combinations on the same chromosome were found paired in 94-99% of the disc nuclei, while a trans-heterozygous combination on different chromosomes was found paired in 96% of the nuclei (Table 3 from Vazquez et al. 2006).  Vazquez et al. also examined a combination of four transgenes inserted on the same chromosome (two at the same insertion site, and two at different insertion sites).  In this case, all four transgenes were clustered together in 94% of the nuclei (Table 3 from Vazquez et al. 2006).  Their studies also suggest that the distant transgenes remain paired for at least several hours.  A similar experiment was done by Li et al. (Li et al. 2011), except that the transgene contained only a single boundary, Mcp or Fab-7.  While pairing was still observed in trans-heterozygotes, the frequency was reduced without scs and scs’.

      It is worth pointing out that there is no plausible mechanism in which cohesin could extrude a loop through hundreds of intervening TADs, across the centromere (ff#13.101_ßà_w#11.102: Figure 6 from Muller et el. 1999; w#14.29_ßà_w#11.02: Figure 6 from Muller et el. 1999 and 5) and come to a halt when it “encounters” Mcp containing transgenes on different homologs.  The same is true for Mcp-dependent pairing interactions in cis (Fig. 7 in Muller et al. (Muller et al. 1999)) or Mcp-dependent pairing interactions between transgenes inserted on different chromosomes (Fig. 8 in Muller et al. (Muller et al. 1999); Line 8 in Table 3 from Vazquez et al. 2006). 

      These are not the only boundaries that can engage in long-distance pairing.  Mohana et al. (Mohana et al. 2023) identified nearly 60 meta-loops, many of which appear to be formed by the pairing of TAD boundary elements.  Two examples (at 200 bp resolution from 12-16 hr embryos) are shown in Author response image 3.

      Author response image 3.

      Metaloops on the 2nd and 3rd chromosomes: circle-loops and multiple stem-loops

      One of these meta-loops (panel A) is generated by the pairing of two TAD boundaries on the 2nd chromosome.  The first boundary, blue, (indicated by blue arrow) is located at ~2,006, 500 bp between a small TAD containing the Nplp4 and CG15353 genes and a larger TAD containing 3 genes, CG33543, Obp22a and Npc2aNplp4 encodes a neuropeptide.  The functions of CG15354 and CG33543 are unknown.  Obp22a encodes an odorant binding protein, while Npc2a encodes the Niemann-Pick type C-2a protein which is involved sterol homeostasis.  The other boundary (purple: indicated by purple arrow) is located between two TADs 2.8 Mb away at 4,794,250 bp.  The upstream TAD contains the fipi gene (CG15630) which has neuronal functions in male courtship, while the downstream TAD contains CG3294, which is thought to be a spliceosome component, and schlaff (slf) which encodes a chitin binding protein.  As illustrated in the accompanying diagram, the blue boundary pairs with the purple boundary in a head-to-head orientation, generating a ~2.8 Mb loop with a circle-loop topology.  As a result of this pairing, the multi-gene (CG33543, Obp22a and Npc2a) TAD upstream of the blue boundary interacts with the CG15630 TAD upstream of the purple boundary.  Conversely the small Nplp4:CG15353 TAD downstream of the blue boundary interacts with the CG3294:slf TAD downstream of the purple boundary.  Even if one imagined that the cohesin bolo tie clip was somehow able to extrude 2.8 Mb of chromatin and then know to stop when it encountered the blue and purple boundaries, it would’ve generated a stemloop, not a circle-loop.

      The second meta-loop (panel B) is more complicated as it is generated by pairing interactions between four boundary elements.  The blue boundary (blue arrow) located ~4,801,800 bp (3L) separates a large TAD containing the RhoGEF64C gene from a small TAD containing CG7509, which encodes a predicted subunit of an extracellular carboxypeptidase.  As can be seen in the MicroC contact profile and the accompanying diagram, the blue boundary pairs with the purple boundary (purple arrow) which is located at ~7,013, 500 (3L) just upstream of the 2nd internal promoter (indicated by black arrowhead) of the Mp (Multiplexin) gene.  This pairing interaction is head-to-tail and generates a large stem-loop that spans ~2.2 Mb.  The stem-loop brings sequences upstream of the blue boundary and downstream of the purple boundary into contact (the strings below a bolo tie clip), just as was observed in the boundary bypass experiments of Muravyova et al. (Muravyova et al. 2001) and Kyrchanova et al. (Kyrchanova et al. 2008).  The physical interactions result in a box of contacts (right top) between sequences in the large RhoGEF64C TAD and sequences in a large TAD that contains an internal Mp promoter.  The second pairing interaction is between the brown boundary (brown arrow) and the green boundary (green arrow).  The brown boundary is located at ~4 805,600 bp (3L) and separates the TAD containing CG7590 from a large TAD containing CG1808 (predicted to encode an oxidoreductase) and the Dhc64C (Dynein heavy chain 64C) gene.  The green boundary is located at ~6,995,500 bp (3L), and it separates a TAD containing CG32388 and the biniou (bin) transcription factor from a TAD that contains the most distal promoter of the Mp (Multiplexin) gene (blue arrowhead).  As indicated in the diagram, the brown and green boundaries pair with each other head-to-tail, and this generates a small internal loop (and the final configuration would resemble a bolo tie with two tie clips).  This small internal loop brings the CG7590 TAD into contact with the TAD that extends from the distal Mp promoter to the 2nd internal Mp promoter.  The resulting contact profile is a rectangular box with diagonal endpoints corresponding to the paired blue:purple and brown:green boundaries.  The pairing of the brown:green boundaries also brings the TADs immediately downstream of the brown boundary and upstream of the green boundary into contact with each other, and this gives a rectangular box of interactions between the Dhc64C TAD, and sequences in the bin/CG3238 TAD.  This box is located on the lower left side of the contact map.

      Since the bin and Mp meta-loops in Author response image 3B are stem-loops, they could have been generated by “sequential” cohesin loop extrusion events.  Besides the fact that cohesin extrusion of 2 Mb of chromatin and breaking through multiple intervening TAD boundaries challenges the imagination, there is no mechanism in the cohesion loop extrusion/CTCF roadblock model to explain why cohesion complex 1 would come to a halt at the purple boundary on one side and the blue boundary on the other, while cohesin complex 2 would instead stop when it hits the brown and green boundaries.  This highlights another problem with the cohesin loop extrusion/CTCF roadblock model, namely that the roadblocks are functionally autonomous: they have an intrinsic ability to block cohesin that is entirely independent of the intrinsic ability of other roadblocks in the neighborhood.  As a result, there is no mechanism for generating specificity in loop formation.  By contrast, boundary pairing interactions are by definition non-autonomous and depend on the ability of individual boundaries to pair with other boundaries: specificity is built into the model. The mechanism for pairing, and accordingly the basis for partner preferences/specificity, are reasonably well understood.  Probably the most common mechanism in flies is based on shared binding sites for architectural proteins that can form dimers or multimers (Bonchuk et al. 2021; Fedotova et al. 2017).  Flies have a large family of polydactyl zinc finger DNA binding proteins, and as noted above, many of these form dimers or multimers and also function as TAD boundary proteins.  This pairing principle was first discovered by Kyrchanova et al. (Kyrchanova et al. 2008).  This paper also showed that orientation-dependent pairing interactions is a common feature of endogenous fly boundaries.  Another mechanism for pairing is specific protein:protein interactions between different DNA binding factors (Blanton et al. 2003).  Yet a third mechanism would be proteins that bridge different DNA binding proteins together.  The boundaries that use these different mechanisms (BX-C boundaries, scs, scs’) depend upon the same sorts of proteins that are used by homie and nhomie.  Likewise, these same set of factors reappear in one combination or another in most other TAD boundaries.  As for the orientation of pairing interactions, this is most likely determined by the order of binding sites for chromosome architectural proteins in the partner boundaries.

      …and many TADs lack focal 3D interactions between their boundaries.

      (1.3) The idea that flies differ from mammals in that they “lack” focal 3D interactions is simply mistaken.  One of the problems with drawing this distinction is that most all of the “focal 3D interactions” seen mammalian Hi-C experiments are a consequence of binning large DNA segments in low resolution restriction enzyme-dependent experiments.  This is even true in the two “high” resolution MicroC experiments that have been published (Hsieh et al. 2020; Krietenstein et al. 2020).  As illustrated above in Author response image 1, most of the “focal 3D interactions” (the dots at the apex of TAD triangles) seen with large bin sizes (1 kb and greater) disappear when the bin size is 200 bp and TADs rather than TAD neighborhoods are being visualized.

      As described in point #1.1, in the MicroC protocol, fixed chromatin is first digested to mononucloesomes by extensive MNase digestion, processed/biotinylated, and ligated to give dinucleosome-length fragments, which are then sequenced.  Regions of chromatin that are nucleosome free (promoters, enhancers, silencers, boundary elements) will typically be reduced to oligonucleotides in this procedure and will not be recovered when dinucleosome-length fragments are sequenced.  The loss of sequences from typical paired boundary elements is illustrated by the lar meta-loop shown in Author response image 4 (at 200 bp resolution).  Panels A and B show the contact profiles generated when the blue boundary (which separates two TADs that span  the Lar (Leukocyteantigen-related-like) transcription unit interacts with the purple boundary (which separates two TADs in a gene poor region ~620 kb away).  The blue and purple boundaries pair with each other head-to-head, and this pairing orientation generates yet another circle-loop.  In the circle-loop topology, sequences in the TADs upstream of both boundaries come into contact with each other, and this gives the small dark rectangular box to the upper left of the paired boundaries (Author response image 4A).  (Note that this small box corresponds to the two small TADs upstream of the blue and purple boundaries, respectively. See panel B.)  Sequences in the TADs downstream of the two boundaries also come into contact with each other, and this gives the large box to the lower right of the paired boundaries.  While this meta-loop is clearly generated by pairing interactions between the blue and purple boundaries, the interacting sequences are degraded in the MicroC protocol, and sequences corresponding to the blue and purple boundaries aren’t recovered.  This can be seen in panel B (red arrow and red arrowheads).  When a different Hi-C procedure is used (dHS-C) that captures nucleosome-free regions of chromatin that are physically linked to each other (Author response image 4C & D), the sequences in the interacting blue and purple boundaries are recovered and generate a prominent “dot” at their physical intersection (blue arrow in panel D).

      Author response image 4.

      Lar metaloop. Panels A & bB: MicroC. Panels C & D: dHS-C

      While sequences corresponding to the blue and purple boundaries are lost in the MicroC procedure, there is at least one class of elements that engage in physical pairing interactions whose sequences are (comparatively) resistant to MNase digestion.  This class of elements includes many PREs ((Kyrchanova et al. 2018); unpublished data), the boundary bypass elements in the Abd-B region of BX-C (Kyrchanova et al. 2023; Kyrchanova et al. 2019a; Kyrchanova et al. 2019b; Postika et al. 2018), and “tethering” elements (Batut et al. 2022; Li et al. 2023).  In all of the cases tested, these elements are bound in nuclear extracts by a large (>1000 kD) GAGA factor-containing multiprotein complex called LBC.  LBC also binds to the hsp70 and eve promoters (unpublished data).  Indirect end-labeling experiments (Galloni et al. 1993; Samal et al. 1981; Udvardy and Schedl 1984) indicate that the LBC protects a ~120-180 bp DNA segment from MNase digestion.  It is likely that this is the reason why LBC-bound sequences can be recovered in MicroC experiments as dots when they are physically linked to each other.  One such example (based on the ChIP signatures of the paired elements) is indicated by the green arrow in panel B and D of Author response image 4.  Note that there are no dots corresponding to these two LBC elements within either of the TADs immediately downstream of the blue and purple boundaries.  Instead the sequences corresponding to the two LBC elements are only recovered when the two elements pair with each other over a distance of ~620 kb.  The fact that these two elements pair with each other is consistent with other findings which indicate that, like classical boundaries, LBC elements exhibit partner preferences.  In fact, LBC elements can sometimes function as TAD boundaries.  For example, the Fab-7 boundary has two LBC elements, and full Fab-7 boundary function can be reconstituted with just these two elements (Kyrchanova et al. 2018).

      Reviewer #2 (Public Review):

      "Chromatin Structure II: Stem-loops and circle-loops" by Ke*, Fujioka*, Schedl, and Jaynes reports a set of experiments and subsequent analyses focusing on the role of Drosophila boundary elements in shaping 3D genome structure and regulating gene expression. The authors primarily focus on the region of the fly genome containing the even skipped (eve) gene; eve is expressed in a canonical spatial pattern in fly embryos and its locus is flanked by the well-characterized neighbor of homie (nhomie) and homie boundary elements. The main focus of investigation is the orientation dependence of these boundary elements, which had been observed previously using reporter assays. In this study, the authors use Crispr/Cas9 editing followed by recombination-mediated cassette exchange to create a series of recombinant fly lines in which the nhomie boundary element is either replaced with exongenous sequence from phage 𝝀, an inversion of nhomie, or a copy of homie that has the same orientation as the endogenous homie sequence. The nhomie sequence is also regenerated in its native orientation to control for effects introduced by the transgenesis process.

      The authors then perform high-resolution Micro-C to analyze 3D structure and couple this with fluorescent and colorimetric RNA in situ hybridization experiments to measure the expression of eve and nearby genes during different stages of fly development. The major findings of these experiments are that total loss of boundary sequence (replacement with 𝝀 DNA) results in major 3D structure changes and the most prominent observed gene changes, while inversion of the nhomie boundary or replacement with homie resulted in more modest effects in terms of 3D structure and gene expression changes and a distinct pattern of gene expression change from the 𝝀 DNA replacement. As the samples in which the nhomie boundary is inverted or replaced with homie have similar Micro-C profiles at the eve locus and show similar patterns of a spurious gene activation relative to the control, the observed effects appear to be driven by the relative orientation of the nhomie and homie boundary elements to one another.

      Collectively, the findings reported in the manuscript are of broad interest to the 3D genome field. Although extensive work has gone into characterizing the patterns of 3D genome organization in a whole host of species, the underlying mechanisms that structure genomes and their functional consequences are still poorly understood. The perhaps best understood system, mechanistically, is the coordinated action of CTCF with the cohesin complex, which in vertebrates appears to shape 3D contact maps through a loop extrusion-pausing mechanism that relies on orientation-dependent sequence elements found at the boundaries of interacting chromatin loops.

      (2.1) The notion that mammalian genome is shaped in 3D by the coordinate action of cohesin and CTCF has achieved the status of dogma in the field of chromosome structure in vertebrates.  However, as we have pointed out in #1.1, the evidence supporting this dogma is far from convincing.  To begin with, it is based on low resolution Hi-C experiments that rely on large bin sizes to visualize so-called “TADs.”  In fact, the notion that cohesin/CTCF are responsible on their own for shaping the mammalian 3D genome appears to be a result of mistaking a series of forests for the actual trees that populate each of the forests.

      As illustrated in Author response image 1 above, the “TADs” that are visualized in these low resolution data sets are not TADs at all, but rather TAD neighborhoods consisting of several dozen or more individual TADs.  Moreover, the “interesting” features that are evident at low resolution (>1 kb)—the dots and stripes—largely disappear at resolutions appropriate for visualizing individual TADs (~200 bp).

      In Goel et al. 2023, we presented data from one of the key experiments in Goel et al. (Goel et al. 2023).  In this experiment,  the authors used RCMC to generate high resolution (~250 bp) MicroC contact maps before and after Rad21 depletion.  Contrary to dogma, Rad21 depletion has absolutely no effect on TADs in a ~250 kb DNA segment—and these TADs look very much like the TADs we observe in the Drosophila genome, in particular in the Abd-B region of BX-C that is thought to be assembled into a series of circle-loops (see Fig. 2B).

      While Goel et al. (Goel et al. 2023) observed no effect of Rad21 depletion on TADs, they found that loss of Rad21 disturbs long-distance (but not short-distance) contacts in large TAD neighborhoods when their RCMC data set is visualized using bin sizes of 5 kb and I kb.  This is shown in Author response image 2.  The significance of this finding is, however, uncertain.  It could mean that the 3D organization of large TAD neighborhoods have a special requirement for cohesin activity.  On the other hand, since cohesin functions to hold sister chromosomes together after replication until they separate during mitosis (and might also participate in mitotic condensation), it is also possible that the loss of long-range contacts in large TAD neighborhoods when Rad21 is depleted is simply a reflection of this particular activity.  Further studies will be required to address these possibilities.

      As for CTCF: a careful inspection of the ChIP data in Goel et al. 2023 indicates that CTCF is not found at each and every TAD boundary.  In fact, the notion that CTCF is the be-all and end-all of TAD boundaries in mammals is truly hard to fathom.  For one, the demands for specificity in TAD formation (and in regulatory interactions) are likely much greater than those in flies, and specificity can’t be generated by a single DNA binding protein.  For another, several dozen chromosomal architectural proteins have already been identified in flies.  This means that (unlike what is thought to be true in mammals) it is possible to use a combinatorial mechanism to generate specificity in, for example, the long distance interactions in RFig 6 and 7.  As noted in #2.1 above, many of the known chromosomal architectural proteins in flies are polydactyl zinc finger proteins (just like CTCF).  There are some 200 different polydactyl zinc finger proteins in flies, and the function of only a hand full of these is known at present.  However, it seems likely that a reasonable fraction of this class of DNA binding proteins will ultimately turn out to have an architectural function of some type (Bonchuk et al. 2021; Fedotova et al. 2017).  The number of different polydactyl zinc finger protein genes in mammals is nearly 3 times that of flies.  It is really possible that of these, only CTCF is involved in shaping the 3D structure of the mammalian genome?

      Despite having a CTCF paralog and cohesin, the Drosophila genome does not appear to be structure by loop extrusion-pausing. The identification of orientation-dependent elements with pronounced structural effects on genome folding thus may shed light on alternative mechanisms used to regulated genome structure, which in turn may yield insights into the significance of particular folding patterns.

      (2.2) Here we would like to draw the reviewer’s and reader’s attention to Author response image 3, which shows that orientation-dependent pairing interactions have a significant impact on physical interactions between different sequences.  We would also refer the reader to two other publications.  One of these is Kyrchanova et al. (Kyrchanova et al. 2008), which was the first to demonstrate that orientation of pairing interactions matters.  The second is Fujioka et al. (Fujioka et al. 2016), which describes experiments indicating that nhomie and homie pair with each other head-to-tail and with themselves head-to-head.

      On the whole, this study is comprehensive and represents a useful contribution to the 3D genome field. The transgenic lines and Micro-C datasets generated in the course of the work will be valuable resources for the research community. Moreover, the manuscript, while dense in places, is generally clearly written and comprehensive in its description of the work. However, I have a number of comments and critiques of the manuscript, mainly centering on the framing of the experiments and presentation of the Micro-C results and on manner in which the data are analyzed and reported. They are as follows:

      Major Points:

      (1) The authors motivate much of the introduction and results with hypothetical "stem loop" and "circle loop" models of chromosome confirmation, which they argue are reflected in the Micro-C data and help to explain the observed ISH patterns. While such structures may possibly form, the support for these specific models vs. the many alternatives is not in any way justified. For instance, no consideration is given to important biophysical properties such as persistence length, packing/scaling, and conformational entropy. As the biophysical properties of chromatin are a very trafficked topic both in terms of experimentation and computational modeling and generally considered in the analysis of chromosome conformation data, the study would be strengthened by acknowledgement of this body of work and more direct integration of its findings.

      (2.3) The reviewer is not correct in claiming that “stem-loops” and “circle-loops” are “hypothetical.”  There is ample evidence that both types of loops are present in eukaryotic genomes, and that loop conformation has significant readouts in terms of not only the physical properties of TADs but also their functional properties.  Here we would draw the reviewer’s attention to Author response image 3 and Author response image 4 for examples of loops formed by the orientation-dependent pairing of yet other TAD boundary elements.  As evident from the MicroC data in these figures, circle-loops and stem-loops have readily distinguishable contact patterns.  The experiments in Fujioka et al. (Fujioka et al. 2016) demonstrate that homie and nhomie pair with each other head-to-tail, while they pair with themselves head-to-head.  The accompany paper (Bing et al. 2024) also provides evidence that loop topology is reflected both in the pattern of activation of reporters and in the MicroC contact profiles.  We would also mention again Kyrchanova et al. (Kyrchanova et al. 2008), who were the first to report orientation-dependent pairing of endogenous fly boundaries.

      At this juncture it would premature to try to incorporate computational modeling of chromosome conformation in our studies.  The reason is that the experimental foundations that would be essential for building accurate models are lacking.  As should be evident from RFigs. 1-3 above, studies on mammalian chromosomes are simply not of high enough resolution to draw firm conclusions about chromosome conformation: in most studies only the forests are visible.  While the situation is better in flies, there are still too many unknown.  As just one example, it would be important to know the orientation of the boundary pairing interactions that generate each TAD.  While it is possible to infer loop topology from how TADs interact with their neighbors (a plume versus clouds), a conclusive identification of stem- and circle-loops will require a method to unambiguously determine whether a TAD boundary pairs with its neighbor head-to-head or headto-tail.

      (2) Similar to Point 1, while there is a fair amount of discussion of how the observed results are or are not consistent with loop extrusion, there is no discussion of the biophysical forces that are thought to underly compartmentalization such as block-polymer co-segregation and their potential influence. I found this absence surprising, as it is generally accepted that A/B compartmentalization essentially can explain the contact maps observed in Drosophila and other non-vertebrate eukaryotes (Rowley, ..., Corces 2017; PMID 28826674). The manuscript would be strengthened by consideration of this phenomenon.

      (2.4) Compartments in mammals have typically been identified and characterized using lowresolution data sets, and these studies have relied on visualizing compartments using quite large bin sizes (>>1 kb).  Our experiments have nothing to do with the large-scale compartments seen in these Hi-C experiments.  Instead, we are studying the properties of individual TADs: how TADs are formed, the relationship between TAD topology and boundary:boundary pairing, and the impact of TAD topology on interactions between TADs in the immediate neighborhood.  There is no evidence to date that these large compartments or “block polymer co-segregation” have a) any impact on the properties of individual boundary elements, b) have a role in determining which boundary elements actually come together to form a given TAD, c) impact the orientation of the interactions between boundaries that generate the TAD or d) determine how TADs tend to interact with their immediate neighbors.  

      In more recent publications (c.f., Harris et al. 2023) compartments have shrunk in size and instead of being units of several hundred kb, the median length of the “compartmental” unit in mammalian cells is about12 kb. This is not too much different from the size of fly TADs.  However, the available evidence does not support the idea that block polymer co-segregation/co-repulsion drive the TAD:TAD interactions seen in MicroC experiments.  For example, according to this “micro-compartment” model, the specific patterns of interaction between TADs in the CG3294 meta-loop in Author response image 3 would be driven by block polymer co-segregation and co-repulsion. In this model, the TAD upstream of the blue boundary (which contains CG33543, the odorant binding protein gene Obp22a and the Npc2a gene which encodes a protein involved in sterol homeostasis) would share the same chromatin state/biophysical properties as the TAD upstream of the purple boundary, which has the fipi gene. While it is true that CG33543, Obp22a and also the fipi gene are not expressed in embryos, Npc2a is expressed at high levels during embryogenesis, yet it is part of the TAD that interacts with the fipi TAD.  The TAD downstream of the blue boundary contains CG15353 and Nplp4 and it interacts with the TAD downstream of the purple boundary which contains CG3294 and slfCG15353 and Nplp4 are not expressed in the embryo and as such should share a compartment with a TAD that is also silent. However, slf is expressed at a high level in 1216 hr embryos, while CG3294 is expressed at a low level.  In neither case would one conclude that the TADs upstream and downstream of the blue and purple boundaries, respectively, interact because of shared chromatin/biophysical states that drive block polymer co-segregation corepulsion. 

      One might also consider several gedanken experiments involving the long-range interactions that generate the CG3294 meta-loop in Author response image 3.    According to the micro-compartment model the patchwork pattern of crosslinking evident in the CG3294 meta-loop arises because the interacting  TADs share the same biochemical/biophysical properties, and this drives block polymer cosegregation and co-repulsion.  If this model is correct, then this patchwork pattern of TAD:TAD interactions would remain unchanged if we were to delete the blue or the purple boundary.  However, given what we know about how boundaries can find and pair with distant boundaries (c.f., Figure 6 from Muller et el. 1999 and the discussion in #1.2), the result of these gedanken experiments seem clear: the patchwork pattern shown in Author response image 3A will disappear.  What would happen if we inverted the blue or the purple boundary? Would the TAD containing CG33543, Obp22a and Npc2a still interact with fipi as would be expected from the compartment model?  Or would the pattern of interactions flip so that the CG33543, Obp22a and Npc2a TAD interacts with the TAD containing CG3294 and slf?  Again we can anticipate the results based on previous studies: the interacting TADs will switch when the CG3294 meta-loop is converted into a stem-loop.  If this happened, the only explanation possible in the compartment model is that the chromatin states change when the boundary is inverted so that TAD upstream of blue boundary now shares the same chromatin state as the TAD downstream of the purple boundary, while the TAD downstream of the blue boundary shares same state as the TAD upstream of the purple boundary.  However, there is no evidence that boundary orientation per se can induce a complete switch in “chromatin states” as would be required in the compartment model. 

      While we have not done these experimental manipulations with the CG3294 meta-loop, an equivalent experiment was done in Bing et al. (Bing et al. 2024).  However, instead of deleting a boundary element, we inserted a homie boundary element together with two reporters (gfp and LacZ) 142 kb away from the eve TAD.  The result of this gedanken “reverse boundary deletion” experiment is shown in Author response image 5.  Panel A shows the MicroC contact profile in the region spanning the transgene insertion site and the eve TAD in wild type (read “deletion”) NC14 embryos.  Panel B shows the MicroC contact profile from 12-16 hr embryos carrying the homie dual reporter transgene inserted at -142 kb.  Prior to the “deletion”, the homie element in the transgene pairs with nhomie and homie in the eve TAD and this generates a “mini-metaloop.”  In this particular insert, the homie boundary in the transgene (red arrow) is “pointing” in the opposite orientation from the homie boundary in the eve TAD (red arrow).  In this orientation, the pairing of the transgene homie with eve nhomie/homie brings the LacZ reporter into contact with sequences in the eve TAD.  Since a mini-metaloop is formed by homie_à _nhomie/homie pairing, sequences in TADs upstream and downstream of the transgene insert interact with sequences in TADs close to the eve TAD (Author response image 5B).  Taken together these interactions correspond to the interaction patchwork that is typically seen in “compartments” (see boxed region and inset).  If this patchwork is driven as per the model, by block polymer co-segregation and co-repulsion, then it should still be present when the transgene is deleted.  However, panel A shows that the interactions linking the transgene and the sequences in TADs next to the transgene to eve and TADs next to eve disappear when the homie boundary (plus transgene) is “deleted” in wild type flies.

      Author response image 5.

      Boundary deletion and compartments

      A second experiment would be to invert the homie boundary so that instead of pointing away from eve it points towards eve.  Again, if the compartmental patchwork is driven by block polymer co-segregation and co-repulsion, inverting the homie boundary in the transgene should have no effect on the compartmental contact profile.  Inspection of Fig. 7 in Bing et al. (Bing et al. 2024) will show that this prediction doesn’t hold either.  When homie is inverted, sequences in the eve TAD interact with the gfp reporter not the LacZ reporter.  In addition, there are corresponding changes in how sequences in TADs to either side of eve interact with sequences to either side of the transgene insert.  

      Yet another “test” of compartments generated by block polymer co-segregation/co-repulsion is provided by the plume above the eve volcano triangle.  According to the compartment model, sequences in TADs flanking the eve locus form the plume above the eve volcano triangle because their chromatin shares properties that drive block polymer co-segregation.  These same properties result in repulsive interactions with chromatin in the eve TAD, and this would explain why the eve TAD doesn’t crosslink with its neighbors.  If the distinctive chromatin properties of eve and the neighboring TADs drive block polymer co-segregation and co-repulsion, then inverting the nhomie boundary or introducing homie in the forward orientation should have absolutely no effect on the physical interactions between chromatin in the eve TAD and chromatin in the neighboring TADs.  However, Figures 4 and 6 in this paper indicate that boundary pairing orientation, not block polymer co-segregation/co-repulsion, is responsible for forming the plume above the eve TAD. Other findings also appear to be inconsistent with the compartment model. (A) The plume topping the eve volcano triangle is present in NC14 embryos when eve is broadly expressed (and potentially active throughout the embryo).  It is also present in 12-16 hr embryos when eve is only expressed in a very small subset of cells and is subject to PcG silencing everywhere else in the embryo.  B) According to the compartment model the precise patchwork pattern of physical interactions should depend upon the transcriptional program/chromatin state that is characteristic of a particular developmental stage or cell type.  As cell fate decisions are just being made during NC14 one might expect that most nuclei will share similar chromatin states throughout much of the genome.  This would not be true for 12-16 hr embryos.  At this stage the compartmental patchwork would be generated by a complex mixture of interactions in cells that have quite different transcriptional programs and chromatin states.  In this case, the patchwork pattern would be expected to become fuzzy as a given chromosomal segment would be in compartment A in one group of cells and in compartment B in another.   Unlike 12-16 hr embryos,  larval wing discs would be much more homogeneous and likely give a distinct and relatively well resolved compartmental pattern. We’ve examined the compartment patchwork of the same chromosomal segments in NC14 embryos, 12-16 hr embryos and larval wing disc cells.  While there are some differences (e.g., changes in some of the BX-C TADs in the wing disc sample) the compartmental patchwork patterns are surprisingly similar in all three cases. Nor is there any “fuzziness” in the compartmental patterns evident in 12-16 hr embryos, despite the fact that there are many different cell types at this stage of development.  C) TAD interactions with their neighbors and compartmental patchworks are substantially suppressed in salivary gland polytene chromosomes.  This would suggest that features of chromosome structure might be the driving force behind many of the “compartmental” interactions as opposed to distinct biochemical/biophysical of properties of small chromosomal segments that drive polymer co- segregation/co-repulsion.  

      (3) The contact maps presented in the study represent many cells and distinct cell types. It is clear from single-cell Hi-C and multiplexed FISH experiments that chromosome conformation is highly variable even within populations of the same cell, let alone between cell types, with structures such as TADs being entirely absent at the single cell level and only appearing upon pseudobulking. It is difficult to square these observations with the models of relatively static structures depicted here. The authors should provide commentary on this point.

      (2.5) As should be evident from Author response image 1, single-cell Hi-C experiments would not provide useful information about the physical organization of individual TADs, TAD boundaries or how individual TADs interact with their immediate neighbors.  In addition, since they capture only a very small fraction of the possible contacts within and between TADs, we suspect that these single-cell studies aren’t likely to be useful for making solid conclusions about TAD neighborhoods like those shown in Author response image 1 panels A, B, C and D, or Author response image 2.  While it might be possible to discern relatively stable contacts between pairs of insulators in single cells with the right experimental protocol, the stabilities/dynamics of these interactions may be better judged by the length of time that physical interactions are seen to persist in live imaging studies such as Chen et al. (2018), Vazquez et al. (2006) and Li et al. (2011).

      The in situ FISH data we’ve seen also seems problematic in that probe hybridization results in a significant decondensation of chromatin.  For two probe sets complementary to adjacent ~1.2 kb DNA sequences, the measured center-to-center distance that we’ve seen was ~110 nM.  This is about 1/3rd the length that is expected for a 1.2 kb naked DNA fragment, and about 1.7 times larger than that expected for a beads-on-a-string nucleosome array (~60 nM).  However, chromatin is thought to be compacted into a 30 nM fiber, which is estimated to reduce the length of DNA by at least another ~6 fold.  If this estimate is correct, FISH hybridization would appear to result in a ~10 fold decompaction of chromatin.  A decompaction of this magnitude would necessarily be followed by a significant distortion in the actual conformation of chromatin loops.

      (4) The analysis of the Micro-C data appears to be largely qualitative. Key information about the number of reads sequenced, reaps mapped, and data quality are not presented. No quantitative framework for identifying features such as the "plumes" is described. The study and its findings would be strengthened by a more rigorous analysis of these rich datasets, including the use of systematic thresholds for calling patterns of organization in the data.

      Additional information on the number of reads and data quality have been included in the methods section. 

      (5) Related to Point 4, the lack of quantitative details about the Micro-C data make it difficult to evaluate if the changes observed are due to biological or technical factors. It is essential that the authors provide quantitative means of controlling for factors like sampling depth, normalization, and data quality between the samples.

      In our view the changes in the MicroC contact patterns for the eve locus and its neighbors when the nhomie boundary is manipulated are not only clear cut and unambiguous but are also readily evident in the Figs that are presented in the manuscript.  If the reviewer believes that there aren’t significant differences between the MicroC contact patterns for the four different nhomie replacements, it seems certain that they would also remain unconvinced by a quantitative analysis.

      The reviewer also suggests that biological and/or technical differences between the four samples could account for the observed changes in the MicroC patterns for the eve TAD and its neighbors.  If this were the case, then similar changes in MicroC patterns should be observed elsewhere in the genome.  Since much of the genome is analyzed in these MicroC experiments there is an abundance of internal controls for each experimental manipulation of the nhomie boundary.  For two of the nhomie replacements, nhomie reverse and homie forward, the plume above the eve volcano triangle is replaced by clouds surrounding the eve volcano triangle.  If these changes in the eve MicroC contact patterns are due to significant technical (or biological) factors, we should observe precisely the same sorts of changes in TADs elsewhere in the genome that are volcano triangles with plumes.   Author response image 6 shows the MicroC contact pattern for several genes in the Antennapedia complex.  The deformed gene is included in a TAD which, like eve, is a volcano triangle topped by a plume.  A comparison of the deformed MicroC contact patterns for nhomie forward (panel B) with the MicroC patterns for nhomie reverse (panel C) and homie forward (panel D) indicates that while there are clearly technical differences between the samples, these differences do not result in the conversion of the deformed plume into clouds as is observed for the eve TAD.  The MicroC patterns elsewhere in Antennapedia complex are also very similar in all four samples.  Likewise, comparisons of regions elsewhere in the fly genome indicate that the basic contact patterns are similar in all four samples.   So while there are technical differences which are reflected in the relative pixel density in the TAD triangles and the LDC domains, these differences do not result in converting plumes into clouds nor do the alter the basic patterns of TAD triangles and LDC domains.  As for biological differences— the embryos in each sample are at roughly the same developmental stage and were collected and processed using the same procedures. Thus, the biological factors that could reasonably be expected to impact the organization of specific TADs (e.g., cell type specific differences) are not going to impact the patterns we see in our experiments. 

      Author response image 6.

      (6) The ISH effects reported are modest, especially in the case of the HCR. The details provided for how the imaging data were acquired and analyzed are minimal, which makes evaluating them challenging. It would strengthen the study to provide much more detail about the acquisition and analysis and to include depiction of intermediates in the analysis process, e.g. the showing segmentation of stripes.

      The imaging analysis is presented in Fig. 5 is just standard confocal microscopy.  Individual embryos were visualized and scored.  An embryo in which stripes could be readily detected was scored as ‘positive’ while an embryo in which stripes couldn’t be detected was scored as ‘negative.’   

      Recommendations for the authors:

      Editor comments:

      It was noted that the Jaynes lab previously published extensive genetic evidence to support the stem loop and circle loop models of Homie-Nhomie interactions (Fujioka 2016 Plos Genetics) that were more convincing than the Micro-C data presented here in proof of their prior model. Maybe the authors could more clearly summarize their prior genetic results to further try to convince the reader about the validity of their model.

      Reviewer #1 (Recommendations For The Authors):

      Below, I list specific comments to further improve the manuscript for publication. Most importantly, I recommend the authors tone down their proposal that boundary pairing is a universal TAD forming mechanism.

      (1) The title is cryptic.

      (2) The second sentence in the abstract is an overstatement: "In flies, TADs are formed by physical interactions between neighboring boundaries". Hi-C and Micro-C studies have not provided evidence that most TADs in Drosophila show focal interactions between their bracketing boundaries. The authors rely too strongly on prior studies that used artificial reporter transgenes to show that multimerized insulator protein binding sites or some endogenous fly boundaries can mediate boundary bypass, as evidence that endogenous boundaries pair.

      Please see responses #1.1 and #1.3 and figures Author response image 1 and Author response image 3.  Note that using dHS-C, most TADs that we’ve looked at so far are topped by a “dot” at their apex.

      (3) Line 64: the references do not cite the stated "studies dating back to the '90's'".

      The papers cited for that sentence are reviews which discussed the earlier findings.  The relevant publications are cited at the appropriate places in the same paragraph.  

      (4) Line 93: "On the other hand, while boundaries have partner preferences, they are also promiscuous in their ability to establish functional interactions with other boundaries." It was unclear what is meant here.

      Boundaries that a) share binding sites for proteins that multimerized, b) have binding sites for proteins that interact with each other, or c) have binding sites for proteins that can be bridged by a third protein can potentially pair with each other.  However, while these mechanisms enable promiscuous pairing interactions, they will also generate partner preferences (through a greater number of a, b and/or c).

      (5) It could be interesting to discuss the fact that it remains unclear whether Nhomie and Homie pair in cis or in trans, given that homologous chromosomes are paired in Drosophila.

      The studies in Fujioka et al. (Fujioka et al. 2016) show that nhomie and homie can pair both in cis and in trans.  Given the results described in #1.2, we imagine that they are paired in both cis and trans in our experiments.

      (6) Line 321: Could the authors further explain why they think that "the nhomie reverse circle-loop also differs from the nhomie deletion (λ DNA) in that there is not such an obvious preference for which eve enhancers activate expression"?

      The likely explanation is that the topology/folding of the altered TADs impacts the probability of interactions between the various eve enhancers and the promoters of the flanking genes.  

      (7) The manuscript would benefit from shortening the long Discussion by avoiding repeating points described previously in the Results.

      (8) Line 495: "If, as seems likely, a significant fraction of the TADs genome-wide are circle loops, this would effectively exclude cohesin-based loop extrusion as a general mechanism for TAD formation in flies". The evidence provided in this manuscript appears insufficient to discard ample evidence from multiple laboratories that TADs form by compartmentalization or loop extrusion. Multiple laboratories have, for example, demonstrated that cohesin depletion disrupts a large fraction of mammalian TADs. 

      Points made here and in #9 have been responded to in #1.1, #2.1 and #2.4 above.  We would suggest that the evidence for loop extrusion falls short of compelling (as it is based on the analysis of TAD neighborhoods, not TADs—that is forests, not trees) and given the results reported in Goel et al. (in particular Fig. 4 and Sup Fig. 8) is clearly suspect. This is not to mention the fact that cohesin loop-extrusion can’t generate circle-loops TADs, yet circle-loops clearly exist.  Likewise, as discussed in #2.4, it is not clear to us that the shared chromatin states, polymer co-segregation and co-repulsion account for the compartmental patchwork patterns of TAD;TAD interactions. The results from the  experimental manipulations in this paper and the accompanying paper, together with studies by others (e.g., Kyrchanova et al. (Kyrchanova et al. 2008), Mohana et al. (Mohana et al. 2023) would also seem to be at odds with the model for compartments as currently formulated.  

      The unique properties of Nhomie and Homie, namely the remarkable specificity with which they physically pair over large distances (Fujioka et al. 2016) may rather suggest that boundary pairing is a phenomenon restricted to special loci. Moreover, it has not yet been demonstrated that Nhomie or Homie are also able to pair with the TAD boundaries on their left or right, respectively.

      Points made here were discussed in detail in #1.2.  As described in detail in #1.2, It is not the case that nhomie and homie are in “unique” or “special.”  Other fly boundaries can do the same things.  As for whether nhomie and homie pair with their neighbors:  We haven’t done transgene experiments (e.g., testing by transvection or boundary bypass).  Likewise, in MicroC experiments there are no obvious dots at the apex of the neighboring TADs that would correspond to nhomie pairing with the neighboring boundary to the left and homie pairing with the neighboring boundary to the right. However, this is to be expected. As we discussed in in #1.3 above, only MNase resistant elements will generate dots in standard MicroC experiments.  On the other hand, when boundary:boundary interactions are analyzed by dHS-C (c.f., Author response image 4), there are dots at the apex of both neighboring TADs.  This would be direct evidence that nhomie pairs with the neighboring boundary to the left and homie pairs with the neighboring boundary to the right.

      (9) The comment in point 8 also applies to the concluding 2 sentences (lines 519-524) of the Discussion.

      See response to 8 above. Otherwise, the concluding sentences are completely accurate. Validation of the cohesin loop extrusion/CTCF roadblock model will required demonstrating a) that all TADs are either stem-loops or unanchored loops and b) that TAD endpoints are always marked by CTCF. 

      The likely presence of circle-loops and evidence that TAD boundaries that don’t have CTCF (c.f.,Goel et al. 2023) already suggests that this model can’t (either fully or not all) account for TAD formation in mammals. 

      (10) Figs. 3 and 6: It would be helpful to add the WT screenshot in the same figure, for direct comparison.

      It is easy enough to scroll between Figs-especially since nhomie forward looks just like WT.

      (11) Fig. 6: It would be helpful to show a cartoon view of a circle loop to the right of the Micro-C screenshot, as was done in Fig. 3.

      Good idea.   Added to the Fig.

      (12) Fig. 5: It would be helpful to standardize the labelling of the different genotypes throughout the figures and panels ("inverted" versus "reverse" versus an arrow indicating the direction).

      Fixed.

      Reviewer #2 (Recommendations For The Authors):

      Minor Points:

      (1) The Micro-C data does not appear to be deposited in an appropriate repository. It would be beneficial to the community to make these data available in this way.

      This has been done.

      (2) Readers not familiar with Drosophila development would benefit from a gentle introduction to the stages analyzed and some brief discussion on how the phenomenon of somatic homolog pairing might influence the study, if at all.

      We included a rough description the stages that were analyzed for both the in situs and MicroC. We thought that an actual description of what is going on at each of the stages wasn’t necessary as the process of development is not a focus of this manuscript.  In other studies, we’ve found that there are only minor differences in MicroC patterns between the blastoderm stage and stage 12-16 embryos.  While these minor differences are clearly interesting, we didn’t discuss them in the text.   In all of experiments chromosomes are likely to be paired.  In NC14 embryos (the stage for visualizing eve stripes and the MicroC contact profiles in Fig. 2) replication of euchromatic sequences is thought to be quite rapid.  While homolog pairing is incomplete at this stage, sister chromosomes are paired.  In stage 12-16 embryos, homologs will be paired and if the cells are arrested in G2, then sister chromosome will also be paired.  So in all of experiments, chromosomes (sisters and/or homologs) are paired. However, since we don’t have examples of unpaired chromosomes, our experiments don’t provide any info on how chromosome pairing might impact MicroC/expression patterns.

      (3) "P > 0.01" appears several times. I believe the authors mean to report "P < 0.01".

      Fixed.  

      References for Response

      Batut PJ, Bing XY, Sisco Z, Raimundo J, Levo M, Levine MS. 2022. Genome organization controls transcriptional dynamics during development. Science. 375(6580):566-570.

      Bing X, Ke W, Fujioka M, Kurbidaeva A, Levitt S, Levine M, Schedl P, Jaynes JB. 2024. Chromosome structure i: Loop extrusion or boundary:Boundary pairing? eLife.

      Blanton J, Gaszner M, Schedl P. 2003. Protein:Protein interactions and the pairing of boundary elements in vivo. Genes Dev. 17(5):664-675.

      Bonchuk A, Boyko K, Fedotova A, Nikolaeva A, Lushchekina S, Khrustaleva A, Popov V, Georgiev P. 2021. Structural basis of diversity and homodimerization specificity of zincfinger-associated domains in drosophila. Nucleic Acids Res. 49(4):2375-2389.

      Bonchuk A, Kamalyan S, Mariasina S, Boyko K, Popov V, Maksimenko O, Georgiev P. 2020. Nterminal domain of the architectural protein ctcf has similar structural organization and ability to self-association in bilaterian organisms. Sci Rep. 10(1):2677.

      Chen H, Levo M, Barinov L, Fujioka M, Jaynes JB, Gregor T. 2018. Dynamic interplay between enhancer–promoter topology and gene activity. Nat Genet. 50(9):1296.

      Fedotova AA, Bonchuk AN, Mogila VA, Georgiev PG. 2017. C2h2 zinc finger proteins: The largest but poorly explored family of higher eukaryotic transcription factors. Acta Naturae. 9(2):47-58.

      Fujioka M, Ke W, Schedl P, Jaynes JB. 2024. The homie insulator has sub-elements with different insulating and long-range pairing properties. bioRxiv. 2024.02.01.578481.

      Fujioka M, Mistry H, Schedl P, Jaynes JB. 2016. Determinants of chromosome architecture: Insulator pairing in cis and in trans. PLoS Genet. 12(2):e1005889.

      Galloni M, Gyurkovics H, Schedl P, Karch F. 1993. The bluetail transposon: Evidence for independent cis‐regulatory domains and domain boundaries in the bithorax complex. The EMBO Journal. 12(3):1087-1097.

      Gaszner M, Vazquez J, Schedl P. 1999. The zw5 protein, a component of the scs chromatin domain boundary, is able to block enhancer-promoter interaction. Genes Dev. 13(16):2098-2107.

      Goel VY, Huseyin MK, Hansen AS. 2023. Region capture micro-c reveals coalescence of enhancers and promoters into nested microcompartments. Nat Genet. 55(6):1048-1056.

      Harris HL, Gu H, Olshansky M, Wang A, Farabella I, Eliaz Y, Kalluchi A, Krishna A, Jacobs M, Cauer G et al. 2023. Chromatin alternates between a and b compartments at kilobase scale for subgenic organization. Nat Commun. 14(1):3303.

      Hart CM, Zhao K, Laemmli UK. 1997. The scs' boundary element: Characterization of boundary element-associated factors. Mol Cell Biol. 17(2):999-1009.

      Hsieh TS, Cattoglio C, Slobodyanyuk E, Hansen AS, Rando OJ, Tjian R, Darzacq X. 2020. Resolving the 3d landscape of transcription-linked mammalian chromatin folding. Mol Cell. 78(3):539-553.e538.

      Krietenstein N, Abraham S, Venev SV, Abdennur N, Gibcus J, Hsieh TS, Parsi KM, Yang L, Maehr R, Mirny LA et al. 2020. Ultrastructural details of mammalian chromosome architecture. Mol Cell. 78(3):554-565.e557.

      Kyrchanova O, Chetverina D, Maksimenko O, Kullyev A, Georgiev P. 2008. Orientation-dependent interaction between drosophila insulators is a property of this class of regulatory elements. Nucleic Acids Res. 36(22):7019-7028.

      Kyrchanova O, Ibragimov A, Postika N, Georgiev P, Schedl P. 2023. Boundary bypass activity in the abdominal-b region of the drosophila bithorax complex is position dependent and regulated. Open Biol. 13(8):230035.

      Kyrchanova O, Kurbidaeva A, Sabirov M, Postika N, Wolle D, Aoki T, Maksimenko O, Mogila V, Schedl P, Georgiev P. 2018. The bithorax complex iab-7 polycomb response element has a novel role in the functioning of the fab-7 chromatin boundary. PLoS Genet. 14(8):e1007442. Kyrchanova O, Sabirov M, Mogila V, Kurbidaeva A, Postika N, Maksimenko O, Schedl P, Georgiev P. 2019a. Complete reconstitution of bypass and blocking functions in a minimal artificial fab-7 insulator from drosophila bithorax complex. Proceedings of the National Academy of Sciences.201907190.

      Kyrchanova O, Wolle D, Sabirov M, Kurbidaeva A, Aoki T, Maksimenko O, Kyrchanova M, Georgiev P, Schedl P. 2019b. Distinct elements confer the blocking and bypass functions of the bithorax fab-8 boundary. Genetics.genetics. 302694.302019.

      Kyrchanova O, Zolotarev N, Mogila V, Maksimenko O, Schedl P, Georgiev P. 2017. Architectural protein pita cooperates with dctcf in organization of functional boundaries in bithorax complex. Development. 144(14):2663-2672.

      Li H-B, Muller M, Bahechar IA, Kyrchanova O, Ohno K, Georgiev P, Pirrotta V. 2011. Insulators, not polycomb response elements, are required for long-range interactions between polycomb targets in drosophila melanogaster. Mol Cell Biol. 31(4):616-625.

      Li X, Tang X, Bing X, Catalano C, Li T, Dolsten G, Wu C, Levine M. 2023. Gaga-associated factor fosters loop formation in the drosophila genome. Mol Cell. 83(9):1519-1526.e1514.

      Mohana G, Dorier J, Li X, Mouginot M, Smith RC, Malek H, Leleu M, Rodriguez D, Khadka J, Rosa P et al. 2023. Chromosome-level organization of the regulatory genome in the drosophila nervous system. Cell. 186(18):3826-3844.e3826.

      Muller M, Hagstrom K, Gyurkovics H, Pirrotta V, Schedl P. 1999. The mcp element from the drosophila melanogaster bithorax complex mediates long-distance regulatory interactions. Genetics. 153(3):1333-1356.

      Muravyova E, Golovnin A, Gracheva E, Parshikov A, Belenkaya T, Pirrotta V, Georgiev P. 2001. Loss of insulator activity by paired su(hw) chromatin insulators. Science. 291(5503):495498.

      Postika N, Metzler M, Affolter M, Müller M, Schedl P, Georgiev P, Kyrchanova O. 2018. Boundaries mediate long-distance interactions between enhancers and promoters in the drosophila bithorax complex. PLoS Genet. 14(12):e1007702.

      Samal B, Worcel A, Louis C, Schedl P. 1981. Chromatin structure of the histone genes of d. Melanogaster. Cell. 23(2):401-409.

      Sigrist CJ, Pirrotta V. 1997. Chromatin insulator elements block the silencing of a target gene by the drosophila polycomb response element (pre) but allow trans interactions between pres on different chromosomes. Genetics. 147(1):209-221.

      Udvardy A, Schedl P. 1984. Chromatin organization of the 87a7 heat shock locus of drosophila melanogaster. J Mol Biol. 172(4):385-403.

      Vazquez J, Muller M, Pirrotta V, Sedat JW. 2006. The mcp element mediates stable long-range chromosome-chromosome interactions in drosophila. Molecular Biology of the Cell. 17(5):2158-2165.

      Zolotarev N, Fedotova A, Kyrchanova O, Bonchuk A, Penin AA, Lando AS, Eliseeva IA, Kulakovskiy IV, Maksimenko O, Georgiev P. 2016. Architectural proteins pita, zw5,and zipic contain homodimerization domain and support specific long-range interactions in drosophila. Nucleic Acids Res. 44(15):7228-7241.

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife assessment

      This valuable study examines the role of a host in conditions that shift pathogenicity of opportunistic microbes. The use of single-cell microbial transcriptomics and metabolomics to demonstrate the host's effects on pathogen dynamics is interesting and convincing. However, the connection to host antimicrobial peptides driving these effects is incomplete and would benefit from additional evidence and improved explanation in the text. This paper has the potential to be of broad interest to those working in host-microbe (microbiome and pathogen) interactions.

      We appreciate the editors for organizing our manuscript and providing eLife assessment. We went through each comment and carried out some necessary experiments. According to the comments, we here provide additional evidence that further supports our findings in this revised manuscript.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      In this work, Wang and colleagues used Drosophila-Serratia as a host-microbe model to investigate the impact of the host on gut bacteria. The authors showed that Drosophila larvae reduce S. marcescens abundance in the food likely due to a combination of mechanical force and secretion of antimicrobial peptides. S. marcescens exposed to Drosophila larvae lost virulence to flies and could promote larval growth similar to typical Drosophila gut commensals. These phenotypic changes were reflected in the transcriptome and metabolome of bacteria, suggesting that the host could drive the switch from pathogenicity to commensalism in bacteria. Further, the authors used single-cell bacterial RNA-seq to demonstrate the heterogeneity in gut bacterial populations.

      Strengths:

      This is a valuable work that addresses an important question of the effect of the host on its gut microbes. The authors could convincingly demonstrate that gut bacteria are strongly affected by the host with important consequences for both interacting partners. Moreover, the authors used state-of-the-art bacterial single-cell RNA-seq to reveal heterogeneity in host-associated commensal populations.

      Weaknesses:

      Some of the conclusions are not fully supported by the data.

      Specifically, in lines 142-143, the authors claim that larva antagonizes the pathogenicity of S. marcescens based on the survival data. I do not fully agree with this statement. An alternative possibility could be that, since there are fewer S. marcescens in larvae-processed food, flies receive a lower pathogen load and consequently survive. Can the authors rule this out?

      Also, the authors propose that Drosophila larvae induce a transition from pathogenicity to commensalism in S. marcescens and provide nice phenotypic and transcriptomic data supporting this claim. However, is it driven only by transcriptional changes? Considering high mutation rates in bacteria, it is possible that S. marcescens during growth in the presence of larvae acquired mutations causing all the observed phenotypic and transcriptional changes. To test this possibility, the authors could check how long S. marcescens maintains the traits it acquires during growth with Drosophila. If these traits persist after reculturing isolated bacteria, it is very likely they are caused by genome alterations, if not - likely it is a phenotypic switch driven by transcriptional changes.

      We thank the reviewer for providing a feasible method to distinguish the shift in transcriptional profile from genomic mutations. According to this valuable suggestion, we checked phenotypic and transcriptional changes after re-culturing the bacterium that had coexisted with larvae. We found that all phenotypes can be recovered after re-culturing. The new data supported our previous result that a phenotypic switch was driven by transcriptional changes rather than genome mutations. We now add these results to the text with figure supplement 3 (line 147-151, 192-194). Please see the following text.

      “To rule out the possibility that phenotypic alterations could stem from genomic mutations, we examined the prodigiosin yield and CFUs of re-culturing S. marcescens that had coexisted with larvae. Our results showed that neither prodigiosin yield nor CFUs of re-culturing S. marcescens differed from the original strain (Figure 2-figure supplement 3A-C), suggesting that a phenotypic switch was driven primarily by transcriptional reprogramming.” “Consistent with the previous result that this phenotypic switch was driven by transcriptional changes, the expression of virulent and growth genes was recovered after re-culturing (Figure 3-figure supplement 3D, E).”

      For the first question, we admit the possibility that the high morality of flies could result from the acquirement of a higher pathogen load, because of an increase in the bacterial load of single S. marcescens. However, host pathogenesis is normally determined by the virulence of pathogens rather than the number of bacteria. For example, hosts constantly harbor astonishing commensals in their guts, but remain healthy. This evidence suggests that it was the property (virulence) of a pathogen that is more important to affect the health status of the hosts. Moreover, an increase in virulence of single S. marcescens was verified by real-time PCR (Fig. 2F) and TE (Fig. 2G). Taken together, we could draw a conclusion that the impaired survival of flies challenged with single S. marcescens mainly arose from an increase in the virulence of S. marcescens. Thanks for your understanding!

      Reviewer #2 (Public Review):

      Summary:

      While many studies have explored the impacts of pathogens on hosts, the effect of hosts on pathogens has received less attention. In this manuscript, Wang et al. utilize Drosophila melanogaster and an opportunistic pathogen, Serratia marcescens, to explore how the host impacts pathogenicity. Beginning with an observation that larval presence and density impacted microbial growth in fly vials (which they assess qualitatively as the amount of 'slick' and quantitatively as microbial load/CFUs), the authors focus on the impact of axenic/germ-free larvae on an opportunistic pathogen S. marcescens. Similar to their observations with general microbial load, they find that larvae reduce the presence of a pinkish slick of Sm, indicative of its secondary metabolite prodigiosin. The presence of larvae alters prodigiosin production, pathogen load, pathogen cellular morphology, and virulence, and this effect is through transcriptional and metabolic changes in the pathogen. Overall, they observe a loss of virulence factors/pathways and an increase in pathways contributing to growth. Given the important role the host plays in this lifestyle shift, the authors then examined host features that might influence these effects, focusing on the role of antimicrobial peptides (Amps). The authors combine the use of synthetic Amps and an Amp-deficient fly line and conclude much of the larval inhibitory effect is due to their production of AMPs.

      Strengths:

      This is a very interesting question and the use of Drosophila-Serratia marcescens is a great model to explore these interactions and effects.

      The authors have an interesting and compelling phenotype and are asking a unique question on the impact of the host on the pathogen. The use of microbial transcriptomics and metabolomics is a strength, especially in order to assess these impacts on the pathogen level and at the single-cell level to capture heterogeneity.

      Weaknesses:

      Overall, the writing style in the manuscript makes it difficult to fully understand and appreciate the data and its interpretation.

      The data on the role of AMPs would benefit from strengthening. Some of the arguments in the text of that section are also counterintuitive. The authors show that △AMP larvae have a reduced impact on Sm as compared to wt larvae, but it seems less mild of an effect than that observed with wt excreta (assuming the same as secreta in Figures 7, should be corrected or harmonized). Higher doses of AMPs give a phenotype similar to wt larvae, but a lower dose (40 ng/ul) gives phenotypes more similar to controls. The authors argue that this data suggests AMPs are the factor responsible for much of the inhibition, but their data seems more to support that it's synergistic- you seem to still need larvae (or some not yet defined feature larvae make, although secreta/excreta was not sufficient) + AMPs to see similar effects as wt. Based on positioning and color scheme guessing that AMP 40ng/ul was used in Figures 7D-H, but could not find this detail in the text, methods, or figure legend and it should be indicated. This section does not seem to be well supported by the provided data, and this inconsistency greatly dampened this reviewer's enthusiasm for the paper.

      We thank the reviewer’s valuable comments and suggestions. We admitted that some photos of the pinkish slick (prodigiosin) are counterintuitive in Figure 7 as well as figure supplement 2B. Here comes the reason. Single S. marcescens produced prodigiosin that only stayed on the surface of fly agar medium. As we know, larvae can agitate food and form a stratification of prodigiosin, even making higher prodigiosin yield inside food lighter than the surface slick of prodigiosin. We mentioned it in the previous manuscript line 166-168. This is why some photos treated with excreta and a lower dose of AMP seemed more intense than those with WT larvae. However, we precisely quantified the prodigiosin yield inside food with the spectrophotometer, so we provided a prodigiosin yield following the photos of the slick. Therefore, we drew our conclusions mainly relying on the quantification of the prodigiosin yield. We actually used cecropin A for our experiments, so we added this information in the text. We hope that our replies can reignite your enthusiasm for our manuscript, and thanks for your great support!

      Reviewer #3 (Public Review):

      In this study, Wang and coworkers established a model of Drosophila-S. marcescens interactions and thoroughly examined host-microbe bidirectional interactions. They found that:

      (1) Drosophila larvae directly impact microbial aggregation and density;

      (2) Drosophila larvae affect microbial metabolism and cell wall morphology, as evidenced by reduced prodigiosin production and EPS production, respectively;

      (3) Drosophila larvae attenuate microbial virulence;

      (4) Drosophila larvae modulate the global transcription of microbes for adaptation to the host;

      (5) Microbial single-cell RNA sequencing (scRNA-seq) analysis revealed heterogeneity in microbial pathogenicity and growth;

      (6) AMPs are key factors controlling microbial virulence phenotypes.

      Taken together, they concluded that host immune factors such as AMPs are directly involved in the pathogen-to-commensal transition by altering microbial transcription.

      General comments:

      In general, this study is intriguing as it demonstrates that host immune effectors such as AMPs can serve as critical factors capable of modulating microbial transcription for host-microbe symbiosis. However, several important questions remain unanswered. One such question is: What is the mechanism by which AMPs modulate the pathogen-to-commensal transition? One hypothesis suggests that antimicrobial activity may influence microbial physiology, subsequently modulating transcription for the transition from pathogen to commensal. In this context, it is imperative to test various antibiotics with different modes of action (e.g., targeting the cell wall, transcription, or translation) at sub-lethal concentrations to determine whether sub-lethal doses of antimicrobial activity are sufficient to induce the pathogen-to-commensal transition.

      Thank you for the important comments on our manuscript. We checked the effect of antibiotics (5 μg/μl kanamycin and 10 μg/μl ampicillin) on the virulence switch of S. marcescens. We found that the two antibiotics with the sub-lethal doses similarly resulted in a decrease in prodigiosin yield and virulence expression of S. marcescens. Intriguingly, the two antibiotics also resulted in a dramatic decline in the bacterial load and the expression of genes involved in cell growth. These results suggest that antibiotics reduced the virulence primarily through suppressing most activities of bacteria.

      We found that larvae and AMPs at 40 μg/μl modestly resulted in a decrease in bacterial load and an increase in the relative level of genes involved in cellular proliferation, suggesting that AMPs could maintain the exponential phase of bacterial growth. This result is consistent that Drosophila larvae can support the long-term persistence of commensals in the shared habitat (DOI: 10.1016/j.cmet.2017.11.011). The inhibition could prevent bacteria from rapidly exhausting their nutritional resources, and consequently maintain symbiosis. It is likely that AMPs could maintain S. marcescens at the exponential phase of cell growth and prevent bacteria from rapidly exhausting their nutritional resources.

      Author response image 1.

      (A) Representative images of surface slick with S. marcescens alone, with kanamycin (5 μg/μl) and ampicillin (10 μg/μl). (B) The prodigiosin production of S. marcescens alone, with kanamycin (5 μg/μl) and ampicillin (10 μg/μl). n = 6 for each. (C) Bacterial loads of S. marcescens alone, with kanamycin (5 μg/μl) and ampicillin (10 μg/μl). n = 6 for each. (D, E) RT-qPCR analysis of the expression levels of downregulated and upregulated genes in the S. marcescens alone, with kanamycin (5 μg/μl) and ampicillin (10 μg/μl). n = 3 for each. Means ± SEMs. All variables have different letters, they are significantly different (p < 0.05). If two variables share a letter, they are not significantly different (p > 0.05). ns, no significance. Kruskal-Wallis test followed by Dunn’s multiple comparisons test.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      Here are some specific points that need to be addressed:

      (1) Lack of statistical analysis for many figures. The authors should perform and report the statistical analysis for all figures where it is currently lacking, specifically, Figures 2C, D, E, F, H; Figures 3E, F; Figures 7G, H; Figure S2E, Figures S3D, E.

      Thanks for your valuable suggestions. We re-checked the manuscript and performed the statistical analysis for these figures.

      (2) For graphs showing dots, it should be specified what exactly individual dots show and how many animals were used per replicate. Also, time points at which specific analysis was performed should be specified.

      We provided the important information in the legends in the revised manuscript.

      (3) Figure 2. No letters illustrating statistical significance are shown, although this is claimed in the legend (line 848).

      We added statistical significance in the updated Figure 2.

      (4) In Figure 7, the authors used AMPs of defined concentration, but it is not specified what exactly these AMPs are. Please provide the full composition of the AMP mix used.

      We used the antimicrobial peptide cecropin A produced by a silkworm. We added this information in the methods line 487-488 and Figure 7 legend.

      (5) Figure S2B. To me, it looks like that medium with larvae is redder than after mechanical force. I find it hard to believe the quantification in panel C that the medium with larvae has 3 times less pigment as compared to the mechanical force.

      Larvae could only agitate the surface of food (~0.4 cm), but sticks completely agitated the food up to 3 cm. Thus, the layer of food with pink pigment with agitation seemed much deeper than with larvae, which was responsible for the counterintuitively. We explained it in the previous manuscript (line 166-168). “Of note, the surface of the slick with agitation appeared lighter than that of larvae, mainly due to a stratification of prodigiosin following agitation.”

      (6) The authors need to proofread the manuscript as there are missing words, terms that need definition, and wrong terms. For example, L86 - naked eye?, L117 - what do the authors mean by co-culture?, L309 - not resist but rather combat, L347 - Species? or competition?, Figure 2A - 2nd?

      We have corrected these errors in the new manuscript. We added an "eye" in L86. Co-culture means “S. marcescens in co-culture”. Interspecies competition for nearly the same or similar nutrients and space occurs in the habitat.

      (7) The authors should reorganize either the text or the figures' order in a way that the figures are described in a consecutive order (Figure 1A, B ... and not Figure 1D first and then 1A).

      Thanks for your valuable advice. We reorganize the order of the text.

      (8) Do the authors have an idea which bacteria they quantified in Figures 1E to 1G? I didn't find the medium that was used for culturing. Also, in Figure 1F, Is the control group comprised of females or males?

      Mixed bacteria (bacteria in the living environment of Drosophila) were quantified in the NA medium that supports the growth of Drosophila microbiota (Jia Y, et al. Nat Commun. 2021) line 474-475. The control group comprised of both males and females with a 1:1 ratio. Similarly, the aged group contained 100 50-day-aged flies, male: female = 1:1. We provided details in Figure 1 legend line 849-850, 851-852.

      (9) L118-129. it is not possible to make all these statements without any statistical analysis. To me, at 96h both treatments have the same CFUs, while the authors claim they are different.

      We added statistical analysis in the current version. In fact, single S. marcescens became collapsed after 72 h post inoculation, and the CFU number of single S. marcescens declined step by step. The bacterial load of S. marcescens in co-culture was comparable (at 96 h post-inoculation, p>0.05) or higher (at 120 h post-inoculation, p<0.001) than S. marcescens alone, possibly explained by the possibility that bacteria rapidly exhausted the nutritional resources and collapsed through population suicide. We rewrote this sentence line 125-129 in the updated manuscript.

      (10) L136. term "symbionts" is not appropriate here.

      We change “symbionts” into “S. marcescens”.

      (11) In Figure 1, the authors used flies of different fitness: weak, strong, and infertile. They should be specific and describe exactly what these terms mean, are these mutants or treatments that affect the fitness?

      We apologize for this missing information and add them in the method and legend. Strong flies (wild-type fly CS), weak flies (yw; Sp/CyO; MKRS/TM6B), infertile flies (dfmr150M null mutant) Figure 1 legend line 849-850.

      (12) Figure S2. The title of this figure is misleading, please modify it. Mechanical force did affect S. marcescens but to a lesser degree as compared to larvae.

      Thank you for your suggestion. We admit that mechanical force affected S. marcescens but to a lesser degree as compared to larvae, so we changed the title to "Biological factors mainly determine S. marcescens lifestyle."

      Reviewer #2 (Recommendations For The Authors):

      General improvement to writing and presentation (see below):

      Describing confluent growth would make more sense than 'slick' and then using descriptions of broken, etc. "colour intensity of the surface slick".

      We used the slick to describe visible surface films of bacteria, which has been used in the previous study (DOI: 10.1038/s43705-023-00307-8). Slick is equal to confluent growth, but seems simple and easy than confluent growth. To make sense, we add this reference to the text.

      We reorganized the text of Figure 1.

      Suggest more specific language to describe observations. For example: Bacterial loading - S. marcescens growth (for example: the presence of dense fly populations reduced Sm growth).

      Thanks for the suggests. We replaced some of them.

      Symbiont, microbiota, microbiome, etc were all used interchangeably throughout the manuscript, but I am not sure I would call Sm part of the indigenous microbiome. Suggest to ensure proper usage and then harmonize throughout the ms.

      We used microbes and microbiome to replace symbiont and microbiota, respectively.

      Details missing from the message and Figure legends that would be helpful (including and especially Figure 7 - what AMP concentration?)

      Thanks for valuable comments. According to this comment, we provided concrete details in the Materials and methods and Figure 7 legend about AMPs, including the source and concentration of AMPs line 487-488, 954-955. Please see the response below.

      L73: define 'these issues" maybe or lead better with the prior sentence, it is not evident as currently written.

      Change "to address these issues" to " To investigate whether and/or how the host modulates bacterial lifestyles,” and merge two paragraphs.

      L74: repetitive sentence with the above.

      Thanks for pointing out this detail. We deleted it.

      L86: naked 'eye'.

      Added.

      L87: what is meant by 'weak flies'?

      Genotypes were added in the updated manuscript. Weak fly stocks display weaker activity and generate fewer eggs than WT flies.

      L96: bacterial load, not loading.

      Corrected.

      L128: no evidence to support, could be reflective of increased numbers in dying/dead larvae that impact total numbers in the vial.

      The number of CFUs of S. marcescens alone was gradually decreased at 96 h post-inoculation. In addition, we observed pale biofilm on the surface of the medium at the late stage. The numbers of CFUs of S. marcescens alone at the later stages were reduced (compared to the peak load at 48 h post-inoculation), so it was deterred that bacteria could undergo ecological suicide. Ecological suicide of the bacterial population was similarly examined by recording the number of CFUs in the medium over time (Ratzke C, et al. Nat Ecol Evol. 2018.). Taken together, we draw a conclusion that bacteria possibly underwent ecological suicide.

      L129: the prior sentence is in contradiction, reduced load only at early time points in the presence of larvae....

      Thanks for pointing out this detail. We added " before 72 h post-inoculation " in the sentence.

      L134: data is only focused on S marcescens, so inferring to 'symbionts' broadly is outside study.

      We change “symbionts” into “S. marcescens”.

      L139: sentence poorly written and confusing.

      We re-organized this sentence.

      To this end, we sought to examine the S. marcescens lifestyle switch from pathogenicity to commensalism by assessing the respective survival of flies on the fly medium that had been processed by single or coexisting S. marcescens.

      L189: evidence for long-term symbiosis is not well established in this paper, suggest editing this language throughout to more specifically reflect what the data supports and leave such interpretations to discussion points and future work.

      Thanks for your valuable advice. We deleted long-term and “thereby promoting the fitness of symbionts in the long maintenance.”.

      L192; used metabolomics to assess the impacts of larvae on bacterial metabolism, as currently written does not make sense.

      We rewrote this sentence. “Next, we investigated whether larvae could further elicit changes in the metabolism of S. marcescens using untargeted metabolomics.”

      L331: the use of monitored here is not correct/odd.

      We changed 'monitored' to 'reshaping’.

      L340: While the authors initially see a cost to Sm in reduced load (CFUs) at 120 h populations associated with larvae become higher - there is also a cost to producing virulence factors, which their RNASeq and metabolomics data support - trade-offs between growth and virulence.

      Thanks for your suggestion. We added “before 72 hours post inoculation” to define the early stage of the bacterial growth in the sentence.

      Reviewer #3 (Recommendations For The Authors):

      (1) Figures 1 A-D: What defines weak and strong flies, and what criteria determine the robustness of flies? How was the experiment conducted? The manuscript lacks details on this matter.

      We thank you for your comments. We lack a criterium, but the robustness of flies comes from daily experience. Weak fly stocks display weak activity and generate fewer eggs than WT flies. Genotypes with different robustness were added in the legend in the updated manuscript

      (2) The authors mentioned, "Noteworthily, the number of CFUs of S. marcescens alone was lower than S. marcescens in co-cultures at the late stage (at 96 h post inoculation), likely that bacteria rapidly exhausted their nutritional resources and underwent ecological suicide." How did they determine that the bacteria exhausted nutritional resources and underwent ecological suicide? One might speculate that larvae could have removed the bacteria simply by consuming them.

      Thanks for this comment. Virtually, there were no larvae inside the vials with single S. marcescens, so bacterial cells were not consumed. However, the numbers of CFUs of S. marcescens alone at the later stages were reduced (compared to the peak load at 48 h post-inoculation), so it was deterred that bacteria could undergo ecological suicide. Ecological suicide of the bacterial population was examined by recording the number of CFUs in the medium over time (Ratzke C, et al. Nat Ecol Evol. 2018.). A similar method was also applied to the number of CFUs of S. marcescens. Taken together, we draw a conclusion that bacteria possibly underwent ecological suicide.

      (3) Figure 2E: The experimental details should be provided in the text. What was the CFU of the bacteria used in this survival experiment?

      We provided further experimental details in the legend line 869-870. The same amount of inocula was used in both single and coculturing S. marcescens.

      (4) The experimental data in Figures 2G and 2H do not sufficiently prove the relationship between the width of the cell wall and virulence, as it lacks experimental validation.

      Previous studies (DOI: 10.1371/journal.ppat.1005946) reveal that glucosylating toxins on the surface are primary virulence determinants, so an increased surface-anchored polysaccharide and protein profile promotes the virulence of the pathogen. Alterations in cell surface (the width of the cell wall) can be examined by TE. Moreover, TE was used to observe changes in the virulence of S. marcescens (DOI: 10.1093/nar/gkab1186). We think that the width of the cell wall could be used to reflect virulence in S. marcescens.

      (5) While it's acknowledged that agitation decreases the color intensity of the bacteria, comparing mechanical agitation with larval crawling seems inappropriate, as the mechanical forces exerted by both methods are not of the same magnitude.

      Thanks for the suggestion. In fact, food was agitated more heavily by glass sticks than by larvae, because larvae merely agitated the surface of food (about 0.5 cm-depth). If the decrease in bacterial load and color was related to the magnitude of agitation, larvae would confer a less decrease (from the decrease in stick agitation) in bacterial load than the sticks. Consequently, it would further support our result that biofactors more importantly confer the inhibition of S. marcescens than force.

      (6) Figure 4D: with this metabolome data, they mentioned, "host suppresses differentiation of S. marcescens into the population with pathogenicity." What evidence supports the claim that downregulation of amino acid metabolism, phosphotransferase system, and ABC transporter directly correlates with decreased pathogenicity?

      Thanks for the comment. Earlier studies showed that amino acid-derived quorum sensing molecules are closely related to bacterial pathogenicity (Defoirdt T. PLoS Pathog. 2019; Wen J, et al. Microbiol Spectr. 2022). Moreover, the phosphotransferase system and ABC transporter can transport and/or produce virulence factors. Therefore, we claimed that downregulation of amino acid metabolism, phosphotransferase system, and ABC transporter directly were related to decreased pathogenicity. To support this claim, we add some references in the updated manuscript line 662-664, 827-830.

      (7) Serotonin: Does serotonin also reduce the virulence of S. marcescens?

      Our primary result showed that serotonin indeed could reduce the virulence of S. marcescens (figure supplement 4), because the survival rate of adult flies was increased and the expression levels of virulence-related genes of S. marcescens alone in the case of serotonin.

      (8) Figures 6D, E, H, I: The expression of key genes should be verified using quantitative real-time polymerase chain reaction (qRT-PCR), as scRNA-seq expression levels might not accurately reflect the true expression levels.

      Bacterial single-cell RNA-seq can evaluate alterations in gene expression in the single-cell resolution. The expression of key genes screened by scRNA-seq was changed only in subpopulations, so the average expression of these genes would be comparable when mixed with a large population. We are afraid that qRT-PCR could be illegible to verify the expression of genes in subpopulations.

      (9) Figure 7: The authors mentioned. "AMPs were supplemented to fly food". However, I could not find information regarding which AMPs and their respective concentrations (i.e., concentration of each AMP) were used in this study. This is a critical aspect of the research; therefore, details should be provided.

      Thanks for your important suggestions. We used the antimicrobial peptide cecropin A, which is produced by silkworms. We provided this information in the methods line 487-488. The concentrations of cecropin A were added in Figure 7 legend.

      (10) Figure 7: Delta AMP + AMP exhibited a stronger effect on the bacteria compared to AMP alone, indicating that immune effectors other than AMP may be involved. Since the IMD pathway is necessary for most immune effectors, including AMP, it would be interesting to test IMD pathway mutant animals and compare them with Delta AMP. Delta AMP + AMP exhibited a stronger effect on the bacteria compared to AMP alone. 

      We appreciate this important question. Indeed, Delta AMP + AMP exhibited a stronger effect on the bacteria compared to AMP alone. We admitted that immune effectors other than AMP may be involved. Alternatively, mechanical force, to a less extent, accounted for the stronger effect on the bacteria (Explained by larvae agitation in figure supplement 2). To rule out this possibility, we examined the effect of total immune effectors on the bacterial load and the prodigiosin yield of S. marcescens using the IMD pathway mutant (RelE20 larvae). Our result showed that the optical density and yield of prodigiosin in Delta AMP group did not significantly differ from the ones in RelE20 group. Moreover, the load of S. marcescens associated with Delta AMP mutant was comparable to that of S. marcescens associated with RelE20 mutant. These results suggested that AMPs play a major role in recapitulating the response of _S. marcescens t_o larvae.

      “To rule out the potential role of other immune effectors, we turned to the IMD pathway mutant RelE20 that is deficient in total immune effectors. Our result showed that the optical density and yield of prodigiosin in RelE20 group did not significantly differ from the ones in DAMP group (figure supplement 7A, B). Moreover, the load of S. marcescens associated with RelE20 mutant was comparable to that of S. marcescens associated with Delta AMP mutant (figure supplement 7C).”

      We now added these results in the text line 326-331.

    1. Author response:

      The following is the authors’ response to the original reviews

      List of major changes

      (1) We have emphasized the assumptions underlying our modeling approach in the third paragraph of the Introduction section.

      (2) We have included a new paragraph in the Discussion section to compare our model with a molecular mechanism-oriented model.

      (3) We have included a new paragraph at the end of the Introduction section to outline the main content of each subsection in Results and the logical connections between them. Correspondingly, the chapter hierarchy and section titles have been adjusted.

      (4) The Supplementary Material includes an additional table (Table S2) that provides detailed explanations of the symbols used in the model.

      (5) We have included a new paragraph in the Introduction section to explicitly emphasize the phenomenological nature of our model and its broad applicability.

      (6) In the Osmoregulation subsection, we have added a discussion on how our model can be directly generalized to scenarios involving the environmental uptake of osmolytes.

      (7) We have included a more detailed examination of the limitations inherent in our modeling approach in the second last paragraph of the Discussion section.

      (8) In the third last paragraph of the Discussion section, we have explicitly demonstrated that our model does not conflict with the observation that, in E. coli, cell wall synthesis is not directly regulated by the turgor pressure.

      Reviewer #1 (Public review):

      Summary:

      A theoretical model for microbial osmoresponse was proposed. The model assumes simple phenomenological rules: (i) the change of free water volume in the cell due to osmotic imbalance based on pressure balance, (ii) osmoregulation that assumes change of the proteome partitioning depending on the osmotic pressure that affects the osmolyte-producing protein production, (iii) the cell-wall synthesis regulation where the change of the turgor pressure to the cell-wall synthesis efficiency to go back to the target turgor pressure, (iv) effect of Intracellular crowding assuming that the biochemical reactions slow down for more crowding and stops when the protein density (protein mass divided by free water volume) reaches a critical value. The parameter values were found in the literature or obtained by fitting to the experimental data. The authors compare the model behavior with various microorganisms (E. coli, B. subtils, S. Cerevisiae, S. pombe), and successfully reproduced the overall trend (steady state behavior for many of them, dynamics for S. pombe). In addition, the model predicts non-trivial behavior such as the fast cell growth just after the hypoosmotic shock, which is consistent with experimental observation. The authors further make experimentally testable predictions regarding mutant behavior and transient dynamics.

      Strength:

      The theory assumes simple mechanistic dependence between core variables without going into specific molecular mechanisms of regulations. The simplicity allows the theory to apply to different organisms by adjusting the time scales with parameters, and the model successfully explains broad classes of observed behaviors. Mathematically, the model provides analytical expressions of the parameter dependences and an understanding of the dynamics through the phase space without being buried in the detail. This theory can serve as a base to discuss the universality and diversity of microbial osmoresponse.

      We would like to thank Reviewer 1 for thoroughly reading our work and appreciating our theoretical approach to investigating microbial osmotic response.

      Weakness:

      The core part of this model is that everything is coupled with growth physiology, and, as far as I understand, the assumption (iv) (Eq. 8) that imposes the global reaction rate dependence on crowding plays a crucial role. I would think this is a strong and interesting assumption. However, the abstract or discussion does not discuss the importance of this assumption. In addition, the paper does not discuss gene regulation explicitly, and some comparison with a molecular mechanismoriented model may be beneficial to highlight the pros and cons of the current approach

      We thank Reviewer 1 for their very helpful feedback. We have significantly revised the manuscript as suggested by Reviewer 1. See the detailed answers in the following.

      Reviewer #1 (Recommendations for the authors)

      (1) Explicitly stating the assumption (iv) in the abstract and discussing its role would help readers understand.

      In the revised manuscript, we have significantly rewritten the third paragraph of the Introduction section to emphasize our key assumptions as suggested by Reviewer 1, including the relationship between global reaction rate and crowding:

      “Our model assumes the following phenomenological rules: (1) the change in free water volume within the cell is driven by osmotic imbalance (Cadart et al., Nature Physics, 2019; Rollin et al., Elife, 2023), while the remaining volume changes in proportion to protein production; (2) osmoregulation influences the production of osmolyte-producing protein, governed by intracellular protein density (Scott et al., Science, 2010); (3) cell-wall synthesis is regulated through a feedback mechanism, wherein turgor pressure modulates the efficiency of cell-wall synthesis, enabling the cell to maintain a relatively stable turgor pressure; and (4) intracellular crowding slows down biochemical reactions as cytoplasmic density increases, with reactions ceasing entirely when protein density reaches a critical threshold.”

      We have also modified the abstract to mention the crowding effects explicitly. Additionally, we have added a few sentences in the first and second paragraphs of the Discussion section to emphasize the importance of crowding effects to our conclusions regarding the growth rate reduction in steady states and the non-monotonic dependence of the growth rate peak on the shock amplitude after a hyperosmotic shock.

      (2) I found [Shen W , Gao Z, Chen K, Zhao A, Ouyang Q, Luo C. The regulatory mechanism of the yeast osmoresponse under different glucose concentrations. Iscience. 2023 Jan 20;26(1)], which discusses the medium glucose concentration dependence of the response, focused on the gene regulatory circuit and the metabolic flux. As far as I understood, this paper considers the effect of the reallocation of resources but not the mechanical part of the osmoresponse such as pressure explicitly. It will be interesting to discuss the pros and cons in comparison with such a model. In principle, I will not be surprised if the current model does not differentiate the different glucose concentrations much since it is a more coarse-grained model, and I don't think it is a problem, but it will be good to have an explicit discussion.

      We appreciate Reviewer 1's insightful comment regarding the work by Shen et al. (iScience, 2023), which elucidates the two distinct osmoresponse strategies in yeast. By quantifying Hog1 nuclear translocation dynamics and downstream protein expression, the study reveals that in a rich medium, cells can leverage surplus glycolytic products as defensive reserves, reallocating metabolic flux to facilitate rapid adaptation to osmotic changes. Conversely, limited glycolytic intermediates in low-glucose environments necessitate increased enzyme synthesis for osmotic adaptation. 

      The paper highlighted by Reviewer 1 studies yeast's adaptive strategies under two stresses— nutrient limitation and osmotic pressure and provides an important complement to our study.

      In our simplified model, we did not include the interaction between cell growth and osmolyte production, assuming a constant fraction of ribosomes translating ribosomal proteins, supported by the experiments of E. coli (Dai et al., mBio, 2018). We remark that incorporating competitive dynamics for translational resources into our framework can be achieved by modifying the proportion of ribosomes translating themselves (X<sub>r</sub>), from a constant to a function related to the translation strategy of the osmolyte-producing enzyme ((X<sub>a</sub>).

      In the revised manuscript, we have included a new discussion in the third paragraph of the Discussion section to compare our approach with the molecular mechanism-oriented model:

      “We remark that our model is intrinsically a coarse-grained model with many molecular details regarding gene expression regulation neglected, which allows us to gain more analytical insights. In [Shen et al., iScience, 2023], the authors studied the responses to osmotic stress in glucose-limited environments and found that cells exhibited stronger osmotic gene expression response under glucose-limited conditions than under glucose-rich conditions. Using a computational model based on molecular mechanisms combined with experimental measurements, the authors demonstrated that in a glucose-limited environment, glycolysis intermediates were limited, which required cells to express more glycerol-production enzymes for stress adaptation. In the current version of our model, we do not account for the interaction between cell growth and osmolyte production; instead, we assume a constant fraction of ribosomes dedicated to translating ribosomal proteins. Our model can be further generalized to include the more complex interactions, including the coupling between biomass and osmolyte production, e.g., by allowing the fraction of ribosomes translating ((X<sub>r</sub>) to depend on the translation strategy of the osmolyte-producing enzyme ((X<sub>a</sub>).”

      (3) A minor comment: The authors call assumption (iii) (eq. 7) "positive feedback from turgor pressure to the cell-wall synthesis efficiency" (line 204). I have a hard time seeing this as positive feedback. It regulates the cell wall synthesis so that turgor pressure returns to the desired value; hence, isn't it negative feedback?

      We apologize for this confusion. We have removed the term "positive feedback" in the revised manuscript.

      Reviewer #2 (Public review):

      Summary:

      In this study, Ye et al. have developed a theoretical model of osmotic pressure adaptation by osmolyte production and wall synthesis.

      Strengths:

      They validate their model predictions of a rapid increase in growth rate on osmotic shock experimentally using fission yeast. The study has several interesting insights which are of interest to the wider community of cell size and mechanics.

      Weaknesses:

      Multiple aspects of this manuscript require addressing, in terms of clarity and consistency with previous literature. The specifics are listed as major and minor comments.

      Major comments:

      (1) The motivation for the work is weak and needs more clarity.

      We thank Reviewer 2 for this very helpful comment, which we believe has significantly improved our manuscript. We would like to clarify the two major motivations of our study. 

      First, we aim to construct a systems-level and coarse-grained model capable of elucidating the complex processes underlying microbial osmoresponse. By leveraging the separation of timescales associated with mechanical equilibrium, cell-wall synthesis regulation, and osmoregulation, our model facilitates in-depth analytical and numerical analysis of how these various processes interact during cellular adaptation. In particular, we demonstrate the key physiological functions of osmoregulation and cell-wall synthesis regulation.

      Second, we seek to apply this model to interpret the phenomenon of supergrowth observed in fission yeast Schizosaccharomyces pombe (Knapp et al., Cell Systems, 2019). This application addresses an essential challenge in experimental studies: exclusive knockout experiments can be difficult, and mechanistic interpretations of experimental observations are often lacking. Our theoretical framework offers a valuable tool for understanding such phenomena, contributing to the fundamental knowledge of microbial physiology and developing predictive models for microbial behavior under osmotic stress.

      In the revised manuscript, we have included a new paragraph at the end of the Discussion section to emphasize our motivations better:

      “In this work, we construct a systems-level and coarse-grained model capable of elucidating the complex processes underlying microbial osmoresponse. By leveraging the separation of timescales associated with mechanical equilibrium, cell-wall synthesis regulation, and osmoregulation, our model facilitates in-depth analytical and numerical analysis of how these various processes interact during cellular adaptation. In particular, we demonstrate the key physiological functions of osmoregulation and cell-wall synthesis regulation. We then apply this model to interpret the unusual phenomenon of supergrowth observed in fission yeast. This application addresses an essential challenge in experimental studies: exclusive knockout experiments can be difficult, and mechanistic interpretations of experimental observations are often lacking. Our theoretical framework offers a valuable tool for understanding such phenomena, contributing to the fundamental knowledge of microbial physiology and developing predictive models for microbial behavior under osmotic stress.”

      (2) The link between sections is very frequently missing. The authors directly address the problem that they are trying to solve without any motivation in the results section.

      We are grateful to Reviewer 2 for their valuable feedback. In the revised manuscript, we have included a new paragraph at the end of the Introduction section to outline the main content of each subsection in Results and the logical connections between them:

      “In the following “Results” section, we begin by outlining the primary assumptions and equations of our model in the subsection "Model Description," which includes four parts, each addressing one of the four phenomenological rules. Additional details can be found in Methods. We then proceed to the subsection “Steady states in constant environments”, where we employ our theoretical framework to analyze steady-state growth and examine how the growth rate varies with external osmolarity. In the “Transient dynamics after a constant osmotic shock” subsection, we investigate the time-dependent osmoresponse after a constant hyperosmotic and hypoosmotic shock. Finally, in “Comparison with experiments: supergrowth phenomena after osmotic oscillation”, we address the supergrowth phenomena observed in S. pombe, utilizing our model to elucidate these experimental observations.”

      (3) The parameters used in the models (symbols) need to be explained better to make the paper more readable.

      We apologize for this confusion. In the revised Supplementary Material, we have included an additional table (Table S2) to explain the meanings of the symbols employed in the model to help the reader better understand.

      (4) Throughout the paper, the authors keep switching between organisms that they are modelling. There needs to be some consistency in this aspect where they mention what organism they are trying to model, since some assumptions that they make may not be valid for both yeast as well as bacteria.

      We thank Reviewer 2 for this very helpful comment. We would like to clarify that our model is coarse-grained without including detailed molecular mechanisms; therefore, it presumably applies to various species of microorganisms. Indeed, the predicted steady-state growth curves derived from our model and the experimental data obtained from various organisms agree reasonably well (Figure 2A of the main text). 

      In the revised manuscript, we have explicitly emphasized the nature of our phenomenological model and its broad applicability in the fourth paragraph of the Introduction section:

      “We remark that our model is coarse-grained, without including detailed molecular mechanisms, and is therefore applicable across diverse microbial species. Notably, the predicted steady-state growth rate as a function of internal osmotic pressure from our model aligns well with experimental data from diverse organisms. This alignment allows us to quantify the sensitivities of translation speed and regulation of osmolyte-producing protein in response to intracellular density. Additionally, we demonstrate that osmoregulation and cellwall synthesis regulation enable cells to adapt to a wide range of external osmolarities and prevent plasmolysis. Our model also predicts a non-monotonic time dependence of growth rate and protein density as they approach steady-state values following a constant osmotic shock, in concert with experimental observations (Rojas et al., PNAS, 2014; Rojas et al., Cell systems, 2017). Moreover, we show that a supergrowth phase can arise following a sudden decrease in external osmolarity, driven by cell-wall synthesis regulation, either through the direct application of a hypoosmotic shock or the withdrawal of an oscillatory stimulus. Remarkably, the predicted amplitudes of supergrowth (i.e., growth rate peaks) quantitatively agree with multiple independent experimental measurements.”

      Furthermore, we have also included a comparison with a detailed molecular mechanism model in the third paragraph of the Discussion section:

      “We remark that our model is intrinsically a coarse-grained model with many molecular details regarding gene expression regulation neglected, which allows us to gain more analytical insights. In [Shen et al., iScience, 2023], the authors studied the responses to osmotic stress in glucose-limited environments and found that cells exhibited stronger osmotic gene expression response under glucose-limited conditions than under glucose-rich conditions. Using a computational model based on molecular mechanisms combined with experimental measurements, the authors demonstrated that in a glucose-limited environment, glycolysis intermediates were limited, which required cells to express more glycerol-production enzymes for stress adaptation. In the current version of our model, we do not account for the interaction between cell growth and osmolyte production; instead, we assume a constant fraction of ribosomes dedicated to translating ribosomal proteins. Our model can be further generalized to include the more complex interactions, including the coupling between biomass and osmolyte production, e.g., by allowing the fraction of ribosomes translating ((X<sub>r</supb) to depend on the translation strategy of the osmolyte-producing enzyme ((X<sub>a</sub>).”

      (5) The extent of universality of osmoregulation i.e the limitations are not very well highlighted.

      The osmoregulation mechanism described in our model primarily addresses changes in cytoplasmic osmolarity through the de-novo synthesis of compatible solutes, widely observed across bacteria, archaea, and eukaryotic microorganisms. This review article (GundeCimerman et al., FEMS microbiology reviews, 2018) provides an extensive summary and exploration of the primary compatible solutes utilized by organisms from all three domains of life, underscoring the prevalence of this osmoregulatory strategy. Furthermore, our model can be directly generalized to scenarios involving the direct uptake of osmolytes from the environment. One only needs to change the interpretation of the parameter, 𝑘<sub>𝑎</sub> in the production of osmolyte molecule, , from the synthesis rate to the uptake rate, and all the results are equally applicable. In the revised manuscript, we have briefly discussed this point in the subsection “Osmoregulation.”

      We agree with Reviewer 2 that our model's coarse-grained nature makes it broadly applicable to diverse microbial taxa; however, more specialized adaptations are beyond our model. In the revised manuscript, we have included a more detailed examination of the limitations inherent in our modeling approach in the second last paragraph of the Discussion section:

      “We remark several limitations of our current coarse-grained model. First, the high membrane tension that inhibits transmembrane flux of peptidoglycan precursors, leading to a growth inhibition before the supergrowth peak (Rojas et al., Cell systems 2017) is beyond our model. Second, in our current framework, the osmoregulation and cell-wall synthesis regulation rely on the instantaneous cellular states. However, microorganisms can exhibit memory effects to external stimuli by adapting to their temporal order of appearance (Mitchell et al., Nature 2009). Notably, in the osmoregulation of yeast, a short-term memory, facilitated by post-translational regulation of the trehalose metabolism pathway, and a long-term memory, orchestrated by transcription factors and mRNP granules, have been identified (Jiang et al., Science signaling 2020). Besides, our model does not account for the role of osmolyte export in osmoregulation (Tamas et al., Molecular microbiology, 1999) and the interaction between biomass and osmolyte production (Shen et al., Iscience 2023). Extending our model to include more realistic biological processes will be interesting.”

      (6) Line 198-200: It is not clear in the text what organisms the authors are writing about here. "Experiments suggested that the turgor pressure induce cell-wall synthesis, e.g., through mechanosensors on cell membrane [45, 46], by increasing the pore size of the peptidoglycan network [5], and by accelerating the moving velocity of the cell-wall synthesis machinery [31]". This however is untrue for bacteria as shown by the study (reference 22 is this paper: E. Rojas, J. A. Theriot, and K. C. Huang, Response of escherichia coli growth rate to osmotic shock, Proceedings of the National Academy of Sciences 111, 7807 (2014).

      We thank Reviewer 2 for pointing out this very important issue and apologize for the confusion. References 45 and 46 (Dupres et al., Nature Chemical Biology 2009; Neeli-Venkata et al., Developmental Cell 2021) discuss how Wsc1 acts as a mechanosensor in S. pombe, detecting turgor pressure and activating pathways that reinforce the cell wall. Reference 5 (Typas et al., Cell 2010) explains the role of LpoA and LpoB, the two outer membrane lipoprotein regulators in E. coli, which modulate peptidoglycan synthesis in an extracellular manner. Reference 31 (Amir and Nelson, PNAS 2012) is a theoretical paper showing that turgor pressure may accelerate the moving velocity of the cell wall synthesis machinery in E. coli. In the revised manuscript, we have been more explicit about the organisms we refer to in the subsection “Cell-wall synthesis regulation.”

      Meanwhile, we agree with Reviewer 2 that cell wall synthesis may not be directly regulated by turgor pressure in E. coli (Rojas et al., PNAS 2014). We would like to clarify that this scenario is also included in our model corresponding to H<sub>cw</sub> = 0 (Eq. (7) in the main text): the turgor pressure does not affect the cell-wall synthesis. Therefore, the supergrowth phenomenon observed in S. pombe does not manifest under hypotonic stimulation in E. coli.

      In the revised manuscript, we have emphasized this point more explicitly in the third last paragraph of the Discussion section:

      “Reference 22 (Rojas et al., PNAS, 2014) showed that the expansion of E. coli cell wall is not directly regulated by turgor pressure, and this scenario is also included in our model as the case of H<sub>cw</sub> \= 0. According to our model, the supergrowth phase is absent if H<sub>cw</sub> = 0 (Figure S8), consistent with the absence of a growth rate peak after a hypoosmotic shock in the experiments of E. coli (Rojas et al., PNAS, 2014). Meanwhile, our predictions are consistent with the growth rate peak after a hypoosmotic shock observed for B. subtilis (Rojas et al., Cell systems, 2017).”

      (7) The time scale of reactions to hyperosmotic shocks does not agree with previous literature (reference 22). Therefore defining which organism you are looking at is important. Hence the statement " Because the timescale of the osmoresponse process, which is around hours (Figure 3B), is much longer than the timescale of the supergrowth phase, which is about 20 minutes, the turgor pressure at the growth rate peak can be well approximated by its immediate value after the shock." from line 447 does not seem to make sense. The authors need to address this.

      We apologize for this confusion. In the revised manuscript, we have clarified that the cited time scales are for the fission yeast S. pombe after Eq. (13) in the main text.

      Reviewer #2 (Recommendations for the authors):

      (1) Inconsistency in nomenclature: On line 117, the equation reads V<sub>b</sub> = αm<sub>p where V<sub>b</sub> is the bound volume. Whereas bound volume has been referred to as V<sub>bd</sub> previously and in Figure 1.

      Answer: We apologize for this confusion. In our model, the total bound volumeV<sub>b</sub> comprises the volume of dry mass and bound water, V<sub>b</sub> \= V<sub>bd</sub> + V<sub>bw</sub>, where V<sub>bd</sub> is the volume occupied by dry mass and V<sub>bw</sub> is the volume of bound water. In the revised manuscript, we have added a brief discussion of this point in the caption of Figure 1.

      (2) Line 180: Please define 𝜌𝜌 for equation 4.

      We apologize for this confusion. In the text, the symbol 𝜌<sub>p</sub> denotes the mass of a given substance per unit volume of free water, and its unit is g/ml. The specific substance in consideration is indicated by a subscript. For example, 𝜌<sub>p</sub> in Eq. (4) represents the protein density, and 𝜌<sub>c</sub> stands for the critical protein density, above which intracellular chemical reactions cease according to Eq. (8) of the main text. In the revised manuscript, we have clarified the meaning of 𝜌<sub>c</sub> after Eq. (4).

      (3) Line 187: Equation 5 also needs to be explained better. Hence there is a need to be more specific while stating the assumptions.

      The elastic modulus 𝐺 defined in Eq. (5) of the main text is a measure of the cell wall's resistance to volume expansion. We assume a constant 𝐺 for simplicity, which is reasonable when the cell wall deformation is mild. In the revised manuscript, we have been more explicit about our assumptions regarding the turgor pressure in the subsection “Cell-wall synthesis regulation.”

      (4) Line 225: For a biological audience some elaboration on "glass transition" may be required- either as a reference to a review or to a 1 sentence statement of relevance.

      We appreciate Reviewer 2’s helpful comment. In the revised manuscript, we have added a brief introduction to the glass transition and a citation to a review paper (Hunter and Weeks, Rep. Prog. Phys. 2012) at the beginning of the subsection “Intracellular crowding.”

      (5) Line 247: "All growth rates in steady states of cell growth are the same: 𝜇<sub>𝑓</sub> \= 𝜇<sub>r</sub> \= 𝜇<sub>cw</sub>". The authors need to explain in a line or two why this is true. Since the processes are independent, it is safe to assume that all 𝜇's are constant, but it is not obvious why they should all be equal.

      We apologize for the lack of a clear explanation regarding the equality of steady-state growth rates in our previous manuscript. In the revised manuscript, we have added a brief explanation of the equality of the three growth rates at the beginning of the subsection “Steady states in constant environments”:

      “When cell growth reaches a steady state, the proportions of all components, including free water volume, cell mass, and cell wall volume, must be constant relative to the total cell volume to ensure homeostasis. Therefore, all growth rates in steady states of cell growth must be the same: 𝜇<sub>𝑓</sub> \= 𝜇<sub>r</sub> \= 𝜇<sub>cw</sub>.”

      (6) Line 264: "Because the typical doubling times of microorganisms are around hours, we can estimate 𝜇<sub>𝑓</sub>/k<sub>w</sub> ∼ 10 Pa [51, 52] ..." since the authors are generalizing for yeast and bacteria, specifically E. coli, this is not a valid assumption to make. There is also a need to explain the basis of "𝜇<sub>𝑓</sub>/k<sub>w</sub> ∼ 10 Pa".  

      We appreciate the need for clarity in the estimation and its implications. The rough estimation of 𝜇<sub>𝑓</sub>/k<sub>w</sub> ~ 10 Pa in the main text is given by:

      Here, the typical value of 𝜇<sub>𝑓</sub> (which equals to 𝜇<sub>r</sub> in steady state) is approximated by the inverse of the cell cycle, which is around hours. The estimation above is employed to justify the assumption that 𝜇<sub>𝑓</sub>/k<sub>w</sub> is much smaller than the cytoplasmic osmotic and turgor pressures, which can be several atmospheric pressures.

      For the case of E. coli, based on the experimental results from Boer et al. (Boer et al., Biochemistry 2011), an 800mM hypoosmotic shock leads to a rapid expansion of cell volume accomplished within a time scale of 0.1s, from which we obtain:

      .

      Therefore, our assumption that 𝜇<sub>𝑓</sub>/k<sub>w</sub> is much smaller than the cytoplasmic osmotic and turgor pressures is still valid. 

      In the revised manuscript, we have increased the estimation ranges to include the case of E. coli in the first paragraph of the subsection “Steady states in constant environments.”

      (7) Lines 279-283 need to be explained better.  

      We apologize for the confusion. In the revised manuscript, we have explained more explicitly the meaning of the growth curve in the second paragraph of the subsection “Steady states in constant environments”:

      “Intriguingly, the relationship between the normalized growth rate () and the normalized cytoplasmic osmotic pressure (), which we refer to as the growth curve in the following, has only one parameter 𝐻<sub>r</sub>/(𝐻<sub>𝑎</sub>) . Therefore, the growth curves of different organisms can be unified by a single formula, Eq. (10b), and different organisms may have different values of 𝐻<sub>r</sub>/(𝐻<sub>𝑎</sub> + 1).”

      (8) In Figure 3, an arrow representing the onset of osmotic shock would make the figure more intuitive to understand.

      We appreciate Reviewer 2 for this helpful suggestion. We have modified Figure 3 as suggested.

      (9) It is unclear to me if the growth rate 𝜇𝜇𝑟𝑟 is representative of the growth of total protein. This can be motivated better.

      We would like to clarify that the growth rate 𝜇𝜇𝑟𝑟 is defined as the changing rate of total protein mass divided by the total protein mass:

      Here, 𝑚<sub>𝑝,𝑟</sub> is the total mass of ribosomal proteins and 𝑘𝑘𝑟𝑟 is a constant proportional to the elongation speed of ribosome. The expression of 𝜇<sub>𝑟</sub> is a direct consequence of ribosomes being responsible for producing all proteins. In the revised manuscript, we have added more details in the introduction of the variable 𝜇<sub>𝑟</sub> in the last paragraph of the subsection “Cell growth”:

      “In this work, we assume that the dry-mass growth rate is proportional to the fraction of ribosomal proteins within the total proteome for simplicity, 𝜇<sub>𝑟</sub> \= 𝑘<sub>r</sub>𝑚<sub>𝑝,𝑟</sub>/𝑚<sub>𝑝</sub> \= 𝑘<sub>r</sub>𝜙<sub>𝑟</sub>. This assumption leverages the fact that ribosomes are responsible for producing all proteins. The proportionality coefficient 𝑘<sub>𝑟</sub> encapsulates the efficiency of ribosomal activity, being proportional to the elongation speed of the ribosome. We remark that 𝑘𝑘𝑟𝑟 is influenced by the crowding effect, which we address later.”

    1. Author response:

      The following is the authors’ response to the current reviews.

      Reviewer #2:

      Line 295 – was the time post-infection, which varies considerably between groups and across samples, taken into consideration when comparison of response was between ChatCre mice (4-9 weeks post-infection) and WT mice (four to five weeks post-infection)?

      Thank you for your comment. We did not originally assess the effects of time post-injection on DREADD response. Generally, AAV transgene expression has been demonstrated to be long-term and stable in the CNS of mice.[1] However, there is some variation in the reporting time of peak transgene expression[2], and this may potentially impact our results.

      In investigating this issue further, we discovered an error in our reporting as we did have n = 1 wild-type mouse that underwent EMG recordings 62 days (~9 weeks) post-AAV injection. This has been corrected in the manuscript (lines 87-88).

      Addressing this question is challenging due to the uneven distribution of time points within the 4–9-week windows for each group. Essentially, there were two groups per cohort, one studied at 4-5 weeks and one at 8-9 weeks. More specifically:

      - Wild-type cohort: n = 10 animals were studied 28–33 days post-injection, and n = 1 at 62 days.

      - ChAT-Cre cohort: n = 4 animals were studied 28–30 days post-injection, and n = 5 at 56–59 days.

      We performed Pearson correlation analyses between time post-injection and diaphragm EMG response to DREADD activation (peak amplitude and area under the curve, AUC) for both cohorts (Author response image 1):

      - ChAT-Cre: No significant correlations were found (peak amplitude: r<sup>2</sup> = -0.117, r = -0.1492, p = 0.702, Figure 1a-b; AUC:r<sup>2</sup> = -0.0883, r = 0.2184, p = 0.572, Figure 1c-d).

      - Wild type: Initial analysis of all data showed significant correlations (peak amplitude:r<sup>2</sup> = 0.362, r = 0.6523, p = 0.0296, Figure 1a; AUC: r<sup>2</sup> = 0.347, r = 0.6424, p = 0.033, Figure 1c), suggesting a moderate positive correlation between time post-injection and EMG response. However, when the single 8–9-week wild-type mouse was excluded, these correlations were no longer significant (peak amplitude: r<sup>2</sup> = 0.172, r = 0.5142, p = 0.128, Figure 1b; AUC: r<sup>2</sup> = 0.23, r = 0.5614, p = 0.0913, Figure1d).

      Comparing wild-type and ChAT-Cre groups directly was unreliable due to the single wild-type mouse studied at the later time point. We attempted to model time post-injection as a continuous variable (i.e., exact days post-injection) using a restricted maximum likelihood mixed linear model in JMP; however, the analysis could not be performed because there were not sufficient overlapping time points between the two cohorts (i.e., not all days post-injection were represented in both groups). To mitigate this, we binned animals into two groups: 4–5 weeks and 8–9 weeks post-injection. This analysis returned a significant interaction between cohort and time post-injection (p = 0.0391), however there were no significant multiple comparisons upon Tukey post hoc test (i.e., p > 0.05).

      Based on these findings, we feel confident that time post-injection is unlikely to have a significant impact on diaphragm EMG response to DREADD activation in the ChAT-Cre cohort. However, in the wild-type cohort, it is difficult to draw definitive conclusions, as only one animal was studied at the 8–9-week time point. For similar reasons, it remains unclear whether the relationship between time post-AAV transduction and DREADD response differs between cohorts. Given the inconclusive nature of these results, we have elected not to include this analysis in the manuscript. Nevertheless, to ensure transparency, we have provided Author response image 1 below of peak amplitude and AUC plotted against time, allowing readers to evaluate the data independently.

      Author response image 1.

      Plots of diaphragm EMG peak amplitude (a-b) and area under the curve (c-d) vs. days post-AAV injection for wild-type (blue) and ChAT-Cre (orange) mice. Pearson correlation analyses were performed to assess the relationship between time post-AAV injection and diaphragm EMG DREADD response in wild-type and ChAT-Cre mouse cohorts. r<sup>2</sup>, r, and p-values are shown in each panel for both cohorts. Panels a and c display peak amplitude and AUC, respectively, including all animals. Panels b and d present the same variables with the n = 1 wild-type mouse at the 9-week time point excluded; ChAT-Cre data is unchanged between corresponding panels. Scatter points represent data from individual animals. Polynomial trendlines are displayed for each cohort with wild-type in blue and ChAT-Cre in orange.

      REFERENCES

      (1) Kim, J. Y., Grunke, S. D., Levites, Y., Golde, T. E. & Jankowsky, J. L. Intracerebroventricular viral injection of the neonatal mouse brain for persistent and widespread neuronal transduction. J Vis Exp, 51863 (2014). https://doi.org/10.3791/51863

      (2) Hollidge, B. S. et al. Kinetics and durability of transgene expression after intrastriatal injection of AAV9 vectors. Front Neurol 13, 1051559 (2022). https://doi.org/10.3389/fneur.2022.1051559


      The following is the authors’ response to the original reviews.

      Response to reviewer’s public reviews:

      We chose the dose of J60 based on a prior publication that established that off-target effects were possible at relatively high doses[1]. The dose that we used (0.1 mg/kg) was 30-fold less than the dose that was reported in that paper to potentially have off-target responses (3 mg/kg). Further, Author response image 1 shows the results of experiments in which J60 was given to animals that did not have the excitatory DREADD expressed in the spinal cord. This includes a sample of mice (n = 2) and rats (n = 3), recorded from using the same diaphragm EMG procedure described in the manuscript. The figure shows that there was no consistent response to the J60 at 0.1 mg/kg in the “control experiment” in which the DREADD was not expressed in the spinal cord.

      Author response image 1.

      Diaphragm EMG response to J60 administrated to naïve rats and mice. Panel a-b show raw EMG values at baseline, following vehicle (saline) and J60 administration for the left and right hemidiaphragm. Panel c-d shows EMG values normalized to baseline. Neither One-way RM ANOVA (panel a-b) nor paired t-test (panel c-d) returned significant p values (p < 0.05).

      Response to specific reviewer comments:

      Reviewer #1:

      How old were the animals at the time of AAV injection, and in subsequent experiments?

      The wildtype cohort of mice were 7-9 weeks old at time of AAV injection and DREADD experiments took place 4-5 weeks after AAV injection. ChAT-Cre mice were 6-10 weeks old at time of AAV injection and DREADD experiments took place 4-9 weeks after AAV injection. ChAT-Cre rats were 2-5 months old at time of AAV spinal injection. These animals underwent plethysmography recordings 3-4 months post-AAV injection and subsequently phrenic nerve recording 3-8 weeks later. These details have been added to the Method section.

      How many mice were excluded from electrophysiology experiments due to deteriorating electrode contact?

      No mice were excluded from electrophysiology experiments due to deteriorating electrode contact. If you are referring to the n = 1 excluded ChAT-Cre mouse (line 368) this animal was excluded because it showed no histological evidence of DREADD expression (lines 200-206).

      What was the urethane dose?

      The urethane dose for phrenic nerve recordings was 2.1 g/kg. See methods section line 395.

      A graphical timeline of the experimental progression for plethysmography and electrophysiology studies would enhance clarity.

      A graphical timeline has been added. See Figure S6.

      Significance indicators in the figures would greatly enhance clarity. It is a little awkward to have to refer to supplemental tables to figure out statistical differences.

      Significance indicators have been added. See Figures 1, 2, 4, and 5

      In Figures 1, 2, and 5, individual data points should be shown, as in Fig 4.

      Thank you for this suggestion. We agree that, in general, it is best practice to scatter individual data points. However, when we drafted the new figures, it was apparent that including individual scatter points, in this case, created very “cluttered” figures that were very difficult to interpret.

      More detail regarding the plethysmography studies is needed. Was saline/J60 infused via a tail vein catheter? Were animals handled during the infusion? How long is the "IV" period? What volume of fluid was delivered?

      All IV infusions were delivered via a tail vein catheter. Animals were not handled during infusion nor at any point during the recording. An IV catheter was externalized via a port in the plethysmograph allowing for IV infusion without handling of the animal or opening the plethysmograph. The infusion period for both saline and J60 was standardized to 2 minutes. The volume of fluid of both saline and J60 was standardized to 0.6 mL. This information has been added to the methods section (lines 408-410, 415-16, 419-420).

      Reviewer #2:

      The abstract could be improved by briefly highlighting the rationale, scope, and novelty of the study - the intro does a great job of highlighting the scope of the study and the research questions.

      A brief explanation of the rationale, scope, and novelty of the study has been added to the abstract. See lines 2-8.

      Line 18, specifies that this was done under urethane anesthesia.

      This detail has been added to the abstract (line 20).

      The methods section should be moved to the end of the manuscript according to Journal policy.

      The methods section has been moved to the end of the manuscript.

      The authors mention the use of both female and male rats but it is not indicated if they tested for and observed any differences between sexes across experiments.

      We included the use of both male and female animals in this study to improve the generalizability of the results. However, we were not adequately powered for sex comparisons and therefore did not perform any statistical analysis to assess differences between sexes across experiments. Text has been added to the methods section (lines 534-537) to clarify.

      Line 40, since delivery of J60 was performed in both IV and IP, this general statement should be updated.

      This detail has been revised to include both IV and IP. See line 43.

      Line 42. "First, we determined if effective diaphragm activation requires focal DREADD expression targeting phrenic motor neurons, or if non-specific expression in the immediate vicinity of the phrenic motor nucleus would be sufficient...." I don't think that in the experiments with wild-type mice the authors can claim that they selectively targeted the cervical propriospinal network (in isolation from the motoneurons). Given the fact that the histological analysis did not quantify interneurons or motoneurons in the spinal cord, authors should be cautious in proposing which neuronal population is activated in the non-specific approach.

      We agree, and this was a poorly worded statement in our original text. We agree that wild-type DREADD expression was not limited to the cervical propriospinal networks but likely a mix of interneurons and motoneurons. The text has been edited to reflect that (see lines 56-60).

      AAV virus source is not described.

      All AAVs were obtained from the UF Powell Gene Therapy Center. Details of virus source and production have been added to the methods section. See lines 336-347.

      Line 108-125. Because the diaphragm EMG recordings are only described for mice here, I would suggest editing this methods section to clearly state mice instead of vaguely describing "animals" in the procedure.

      “Animals” has been changed to “mice” to avoid ambiguity.

      Line 120, add parenthesis.

      Parenthesis has been added.

      Line 126. Whole body plethysmography protocol. Three hypercapnic hypoxic challenges are a lot for a rat within a 3-hour recording session in freely behaving rats. Did the authors verify with control/ vehicle experiments that repeated challenges in the absence of J60 do not cause potentiation of the response? I understand that it is not possible to invert the order of the injections (due to likely long-term effects of J60) or it is too late to perform vehicle and J60 injections on different days, but controls for repeated challenges should be performed in this type of experiment, especially considering the great variability in the response observed in Figure 4 (in normoxic conditions).

      We did not conduct control experiments to assess the impact of repeated hypercapnic hypoxic challenges on the naïve response (i.e., in the absence of J60). However, our experimental protocol was designed such that each experimental period (i.e., post-vehicle or post-J60 infusion) was normalized to baseline recordings taken immediately prior to the vehicle or J60 infusion. While repeated exposure to hypercapnic hypoxic challenges may have altered respiratory output, we are confident that normalizing each experimental period to its respective baseline effectively captures the impact of DREADD activation on ventilation, independent of any potential potentiation that may have occurred due to gas challenge exposure. We have included raw values for all plethysmography outcomes (see Figure 4, panels a-c) to ensure full data transparency. Still, we believe that the baseline-normalized values more accurately reflect the impact of DREADD activation on the components of ventilation.

      Furthermore, why the response to the hypercapnic hypoxic challenges are not reported? These could be very interesting to determine the effects of DREADD stimulation on chemosensory responses and enhance the significance of the study.

      Response to the hypercapnic hypoxic challenges has been added to the manuscript. See Figure S3 and results section lines 162-167. Briefly, there were no statistically significant (p < 0.05) differences in tidal volume, respiratory rate, or minute ventilation between J60 vs sham condition during hypercapnic-hypoxic ventilatory challenges.

      Line 200 - what is the reason behind performing a qualitative analysis of mCherry in various quadrants? This limits the interpretation of the results. If the authors used Chat-cre rats, the virus should only be in Chat+ MN. Knowing how selective the virus is, and whether its expression was selective for Phrenic MN versus other MN pools, could address several technical questions.

      We agree that detailed quantification of expression by motoneuron pool would be of value in future work.  However, for these initial proof-of-concept experiments, we performed the quadrant-based qualitative analysis of mCherry expression to provide a simple comparison of mCherry expression between groups (i.e., ChAT-Cre vs. wildtype mice). This analysis allowed us to: 1) show the reader that each animal included in the study showed evidence of mCherry expression and 2) give the reader an idea of patterns of mCherry expression throughout the mid-cervical spinal cord. Additionally, it is important to note that while ChAT is a marker of motoneurons some populations of interneurons also express ChAT(2-4).

      Given the increased values of Dia EMG AUC and no changes in respiratory rate, did the authors determine if there was a change in the inspiratory time with J60 administration?

      We did not assess inspiratory time.

      High death rate in DREADD WT mice - was histological analysis performed on these mice? Could it be due to the large volume injected into the spinal cord that affects not only descending pathways but also ascending ones? Or caused by neuronal death due to the large volume of viral solution in injected in mice.

      Histological analysis was performed on these animals to assess mCherry expression only (i.e., no staining for NeuN or other markers was performed). While the reviewer's speculations are reasonable, we feel these reasons are unlikely to explain the death rate in DREADD WT mice as ChAT-Cre mice received the same volume injected into their spine and lived up until and during diaphragm EMG recordings. Additionally, WT mice lived for 4-5 weeks post-injection which would be past the acute phase that a large immune response to the viral dose would have occurred.

      Line 299-304. Can you please clarify whether these rats were tested under anesthesia?

      These rats were assessed under anesthesia. This detail has been added (line 146).

      Given some of the unexpected results on cardiovascular parameters in urethane anesthetized rats, did the authors test the effects of J60 in the absence of AAV construct infection?

      A small cohort (n = 2) of urethane anesthetized naïve wildtype rats were given the J60 ligand (IV, 0.1 mg/kg dose). We did observe a sudden drop in blood pressure after J60 administration that was sustained for the duration of the recording. One animal showed a 12% decrease in mean arterial blood pressure following J60 administration while the other showed a 35% decrease. Thus, it does appear that in this preparation the J60 ligand is producing a drop in arterial blood pressure.

      Line 393. I believe this comment is referred to the intrapleural and diaphragmatic injection. Maybe this should clarified in the sentence.

      This sentence has been revised for clarity (see lines 248-250).

      Figures 1 and 2. It would be informative to show raw traces of the Diaphragm EMG to demonstrate the increase in tonic EMG. It is not possible to determine that from the integrated traces in Figures 1A and B.

      Thank you for bringing up this concern. While the mean data in Figures 1F and 2F do indicate that, on average, animals had tonic diaphragm EMG responses to DREADD activation, the examples given in Figures 1A and 2A show minimal responses. This makes it difficult to fully appreciate the tonic response from those particular traces. However, clear tonic activity can be appreciated from Figures 5A and S2. In these figures, tonic activity is evident from the integrated EMG signals, presenting as a sustained increase in baseline activity between bursts—essentially an upward shift from the zero point.

      References

      (1) Van Savage, J. & Avegno, E. M. High dose administration of DREADD agonist JHU37160 produces increases in anxiety-like behavior in male rats. Behav Brain Res 452, 114553 (2023). https://doi.org/10.1016/j.bbr.2023.114553

      (2) Mesnage, B. et al. Morphological and functional characterization of cholinergic interneurons in the dorsal horn of the mouse spinal cord. J Comp Neurol 519, 3139-3158 (2011). https://doi.org/10.1002/cne.22668

      (3) Gotts, J., Atkinson, L., Yanagawa, Y., Deuchars, J. & Deuchars, S. A. Co-expression of GAD67 and choline acetyltransferase in neurons in the mouse spinal cord: A focus on lamina X. Brain Res 1646, 570-579 (2016). https://doi.org/10.1016/j.brainres.2016.07.001

      (4) Alkaslasi, M. R. et al. Single nucleus RNA-sequencing defines unexpected diversity of cholinergic neuron types in the adult mouse spinal cord. Nat Commun 12, 2471 (2021). https://doi.org/10.1038/s41467-021-22691-2

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      In this manuscript, Eaton et al. examine the regulation of transcription directionality using a powerful genomic approach (more about the methodology below). Their data challenge the notion that the polyadenylation signal-reading Cleavage and Polyadenylation (CPA) complex is responsible for controlling promoter directionality by terminating antisense transcription. Namely, depletion of the required CPA factor RBBP6 has little effect on antisense transcription measured by POINT. They find instead that initiation is intrinsically preferential in the sense direction and additionally maintained by the activities of an alternative processing complex called Integrator, together with the kinase CDK9. In the presence of CDK9 activity, depletion of Integrator endoribonuclease INTS11 leads to globally increased transcription in the antisense direction, and minor effects in the sense direction. However, CDK9 inhibition reveals that sense transcription is also sensitive to INS11 depletion. The authors suggest that CDK9 activity is stronger in the sense direction, preventing INTS11-mediated premature termination of sense transcrpts.

      Strengths:

      The combination of acute depletion of the studied factors using degron approaches (important to limit possible secondary effects), together with novel and very sensitive nascent transcriptomics methods POINT and sPOINT is very powerful. The applied spike-in normalization means the analysis is more rigorous than most. Using this methodology allowed the authors to revisit the interesting question of how promoter/transcription directionality is determined.

      The data quality appears very good and the fact that both global analysis as well as numerous gene-specific examples are shown makes it convincing.

      The manuscript is well written and hence a pleasure to read.

      We appreciate this positive assessment.

      Weaknesses:

      I am slightly worried about the reproducibility of the data - it is unclear to me from the manuscript if and which experiments were performed in replicate (lack of table with genomic experiments and GEO access, mentioned in more detail in below recommendations to authors), and the methods could be more detailed.

      All sequencing data was deposited with GEO. Multiple biological replicates were performed for each sequencing experiment.  Bigwig files are presented as a table in the GEO submissions. This data has now been made public.

      A separate discussion section would be useful, particularly since the data provided challenge some concepts in the field. How do the authors interpret U1 data from the Dreyfuss lab in light of their results? How about the known PAS-density directionality bias (more PAS present in antisense direction than in sense) - could the differential PAS density be still relevant to transcription directionality?

      As suggested, we have expanded our discussion to relate our findings to existing data. We think the results from the Dreyfuss lab are very important and highlight the role of U1 snRNA in enforcing transcriptional elongation.  It does this in part by shielding PAS sequences.  Recent work from our lab also shows that U1 snRNA opposes the Restrictor complex and PNUTS, which otherwise suppress transcription (Estell et al., Mol Cell 2023).  Most recently, the Adelman lab has demonstrated that U1 snRNA generally enhances transcription elongation (Mimoso and Adelman., Mol Cell 2023).  Our work does not challenge and is not inconsistent with these studies.

      The role of U1 in opposing PAS-dependent termination inspired the idea that antisense transcriptional termination may utilise PASs.  This was because such regions are rich in AAUAAA and comparatively poor in U1 binding sites. However, our RBBP6 depletion and POINT-seq data suggest that PAS-dependent termination is uncommon in the antisense direction. As such, other mechanisms suppress antisense transcription and influence promoter directionality. In our paper, we propose a major role for the Integrator complex.

      We do not completely rule out antisense PAS activity and discuss the prior work that identified polyadenylated antisense transcripts. Nevertheless, this was detected by oligo-dT primed RT-PCR/Northern blotting, which cannot determine the fraction of non-polyadenylated RNA that could result from PAS-independent termination (e.g. by Integrator).  To do that requires an analysis of total nascent transcription as achieved by our POINT-seq.  Based on these experiments, Integrator depletion has a greater impact on antisense transcription than RBBP6 depletion. 

      I find that the provided evidence for promoter directionality to be for the most part due to preferential initiation in the sense direction should be stressed more. This is in my eyes the strongest effect and is somehow brushed under the rug.

      We agree that this is an important finding and incorporated it into the title and abstract.  As the reviewer recommends, we now highlight it further in the new discussion.

      References 12-17 report an effect of Integrator on 5' of protein-coding genes, while data in Figure 2 appears contradictory. Then, experiments in Figure 4 show a global effect of INST11 depletion on promoter-proximal sense transcription. In my opinion, data from the 2.5h time-point of depletion should be shown alongside 1.5h in Figure 2 so that it is clear that the authors found an effect similar to the above references. I find the current presentation somehow misleading.

      We are grateful for this suggestion and present new analyses demonstrating that our experiment in Figure 2 concurs with previous findings (Supplemental Figures 2A and B). Our original heatmap (Figure 2E) shows a very strong and general antisense effect of INTS11 loss. On the same scale, the effects in the sense direction are not as apparent, which is also the case using metaplots.  New supplemental figure 2A now shows sense transcription from this experiment in isolation and on a lower scale, demonstrating that a subset of genes shows promoter-proximal increases in transcription following INTS11 depletion.  This is smaller and less general than the antisense effect but consistent with previous findings.  Indeed, our new analysis in supplemental figure 2B shows that affected protein-coding genes are lowly expressed, in line with Hu et al., Mol Cell 2023. This explains why a sense effect is not as apparent by metaplot, for which highly expressed genes contribute the most signal.

      As a result of our analyses, we are confident that the apparently larger effect at the 2.5hr timepoint (Figure 4) that we initially reported is due to experimental variability and not greater effects of extended INTS11 depletion. Overlaying the 1.5h and 2.5h datasets (Supplemental Figure 4B) revealed a similar number of affected protein-coding genes with a strong (83%) overlap between the affected genes.  To support this, we performed qPCR on four affected protein-coding transcripts which revealed no significant difference in the level of INTS11 effect after 2.5h vs 1.5h (Supplemental Figure 4C).

      We now present data for merged replicates in Figures 2 and 4 which reveal very similar average profiles for -INTS11 vs +INTS11 at both timepoints. Overall, we believe that we have resolved this discrepancy by showing that it amounts to experimental variability and because the most acutely affected protein-coding genes are lowly expressed. As detailed above, we show this in multiple ways (and validate by qPCR) We have revised the text accordingly and removed our original speculation that differences reflected the timeframe of INTS11 loss.

      Conclusion/assessment:

      This important work substantially advances our understanding of the mechanisms governing the directionality of human promoters. The evidence supporting the claims of the authors is compelling, with among others the use of advanced nascent transcriptomics including spike-in normalization controls and acute protein depletion using degron approaches.

      In my opinion, the authors' conclusions are in general well supported.

      Not only the manuscript but also the data generated will be useful to the wide community of researchers studying transcriptional regulation. Also, the POINT-derived novel sPOINT method described here is very valuable and can positively impact work in the field.

      We are grateful for the reviewers' positive assessment of our study.

      Reviewer #2 (Public Review):

      Summary:

      Eaton and colleagues use targeted protein degradation coupled with nascent transcription mapping to highlight a role for the integrator component INST11 in terminating antisense transcription. They find that upon inhibition of CDK9, INST11 can terminate both antisense and sense transcription - leading to a model whereby INST11 can terminate antisense transcription and the activity of CDK9 protects sense transcription from INST11-mediated termination. They further develop a new method called sPOINT which selectively amplifies nascent 5' capped RNAs and find that transcription initiation is more efficient in the sense direction than in the antisense direction. This is an excellent paper that uses elegant experimental design and innovative technologies to uncover a novel regulatory step in the control of transcriptional directionality.

      Strengths:

      One of the major strengths of this work is that the authors endogenously tag two of their proteins of interest - RBBP6 and INST11. This tag allows them to rapidly degrade these proteins - increasing the likelihood that any effects they see are primary effects of protein depletion rather than secondary effects. Another strength of this work is that the authors immunoprecipitate RNAPII and sequence extracted full-length RNA (POINT-seq) allowing them to map nascent transcription. A technical advance from this work is the development of sPOINT which allows the selective amplification of 5' capped RNAs < 150 nucleotides, allowing the direction of transcription initiation to be resolved.

      We appreciate this positive assessment.

      Weaknesses:

      While the authors provide strong evidence that INST11 and CDK9 play important roles in determining promoter directionality, their data suggests that when INST11 is degraded and CDK9 is inhibited there remains a bias in favour of sense transcription (Figures 4B and C). This suggests that there are other unknown factors that promote sense transcription over antisense transcription and future work could look to identify these.

      We agree that other (so far, unknown) factors promote sense transcription over antisense, which was demonstrated by our short POINT.  We have provided an expanded discussion on this in the revision. In our opinion, demonstrating that sense transcription is driven by preferential initiation in that direction is a key finding and we agree that the identification of the underlying mechanism constitutes an interesting avenue for future study.

      Reviewer #3 (Public Review):

      Summary:

      Using a protein degradation approach, Eaton et al show that INST11 can terminate the sense and anti-sense transcription but higher activity of CDK9 in the sense direction protects it from INS11-dependent termination. They developed sPOINT-seq that detects nascent 5'-capped RNA. The technique allowed them to reveal robust transcription initiation of sense-RNA as compared to anti-sense.

      Strengths:

      The strength of the paper is the acute degradation of proteins, eliminating the off-target effects. Further, the paper uses elegant approaches such as POINT and sPOINT-seq to measure nascent RNA and 5'-capped short RNA. Together, the combination of these three allowed the authors to make clean interpretations of data.

      We appreciate this positive assessment.

      Weaknesses:

      While the manuscript is well written, the details on the panel are not sufficient. The methods could be elaborated to aid understanding. Additional discussion on how the authors' findings contradict the existing model of anti-sense transcription termination should be added.

      We have added more detail to the figure panels, which we hope will help readers to navigate the paper more easily. Specifically, the assay employed for each experiment is indicated in each figure panel. As requested, we provide a new and separate discussion section in the revision.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      Congratulations on this important piece of work!

      Some specific suggestions.

      MAJOR

      -The data are not available (Accession "GSE243266" is currently private and is scheduled to be released on Sep 01, 2026.) This should be corrected and as a minimum, the raw sequencing files as well as the spike-in scaled bigwig files should be provided in GEO.

      We have made the data public. Raw and bigwig files are provided as part of the GEO upload.

      MINOR

      - It would be useful for readers if you could include catalog numbers of the reagents used in the study.

      We have included this information in our revision.

      - A table in experimental procedures summarizing the genomic experiments performed in this study as well as published ones reanalyzed here would be helpful.

      This is now provided as part of the resources table.

      - It would be easier for reviewers to evaluate the manuscript if the figure legends were included together with the figures on one page. This is now allowed by most journals.

      We have used this formatting in the revision.

      - Providing some captions for the results sections would be helpful.

      We have included subheadings as suggested.

      Reviewer #2 (Recommendations For The Authors):

      Generally, I would suggest writing the experiment-type above panels where it is not immediately obvious what they are so a reader can appreciate the figures without referencing the legend. E.g. write POINT-seq on Figure 1B just to make it obvious to someone looking at the figures what methodology they are looking at. Likewise, you could write RNAPII ChIP-seq for Supplementary Figures 3D and 3E.

      We have carried out this recommendation.

      Can a y-axis be indicated on POINT-seq genome browser tracks? This could make them easier to interpret.

      Y-axis scales are provided as RPKM as stated in the figure legends.

      The authors could address/speculate in the text why there is less POINT-seq signal for the antisense transcript in the treatment condition in Figure 1B? Or could consider including a different example locus where this is not the case for clarity.

      Acute depletion of poly(A) factors (like RBBP6) results in a strong read-through beyond the poly(A) signal of protein-coding genes as Figure 1 shows.  However, it also causes a reduction in transcription levels, which can be seen in the figure and is correctly noted by the reviewer in this comment.  We see this with other poly(A) factor depletions (e.g. CPSF73 and CPSF30 – Eaton et al., 2020 and Estell et al., 2021) and other labs have observed this too (e.g for CPSF73-dTAG depletion (Cugusi et al., Mol Cell 2022)).  Plausible reasons include a limited pool of free RNAPII due to impaired transcriptional termination or limited nucleotide availability due to their incorporation within long read-through transcripts. For these reasons, we have retained the example in Figure 1B as a typical representation of the effect. Moreover, the heatmap in Figure 1D fairly represents the spectrum of effects following RBBP6 loss – highlighting the strong read-through beyond poly(A) signals and the marginal antisense effects.

      "The established effect of INTS11 at snRNAs was detected in our POINT-seq data and demonstrates the efficacy of this approach (Figure 2B)." The authors could explain this point more clearly in the text and describe the data - e.g. As expected, depletion of INTS11 leads to increased POINT-seq signal at the 3' end of snRNAs, consistent with defects in transcriptional termination. This is highlighted by the RNU5A-1 and RNU5B-1 loci (Figure 2B).

      We agree and have added more context to clarify this.

      I would suggest adjusting the scale of the heatmap in Figure 2E - I think it would be easier to interpret if the value of 0 was white - with >0 a gradient of orange and <0 a gradient of blue (as is done in Figure 1C). I think making this change would make the point as written in the text clearer i.e. "heatmap analysis demonstrates the dominant impact of INTS11 on antisense versus sense transcription at most promoters (Figure 2E)." I'm assuming most of the sense transcription would be white (more clearly unchanging) when the scale is adjusted.

      We agree and have done this. The reviewer is correct that most sense transcription is unchanged by INTS11 loss.  However, as we alluded to in the original submission, a subset of transcripts shows a promoter-proximal increase after INTS11 depletion. We have expanded the analyses of this effect (see responses to other comments) but stress that it is neither as general nor as large as the antisense effect.

      The authors make the point that there is mildly increased transcription over the 5' end of some genes upon INST11 depletion and show a track (Supplementary Fig 2A). It is not immediately obvious from the presentation of the meta-analysis in Figure 2D how generalisable this statement is. Perhaps the size of the panel or thickness of the lines in Figure 2D could be adjusted so that the peak of the control (in blue) could be seen. Perhaps an arrow indicating the peak could be added? I'm assuming the peak at the TSS is slightly lower in the control compared to INST11 depletion based on the authors' statement.

      We have provided multiple new analyses of this data to highlight where there are promoter-proximal effects of INTS11 loss in the sense direction.  Please see our response to the public review of reviewer 1 and new supplemental figures 2A, 2B, 4A and 4B which highlight the sense transcription increased in the absence of INTS11.

      The authors label Figure 4 "Promoters lose their directionality when CDK9 is inhibited" - but in INST11 depleted cells treated with CDK9i they find that there still is a bias towards sense transcription. Suggested edit "Some promoter directionality is lost when CDK9 is inhibited" or similar.

      We agree and have made this change.

      The authors conclude that INTS11-mediated effects are the result of perturbation of the catalytic activities of Integrator, the authors should perform rescue experiments with the catalytically dead E203Q-INTS11 mutant.

      This is a very good suggestion and something we had intended to pursue.  However, as we will describe below (and shown in Supplemental Figure 4G), there were confounding issues with this experiment.

      The E203Q mutant of INTS11 is widely used in the literature to test for catalytic functions of INTS11.  However, we have found that this mutation impairs the ability of INTS11 to bind other Integrator modules in cells. Based on co-immunoprecipitation of flag-tagged WT and E203Q derivatives, INTS1 (backbone module), 10 (tail module), and 8 (phosphatase module) all show reduced binding to E203Q vs. WT. Because E203Q INTS11 is defective in forming Integrator complexes, rescue experiments might not fully distinguish the effects of INTS11 activity from those caused by defects in complex assembly. While this may at first seem unexpected, in the analogous 3’ end processing complex, catalytic mutants of CPSF73 (which is highly related to INTS11) negatively affect its interaction with other complex members (Kolev and Steitz, EMBO Reports 2005).

      We hypothesise that INTS11 activity is most likely involved in attenuating promoter-proximal transcription, but we cannot formally rule out other explanations and discuss this in our revision. Regardless of how INTS11 attenuates transcription, our main conclusion is on its requirement to terminate antisense transcription whether this involves its cleavage activity or not.

      The authors suggest that CDK9 modulates INTS11 activity/assembly and suggest this may be related to SPT5. Is there an effect of CDK9 inhibition on the snRNA's highlighted in Figure 2B?

      We believe that snRNAs are different from protein-coding genes concerning CDK9 function. Shona Murphy’s lab previously showed that, unlike protein-coding genes, snRNA transcription is insensitive to CDK9 inhibition, and that snRNA processing is impaired by CDK9 inhibition (Medlin et al., EMBO 2003 and EMBO 2005).  We reproduce these findings by metaanalysis of 15 highly expressed and well-separated snRNAs and by qRT-PCR of unprocessed RNU1-1, RNU5A-1 and RNU7-1 snRNA following CDK9 inhibition. We observe snRNA read-through by POINT-seq following INTS11 loss whether CDK9 is inhibited or not (left panel, below). Note the higher TES proximal signal in CDK9i conditions, which likely reflects the accumulation of unprocessed snRNA as validated by qPCR for three example snRNAs (right panel, below).

      Author response image 1.

      For Figure 4, would similar results be observed using inhibitors targeting other transcriptional CDKs such as CDK7,12/13?

      In response to this suggestion, we analysed four selected protein-coding transcripts (the same 4 that we used to validate the CDK9i results) by qRT-PCR in a background of CDK7 inhibition using the THZ2 compound (new Supplemental Figure 4E).  THZ2 suppresses transcription from these genes as expected.  Interestingly, expression is restored by co-depleting Integrator, recapitulating our findings with CDK9 inhibition.  As CDK7 is the CDK-activating kinase for CDK9, its inhibition will also inhibit CDK9 so THZ2 may simply hit this pathway upstream of where CDK9 inhibitors.  Second, CDK7 may independently shield transcription from INTS11.  We allude to both interesting possibilities.

      What happens to the phosphorylation state of anti-sense engaged RNAPII when INTS11 is acutely depleted and/or CDK9 is inhibited? This could be measured by including Ser5 and Ser2 antibodies in the sPOINT-seq assay and complemented with Western Blot analysis.

      We have performed the western blot for Ser5 and Ser2 phosphorylation as suggested.  Both signals are mildly enhanced by INTS11 loss, which is consistent with generally increased transcription.  Ser2p is strongly reduced by CDK9 inhibition, which is consistent with the loss of nascent transcription in this condition.  Interestingly, both modifications are partly recovered when INTS11 is depleted in conjunction with CDK9 inhibition. This is consistent with the effects that we see on POINT-seq and shows that the recovered transcription is associated with some phosphorylation of RNAPII CTD.  This presumably reflects the action(s) of kinases that can act redundantly with CDK9.

      We have not performed POINT-seq with Ser5p and Ser2p antibodies under these various conditions.  Our rationale is that our existing data uses an antibody that captures all RNAPII (regardless of its phosphorylation status), which we feel most comprehensively assays transcription in either direction. Moreover, the lab of Fei Chen (Hu et al., Mol Cell 2023) recently published Ser5p and Ser2p ChIP-seq following INTS11 loss. By ChIP-seq, they observe a bigger increase in antisense RNAPII occupancy vs. sense providing independent and orthogonal support for our POINT-seq data.  Interestingly, this antisense increase is not paralleled by proportional increases in Ser5p or Ser2p signals.  This suggests that the unattenuated antisense transcription resulting from INTS11 loss does not have high Ser5p or Ser2p.  Since CDK7 and 9 are major Ser5 and 2 kinases, this supports our model that their activity is less prevalent for antisense transcription.  We now discuss these data in our revision.   

      The HIV reporter RNA experiments should be performed with the CDK9 inhibitor added to the experimental conditions. Presumably CDK9 inhibition would result in no upregulation of the reporter upon addition of TAT and/or dTAG. Perhaps the amount of TAT should be reduced to still have a dynamic window in which changes can be detected. It is possible that reporter activation is simply at a maximum. Can anti-sense transcription be measured from the reporter?

      We have performed the requested CDK9 inhibitor experiment to confirm that TAT-activated transcription from the HIV promoter is CDK9-dependent (new supplemental figure 4F).  Consistent with previous literature on HIV transcription, CDK9 inhibition attenuates TAT-activated transcription.  Importantly, and in line with our other experiments, depletion of INTS11 results in significant restoration of transcription from the HIV promoter when CDK9 is inhibited. Thus, TAT-activated transcription is CDK9-dependent and, as for endogenous genes, CDK9 prevents attenuation by INTS11.

      While TAT-activated transcription is high, we do not think that the plasmid is saturated. When considering this question, we revisited previous experiments using this system to study RNA processing (Dye et al., Mol Cell 1999, Cell 2001, Mol Cell 2006). In these cases, mutations in splice sites or polyadenylation sites have a strong effect on RNA processing and transcription around HIV reporter plasmids. Effects on transcription and RNA processing are; therefore, apparent in the appropriate context. In contrast, we find that the complete elimination of INTS11 has no impact on RNA output from the HIV reporter. Our original experiment assessing the impact of INTS11 loss in +TAT conditions used total RNA.  One possibility is that this allows non-nascent RNA to accumulate which might confound our interpretation of INTS11 effects on ongoing transcription.  However, the new experiment described in the paragraph above was performed on chromatin-associated (nascent) RNA to rule this out.  This again shows no impact of INTS11 loss on HIV promoter-derived transcription in the presence of TAT.

      To our knowledge, antisense transcription is not routinely assayed from plasmids. They generally employ very strong promoters (e.g. CMV, HIV) to drive sense transcription.  Crucially, their circular nature means that RNAPII going around the plasmid could interfere with antisense transcription coming the other way which does not happen in a linear genomic context. This is why we restricted our use of plasmids to looking at the effects of stimulated CDK9 recruitment (via TAT) on transcription rather than promoter directionality.   

      The authors should clearly state how many replicates were performed for the genomics experiments. Ideally, a signal should be quantified and compared statistically rather than relying on average profiles only.

      We have stated the replicate numbers for sequencing experiments in the relevant figure legends. All sequencing experiments were performed in at least two biological replicates, but often three. In addition, we validated their key conclusions by qPCR or with orthogonal sequencing approaches.

      Reviewer #3 (Recommendations For The Authors):

      The authors provide strong evidence in support of their claims.

      ChIP-seq of pol2S5 and S2 upon INST11 and CDK9 inhibition will strengthen the observation that transcription in the sense direction is more efficient.

      We view the analysis of total RNAPII as the most unbiased way of establishing how much RNAPII is going one way or the other. Importantly, ChIP-seq was very recently performed for Ser2p and Ser5p RNAPII derivatives in the lab of Fei Chen (Hu et al., Mol Cell 2023). Their data shows that loss of INTS11 increases the occupancy of total RNAPII in the antisense direction more than in the sense direction, which is consistent with our finding. Interestingly, the increased antisense RNAPII was not paralleled with an increase in Ser2p or Ser5p. This suggests that, following INTS11 loss, the unattenuated antisense transcription is not associated with full/normal Ser2p or Ser5p. These modifications are normally established by CDK7 and 9; therefore, this published ChIP-seq suggests that they are not fully active on antisense transcription when INTS11 is lost. This supports our overall model that CDK9 (and potentially CDK7 as suggested for a small number of genes in new Supplemental Figure 4E) is more active in the sense direction to prevent INTS11-dependent attenuation. We now discuss these data in our revision.

      In Supplementary Figure 2, the eRNA expression increases upon INST11 degradation, I wonder if the effects of this will be appreciated on cognate promoters? Can the authors test some enhancer:promoter pairs?

      We noticed that some genes (e.g. MYC) that are regulated by enhancers show reduced transcription in the absence of INTS11. Whilst this could suggest a correlation, the transcription of other genes (e.g. ACTB and GAPDH) is also reduced by INTS11 loss although they are not regulated by enhancers.  A detailed and extensive analysis would be required to establish any link between INTS11-regulated enhancer transcription and the transcription of genes from their cognate promoters.  We agree that this would be interesting, but it seems beyond the scope of our short report on promoter directionality.

      Line 111, meta plot was done of 1316 genes. Details on this number should be provided. Overall, the details of methods and analysis need improvement. The layout of panels and labelling on graphs can be improved.

      We have now explained the 1316 gene set.  In essence, these are the genes separated from an expressed neighbour by at least 10kb.  This distance was selected because depletion of RBBP6 induces extensive read-through transcription beyond the polyadenylation site of protein-coding genes.  To avoid including genes affected by transcriptional read-through from nearby transcription units we selected those with a 10kb gap between them. This was the only selection criteria so is unlikely to induce any unintended biases. Finally, we have added more information to the figure panels and their legends, which we hope will make our manuscript more accessible.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary of the work: In this work, Fruchard et. al. study the enzyme Tgt and how it modifies guanine in tRNAs to queuosine (Q), essential for Vibrio cholerae's growth under aminoglycoside stress. Q's role in codon decoding efficiency and its proteomic effects during antibiotic exposure is examined, revealing Q modification impacts tyrosine codon decoding and influences RsxA translation, affecting the SoxR oxidative stress response. The research proposes Q modification's regulation under environmental cues reprograms the translation of genes with tyrosine codon bias, including DNA repair factors, crucial for bacterial antibiotic response.

      The experiments are well-designed and conducted and the conclusions, for the most part, are well supported by the data. However, a few clarifications will significantly strengthen the manuscript.

      Thank you.

      Major:

      Figure S4 A-D. These growth curves are important data and should be presented in the main figures. Moreover, given that it is not possible to make a rsxA mutant, I wonder if it would be possible to connect rsx and tgt using the following experiment: expression of tgt results in resistance to TOB (in B), while expression of only rsx lower resistance to TOB (in D). Then simultaneous overexpression of both tgt/rsx in the WT strain should have either no effect on TOB resistance or increased resistance, relative to the WT. Perhaps the authors have done this, and if so, the data should be included as it will significantly strengthen their model.

      We thank the reviewer for this suggestion, we have tried to overexpress both tgt and rsxA simultaneously. However, this appears to be toxic as cells form small colonies and cannot grow well in liquid. We think that the presence of 2 plasmids and corresponding selection antibiotics amplify the toxicity of overexpressing rsxA, and even tgt. In fact, it can be seen that tgt overexpression in WT is already slightly deleterious, in the absence of tobramycin (figure 1B).

      Figure S4 - Is there a rationale for why it is possible to make rsx mutants in E. coli, but not in V. cholerae? For example, does E. coli have a second gene/protein that is redundant in function to rsxA, while V. cholerae does not? I think your data hint at this, since in the right panel growth data, your double mutant does not fully rescue back to rsx single mutant levels, suggesting another factor in tgt mutant also acts to lower resistance to TOB. If so, perhaps a line or two in text will be helpful for readers.

      This point raised by the referee is an interesting one that we have also asked ourselves at multiple occasions. In fact, the Rsx operon is linked with oxidative stress and respiration. Vibrio cholerae and E. coli show differences on genes involved in these pathways. V. cholerae lacks the cyo/nuo respiratory complex genes, and does not encode a Suf operon. Moreover, deletion of the anaerobic respiration Frd pathway leads to strong decrease of V. cholerae growth even in aerobic conditions. (10.1128/spectrum.01730-23). We have previously also generally seen differences between the 2 species in response to stress (10.1128/AAC.01549-10) and the way they deal with ROS (10.1371/journal.pgen.1003421). Therefore, we think that the fact that rsx is essential in V. cholerae and not E. coli could either be due to the presence of an additional redundant pathway in E. coli as suggested by the referee, or to more general differences in respiration and treatment of ROS. We thank the referee for highlighting this and we have now included a comment about this in the manuscript.

      - For growth curves in Figure 2 and relative comparisons like in Figure 5D and Figure S4 (and others in the paper), statistics and error bars, along with replicate information should be provided.

      We had mentioned this in the methods section, we have now added the specific information also on figure legends.

      - Figure 6A - Is the transcript fold change in linear or log? If linear, then tgt expression should not be classified as being upregulated in TOB. It is barely up by ~2-fold with TOB- 0.6....which is a mild phenotype, at best.

      We think that 2-fold change of tgt expression can be sufficient to lead to changes in tRNA modification levels. We agree that this is a mild induction, we have thus changed “increase” to “mildly increase” in the results.  

      - Line 779- 780: "This indicates that sub-MIC TOB possibly induces tgt expression through the stringent response activation." To me, the data presented in this figure, do not support this statement. The experiment is indirect.

      We agree, we rephrased: “Tobramycin may induces tgt expression through stringent response activation or through an independent pathway. “

      - Figure 3B and D. - These samples only have tobramycin, correct? The legend says both carbenicillin and tobramycin.

      The legend is correct, samples also have carbenicillin because we are testing here the growth with 2 synonymous beta-lactamase genes in presence of beta-lactams.

      - Figure 5. The color schemes in bars do not match up with the color scheme in cartoons below panels B and C. That makes it confusing to read. Please fix.

      Fixed.

      - A lot of abbreviations have been used. This makes reading a bit cumbersome. Ideally, less abbreviations will be used.

      Fixed

      Reviewer #2 (Public Review):

      Fruchard et al. investigate the role of the queuosine (Q) modification of the tRNA (Q-tRNA) in the human pathogen Vibrio cholerae. First, the authors state that the absence of Q-modified tRNAs (tgt mutant) increases the translation of TAT codons and proteins with a high TAT codon bias. Second, the absence of Q increases rsxA translation, because rsxA gene has a high TAT codon bias. Third, increased RsxA in the absence of Q inhibits SoxR response, reducing resistance towards the antibiotic tobramycin (TOB). Authors also predict in silico which genes harbor a higher TAT bias and found that among them are some involved in DNA repair, experimentally observing that a tgt mutant is more resistant to UV than the wt strain. It is worth noting that authors employ a wide variety of techniques, both experimental and bioinformatic. However, some aspects of the work need to be clarified or reevaluated.

      (1) The statement that the absence of Q increases the translation of TAT codons and proteins encoded by TAT-enriched genes presents the following problems that should be addressed:

      (1.1) The increase in TAT codon translation in the absence of Q is not supported by proteomics, since there was no detected statistical difference for TAT codon usage in proteins differentially expressed. Furthermore, there are some problems regarding the statistics of proteomics. Some proteins shown in Table S1 have adjusted p-values higher than their pvalues, which makes no sense. Maybe there is a mistake in the adjusted p-value calculation.

      We appreciate the reviewer’s thorough examination of our findings. In our study, we employed an adaptive Benjamini-Hochberg (BH) procedure to control the false discovery rate in our list of selected proteins, as explained in the Data Analysis part of the Proteomics MS and analysis part of our material and methods. The classical BH procedure (10.1111/j.2517-6161.1995.tb02031.x) calculates the 𝑚×𝑝(𝑗) adjusted p-value for the i-th ranked p-value as min where 𝑝(𝑗) is the j-th ranked pvalue and 𝑚 is the number of tests (e.g. number of proteins) (see 10.1021/acs.jproteome.7b00170 for details). Since m/j > 1 and 𝑝(𝑗) > 𝑝(𝑖) for 𝑗≥𝑚, it follows that for 𝑗≥i, resulting in adjusted p-values being higher or equal than the original p-values. Therefore, contrary to the reviewer's comment, it is a mathematical property that the adjusted p-value is greater than the original p-value when using the classical Benjamini-Hochberg procedure. 

      However, we want to underline that we used an « adaptive » BH procedure, which calculates the adjusted p-value for the i-th ranked p-value as min , where 𝜋0 is an estimate of the proportion of true null hypotheses (see 10.1021/acs.jproteome.7b00170 for details). Indeed, the classical BH procedure makes the assumption that 𝜋0 \= 1, which is a strong assumption in MS-based proteomics context.  Consequently, the mathematical property that the adjusted p-value is greater than the original p-value does not always hold true in our approach (that depends also on the 𝜋0 parameter).

      In addition, it is not common to assume that proteins that are quantitatively present in one condition and absent in another are differentially abundant proteins. Proteomics data software typically addresses this issue and applies some corrections. It would be advisable to review that.

      We thank the reviewer for highlighting this point. Indeed, some software impute a random small value to replace missing values and then produces statistics based on this imputed data (10.1038/nmeth.3901). However, the validity and relevance of generating statistics in the absence of actual data is questionable. 

      There are no universally accepted guidelines for handling this situation, and we believe it is more logical to set these values aside as potential interesting proteins. It is well-established that intensity values are often missing due to the detection limits of the spectrometer, suggesting that the missing values observed in several replicates of a condition are actually due to low values (see 10.1093/bioinformatics/btp362 and 10.1093/bioinformatics/bts193 for instance). It is thus logical to consider the associated proteins as potentially differentially abundant when comparing their complete absence in all replicates of one condition to their presence in several replicates of another condition.

      (1.2) Problems with the interpretation of Ribo-seq data (Figure 4D). On the one hand, the Ribo-seq data should be corrected (normalized) with the RNA-seq data in each of the conditions to obtain ribosome profiling data, since some genes could have more transcription in some of the conditions studied. In other articles in which this technique is used (such as in Tuorto et al., EMBO J. 2018; doi: 10.15252/embj.201899777), it is interpreted that those positions in which the ribosome moves most slowly and therefore less efficiently translated), are the most abundant. Assuming this interpretation, according to the hypothesis proposed in this work, the fragments enriched in TAT codons should have been less abundant in the absence of Q-tRNA (tgt mutant) in the Rib-seq experiment. However, what is observed is that TAT-enriched fragments are more abundant in the tgt mutant, and yet the Ribo-seq results are interpreted as RNA-seq, stating that this is because the genes corresponding to those sequences have greater expression in the absence of Q. 

      As recommended by the reviewer, we normalized the RiboSeq data with the RNAseq data to account for potential RNA variations. The updated Figure 4 demonstrates that this normalization does not alter our findings, confirming that variations at the RNAseq level do not contradict changes at the translational level. 

      The reviewer's observation that pauses at TAT codons would lead to ribosome accumulation and subsequent categorization as "up" genes is accurate. We must emphasize, however, that this category of “up genes” is probably quite diverse. The effect of ribosome stalling at TAT codons on total mRNA ribosome occupancy is likely highly variable, depending on the location of the TAT codon(s) within the CDS and the gene's expression level. We therefore think that genes in the "Up" category mainly correspond to genes that are more translated because the impact of pausing at TAT codons is probably not strong enough. Note that unlike what is usually done in bacterial riboseq experiments, we did not use any antibiotics to artificially freeze the ribosomes.

      On the other hand, it would be interesting to calculate the mean of the protein levels encoded by the transcripts with high and low ribosome profiling data.

      While this is a common request, we believe that comparing RiboSeq and proteomics data is not particularly informative. RiboSeq data directly measures translation, while proteomics provides information about protein abundance at steady state, reflecting the balance between protein synthesis and degradation. Furthermore, the number of proteins detectable by mass spectrometry is significantly smaller than the number of genes quantified by RiboSeq. Given these factors, there is often a low correlation between translation and protein abundance, making a direct comparison less relevant 

      (1.3) This statement is contrary to most previously reported studies on this topic in eukaryotes and bacteria, in which ribosome profiling experiments, among others, indicate that translation of TAT codons is slower (or unaffected) than translation of the TAC codons, and the same phenomenon is observed for the rest of the NAC/T codons. This is completely opposed to the results showed in Figure 4. However, the results of these studies are either not mentioned or not discussed in this work. Some examples of articles that should be discussed in this work:

      - "Queuosine-modified tRNAs confer nutritional control of protein translation" (Tuorto et al., 2018; 10.15252/embj.201899777)

      - "Preferential import of queuosine-modified tRNAs into Trypanosoma brucei mitochondrion is critical for organellar protein synthesis" (Kulkarni et al., 2021; doi:10.1093/nar/gkab567.

      - "Queuosine-tRNA promotes sex-dependent learning and memory formation by maintaining codonbiased translation elongation speed" (Cirzi et al., 2023; 10.15252/embj.2022112507)

      - "Glycosylated queuosines in tRNAs optimize translational rate and post-embryonic growth" (Zhao et al., 2023; 10.1016/j.cell.2023.10.026)

      - "tRNA queuosine modification is involved in biofilm formation and virulence in bacteria" (Diaz-Rullo and Gonzalez-Pastor, 2023; doi: 10.1093/nar/gkad667). In this work, the authors indicate that QtRNA increases NAT codon translation in most bacterial species. Could the regulation of TAT codonenriched proteins by Q-tRNAs in V. cholerae an exception? In addition, authors use a bioinformatic method to identify genes enriched in NAT codons similar to the one used in this work, and to find in which biological process are involved the genes whose expression is affected by Q-tRNAs (as discussed for the phenotype of UV resistance). It will be worth discussing all of this.

      Thank you for detailed suggestions, we agree that this discussion was missing and this comment gives us a chance to address that in the revised version of the manuscript.

      About the references above suggested by the referee, 4 of these papers were not mentioned in our manuscript, these were published while our manuscript was previously in review and we realize we have not cited them in the latest version of our manuscript. We thank the referee for highlighting this. We have now included a discussion about this. 

      We included the following in the discussion:

      “However, the opposite codon preference was shown in E. coli {Diaz-Rullo, 2023 #1888}. In eukaryotes also, several recent studies indicate slower translation of U-ending codons in the absence of Q34 {Cirzi, 2023 #1887;Kulkarni, 2021 #1886;Tuorto, 2018 #1268}. It’s important to note here, that in V. cholerae ∆tgt, increased decoding of U-ending codons is observed only with tyrosine, and not with the other three NAC/U codons (Histidine, Aspartate, Asparagine). This is interesting because it suggests that what we observe with tyrosine may not adhere to a general rule about the decoding efficiency of U- or C-ending codons, but instead seems to be specific to Tyr tRNAs, at least in the context of V. cholerae. Exceptions may also exist in other organisms. For example, in human cells, queuosine increases efficiency of decoding for U- ending codons and slows decoding of C- ending codons except for AAC {Zhao, 2023 #1889}. In this case, the exception is for tRNA Asparagine. Moreover, in mammalian cells {Tuorto, 2018 #1268}, ribosome pausing at U-ending codons is strongly seen for Asp, His and Asn, but less with Tyr. In Trypanosoma {Kulkarni, 2021 #1886}, reporters with a combination of the 4 NAC/NAU codons for Asp, Asn, Tyr, His have been tested, showing slow translation at U- ending version of the reporter in the absence of Q, but the effect on individual codons (e.g. Tyr only) is not tested. In mice {Cirzi, 2023 #1887}, ribosome slowdown is seen for the Asn, Asp, His U-ending codons but not for the Tyr U-ending codon. In summary, Q generally increases efficiency of U- ending codons in multiple organisms, but there appears to be additional unknown parameters which affect tyrosine UAU decoding, at least in V. cholerae. Additional factors such as mRNA secondary structures or mistranslation may also contribute to the better translation of UAU versions of tested genes. Mistranslation could be an important factor. If codon decoding fidelity impacts decoding speed, then mistranslation could also contribute to decoding efficiency of Tyr UAU/UAC codons and proteome composition.”

      (1.4) It is proposed that the stress produced by the TOB antibiotic causes greater translation of genes enriched in TAT codons. 

      Actually, it’s the opposite because in presence of TOB, in the wt, tgt would be induced leading to more Q on tRNA-Tyr and less translation of TAT.

      On the one hand, it is shown that the GFP-TAT version (gene enriched in TAT codons) and the RsxATAT-GFP protein (native gene naturally enriched in TAT) are expressed more, compared to their versions enriched in TAC in a tgt mutant than in a wt, in the presence of TBO (Fig. 5C). 

      Figure 5C shows relative fluorescence, ie changes of fluorescence in delta-tgt compared to WT. So it’s not necessarily more expressed but “more increased”

      However, in the absence of TOB, and in a wt context, although the two versions of GFP have a similar expression level (Fig. 3SD), the same does not occur with RsxA, whose RsxA-TAT form (the native one) is expressed significantly more than the RsxA-TAC version (Fig. 3SA). How can it be explained that in a wt context, in which there are also tRNA Q-modification, a gene naturally enriched in TAT is translated better than the same gene enriched in TAC?

      We thank the referee for this question based on careful assessment of our data. We agree, there appears to be significantly more RsxA-TAT in WT than RsxA-TAC. This could be due to other effects such as secondary structure formation on mRNA when the wt RsxA is recoded with TAC codons. This does not hinder the conclusion that the translation of the TAT version is increased in delta-tgt compared to WT.  

      It would be expected that in the presence of Q-tRNAs the two versions would be translated equally (as happens with GFP) or even the TAT version would be less translated. On the other hand, in the presence of TOB the fluorescence of WT GFP(TAT) is higher than the fluorescence of WT GFP(TAC) (Figure S3E) (mean fluorescence data for RsxA-GFP version in the presence of TOB is not shown). These results may indicate that the apparent better translation of TAT versions could be due to indirect effects rather from TAT codon translation.

      This is now mentioned in the manuscript

      “We cannot exclude, however, that additional factors such as mRNA secondary structures also contributes to the better translation of UAU versions of tested genes. “

      (2) Another problem is related to the already known role of Q in prevention of stop codon readthrough, which is not discuss at all in the work. In the absence of Q, stop codon readthrough is increased. In addition, it is known that aminoglycosides (such as tobramycin) also increase stop codon readthrough ("Stop codon context influences genome-wide stimulation of termination codon readthrough by aminoglycosides"; Wanger and Green, 2023; 10.7554/eLife.52611). Absence of Q and presence of aminoglycosides can be synergic, producing devastating increases in stop codon readthrough and a large alteration of global gene expression. All of these needs to be discussed in the work. Moreover, it is known that stop codon readthrough can alter gene expression and mRNA sequence context all influence the likelihood of stop codon readthrough. Thus, this process could also affect to the expression of recoded GFP and RsxA versions.

      We included the following in the revised version of the manuscript (results):

      “Q modification impacts decoding fidelity in V. cholerae.

      To test whether a defect in Q34 modification influences the fidelity of translation in the presence and absence of tobramycin, previously developed reporter tools were used (Fabret & Namy, 2021), to measure stop codons readthrough in V. cholerae ∆tgt and wild-type strains. The system consists of vectors containing readthrough promoting signals inserted between the lacZ and luc sequences, encoding β-galactosidase and luciferase, respectively. Luciferase activity reflects the readthrough efficiency, while β-galactosidase activity serves as an internal control of expression level, integrating a number of possible sources of variability (plasmid copy number, transcriptional activity, mRNA stability, and translation rate).  We found increased readthrough at stop codons UAA and to a lesser extent at UAG for ∆tgt, and this increase was amplified for UAG in presence of tobramycin (Fig. S2, stop readthrough). In the case of UAA, tobramycin appears to decrease readthrough, this may be artefactual, due to the toxic effect of tobramycin on ∆tgt.

      Mistranslation at specific codons can also impact protein synthesis. To further investigate mistranslation levels by tRNATyr in WT and ∆tgt, we designed a set of gfp mutants where the codon for the catalytic tyrosine required for fluorescence (TAT at position 66) was substituted by nearcognate codons (Fig. S2). Results suggest that in this sequence context, particularly in the presence of tobramycin, non-modified tRNATyr mistakenly decodes Asp GAC, His CAC and also Ser UCC, Ala GCU, Gly GGU, Leu CUU and Val GUC codons, suggesting that Q34 increases the fidelity of tRNATyr. 

      In parallel, we replaced Tyr103 of the β-lactamase described above, with Asp codons GAT or GAC. The expression of the resulting mutant β-lactamase is expected to yield a carbenicillin sensitive phenotype. In this system, increased tyrosine misincorporation (more mistakes) by tRNATyr at the mutated Asp codon, will lead to increased synthesis of active β-lactamase, which can be evaluated by carbenicillin tolerance tests. As such, amino-acid misincorporation leads here to phenotypic (transient) tolerance, while genetic reversion mutations result in resistance (growth on carbenicillin). The rationale is summarized in Fig. 3C. When the Tyr103 codon was replaced with either Asp codons, we observe increased β-lactamase tolerance (Fig. 3D, left), suggesting increased misincorporation of tyrosine by tRNATyr at Asp codons in the absence of Q, again suggesting that Q34 prevents misdecoding of Asp codons by tRNATyr.

      In order to test any effect on an additional tRNA modified by Tgt, namely tRNAAsp, we mutated the Asp129 (GAT) codon of the β-lactamase. When Asp129 was mutated to Tyr TAT (Fig. 3D, right), we observe reduced tolerance in ∆tgt, but not when it was mutated to Tyr TAC, suggesting less misincorporation of aspartate by tRNAAsp at the Tyr UAU codon in the absence of Q. In summary, absence of Q34 increases misdecoding by tRNATyr at Asp codons, but decreases misdecoding by tRNAAsp at Tyr UAU. 

      This supports the fact that tRNA Q34 modification is involved in translation fidelity during antibiotic stress, and that the effects can be different on different tRNAs, e.g. tRNATyr and tRNAAsp tested here.”

      Added figures: Figure S2, Figure 3CD

      (3) The statement about that the TOB resistance depends on RsxA translation, which is related to the presence of Q, also presents some problems:

      (3.1) It is observed that the absence of tgt produces a growth defect in V. cholerae when exposed to TOB (Figure 1A), and it is stated that this is mediated by an increase in the translation of RsxA, because its gene is TAT enriched. However, in Figure S4F, it is shown that the same phenotype is observed in E. coli, but its rsxA gene is not enriched in TAT codons. Therefore, the growth defect observed in the tgt mutant in the presence of TOB may not be due to the increase in the translation of TAT codons of the rsxA gene in the absence of Q. This phenotype is very interesting, but it may be related to another molecular process regulated by Q. Maybe the role of Q in preventing stop codon readthrough is important in this process, reducing cellular stress in the presence of TOB and growing better.

      FigS4F (now figure 5D) shows that rsxA can be toxic during growth in presence of tobramycin, but it does not show that rsxA translation is increased in E. coli in delta-tgt. However, we agree with the referee that there are probably additional processes regulated by Q which are also involved in the response to TOB stress. We already had mentioned this briefly in the discussion (“Note that, our results do not exclude the involvement of additional Q-regulated MoTTs in the response to sub-MIC TOB, since Q modification leads to reprogramming of the whole proteome. “), we further discussed it as follows:

      “As a consequence, transcripts with tyrosine codon usage bias are differentially translated. One such transcript codes for RsxA, an anti-SoxR factor. SoxR controls a regulon involved in oxidative stress response and sub-MIC aminoglycosides trigger oxidative stress in V. cholerae{Baharoglu, 2013 #720}, pointing to an involvement of oxidative stress response in the response to sub-MIC tobramycin stress.

      A link between Q34 and oxidative stress has also been previously found in eukaryotic organisms {Nagaraja, 2021 #1466}. Note that our results do not exclude the involvement of additional Qregulated translation of other transcripts in the response to tobramycin. Q34 modification leads to reprogramming of the whole proteome, not only for other transcripts with codon usage bias, but also through an impact on the levels of stop codon readthrough and mistranslation at specific codons, as supported by our data.”

      (3.2) All experiments related to the effect of Q on the translation of TAT codons have been performed with the tgt mutant strain. Considering that the authors have a pSEVA-tgt plasmid to overexpress this gene, they would have to show whether tgt overexpression in a wt strain produces a decrease in the translation of proteins encoded by TAT-enriched genes such as RsxA. This experiment would allow them to conclude that Q reduces RsxA levels, increasing resistance to TOB.

      We agree that this would be interesting to test, however, as it can be seen in figure 1B, delta-tgt pSEVAtgt (complemented strain) grows better than WT pSEVA-tgt (tgt overexpression). In fact, overexpression of tgt negatively impacts cell growth and yield smaller colonies, especially when cells carry a second plasmid (e.g with gfp constructs). We have also seen this with other RNA modification gene overexpressions in the lab (unpublished). We believe that the expression of tgt is tuned and since overexpression affects fitness, it is generally difficult to conduct experiments with overexpression plasmid for RNA modifications.  Nevertheless, we have done the experiment (with slow growing bacteria) and when we normalize expression of gfp in the presence of tgt overexpressing plasmid to the condition with no plasmid, we see little (1.5 fold) or no effect of tgt overexpression on fluorescence (see graph below). This is probably due to a toxic effect of ooverexpression and we do not believe these results are biologically relevant. 

      Author response image 1.

      (3.3) On the other hand, Fig. 1B shows that when the wt and tgt strains compete, both overexpressing tgt, the tgt mutant strain grows better in the presence of TOB. This result is not very well understood, since according to the hypothesis proposed, the absence of modification by Q of the tRNA would increase the translation of genes enriched in TAT, therefore, a strain with a higher proportion of Q-modified tRNAs as in the case of the wt strain overexpressing tgt would express the rsxA gene less than the tgt strain overexpressing tgt and would therefore grow better in the presence of TOB. For all these reasons, it would be necessary to evaluate the effect of tgt overexpression on the translation of RsxA.

      See our answer above about negative effect of tgt overexpression.

      (3.4) According to Figure 1I, the overexpression of tRNA-Tyr(GUA) caused a better growth of tgt mutant in comparison to WT. If the growth defect observed in tgt mutant in the presence of TOB is due to a better translation of the TAT codons of rsxA gene, the overexpression of tRNA-Tyr(GUA) in the tgt mutant should have resulted in even better RsxA translation a worse growth, but not the opposite result.

      We agree, we think that rsxA is not the only factor responsible for growth defect of tgt in presence of TOB (as now further discussed in the discussion). Overexpression of tRNAtyr possibly changes the equilibrium between the decoding of TAC vs TAT and may restore translation of TAC enriched genes. As also suggested by rev3, we have measured decoding reporters for TAT/TAC while overexpressing tTNA-tyr. This is now added to the results in fig S2C and the following:

      “We also tested decoding reporters for TAT/TAC in WT and ∆tgt overexpressing tRNATyr in trans (Fig. S1C). The presence of the plasmid (empty p0) amplified differences between the two strains with decreased decoding of TAC (and increased TAT, as expected) in ∆tgt compared to WT. Overexpression of tRNATyrGUA did not significantly impact decoding of TAT and increased decoding of TAC, as expected. Since overexpression of tRNATyrGUA rescues ∆tgt in tobramycin (Fig. 1I) and facilitates TAC decoding, this suggests that issues with TAC codon decoding contribute to the fitness defect observed in ∆tgt upon growth with tobramycin. Overexpression of tRNATyrAUA increased decoding of TAT in WT but did not change it in ∆tgt where it is already high. Unexpectedly, overexpression of tRNATyrAUA also increased decoding of TAC in WT. Thus, overexpression of tRNATyrAUA possibly changes the equilibrium between the decoding of TAC vs TAT and may restore translation of TAC enriched transcripts.” 

      Added figure: figure S1C

      (4) It cannot be stated that DNA repair is more efficient in the tgt mutant of V. cholerae, as indicated in the text of the article and in Fig 7. The authors only observe that the tgt mutant is more resistant to UV radiation and it is suggested that the reason may be TAT bias of DNA repair genes. To validate the hypothesis that UV resistance is increased because DNA repair genes are TAT biased, it would be necessary to check if DNA repair is affected by Q. UV not only produces DNA damage, but also oxidative stress. Therefore, maybe this phenotype is due to the increase in proteins related to oxidative stress controlled by RsxA, such as the superoxide dismutase encoded by sodA. It is also stated that these repair genes were found up for the tgt mutant in the Ribo-seq data, with unchanged transcription levels. Again, it is necessary to clarify this interpretation of the Ribo-seq data, since the fact that they are more represented in a tgt mutant perhaps means that translation is slower in those transcripts. Has it been observed in proteomics (wt vs tgt in the absence of TOB) whether these proteins involved in repair are more expressed in a tgt mutant?

      We agree that our results do not directly show that DNA repair is more efficient, but that delta-tgt responds better to UV. This has been modified in the manuscript. About oxidative stress, we did not see a better or worse response to H202 of delta-tgt. Moreover, since we see better response of deltatgt  to UV only in V. cholerae and not in E. coli, we did not favor the hypothesesi of response to stressox. In proteomics, we do not detect changes for DNA repair genes except for RuvA which is more abundant in delta-tgt. We have toned down the statement about DNA repair in the paper.

      (5) The authors demonstrate that in E. coli the tgt mutant does not show greater resistance to UV radiation (Fig. 7D), unlike what happens in V. cholerae. It should be discussed that in previous works it has been observed that overexpression in E. coli of the tgt gene or the queF gene (Q biosynthesis) is involved in greater resistance to UV radiation (Morgante et al., Environ Microbiol, 2015 doi: 10.1111/1462-2920.12505; and Díaz-Rullo et al., Front Microbiol. 2021 doi: 10.3389/fmicb.2021.723874). As an explanation, it was proposed (Diaz-Rullo and Gonzalez-Pastor, NAR 2023 doi: 10.1093/nar/gkad667) that the observed increase in the capacity to form biofilms in strains that overexpress genes related to Q modification of tRNA would be related to this greater resistance to UV radiation.

      We now mention the previous observations suggesting a link between tgt and UV. We thank the referee for the reference which we had overlooked. Note that in the case of our experiments, all cultures are in planktonic form and are not allowed to form biofilms. We thus prefer not to biofilmlinked processes in this study.

      Reviewer #3 (Public Review):

      Summary:

      In this manuscript the authors begin with the interesting phenotype of sub-inhibitory concentrations of the aminoglycoside tobramycin proving toxic to a knockout of the tRNA-guanine transglycosylase (Tgt) of the important human pathogen, Vibrio cholerae. Tgt is important for incorporating queuosine (Q) in place of guanosine at the wobble position of GUN codons. The authors go on to define a mechanism of action where environmental stressors control expression of tgt to control translational decoding of particularly tyrosine codons, skewing the balance from TAC towards TAT decoding in the absence of the enzyme. The authors use advanced proteomics and ribosome profiling to reveal that the loss of tgt results in increased translation of proteins like RsxA and a cohort of DNA repair factors, whose genes harbor an excess of TAT codons in many cases. These findings are bolstered by a series of molecular reporters, mass spectrometry, and tRNA overexpression strains to provide support for a model where Tgt serves as a molecular pivot point to reprogram translational output in response to stress.

      Strengths:

      The manuscript has many strengths. The authors use a variety of strains, assays, and advanced techniques to discover a mechanism of action for Tgt in mediating tolerance to sub-inhibitory concentrations of tobramycin. They observe a clear phenotype for a tRNA modification in facilitating reprogramming of the translational response, and the manuscript certainly has value in defining how microbes tolerate antibiotics.

      We thank the referee for their time and comments. 

      Weaknesses:

      The conclusions of the manuscript are mostly very well-supported by the data, but in some places control experiments or peripheral findings cloud precise conclusions. Some additional clarification, discussion, or even experimental extension could be useful in strengthening these areas.

      (1) The authors have created and used a variety of relevant molecular tools. In some cases, using these tools in additional assays as controls would be helpful. For example, testing for compensation of the observed phenotypes by overexpression of the Tyrosine tRNA(GUA) in Figure 2A with the 6xTAT strain, Figure 5C with the rxsA-GFP fusion, and/or Figure 7B with UV stress would provide additional information of the ability of tRNA overexpression to compensate for the defect in these situations.

      Thank you for the suggestions. Since overexpression of tRNA tyr is not expected to decrease decoding of TAT, we do not necessarily expect any effect for UV and rsxA expression. Overexpression of tRNA_GUA restores fitness of delta-tgt in TOB, but this is probably independent of RsxA. As ref2 also suggested above, we included in the discussion that the effect seen in delta-tgt with TOB is not only due to RsxA expression but also additional processes. However, these suggestions are interesting and we performed the following experiments in order to have an answer for these questions: 

      - “testing for compensation of the observed phenotypes by overexpression of the Tyrosine tRNA(GUA) in Figure 2A with the 6xTAT strain”: 

      This is now included in figure S2C and results as follows: 

      “We also tested decoding reporters for TAT/TAC in WT and ∆tgt overexpressing tRNA-Tyr in trans (Fig. S1C). The presence of the plasmid amplified differences between the two strains with decreased decoding of TAC (and increased TAT, as expected) in ∆tgt with empty plasmid compared to WT. Overexpression of tRNA_TyrGUA did not significantly impact decoding of TAT and increased decoding of TAC as expected. Since overexpression of tRNA_TyrGUA rescues ∆tgt in tobramycin (Fig. 1I) and facilitates TAC decoding, this suggests that issues with TAC codon decoding contribute to the fitness defect observed in ∆tgt upon growth with tobramycin. Overexpression of tRNA_TyrAUA increased decoding of TAT in WT but did not change it in ∆tgt where it is already high. Interestingly, overexpression of TyrAUA also increased decoding of TAC in WT. Thus, overexpression of tRNA_TyrAUA possibly changes the equilibrium between the decoding of TAC vs TAT and may restore translation of TAC enriched transcripts. “  

      -  Figure 5C with the rxsA-GFP fusion: 

      When we overexpress tRNA_GUA, rsxA fluorescence is 2-fold higher in delta-tgt compared to wt. However, the fluorescence is highly decreased compared to the condition with no tRNA overexpression. While we are not sure whether this apparent decrease is a technical issue or not (e.g. due to the presence of additional plasmid), we prefer not to further explore this in this manuscript. Note that we could not obtain delta-tgt strain carrying both plasmids expressing tRNA_GUA and rsxA, suggesting toxic overproduction of rsxA in this context.

      Author response image 2.

      - Figure 7B with UV stress: 

      Here again, delta-tgt overexpressing tRNA_GUA is still more UV resistant than WT overexpressing tRNA_GUA.

      Author response image 3.

      (2) The authors present a clear story with a reprogramming towards TAT codons in the knockout strain, particularly regarding tobramycin treatment. The control experiments often hint at other codons also contributing to the observed phenotypes (e.g., His or Asp), yet these effects are mostly ignored in the discussion. It would be helpful to discuss these findings at a minimum in the discussion section, or possibly experimentally address the role of His or Asp by overexpression of these tRNAs together with Tyrosine tRNA(GUA) in an experiment like that of Figure 1I to see if a more "wild type" phenotype would present. In fact, the synergy of Tyr, His, and/or Asp codons likely helps to explain the effects observed with the DNA repair genes in later experiments.

      We thank the referee for the suggestion. We agree that there could be synergies between these codons, and that’s probably why proteomics data does not clearly reflect tyrosine codons usage bias. This is now further discussed in the ideas and speculation section. 

      Moreover, we have added Figure S3G and the following result:

      “Since not all TAT biased proteins are found to be enriched in ∆tgt proteomics data, the sequence context surrounding TAT codons could affect their decoding. To illustrate this, we inserted after the gfp start codon, various tyrosine containing sequences displayed by rsxA (Fig. S3G). The native tyrosines were all TAT codons, our synthetic constructs were either TAT or TAC, while keeping the remaining sequence unchanged.  We observe that the production of GFP carrying the TEYTATLLL sequence from RsxA is increased in Δtgt compared to WT, while it is unchanged with TEYTACLLL. However, production of the GFP with the sequences LYTATRLL/LYTACRLL and EYTATLR/ EYTACLR was not unaffected (or even decreased for the latter) by the absence of tgt. Overall, our results demonstrate that RsxA is upregulated in the ∆tgt strain at the translational level, and that proteins with a codon usage bias towards tyrosine TAT are prone to be more efficiently translated in the absence of Q modification, but this is also dependent on the sequence context. “

      (3) Regarding Figure 6D, the APB northern blot feels like an afterthought. It was loaded with different amounts of RNA as input and some samples are repeated three times, but Δcrp only once. Collectively, it makes this experiment very difficult to assess.

      A different amount of RNA was used only for ∆tgt in which we have only one band because of the absence of modification. For all the other conditions, the same amount of RNA was used (0.9 µg). Additional replicates of crp were in an additional gel but only a representative gel was shown in the manuscript. This is now specified in the legend.

      We also attach below the picture of the gel with total RNA (syber Gold labelling of total RNA), where it can be seen that the lanes contain an equivalent quantity of RNA, except for ∆tgt.

      Author response image 4.

      Minor Points:

      (3) Fig S2B, do the authors have a hypothesis why the Asp and Phe tRNAs lead to a growth decrease in the untreated samples? It appears like Phe(GAA) partially compensates for the defect.

      Yes we agree, at this stage we do not have any satisfactory answer for this unfortunately. This would be interesting to study further but this is beyond the scope of the present study.

      (5) Lines 655 to 660 seem more appropriate as speculation in the discussion rather than as a conclusion in the results, where no direct experiments are performed. The authors might take advantage of the "Ideas and Speculation" section that eLife allows.

      Thank you very much for this suggestion, we added this section to the manuscript.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      Minor.

      - Figure 6 - Fonts on several mutants is different size/type. fixed

      - What is the Pm promoter. Please expand and give enough details so reader can follow. Especially as it is less used in V. cholerae (typical being pBAD or pTAC promoters). done

      - Spacing where references are inserted should be checked. done

      - Line 860-863 - "V. cholerae's response to sub-MIC antibiotic stress is transposable to other Gramnegative pathogens" . This reads awkard. Consider rephrasing. done

      - Figure 7 - Text in A and C is very small and is very hard to read. Font for tgt is different.

      Fixed. Tgt is in italics.

      Reviewer #2 (Recommendations For The Authors):

      As specified in the public review, more evidence would be necessary to affirm that tRNAs not modified by Q have a greater preference for translating TAT codons, since there are several previous studies in which it is shown that Q-tRNAs have a greater preference for NAT codons (including TAT). For example, it is suggested to explore what happens with other recoded genes (enriched in TAT or TAC) if there is a high level of Q-tRNAs (overexpression of tgt in a wt context). It is also necessary to clarify how to interpret the Ribo-seq results, which apparently is different from how they have been interpreted in other studies.

      Please see above our responses and changes made to the manuscript.

      Minor corrections

      In Figure 8, replace "Epitranscriptomic adapation to stress" with "Epitranscriptomic adaptation to stress".

      Fixed, thank you for noticing!

      Reviewer #3 (Recommendations For The Authors):

      (1) Lines 48-50, and 110 to 112, the authors have a nice mechanism and story, yet the lines mentioned feel very qualified (e.g., "possibly", "plausibly") and lead to the abstract hiding the value and major conclusions of the study. The authors could consider to revise or even remove these lines to focus on the take-home message in the abstract and end of introduction/discussion. 

      Thank you for this comment, we modified the text.  

      (2) Additional description for the samples in the results section for Figure 1 would be helpful to the reader.

      Done

      (3) Figure S1, the line of experiments with rluF is interesting, but in the end the choice seems a little random. Have the authors assessed knockouts of other modifications on the ASL for effects? Since the modification is not well characterized in V. cholerae according to the authors, it might make sense to save this for a future paper.

      We removed S1, as we agree that this experiment does not really add something to the paper.

      (4) Line 334 and 353 are redundant.

      Fixed

      (5) It is likely beyond the scope of the study, but it would strengthen the paper to repeat Figure 3 with His and/or Asp based on the findings of 2C and 4E to better understand the contribution of His and Asp to Q biology.

      We repeated figure 3 with Asp. Based on Fig 2C (less efficient decoding of GAC in deta-tgt in TOB) and 4E (positive GAT codon bias in proteins up in riboseq in delta-tgt TOB), we would expect that beta-lactamase with asp GAC would be less efficiently decoded than GAT in delta-tgt. 

      This was added to the manuscript

      “Like Tyr103, Asp129 was shown to be important for resistance to β-lactams (Doucet et al., 2004; Escobar et al., 1994; Jacob et al., 1990). When we replaced the native Asp129 GAT with the synonymous codon Asp129 GAC, the GAC version did not appear to produce functional β-lactamase in ∆tgt (Fig. 3B), suggesting increased mistranslation or inefficient decoding of the GAC codon by tRNAAsp in the absence of Q. Decoding of GAT codon was also affected in ∆tgt in the presence of tobramycin.”

      Added figure: Figure 3B

      (6) The authors could consider replacing 5D with S4A-D, which is easier to understand in our opinion.

      Done

    1. Author response:

      The following is the authors’ response to the original reviews

      eLife Assessment

      This useful study integrates experimental methods from materials science with psychophysical methods to investigate how frictional stabilities influence tactile surface discrimination. The authors argue that force fluctuations arising from transitions between frictional sliding conditions facilitate the discrimination of surfaces with similar friction coefficients. However, the reliance on friction data obtained from an artificial finger, together with the ambiguous correlative analyses relating these measurements to human psychophysics, renders the findings incomplete.

      Our main goal with this paper was to show that the most common metric, i.e. average friction coefficient—widely used in tactile perception and device design – is fundamentally unsound, and to offer a secondary parameter that is compatible with the fact that human motion is unconstrained, leading to dynamic interfacial mechanics.

      We understand the Reviewers wanted, through biomechanical measurements, to demonstrate that humans using instabilities. This is seemingly reasonable, but in individual responses, we explain the significant challenges and fundamental unknowns to those experiments. We believe this paper sets forth an important step to approach this problem. At the same time, we have made several changes in the discussion, conclusion, and title to clarify that our study is correlative between mechanical characterization and human testing.

      In short, there are still several fundamental unknowns that prevented us from basing the study around biomechanical measurements: (1) a decision-making model would need to be created, but it is unknown if tactile decision making follows other models, (2) it is further unknown what constitutes “tactile evidence”, though at our manuscript’s conclusion, we propose that friction instabilities are better suited for to be tactile evidence than the averaging of friction coefficients from a narrow range of human exploration (3) in the design of samples, from a friction mechanics and materials perspective, it is not at this point, possible to pre-program surfaces a priori to deliver friction instabilities and instead must be experimentally determined – especially when attempting to achieve this in controlled surfaces that do not create other overriding tactile cues, like macroscopic bumps or large differences in surface roughness. (4) Given that the basis for tactile percepts, like which object feels “rougher” or “smoother” is not sufficiently established, it is necessary to use a 3-alternative forced choice task which avoids asking objects along a preset perceptual dimension – a challenge recognized by Reviewer 3. However, this would bring in issues of memory in the decision-making model. (5) The prior points are compounded by the fact that, we believe, tactile exploration must be performed in an unconstrained manner, i.e., without an apparatus generating motion onto a stationary finger. Work by Liu et al. (IEEE ToH, 2024) showed that recreating friction obtained during free exploration onto a stationary finger was uninterpretable by the participants, hinting at the importance of efference copies.[1] We believe that many of the above-mentioned issues constitutes a significant advance in knowledge and would require discussion and dissemination with the community.

      Our changes to the manuscript

      Page 1 & SI Page 1, Title

      “Alternatives to Friction Coefficient: Fine Touch Perception Correlates with Frictional Instabilities”

      Reviewer 1 (Public review):

      Summary:

      In this paper, Derkaloustian et. al look at the important topic of what affects fine touch perception. The observations that there may be some level of correlation with instabilities are intriguing. They attempted to characterize different materials by counting the frequency (occurrence #, not of vibration) of instabilities at various speeds and forces of a PDMS slab pulled lengthwise over the material. They then had humans make the same vertical motion to discriminate between these samples. They correlated the % correct in discrimination with differences in frequency of steady sliding over the design space as well as other traditional parameters such as friction coefficient and roughness. The authors pose an interesting hypothesis and make an interesting observation about the occurrences of instability regimes in different materials while in contact with PDMS, which is interesting for the community to see in the publication. It should be noted that the finger is complex, however, and there are many factors that may be quite oversimplified with the use of the PDMS finger, and the consideration and discounting of other parameters are not fully discussed in the main text or SI. Most importantly, however, the conclusions as stated do not align with the primary summary of the data in Figure 2.

      Strengths:

      The strength of this paper is in its intriguing hypothesis and important observation that instabilities may contribute to what humans are detecting as differences in these apparently similar samples.

      We thank Reviewer 1 for their time on the manuscript, recognizing the approach we took, and offering constructive feedback. We believe that our conclusions, in fact, are supported by the primary summary of the data in Fig. 2 but we believe that our use of R<sup>2</sup> could have led to misinterpretation. The trend with friction coefficient and percent correct was indeed statistically significant but was spurious because the slope was negative. In the revision, we add clarifying comments throughout, change from R<sup>2</sup> to r as to highlight the negative trend, and adjust the figures to better focus on friction coefficient.

      Finally, we added a new section to discuss the tradeoffs between using a real human finger versus a mock finger, and which situations may warrant the use of one or the other. In short, for our goal of characterizing surfaces to be used in tactile experiments, we believe a mock finger is more sustainable and practical than using real humans because human fingers are unique per participant, humans move their fingers at constantly changing pressures and velocities, and friction generated during free exploring human cannot be satisfactorily replicated by moving a sample onto a stationary finger. But, we do not disagree that for other types of experiments, characterizing a human participant directly may be more advantageous.

      Weaknesses:

      Comment 1

      The most important weakness is that the findings do not support the statements of findings made in the abstract. Of specific note in this regard is the primary correlation in Figure 2B between SS (steady sliding) and percent correct discrimination. Of specific note in this regard is the primary correlation in Figure 2B between SS (steady sliding) and percent correct discrimination. While the statistical test shows significance (and is interesting!), the R-squared value is 0.38, while the R-squared value for the "Friction Coefficient vs. Percent Correct" plot has an R-squared of 0.6 and a p-value of < 0.01 (including Figure 2B). This suggests that the results do not support the claim in the abstract: "We found that participant accuracy in tactile discrimination was most strongly correlated with formations of steady sliding, and response times were negatively correlated with stiction spikes. Conversely, traditional metrics like surface roughness or average friction coefficient did not predict tactile discriminability."

      We disagree that the trend with friction coefficient suggests the results do not support the claim because the correlation was found to be negative. However, we could have made the comparison more apparent and expanded on this point, given its novelty.

      While the R<sup>2</sup> value corresponding to the “Friction Coefficient vs. Percent Correct” plot is notably higher, our results show that the slope is negative, which would be statistically spurious. This is because a negative correlation between percent correct (accuracy in discriminating surfaces) and difference in friction coefficient means that the more similar two surfaces are (by friction coefficient), the easier it would be for people to tell them apart. That is, it incorrectly concludes that two identical surfaces would be much easier to tell apart than two surfaces with greatly different friction coefficients.

      This is counterintuitive to nearly all existing results, but we believe our samples were well-positioned to uncover this trend by minimizing variability, by controlling multiple physical parameters in the samples, and that the friction coefficient — typically calculated in the field as an average friction coefficient — ignores all the dynamic changes in forces present in elastic systems undergoing mesoscale friction, i.e., human touch, as seen in Fig. 1 in a mock finger and Fig. 3 in a real finger. By demonstrating this statistically spurious trend, we believe this strongly supports our premise that an alternative to friction coefficient is needed in the design of tactile psychophysics and haptic interfaces.

      We believe that this could have been misinterpreted, so we took several steps to improve clarity, given the importance of this finding: we separated the panel on friction coefficient to its own panel, we changed from R<sup>2</sup> to r throughout, and we added clarifying text. We also added a small section focusing on this spurious trend.

      Our changes to the manuscript

      Page 1, Abstract

      “In fact, the typical method of averaging friction coefficients led to a spurious correlation which erroneously suggests that distinct objects should feel identical and identical objects should feel distinct.”

      Page 7

      “As Fig. 1 was constructed from friction measurements, we can also calculate an average friction coefficient, µ, by averaging the friction coefficient obtained at each of the 16 combinations of masses and velocities (Table 1). This calculation is a standard approach in tactile studies for summarizing friction measurements, or in some cases, surfaces are never characterized at multiple masses and velocities. However, summarizing friction data in this manner has been considered as conceptually questionable by others from a mechanics perspective.[3] Fig. 1 shows that the type of instabilities and friction forces encountered on a single surface can vary widely depending on the conditions. As a result, large variations in the friction coefficient are expected, depending on the mass and velocity — even though measurements originate from the same surface. This variability in friction coefficient can be seen with the large interquartile range of friction coefficients, which shows that the variation in friction coefficient across a single surface is similar, or even larger, than the differences in average friction coefficient across two different surfaces. The observation that friction coefficients vary so widely on a single surface calls into question the approach of analyzing how humans may perceive two different objects based on their average friction coefficients.”

      Page 9, Fig. 2 Caption

      “D) GLMM of accuracy vs. difference in average friction coefficient , showing a negative correlation. E) GLMMs of accuracy vs. other commonly used material properties or parameters: ΔAverage roughness R<sub>a</sub>, ΔHurst exponent H, and ΔWater contact angle hysteresis (º) (N = 10 participants_, _n = 600 total trials).”

      Page 9

      “Considering all instabilities individually, we found that only steady sliding was a positive, statistically significant predictor. (r \= 0.62, p < 0.05, shown in Fig. 2B).”

      Page 10

      “To compare the value of looking at frictional instabilities, we also performed GLMM fits on common approaches in the field, like a friction coefficient or material property typically used in tactile discrimination, shown in Fig. 2D-E. Interestingly, in Fig. 2D, we observed a spurious, negative correlation between friction coefficient (typically and often problematically simplified as across all tested conditions) and accuracy (r = -0.64, p < 0.01); that is, the more different the surfaces are by friction coefficient, the less people can tell them apart. This spurious correlation would be the opposite of intuition, and further calls into question the common practice of using friction coefficients in touch-related studies. Interestingly, this spurious correlation was also found by Gueorguiev et al.[21] The alternative, two-term model which includes adhesive contact area for friction coefficient[32] was even less predictive (see Fig. S6A of SI). We believe such a correlation could not have been uncovered previously as our samples are minimal in their physical variations. Yet, the dynamic changes in force even within a single sample are not considered, despite being a key feature of mesoscale friction during human touch.

      We investigate different material properties in Fig. 2E. Differences in average roughness R<sub>a</sub> (or other parameters, like root mean square roughness R<sub>rms</sub> (Fig. S6A of SI) did not show a statistically significant correlation to accuracy. Though roughness is a popular parameter, correlating any roughness parameter to human performance here could be moot: the limit of detecting roughness differences has previously been defined as 13 nm on structured surfaces[36] and much higher for randomly rough surfaces,[49] all of which are magnitudes larger than the roughness differences between our surfaces. The differences in contact angle hysteresis – as an approximation of the adhesion contributions[50] – do not present any statistically significant effects on performance.”

      Page 11-12

      “Despite the correlative nature of this study, we still obtained high correlations compared to existing biomechanical studies[4,19,21], which we speculate is because instabilities are an important predictive phenomenon for models of human touch. We believe that biomechanical studies, including more sophisticated techniques, like spatially resolved force maps from digital image correlation[5,42] may yield stronger correlations and results if they analyze data based on instabilities.

      Added References

      (2) Khamis, H. et al. Friction sensing mechanisms for perception and motor control: passive touch without sliding may not provide perceivable frictional information. J. Neurophysiol. 125, 809– 823 (2021).

      (6) Olczak, D., Sukumar, V. & Pruszynski, J. A. Edge orientation perception during active touch. J. Neurophysiol. 120, 2423–2429 (2018).

      Comment 2, Part 1

      Along the same lines, other parameters that were considered such as the "Percent Correct vs. Difference in Sp" and "Percent Correct vs. Difference in SFW" were not plotted for consideration in the SI. It would be helpful to compare these results with the other three metrics in order to fully understand the relationships.

      We have added these plots to the SI. We note that we had checked these relationships and discussed them briefly, but did not include the plot. The plots show that the type of instability was not as helpful as its presence or absence.

      Our changes to the manuscript

      Page 9

      “Furthermore, a model accounting for slow frictional waves alone specifically shows a significant, negative effect on performance (p < 0.01, Fig. S5 of SI), suggesting that in these samples and task, the type of instability was not as important.”

      “Fig. S5. GLMM fits of participant accuracy vs. the differences in instability incidence for individual instability types. Left: accuracy vs. differences in formation of slow frictional waves (SFW) between pairs. P1 and P5 have the same x-axis value and are shifted for clarity. Right: accuracy vs. differences in formation of stiction spikes (Sp).”

      SI Page 4

      “and no correlation between accuracy and stiction spikes (Fig. S5).”

      Comment 2, Part 2

      Other parameters such as stiction magnitude and differences in friction coefficient over the test space could also be important and interesting.

      We agree these are interesting and have thought about them. We are aware that others, like Gueorguiev et al., have studied stiction magnitudes, and though there was a correlation, the physical differences in surface roughness (glass versus PMMA) investigated made it unclear if these could be generalized further.[3] We are unsure how to proceed here with a satisfactory analysis of stiction magnitude, given that stiction spikes are not always generated. In fact, Fig. 1 shows that for many velocities and pressures, stiction spikes are not formed. In ongoing work, however, we are always cognizant that if stiction spikes are a dominant factor, then a secondary analysis on their magnitude would be important. We offer some speculation on why stiction spikes may be overrepresented in the literature:

      (1) They are prone to being created if the finger was loaded for a long time onto a surface prior to movement, thus creating adhesion by contact aging which is unlike active human exploration. We avoid this by discarding the first pull in our measurements, which is a standard practice in mechanical characterization if contact aging needs to be avoided.

      (2) The ranges of velocities and pressures explored by others were small.

      (3) In an effort to generate strong tactile stimuli, highly adhesive or rough surfaces are used.

      (4) Stiction spikes are visually distinctive on a plot, but we are unaware of any mechanistic reason that mechanoreceptors would be particularly sensitive to this low frequency event over other signals.

      We interpret “difference in friction coefficient over the test space” to be, for a single surface, like C4, to find the highest average friction for a condition of single velocity and mass and subtract that from the lowest average friction for a condition of single velocity and mass. We calculated the difference in friction coefficient in the typical manner of the field, by averaging all data collected at all velocities and masses and assigning a single value for all of a surface, like C4. We had performed this, and have the data, but we are wary of overinterpreting secondary and tertiary metrics because they do not have any fundamental basis in traditional tribology, and this value, if used by humans, would suggest that they rapidly explore a large parameter space to find a “maximum” and “minimum” friction. Furthermore, the range in friction across the test space, after averaging, can be smaller than the range of friction experienced at different masses and velocities on a single surface. We have tabulated and newly included these values (the interquartile range of friction coefficients of different masses and velocities per surface) in Table 1.

      Fig. 2D shows a GLMM fit between percent correct responses across our pairs and the differences in friction coefficient for each pair, where we see a spurious negative correlation. As we had the data of all average friction coefficients for each condition for a given material, we also looked at the difference in maximum and minimum friction coefficients. For our tested pairs, these differences also lined up on a statistically significant, negative GLMM fit (r = -0.86, p < 0.005). However, the values for a given surface can vary drastically, with an interquartile range of 1.20 to 2.09 on a single surface. We fit participant accuracy to the differences in these IQRs across pairs. This also led to a negative GLMM fit (r = -0.65, p < 0.05). However, we are hesitant to add this plot to the manuscript for the reasons stated previously.

      Comment 3, Part 1

      Beyond this fundamental concern, there is a weakness in the representativeness of the PDMS finger, the vertical motion, and the speed of sliding to real human exploration.

      Overall, this is a continuous debate that we think offers two solutions, and we are not advocating for an “either-or” case. There is always a tradeoff between using a synthetic model of a finger versus a real human finger, and there is a place for both models. That is, while our mock finger will be “better” the more similar it is to a human finger, it is not our goal to fully replace a human finger. Rather our goal is to provide a consistent method of characterizing surfaces that is sufficiently similar to human touch as to be a useful and predictive tool.

      The usefulness of the mock finger is in isolating the features of each surface that is independent of human variability, i.e., instabilities that form without changing loading conditions between sliding motions or even within one sliding motion. Of course, with this method, we still require confirmation of these features still forming during human exploration, which we show in Fig. 3. We believe that this method of characterizing surfaces at the mesoscale will ultimately lead to more successful human studies on tactile perception. Currently, and as shown in the paper, characterizing surfaces through traditional techniques, such as a commercial tribometer (friction coefficient, using a steel or hard metal ball), roughness (via atomic force microscopy or some other metrology), surface energy are less or not at all predictive. Thus, we believe this mock finger is better than the current state-of-the-art characterizing surfaces (we are also aware of a commercial mock finger company, but we were unable to purchase or obtain an evaluation model).

      One of the main – and severe – limitations of using a human finger is that all fingers are different, meaning any study focusing on a particular user may not apply to others or be recreated easily by other researchers. We do not think it is feasible to set a standard for replication around a real human finger as that participant may no longer be available, or willing to travel the world as a “standard”. Furthermore, the method in which a person changes their pressures and velocities is different. We note that this is a challenge unique to touch perception – how an object is touched changes the friction generated, and thus the tactile stimulus generated, whereas a standardized stimulus is more straightforward for light or sound.

      However, we do emphasize that we have strongly considered the balance between feasibility and ecological validity in the design of a mock finger. We have a mock finger, with the three components of stiffness of a human finger (more below). Furthermore, we have also successfully used this mock finger in correlations with human psychophysics in previous work, where findings from our mechanical experiments were more predictive of human performance[4–7] than other available methods.

      Our changes to the manuscript Added (Page 2-3)

      “Mock finger as a characterization tool

      We use a mechanical setup with a PDMS (poly(dimethylsiloxane)) mock finger to derive tactile predictors as opposed to direct biomechanical measurements on human participants. While there is a tradeoff in selecting a synthetic finger over a real human finger to modeling human touch, human fingers themselves are also highly variable[23] both in their physical shape and their use during human motion. Our goal is to design a consistent method of characterization of samples that can be easily accessed by other researchers and does not rely on a standard established around single human participant. We believe that sufficient replication of surface, bulk properties, and contact geometry results in characterization that isolates consistent features of surfaces that are not derived from human-to-human variability. We have used this approach to successfully correlate human results with mock finger characterization previously.[8,9,24]

      The major component of a human finger, by volume, is soft tissue (~56%),[25] resulting in an effective modulus close to 100 kPa.[26,27] In order to achieve this same softness, we crosslink PDMS in a 1×1×5 cm mold at a 30:1 elastomer:crosslinker ratio. In addition, two more features in the human finger impart significant mechanical differences. Human fingers have a bone at the fingertip, the distal phalanx,[26–28, 8–10]which we mimic with an acrylic “bone” within our PDMS network. The stratum corneum, the stiffer, glassier outer layer of skin,[29] is replicated with the surface of the mock finger glassified, or further crosslinked, after 8 hours of UV-Ozone treatment.30 This treatment also modifies the surface properties of the native PDMS to align with those of a human finger more closely: it minimizes the viscoelastic tack at the surface, resulting in a comparable non-sticky surface. Stabilizing after one day after treatment, the mock finger surface obtains a moderate hydrophilicity (~60º), as is typically observed for a real finger.[11,31]

      The initial contact area formed before a friction trace is collected is a rectangle of 1×1 cm. While this shape is not entirely representative of a human finger with curves and ridges, human fingers flatten out enough to reduce the effects of curvature with even very light pressures.[31–33] This implies that for most realistic finger pressures, the contact area is largely load-independent, which is more accurately replicated with a rectangular mock finger.

      Lastly, we consider the role of fingerprint ridges. A key finding of our previous work is that while fingerprints enhanced frictional dynamics at certain conditions, key features were still maintained with a flat finger.[11] Furthermore, for some loading conditions, the more amplified signals could also result in more similar friction traces for different surfaces. We have observed good agreement between these friction traces and human experiments.[8,9,22,34]”

      Page 3-4, Materials and Methods

      “Mock Finger Preparation

      Friction forces across all six surfaces were measured using a custom apparatus with a polydimethylsiloxane (PDMS, Dow Sylgard 184) mock finger that mimics a human finger’s mechanical properties and contact mechanics while exploring a surface relatively closely.[8,9] PDMS and crosslinker were combined in a 30:1 ratio to achieve a stiffness of 100 kPa comparable to a real finger, then degassed in a vacuum desiccator for 30 minutes. We are aware that the manufacturer recommended crosslinking ratio for Sylgard 184 is 10:1 due to potential uncrosslinked liquid residues,[35] but further crosslinking concentrated at the surface prevents this. The prepared PDMS was then poured into a 1×1×5 cm mold also containing an acrylic 3D-printed “bone” to attach applied masses on top of the “fingertip” area contacting a surface during friction testing. After crosslinking in the mold at 60ºC for 1 hour, the finger was treated with UV-Ozone for 8 hours out of the mold to minimize viscoelastic tack.

      Mechanical Testing

      A custom device using our PDMS mock finger was used to collect macroscopic friction force traces replicating human exploration.[8,9] After placing a sample surface on a stage, the finger was lowered at a slight angle such that an initial 1×1 cm rectangle of “fingertip” contact area could be established. We considered a broad range of applied masses (M \= 0, 25, 75, and 100 g) added onto the deadweight of the finger (6 g) observed during a tactile discrimination task. The other side of the sensor was connected to a motorized stage (V-508 PIMag Precision Linear Stage, Physikinstrumente) to control both displacement (4 mm across all conditions) and sliding velocity (v \= 5, 10, 25, and 45 mm s<sup>-1</sup>). Forces were measured at all 16 combinations of mass and velocity via a 250 g Futek force sensor (k \= 13.9 kN m<sup>-1</sup>) threaded to the bone, and recorded at an average sampling rate of 550 Hz with a Keithley 7510 DMM digitized multimeter. Force traces were collected in sets of 4 slides, discarding the first due to contact aging. Because some mass-velocity combinations were near the boundaries of instability phase transitions, not all force traces at these given conditions exhibited similar profiles. Thus, three sets were collected on fresh spots for each condition to observe enough occurrences of multiple instabilities, at a total of nine traces per combination for each surface.”

      Added References

      (23) Infante, V. H. P. et al. The role of skin hydration, skin deformability, and age in tactile friction and perception of materials. Sci. Rep. 15, 9935 (2025).

      (24) Nolin, A., Lo, C.-Y., Kayser, L. V. & Dhong, C. B. Transparent and Electrically Switchable Thin Film Tactile Actuators Based on Molecular Orientation. Preprint at https://doi.org/10.48550/arXiv.2411.07968 (2024).

      (25) Murai, M., Lau, H.-K., Pereira, B. P. & Pho, R. W. H. A cadaver study on volume and surface area of the fingertip. J. Hand Surg. 22, 935–941 (1997).

      (26) Abdouni, A. et al. Biophysical properties of the human finger for touch comprehension: influences of ageing and gender. R. Soc. Open Sci. (2017) doi:10.1098/rsos.170321.

      (27) Cornuault, P.-H., Carpentier, L., Bueno, M.-A., Cote, J.-M. & Monteil, G. Influence of physico-chemical, mechanical and morphological fingerpad properties on the frictional distinction of sticky/slippery surfaces. J. R. Soc. Interface (2015) doi:10.1098/rsif.2015.0495.

      (28) Qian, K. et al. Mechanical properties vary for different regions of the finger extensor apparatus. J. Biomech. 47, 3094–3099 (2014).

      (29) Yuan, Y. & Verma, R. Measuring microelastic properties of stratum corneum. Colloids Surf. B Biointerfaces 48, 6–12 (2006).

      (30) Fu, Y.-J. et al. Effect of UV-Ozone Treatment on Poly(dimethylsiloxane) Membranes: Surface Characterization and Gas Separation Performance. Langmuir 26, 4392–4399 (2010).

      Comment 3, Part 2

      The real finger has multiple layers with different moduli. In fact, the stratum corneum cells, which are the outer layer at the interface and determine the friction, have a much higher modulus than PDMS. The real finger has multiple layers with different moduli. In fact, the stratum corneum cells, which are the outer layer at the interface and determine the friction, have a much higher modulus than PDMS.

      We have approximated the softness of the finger with 100 kPa crosslinked PDMS, which is close to what has been reported for the bulk of a human fingertip.[9,10] However, as mentioned in the Materials and Methods, there are two additional features of the mock finger that impart greater strength. The PDMS surrounds a rigid, acrylic bone comparable to the distal phalanx, which provides an additional layer of higher modulus.[8] Additionally, the 8-hour UV-Ozone treatment decreases the viscoelastic tack of the pristine PDMS by glassifying, or further crosslinking the surface of the finger,[12] therefore imparting greater stiffness at the surface similar to the contributions of the stratum corneum, along with a similar surface energy.[13] This technique is widely used in wearables,[14] soft robotics,[15] and microfluidics[16] to induce both these material changes. Additionally, the finger is used at least a day after UV-Ozone treatment is completed to generate a stable surface that is moderately hydrophilic, similar to the outermost layer of human skin.[17]

      Comment 3, Part 3

      In addition, the slanted position of the finger can cause non-uniform pressures across the finger. Both can contribute to making the PDMS finger have much more stick-slip than a real finger.

      To ensure that there is minimal contribution from the slanted position of the finger, an initial contact area of 1×1 cm is established before sliding and recording friction measurements. As the PDMS finger is a soft object, the portion in contact with a surface flattens and the contact area remains largely unchanged during sliding. Any additional stick-slip after this alignment step is caused by contact aging at the interface, but the first trace we collect is always discarded to only consider stick-slip events caused by surface chemistry. We recognize that it is difficult to completely control the pressure distribution due to the planar interface, but this is also expected when humans freely explore a surface.

      Comment 3, Part 4

      In fact, if you look at the regime maps, there is very little space that has steady sliding. This does not represent well human exploration of surfaces. We do not tend to use a force and velocity that will cause extensive stick-slip (frequent regions of 100% stick-slip) and, in fact, the speeds used in the study are on the slow side, which also contributes to more stick-slip. At higher speeds and lower forces, all of the materials had steady sliding regions.”

      We are not aware of published studies that extensively show that humans avoid stickslip regimes. In fact, we are aware familiar with literature where stiction spike formation is suppressed – a recent paper by AliAbbasi, Basdogan et. al. investigates electroadhesion and friction with NaCl solution-infused interfaces, resulting in significantly steadier forces.[18] We also directly showed evidence of instability formation that we observed during human exploration in Fig. 3B-C. These dynamic events are common, despite the lack of control of normal forces and sliding velocities. We also note that Reviewer 1, Comment 2, Part 2 was suggesting that we further explore possible trends from parameterizing the stiction spike.

      We note that many studies have often not gone at the velocities and masses required for stiction spikes – even though these masses and velocities would be routinely seen in free exploration – this is usually due to constraints of their equipment.[19] Sliding events during human free exploration of surfaces can exceed 100 mm/s for rapid touches. However, for the surfaces investigated here, we observe that large regions of stick-slip can emerge at velocities as low as 5 mm/s depending on the applied load. The incidence of steady sliding appears more dependent on the applied mass, with almost no steady sliding observed at or above 75 g. Indeed, the force categorization along our transition zones is the main point of the paper.

      Comment 3, Part 5

      Further, on these very smooth surfaces, the friction and stiction are more complex and cannot dismiss considerations such as finger material property change with sweat pore occlusion and sweat capillary forces. Also, the vertical motion of both the PDMS finger and the instructed human subjects is not the motion that humans typically use to discriminate between surfaces.

      We did not describe the task sufficiently. Humans were only given the instruction to slide their finger along a single axis from top to bottom of a sample, not vertical as in azimuthal to gravity. We have updated our wording in the manuscript to reflect this.

      Page 4

      “Participants could touch for as long as they wanted, but were asked to only use their dominant index fingers along a single axis to better mimic the conditions for instability formation during mechanical testing with the mock finger.”

      Page 11

      “The participant was then asked to explore each sample simultaneously, and ran over each surface in strokes along a single axis until the participant could decide which of the two had “more friction”.”

      Comment 3, Part 6

      Finally, fingerprints may not affect the shape and size of the contact area, but they certainly do affect the dynamic response and detection of vibrations.”

      We are aware of the nuance. Our previous work on the role of fingerprints on friction experienced by a PDMS mock finger showed enhanced signals with the incorporation of ridges on the finger and used a rate-and-state model of a heterogenous, elastic body to find corresponding trends (though there is no existing model of friction that can accurately model experiments on mesoscale friction).[11] The key conclusion was that a flat finger still preserved key dynamic features, and the presence of stronger or more vibrations could result in more similar forces for different surfaces depending on the sliding conditions.

      This is also in the context that we are seeking to provide a reasonable and experimentally accessible method to characterize surfaces, which will always be better as we get closer in replicating a true human finger. But our goal here was to replicate the finger sufficiently for use in human studies. We believe the more appropriate metric of success is if the mock finger is more successful than replacing traditional characterization experiments, like friction coefficient, roughness, surface energy, etc.

      Comment 4

      This all leads to the critical question, why are friction, normal force, and velocity not measured during the measured human exploration and in a systematic study using the real human finger? The authors posed an extremely interesting hypothesis that humans may alter their speed to feel the instability transition regions. This is something that could be measured with a real finger but is not likely to be correlated accurately enough to match regime boundaries with such a simplified artificial finger.

      We are excited that our manuscript offers a tractable manner to test the hypothesis that tactile decision-making models use friction instabilities as evidence. However, we lay out the challenges and barriers, and how the scope of this paper will lead us in that direction. We also clarify that our goals are to provide a method to characterize samples to better design tactile interfaces in haptics or in psychophysical experiments and raise awareness that the common methods of sample characterization in touch by an average friction coefficient or roughness is fundamentally unsound. Throughout the paper, we have made changes to reflect that our study, at this point, is only correlative.

      As discussed in the summary, and with additional detail here, to further support our findings through observation on humans would require answering:

      (1) Which one, or combination of, of the multiple swipes that people make responsible for a tactile decision? (There is a need for a decision-making model)

      (2) Establish what is, or may be, tactile evidence.

      (3) Establish tactile decision-making models are similar or different than existing decision-making models.

      (4) Design a task that does not require the use of subjective tactile descriptors, like “which one feels rougher”, which we have seen causes confusion in participants, which will likely require accounting for memory effects.

      We elaborate these points below:

      To successfully perform this experiment, we note that freely exploring humans make multiple strokes on a surface. Therefore, we would need to construct a decision-making model. It has not yet been demonstrated whether tactile decision making follows visual decision making, but perhaps to start, we can assume it does. Then, in the design of our decision-making paradigm, we immediately run into the problem: What is tactile evidence?

      From Fig. 3C, we already can see that identifying evidence is challenging. Prior to this manuscript, people may have chosen the average force, or the highest force. Or we may choose the average friction force. Then, after deciding on the evidence, we need to find a method to manipulate the evidence, i.e., create samples or a machine that causes high friction, etc. We show that during the course of human touch, due to the dynamic nature of friction, the average can change a large amount and sample design becomes a central barrier to experiments. Others may suggest immobilizing the finger and applying a known force, but given how much friction changes with human exploration, there is no known method to make a machine recreate temporally and spatially varying friction forces during sliding onto a stationary finger. Finally, perhaps most importantly, in addition to mechanical challenges, a study by Liu, Colgate et al. showed that even if they recorded the friction (2D) of a finger exploring a surface and then replicated the same friction forces onto a finger, the participant could not determine which surface the replayed friction force was supposed to represent.[1] This supports that the efference copy is important, that the forces in response to expected motion are important to determine friction. Finally, there is no known method to design instabilities a priori. They must be found through experiments. Especially since if we were to introduce, say a bump or a trough, then we bring in confounding variables to how participants tell surfaces apart.

      Furthermore, even if we had some consistent method to create tactile “evidence”, the paradigm also deserves some consideration. In our experience, the 3-AFC task we perform is important because the vocabulary for touch has not been established. That is, in 3-AFC, by asking to determine which one sample is unlike the others, we do not have to ask the participant questions like “which one is rougher” or “which one has less friction”. In contrast, 2-AFC, which is better for decision-making models because it does not include memory, requires the asking of a perceptual question like: “which one is rougher?”. In our ongoing work, taking two silane coatings, we found that participants could easily identify which surface is unlike the others above chance in a 3-AFC, but participants, even within their own trials, could not consistently identify one silane as perceptually “rougher” by 2-AFC. To us, this calls into question the validity of tactile descriptors, but is beyond the scope of this manuscript.

      This is not our only goal, but in the context of human exploration, in this manuscript here, we believed it was important to identify a mechanical parameter that was consistent with how humans explore surfaces, but was also a parameter that could characterize to some consistent property of a surface – irrespective of whether a human was touching it. We thought that designing human decision-making models and paradigms around the friction coefficient would not be successful.

      Given the scope of these challenges, we do not think it would be possible to establish these conceptual sequences in a single manuscript. However, we think that our manuscript brings an important step forward to approach this problem.

      Reviewer 2 (Public review):

      Summary:

      In this paper, the authors want to test the hypothesis that frictional instabilities rather than friction are the main drivers for discriminating flat surfaces of different sub-nanometric roughness profiles.

      They first produced flat surfaces with 6 different coatings giving them unique and various properties in terms of roughness (picometer scale), contact angles (from hydrophilic to hydrophobic), friction coefficient (as measured against a mock finger), and Hurst exponent.

      Then, they used those surfaces in two different experiments. In the first experiment, they used a mock finger (PDMS of 100kPA molded into a fingertip shape) and slid it over the surfaces at different normal forces and speeds. They categorized the sliding behavior as steady sliding, sticking spikes, and slow frictional waves by visual inspection, and show that the surfaces have different behaviors depending on normal force and speed. In a second experiment, participants (10) were asked to discriminate pairs of those surfaces. It is found that each of those pairs could be reliably discriminated by most participants.

      Finally, the participant's discrimination performance is correlated with differences in the physical attributes observed against the mock finger. The authors found a positive correlation between participants' performances and differences in the count of steady sliding against the mock finger and a negative correlation between participants' reaction time and differences in the count of stiction spikes against the mock finger. They interpret those correlations as evidence that participants use those differences to discriminate the surfaces.

      Strengths:

      The created surfaces are very interesting as they are flat at the nanometer scale, yet have different physical attributes and can be reliably discriminated.

      We thank Reviewer 2 for their notes on our manuscript. The responses below address the reviewer’s comments and recommendations for revised work.

      Weaknesses:

      Comment 1

      In my opinion, the data presented in the paper do not support the conclusions. The conclusions are based on a correlation between results obtained on the mock finger and results obtained with human participants but there is no evidence that the human participants' fingertips will behave similarly to the mock finger during the experiment. Figure 3 gives a hint that the 3 sliding behaviors can be observed in a real finger, but does not prove that the human finger will behave as the mock finger, i.e., there is no evidence that the phase maps in Figure 1C are similar for human fingers and across different people that can have very different stiffness and moisture levels.

      We have made changes throughout the manuscript to acknowledge that our findings are correlative, clarifying this throughout, and incorporating into the discussion how our work may enable biomechanical measurements and tactile decision making models.

      The mechanical characterization conducted with the mock finger seeks to extract significant features of friction traces of a set of surfaces to use as predictors of tactile discriminability. The goal is to find a consistent method to characterize surfaces for use in tactile experiments that can be replicated by others and used prior to any human experiments. However, in the overall response and in a response to a similar comment by Reviewer 1 (recreated below), we also explain why we believe experiments on humans to establish this fact is not yet reasonable.

      First, we discuss the mock finger. The PDMS finger is treated to have comparable surface and bulk properties to a human finger. We have approximated the softness of the finger with 100 kPa crosslinked PDMS, which is close to what has been reported for the bulk of a human fingertip.[9,10] However, as mentioned in the Materials and Methods, there are two additional features of the mock finger that impart greater strength. The PDMS surrounds a rigid, acrylic bone comparable to the distal phalanx, which provides an additional layer of higher modulus.[8] Additionally, the 8-hour UV-Ozone treatment decreases the viscoelastic tack of the pristine PDMS by glassifying, or further crosslinking the surface of the finger,[12] therefore imparting greater stiffness at the surface similar to the contributions of the stratum corneum, along with a similar surface energy.[13] Additionally, the finger is used at least a day after UV-Ozone treatment is completed in order for the surface to return to moderate hydrophilicity, similar to the outermost layer of human skin.[17] We also discuss the shape of the contact formed. To ensure that there is minimal contribution from the slanted position of the finger, an initial contact area of 1×1 cm is established before sliding and recording friction measurements. As the PDMS finger is a soft object, the portion in contact with a surface flattens and the contact area remains largely unchanged during sliding. Any additional stick-slip after this alignment step is caused by contact aging at the interface, but the first trace we collect is always discarded to only consider stick-slip events caused by surface chemistry. We recognize that it is difficult to completely control the pressure distribution due to the planar interface, but this is also expected when humans freely explore a surface. Finally, we consider flat vs. fingerprinted fingers. Our previous work on the role of fingerprints on friction experienced by a PDMS mock finger showed enhanced signals with the incorporation of ridges on the finger and used a rate-and-state model of a heterogenous, elastic body to find corresponding trends.[11] The key conclusion was that a flat finger still preserved key dynamic features, and the presence of stronger or more vibrations could result in more similar forces for different surfaces depending on the sliding conditions. We note that we have subsequently used this flat mock finger in correlations with human psychophysics in previous work, where findings from our mechanical experiments were predictive of human performance.[4–7] We have added these details to the manuscript.

      With this adequately similar mock finger, we collected friction traces at controlled conditions of normal force and velocity in order to extract the signals unique to each material that are not caused by the influence of human variability. For example, we observe the smallest regions of steady sliding on our phase maps (Fig. 1C) for short-chain alkylsilanes C4 and C5, while the increased intermolecular forces of other silanes increase the incidence of steady sliding. We have also previously shown that comparisons of similarly collected mechanical data is predictive of human performance, using the crosscorrelations between signals of two different materials.[4–7] While different participants produce different raw signals, we see that broad categories of stick-slip, i.e. instabilities, can be extracted (Fig. 3B-C) and used as a cue in a tactile discrimination task. As mentioned above, we have provided an additional section about the usefulness of our mock finger, as well as its structure, in the main manuscript.

      Second, we lay out the challenges and barriers to demonstrating this in humans in the manner requested by the reviewer, and how the scope of this paper will lead us in that direction. We also clarify that our goals are to provide a method to characterize samples to better design tactile interfaces in haptics or in psychophysical experiments and raise awareness that the common methods of sample characterization in touch by an average friction coefficient or roughness is fundamentally unsound.

      As discussed in the summary, and with additional detail here, to further support our findings through observation on humans would require answering:

      (1) Which one, or combination of, of the multiple swipes that people make responsible for a tactile decision?

      (2) Establish what is, or may be, tactile evidence.

      (3) Establish tactile decision-making models are similar or different than existing decision-making models.

      (4) Test the hypothesis, in these models, that friction instabilities are evidence, and not some other unknown metric.

      (5) Design a task that does not require the use of subjective tactile descriptors, like “which one feels rougher”, which we see cause confusion in participants, which will likely require accounting for memory effects.

      We elaborate these points below:

      To successfully perform this experiment, we note that freely exploring humans make multiple strokes on a surface. Therefore, we would need to construct a decision-making model. It has not yet been demonstrated whether tactile decision making follows visual decision making, but perhaps to start, we can assume it does. Then, in the design of our decision-making paradigm, we immediately run into the problem: What is tactile evidence?

      From Fig. 3C, we already can see that identifying evidence is challenging. Prior to this manuscript, people may have chosen the average force, or the highest force. Or we may choose the average friction force. Then, after deciding on the evidence, we need to find a method to manipulate the evidence, i.e., create samples or a machine that causes high friction, etc. We show that during the course of human touch, due to the dynamic nature of friction, the average can change a large amount and sample design becomes a central barrier to experiments. Others may suggest immobilizing the finger and applying a known force, but given how much friction changes with human exploration, there is no known method to make a machine recreate temporally and spatially varying friction forces during sliding onto a stationary finger. Finally, perhaps most importantly, in addition to mechanical challenges, a study by Liu, Colgate, et al. showed that even if they recorded the friction (2D) of a finger exploring a surface and then replicated the same friction forces onto a finger, the participant could not determine which surface the replayed friction force was supposed to represent.[1] This supports that the efference copy is important, that the forces in response to expected motion are important to determine friction. Finally, there is no known method to design instabilities a priori. They must be found through experiments, especially since if we were to introduce, say a bump or a trough, then we bring in confounding variables to how participants tell surfaces apart.

      Furthermore, even if we had some consistent method to create tactile “evidence”, the paradigm also deserves some consideration. In our experience, the 3-AFC task we perform is important because the vocabulary for touch has not been established. That is, in 3-AFC, by asking to determine which one sample is unlike the others, we do not have to ask the participant questions like “which one is rougher” or “which one has less friction”. In contrast, 2-AFC, which is better for decision-making models because it does not include memory, requires the asking of a perceptual question like: “which one is rougher?”. In our ongoing work, taking two silane coatings, we found that participants could easily identify which surface is unlike the others above chance in a 3-AFC, but participants, even within their own trials, could not consistently identify one silane as perceptually “rougher” by 2-AFC. To us, this calls into question the validity of tactile descriptors, but is beyond the scope of the current manuscript.

      This is not our only goal, but in the context of human exploration, in this manuscript here, we believed it was important to identify a mechanical parameter that was consistent with how humans explore surfaces, but was also a parameter that could characterize to some consistent property of a surface – irrespective of whether a human was touching it. We thought that designing human decision-making models and paradigms around the friction coefficient would not be successful.

      Given the scope of these challenges, we do not think it would be possible to establish this conceptual sequence in a single manuscript.

      See Reviewer 1, comment 3part 3 for changes to the manuscript

      Comment 2

      I believe that the authors collected the contact forces during the psychophysics experiments, so this shortcoming could be solved if the authors use the actual data, and show that the participant responses can be better predicted by the occurrence of frictional instabilities than by the usual metrics on a trial by trial basis, or at least on a subject by subject basis. I.e. Poor performers should show fewer signs of differences in the sliding behaviors than good performers.

      To fully implement this, a decision-making model is necessary because, as a counter example, a participant could have generated 10 swipes of SFW and 1 swipe of a Sp, but the Sp may have been the most important event for making a tactile decision. This type of scenario is not compatible with the analysis suggested — and similar counterpoints can be made for other types of seemingly straightforward analysis.

      While we are interested and actively working on this, the study here is critical to establish types of evidence for a future decision-making model. We know humans change their friction constantly during real exploration, so it is unclear which of these constantly changing values we should input into the decision making model, and the future challenges we anticipate are explained in Weaknesses, Comment 1.

      Comment 3

      The sample size (10) is very small.

      We recognize that, with all factors being equal, this sample size is on the smaller end. However, we emphasize the degree of control of samples is far above typical, with minimal variations in sample properties such as surface roughness, and every sample for every trial was pristine. Furthermore, the sample preparation (> 300 individual wafers were used) became a factor. Although not typically appropriate, and thus not included in the manuscript, a post-hoc power analysis for our 100 trials of our pair that was closest to chance, P4, (53%, closest to chance at 33%) showed a power of 98.2%, suggesting that the study was appropriately powered.

      Reviewer 2 (Recommendations for the authors):

      Comment 1

      Differences in SS and Sp (Table 2) are NOT physical or mechanical differences but are obtained by counting differences in the number of occurrences of each sliding behavior. It is rather a weird choice.

      We disagree that differences in SS and Sp are not physical or mechanical, as these are well-established phenomena in the soft matter and tribology literature.[20–22] These are known as “mechanical instabilities” and generated due to the effects of two physical phenomena: the elasticity of the finger (which is constant in our mechanical testing) and the friction forces present (which change per sample type). The motivation behind using these different shapes is that the instabilities, in some conditions, can be invariant to external factors like velocity. This would be quite advantageous for human exploration because, unlike friction coefficient, which changes with nearly any factor, including velocity and mass, the instabilities being invariant to velocity would mean that we are accurately characterizing a unique identifier of the surface even though velocity may be variable.

      This “weird choice” is the central innovation of this paper. This choice was necessary because we demonstrated that the common usage of friction coefficient is fundamentally flawed: we see that friction coefficient suggests that surface which are more different would feel more similar – indeed the most distinctive surfaces would be two surfaces that are identical, which is clearly spurious. Furthermore, Table 1 now includes the range of friction generated on a surface, the range of friction coefficients of a single surface is large – of order the differences in friction between two surfaces. This is expected in soft sliding systems and emphasizes our issue with the use of average friction coefficient in psychophysical design. One potential explanation for why we were able to see this is effect is because our surfaces have similar (< 0.6 nm variability) roughness, removing potential confounding factors from large scale roughness, and this type of low roughness control has not been widely used in tactile studies to the best of our knowledge.

      Comment 2

      Figures 2B-C: why are the x-data different than Table 2?

      The x-data in Fig. 2B-C are the absolute differences in the number of occurrences measured for a given instability type or material property out of 144 pulls. Modeling the human participant results in our GLMMs required the independent variables to be in this form rather than percentages. We initially chose to list percent differences in Table 2 to highlight the ranges of differences instead of an absolute value, but have added both for clarity.

      Our changes to the manuscript

      Page 7

      “To determine if humans can detect these three different instabilities, we selected six pairs of surfaces to create a broad range of potential instabilities present across all three types. These are summarized in Table 2, where the first column for each instability is the difference in occurrence of that instability formed between each pair, and the second is the percent difference.”

      “Thus, when comparing C4 versus C4-APTMS, they have a difference in steady sliding of 20 out of a maximum 144 pulls, for a |ΔSS| of 13.9%. The absolute value is taken to compare total differences present, as the psychophysical task does not distinguish between sample order.”

      Comment 3

      We constructed a set of coated surfaces with physical differences which were imperceptible by touch but created different types of instabilities based on how quickly a finger is slid and how hard a human finger is pressed during sliding." Yet, in your experiment, participants could discriminate them, so this is incoherent.

      To clarify the point, macroscopic objects can differ in physical shape and in chemical composition. What we meant was that the physical differences, i.e., roughness, were below a limit (Skedung et al.) that participants, without a coating, would not be able to tell these apart.[23] Therefore, the reason people could tell our surfaces apart was due to the chemical composition of the surface, and not any differences in roughness or physical effects like film stiffness (due to the molecular-scale thinness of the surface coatings, they are mechanically negligible). However, we concede that at the molecular scale, the traditional macroscopic distinction between physical and chemical is blurred.

      We have made minor revisions to the wording in the abstract. We clarify that the surface coatings had physical differences in roughness that were smaller than 0.6 nm, which based purely on roughness, would not be expected to be distinguishable to participants. Therefore, the reason participants can tell these surfaces apart is due to differences in friction generated by chemical composition, and we were able to minimize contributions from physical differences in the sample our study.

      Our changes to the manuscript

      Page 1, Abstract

      “Here, we constructed a set of coated surfaces with minimal physical differences that by themselves, are not perceptible to people, but instead, due to modification in surface chemistry, the surfaces created different types of instabilities based on how quickly a finger is slid and how hard a human finger is pressed during sliding.”

      “In one experiment, we used a mechanical mock finger to quantify and classify differences in instability formation from different coated surfaces. In a second experiment, participants perform a discrimination task using the same coated surfaces. Using the data from these two experiments, we found that human discrimination response times were faster with surfaces where the mock finger produced more stiction spikes and discrimination accuracy was higher where the mock finger produced more steady sliding. Conversely, traditional metrics like surface roughness or average friction coefficient did not relate to tactile discriminability. In fact, the typical method of averaging friction coefficients led to a spurious correlation which erroneously suggests that distinct objects should feel identical and identical objects should feel distinct—similar to findings by others. Friction instabilities may offer a more predictive and tractable framework of fine touch perception than friction coefficients, which would accelerate the design of tactile interfaces.”

      Reviewer 3 (Public review):

      Strengths

      The paper describes a new perspective on friction perception, with the hypothesis that humans are sensitive to the instabilities of the surface rather than the coefficient of friction. The paper is very well written and with a comprehensive literature survey.

      One of the central tools used by the author to characterize the frictional behavior is the frictional instabilities maps. With these maps, it becomes clear that two different surfaces can have both similar and different behavior depending on the normal force and the speed of exploration. It puts forward that friction is a complicated phenomenon, especially for soft materials.

      The psychophysics study is centered around an odd-one-out protocol, which has the advantage of avoiding any external reference to what would mean friction or texture for example. The comparisons are made only based on the texture being similar or not.

      The results show a significant relationship between the distance between frictional maps and the success rate in discriminating two kinds of surface.

      We thank Reviewer 3 for their notes and interesting discussion points on our manuscript. Below, we address the reviewer’s feedback and comments on related works.

      Weaknesses:

      Comment 1

      The main weakness of the paper comes from the fact that the frictional maps and the extensive psychophysics study are not made at the same time, nor with the same finger. The frictional maps are produced with an artificial finger made out of PDMS which is a poor substitute for the complex tribological properties of skin.

      A similar comment was made by Reviewers 1 and 2. We agree in part and have made changes throughout that our study is correlative, but presents an important step forward to these biomechanical measurements and corresponding decision making models.

      We are not claiming that our PDMS fingers are superior to real fingers, but rather, we cannot establish standards in the field by using real human fingers that vary between subjects and researchers. We believe the mock finger we designed is a reasonable mimic of the human finger by matching surface energy, heterogeneous mechanical structure, and the ability to test multiple physiologically relevant pressures and sliding velocities.

      We achieve a heterogeneous mechanical structure with the 3 primary components of stiffness of a human finger. The effective modulus of ~100 kPa, from soft tissue,[9,10] is obtained with a 30:1 ratio of PDMS to crosslinker. The PDMS also surrounds a rigid, acrylic bone comparable to the distal phalanx, which provides an additional layer of higher modulus.[8] Additionally, the 8-hour UV-Ozone treatment decreases the viscoelastic tack of the pristine PDMS by glassifying, or further crosslinking the surface of the finger,[12] therefore imparting greater stiffness at the surface similar to the contributions of the stratum corneum, along with a similar surface energy.[13] The finger is used at least a day after UV-Ozone treatment is completed in order for the surface to return to moderate hydrophilicity, similar to the outermost layer of human skin.[17] We also discuss the shape of the contact formed. To ensure that there is minimal contribution from the slanted position of the finger, an initial contact area of 1×1 cm is established before sliding and recording friction measurements. As the PDMS finger is a soft object, the portion in contact with a surface flattens and the contact area remains largely unchanged during sliding. We recognize that it is difficult to completely control the pressure distribution due to the planar interface, but this variation is also expected when humans freely explore a surface. Finally, we consider flat vs. fingerprinted fingers. Our previous work on the role of fingerprints on friction experienced by a PDMS mock finger showed enhanced signals with the incorporation of ridges on the finger and used a rate-andstate model of a heterogenous, elastic body to find corresponding trends.[11] The key conclusion was that a flat finger still preserved key dynamic features, and the presence of stronger or more vibrations could result in more similar forces for different surfaces depending on the sliding conditions. We note that we have subsequently used the controlled mechanical data collected with this flat mock finger in correlations with human psychophysics in previous work, where findings from our mechanical experiments were predictive of human performance.[4–7] Ultimately, we see from our prior work and here that, despite the drawbacks of our mock finger, it outperforms other standard characterization technique in providing information about the mesoscale that correlates to tactile perception. We have added these details to the manuscript.

      We also note that an intermediate option, replicating real fingers, even in a mold, may also inadvertently limit trends from characterization to a specific finger. One of the main – and severe – limitations of using a human finger is that all fingers are different, meaning any study focusing on a particular user may not apply to others or be recreated easily by other researchers. We cannot set a standard for replication around a real human finger as that participant may no longer be available, or willing to travel the world as a “standard”. Furthermore, the method in which a single person changes their pressures and velocities as they touch a surface is highly variable. We also note that in the Summary Response, we noted that a study by Colgate et al. (IEEE ToH 2024) demonstrated that efference copies may be important, and thus constraining a human finger and replaying the forces recorded during free exploration will not lead to the participant identifying a surface with any consistency. Thus, it is important to allow humans to freely explore surfaces, but creates nearly limitless variability in friction forces.

      This is also against the backdrop that we are seeking to provide a method to characterize surfaces. Indeed, the more features we replicate in the mock finger to a human finger, the more likely it is that the mechanical data will correlate to human performance. However, we have used this technique several times to achieve stronger correlations to human data than other available techniques. We believe the metric of success should be in comparison to the available characterization technique, rather than a 1:1 reconstruction of forces of an arbitrary human finger. Indeed, a 1:1 reconstruction of forces of an arbitrary human finger would be limited to the finger of a single individual, perhaps even to that individual on a given day.

      See Reviewer1 weaknesses, comment 2 part 2 for changes to the manuscript

      Comment 2

      The evidence would have been much stronger if the measurement of the interaction was done during the psychophysical experiment. In addition, because of the protocol, the correlation is based on aggregates rather than on individual interactions.

      We agree that this would have helped further establish our argument, but in the overall statement and in other reviewer responses, we describe the significant challenges to establishing this.

      To fully implement this, a decision-making model is necessary because, as a counter example, a participant could have generated 10 swipes of SFW and 1 swipe of a Sp, but the Sp may have been the most important event for making a tactile decision. We also clarify that our goals are to provide a method to characterize samples to better design tactile interfaces in haptics or in psychophysical experiments.

      As discussed in the summary, and expanded on here, in our view, to develop a decision-making model, the challenges are as follows:

      (1) Which one, or combination of, of the multiple swipes that people make responsible for a tactile decision?

      (2) Establish what is, or may be, tactile evidence.

      (3) Establish tactile decision-making models are similar or different than existing decision-making models.

      (4) Test the hypothesis, in these models, that friction instabilities are evidence, and not some other unknown metric.

      (5) Design a task that does not require the use of subjective tactile descriptors, like “which one feels rougher”, which we see cause confusion in participants, which will likely require accounting for memory effects.

      (6) Design samples that vary in the amount of evidence generated, but this evidence cannot be controlled directly. Rather, the samples indirectly vary evidence by how likely it is for a human to generate different types of friction instabilities during standard exploration.

      We elaborate these points below:

      To successfully perform this experiment, we note that freely exploring humans make multiple strokes on a surface. Therefore, we would need to construct a decision-making model. It has not yet been demonstrated whether tactile decision making follows visual decision making, but perhaps to start, we can assume it does. Then, in the design of our decision-making paradigm, we immediately run into the problem: What is tactile evidence?

      From Fig. 3C, we already can see that identifying evidence is challenging. Prior to this manuscript, people may have chosen the average force, or the highest force. Or we may choose the average friction force. Then, after deciding on the evidence, we need to find a method to manipulate the evidence, i.e., create samples or a machine that causes high friction, etc. We show that during the course of human touch, due to the dynamic nature of friction, the average can change a large amount and sample design becomes a central barrier to experiments. Others may suggest to immobilize the finger and applying a known force, but given how much friction changes with human exploration, there is no known method to make a machine recreate temporally and spatially varying friction forces during sliding onto a stationary finger. Finally, perhaps most importantly, in addition to mechanical challenges, a study by Liu, Colgate et al. showed that even if they recorded the friction (2D) of a finger exploring a surface and then replicated the same friction forces onto a finger, the participant could not determine which surface the replayed friction force was supposed to represent.[1] This supports that the efference copy is important, that the forces in response to expected motion are important to determine friction. Finally, there is no known method to design instabilities a priori. They must be found through experiments, especially since if we were to introduce, say a bump or a trough, then we bring in confounding variables to how participants tell surfaces apart.

      Furthermore, even if we had some consistent method to create tactile “evidence”, the paradigm also deserves some consideration. In our experience, the 3-AFC task we perform is important because the vocabulary for touch has not been established. That is, in 3-AFC, by asking to determine which one sample is unlike the others, we do not have to ask the participant questions like “which one is rougher” or “which one has less friction”. In contrast, 2-AFC, which is better for decision-making models because it does not include memory, requires the asking of a perceptual question like: “which one is rougher?”. In our ongoing work, taking two silane coatings, we found that participants could easily identify which surface is unlike the others above chance in a 3-AFC, but participants, even within their own trials, could not consistently identify one silane as perceptually “rougher” by 2-AFC. To us, this calls into question the validity of tactile descriptors, but is beyond the scope of the current manuscript.

      This is not our only goal, but in the context of human exploration, in this manuscript here, we believed it was important to identify a mechanical parameter that was consistent with how humans explore surfaces, but was also a parameter that could characterize to some consistent property of a surface – irrespective of whether a human was touching it. We thought that designing human decision-making models and paradigms around the friction coefficient would not be successful.

      Given the scope of these challenges, we do not think it would be possible to establish this conceptual sequence in a single manuscript.

      Comment 3

      The authors compensate with a third experiment where they used a 2AFC protocol and an online force measurement. But the results of this third study, fail to convince the relation.

      With this experiment, our central goal was to demonstrate that the instabilities we have identified with the PDMS finger also occur with a human finger. Several instances of SS, Sp, and SFW were recorded with this setup as a participant touched surfaces in real time.

      Comment 4

      No map of the real finger interaction is shown, bringing doubt to the validity of the frictional map for something as variable as human fingers.

      Real fingers change constantly during exploration, and friction is state-dependent, meaning that the friction will depend on how the person was moving the moment prior. Therefore, a map is only valid for a single human movement – even if participants all were instructed to take a single swipe and start from zero motion, humans are unable to maintain constant velocities and pressures. Clearly, this is not sustainable for any analysis, and these drawbacks apply to any measured parameter, whether instabilities suggested here, or friction coefficients used throughout. We believe the difficulty of this approach emphasizes why a standard map of characterization of a surface by a mock finger, even with its drawbacks, is a viable path forward.

      Reviewer 3 (Recommendations for the authors):

      Comment 1

      It would be interesting to comment on a potential connection between the frictional instability maps and Schalamack waves.

      Schallamach waves are a subset of slow frictional waves (SFW). Schallamach waves are very specifically defined in the field. They occur when pockets of air that form between a soft sliding object and rigid surface which then propagate rear-to-front (retrograde waves) relative to motion of the sliding motion and form buckles due to adhesive pinning. Wrinkles then form at the detached portion of the soft material, until the interface reattaches and the process repeats.[24] There is typically a high burden of proof to establish a Schallamach wave over a more general slow frictional wave. We note that it would be exceedingly difficult to design samples that can reliably create subsets of SFW, but we are aware that this may be an interesting question at a future point in our work.

      Comment 2

      The force sensors look very compliant, and given the dynamic nature of the signal, it is important to characterize the frequency response of the system to make sure that the fluctuations are not amplified.

      Thank you for noticing. We mistyped the sensor spring constant as 13.9 N m<sup>-1</sup> instead of kN m<sup>-1</sup>. However, below we show how the instabilities are derived from the mechanics at the interface due to the compliance of the finger. The “springs” of the force sensor and PDMS finger are connected in parallel. Since k<sub>sensor</sub> = 13.9 kN m<sup>-1</sup>, the spring constant of the system overall reflects the compliance of the finger, and highlights the oscillations arising solely from stick-slip. A sample calculation is shown below.

      Author response image 1.

      Fitting a line to the initial slope of the force trace for C6 gives the equation y = 25.679x – 0.2149. The slope here represents force data over time data, and is divided by the velocity (25 mm/s) to determine the spring constant of the system k<sub>total</sub> == 1027.16 N/m. This value is lower than k<sub>sensor</sub> = 13.9 kN/m, indicating that the “springs” representing the force sensor and PDMS finger are connected in parallel:

      . The finger is the compliant component of the system, with k<sub>finger</sub> = 1.11 kN/m, and of course, real human fingers are also compliant so this matches our goals with the design of the mock finger.

      Our changes to the manuscript

      (Page 4) (k = 13.9 kN m<sup>1</sup>)

      Comment 3

      The authors should discuss about the stochastic nature of friction: - Wiertlewski, Hudin, Hayward, IEEE WHC 2011 Greenspon, McLellan, Lieber, Bensmaia, JRSI 2020.

      We believe that, given the references, this comment on “stochastic” refers to the macroscopically-observable fluctuations (i.e., the mechanical “noise” which is not due to instrument noise) in friction arising from the discordant network of stick-slip phenomena occurring throughout the contact zone, and not the stochastic nature of nanoscale friction that occurs thermal fluctuations nor due to statistical distributions in bond breaking associated with soft contact.

      We first note that our small-scale fluctuations do not arise from a periodic surface texture that dominates in the frequency regime. However, even on our comparatively smooth surfaces, we do expect fluctuations due to nanoscale variation in contact, generation of stick-slip across at microscale length scales that occur either concurrently or discordantly across the contact zone, and the nonlinear dependence of friction to nearly any variation in state and composition.[11]

      Perhaps the most relevant to the manuscript is that a major advantage of analysis by friction is that it sidesteps these ever-present microscale fluctuations, leading to more clearly defined classifiers or categories during analysis. Wiertlewski et. al. showed repeated measurements in their systems ultimately gave rise to consistent frequencies[25] (we think their system was in a steady sliding regime and the patterning gave rise to underlying macroscopic waves). These consistent frequencies, at least in soft systems and absent obvious macroscopic patterned features, would be expected to arise from the instability categories and we see them throughout.

      Comment 4

      It is stated that "we observed a spurious, negative correlation between friction coefficient and accuracy".

      What makes you qualify that correlation as spurious?

      We mean this as in the statistical definition of “spurious”.

      This correlation would indicate that by the metric of friction coefficient, more different surfaces are perceived more similarly. Thus, two very different surfaces, like Teflon and sandpaper, by friction coefficient would be expected to feel very similar. Two nearly identical surfaces would be expected to feel very different – but of course, humans cannot consistently distinguish two identical surfaces. This finding is counterintuitive and refutes that friction coefficient is a reliable classifier of surfaces by touch. We do not think it is productive to determine a mechanism for a spurious correlation, but perhaps one reason we were able to observe this is because our study, to the best of our knowledge, is unique for having samples that are controlled in their physical differences in roughness and surface features.

      See response to Reviewer 1 weaknesses, comment 1 for changes to the manuscript

      Comment 5

      The authors should comment on the influence of friction on perceptual invariance. Despite inducing radially different frictional behavior for various conditions, these surfaces are stably perceived. Maybe this is a sign that humans extract a different metric?

      We agree – we are excited that frictional instabilities may offer a more stable perceptual cue because they are not prone to fluctuations (as discussed in Comment 3) and instability formation, in many conditions, is invariant to applied pressures and velocities – thus forming large zones where a human may reasonable encounter a given instability.

      Raw friction is highly prone to variation during human exploration (in alignment with Recommendations for the authors, Comment 3), but ongoing work seeks to explain tactile constancy, or the ability to identify objects despite these large changes in force. Very recently published work by Fehlberg et. al. identified the role of modulating finger speed and normal force in amplifying the differences in friction coefficient between materials in order to identify them,[26] and we postulate that their work may be streamlined and consistent with the idea of friction instabilities, though we have not had a chance to discuss this in-depth with the authors yet.

      We think that the instability maps show a viable path forward to how surfaces are stably perceived, and instabilities themselves show a potential mechanism: mathematically, instabilities for given conditions can be invariant to velocity or mass, creating zones where a certain instability is encountered. This reduces the immense variability of friction to a smaller, more stable classification of surfaces (e.g., a 30% SS surface or a 60% SS surface). A given surface will typically produce the same instability at a specific condition (we found some boundaries of experimental parameters are very condition sensitive, but many conditions are not), whereas a single friction trace which is highly prone to variation is not a stable metric.

      Added Reference

      (53) M. Fehlberg, E. Monfort, S. Saikumar, K. Drewing and R. Bennewitz, IEEE Trans. Haptics, 2024, 17, 957–963.

      References

      (1) Liu, Z., Kim, J.-T., Rogers, J. A., Klatzky, R. L. & Colgate, J. E. Realism of Tactile Texture Playback: A Combination of Stretch and Vibration. IEEE Trans. Haptics 17, 441–450 (2024).

      (2) Waters, I., Alazmani, A. & Culmer, P. Engineering Incipient Slip Into Surgical Graspers to Enhance Grasp Performance. IEEE Transactions on Medical Robotics and Bionics 2, 541–544 (2020).

      (3) Gueorguiev, D., Bochereau, S., Mouraux, A., Hayward, V. & Thonnard, J.-L. Touch uses frictional cues to discriminate flat materials. Sci Rep 6, 25553 (2016).

      (4) Carpenter, C. W. et al. Human ability to discriminate surface chemistry by touch. Mater. Horiz. 5, 70– 77 (2018).

      (5) Nolin, A. et al. Predicting human touch sensitivity to single atom substitutions in surface monolayers for molecular control in tactile interfaces. Soft Matter 17, 5050–5060 (2021).

      (6) Nolin, A. et al. Controlling fine touch sensations with polymer tacticity and crystallinity. Soft Matter 18, 3928–3940 (2022).

      (7) Swain, Z. et al. Self-Assembled Thin Films as Alternative Surface Textures in Assistive Aids with Users Who are Blind. J. Mater. Chem. B (2024) doi:10.1039/D4TB01646G.

      (8) Qian, K. et al. Mechanical properties vary for different regions of the finger extensor apparatus. J Biomech 47, 3094–3099 (2014).

      (9) Abdouni, A. et al. Biophysical properties of the human finger for touch comprehension: influences of ageing and gender. Royal Society Open Science (2017) doi:10.1098/rsos.170321.

      (10) Cornuault, P.-H., Carpentier, L., Bueno, M.-A., Cote, J.-M. & Monteil, G. Influence of physicochemical, mechanical and morphological fingerpad properties on the frictional distinction of sticky/slippery surfaces. Journal of The Royal Society Interface (2015) doi:10.1098/rsif.2015.0495.

      (11) Dhong, C. et al. Role of fingerprint-inspired relief structures in elastomeric slabs for detecting frictional differences arising from surface monolayers. Soft Matter 14, 7483–7491 (2018).

      (12) Fu, Y.-J. et al. Effect of UV-Ozone Treatment on Poly(dimethylsiloxane) Membranes: Surface Characterization and Gas Separation Performance. Langmuir 26, 4392–4399 (2010).

      (13) Yuan, Y. & Verma, R. Measuring microelastic properties of stratum corneum. Colloids Surf B Biointerfaces 48, 6–12 (2006).

      (14) Yu, G. et al. A wearable pressure sensor based on ultra-violet/ozone microstructured carbon nanotube/polydimethylsiloxane arrays for electronic skins. Nanotechnology 29, 115502 (2018).

      (15) Zheng, L. et al. Dual-Stimulus Smart Actuator and Robot Hand Based on a Vapor-Responsive PDMS Film and Triboelectric Nanogenerator. ACS Appl. Mater. Interfaces 11, 42504–42511 (2019).

      (16) Ma, K., Rivera, J., Hirasaki, G. J. & Biswal, S. L. Wettability control and patterning of PDMS using UV–ozone and water immersion. Journal of Colloid and Interface Science 363, 371–378 (2011).

      (17) Mavon, A. et al. Sebum and stratum corneum lipids increase human skin surface free energy as determined from contact angle measurements: A study on two anatomical sites. Colloids and Surfaces B: Biointerfaces 8, 147–155 (1997).

      (18) AliAbbasi, E. et al. Effect of Finger Moisture on Tactile Perception of Electroadhesion. IEEE Trans. Haptics 17, 841–849 (2024).

      (19) Corniani, G. et al. Sub-surface deformation of individual fingerprint ridges during tactile interactions.

      eLife 13, (2024).

      (20) Israelachvili, J. N. Intermolecular and Surface Forces. (Academic Press, 2011).

      (21) Das, S. et al. Stick–slip friction of gecko-mimetic flaps on smooth and rough surfaces. J R Soc Interface 12, 20141346 (2015).

      (22) Persson, B. N. J., Albohr, O., Creton, C. & Peveri, V. Contact area between a viscoelastic solid and a hard, randomly rough, substrate. The Journal of Chemical Physics 120, 8779–8793 (2004).

      (23) Skedung, L. et al. Feeling Small: Exploring the Tactile Perception Limits. Sci Rep 3, 2617 (2013).

      (24) Viswanathan, K., Sundaram, N. K. & Chandrasekar, S. Stick-slip at soft adhesive interfaces mediated by slow frictional waves. Soft Matter 12, 5265–5275 (2016).

      (25) Wiertlewski, M., Hudin, C. & Hayward, V. On the 1/f noise and non-integer harmonic decay of the interaction of a finger sliding on flat and sinusoidal surfaces. in 2011 IEEE World Haptics Conference 25–30 (2011). doi:10.1109/WHC.2011.5945456.

      (26) Fehlberg, M., Monfort, E., Saikumar, S., Drewing, K. & Bennewitz, R. Perceptual Constancy in the Speed Dependence of Friction During Active Tactile Exploration. IEEE Transactions on Haptics 17, 957–963 (2024).

    1. Author Response

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Summary

      The authors investigated the antigenic diversity of recent (2009- 2017) A/H3N2 influenza neuraminidases (NAs), the second major antigenic protein after haemagglutinin. They used 27 viruses and 43 ferret sera and performed NA inhibition. This work was supported by a subset of mouse sera. Clustering analysis determined 4 antigenic clusters, mostly in concordance with the genetic groupings. Association analysis was used to estimate important amino acid positions, which were shown to be more likely close to the catalytic site. Antigenic distances were calculated and a random forest model was used to determine potential important sites.

      This has the potential to be a very interesting piece of work. At present, there are inconsistencies in the methods, results and presentation that limit its impact. In particular, there are weaknesses in some of the computational work.

      Strengths

      (1) The data cover recent NA evolution and a substantial number (43) of ferret (and mouse) sera were generated and titrated against 27 viruses. This is laborious experimental work and is the largest publicly available neuraminidase inhibition dataset that I am aware of. As such, it will prove a useful resource for the influenza community.

      (2) A variety of computational methods were used to analyse the data, which give a rounded picture of the antigenic and genetic relationships and link between sequence, structure and phenotype.

      Weaknesses

      (1) Inconsistency in experimental methods

      Two ferret sera were boosted with H1N2, while recombinant NA protein for the others. This, and the underlying reason, are clearly explained in the manuscript. The authors note that boosting with live virus did not increase titres. Nevertheless, these results are included in the analysis when it would be better to exclude them (Figure 2 shows much lower titres to their own group than other sera).

      As an exercise, we have excluded the H1N2 boosted ferrets sera and no major impact was observed in the antigenic grouping (see Author response image 1a). Another way to control for differences in immunogenicity is to normalize the NAI values with the homologous ELISA titers for each antigen. Clustering based on these ELISA normalized NAI titers reveals the same 4 distinct antigenic groups but with one change: Kan17 is shifted from group 1 to group 2 (Author response image 1b). Note that a homologous ELISA titer is not available for A/West-Virginia/17/2012 and thus this serum sample is not included in Author response image 1b.

      Author response image 1.

      Antigenic and phylogenetic relatedness of N2 NAs. Phylogenetic tree based on the N2 NA head domain amino acid sequences and heat-map representing the average of normalized neuraminidase inhibition titer per H6N2 [log2 (max NAI/NAI)] determined in ferret sera after the boost (listed vertically). The red-to-blue scale indicates high-to-low NAI observed in ELLA against the H6N2 reassortants (listed at the bottom). UPGMA clustering of H6N2s inhibition profiles are shown on top of the heat map and colored according to the phylogenetic groups.(a) Based on the ferret sera with exclusion of the sera that were obtained following prime-boost by infection with H1N2 (A/Estonia/91625/2015 and A/Stockholm/15/2014). (b) Based on serum NAI titers that were normalized by the homologous ELISA titer.

      (2) Inconsistency in experimental results

      Clustering of the NA inhibition results identifies three viruses which do not cluster with their phylogenetic group. Again, this is clearly pointed out in the paper. Further investigation of this inconsistency is required to determine whether this has a genetic basis or is an experimental issue. It is difficult to trust the remaining data while this issue is unresolved.

      We understand the concern of the reviewer. It is important to keep in mind that discrete grouping of antigens allows to visualize major antigenic drifts. However, within closely related groups the cross reactivity of antisera is more likely distributed in a spectrum. When we constructed an antigenic map based on the antigenic cartography algorithm (as described by Smith D. et al, 2004), Kansas17, Wis15, and Ala15 are positioned more closely to antigenic group 1 than the majority of other antigens that were classified as group 2 (Author response image 2a). Similar results were obtained when individual ferret sera from the biological duplicates were used (Author response image 2b). This antigenic cartography map is now added as Figure 2. Figure supplement 3 to the revised manuscript.

      Author response image 2.

      The antigenic cartography was constructed using averaged data from pairs of ferrets (a). Similar analysis was performed on individual ferrets sera (b).

      (3) Inconsistency in group labelling

      A/Hatay/4990/2016 & A/New Caledonia/23/2016 are in phylogenetic group 1 in Figure 2 and phylogenetic group 1 in Figure 5 - figure supplement 1 panel a.

      Our apologies: there was indeed a mistake in labeling of Figure 5. A new antigenic cartography was constructed and included in the revised manuscript. As a result Figure 5 - figure supplement has now become redundant and was removed from the manuscript.

      A/Kansas/14/2017 is selected as a representative of antigenic group 2, when in Figure 2 it is labelled as AC1 (although Figure 2 - supplement 4 which the text is referring to shows data for A/Singapore/Infimh-16-0019/2016 as the representative of AC2). A/Kansas/14/2017 is coloured and labelled as AC2 in Figure 2 - supplement 5.

      Thank you for pointing out this inconsistency. Kan17 clustered antigenically in group 1 based on the NAI values that were normalized relative to the serum with the maximal NAI value against the H6N2 virus that was tested. When using NAI titers that are normalization with the homologous ELISA titer, Kan17 is positioned in group 2. Likewise, antigenic cartography mapping positions Kan17 in group 2. Therefore, we conclude that A/Kansas/14/2017 NA is a representative of group 2.

      The colouring is changed for Figure 3a at the bottom. A/Heilongjiang-Xiangyang/1134/2011 is coloured the same as AC4 viruses when it is AC1 in Figure 2. This lack of consistency makes the figures misleading.

      We apologize for this mistake. The coloring in Figure 3a has been corrected.

      (4) Data not presented, without explanation

      The paper states that 44 sera and 27 H6N2 viruses were used (line 158). However, the results for the Kansas/14/2017 sera do not appear to be presented in any of the figures (e.g. Figure 2 phylogenetic tree, Figure 5 - figure supplement 1). It is not obvious why these data were not presented. The exclusion of this serum could affect the results as often the homologous titre is the highest and several heatmaps show the fold down from the highest titre.

      Serum against A/Kansas/14/2017 was not prepared. For that reason, it is not included in the analysis. We agree that such homologous serum ideally should have been included and in the NAI assay would have resulted in a high if not the highest titer. However, we noticed that homologous sera did not always have the highest titers, especially in panels like ours were some antigens are closely related. The highest titer obtained against Kan17 H6N2 was from A/Bris/16 sera: 1/104, a titer that is in the range of other, homologous titers observed in the panel (Table S3). The Bris16 and Kan17 NAs have five amino acid differences. In summary, inclusion of Kan17 homologous sera would likely not impact the analysis and interpretation of the results because there are multiple highly cross-inhibiting heterologous serum samples against Kan17.

      (5) The cMDS plot does not have sufficient quality assurance A cMDS plot is shown in Figure 5 - figure supplement 1, generated using classical MDS. The following support for the appropriateness of this visualisation is not given. a. Goodness of fit of the cMDS projection, including per point and per titre. b. Testing of the appropriate number of dimensions (the two sera from phylogenetic group 3 are clustered with phylogenetic group 2; additional dimensions might separate these groups). c. A measure of uncertainty in positioning, e.g. bootstrapping. d. A sensitivity analysis of the assumption about titres below the level of detection (i.e. that <20 = 10). Without this information, it is difficult to judge if the projection is reliable.

      We agree with these comments. We have removed Figure 5 – figure supplement 1, and added new figure 2 – figure supplement 3 (antigenic cartography) instead.

      (6) Choice of antigenic distance measure

      The measure of antigenic distance used here is the average difference between titres for two sera. This is dependent on which viruses have been included in the analysis and will be biased by the unbalanced number of viruses in the different clusters (12, 8, 2, 5).

      To verify the impact of the number of antigens on our analysis, the matrix of differences was generated with only 4 H6N2s representing at least one phylogenetic group (Per09, Sin16, Hel823 and Ind11) (Author response image 3a). This matrix is very similar to the one calculated based on all 27 antigens (Author response image 3b). The obtained matrix (Author response image 3a) was used in random forest to model antigenic distances and the result of prediction was plotted against real differences calculated based on the full data. The correlation coefficient (R2) of predicted vs observed values dropped from 0.81 to 0.71, suggesting that the number of antigens tested does not drastically affect the antigenic differences calculated based on serum values (Author response image 3e). Importantly, amino acid substitutions potentially associated with increased antigenic distances are similarly identified (Author response image 3c, d and f).

      Author response image 3.

      Matrix of differences was calculated using only 4 H6N2 antigens (a) or the full panel (b). The matrixes from (c) 4 or (d) 27 antigens were used in random forest modeling to estimate the impact of amino acid changes, respectively. The rf modeling data generated from 4 H6N2 only was plotted and correlated with values calculated from the full panel of 27 H6N2s (e). The multi-way importance plot indicates in red that 7 out of the 10 most important substitutions were identified by the analysis using only 4 H6N2s (f).

      Interestingly, when matrix of differences is calculated using only 4 H6N2s data but not including at least one representative of antigenic group 1 and 2, the correlation coefficient between the predicted values and values obtained from the full panel is dramatically impacted (R2 values drops from 0.81 to 0.5 and 0.57. It is important to note that most of the sera also belong to phylogenetic antigens from groups 1 and 2. As a consequence, poorer prediction of those antigens would more drastically impact the correlation. No drastic drop was observed when representative H6N2s from group 3 or 4 were excluded from the data (from 0.81 to 0.75 and 0.73, Author response image 4 c and d).

      Author response image 4.

      Random forest analysis was repeated using only 4 antigens, but excluding representatives of one of the phylogenetic groups (a) no group 1, (b) no group 2, (c) no group 3, and (d) no group 4.

      We also used Euclidean distances as a measure of differences (Author response image 5). The predictive values obtained in rf have a slightly reduced R2 compared to the values obtained using average of differences.

      In conclusion the unbalanced number of antigens used per group and metric of distance does not seem to impact per se our analysis.

      Author response image 5.

      Antigenic distances were calculated using Euclidian distances of sera to sera. Those antigenic distances were used in rf for estimation of antigenic distance and importance of each amino acid substitution.

      (7) Association analysis does not account for correlations

      For each H6N2 virus and position, significance was calculated by comparing the titres between sera that did or did not have a change at that position. This does not take into account the correlations between positions. For haemagglutinin, it can be impossible to determine the true antigenic effects of such correlated substitutions with mutagenesis studies.

      Most of the potential correlated effects cannot be addressed with the panel of N2s, except for combinations of substitution that are included in the panel, such as 245/247 with or without 468. Only mutagenesis studies would shed light on the epistatic effects. However, it is important to keep in mind that those individual substitutions in such kind of study likely do not reflect natural evolution of N2 (cfr. the importance of the NA charge balance (Wang et al., 2021: 10.7554/eLife.72516).

      (8) Random forest method

      25 features are used to classify 43 sera, which seems high (p/3 is typical for classification). By only considering mismatches, rather than the specific amino acid changes, some signals may be lost (for example, at a given position, one amino acid change might be neutral while another has a large antigenic effect). Features may be highly, or perfectly correlated, which will give them a lower reported importance and skew the results.

      The number of features were optimized in the range from 5 to 80, with 25 being optimal (best R-value in predicted vs observed antigenic distances). Those features refer to the number of amino acid substitutions used in each tree. The number of trees was also optimized in the range of 100 to 2000.

      In random forest the matrix of differences is made considering only position based and not the type of substitution in pairs of NA. Indeed, substitutions with distinct effects may skew results by indicating lower reported importance.

      We have highlighted such potential bias in our discussion:

      “Also, our modelling does not consider that substitution by other amino acids can have a distinct impact on the antigenic distance. As a consequence, predictions based on the model could underestimate or overestimate the importance of a particular amino acid residue substitution in some cases.”

      Reviewer #2 (Public Review):

      Summary:

      The authors characterized the antigenicity of N2 protein of 44 selected A(H3N2) influenza A viruses isolated from 2009-2017 using ferret and mice immune sera. Four antigenic groups were identified, which correlated with their respective phylogenic/ genetic groups. Among 102 amino acids differed by the 44 selected N2 proteins, the authors identified residues that differentiate the antigenicity of the four groups and constructed a machine-learning model that provides antigenic distance estimation. Three recent A(H3N2) vaccine strains were tested in the model but there was no experimental data to confirm the model prediction results.

      Strengths:

      This study used N2 protein of 44 selected A(H3N2) influenza A viruses isolated from 2009-2017 and generated corresponding panels of ferret and mouse sera to react with the selected strains. The amount of experimental data for N2 antigenicity characterization is large enough for model building.

      Weaknesses:

      The main weakness is that the strategy of selecting 44 A(H3N2) viruses from 2009-2017 was not explained. It is not clear if they represent the overall genetic diversity of human A(H3N2) viruses circulating during this time. A comprehensive N2 phylogenetic tree of human A(H3N2) viruses from 2009-2017, with the selected 44 strains labeled in the tree, would be helpful to assess the representativeness of the strains included in the study.

      The selection of antigens was performed using the method described by Bien and Tibshirani 2011 (doi: 10.1198/jasa.2011.tm10183). This method calculates MinMax distances to identify a central representative among distinct clusters.

      To facilitate visualization of in a phylogenetic tree, only 180 representative N2 proteins from 2009-2017 were randomly selected (20 strains per year, unlabelled). Those 180 representatives and 44 readout panel strains (labelled) are shown in the phylogenetic tree below. Readout strains cover the major branches of the tree. The tree has been built using PhyML 3.0 using JTT substitution model and default parameters (Guindon S. et al, Systematic Biology 59(3):307-21, 2010) and visualized using ETE3 (Huerta-Cepas J. et al, Mol. Biol. Evol 33(6):1635-38, 2016).

      Author response image 6.

      The second weakness is the use of double-immune ferret sera (post-infection plus immunization with recombinant NA protein) or mouse sera (immunized twice with recombinant NA protein) to characterize the antigenicity of the selected A(H3N2) viruses. Conventionally, NA antigenicity is characterized using ferret sera after a single infection. Repeated influenza exposure in ferrets has been shown to enhance antibody binding affinity and may affect the cross-reactivity to heterologous strains (PMID: 29672713). The increased cross-reactivity is supported by the NAI titers shown in Table S3, as many of the double immune ferret sera showed the highest reactivity not against its own homologous virus but to heterologous strains. Although the authors used the post-infection ferret sera to characterize 5 viruses (Figure 2, Figure Supplement 4), the patterns did not correlate well. If the authors repeat the NA antigenic analysis using the post-infection ferret sera with lower cross-reactivity, will the authors be able to identify more antigenic groups instead of 4 groups?

      This is a very valuable remark. In their paper, Kosikova et al. (CID 2018) report that repeated infection of ferrets with antigenically slightly different H3N2 viruses results in a broader anti-HA response, compared to a prime infection of an influenza naïve ferret, which results in a narrower anti-HA response. In our ferret immunizations the boost was performed with recombinant, enzymatically active NA that was homologous to the NA of the H1N2 virus that was used for the priming by infection. We determined the NAI responses in sera from ferrets after H1N2 infection against 5 different H6N2 viruses (Figure 2 – figure supplement 5). Compared to NAI responses in sera from H1N2 infected and subsequently NA protein boosted ferrets, the NAI titers obtained after a single infection were considerably lower. Although the normalized NAI titers of day 14 and day 42 sera correlated well, we cannot exclude a degree of broadening of the NAI response in the NA protein boost sera (Author response image 7). On the other hand, repeated influenza antigen exposure is the reality for the majority of people.

      Author response image 7.

      Correlation obtained on NAI data from ferrets at day 14 after infection vs data from day 42 after boost.

      Another weakness is that the authors used the newly constructed model to predict the antigenic distance of three recent A(H3N2) viruses but there is no experimental data to validate their prediction (eg. if these viruses are indeed antigenically deviating from group 2 strains as concluded by the authors).

      Indeed, there is no experimental data from A/Hong_Kong/45/2018, A/Tasmania/503/2020, or A/Darwin/9/2021. The generation of data to determine experimental values for A/Hong_Kong/45/2018, A/Tasmania/503/2020, or A/Darwin/9/2021 would require the generation of new reassortant viruses (H1N2s), recombinant protein and immunization of new ferrets. The ferrets sera would have to be analyzed against all 27 H6N2s, including duplicated control sera for normalization. The major point of the modeling was to evaluate if it is possible to predict the antigenic behavior based on amino acid substitutions.

      As an exercise we have run the model again but this time excluding the Swe17 and HK17 antigens from the data set. Sequences of Sw17 or HK17 were then used to predict antigenic distances. The modeled versus experimental data are plotted in Author response image 8 and show a robust predictive outcome with R2 values of 0.94 and 0.91 for Sw17 and HK17, respectively.

      Author response image 8.

      Antigenic distances from Swe17 and HK17 calculated using the random forest algorithm that was constructed without experimental data from Swe17 and HK17. The predicted distances were plotted side by side to the experimental distances in (a) and correlations are shown in (b).

      Reviewer #3 (Public Review):

      Summary:

      This paper by Portela Catani et al examines the antigenic relationships (measured using monotypic ferret and mouse sera) across a panel of N2 genes from the past 14 years, along with the underlying sequence differences and phylogenetic relationships. This is a highly significant topic given the recent increased appreciation of the importance of NA as a vaccine target, and the relative lack of information about NA antigenic evolution compared with what is known about HA. Thus, these data will be of interest to those studying the antigenic evolution of influenza viruses. The methods used are generally quite sound, though there are a few addressable concerns that limit the confidence with which conclusions can be drawn from the data/analyses.

      Strengths:

      • The significance of the work, and the (general) soundness of the methods.

      • Explicit comparison of results obtained with mouse and ferret sera.

      Weaknesses:

      • Approach for assessing the influence of individual polymorphisms on antigenicity does not account for the potential effects of epistasis.

      Indeed, possible epistatic effects or individual polymorphisms were not assessed, which is limited by the nature of the panel of N2s selected in the study. We now emphasize this in the discussion as follows:

      “Also, our modelling does not consider that substitution by different amino acids can have distinct impact on antigenic distance. As a consequence, predictions based on the model could underestimate the importance of a particular amino acid residue substitution in some cases.”

      • Machine learning analyses were neither experimentally validated nor shown to be better than simple, phylogenetic-based inference.

      This is a valid remark and indeed we have found a clear correlation between NAI cross reactivity and phylogenetic relatedness. However, besides achieving good prediction of the experimental data (as shown in Figure 5 and in FigureR7), machine Learning analysis has the potential to rank or indicate major antigenic divergences based on available sequences before it has consolidated as new clade. ML can also support the selection and design of broader reactive antigens.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      Major corrections

      No major corrections, beyond the issues I touched on in the public review, for which I give a little more detail below:

      Point 2. If there's not a putative genetic basis for the unexpected clustering seen in the NAI, then reiterating a small subset of the data would show the reliability of the experimental methods and substantiate this unexpected finding.

      We thank the reviewer for this pertinent point and suggestion. We have modified our analysis by reiterating individual ferret data normalized with the homologous ELISA titers. This reiteration is shown in figure R1b. In this case both Kan17 and Wis15 are switched to antigenic group 2. The profile of sera inhibition against those 2 strains that shift from antigenic cluster 1 to 2, is clearly an intermediate between profiles observed in those 2 groups. Considering that antigenic evolution occurs gradually, it is not unexpected that those intermediate profiles would swing from one side to another when pushed to forced discrimination. Antigenic cartography mapping, as in Smith et al. (2004), also indicated that those H6N2s are located closer to G1 than overall antigens from G2. Raw data distribution (max and min EC50) also do not indicate potential bias in analysis.

      Point 5. If you want to use antigenic cartography (Smith et al 2004), there is the R CRAN package (https://CRAN.R-project.org/package=Racmacs) which can handle threshold titres (like <20) and has functions for the diagnostic tools I describe, in order to quality assure the resulting plot. It does use a different antigenic distance metric than the paper currently uses, so you might not want to take that route.

      Thank you for this suggestion. We have performed antigenic cartography using the methodology described by Smith et al made accessible by Sam Wilks. The outcome of this analysis has been added to the manuscript as Figure 2 – Figure supplement 3.

      Point 6. More robust measures of antigenic distance take into account the homologous titre, homologous and heterologous titres (Archetti & Horsfall, 1950) or use the highest observed titre for a serum (Smith et al 2004). A limitation of the first two is that the antigenic distance can only be calculated when you have the homologous titre, which will limit you as you only have this for 26/43 sera. They may give similar results to your average antigenic distance, in which case your analysis still stands. Calculating antigenic distance using the homologous or maximum titre only gives the antigenic distance between the antigen and the serum. If you want the distance between all the sera, then further analysis is required (making an antigenic map and outputting the serum-serum distances, see the point above).

      We thank the reviewer for these suggestions. A complete set of 43 H6N2 viruses that matches all 43 sera would have been ideal. This would require the generation of 17 additional H6N2 viruses and their testing in ELLA, a significant amount of work in terms of time and resources. Instead, we have generated an antigenic map of the 27 antigens and homologous sera (cfr. our response to point 5 above). Despite different methods the outcome showing 4 major antigenic groups is consistent.

      Minor corrections

      Table S1

      A/New_Castle/67/2016 should be A/Newcastle/67/2016

      A/Gambia/2012 is not the full virus name

      Corrected.

      Table S3 has multiple values of exactly 10.0. I think these should be <20 as they are below the threshold of detection for the assay.

      All the values lower than 20 in Table S3 were replaced by “< 20”.

      Line 376: A/Sidney/5/1997 should be A/Sydney/5/1997

      Corrected.

      Line 338: "25 randomly sampled data" is a bit vague, "25 randomly sampled features" would be better

      Corrected.

      Include RMSE of the random forest model.

      RMSE=19.6 RMSE/mean = 0.207 is now mentioned in the manuscript.

      Figure 5 - supplement 1: These plots are difficult to interpret as the aspect ratio is not 1:1, and panels a & b are difficult to compare as they have not been aligned (using a Procrustes analysis). It would be neater if they were labelled with short names.

      We have generated an antigenic cartography map instead. As a consequence, the MDS has become redundant and Figure 5 – supplement 1 was removed.

      Line 562: 98 variable residues, where it is 102 elsewhere in the text.

      There are 4 mutations near the end of the NA stalk domain, which are not resolved in the N2 structure. Therefore, amino acid distances to these residues cannot be calculated.

      No data availability statement. Some of the raw data is available in Table S3 and there is no link to the code.

      The data and code used for generation of rf modelling was uploaded to Github and made available. The following statement has been added to the manuscript: “The data and code used for the generation of the rf model is available at https://github.com/SaelensLAB/RF..”

      Reviewer #2 (Recommendations For The Authors):

      (1) More than 42,000 NA sequences are available for the mentioned period on GISAID, it is therefore important to understand the selection criteria for the 44 strains and if these strains represent the overall genetic diversity of N2 of human A(H3N2) viruses. To demonstrate the representativeness of the 44 selected strains, please construct a representative N2 phylogenetic tree for human A(H3N2) viruses circulated in 2009-2017 and label the 44 selected strains on the tree.

      The selection of antigens was performed using the method described by Bien and Tibshirani 2011 (doi: 10.1198/jasa.2011.tm10183). This method uses MinMax distances to identify a central representative among distinct clusters.

      To facilitate visualization tree only of 180 representative N2 proteins from 2009-2017 were randomly selected (20 strains per year, unlabelled). Those 180 representatives and 44 readout panel strains (labelled) are shown in the phylogenetic tree below. Readout strains cover the major branches of the tree. The tree has been built using PhyML 3.0 using JTT substitution model and default parameters (Guindon S. et al, Systematic Biology 59(3):307-21, 2010) and visualized using ETE3 (Huerta-Cepas J. et al, Mol. Biol. Evol 33(6):1635-38, 2016).

      Author response image 9.

      (2) Double immune ferret sera may increase antibody binding affinity and cross-reactivity against heterologous strains. Using single-infection ferret sera may yield different antigenic grouping results (eg. may identify more antigenic groups). Can the authors repeat the NA antigenic grouping using single-infection ferret sera? Although data from a subset of 5 strains was presented (Figure 2, Figure Supplement 4), the information was not sufficient to support if the use of single-infection or double immune ferret sera will yield similar antigenic grouping results.

      In our ferret immunizations the boost was performed with recombinant, enzymatically active NA that was homologous to the NA of the H1N2 virus that was used for the priming by infection. We determined the NAI responses in sera from ferrets after H1N2 infection against 5 different H6N2 viruses (Figure 2 – figure supplement 5). Compared to NAI responses in sera from H1N2 infected and subsequently NA protein boosted ferrets, the NAI titers obtained after a single infection were considerably lower. Although the normalized NAI titers of day 14 and day 42 sera correlated well, we cannot exclude a degree of broadening of the NAI response in the NA protein boost sera (Figure R6). On the other hand, repeated influenza antigen exposure is the reality for the majority of people.

      (3) NA antigenicity data is presented in heat maps and the authors would often describe the heat map patterns matches without further explanations. Line 234-235, the heat map of mouse sera (Figure 2. Figure supplement 5) was described to match the results of ferret sera (Figure 2), but this tends to be subjective. A correlation analysis of 7 selected antigens showed a positive correlation, what about the other 37 antigens?

      The interpretation of heatmaps is indeed very subjective, for this reason the correlation of the 7 selected antigens was also provided. The other 37 antigens were not tested. Considering the results using post boost sera, a simulation of using random forest modeling indicate that the data from one antigen of each antigenic group is sufficient to achieve a reliable predictive output (R2=0.71) (Figure R3 of this rebuttal).

      (4) Can the authors explain in more detail how data in Figure 4a was generated? According to the authors, residues close to the catalytic pocket are more likely to impact NAI. Can the authors explain how they define if a residue is close to the catalytic pocket?

      The correlation of distances of amino acid residues with significance values is explained as follows. Consider 7 distinct elements that are distributed horizontally as shown by the squares in the figure below (Author response image 10a). The elements highlighted in yellow have a numerical propriety (in case of N2 neuraminidase this was the significance values obtained in the association study). Taking P1 as reference we can calculate the distance (red arrows) between P1 and P2, P4 and P7, those distances can them be correlated to intrinsic values of P2, P4 and P7, which enables the calculation of the correlation coefficient Tau. This same process is repeated for each position (or each amino acid), as a consequence every position will have a correlation coefficient calculated (Author response image 8b). This correlation coefficient can be represented as a heat map at the surface of N2.

      Author response image 10.

      The 2D scheme represents the strategy used to calculate the correlation (i.e. the Tau values) between distances and p-values. Tau values can then be presented in a heat map.

      (5) Can the authors provide experimental data using the three recent A(H3N2) viruses as antigens and perform NAI assay to confirm if they are antigenic all deviating from group 2 viruses?

      The generation of data to determine experimental values for A/Hong_Kong/45/2018, A/Tasmania/503/2020, or A/Darwin/9/2021 would require the generation of new reassortant viruses (H1N2s), recombinant protein and immunization of new ferrets. The ferrets sera would have to be analyzed against all 27 H6N2s, including duplicated control sera for normalization. The major point of the modeling was to evaluate if it is possible to predict the antigenic behavior based on amino acid substitutions.

      As an exercise we have run the model again but this time excluding the Swe17 and HK17 antigens from the data set. Sequences of Sw17 or HK17 were then used to predict antigenic distances. The modeled versus experimental data are plotted in Author response image 7 and show a robust predictive outcome with R2 values of 0.94 and 0.91 for Sw17 and HK17, respectively.

      (6) According to Ge et al. 2022 (PMID: 35387078), N2 NA's before 2014 (2007-2013) showed a 329-N-glycosylation and E344, and they were subsequently replaced by H3N2 viruses with E344K and 329 non-glycosylation changing the NI reactivity in ferret antisera towards later strains. Were these residues also predicted to be important to N2 antigenicity from your machine-learning method?

      Three of the N2 NAs used in our panel, A/Victoria/361/2011, A/Hong_Kong/3089/2017, and A/Tennessee/18/2017, lack this N-glycosylation motif. The E344K substitution is present in another 3 NAs, derived from A/Nagano/2153/2017, A/Minnesota/11/2010, and A/Indiana/08/2011. The importance of those mutations is among the lowest ones predicted in our modeling. However, the differences in NAI reported by Ge et al. are low (not even twofold). The experimental variability in our study potentially limits the identification of substitutions with a subtle impact NAI. We have added the following to the discussion in our revised manuscript:

      “It has been reported that an N-glycosylation site at position 329 combined with E344 in NA from human H3N2 viruses from 2007 to 2013 was gradually lost in later H3N2 viruses (Ge et al., 2022). This loss of an N-glycosylation site at position 329 combined with an E344K substitution was associated with a change in NAI reactivity in ferret sera. Three N2 NAs in our panel, derived from A/Victoria/361/2011, A/Hong_Kong/3089/2017, and A/Tennessee/18/2017, lack this N-glycosylation motif. The E344K substitution is present in three other NAs, derived from A/Nagano/2153/2017, A/Minnesota/11/2010, and A/Indiana/08/2011. The importance of those mutations is among the lowest ones predicted by our modeling. However, the differences in NAI reported by Ge et al. are very modest (lower than twofold). The experimental variability in our study potentially limits the identification of substitutions with a subtle impact NAI.”

      Reviewer #3 (Recommendations For The Authors):

      Specific suggestions:

      Line 132: Did the authors confirm the absence of compensatory mutations due to a heterologous H6 background that could potentially confound downstream NAI results?

      All NAs genes of the rescued H6N2 viruses were fully sequenced and were found to be identical to the expected NA sequences, with the only exception being the A/Tasmania/1018/2015 were a mixed population of wt and M467I was found. This substitution is located at the surface and at the top of the NA head domain, and thus could potentially impact NA antigenicity. However, A/Tasmania/1018/2015 H6N2s had a similar inhibition profile as other H6N2s in phylogenetic and antigenic group 1. This indicates that, at least in this mixed population, antigenicity was not drastically affected by the M467I substitution.

      Line 96: how do these data rule out variation in the fraction of properly folded protein across NAs? They certainly show that properly folded NA protein is present, but not whether amounts vary between the different NAs.

      SEC-MALS (size exclusion chromatography-Multiangle light scattering) data and enzymatic activity were considered as a proxy for correctly folded NA. Although the specific activity of the recombinant N2 NAs is expressed per mass unit (microgram), we cannot exclude that the fraction of properly folded protein across the different recombinant NAs may vary.

      Lines 262-269: this analysis approach (based on my reading) seems to consider each polymorphism in isolation and thus does not seem well suited for accounting for epistatic interactions within the NA. For example, the effect of a substitution on NAI may be contingent upon other alleles within NA that are not cleanly segregated between the two serum comparator groups. Can the authors address the potential of epistasis within NA to confound the results shown in Figure 3?

      Unfortunately, epistatic interactions cannot be solved using the panel of N2 selected for the study. This limitation is mentioned in our discussion:

      “It is important to highlight that co-occurring substitutions in our panel (the ones present in the main branches of the phylogenetic tree) cannot be individually assessed by association analysis or the random forest model. The individual weight of those mutation on NA drift thus remains to be experimentally demonstrated.”

      Line 331: is there a way to visualize and/or quantify how these two plots (F5 supplement 1a/b) reflect each other or not? Without this, it is hard to ascertain how they relate to each other.

      We have generated an antigenic cartography map instead. As a consequence, the MDS has become redundant and Figure 5 – supplement 1 was removed.

      Figure 4B structural images are not well labelled.

      The active site in 1 of the protomers is now indicated with an arrow in the top and side views of the NA tetramer.

      Lines 339-359: the ML predictions are just predictions and kind of meaningless without experimental validation of the predicted antigenic differences between recent NAs. This section would also be strengthened by an assessment of whether the ML approach obtains more accurate results than simply using phylogeny to predict antigenic relationships.

      Indeed, there is no experimental data from A/Hong_Kong/45/2018, A/Tasmania/503/2020, or A/Darwin/9/2021. The generation of data to determine experimental values for A/Hong_Kong/45/2018, A/Tasmania/503/2020, or A/Darwin/9/2021 would require the generation of new reassortant viruses (H1N2s), recombinant protein and immunization of new ferrets. The ferrets sera would have to be analyzed against all 27 H6N2s, including duplicated control sera for normalization. The major point of the modeling was to evaluate if it is possible to predict the antigenic behavior based on amino acid substitutions.

      As an exercise we have run the model again but this time excluding the Swe17 and HK17 antigens from the data set. Sequences of Sw17 or HK17 were then used to predict antigenic distances. The modeled versus experimental data are plotted in figure R7 and show a robust predictive outcome with R2 values of 0.94 and 0.91 for Sw17 and HK17, respectively. A major advantage of antigenic modeling is the potential to rank or indicate major antigenic divergences based on available sequences before it has consolidated as new clade. The support in selecting or designing broader reactive antigens is another advantage of machine learning analysis.

      Lines 416-421: appreciate the direct comparison of results obtained from ferrets versus mice.

      We thank the reviewer for expressing this appreciation.

    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewer #1 (Public review):

      Summary:

      This manuscript by Tesmer and colleagues uses fiber photometry recordings, sophisticated analysis of movement, and deep learning algorithms to provide compelling evidence that activity in hypothalamic hypocretin/orexin neurons (HONs) correlates with net body movement over multiple behaviors. By examining projection targets, the authors show that hypocretin/orexin release differs in projection targets to the locus coeruleus and substantia nigra, pars compacta. Ablation of HONs does not cause differences in the power spectra of movements. The movement-tracking ability of HONs is independent of HON activity that correlates with blood glucose levels. Finally, the authors show that body movement is not encoded to the same extent in other neural populations.

      Strengths:

      The major strengths of the study are the combination of fiber photometry recordings, analysis of movement in head-fixed mice, and sophisticated classification of movement using deep learning algorithms. The experiments seem to be well performed, and the data are well presented, visually. The data support the main conclusions of the manuscript.

      We thank the reviewer for their supportive feedback.

      Weaknesses:

      The weaknesses are minor, mostly consisting of writing and data visualization throughout the manuscript. To some degree, it is already known that hypocretin/orexin neurons correlate with movement and arousal, although this manuscript studies this correlation with unprecedented sophistication and scale. It is also unfortunate that most of the experiments throughout the study were only performed in male mice. Taken together, this study is likely to be impactful to the field and our understanding of HONs across behavioral states.

      We agree that disentangling movement from arousal is an important aspect, and in the revised manuscript, we now include new data and analyses towards this (pupillometry to directly assess arousal, and multivariate analysis to assess contributions of arousal vs movemement to HON activity). In addition, we now implement many of the reviewer’s recommendations regarding writing, data presentation, and visual clarity (see our replies in the “recommendations for authors” section).

      Reviewer #1 (Recommendations for the authors):

      Some recommendations for the authors:

      (1) The first sentence of the Introduction states: "Neural activity related to body movement recently received much attention." I would rephrase or clarify this statement, as neuroscientists have been studying neural activity related to body movement for decades.

      The reviewer is correct. Our intention was to highlight the resurgence of movementrelated neurosciences enabled by modern techniques such as deep learning applied to video data (e.g. DeepLabCut, etc). The passage has been updated for clarity.

      (2) The Introduction also states that HONs orchestrate "consciousness and arousal." I would delete the word "consciousness," as consciousness represents a lofty, global concept that is challenging to define and quantify in humans, let alone mice.

      We used the word consciousness to be consistent with current literature on the function of the mouse hypothalamus (e.g. Nat Neurosci 2016 Feb;19(2):290-8). But we agree it is not necessary here, and so we followed the advice to delete it.

      (3) The authors state that HON dynamics were recorded while mice were head-fixed while on a running wheel. For clarity, it would be helpful to visualize this head-fixation in Figures 1A and 5B. It would also be helpful to clarify how certain behaviors (e.g. grooming, chewing) were performed and recorded while the mouse was head-fixed.

      In the revised manuscript, updated graphics with a head-fixed mouse have now been added to relevant figures. Representative RGB frames (colors representing sequential frames) of each behaviour have been added to Figure 2A.

      (4) In the legend for Figure 1A, the reference to Gonzalez et al. 2016 seems out of place (at least the reader should be informed why the text is referring to this previous study). Additionally, because the references are ordered by number instead of alphabetically, it would be more helpful to refer to a numbered reference rather than a name.

      Gonzalez et al. 2016 references the source of the AAV construct used in this figure. This has been moved to the methods. Following eLife formatting guidelines, references will be alphabetized upon publication.

      (5) In Figure 3F, it would be helpful to show visual validation that the HON-DTR method indeed ablates all HONs. This is depicted conceptually, but representative figures would be much more convincing.

      A representative histological slice is now included for both wild type (WT) and HON-DTR mice in the new Figure 4B.

      Reviewer #2 (Public review):

      Summary:

      Despite several methodological strengths, the major and highly significant drawback is the confound of arousal with movement. This confound is not resolved, so the results could be explained by previously established relationships between orexin and arousal/wakefulness.

      This an excellent point, and we agree. To address this directly in the revised manuscript, we now include new data and analyses towards this (pupillometry to directly assess arousal, and multivariate analysis to assess contributions of arousal vs movemement to HON activity).

      Strengths:

      The authors show that orexin neuron activity is associated with body movement and that this information is conveyed irrespective of the fasted state. They also report differences in different orexin target brain regions for orexin release during movement. This paper contains an impressive array of cutting-edge techniques to examine a very important brain system, the orexin-hypocretin system. The authors offer an original perspective on the function of this system. The authors showed that orexin neuron activity scales to some degree with the magnitude of body movement change; this is unaffected by a fasted state and seems to be somewhat unique to orexin neurons.

      The investigation of other genetically defined subcortical neuron populations to determine the specificity of findings is also a strength, as is the ability to quantify movement and use deep learning to classify specific behaviors adds sophistication to analysis. The authors also show heterogeneity in orexin projections to specific target nuclei, which is interesting.

      The authors "speculate that narcolepsy-cataplexy, caused by HON loss-of-function, is perhaps explained by oscillations into unwanted sleep-states and motor programs due to impaired control loops for wakefulness and movement". This is quite an interesting aspect of their work and deserving of further study.

      We thank the reviewer for their supportive feedback.

      Weaknesses:

      Despite the strengths, there are several major and minor weaknesses that detract significantly from the study.

      My main concern with this work is the confound of arousal with movement so that correlations with one might reflect a relationship instead with the other. The orexin system is well known to play an important role in arousal, with elevated activity of orexin neurons reported for waking and high arousal. Orexin signaling has also been strongly associated with motivation, which also is associated with arousal and movement. The authors offer no compelling evidence that the relationships they describe between different movements and orexin signaling do not simply reflect the known relationship between arousal and motivation.

      The authors could address this concern by including classical arousal measurements, eg, cortical EEG recorded simultaneously with movements. Often, EEG arousal occurs independently of movement, so this could provide one approach to disentangling this confound. The idea that orexin signaling plays a role in arousal rather than movement is supported by their finding that orexin lesions using the orexin-DTR mouse model did not impact movements. In contrast, prior lesion and pharmacologic studies have found that decreased orexin signaling significantly decreases arousal and waking.

      Another way they could test their idea would be to paralyze and respirate animals so that orexin activity could be recorded without movement. Alternatively, animals could be trained to remain motionless to receive a reward. Thus, there are several ways to test the overall hypothesis of this work that have not been examined here.

      The authors propose that "a simple interpretation of their results is that, via HON movement tracking, the brain creates a "wake up" signal in proportion to movement". This seems to argue for the role of the orexin system in arousal and motivation rather than in movement per se.

      Thank you. We agree that disentangling between arousal and movement is indeed critical. A classic approach is a multivariate analysis, wherein multiple simultaneously recorded “predictors” of HON activity – such as arousal and movement - can be directly compared. While EEG arousal is an option, another well-accepted metric for arousal is pupil diameter. Using n = 7 mice, we now simultaneously record HON activity, movement, running speed, pupil size fluctuations, and ocular movements:

      We then fit a partial least squares multivariate regression (a regression type more robust to collinearity) using the movement metric, pupil size, and ocular movements as predictors of orexin neuron activity. Consistent with previous publications, we found that pupil size alone has a positive correlation with hORX.GCaMP6s (~0.45). However, using a drop-one feature analysis in multivariate regression, we found that movement had the highest % contribution to statistically explaining orexin neuron activity. Here are the new results (which we now added as Fig. 7A-B).

      Author response image 1.

      Furthermore, we also expanded this analysis to incorporate the different frequencies found in HON dynamics, using empirical mode decomposition. We found that pupil size had a maximum correlation at lower HON frequencies than the movement metric, while ocular movements were maximally correlated in higher frequencies (now added as Fig. 7D,E).

      Overall, this analysis suggests that – while HONs encode both movement and arousal – arousal and movement do not always co-fluctuate at the same timescales, and their impacts on HONs can be disentangled in a number of ways. We now mention this in revised text on page 5.

      There are several studies that have examined the effect of orexin antagonist treatment in rodents on locomotor and other motor activities. These studies have largely found no consistent effect of antagonizing orexin signaling, especially at the OxR1 receptor, on simple motor activity. These studies are not referenced here but should be taken into account in the authors' conclusions.

      We agree. Prior studies found that orexin antagonism – or optogenetic silencing of HONs – evokes either reduced locomotion, or no effect on locomotor movements. We now added text and references to paragraph 4 of Discussion, summarising this.

      Figure 3, panel F: I understand HON-DTR is a validated model but a picture of HONs ablation is necessary, including pictures of HONs outputs ablation within the SNc and LC.

      A representative histological slice is now included for both wild type (WT) and HON-DTR mice in the new Figure 4B. Because HONs are only found in the hypothalamus, somatic deletion of HONs in this region will result in axonal degradation in output regions.

      The discussion lacks a more extensive paragraph on the distinct signal and role of Ox>SNc and Ox-LC projections.

      We now added sentences discussing potential implications of this to Discussion (middle of paragraph 4).

      Reviewer #2 (Recommendations for the authors):

      Minor weaknesses

      A very important movement in rodents is head orientation, especially given the limitation in ocular movement. However, this paper used a fixed head model which obviated this movement and did not attempt to analyze ocular movements.

      Analysing ocular movements is something we had not considered but is very easy to check using pupillometry. In n = 7 mice, we recorded both orexin neurons, and ocular movements captured through an infrared camera under constant lighting. Ocular movements had a small positive correlation with orexin neuron photometry (r = ~0.26). See response to the public review above.

      Author response image 2.

      The "HON" abbreviation is not commonly used for orexin neurons, and I suggest replacing that with a more well-known abbreviation.

      To the best of our knowledge, there is no universally agreed or best-known abbreviation for hypocretin/orexin neurons (we agree it would be nice if there was one!). “HONs” is a simple first letter abbreviation of hypocretin/orexin neurons, which acknowledges the two names for this peptide given by the original discoverers (de Lecea et al, and Sakurai et al, in 1998). Although this may not be the perfect abbreviation, we have kept it for now, also to be consistent with the large number (>10) of other published studies that recently used this abbreviation.

      The graphs showing Pearson's r values do not demonstrate a very strong correlation between neural activity and movement change; they also lack validation of genetic expression/ablation in some cases. The results would more strongly support the conclusions if statistically significant correlations could be demonstrated between activity and movement.

      We agree that a correlation of ~0.68 is probably not worthy of a “very strong” classification. While there is no universal ruleset for categorizing the strength of a correlation, we have toned down our language throughout the manuscript.

      Comment regarding statistical testing of correlations: we are cautious to stand behind correlation significance testing for large sample sizes (~48’000 photometry & video samples in a 40-minute session). In our case, correlations were always extremely significant p<0.0001. The reason for this is that correlation p-values become “too big to fail” (see Lin et al. 2013) with inflated sample size. We therefore refrain from commenting on p-values and rather report between or within-subjects statistical tests, or tests against zero. See four example experiments below.

      Author response image 3.

      Citation: Lin, M., Lucas, H. C., Jr & Shmueli, G. Research Commentary—Too Big to Fail: Large Samples and the p-Value Problem. Information Systems Research 24, 906–917 (2013).

      The rationale for looking at running speed, general movement, and specific types of nonlocomotor movements could be clarified and explained more thoroughly in the introduction. Why is it important to distinguish between locomotion (represented here with running) and all other movements? Presumably, this is because orexin is known to regulate arousal/locomotion. What evidence is there for orexin's role in other types of movements, which are being grouped together in Figure 1? This could be laid out in more detail in the Introduction. Relatedly, it is not very clear in the text whether the correlation between movement and orexin neuron activity includes movement related to running.

      The main focus of our paper is on movement in general (i.e. video pixel difference, described in Results and Methods). This movement metric includes everything captured by the video, it is agnostic to the type of movement or behaviour.  To connect this to some of the specific innate movements/behaviours typically studied in mouse literature (running, grooming, sniffing, etc), we also performed plots in Figure 2. We attempted to explain this better in revised section 1 of Results.

      What exactly is being correlated in Figure 1C (and throughout the rest of the paper?) Is this the average signal correlated with the average movement change over the entire recording time? This could be more explicitly stated in methods/results. The correlations themselves/p-values could be shown in addition to/instead of Pearson's r values. Are the correlations themselves significant? This would strengthen the claim that orexin activity is strongly coupled to the magnitude of body movement change. As another example, in Figure 2D, there are no statistics reported on the correlation between movement metric and average neural signal. In Figure 6G, orexin neuron activity is more strongly correlated with movement than MVe glut neurons, but are either of these correlations significant? The correlation between MVe glut activity and movement overall seems similar to that of orexin neurons, and may be worth noting more explicitly.

      Throughout the paper, we have recorded both neural activity (photometry) and movement at 20 Hz. This would generate, for example, 48’000 samples of photometry and movement from a 40-minute session. All the samples were used to calculate a pearson’s r between variables. To clarify this, we now added the subtext “wholesession” to relevant figures, as well as a clarification in the methods.

      Individual experiment correlations for orexin neurons and MVe glut neurons were always significant p<0.0001, even after a Bonferroni multiple comparisons correction was applied to each population. See the “too big to fail” nature of correlation hypothesis testing above.

      It could be made clearer at the end of Figure 2 that orexin neuron activity is tracking the magnitude of movement change (shown in Figure 2D), not that it is encoding different types of movement.

      We intended for original Figure 2E to illustrate this concept, however this panel has caused a great deal of confusion to several readers and was perhaps ill conceived. We have replaced Figure 2E with a new panel more directly addressing the reviewer’s statement. We can construct three models where orexin neuron activity is predicted from the behavioral classification (sometimes called “one-hot” encoding) and/or the movement metric.

      Model 1 predicts orexin neuron activity using only a categorical predictor of behavioral state. Model 2 only uses the movement metric, and model 3 allows a different movement-metric correlation within each behavioral state. We can compare these models using AIC (Akaike Information Criterion) which is a point estimate. While the most complex model 3 was the best, model 2 was much closer to model 3 than model 1. Similarly, model 2 was much better than model 1. From this we conclude that the magnitude of movement change is a more powerful predictor than behavioral state (“type of movement”). This is now Figure 2E.

      It would be interesting to see the raw movement metric data as shown in Figures 1 and 2 in the DTR mice to show that ablating orexin neurons does not impair the movement profile seen in Figures 1 and 2.

      The requested visualization has been added to Figure 4B.

      Validation that orexin was selectively ablated in these mice would be ideal.

      Histology (see response to public review) was added to a new Figure 4B.

      Figure 4A - OxLight expression in SNc does not look very robust.

      Please note this is a membrane-targeted indicator, the staining this produces is thus much weaker than cyctosolic indicators such as calcium indicator GCaMP.

      Figure 4 - It would be beneficial to see the same correlations that were done in Figures 1 and 2 to show OxLight activity vs. movement metric. Are they correlated?

      Individual traces had significant correlations with OxLight and movement, and the population averages revealed similar trends:

      Author response image 4.

      Figure 6B - Targeting of MVe neurons does not look very specific. The sample size for orexintargeted mice should be re-stated in the figure legend for clarity.

      Legend has been updated to clarify n = 15 for orexin targeted mice.

      Some citations didn't seem to match what was being referenced in the text. Similarly, in the legend for Figure 1C, the statistics do not match what is reported in the text. In Figure 1, the sample size is not noted in the text. When referring to running in Figure 1, is this referring to running speed? Perhaps the language could be more consistent.

      These typos (due to a rounding error) in the legend and text have been corrected. Sample size has been added to the text, and we have changed Figure 1D to clarify we are referring to running speed. We moved some citations to improve clarity.

      Methods - where were Cre mice obtained from?

      Sources now better referenced in Methods (JAX or Parlato et al).

      Figure 1, panel C: The authors compared Pearson's r-coefficient results for each animal and for each variable. However, it would be interesting to show the correlation curves for each variable. However, it would be interesting to show the correlation curves for each variable as well here. Also, there is mention of a strong correlation but it is unclear whether these correlations are significant.

      See below for an example mouse.

      Author response image 5.

      Figure 3, panel F: I understand HON-DTR is a validated model but a picture orexin ablation is necessary, including pictures of orexin fibers ablation within the SNc and LC.

      See our reply to the public review above.

      Figure 5, Panel A: Same comment as Figure 1, panel C.

      We have similarly clarified the panel and legend.

      Page 4: The authors mention "Within the 1st and 4th quartile of blood glucose, movement-HON correlations were not significantly different. Please add the figures.

      The requested plot has been added to Figure 6, panel G.

      Reviewer #3 (Public review):

      Summary

      The study presents an investigation into how hypothalamic orexin neurons (HONs) track body movement with high precision. Using techniques including fiber photometry, video-based movement metrics, and empirical mode decomposition (EMD), the authors demonstrate that HONs encode net body movement consistently across a range of behaviors and metabolic states. They test the ability of HONs to track body movement to that of other subcortical neural populations, from which they distinguish HONs activity from other subcortical neural populations.

      Strengths:

      The study characterizes HONs activity as key indicators of movement and arousal, and this method may have potential implications for understanding sleep disorders, energy regulation, and brain-body coordination. Overall, I think this is a very interesting story, with novel findings and implications about sensorimotor systems in animals. The manuscript is clearly written and the evidence presented is rigorous. The conclusions are well supported by experimental data with clear statistical analyses.

      We thank the reviewer for their supportive feedback.

      Weaknesses/suggestions:

      There are a couple of issues I think the authors could address to make the paper better and more complete:

      (1) The study primarily focuses on steady-state behaviors. It would be interesting if the authors' current dataset allows analyses of HON dynamics during transitions between behavioral states (e.g., resting to running or grooming to sniffing). This could provide additional insights into how HONs adapt to rapid changes in body movement.

      This is a fantastic idea, and easy to check using our classification CNN. We identified the six most frequent behavioral transitions and plotted them in Figure 2H. HONs show rapid dynamics in activity aligned with behavioral changes.

      These changes are very similar to the movement magnitude along these transitions, which is now also plotted in Figure 2G.

      (2) Given the established role of HONs in arousal and wakefulness, the study could further investigate how movement-related HON dynamics interact with arousal states. For example, does HON encoding of movement differ during sleep versus wakefulness?

      To further investigate how movement encoding interacts with arousal, we now include quantification and analysis of pupil-linked arousal (see new Figure 7). We agree it would be interesting to look at what happens during sleep, especially REM sleep when some HONs are thought to be active where there is no/little body movement, but this is beyond the scope of the present study.

      (3) Although HON ablation experiments suggest that HONs do not shape movement frequency profiles. It would be more compelling if the authors could investigate whether HONs contribute to specific types of movements (e.g., fine motor vs. gross motor movements) or modulate movement initiation thresholds.

      We performed this analysis using the k-means classifier for small/large movements. Consistent with previous results, we found no significant effect (p = 0.2767) of genotype on the frequency of identified small (fine) or large (gross) movement clusters. This plot has been added to Figure 4E.

      (4) The heterogeneous movement-related orexin dynamics observed in the LC and SNc raise intriguing questions about the circuit-level mechanisms underlying these differences. Optogenetic or chemogenetic manipulation of these projections could validate the functional implications of these dynamics.

      We agree. We now discuss some implications of this in revised Discussion (paragraph 4). Please note that previous work already demonstrated that orexin action in the SNc can produce locomotion (referenced in the paragraph), though we agree that further work would be valuable.

      Reviewer #3 (Recommendations for the authors):

      Additional feedback:

      (1) Figure 1C: the individual data points are hard to track or see. Consider using a larger marker face to help data visualization. Similar issues can be found in Figures 2C, 2E, 5E, 6C, 6F, and 6G.

      Thickness of the lines and scatterplots have been increased.

      (2) First Section of Results: the authors claim to use a deep-learning network to automatically classify video recordings into five distinct behaviors. However, several issues need to be addressed here:

      a. In Results, the corresponding sentence lacks a reference to the Methods Section.

      Reference has been added to the text.

      b. In Methods, the description of the CNN model is quite limited, lacking many basic, necessary components including necessary references to published papers, the model training, characterization (only an overall accuracy is not enough), as well as dataset definition, preparation, augmentation (if any), etc.

      We have expanded the methods section regarding the CNN model.

      (3) First Section of Results: in the second paragraph, the authors claim that "Overall, these results reveal HON population activity precisely tracks a general degree of body movement across recorded behaviors." This is not accurate. To indicate that HONs activity tracks the general degree of body movement across behavior states, they need to further show that behavioral states with similar levels of movement metrics can be differentiated via HON activities. However, as they showed in Figure 2D, some behaviors with similar values of movement metric do not seem to be easily discerned by HON activity levels.

      We agree with you, and this is also what we originally intended to convey – now reworded for clarity.

      (4) Technical issue: Figures 3B, 3C, 3G, using local regression to plot the solid lines makes them touch negative values, which does not make sense for "power proportion" (this quantity is always non-negative).

      This is a good point. To fix this, we first log-transformed the power metric, then performed a local regression, and used the link function to transform the model predictions back to %-units for visualization. This has been noted in the methods.

      (5) Figure 3G: For a better comparison, consider combining the two plots into a single plot.

      The two plots have been merged as shown in Figure 4C.

      (6) Figure 5E: For a better data visualization, the current pair of plots can be consolidated into one single plot where the x-axis is Move and the y-axis is dGlu. In this way, it is easier to understand and the orthogonality as claimed in the manuscript can be more apparent.

      The requested plot has been added as Figure 6F.

    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewer #1 (Public review):

      Summary:

      This study takes a detailed approach to understanding the effect of menopausal hormone therapy (MHT) in the brain aging of females. Neuroimaging data from the UK Biobank is used to explore brain aging and shows an unexpected effect of current MHT use and poorer brain health outcomes relative to never users. There is considerable debate about the benefits of MHT and estrogens in particular for brain health, and this analysis illustrates that the effects are certainly not straightforward and require greater consideration.

      Strengths:

      (1) The detailed approach to obtaining important information about MHT use from primary care records. Prior studies have suggested that factors such as estrogen/progestin type, route of administration, duration, and timing of use relative to menopause onset can contribute to whether MHT benefits brain health.

      (2) Consideration of type of menopause (spontaneous, or surgical) in the analysis, as well as sensitivity diagnoses to rule out the effect being driven by those with clinical conditions.

      (3) The incorporation of the brain age estimate along with hippocampal volume to address brain health.

      (4) The complex data are also well explained and interpretations are reasonable.

      (5) Limitations of the UK Biobank data are acknowledged

      We thank the reviewer for their time and the positive evaluation of our manuscript.

      Weaknesses:

      (1) Lifestyle factors are listed and the authors acknowledge group differences (at least between current users and never users of MHT). I was not able to find these analyses showing these differences.

      We highlighted and tested for group differences in lifestyle scores, and the results are shown in Table 1-3, column p-value. As highlighted in the method section (page 9): “The lifestyle score was calculated using a published formula (69), and included data on sleep, physical activity, nutrition, smoking, and alcohol consumption (see supplementary Note 3, Table S2)”. In line with reviewer 1 suggestion to the authors, we now included an additional table testing for group differences in the specific lifestyle factors constituting the lifestyle score in the supplementary materials (Table S2). Please find a more detailed response below (Recommendations for the authors, Response to Comment 1).

      (2) The distribution of women who were not menopausal was unequal across groups, and while the authors acknowledge this, one wonders to what extent this explains the observed findings.

      We agree with the reviewer that the unequal distribution of women across groups can influence the observed findings. We have made minor edits to highlight this important topic more explicitly in the discussion:

      Discussion (page 21): “Current MHT users were significantly younger than past- and never-users, and around 67 % were menopausal relative to over 80% in the past- and never-user groups. The unequal distribution of age and menopausal status across groups may have influenced the observed findings. For instance, a larger proportion of the current users might be in the perimenopausal phase, which is often associated with debilitating neurological and vasomotor symptoms (1). MHT is commonly prescribed to minimize such symptoms. Although MHT initiation during perimenopause has been associated with improved memory and hippocampal function, as well as lower AD risk later in life (15), the need for MHT might in itself be an indicator of neurological changes (71); here potentially reflected in higher BAG and lower hippocampal volumes. After the transition to menopause, symptoms might subside and some perimenopausal brain changes might revert or stabilize in the postmenopausal phase 5. Although the UK Biobank lacks detailed information on menopausal symptoms and perimenopausal staging, our results might be capturing subtle disturbances during perimenopause that later stabilize. This could explain why the largely postmenopausal groups of past MHT users and never-users present with lower GM and WM BAG than the current user group. Considering the critical window hypothesis emphasizing perimenopause as a key phase for MHT action (29,43), future longitudinal studies are crucial to clarify the interplay between neurological changes and MHT use across the menopause transition.”

      Discussion (page 25): “In addition, previous studies highlight that UK Biobank participants are considered healthier than the general population based on several lifestyle and health-related factors (89, 90). This healthy volunteer bias increases with age, likely resulting in a disproportionate number of healthier older adults. Together with the imbalance in age distributions across groups, this might explain the less apparent brain aging in the older MHT user groups. We have previously highlighted that age is negatively associated with the number of APOE ε4 carriers in the UK Biobank (21), which is indicative of survivor bias.”

      (3) While the interpretations are reasonable, and relevant theories (healthy cell & critical window) are mentioned, the discussion is missing a more zoomed-out perspective of the findings. While I appreciate wanting to limit speculation, the reader is left having to synthesize a lot of complex details on their own. A particularly difficult finding to reconcile is under what conditions these women benefit from MHT and when do they not (and why that may be).

      We thank the reviewer for this comment. As the presented data is cross-sectional and does not enable causal inference, we have refrained from a more zoomed-out interpretation of the results to avoid undue speculations. However, where applicable, we have discussed our findings in a broader context such as the effects of MHT use on the brain across the menopausal transition (discussion page 21) and the effects of MHT use on the brain in the presence and absence of bilateral oophorectomy and/or hysterectomy (discussion page 25).

      To best inform the reader about the scope of our paper, we would like to highlight the following sentences in our discussion (page 24):

      “The current work represents the most comprehensive study of detailed MHT data, APOE ε4 genotype, and several brain measures in a large population-based cohort to date. Overall, our findings do not unequivocally support general neuroprotective effects of MHT, nor do they indicate severe adverse effects of MHT use on the female brain. The results suggest subtle yet complex relationships between MHT’s and brain health, highlighting the necessity for a personalized approach to MHT use. Importantly, our analyses provide a broad view of population-based associations and are not designed to guide individual-level decisions regarding the benefits versus risks of MHT use.”

      And the conclusion (page 25): “In conclusion, our findings suggest that associations between MHT use and female brain health might vary depending on duration of use and past surgical history. Although the effect sizes were generally modest, future longitudinal studies and RCTs, particularly focused on the perimenopausal transition window, are warranted to fully understand how MHT use influences female brain health. Importantly, considering risks and benefits, decisions regarding MHT use should be made within the clinical context unique to each individual.”

      Reviewer #1 (Recommendations for the authors):

      Can the authors provide:

      (1) More information about which aspects of lifestyle factors were different between the groups, and how these factors may have contributed to the observed findings (if possible, without burying this information in the supplemental)?

      We thank the reviewer for this suggestion. We now added a table comparing lifestyle factors contained in the lifestyle score by MHT user status using t-tests (continuous variables) or χ2 tests (see Table S2). The results are referred to in the main manuscript result section under “Sample characteristics”, and the table (Table S2) is provided in the supplements not to overburden the main text, in line with input from reviewer 3.

      We updated the main text to refer to Table S2 and updated the supplementary Note 3 (page 2-3) to include the results of the comparison of the lifestyle factors contained in the lifestyle score by MHT user status.

      Methods, page 9:“The lifestyle score was calculated using a published formula (69), and included data on sleep, physical activity, nutrition, smoking, and alcohol consumption (see supplementary Note 3, Table S2).”

      Results, page 13: “Sample demographics including lifestyle score, stratified by MHT user group, surgical history among MHT users, and estrogen only MHT or combined MHT use, are summarized in Table 1, 2 and 3, respectively. MHT user group differences for each lifestyle factor contained in the lifestyle score are shown in Table S2.”

      “Note 3| Lifestyle Score

      The lifestyle score was calculated based on sleep duration, time spent watching television, current and past smoking status, alcohol consumption frequency, physical activity level (number of days per week of moderate/vigorous activity for at least 10 minutes), intake of fruits and vegetables, and intake of oily fish, beef, lamb/mutton, pork and processed meat (for details see (10)). Each unhealthy lifestyle factor was scored with 1 point (e.g., smoking), and participants points were summed to generate an unweighted score (from 0-9): the higher the lifestyle score, the unhealthier the participant’s lifestyle.

      A comparison of the lifestyle factors contained in the lifestyle score by MHT user status is presented in Table S2. In summary, we found that current MHT were more often smokers than never-users, had a higher alcohol intake than never- and past MHT users, reported the lowest fruit and vegetable intake relative to never-users and past MHT users, and stated lower moderate activity levels relative to past MHT users. Past MHT users reported higher alcohol intake than never-users, spend more time watching TV relative to never- and current-users, consumed more beef, pork, lamb/mutton, and processed meat than never-users, and reported lower vigorous activity levels relative to never-users. However, oily fish intake and fruit and vegetable intake was higher among past MHT users relative to never-and current-users. Self-reported sleep duration did not differ between MHT user groups.”

      (2) A greater description of the 2 main theories of MHT effects on the brain (healthy cell vs critical window). Can the authors also provide a more thorough explanation for how the findings fit with these theories.

      We thank the reviewer for this comment. We have described our findings in the context of the critical window hypothesis (discussion, page 21, paragraph 2), the healthy cell bias hypothesis (discussion, page 22, paragraph 3), and healthy user bias hypothesis (discussion, page 22, paragraph 4). We refrained from a more thorough explanation to avoid undue speculations.

      (3) Reflect more on what the findings may indicate as to who benefits from MHT, and why. There are some references that the authors may want to add, particularly related to recent findings from premenopausal bilateral oophortectomies that also speak to when (and for whom) MHT use might benefit.

      We thank the reviewer for this feedback. We have included additional references in the revised manuscript as follows:

      Discussion, page 23: “It is also possible that the timing between MHT use and surgery is more tightly controlled and therefore more beneficial for brain aging (43). For instance, studies suggest that MHT may mitigate the potential long-term adverse effects of bilateral oophorectomy before natural menopause on bone mineral density as well as cardiovascular, cognitive and mental health (79-81). In addition, a 2024 UK Biobank study found that ever used MHT was associated with decreased odds of Alzheimer’s disease in women with bilateral oophorectomy (82).”  

      (79) Blumel JE, Arteaga E, Vallejo MS, et al. Association of bilateral oophorectomy and menopause hormone therapy with mild cognitive impairment: the REDLINC X study. Climacteric 2022;25:195-202.

      (80) Kaunitz AM, Kapoor E, Faubion S. Treatment of Women After Bilateral Salpingo-oophorectomy Performed Prior to Natural Menopause. JAMA 2021;326:1429-1430.

      (81) Stuursma A, Lanjouw L, Idema DL, de Bock GH, Mourits MJE. Surgical Menopause and Bilateral Oophorectomy: Effect of Estrogen-Progesterone and Testosterone Replacement Therapy on Psychological Well-being and Sexual Functioning; A Systematic Literature Review. J Sex Med 2022;19:1778-1789.

      (82) Calvo N, McFall GP, Ramana S, et al. Associated risk and resilience factors of Alzheimer's disease in women with early bilateral oophorectomy: Data from the UK Biobank. J Alzheimers Dis 2024;102:119-128.

      Reviewer #2 (Public review):

      Summary:

      In this observational study, Barth et al. investigated the association between menopausal hormone therapy and brain health in middle- to older-aged women from the UK Biobank. The study evaluated detailed MHT data (never, current, or past user), duration of mHT use (age first/last used), history of hysterectomy with or without bilateral oophorectomy, APOEE4 genotype, and brain characteristics in a large, population-based sample. The researchers found that current mHT use (compared to never-users), but not past use, was associated with a modest increase in gray and white matter brain age gap (GM and WM BAG) and a decrease in hippocampal volumes. No significant association was found between the age of mHT initiation and brain measures among mHT users. Longer duration of use and older age at last MHT use post-menopause were associated with higher GM and WM BAG, larger WMH volumes, and smaller hippocampal volumes. In a sub-sample, after adjusting for multiple comparisons, no significant associations were found between detailed mHT variables (formulations, route of administration, dosage) and brain measures. The association between mHT variables and brain measures was not influenced by APOEE4 allele carrier status. Women with a history of hysterectomy with or without bilateral oophorectomy had lower GM BAG compared to those without such a history. Overall, these observational data suggest that the association between mHT use and brain health in women may vary depending on the duration of use and surgical history.

      Strengths:

      (1) The study has several strengths, including a large, population-based sample of women in the UK, and comprehensive details of demographic variables such as menopausal status, history of oophorectomy/hysterectomy, genetic risk factors for Alzheimer's disease (APOE ε4 status), age at mHT initiation, age at last use, duration of mHT, and brain imaging data (hippocampus and WMH volume).

      (2) In a sub-sample, the study accessed detailed mHT prescription data (formulations, route of administration, dosage, duration), allowing the researchers to study how these variables were associated with brain health outcomes. This level of detail is generally missing in observational studies investigating the association of mHT use with brain health.

      We thank the reviewer for their time and the positive evaluation of our manuscript.

      Weaknesses:

      (1) While the study has many strengths, it also has some weaknesses. As highlighted in an editorial by Kantarci & Manson (2023), women with symptoms such as subjective cognitive problems, sleep disturbances, and elevated vasomotor symptoms combined with sleep disturbances tend to seek mHT more frequently than those without these symptoms. The authors of this study have also indicated that the need of mHT use which might be associated with these symptoms may be indicators of preexisting neurological changes, potentially reflecting worse brain health scores, including higher BAG and lower hippocampal volume and/or higher WMH. However, among current users, how many of these women have these symptoms could not be reported in the study. Women with these vasomotor symptoms who are using mHT are more likely to stay longer in the healthcare system compared with those without these symptoms and no MHT use history. The authors noted that the UK Biobank lacks detailed information on menopausal symptoms and perimenopausal staging, limiting the study's ability to understand how these variables influence outcomes.

      We thank the reviewer for the succint synopsis of the limitations highlighted in discussion, page 21. We have now added the mentioned reference, 2023 editoral by Kantarci & Manson, to the discussion as well (see reference 71).

      Discussion (page 21): “Current MHT users were significantly younger than past- and never-users, and around 67 % were menopausal relative to over 80% in the past- and never-user groups. The unequal distribution of age and menopausal status across groups may have influenced the observed findings. For instance, a larger proportion of the current users might be in the perimenopausal phase, which is often associated with debilitating neurological and vasomotor symptoms (1). MHT is commonly prescribed to minimize such symptoms. Although MHT initiation during perimenopause has been associated with improved memory and hippocampal function, as well as lower AD risk later in life (15), the need for MHT might in itself be an indicator of neurological changes (71); here potentially reflected in higher BAG and lower hippocampal volumes. After the transition to menopause, symptoms might subside and some perimenopausal brain changes might revert or stabilize in the postmenopausal phase 5. Although the UK Biobank lacks detailed information on menopausal symptoms and perimenopausal staging, our results might be capturing subtle disturbances during perimenopause that later stabilize. This could explain why the largely postmenopausal groups of past MHT users and never-users present with lower GM and WM BAG than the current user group. Considering the critical window hypothesis emphasizing perimenopause as a key phase for MHT action (29,43), future longitudinal studies are crucial to clarify the interplay between neurological changes and MHT use across the menopause transition.”

      (2)  Earlier observational studies have reported conflicting results regarding the association between mHT use and the risk of dementia and brain health. Contrary to some observational studies, three randomized trials (WHI, KEEPS, ELITE) (Espeland et al 2013, Gleason et al 2015; Henderson et al 2016) demonstrated neither beneficial nor harmful effects of mHT (with varying doses and formulations) when initiated closer to menopause (<5 years). While strong efforts were made to run proper statistical analyses to investigate the association between mHT use and brain health, these results reflect mainly associations, but not causal relationships as also stated by the authors.

      We thank the reviewer for pointing that out.

      (3)  Furthermore, observational studies have intrinsic limitations, such as a lack of control over switching mHT doses and formulations, a lack of laboratory measures to confirm mHT use, and reliance on self-reported data, which may not always be reliable. The authors caution that these findings should not guide individual-level decisions regarding the benefits versus risks of mHT use. However, the study raises new questions that should be addressed by randomized clinical trials to investigate the varying effects of MHT on brain health and dementia risk.

      We thank the reviewer for making our efforts in providing proper disclaimers in the discussion visible.

      Reviewer #2 (Recommendations for the authors):

      (1) The study could benefit from extending these findings by adding plasma biomarkers of AD and PET imaging markers to further study the association of mHT variables with brain health.

      We agree with the reviewer that such markers would be beneficial for elucidating the association between MHT variables and brain health. Unfortunately, these markers are not readily available in the UK Biobank.

      (2) The study's reliance on a predominantly white cohort limits the generalizability of the findings to more diverse populations. This homogeneity may not capture the full spectrum of responses to MHT across different ethnic and genetic backgrounds.

      We fully agree with the reviewers statement and state this limitation in the discussion (page 25) as follows:

      “In addition to these inherent biases in aging cohorts, the ethnic background of the sample is homogeneous (> 96% white), further reducing the generalizability of the results.”

      (3) The study may benefit by editing the following information in the introduction: "In summary, WHIMS, HERS, and KEEPS mainly relied on orally administered CEE in older-aged or recently postmenopausal females." KEEPS used two routes and formulations (transdermal estradiol and oCEE, both with micronized progesterone).

      We thank the reviewer for catching this oversight. We removed the sentence to avoid ambiguities and revised the sentence specifically refering to the KEEPS study as follows:

      Introduction, page 3: “In contrast, administering oral CEE or transdermal estradiol plus micronized progesterone in recently postmenopausal females did not alter cognition in the Kronos Early Estrogen Prevention Study (KEEPS) (28).”

      (4) The study may benefit by editing the following statement in the introduction: "oral CEE use in combination with MPA seems to increase the risk for AD regardless of timing": I would suggest revising this statement, which is based on review article 29. The statement of the adverse effect of oCEE regardless of the time of start contradicts earlier randomized clinical findings. I think it is important to make a distinction between the outcomes of randomized control trials and observational studies. The WMIHS (Shumaker et al., 2003) (randomized control trial) reported that there was an increased risk of dementia for women (who were more than 10 years from the onset of menopause when the therapy was initiated) in oCEE + MPA compared to placebo. Two other long-duration randomized trials tested the effect of oral oestrogen and progesterone treatment on cognitive function in women who started treatment shortly after menopause (within 3 or 6 years) did not find evidence that treatment benefits or harms cognitive function compared with placebo (Gleason et al., 2015; Henderson et al., 2016). A short-term (4 months) randomized trial (Maki et al 2007 (Maki et al., 2007) (mentioned in ref 29) reported a potential negative effect of CEE/MPA on verbal memory in women who started HT shortly after menopause (within 3 years). The study did not investigate the risk of dementia, and the duration of use of HT was short-term.

      We thank the reviewer for this detailed input. After checking the provided references, we rephrased the sentence as follows:

      Introduction, page 4:“Although emerging evidence supports this hypothesis (30, 31), oral CEE use in combination with MPA has been found to increase the risk for memory decline regardless of timing (26, 29, 32).”

      We believe this formulation is more in line with the evidence provided by Shumaker et al. 2003, Maki et al. 2007 and the other references provided in the review paper by Maki and colleagues (mentioned in ref. 29). The reviewer further refers to Gleason et al. 2015 and Henderson et al. 2016, however both RCTs use micronized progesterone, not MPA, thereby not supporting the statement.

      (26) Shumaker SA, Legault C, Rapp SR, et al. Estrogen plus progestin and the incidence of dementia and mild cognitive impairment in postmenopausal women: the Women's Health Initiative Memory Study: a randomized controlled trial. JAMA 2003;289:2651-2662.

      (29) Maki PM. Critical window hypothesis of hormone therapy and cognition: a scientific update on clinical studies. Menopause 2013;20:695-709.

      (32) Maki PM, Gast MJ, Vieweg AJ, Burriss SW, Yaffe K. Hormone therapy in menopausal women with cognitive complaints: a randomized, double-blind trial. Neurology 2007;69:1322-1330.

      Reviewer #3 (Public review):

      In this study Barth et al. present results of detailed analyses of the relationships between menopausal hormone therapy (MHT), APOE ε4 genotype, and measures of anatomical brain age in women in the UK Biobank. While past studies have investigated the links between some of these variables (including works by the authors themselves), this new study adds more detailed MHT variables, surgical status, and additional brain aging measures. The UK biobank sample is large, but it is a population cohort and many of the MHT measures are self-reported (as the authors point out). However, the authors present a solid analysis of the available information which shows associations between MHT user status, length of MHT use, as well as surgical status with brain age. However, as the authors themselves state, the results do not unequivocally support the neuroprotective or adverse effect of MHT on the brain. I think this work strengthens the case for the need of better-designed longitudinal studies investigating the effect of MHT on the brain in the peri/post-menopausal stage.

      Strengths:

      (1) The authors addressed the statistical analyses rigorously. For example, multiple testing corrections, outlier removal, and sensitivity analysis were performed carefully. Ample background information is provided in the introduction allowing even individuals not familiar with the field to understand the motivation behind the work. The discussion section also does a great job of addressing open questions and limitations. Very detailed results of all statistical tests are provided either in the main text or in the supplementary information.

      We thank the reviewer for their time and the positive evaluation of our manuscript.

      Weaknesses:

      (1) For me, the biggest weakness was the presentation of the results. As many variables are involved and past studies have investigated several of these questions, it would have helped to better clarify the analysis and questions that are addressed by this study in particular and what sets this work apart from past studies. The information is present in the manuscript but better organization might have helped. For example, a figure depicting the key questions near the beginning of the manuscript would have been very helpful for me. The Tables also contain a lot of information but I wonder if there might be a way to capture the most relevant information more succinctly (either in Table format or in a figure) for the main text.

      We thank the reviewer for this comment. We do agree that with the large number of analyses it can be hard to keep an overview. We now added a Figure summarizing the main and sensitity analyses by sample.

      (2) Another concern I had was the linear models investigating the effects of these MHT variables on the brain age gap. The authors have included "age" as one of the parameters in this analysis. I wonder if adding a quadratic age factor age2 in the model might have improved the fit since many brain phenotypes tend to show quadratic brain age effects in the 40 to 80-year age range.

      We thank the reviewer for this suggestion. We have rerun the main analysis in the whole sample (model 1) with age squared as an additional covariate, and compared the gray matter brain age gap model fits using the corrected Akaike Information Criterion (AIC). All models with age squared had a better model fit than models without age squared (see Author response table 1). Hence, in the revised manuscript, we added a sensitivity analysis rerunning the model 1 with age squared to account for potential non-linear effect. The results were largely consistent. The manuscript was revised as follows to reflect the added analysis:

      Sensitivity analysis (Methods, Page 11): “To test whether the results were influenced by the inclusion of participants with ICD-10 diagnosis or by non-linear effects of age, the main analyses (models 1-2) were re-run excluding the sub-sample with diagnosed brain disorders (see supplementary Note 2) or adding age(2) as additional covariate, respectively.”

      Sensitivity analysis (Results, Page 20): “The results were consistent after removing participants with ICD-10 diagnoses known to impact the brain (see Table S9 for model 1 analyses and Table S10 for model 2 analyses), after additionally adjusting for age(2) (see Table S11), and after removing extreme values (see Table S12 for model 1 analyses).”

      Author response table 1.

      Gray matter brain age gap model selection based on corrected Akaike Information Criterion (AICc)

      Abbreviations and explanations of parameters: MHT = menopausal hormone therapy, K = number of estimated parameters for each model, AICc = the information criterion requested for each model, ΔAICc = the appropriate delta AIC component depending on the information criteria selectedModelLik = the relative likelihood of the model given the data, AICcWT = Akaike weights to indicate the level of support in favor of any given model being the most parsimonious among the candidate model sets, LL = log-likelihood of each model.

      Reviewer #3 (Recommendations for the authors):

      (1) Please note typo in Figures 2 and 3 legend "GM WM".

      We thank the reviewer for catching this typo and we changed it to BAG GM and BAG WM for all Figures for consistency.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      The chromophore molecule of animal and microbial rhodopsins is retinal which forms a Schiff base linkage with a lysine in the 7-th transmembrane helix. In most cases, the chromophore is positively charged by protonation of the Schiff base, which is stabilized by a negatively charged counterion. In animal opsins, three sites have been experimentally identified, Glu94 in helix 2, Glu113 in helix 3, and Glu181 in extracellular loop 2, where a glutamate acts as the counterion by deprotonation. In this paper, Sakai et al. investigated molecular properties of anthozoan-specific opsin II (ASO-II opsins), as they lack these glutamates. They found an alternative candidate, Glu292 in helix 7, from the sequences. Interestingly, the experimental data suggested that Glu292 is not the direct counterion in ASO-II opsins. Instead, they found that ASO-II opsins employ a chloride ion as the counterion. In the case of microbial rhodopsin, a chloride ion serves as the counterion of light-driven chloride pumps. This paper reports the first observation of a chloride ion as the counterion in animal rhodopsin. Theoretical calculation using a QM/MM method supports their experimental data. The authors also revealed the role of Glu292, which serves as the counterion in the photoproduct, and is involved in G protein activation.

      The conclusions of this paper are well supported by data, while the following aspects should be considered for the improvement of the manuscript.

      We thank the reviewer for carefully reading the manuscript and providing important suggestions. Below, we address the specific comments.

      (1) Information on sequence alignment only appears in Figure S2, not in the main figures. Figure S2 is too complicated by so many opsins and residue positions. It will be difficult for general readers to follow the manuscript because of such an organization. I recommend the authors show key residues in Figure 1 by picking up from Figure S2.

      We thank the reviewer for pointing this out. As suggested, we have selected key residues (potential counterion sites) from Fig. S2 and show them now as Fig. 1B in the revised manuscript. Fig. S2 has also been simplified by showing only the most important residues.

      (2) Halide size dependence. The authors observed spectral red-shift for larger halides. Their observation is fully coincident with the chromophore molecule in solution (Blatz et al. Biochemistry 1972), though the isomeric states are different (11-cis vs all-trans). This suggests that a halide ion is the hydrogen-bonding acceptor of the Schiff base N-H group in solution and ASO-II opsins. A halide ion is not the hydrogen-bonding acceptor in the structure of halorhodopsin, whose halide size dependence is not clearly correlated with absorption maxima (Scharf and Engelhard, Biochemistry 1994). These results support their model structure (Figure 4), and help QM/MM calculations.

      We appreciate the comment, which provides a deeper insight into our results and reinforces our conclusions. We have revised the discussion of the effect of halide size on the λ<sub>max</sub> shift to cite the prior work mentioned by the reviewer.

      (3) QM/MM calculations. According to Materials and Methods, the authors added water molecules to the structure and performed their calculations. However, Figure 4 does not include such water molecules, and no information was given in the manuscript. In addition, no information was given for the chloride binding site (contact residues) in Figure 4. More detailed information should be shown with additional figures in Figure SX.

      We thank the reviewer for making us realize that Fig. 4 was oversimplified.

      We have added following text in the “Structural modelling and QM/MM calculations of the dark state of Antho2a” section:

      Lines 220 – 223

      “The chloride ion is also coordinated by two water molecules and the backbone of Cys187 which is part of a conserved disulfide bridge (Fig. S2). The retinylidene Schiff base region also includes polar (Ser186, Tyr91) and non-polar (Ala94, Leu113) residues (Fig. 4).”

      We have updated Fig. 4 and its legend to show a more detailed environment of the protonated Schiff base and the chloride ion, including water molecules and other nearby residues.

      (4) Figure 5 clearly shows much lower activity of E292A than that of WT, whose expression levels are unclear. How did the authors normalize (or not normalize) expression levels in this experiment?

      We thank the reviewer for this valuable comment. In the previous version of the manuscript, we did not normalize the activity based on expression levels. We have considered this in the amended version.

      First, we evaluated the expression levels of wild type and E292A Antho2a by comparing absorbances at λ<sub>max</sub> (± 5 nm) of these pigments that were expressed and purified under the same conditions. Assuming that their molar absorption coefficients at the absorption maximum wavelengths are approximately the same, this can allow us to roughly compare their expression levels. The relative expression of the E292A mutant compared to the wild type (set as 1) was 0.81 at pH 6.5 and 140 mM NaCl, in which 94.0% (for E292A) and 99.8% (for wild type) of the Schiff base is protonated (Fig. 3A and B). As we conducted the live cell Ca<sup>2+</sup> assay in media at pH 7.0, we estimated the proportion of the protonated states of wild type and E292A mutant at same pH. The relative amounts of the protonated states to the wild type at pH 6.5 (set as 1) were estimated to be 0.99 for wild type and 0.84 for E292A. Together, the protonated pigment of the E292A mutant was calculated to be about 73% of that of the wild type at pH 7.0. From Fig. 5, the amplitude of Ca<sup>2+</sup> response of the E292A mutant was 12.1% of the wild type, showing that even after normalizing the expression levels, the Ca<sup>2+</sup> response amplitude was lower in the E292A mutant than in the wild type. This leads to our conclusion that the E292A mutation can also influence the G protein activation efficiency.

      We have added Fig. S11 showing the comparison of expression levels between the wild type and E292A of Antho2a (Fig. S11A) and maximum Ca<sup>2+</sup> responses after normalizing the expression levels (Fig. S11B).

      We have also revised the discussion section as follows:

      Lines 324 – 335

      “The relative expression level of the E292A mutant of Antho2a was approximately 0.81 of the wild type (set as 1), as determined by comparing absorbances at λ<sub>max</sub> for both pigments expressed and purified under identical conditions (Fig. S11A). Additionally, the fraction of protonated pigment relative to the wild type (set as 1 at pH 6.5) was estimated to be 0.94 for the E292A mutant at pH 6.5, and 0.99 and 0.84 for the wild type and the E292A mutant at pH 7.0, respectively (Fig. 3A and B). Since pH 7.0 corresponds to the conditions used in the live cell Ca<sup>2+</sup> assays, the effective amount of protonated pigment for the E292A mutant was approximately 73% of the wild type. Nevertheless, even after normalization for these differences, the Ca<sup>2+</sup> response amplitude of the E292A mutant remained significantly lower (~ 17% of wild type, compared to the observed 12% prior to normalization; Fig. 5 and Fig. S11B). These observations suggest that Glu292 serves not only as a counterion in the photoproduct but also plays an allosteric role in influencing G protein activation.”

      (5) The authors propose the counterion switching from a chloride ion to E292 upon light activation. A schematic drawing on the chromophore, a chloride ion, and E292 (and possible surroundings) in Antho2a and the photoproduct will aid readers' understanding.

      We thank the reviewer for this excellent suggestion. We have prepared a new figure with a schematic drawing of the environment of the protonated Schiff base depicting the counterion switch in Fig. S10.

      Reviewer #2 (Public review):

      Summary:

      This work reports the discovery of a new rhodopsin from reef-building corals that is characterized experimentally, spectroscopically, and by simulation. This rhodopsin lacks a carboxylate-based counterion, which is typical for this family of proteins. Instead, the authors find that a chloride ion stabilizes the protonated Schiff base and thus serves as a counterion.

      Strengths:

      This work focuses on the rhodopsin Antho2a, which absorbs in the visible spectrum with a maximum at 503 nm. Spectroscopic studies under different pH conditions, including the mutant E292A and different chloride concentrations, indicate that chloride acts as a counterion in the dark. In the photoproduct, however, the counterion is identified as E292.

      These results lead to a computational model of Antho2a in which the chloride is modeled in addition to the Schiff base. This model is improved using the hybrid QM/MM simulations. As a validation, the absorption maximum is calculated using the QM/MM approach for the protonated and deprotonated E292 residue as well as the E292A mutant. The results are in good agreement with the experiment. However, there is a larger deviation for ADC(2) than for sTD-DFT. Nevertheless, the trend is robust since the wt and E292A mutant models have similar excitation energies. The calculations are performed at a high level of theory that includes a large QM region.

      Weaknesses:

      I have a couple of questions about this study:

      We thank the reviewer for providing critical comments, particularly on the QM/MM calculations. We have carefully considered all comments and have addressed them as detailed below. Corresponding revisions have been made to the manuscript.

      (1) I find it suspicious that the absorption maximum is so close to that of rhodopsin when the counterion is very different. Is it possible that the chloride creates an environment for the deprotonated E292, which is the actual counterion?

      We think it is unlikely that the chloride ion merely facilitates deprotonation of Glu292 in such a way that it acts as the counterion of the dark state Antho2a. This conclusion is based on two results from our study. (1) λ<sub>max</sub> of wild type Antho2a in the dark is positively correlated with the ionic radius of the halide in the solution; the λ<sub>max</sub> is red shifted in the order Cl- < Br- < I- (Fig. 2E and F in the revised manuscript). This tendency is observed when the halide anion acts as a counterion of the protonated Schiff base (Blatz et al. Biochemistry 11: 848–855, 1972). (2) The QM/MM models of the dark state of Antho2a show that the calculated λ<sub>max</sub> of Antho2a with a protonated (neutral) Glu292 is much closer to the experimentally observed λ<sub>max</sub> than with a deprotonated (negatively charged) Glu292 (Fig. 4), suggesting that the Glu292 is likely to be protonated even in the presence of chloride ion. Therefore, we conclude that a solute anion, and not Glu292, acts as the counterion of the protonated Schiff base in the dark state of Antho2a. We have discussed this in the revised manuscript as follows:

      Lines 274 – 291

      “We found that the type of halide anions in the solution has a small but noticeable effect on the λ<sub>max</sub> values of the dark state of Antho2a. This is consistent with the effect observed in a counterion-less mutant of bovine rhodopsin, in which halide ions serve as surrogate counterions (Nathans, 1990; Sakmar et al., 1991). Similarly, our results align with earlier observations that the λ<sub>max</sub> of a retinylidene Schiff base in solution increases with the ionic radius of halides acting as hydrogen bond acceptors (i.e., I− > Br− > Cl−) (Blatz et al., 1972). In contrast, the λ<sub>max</sub> of halorhodopsin from Natronobacterium pharaonic does not clearly correlate with halide ionic radius (Scharf and Engelhard, 1994), as the halide ion in this case is not a hydrogen-bonding acceptor of the protonated Schiff base (Kouyama et al., 2010; Mizuno et al., 2018). Altogether, these findings support our hypothesis that in Antho2a, a solute halide ion forms a hydrogen bond with the Schiff base, thereby serving as the counterion in the dark state. Moreover, QM/MM calculations for the dark state of Antho2a suggest that Glu292 is protonated and neutral, further supporting the hypothesis that Glu292 does not serve as the counterion in the dark state. However, unlike dark state, Cl− has little to no effect on the visible light absorption of the photoproduct (Fig. S5). Therefore, we conclude that Cl− and Glu292, respectively, act as counterions for the protonated Schiff base of the dark state and photoproduct of Antho2a. This represents a unique example of counterion switching from exogeneous anion to a specific amino acid residue upon light irradiation (Fig. S10).”

      (2) The computational protocol states that water molecules have been added to the predicted protein structure. Are there water molecules next to the Schiff base, E292, and Cl-? If so, where are they located in the QM region?

      We have updated Fig. 4 to show amino acids and water molecules near the Schiff base, E292, and the chloride ion. These include Ser186, Tyr91, Ala94, Leu113, Cys187, and two water molecules coordinating the chloride ion. We have added following text in the “Structural modelling and QM/MM calculations of the dark state of Antho2a” section of the revised manuscript.

      Lines 220 – 223

      “The chloride ion is also coordinated by two water molecules and the backbone of Cys187 which is part of a conserved disulfide bridge (Fig. S2). The retinylidene Schiff base region also includes polar (Ser186, Tyr91) and non-polar (Ala94, Leu113) residues (Fig. 4).”

      Water molecules, which have been modelled by homology to other GPCR structures, were not included in the QM region. In the revised version of the manuscript, we clarify this point in the “Computational modelling and QM/MM calculations” section as follows.

      Lines 515 – 517

      “The retinal-binding pocket also contains predicted water molecules (modelled based on homologous GPCR structures) close to the Schiff base and the chloride ion which were not included in the QM region.”

      (3) If the E292 residue is the counterion in the photoproduct state, I would expect the retinal Schiff base to rotate toward this side chain upon isomerization. Can this be modeled based on the recent XFEL results on rhodopsin?

      The recent XFEL studies of rhodopsin reveal that at very early stages (1 ps after photoactivation), structural changes in retinal are limited primarily to the isomerization around the C11=C12 bond of the polyene chain, without significant rotation of the Schiff base.

      Although modelling of a later active state with planar retinal and a rotated Schiff base is feasible—e.g., guided by high-resolution structures of bovine rhodopsin’s Meta II state such as PDB ID: 3PQR, see Author response image 1 below—active states of GPCRs typically exhibit substantial conformational flexibility and heterogeneity, making the generation of precise structural models suitable for accurate QM/MM calculations challenging. Despite these uncertainties, this preliminary modelling does indicate that upon isomerization to the all-trans configuration, the retinal Schiff base would rotate towards E292, supporting our hypothesis that E292 serves as the counterion in the Antho2a photoproduct. This is now shown better in the revised Fig. S10.

      Author response image 1.

      Reviewer #3 (Public review):

      Summary:

      The paper by Saito et al. studies the properties of anthozoan-specific opsins (ASO-II) from organisms found in reef-building coral. Their goal was to test if ASO-II opsins can absorb visible light, and if so, what the key factors involved are.

      The most exciting aspect of this work is their discovery that ASO-II opsins do not have a counterion residue (Asp or Glu) located at any of the previously known sites found in other animal opsins.

      This is very surprising. Opsins are only able to absorb visible (long wavelength light) if the retinal Schiff base is protonated, and the latter requires (as the name implies) a "counter ion". However, the authors clearly show that some ASO-II opsins do absorb visible light.

      To address this conundrum, they tested if the counterion could be provided by exogenous chloride ions (Cl-). Their results find compelling evidence supporting this idea, and their studies of ASO-II mutant E292A suggest E292 also plays a role in G protein activation and is a counterion for a protonated Schiff base in the light-activated form.

      Strengths:

      Overall, the methods are well-described and carefully executed, and the results are very compelling.

      Their analysis of seven ASO-II opsin sequences undoubtedly shows they all lack a Glu or Asp residue at "normal" (previously established) counter-ion sites in mammalian opsins (typically found at positions 94, 113, or 181). The experimental studies clearly demonstrate the necessity of Cl- for visible light absorbance, as do their studies of the effect of altering the pH.

      Importantly, the authors also carried out careful QM/MM computational analysis (and corresponding calculation of the expected absorbance effects), thus providing compelling support for the Cl- acting directly as a counterion to the protonated retinal Schiff base, and thus limiting the possibility that the Cl- is simply altering the absorbance of ASO-II opsins through some indirect effect on the protein.

      Altogether, the authors achieved their aims, and the results support their conclusions. The manuscript is carefully written, and refreshingly, the results and conclusions are not overstated.

      This study is impactful for several reasons. There is increasing interest in optogenetic tools, especially those that leverage G protein-coupled receptor systems. Thus, the authors' demonstration that ASO-II opsins could be useful for such studies is of interest.

      Moreover, the finding that visible light absorbance by an opsin does not absolutely require a negatively charged amino acid to be placed at one of the expected sites (94, 113, or 181) typically found in animal opsins is very intriguing and will help future protein engineering efforts. The argument that the Cl- counterion system they discover here might have been a preliminary step in the evolution of amino acid based counterions used in animal opsins is also interesting.

      Finally, given the ongoing degradation of coral reefs worldwide, the focus on these curious opsins is very timely, as is the authors' proposal that the lower Schiff base pKa they discovered here for ASO-II opsins may cause them to change their spectral sensitivity and G protein activation due to changes in their environmental pH.

      We thank the reviewer for the comprehensive summary of the manuscript and for finding it well-described and impactful.

      Recommendations for the Authors:

      Reviewer #1 (Recommendations for the authors):

      (1) p. 5, l. 102: The authors obtained three absorption spectra out of seven. Did the authors examine the reasons for no absorption spectra for the remaining four proteins?

      We have not identified the reasons for the absence of detectable absorption spectra for the remaining four opsins. We speculate that this could result from poor retinal binding under detergent-solubilized conditions, but we have not directly tested this possibility.

      (2) p. 7, l. 141: The pH value is 7.5 in the text and 7.4 in Figure S4B.

      We thank the reviewer for finding this mistake. The correct value is 7.4 and we have revised the text accordingly.

      Reviewer #2 (Recommendations for the authors):

      The structures and the simulations should be made available to the reader by providing them in a repository.

      We have deposited the Antho2a models in Zenodo (https://zenodo.org/; an open-access repository for research data). We have added the following description in the “Data and materials availability” section of the revised manuscript.

      Lines 559 – 560

      “The structural models of wild type Antho2a with a neutral or charged Glu292 and the Antho2a E292A mutant are available in Zenodo (10.5281/zenodo.15064942).”

      Reviewer #3 (Recommendations for the authors):

      (1) In the homology models for the ASO-II opsins, are there any other possible residues that could act as counter-ion residues outside of the "normal" positions at 94, 113, or 181?

      We have updated Fig. 4 to show all residues near the retinylidene Schiff base region, which include Cl−, Glu292, Ser186, Tyr91, Ala94, Leu113, Cys187, and two water molecules.

      Apart from Cl− and Glu292, the homology models of the ASO-II opsins do not reveal any other candidate as the counterion of Schiff base. This is also suggested by the sequence alignment between opsins of the ASO-II group and other animal opsins in Fig. S2, where we show amino acid residues near the Schiff base (in addition to key motifs important for G protein activation).

      (2) It is mentioned that the ASO-II opsins do not appear to be bistable opsins in detergents - do these opsins show any ability to photo-switch back and forth when in cellular membranes?

      We have not directly tested whether Antho2a exhibits photo-switching in cellular membranes due to technical limitations associated with high light scattering in spectroscopic measurements. Instead, we recorded absorption spectra from crude extracts of detergent-solubilized cell membranes expressing Antho2a wild type (without purification) in the dark and after sequential light irradiation (Fig. S3C). This approach, which retains cellular lipids, can better preserve the photochemical properties of opsins, such as thermal stability and photoreactivity of their photoproducts, similar to intact cellular membranes. The first irradiation with green light (500 nm) led to a decrease in absorbance around the 550 nm region and an increase around the 450 nm region, indicating the formation of a photoproduct, consistent with observations using purified Antho2a.

      However, subsequent irradiation with violet light (420 nm) did not reverse these spectral changes but resulted in only a slight decrease in absorbance around 400 nm. Re-exposure to green light produced no further spectral changes aside from baseline distortions. These findings suggest that the Antho2a photoproduct has limited ability to revert to its original dark state under these conditions. Nevertheless, because detergent solubilization may influence these observations, further studies in intact cellular membranes using live-cell assay will be required to conclusively assess bistability or photo-switching properties.

      (3) The idea that E292 acts as a counterion for the protonated active state is intriguing - do the authors think the retinal decay process after light activation occurs with hydrolysis of the non-protonated form with subsequent retinal release?

      We thank the reviewer for raising this important question. We first examined whether the increased UV absorbance observed after incubating the photoproduct for 20 hours in the dark (Fig. S3D, E, violet curves) originated from free retinal released from the opsin pigment. Acid denaturation (performed at pH 1.9) of this photoproduct resulted in a main product absorbing around 400 nm (Fig. S3G). Typically, when retinal binds opsin via the Schiff base (whether protonated or deprotonated), acid denaturation traps the retinal chromophore as a protonated Schiff base, yielding an absorption spectrum with a λ<sub>max</sub> at approximately 440 nm, as observed in the dark state of Antho2a (Fig. S3F). Our results thus indicate that the UV absorbance in the photoproduct did not result from a deprotonated Schiff base but rather from retinal released during incubation. We have not directly tested whether the protonated or deprotonated form is more prone to retinal release. However, the decay of visible absorbance (associated with the protonated photoproduct) occurred more rapidly under alkaline conditions (pH 8.0), which generally favors deprotonation of the Schiff base (Fig. S3H). Thus, it is possible that the deprotonated photoproduct releases retinal more rapidly than the protonated form, but further studies are necessary to confirm this hypothesis.

      To answer the comments (2) and (3) by the reviewer, we have added new panels (C and F–H) to Fig. S3.

      We have revised the Results section as follows:

      Lines 136 – 141

      “The photoproduct remained stable for at least 5 minutes (Fig. S3A, curves 2 and 3) but did not revert to the original dark state upon subsequent irradiation (Fig. S3A and C). Instead, it underwent gradual decay accompanied by retinal release over time (Fig. S3D–G). These findings indicate that purified Antho2a is neither strictly bleach resistant nor bistable (see also Fig. S3 legend). We also observed that the protonated photoproduct decayed more rapidly at pH 8.0 (Fig. S3H) than at pH 6.5 (Fig. 3A, D, E).”

      Text:

      (4) Page 3, line 38. Consider defining eumetazoan (for lay readers).

      As suggested, we have defined eumetazoans and revised the sentence as follows:

      Lines 38 – 40

      “Opsins are present in the genomes of all eumetazoans (i.e., all animal lineages except sponges), and based on their phylogenetic relationships, they can be classified into eight groups…”

      (5) Page 3, line 42. "But, furthermore, ..." should be changed to either word alone.

      Revised as suggested.

      (6) Page 18, line 447. The HPLC method is well-described and helpful. If possible, please add a Reference, or indicate if this is a new variation of the method.

      This is a well-established method for analyzing the composition of retinal isomers bound to different states of rhodopsin pigments. We have now cited a reference describing the methodology (Terakita et al. Vision Res. 6: 639–652, 1989).

      (7) Page 11, line 267. "..type of halide anions in the solution affected the λ<sub>max</sub> values of the dark state of".

      Since the changes are not large (but clearly occur), consider changing this sentence to "..type of halide anions in the solution has a small but visible effect on the λ<sub>max</sub> values of the dark state ..."

      We have revised this sentence as suggested.

      Figures:

      (9) Consider combining Figure FS6 with Figure 2 (effect of anions on visible absorbance).

      As suggested, the previous Fig. S6 has been included in the main text as Fig. 2E and F in the revised manuscript.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      The manuscript by Li et al. investigates the metabolism-independent role of nuclear IDH1 in chromatin state reprogramming during erythropoiesis. The authors describe accumulation and redistribution of histone H3K79me3, and downregulation of SIRT1, as a cause for dyserythropoiesis observed due to IDH1 deficiency. The authors studied the consequences of IDH1 knockdown, and targeted knockout of nuclear IDH1, in normal human erythroid cells derived from hematopoietic stem and progenitor cells and HUDEP2 cells respectively. They further correlate some of the observations such as nuclear localization of IDH1 and aberrant localization of histone modifications in MDS and AML patient samples harboring IDH1 mutations. These observations are intriguing from a mechanistic perspective and they hold therapeutic significance, however there are major concerns that make the inferences presented in the manuscript less convincing.

      (1) The authors show the presence of nuclear IDH1 both by cell fractionation and IF, and employ an efficient strategy to knock out nuclear IDH1 (knockout IDH1/ Sg-IDH1 and rescue with the NES tagged IDH1/ Sg-NES-IDH1 that does not enter the nucleus) in HUDEP2 cells. However, some important controls are missing.

      A) In Figure 3C, for IDH1 staining, Sg-IDH1 knockout control is missing.

      Thanks for the reviewer’s suggestion. We have complemented the staining of Sg-IDH1 knockout cells, and made corresponding revision in Figure 3C in the revised manuscript.

      B) Wild-type IDH1 rescue control (ie., IDH1 without NES tag) is missing to gauge the maximum rescue that is possible with this system.

      Thanks for the reviewer’s suggestion. We have overexpressed wild-type IDH1 in the IDH1-deficient HUDEP2 cell line and detected the phenotype. The results are presented in Supplementary Figure 9 in the revised manuscript. As shown in Supplementary Figure 9A, IDH1 deficiency resulted in reduced cell number in HUDEP2 cells, a phenotype that was rescued by overexpression of wild-type IDH1 but not by NES-IDH1. Given IDH1's well-established role in redox homeostasis through catalyzing isocitrate to α-KG conversion, we hypothesized that both wild-type IDH1 and NES-IDH1 overexpression would significantly restore α-KG levels compared to the IDH1-deficient group. Supplementary Figure 9B demonstrates that IDH1 depletion resulted in a dramatic decrease in α-KG levels, whereas overexpression of either wild-type IDH1 or NES-IDH1 almost completely restored α-KG levels, as anticipated. These results suggest that wild-type IDH1 overexpression can restore metabolic regulatory functions as effectively as NES-IDH1 overexpression. To investigate whether apoptosis contributes to the impaired cell expansion caused by IDH1 deficiency, we performed Annexin V/PI staining to quantify apoptotic cells. As shown in Supplementary Figure 9C and D, flow cytometry analysis revealed no significant changes in apoptosis rates following either IDH1 depletion or ectopic expression of wild-type IDH1 or NES-IDH1 in IDH1 deficient HUDEP2 cells.

      Flow cytometric analysis demonstrated that IDH1 deficiency triggered S-phase accumulation at day 8, indicative of cell cycle arrest. Whereas ectopic expression of wild-type IDH1 significantly rescued this cell cycle defect, overexpression of NES-IDH1 failed to ameliorate the S-phase accumulation phenotype induced by IDH1 depletion, as presented in Supplementary Figure 9E and F. Although NES-IDH1 overexpression rescued metabolic regulatory function defect, it failed to alleviate the cell cycle arrest induced by IDH1 deficiency. In contrast, wild-type IDH1 overexpression fully restored normal cell cycle progression. This functional dichotomy demonstrates that nuclear-localized IDH1 executes critical roles distinct from its cytoplasmic counterpart, and overexpression of wild-type IDH1 could efficient restore the functional impairment induced by depletion of nuclear localized IDH1.

      (2) Considering the nuclear knockout of IDH1 (Sg-NES-IDH1 referenced in the previous point) is a key experimental system that the authors have employed to delineate non-metabolic functions of IDH1 in human erythropoiesis, some critical experiments are lacking to make convincing inferences.

      A) The authors rely on IF to show the nuclear deletion of Sg-NES-IDH1 HUDEP2 cells. As mentioned earlier since a knockout control is missing in IF experiments, a cellular fractionation experiment (similar to what is shown in Figure 2F) is required to convincingly show the nuclear deletion in these cells.

      We sincerely thank the reviewer for raising this critical point. As suggested, we have performed additional IF experiments and cellular fractionation experiments to comprehensively address the subcellular localization of IDH1.

      The results of IF staining were shown in Figure 3C of the revised manuscript. In Control HUDEP2 cells, endogenous IDH1 was detected in both the cytoplasm and nucleus. This dual localization may reflect its dynamic roles in cytoplasmic metabolic processes and potential nuclear functions under specific conditions. In Sg-IDH1 cells (IDH1 knockout), IDH1 signal was undetectable, confirming efficient depletion of the protein. In Sg-NES-IDH1 cells (overexpressing NES-IDH1 in IDH1 deficient cells), IDH1 predominantly accumulated in the cytoplasm, consistent with the disruption of its nuclear export signal. The dual localization of IDH1 that was determined by IF staining experiment were then further confirmed by cellular fractionation assays, as shown in Figure 3D.

      B) Since the authors attribute nuclear localization to a lack of metabolic/enzymatic functions, it is important to show the status of ROS and alpha-KG in the Sg-NES-IDH1 in comparison to control, wild type rescue, and knockout HUDEP2 cells. The authors observe an increase of ROS and a decrease of alpha-KG upon IDH1 knockdown. If nuclear IDH1 is not involved in metabolic functions, is there only a minimal or no impact of the nuclear knockout of IDH1 on ROS and alpha-KG, in comparison to complete knockout? These studies are lacking.

      We appreciate the insightful suggestions of the reviewers and agree that the detection of ROS and alpha-KG is useful for the demonstration of the non-canonical function of IDH1. We examined alpha-KG concentrations in control, IDH1 knockout and nuclear IDH1 knockout HUDEP2 cell lines. The results showed a significant decrease in alpha-KG content after complete knockout of IDH1, whereas there was no significant change in nuclear knockout IDH1 (Supplementary Figure 9B). As to the detection of ROS level, the commercial ROS assay kits that we can get are detected using PE (Excitation: 565nm; Emission: 575nm) and FITC (Excitation: 488nm; Emission: 518nm) channels in flow cytometry. We constructed HUDEP2 cell lines of Sg-IDH1 and Sg-NES-IDH1 to express green fluorescent protein (Excitation: 488nm; Emission: 507nm) and Kusabira Orange fluorescent protein (Excitation: 500nm; Emission: 561nm) by themselves. Unfortunately, due to the spectral overlap of the fluorescence channels, we were unable to detect the changes in ROS levels in these HUDEP2 cell lines using the available commercial kit.

      (3) The authors report abnormal nuclear phenotype in IDH1 deficient erythroid cells. It is not clear what parameters are used here to define and quantify abnormal nuclei. Based on the cytospins (eg., Figure 1A, 3D) many multinucleated cells are seen in both shIDH1 and Sg-NES-IDH1 erythroid cells, compared to control cells. Importantly, this phenotype and enucleation defects are not rescued by the administration of alpha-KG (Figures 1E, F). The authors study these nuclei with electron microscopy and report increased euchromatin in Figure 4B. However, there is no discussion or quantification of polyploidy/multinucleation in the IDH1 deficient cells, despite their increased presence in the cytospins.

      A) PI staining followed by cell cycle FACS will be helpful in gauging the extent of polyploidy in IDH1 deficient cells and could add to the discussions of the defects related to abnormal nuclei.

      We appreciate the reviewer’s insightful suggestion. Since PI dye is detected using the PE channel (Excitation: 565nm; Emission: 575nm) of the flow cytometer and the HUDEP2 cell line expresses Kusabira orange fluorescent protein (Excitation: 500nm; Emission: 561nm), we were unable to use PI staining to detect the cell cycle. Edu staining is another commonly used method to determine cell cycle progression, and we performed Edu staining followed by flow cytometry analysis on Control, Sg-IDH1 and Sg-NES-IDH1 HUDEP2 cells, respectively. The results showed that complete knockdown of IDH1 resulted in S-phase block and increased polyploidy in HUDEP2 cells on day 8 of erythroid differentiation, and overexpression of IDH1-NES did not reverse this phenotype (Supplemental Figure 9E-F). Moreover, we have added a discussion of abnormal nuclei being associated with impaired erythropoiesis.

      B) For electron microscopy quantification in Figures 4B and C, how the quantification was done and the labelling of the y-axis (% of euchromatin and heterochromatin) in Figure 4 C is not clear and is confusingly presented. The details on how the quantification was done and a clear label (y-axis in Figure 4C) for the quantification are needed.

      Thanks for the reviewer's suggestion. In this study, we calculated the area of nuclear, heterochromatin and euchromatin by using Image J software. We addressed the quantification strategy in the section of Supplementary methods of the revised Supplementary file. In addition, the y-axis label in Figure 4C was changed to “the area percentage of euchromatin and heterochromatin’’.

      C) As mentioned earlier, what parameters were used to define and quantify abnormal nuclei (e.g. Figure 1A) needs to be discussed clearly. The red arrows in Figure 1A all point to bi/multinucleated cells. If this is the case, this needs to be made clear.

      We thank the reviewer for their suggestion. In our present study, nuclear malformations were defined as cells exhibiting binucleation or multinucleation based on cytospin analysis. A minimum of 300 cells per group were evaluated, and the proportion of aberrant nuclei was calculated as (number of abnormal cells / total counted cells) × 100%.

      (4) The authors mention that their previous study (reference #22) showed that ROS scavengers did not rescue dyseythropoiesis in shIDH1 cells. However, in this referenced study they did report that vitamin C, a ROS scavenger, partially rescued enucleation in IDH1 deficient cells and completely suppressed abnormal nuclei in both control and IDH1 deficient cells, in addition to restoring redox homeostasis by scavenging reactive oxygen species in shIDH1 erythroid cells. In the current study, the authors used ROS scavengers GSH and NAC in shIDH1 erythroid cells and showed that they do not rescue abnormal nuclei phenotype and enucleation defects. The differences between the results in their previous study with vitamin C vs GSH and NAC in the context of IDH1 deficiency need to be discussed.

      We appreciate the reviewer’s insightful observation. The apparent discrepancy between the effects of vitamin C (VC) in our previous study and glutathione (GSH)/N-acetylcysteine (NAC) in the current work can be attributed to divergent molecular mechanisms beyond ROS scavenging. A growing body of evidence has identified vitamin C as a multifunctional regulator. In addition to acting as an antioxidant maintaining redox homeostasis, VC also acts as a critical epigenetic modulator. VC have been identified as a cofactor for α-ketoglutarate (α-KG)-dependent dioxygenases, including TET2, which catalyzes 5-methylcytosine (5mC) oxidation to 5-hydroxymethylcytosine (5hmC) [1,2]. Structural studies confirm its direct interaction with TET2’s catalytic domain to enhance enzymatic activity in vitro [3]. The biological significance of the epigenetic modulation induced by vitamin C is illustrated by its ability to improve the generation of induced pluripotent stem cells and to induce a blastocyst-like state in mouse embryonic stem cells by promoting demethylation of H3K9 and 5mC, respectively [4,5]. In contrast, GSH and NAC are canonical ROS scavengers lacking intrinsic epigenetic-modifying activity. While they effectively neutralize oxidative stress (as validated by reduced ROS levels in our current data, Supplemental Figure 7), their inability to rescue nuclear abnormalities or enucleation defects in IDH1 deficient cells suggests that IDH1 deficiency-driven dyserythropoiesis is not solely ROS-dependent.

      References:

      (1) Blaschke K, Ebata KT, Karimi MM, Zepeda-Martínez JA, Goyal P, et al. Vitamin C induces Tet-dependent DNA demethylation and a blastocyst-like state in ES cells. Nature. 20138;500(7461): 222-226.

      (2) Minor EA, Court BL, Young JI, Wang G. Ascorbate induces ten-eleven translocation (Tet) methylcytosine dioxygenase-mediated generation of 5-hydroxymethylcytosine. J Biol Chem. 2013;288(19): 13669-13674.

      (3) Yin R, Mao S, Zhao B, Chong Z, Yang Y, et al. Ascorbic acid enhances Tet-mediated 5-methylcytosine oxidation and promotes DNA demethylation in mammals. J Am Chem Soc. 2013;135(28):10396-10403.

      (4) Esteban MA, Wang T, Qin B, Yang J, Qin D, et al. Vitamin C enhances the generation of mouse and human induced pluripotent stem cells. Cell Stem Cell. 2010;6(1):71-79.

      (5) Chung T, Brena RM, Kolle G, Grimmond SM, Berman BP, et al. Vitamin C promotes widespread yet specific DNA demethylation of the epigenome in human embryonic stem cells. Stem Cells. 2010;28(10):1848-1855.

      (5) The authors describe an increase in euchromatin as the consequential abnormal nuclei phenotype in shIDH1 erythroid cells. However, in their RNA-seq, they observe an almost equal number of genes that are up and down-regulated in shIDH1 cells compared to control cells. If possible, an RNA-Seq in nuclear knockout Sg-NES-IDH1 erythroid cells in comparison with knockout and wild-type cells will be helpful to tease out whether a specific absence of IDH1 in the nucleus (ie., lack of metabolic functions of IDH) impacts gene expression differently.

      Thanks for the reviewer's suggestion. ATAC-seq showed an increase in chromatin accessibility after IDH1 deletion, but the number of up-regulated genes was slightly larger than that of down-regulated genes, which may be caused by the metabolic changes affected by IDH1 deletion. In order to explore the effect of chromatin accessibility changes on gene expression after IDH1 deletion, we analyzed the changes in differential gene expression at the differential ATAC peak region (as shown in Author response image 1), and the results showed that the gene expression at the ATAC peak region with increased chromatin accessibility was significantly up-regulated. This may explain the regulation of chromatin accessibility on gene expression.

      Author response image 1.

      Box plots of gene expression differences of differential ATAC peaks located in promoter for the signal increasing and decreasing groups.

      (6) In Figure 8, the authors show data related to SIRT1's role in mediating non-metabolic, chromatin-associated functions of IDH1.

      A) The authors show that SIRT1 inhibition leads to a rescue of enucleation and abnormal nuclei. However, whether this rescues the progression through the late stages of terminal differentiation and the euchromatin/heterochromatin ratio is not clear.

      Thanks for the reviewer's suggestion. As shown in Supplementary Figure 14 and 15 in the revised Supplementary Data, our data showed that both the treatment of SRT1720 on normal erythroid cells and treatment of IDH1-deficient erythroid cells with SIRT1 inhibitor both have no effect on the terminal differentiation.

      (7) In Figure 4 and Supplemental Figure 8, the authors show the accumulation and altered cellular localization of H3K79me3, H3K9me3, and H3K27me2, and the lack of accumulation of other three histone modifications they tested (H3K4me3, H3K35me4, and H3K36me2) in shIDH1 cells. They also show the accumulation and altered localization of the specific histone marks in Sg-NES-IDH1 HUDEP2 cells.

      A) To aid better comparison of these histone modifications, it will be helpful to show the cell fractionation data of the three histone modifications that did not accumulate (H3K4me3, H3K35me4, and H3K36me2), similar to what was shown in Figure 4E for H3K79me3, H3K9me3, and H3K27me2).

      We appreciate the reviewer’s insightful suggestion. We collected erythroblasts on day 15 of differentiation from cord blood-derived CD34<sup>+</sup> hematopoietic stem cells to erythroid lineage and performed ChIP assay. As shown in Author response image 2, the results showed that the concentration of bound DNA of H3K9me3, H3K27me2 and H3K79me3 was too low to meet the sequencing quality requirement, which was consistent with that of WB. In addition, we tried to perform ChIP-seq for H3K79me3, and the results showed that there was no marked peak signal.

      Author response image 2.

      ChIP-seq analysis show that there was no marked peak signal of H3K79me3 on D15. (A) Quality control of ChIP assay for H3K9me3, H3K27me2, and H3K79me3. (B) Representative peaks chart image showed normalized ChIP signal of H3K79me3 at gene body regions. (C) Heatmaps displayed normalized ChIP signal of H3K79me3 at gene body regions. The window represents ±1.5 kb regions from the gene body. TES, transcriptional end site; TSS, transcriptional start site.

      C) Among the three histone marks that are dysregulated in IDH1 deficient cells (H3K79me3, H3K9me3, and H3K27me2), the authors show via ChIP-seq (Fig5) that H3K79me3 is the critical factor. However, the ChIP-seq data shown here lacks many details and this makes it hard to interpret the data. For example, in Figure 5A, they do not mention which samples the data shown correspond to (are these differential peaks in shIDH1 compared to shLuc cells?). There is also no mention of how many replicates were used for the ChIP seq studies.

      We thank the reviewer for pointing this out. We apologize for not clearly describing the ChIP-seq data for H3K9me3, H3K27me2 and H3K79me3 and we have revised them in the corresponding paragraphs. Since H3 proteins gradually translocate from the nucleus to the cytoplasm starting at day 11 (late Baso-E/Ploy-E) of erythroid lineage differentiation, we performed ChIP-seq for H3K9me3, H3K27me2 and H3K79me3 only for the shIDH1 group, and set up three independent biological replicates for each of them.

      Reviewer #2 (Public Review):

      Li and colleagues investigate the enzymatic activity-independent function of IDH1 in regulating erythropoiesis. This manuscript reveals that IDH1 deficiency in the nucleus leads to the redistribution of histone marks (especially H3K79me3) and chromatin state reprogramming. Their findings suggest a non-typical localization and function of the metabolic enzyme, providing new insights for further studies into the non-metabolic roles of metabolic enzymes. However, there are still some issues that need addressing:

      (1) Could the authors show the RNA and protein expression levels (without fractionation) of IDH1 on different days throughout the human CD34+ erythroid differentiation?

      We sincerely appreciate the reviewer’s constructive feedback. To address this point, we have now systematically quantified IDH1 expression dynamics across erythropoiesis stages using qRT-PCR and Western blot analyses. As quantified in Supplementary fige 1, IDH1 expression exhibited a progressive upregulation during early erythropoiesis and subsequently stabilized throughout terminal differentiation.

      (2) Even though the human CD34+ erythroid differentiation protocol was published and cited in the manuscript, it would be helpful to specify which erythroid stages correspond to cells on days 7, 9, 11, 13, and 15.

      We sincerely thank the reviewer for raising this important methodological consideration. Our research group has previously established a robust platform for staged human erythropoiesis characterization using cord blood-derived CD34<sup>+</sup> hematopoietic stem cells (HSCs) [6-9]. This standardized protocol enables high-purity isolation and functional analysis of erythroblasts at defined differentiation stages.

      Thanks for the reviewer’s suggestion. Our previous work (Jingping Hu et.al, Blood 2013. Xu Han et.al, Blood 2017.Yaomei Wang et.al, Blood 2021.) have isolation and functional characterization of human erythroblasts at distinct stages by using Cord blood. These works illustrated that using cord blood-derived hematopoietic stem cells and purification each stage of human erythrocytes can facilitate a comprehensive cellular and molecular characterization.

      Following isolation from cord blood, CD34<sup>+</sup> cells were cultured in a serum-free medium and induced to undergo erythroid differentiation using our standardized protocol. The process of erythropoiesis was comprised of 2 phases. During the early phase (day 0 to day 6), hematopoietic stem progenitor cells expanded and differentiated into erythroid progenitors, including BFU-E and CFU-E cells.

      During terminal erythroid maturation (day 7 to day 15), CFU-E cells progressively transitioned through defined erythroblast stages, as validated by daily cytospin morphology and expression of band 3/α4 integrin. The stage-specific composition was quantitatively characterized as follows:

      Author response table 1.

      The composition of erythroblast during terminal stage erythropoiesis.

      This differentiation progression from proerythroblasts (Pro-E) through basophilic (Baso-E), polychromatic (Poly-E), to orthochromatic erythroblasts (Ortho-E) recapitulates physiological human erythropoiesis, confirming the validity of our differentiation system for mechanistic studies.

      Reference:

      (6) Li J, Hale J, Bhagia P, Xue F, Chen L, et al. Isolation and transcriptome analyses of human erythroid progenitors: BFU-E and CFU-E. Blood. 2014;124(24):3636-3645.

      (7) Hu J, Liu J, Xue F, Halverson G, Reid M, et al. Isolation and functional characterization of human erythroblasts at distinct stages: implications for understanding of normal and disordered erythropoiesis in vivo. Blood. 2013;121(16):3246-3253.

      (8) Wang Y, Li W, Schulz VP, Zhao H, Qu X, et al. Impairment of human terminal erythroid differentiation by histone deacetylase 5 deficiency. Blood. 2021;138(17):1615-1627.

      (9) Li M, Liu D, Xue F, Zhang H, Yang Q, et al. Stage-specific dual function: EZH2 regulates human erythropoiesis by eliciting histone and non-histone methylation. Haematologica. 2023;108(9):2487-2502.

      (3) It is important to mention on which day the lentiviral knockdown of IDH1 was performed. Will the phenotype differ if the knockdown is performed in early vs. late erythropoiesis? In Figures 1C and 1D, on which day do the authors begin the knockdown of IDH1 and administer NAC and GSH treatments?

      We sincerely appreciate the reviewer’s inquiry regarding the experimental timeline. The day of getting CD34<sup>+</sup> cells was recorded as day 0. Lentivirus of IDH1-shRNA and Luciferase -shRNA was transduced in human CD34<sup>+</sup> at day 2. Puromycin selection was initiated 24h post-transduction to eliminate non-transduced cells. IDH1-KD cells were then selected for 3 days. The knock down deficiency of IDH1 was determined on day 7. NAC or GSH was added to culture medium and replenished every 2 days.

      (4) While the cell phenotype of IDH1 deficiency is quite dramatic, yielding cells with larger nuclei and multi-nuclei, the authors only attribute this phenotype to defects in chromatin condensation. Is it possible that IDH1-knockdown cells also exhibit primary defects in mitosis/cytokinesis (not just secondary to the nuclear condensation defect)?), given the function of H3K79Me in cell cycle regulation?

      We appreciate the reviewer’s insightful suggestion. We performed Edu based cell cycle analysis on Control, Sg-IDH1 and Sg-NES-IDH1 HUDEP2 cells, respectively. The results showed that IDH1 deficiency resulted in S-phase block and increased polyploidy in HUDEP2 cells on day 8 of erythroid differentiation. NES-IDH1 overexpression failed to rescue these defects, indicating nuclear IDH1 depletion as the primary driving factor (Figure 3E,F). Recent studies have established a clear link between cell cycle arrest and nuclear malformation. Chromosome mis-segregation, nuclear lamina disruption, mechanical stress on the nuclear envelope, and nucleolar dysfunction all contribute to nuclear abnormalities that trigger cell cycle checkpoints [10,11]. Based on all these findings, it reasonable for us to speculate that the cell cycle defect in IDH1 deficient cells might caused by the nuclear malfunction.

      Reference:

      (10) Hong T, Hogger AC, Wang D, Pan Q, Gansel J, et al. Cell Death Discov. CDK4/6 inhibition initiates cell cycle arrest by nuclear translocation of RB and induces a multistep molecular response. 2024;10(1):453.

      (11) Hervé S, Scelfo A, Marchisio GB, Grison M, Vaidžiulytė K, et al. Chromosome mis-segregation triggers cell cycle arrest through a mechanosensitive nuclear envelope checkpoint. Nat Cell Biol. 2025;27(1):73-86. 

      (5) Why are there two bands of Histone H3 in Figure 4A?

      We sincerely appreciate the reviewer's insightful observation regarding the dual bands of Histone H3 in our original Figure 4A. Upon rigorous investigation, we identified that the observed doublet pattern likely originated from the inter-batch variability of the original commercial antibody. To conclusively resolve this technical artifact, we have procured a new lot of Histone H3 antibody and repeated the western blot experimental under optimized conditions, and the results demonstrates a single band of H3.

      (6) Displaying a heatmap and profile plots in Figure 5A between control and IDH1-deficient cells will help illustrate changes in H3K79me3 density in the nucleus after IDH1 knockdown.

      Thank you for your suggestion. As presented in Author response image 2, we performed ChIP assays on erythroblasts collected at day 15. However, the concentration of H3K79me3-bound DNA was insufficient to meet the quality threshold required for reliable sequencing. Consequently, we are unable to generate the requested heatmap and profile plots due to limitations in data integrity from this experimental condition.

      Reviewer #3 (Public Review):

      Li, Zhang, Wu, and colleagues describe a new role for nuclear IDH1 in erythroid differentiation independent from its enzymatic function. IDH1 depletion results in a terminal erythroid differentiation defect with polychromatic and orthochromatic erythroblasts showing abnormal nuclei, nuclear condensation defects, and an increased proportion of euchromatin, as well as enucleation defects. Using ChIP-seq for the histone modifications H3K79me3, H3K27me2, and H3K9me3, as well as ATAC-seq and RNA-seq in primary CD34-derived erythroblasts, the authors elucidate SIRT1 as a key dysregulated gene that is upregulated upon IDH1 knockdown. They furthermore show that chemical inhibition of SIRT1 partially rescues the abnormal nuclear morphology and enucleation defect during IDH1-deficient erythroid differentiation. The phenotype of delayed erythroid maturation and enucleation upon IDH1 shRNA-mediated knockdown was described in the group's previous co-authored study (PMID: 33535038). The authors' new hypothesis of an enzyme- and metabolism-independent role of IDH1 in this process is currently not supported by conclusive experimental evidence as discussed in more detail further below. On the other hand, while the dependency of IDH1 mutant cells on NAD+, as well as cell survival benefit upon SIRT1 inhibition, has already been shown (see, e.g, PMID: 26678339, PMID: 32710757), previous studies focused on cancer cell lines and did not look at a developmental differentiation process, which makes this study interesting.

      (1) The central hypothesis that IDH1 has a role independent of its enzymatic function is interesting but not supported by the experiments. One of the author's supporting arguments for their claim is that alpha-ketoglutarate (aKG) does not rescue the IDH1 phenotype of reduced enucleation. However, in the group's previous co-authored study (PMID: 33535038), they show that when IDH1 is knocked down, the addition of aKG even exacerbates the reduced enucleation phenotype, which could indicate that aKG catalysis by cytoplasmic IDH1 enzyme is important during terminal erythroid differentiation. A definitive experiment to test the requirement of IDH1's enzymatic function in erythropoiesis would be to knock down/out IDH1 and re-express an IDH1 catalytic site mutant. The authors perform an interesting genetic manipulation in HUDEP-2 cells to address a nucleus-specific role of IDH1 through CRISPR/Cas-mediated IDH1 knockout followed by overexpression of an IDH1 construct containing a nuclear export signal. However, this system is only used to show nuclear abnormalities and (not quantified) accumulation of H3K79me3 upon nuclear exclusion of IDH1. Otherwise, a global IDH1 shRNA knockdown approach is employed, which will affect both forms of IDH1, cytoplasmic and nuclear. In this system and even the NES-IDH1 system, an enzymatic role of IDH1 cannot be excluded because (1) shRNA selection takes several days, prohibiting the assessment of direct effects of IDH1 loss of function (only a degron approach could address this if IDH1's half-life is short), and (2) metabolic activity of this part of the TCA cycle in the nucleus has recently been demonstrated (PMID: 36044572), and thus even a nuclear role of IDH1 could be linked to its enzymatic function, which makes it a challenging task to separate two functions if they exist.

      We appreciate the reviewer’s emphasis on rigorously distinguishing between enzymatic and enzymatic independent roles of IDH1. In our revised manuscript, we have removed all assertions of a "metabolism-independent" mechanism. Instead, we focus on demonstrating that nuclear-localized IDH1 contributes to chromatin state regulation during terminal erythropoiesis (e.g., H3K79me3 accumulation). While we acknowledge that nuclear IDH1’s enzymatic activity may still play a role [12], our data emphasize its spatial association with chromatin remodeling. We now explicitly state that nuclear IDH1’s function may involve both enzymatic and structural roles, and further studies are required to dissect these mechanisms.

      Reference:

      (12) Kafkia E, Andres-Pons A, Ganter K, Seiler M, Smith TS, et al.Operation of a TCA cycle subnetwork in the mammalian nucleus. Sci Adv. 2022;8(35):eabq5206.

      (2) It is not clear how the enrichment of H3K9me3, a prominent marker of heterochromatin, upon IDH1 knockdown in the primary erythroid culture (Figure 4), goes along with a 2-3-fold increase in euchromatin. Furthermore, in the immunofluorescence (IF) experiments presented in Figure 4Db, it seems that H3K9me3 levels decrease in intensity (the signal seems more diffuse), which seems to contrast the ChIP-seq data. It would be interesting to test for localization of other heterochromatin marks such as HP1gamma. As a related point, it is not clear at what stage of erythroid differentiation the ATAC-seq was performed upon luciferase- and IDH1-shRNA-mediated knockdown shown in Figure 6. If it was done at a similar stage (Day 15) as the electron microscopy in Figure 4B, then the authors should explain the discrepancy between the vast increase in euchromatin and the rather small increase in ATAC-seq signal upon IDH1 knockdown.

      Thank you for raising this important point. We agree that while H3K9me3 and H3K27me2 modifications are detectable in the nucleus, their functional association with chromatin in this context remains unclear. Our ChIP-seq data did not reveal distinct enrichment peaks for H3K9me3 or H3K27me2 (unlike the well-defined H3K79me3 peaks), suggesting that these marks may not be stably bound to specific chromatin regions under the experimental conditions tested. However, we acknowledge that the absence of clear peaks in our dataset does not definitively rule out chromatin interactions, as technical limitations or transient binding dynamics could influence these results. To avoid over-interpretation, we have removed speculative statements about the chromatin-unbound status of H3K9me3 and H3K27me2 from the revised manuscript. This revision aligns with our broader effort to present conclusions strictly supported by the current data while highlighting open questions for future investigation.

      (3)The subcellular localization of IDH1, in particular its presence on chromatin, is not convincing in light of histone H3 being enriched in the cytoplasm on the same Western blot. H3 would be expected to be mostly localized to the chromatin fraction (see, e.g., PMID: 31408165 that the authors cite). The same issue is seen in Figure 4A.

      We sincerely appreciate the reviewer's insightful comment regarding the subcellular distribution of histone H3 in our study. We agree that histone H3 is classically associated with chromatin-bound fractions, and its cytoplasmic enrichment in our Western blot analyses appears counterintuitive at first glance. However, this observation is fully consistent with the unique biology of terminal erythroid differentiation, which involves drastic nuclear remodeling and histone release - a hallmark of terminal stage erythropoiesis. Terminal erythroid differentiation is characterized by progressive nuclear condensation, chromatin compaction, and eventual enucleation. During this phase, global chromatin reorganization leads to the active eviction of histones from the condensed nucleus into the cytoplasm. This process has been extensively documented in erythroid cells, with studies demonstrating cytoplasmic accumulation of histones H3 and H4 as a direct consequence of nuclear envelope breakdown and chromatin decondensation preceding enucleation [13-16]. Our experiments specifically analyzed terminal-stage polychromatic and orthochromatic erythroblasts. At this stage, histone releasing into the cytoplasm is a dominant biological event, explaining the pronounced cytoplasmic H3 signal in our subcellular fractionation assays.

      In summary, the cytoplasmic enrichment of histone H3 in our data aligns with established principles of erythroid biology and reinforces the physiological relevance of our findings. We thank the reviewer for raising this critical point, which allowed us to better articulate the unique aspects of our experimental system.

      Reference:

      (13) Hattangadi SM, Martinez-Morilla S, Patterson HC, Shi J, Burke K, et al. Histones to the cytosol: exportin 7 is essential for normal terminal erythroid nuclear maturation. Blood. 2014;124(12):1931-1940.

      (14) Zhao B, Mei Y, Schipma MJ, Roth EW, Bleher R, et al. Nuclear Condensation during Mouse Erythropoiesis Requires Caspase-3-Mediated Nuclear Opening. Dev Cell. 2016;36(5): 498-510.

      (15) Zhao B, Liu H, Mei Y, Liu Y, Han X, et al. Disruption of erythroid nuclear opening and histone release in myelodysplastic syndromes. Cancer Med. 2019;8(3):1169-1174. 

      (16) Zhen R, Moo C, Zhao Z, Chen M, Feng H, et al.  Wdr26 regulates nuclear condensation in developing erythroblasts. Blood. 2020;135(3):208-219.

      (4) This manuscript will highly benefit from more precise and complete explanations of the experiments performed, the material and methods used, and the results presented. At times, the wording is confusing. As an example, one of the "Key points" is described as "Dyserythropoiesis is caused by downregulation of SIRT1 induced by H3K79me3 accumulation." It should probably read "upregulation of SIRT1".

      We sincerely thank the reviewer for highlighting the need for improved clarity in our experimental descriptions and textual precision. We fully agree that rigorous wording is essential to accurately convey scientific findings. Specific modifications have been made and are highlighted in Track Changes mode in the resubmitted manuscript.

      The reviewer correctly identified an inconsistency in the original phrasing of one key finding. The sentence in question ("Dyserythropoiesis is caused by downregulation of SIRT1 induced by H3K79me3 accumulation") has been revised to:"Dyserythropoiesis is caused by the upregulation of SIRT1 mediated through H3K79me3 accumulation." This correction aligns with our experimental data showing that H3K79me3 elevation promotes SIRT1 transcriptional activation. We apologize for this oversight and have verified the consistency of all regulatory claims in the text.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      (1) It will be helpful to mention/introduce the cells used for the study at the beginning of the results section. For example, for Figure 1A neither the figure legend nor the results text includes information on the cells used.

      Thanks for the reviewer’s suggestion. The detail information of the cells that were used in our study have been provided in the revised manuscript.

      (2) Important details for many figures are lacking. For example, in Figure 5, there is no mention of the replicates for ChIP-Seq studies. Also, the criteria used for quantifications of abnormal nuclei, % euchromatin vs heterochromatin, the numbers of biological replicates, and how many fields/cells were used for these quantifications are missing.

      We thank the reviewer for emphasizing the importance of methodological transparency. It has been revised accordingly. The ChIP-Seq data in Figure 5 was generated from three independent biological replicates to ensure reproducibility. In this study, Image J software was used to calculate the area of nuclear, heterochromatin/euchromatin and to quantify the percentage of euchromatin and heterochromatin. A minimum of 300 cells per group were evaluated, and the proportion of aberrant nuclei was calculated as (number of abnormal cells / total counted cells) × 100%.

      (3) It will be helpful if supplemental data are ordered according to how they are discussed in the text. Currently, the order of the supplemental data is hard to keep track of eg., the results section starts describing supplemental Figure 1, then the text jumps to supplemental Figure 5 followed by Supplemental Figure 3 (and so on).

      Thanks for the reviewer’s suggestion. It has been revised accordingly.

      (4) Overall, there are many incomplete sentences and typos throughout the manuscript including some of the figures e.g. on page 10 the sentence "Since the generation of erythroid with abnormal nucleus and reduction of mature red blood cells caused by IDH1 absence are notable characteristics of MDS and AML." is incomplete. On page 11, it reads "Histone post-modifications". This needs to be either histone modifications or histone post-translational modifications. In Figure 4C, the y-axis title is hard to understand "% of euchromatin and heterochromatin". Overall, the document needs to be proofread and revised carefully.

      Thanks for the reviewer’s suggestion. We have made revision accordingly in the revised manuscript. The sentence "Since the generation of erythroid with abnormal nucleus and reduction of mature red blood cells caused by IDH1 absence are notable characteristics of MDS and AML." has been revised to “The production of erythrocytes with abnormal nuclei and the reduction of mature erythrocytes due to IDH1 deletion are prominent features of MDS and AML.”  “% of euchromatin and heterochromatin” has been modified to “Area ratio of euchromatin to heterochromatin”.

      Reviewer #3 (Recommendations For The Authors):

      The following critique points aim to help the authors to improve their manuscript:

      (1) The authors reason (p. 10) that because mutant IDH1 has been shown to result in altered chromatin organization, this could be the case in their system, too. However, mutant IDH1 has an ascribed metabolic consequence, the generation of 2-HG, which further weakens the author's argument for an enzymatically independent role of IDH1 in their system. The same is true for the author's observation in Supplementary Figure 9B that in IDH1-mutant AML/MDS samples, H3K79me3 colocalized with the IDH1 mutants in the nucleus. Again, this speaks in favor of IDH1's role being linked to metabolism. The authors could re-write this manuscript, not so much emphasizing the separation of function between different subcellular forms of IDH1 but rather focusing on the chromatin changes and how they could be linked to the actual phenotype, the nuclear condensation and enucleation defect - if so, addressing the surprising finding of enrichment of both active and repressive chromatin marks will be important.

      Thanks for the reviewer’s suggestion. We agree with the reviewers and editors all the data we present in the current are not robust enough to rigorously distinguish between enzymatic and enzymatic-independent roles of IDH1. In our revised manuscript, we have removed all assertions of a "metabolism-independent" mechanism. Instead, we focus on demonstrating that nuclear-localized IDH1 contributes to chromatin state regulation during terminal erythropoiesis (e.g., H3K79me3 accumulation).

      (2) How come so many genes were downregulated by RNA-seq (about an equal number as upregulated genes) but not more open by ATAC-seq? The authors should discuss this result.

      Thanks for the reviewer's suggestion. ATAC-seq showed an increase in chromatin accessibility after IDH1 deletion, but the number of up-regulated genes was slightly larger than that of down-regulated genes, which may be caused by the metabolic changes affected by IDH1 deletion. In order to explore the effect of chromatin accessibility changes on gene expression after IDH1 deletion, we analyzed the changes in differential gene expression at the differential ATAC peak region (as shown in the figure below), and the results showed that the gene expression at the ATAC peak region with increased chromatin accessibility was significantly up-regulated. This may explain the regulation of chromatin accessibility on gene expression.

      (3) For the ChIP-seq analyses of H3K79me3, H3K27me2, and H3K9me3, the authors should not just show genome-wide data but also several example gene tracks to demonstrate the differential abundance of peaks in control versus IDH1 knockdown. Furthermore, the heatmap shown in Figure 5A should include broader regions spanning the gene bodies, to visualize the intergenic H3K27me2 and H3K9me3 peaks. Expression could very well be regulated from these intergenic regions as they could bear enhancer regions. ChIP-seq for H3K27Ac in the same setting would be very useful to identify those enhancers.

      Thanks for the reviewer’s suggestion. It has been revised accordingly. We reanalyzed the ChIP-seq peak signal of H3K79me3, H3K27me2 and H3K9me3 in a wider region (±5Kb) at gene body, and the results showed that the H3K27me2 and H3K9me3 peak signals did not change significantly. Since H3K79me3 showed a higher peak signal and was mainly enriched in the promoter region, our subsequent analysis focusing on the impact of H3K79me3 accumulation on chromatin accessibility and gene expression might be more valuable.

      Author response image 3.

      ChIP-seq analysis show that the peak signal of H3K79me3,H3K27me2 and H3K9me3. (A) Heatmaps displayed normalized ChIP signal of H3K9me3, H3K27me2, and H3K79me3 at gene body regions. The window represents ±5 kb regions from the gene body. TES, transcriptional end site; TSS, transcriptional start site. (B) Representative peaks chart image showed normalized ChIP signal of H3K9me3, H3K27me2, and H3K79me3 at gene body regions.

      (4) The absent or very mild delay (also no significance visible in the quantification plots) in the generation of orthochromatic erythroblasts on Day 13 upon IDH1 shRNA knockdown as per a4-integrin/Band3 flow cytometry does not correspond to the already quite prominent number of multinucleated cells at that stage seen by cytospin/Giemsa staining. Why do the authors think this is the case? Cytospin/Giemsa staining might be the better method to quantify this phenotype and the authors should quantify the cells at different stages in at least 100 cells from non-overlapping cytospin images.

      Thanks for the reviewer’s suggestion. We have supplemented the cytpspin assay and the results were presented in Supplemental Figure 4.

      (5) The pull-down assay in Figure 7E does not show a specific binding of H3K79me3 to the SIRT1 promoter. Rather, there is just more H3K79me3 in the nucleus, thus leading to generally increased binding. The authors should show that H3K79me3 does not bind more just everywhere but to specific loci. The ChIP-seq data mention only categories but don't show any gene lists that could hint at the specificity of H3K79me3 binding at genes that would promote nuclear abnormalities and enucleation defects.

      We thank the reviewer for pointing this out. The GSEA results of H3K79me3 peak showed enrichment of chromatin related biological processes, and the list of associated genes is shown Figure 7B. In addition, we also displayed the changes in H3K79me3 peak signals, ATAC peak signals, and gene expression at gene loci of three chromatin-associated genes (SIRT1, KMT5A and NUCKS1).

      (6) P. 12: "Representatively, gene expression levels and ATAC peak signals at SIRT1 locus were elevated in IDH1-shRNA group and were accompanied by enrichment of H3K9me3 (Figure 7F)." Figure 7F does not show an enrichment of H3K9me3, but if the authors found such, they should explain how this modification correlates with the activation of gene expression.

      Thank you for bringing this issue to our attention. We sincerely apologize for the mistake in the description of Figure 7F on page 12. We have already corrected this error in the revised manuscript.

      (7) Related to the mild phenotype by flow cytometry on Day 13, are the "3 independent biological replicates" from culturing and differentiating CD34 cells from 3 different donors? If all are from the same donor, experiments from at least a second donor should be performed to generalize the results.

      In our current study, CD34<sup>+</sup> cells were derived from different donors. 

      (8) If the images in Supplementary Figure 4 are only the indicated cell type, then it is not clear how the data were quantified since only some cells in each image are pointed at and others do not seem to have as large nuclei. There is also no explanation in the legend what the colors mean (nuclei were presumably stained with DAPI, not clear what the cytoplasm stain is - GPA?).

      We thank the reviewer for pointing this out. We have revised the manuscript accordingly. Specifically, the nuclei was stained with DAPI and the color was blue. The cell membrane was stained with GPA and the color was red. This staining method allows for clear visualization of the cell structure and helps to better understand the localization of the proteins of interest.

      (9) It is not clear to this reviewer whether Figure 4F is a quantification of the Western Blot or of the IF data.

      Figure 4F is a quantification of the Western Blot experiment.

      (10) The authors sometimes do not describe experiments well, e.g., "treatment of IDH1-deficient erythroid cells with IDH1-EX527" (p. 13). EX-527 is a SIRT1 inhibitor, which the authors only explicitly mention later in that paragraph. It is unclear to this reviewer, why the authors call it IDH1-EX527.

      Thank you for pointing out the unclear description in our manuscript. We apologize for the confusion caused by the unclear statement. We have revised the manuscript accordingly. The compound EX-527 is a SIRT1 inhibitor, and we have corrected the description to simply "EX-527" in the revised manuscript.

      (11) The end of the introduction needs revising to be more concise; the last paragraph on p. 4 ("Recently, the decreased expression of IDH1...") partially should be integrated with the previous paragraph, and partially is repeated in the last paragraph (top paragraph on p. 5). The last sentence on p. 4, "These findings strongly suggest that aberrant expression of IDH1 is also an important factor in the pathogenesis of AML and MDS.", should rather read "increased expression of IDH1", to distinguish it from mutant IDH1 (mutant IDH1 is also aberrantly expressed IDH1).

      We appreciated the reviewer for the helpful suggestion. Considering that the inclusion of this paragraph did not provide a valuable contribution to the formulation of the scientific question, we have removed it after careful consideration, and the revised manuscript is generally more logically smooth.

      (12) Abstract and last sentence of the introduction: "innovative perspective" should be re-worded, as the authors present data, not a perspective. Maybe could use "evidence".

      Thanks for the reviewer’s suggestion. It has been revised accordingly.

      (13) "IDH1-mut AML/MDS" on p. 11. The authors should provide more information about these AML/MDS samples. The legend contains no information about them/their mutational status. How many samples did the authors look at? Do these cells contain mutations other than IDH1?

      Thanks for the reviewer’s suggestion. The detail information of these AML/MDS samples are provide in supplemental table 1. In our current study, we collected ten AML/MDS samples and the majority of the samples only contain IDH1 mutations at different sites.

      (14) The statement, "Taken together, these results indicated that IDH1 deficiency reshaped chromatin states and subsequently altered gene expression pattern, especially for genes regulated by H3K79me3, which was the mechanism underlying roles of IDH1 in modulation of terminal erythropoiesis." (p. 10), is not correct at that point in the manuscript as the authors have not yet introduced the RNA-seq data.

      Thanks for the reviewer’s suggestion. The statement has been revised to “Taken together, these results indicated that IDH1 deficiency reshaped chromatin states by altering the abundance and distribution of H3K79me3, which was the mechanism underlying roles of IDH1 in modulation of terminal erythropoiesis”.

      (15) For easier readability, the authors should present the data in order. For example, the supplemental data for IDH shRNA and siRNA should be presented together and not in Supplementary Figures 1 and 5. Supplementary Figure 3 is mentioned after Supplementary Figure 1, but before Supplementary Figure 2 - again, all data need to be presented in subsequent figures to be viewed together.

      Thank you for your suggestion regarding the order of data presentation. We have reorganized the figures in the manuscript to improve readability. We apologize for any confusion caused by the previous arrangement and hope that the revised version meets your expectations.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The authors investigate the role of HSPA2 during mouse preimplantation development. Knocking down HSPA2 in zygotes, the authors describe lower chances of developing into blastocysts, which show a reduced number of inner cell mass cells. They find that HSPA2 mRNA and protein levels show some heterogeneity among blastomeres at the 4-cell stage and propose that HSPA2 could contribute to skewing their relative contribution to embryonic lineages. To test this, the authors try to reduce HSPA2 expression in one of the 2-cell stage blastomere and propose that it biases their contribution to towards extra-embryonic lineages. To explain this, the authors propose that HSPA2 would interact with CARM1, which controls chromatin accessibility around genes regulating differentiation into embryonic lineage.

      Strengths:

      (1) The study offers simple and straightforward experiments with large sample sizes.

      Thanks for your kind recognition.

      (2) Unlike most studies in the field, this research often relies on both mRNA and protein levels to analyses gene expression and differentiation.

      Thanks for your kind recognition.

      Weaknesses:

      (1) Image and statistical analyses are not well described.

      Thanks for your advisable comment. We redescribe the image and statistical analyses in our revised version (line 255-257).

      (2) The functionality of the overexpression construct is not validated.

      Thanks for your kind suggestion. We validate the functionality of the overexpression construct in our revised version (Figure S3).

      (3) Tracking of KD cells in embryos injected at the 2-cell stage with GFP is unclear.

      Thanks for your kind suggestion. We randomly co-injected green fluorescent protein (Gfp) mRNA as a linage tracer with either Hspa2-siRNA or NC-FAM into one of the 2 -cell, and then monitored embryo development to the blastocyst stage (line 342-344).

      (4) A key rationale of the study relies on measuring small differences in the levels of mRNA and proteins using semi-quantitative methods to compare blastomeres. As such, it is not possible to know whether those subtle differences are biologically meaningful. For example, the lowest HSPA2 level of the embryo with the highest level is much higher than the top cell from the embryo with the lowest level. What does this level mean then? Does this mean that some blastomeres grafted from strong embryos would systematically outcompete all other blastomeres from weaker embryos? That would be very surprising. I think the authors should be more careful and consider the lack of quantitative power of their approach before reaching firm conclusions. Although to be fair, the authors only follow a long trend of studies with the same intrinsic flaw of this approach.

      Thanks for your advisable comment. Indeed, despite the approach drew on previous research (Zhou Cell 2018), we were clearly aware that this approach can only reflect relative comparisons. This means that the relative difference among the blastomeres from the same embryo were detected and compared. We did not compare the absolute levels of mRNA between different embryos. We also offered simple and straightforward experiments with large sample sizes to confirm this conclusion.

      (5) Some of the analyses on immunostaining do not take into account that this technique only allows for semi-quantitative measurements and comparisons.

      a) Some of the microscopy images are shown with an incorrect look-up table.

      b) Some of the schematics are incorrect and misleading.

      Thanks for your advisable comment. We revised microscopy images and schematics in our revised version.

      Reviewer #2 (Public review):

      Summary:

      In this study, Gao et al. use RNA-seq to identify Hspa2 as one of the earliest transcripts heterogeneously distributed between blastomeres. Functional studies are performed using siRNA knockdown showing Hspa2 may bias cells toward the ICM lineage via interaction with the known methyltransferase CARM1.

      Strengths:

      This study tackles an important question regarding the origins of the first cell fate decision in the preimplantation embryo. It provides novelty in its identification of Hspa2 as a heterogeneous transcript in the early embryo and proposes a plausible mechanism showing interactions with Carm1. Multiple approaches are used to validate their functional studies (FISH, WB, development rates, proteomics). Given only 4 other transcripts/RNA have been identified at or before the 4-cell stage (LincGET, CARM1, PRDM14, HMGA1), this would be an important addition to our understanding of how TE vs ICM fate is established.

      Thanks for your kind recognition.

      The RNA-seq results leading the authors to focus on Hspa2 are not included in the manuscript. This dataset would serve as an important resource but is neither included nor discussed. Nor is it mentioned whether Hspa2 was identified in prior RNA-seq embryos studies (for example Deng Science 2014).

      Thanks for your advisable comment. To identify genes that show a significantly high variability across blastomeres in the same embryo, we regressed out the embryo effect by established a new method, which will be published and uploaded to the database in the future. Thus, the RNA-seq results leading the we focus on Hspa2 are not included in the manuscript.   

      In addition, the functional studies are centered on Hspa2 knockdown at the zygote (1-cell) stage, which would largely target maternal transcript. Given the proposed mechanism relies on Hspa2 heterogeneity post-ZGA (late 2-cell stage), the knockdown studies don't necessarily test this and thus don't provide direct support to the authors' conclusions. The relevance of the study would be improved if the authors could show that zygotic knockdown leads to symmetric Hspa2 levels at the late 2-cell and/or 4-cell stage. It may be possible that zygotic knockdown leads to lower global Hspa2 levels, but that asymmetry is still generated at the 4-cell stage.

      Thanks for your advisable comment. We showed that the Hspa2 levels at the late 2-cell and 4cell stage after zygotic knockdown in our revised version (Figure S1 G-H, line 450-452).

      Furthermore, the authors show that Hspa2 knockdown at the 1-cell stage lowers total Carm1 levels at the 4-cell stage. However, it is unclear how total abundance within the embryo alters lineage specification within blastomeres. The authors go on to propose a plausible mechanism involving Hspa2 and Carm1 interaction, but do not discuss how expression levels may be involved.

      Thanks for your advisable comment. Previous research suggests that heterogeneous activity of the methyltransferase CARM1 results in differential methylation of histone H3R26 to modulate establishment of lineage specification (Zernicka-Goetz Cell 2018). Thus, we didn't discuss the total abundance within the embryo alters lineage specification.

      Recommendations for the authors:  

      Reviewer #1 (Recommendations for the authors):

      (1) Major issue with analyses:

      Image analysis needs to be much better explained than simply saying that ImageJ was used. Where are cells measured (at their equatorial plane? What is the size of the ROI?)? Ideally, the ROI and/or raw measurements should be provided.

      Thanks for your advisable comment. We redescribe the Image analysis in our revised version (line 187-194). 

      What are the objective criteria determining whether a cell is counted as GFP positive, CDX2 positive, or OCT4 positive? This is very unclear and key to the interpretation of many experiments.

      Thanks for your advisable comment. We think that the cell containing fluorescence signals above background noise were counted positive.

      Statistical analyses mention ANOVA in the methods but the student's t-test in the figure legend. Which is which? Most data are heavily normalized, which would unlikely fit the description for Student's t-test analyses.

      Thanks for your advisable comment. We redescribe the statistical analyses in our materials and methods (line 253-260).

      Figure 5H describes a relative fluorescence intensity with control at 1. The legend describes a normalization to "DNA" (I guess the authors meant DAPI), which is unlikely to give 1. This suggests that additional normalization was done and is not described. Is that the case? Also, since the authors propose that HSPA2 would control Histone modification and chromatin packing, I do not think that using DAPI is an appropriate way of normalizing the fluorescence signal.

      Thanks for your advisable comment. We replaced DNA with DAPI in our revised version. Based on previous studies, we adopted DAPI as a normalized fluorescence signal (Zhou Cell 2018, Zernicka-Goetz Cell 2018).

      Figure 1E shows data normalized to the lowest level while Figure 1H is normalized to the highest level. A consistent representation would be welcome.

      Thanks for your advisable comment. We revised the Figure 1H in our revised version.

      Is Figure 1C showing a t-test between correlations?

      Yes, Figure 1C shows the t-test between correlation.

      (2) Major issue with the interpretation of semi-quantitative methods and measurements:

      qPCR, WB, immunostaining are all semi-quantitative methods that require some kind of normalization due to non-linear bias in the way the molecules are picked up. Such normalization makes it difficult to know whether a detectable difference is meaningful biologically speaking i.e. if a difference of 1 CT between blastomeres can be detected after qPCR, is it meaningful? If that were the case, then embryos with lower CT than others (Figure 1D) would not be able to develop into blastocyst, like siRNA injected embryos, or grafting a blastomere with a high CT onto an embryo with low CT would lead to the systematic differentiation of these strong blastomeres into ICM.

      Thanks for your advisable comment. The CT values represent the relative mRNA levels of Hspa2 between blastomeres, and the higher CT value represents the lower expression of Hspa2 at mRNA level. Figure 1D shows the Hspa2 mRNA levels between blastomeres. The blastomere with lowlevel expression of the Hspa2 mRNA is not bias an ICM fates.  

      The same goes for fluorescence analyses (Figure 1F). Can the authors also provide the measurements for DAPI as they did for HSPA2? I am sure that with enough measurements, DAPI is variable enough to give a statistical difference among blastomeres with questionable biological meaning.

      I think the reasoning used here (unfortunately following the reasoning that has been used in a series of studies by other groups) of ranking blastomeres after semi-quantitative measurement is fundamentally flawed.

      Thanks for your advisable comment. The DAPI was determined by the maximal area using a custom Python script. Based on previous studies, we adopted DAPI as a normalized fluorescence signal (Zhou Cell 2018). This approach is to normalize embryo-to-embryo variance from the technical reason.

      (3) Major issue with overexpression experiment:

      While the siRNA experiment is partially validated by qPCR and WB measurements of HSPA2 after KD, the overexpression experiment is not. Do the authors have any evidence that the construct they use is produced into protein and functional? Can the authors check by WB? Can the authors rescue the siRNA with their overexpression?

      Thanks for your advisable comment. We verified the overexpression experiment by WB in in our revised version (Figure S3, line 360-361). Considering that siRNA degrades mRNA and prevents the mRNA translation process, we did not co-inject the siRNA with their overexpression.

      The lack of effect of HSPA2 overexpression on blastocyst formation is difficult to reconcile with the interpretation from the authors that levels of HSPA2 bias lineages.

      Have the authors tried lower concentrations? Have the authors tried FISH on their half-injected 2cell embryos? Of course, if the antibody against HSPA2 would work with immunostaining, that would be ideal.

      Thanks for your advisable comment. We chose the concentrations for our study based on previous research (Zernicka-Goetz Cell 2016). To verified Hspa2 was successfully inject into one blastomere at the 2-cell stage, we observed green fluorescence after co-injected GFP mRNA with either siRNA or NC-FAM into one blastomere of the two-cell embryos. Thus, we didn't try FISH on half-injected 2-cell embryos. We tried to perform immunostaining experiments with various HSPA2 antibodies (Proteintech: 12797-1-AP, Abcam: ab108416) and no good results were achieved.

      Author response image 1.

      (4) Major issue with tracking of injected cells:

      It is unclear what counts as a GFP-positive cell. In Figure 3D, most cells appear to have the same level of GFP.

      Thanks for your advisable comment. The cell containing green fluorescence signals above background noise were counted GFP-positive in Figure 3D. Most cells seem to have the same level of GFP because they are daughter cells of the blastomeres injected with GFP.

      In the images of GFP-expressing cells used to track the control of KD cells shown in Figure 3A, it seems that the control embryos have mostly GFP cells in the ICM. Is that the case, or just a bad example?

      Thanks for your advisable comment. The green fluorescent signals in Figure 3A represented OCT4 protein, an ICM marker.

      Can the authors do FISH against HSPA2 and visualize their GFP cells to validate the heterogeneous expression in situ?

      Thanks for your advisable comment. We have verified the heterogeneous expression of HSPA2 in Figure1.

      (5) Issue with fluorescent images:

      Many images are shown with inappropriate look-up tables with saturated DAPI, OCT4, CDX2, and FISH. This raises the doubt that analyses were made on saturated images, which would be incorrect.

      The LUT of Figure 5H should be adjusted similarly between the control and siRNA.

      Thanks for your advisable comment. We revised some images which showed inappropriate lookup tables in our revised version. The LUT of Figure 5H had been adjusted between the control and siRNA. 

      (6) Issue with schematics:

      Schematics of blastomere isolation grown into blastocyst-like structures are misleading since the final blastocyst-like structure should not have a zona pellucida and should have fewer cells than regular blastocysts.

      Thanks for your advisable comment. We revised schematics of blastomere grown into morula in our revised version (Figure 1A and Figure S1A).

      The summary schematics in the final figure should not state HSPA2 -/- since experiments in the study did not use KO but KD.

      Thanks for your advisable comment. We revised the summary schematics in our revised version.

      The blastocysts are the same sizes as the cleavage stage or morula embryos which implies that cells lose volume to the lumen, which is not the case.

      Thanks for your advisable comment. We revised the schematics in our revised version.

      (7) Issue with data presentation:

      In the tables within the figures, the number of decimals given should be the same for the mean and SE (one decimal should be more than enough).

      Thanks for your advisable comment. We revised the figure 2H in our revised version.

      The comparison of cell number and distribution within embryos (e.g. Figure 2B) would be best represented by a correlation analysis of TE vs ICM cells.

      Thanks for your advisable comment. We add the figure of a correlation analysis of TE vs ICM cells in our revised version (Figure 3B).

      The docking simulations are described in the main text as "experiments".

      Thanks for your advisable comment. We redescribed the docking simulations in our revised version.

      (8) Issue with data interpretation:

      The reduced number of ICM cells is interpreted as a slowed-down cell cycle. This could also be explained by failed cytokinesis and the generation of binucleated or polyploid cells. Have the authors checked for that? For example, by looking at their DAPI staining. 

      Thanks for your advisable comment. Our RNA-seq results revealed that the differentially expressed genes (DEGs) at blastocyst stage with HSPA2 knocking down are closely related to negative regulation of cell cycle, G1/S transition of mitotic cell cycle, mitotic cell cycle phase transition and regulation of mitotic cell cycle phase transition. Additionally, the previous study demonstrated that knockdown of HSPA2 reduced cell proliferation and led to G1/S phase cell cycle arrest (Hu Ann Transl Med 2019). Additionally, the lower cell number in ICM may also associated with failed cytokinesis and the generation of binucleated or polyploid cells. Thus, we guessed that HSPA2 has a role in ICM lineage establishment, although half of the ICM cells were able to survive with HSPA2 deficiency (line 463-472).

      It is unclear to me why reduced ICM should lead to fewer blastocysts. Blastocysts should be able to form as long as their TE is fine. In Figure 2G, embryos seem to be cultured in close proximity, which is fine if they are healthy but not if some of the embryos start dying and releasing toxic compounds (e.g. ROS). Have the authors tried removing the dying KD embryos to see if the development of the remaining embryos would improve?

      Thanks for your advisable comment. We think HSPA2 may affect blastocyst development by affecting other signaling pathways. And, the GO enriched terms was closely related to blastocyst development (Figure 2E). There was no significant difference in morula formation rate between Hspa2-KD group and NC group, thus the assumption that the toxic compounds released by some of the embryos that lead to downregulation of blastocyst rate may not be correct. Indeed, the rate of blastocyst formation in Hspa2-KD embryos was reduced significantly lower when few embryos was cultured separately. In addition, we discussed the possibility that the lower cell number in ICM may also associated with failed cytokinesis and the generation of binucleated or polyploid cells.

      Author response image 2.

      Reviewer #2 (Recommendations for the authors):

      One of the significant findings in the paper is the discovery portion where Hspa2 is identified as a heterogeneous transcript. To improve the logic and impact of the manuscript, it may benefit from reorganizing some of the figures and text. For example:

      (1) The paragraph in the introduction (Lines 56-68) should be moved to the discussion as the Hspa2 reveal should be in section 3.1, not prior to the RNA-seq results presented in Figure 1.

      Thanks for your advisable comment. We think it is more logical that HSPA2 needs to be introduced in the introduction.

      (2) Add text at the beginning of Section 3.1 to describe the rationale and results for the RNAseq. It would help the readers if the authors clearly stated why they chose the 4-cell stage.

      Thanks for your advisable comment. We explain why we chose the 4-cell stage in our revised version (line 272-273).

      (3) As this is the first time Hspa2 is identified, consider moving Figure S1C to the main figure to show expression throughout development.

      Thanks for your advisable comment. We moved Figure S1C to the main figure in our revised version (line 286-291).

      (4) Figure 1C: the correlation between Hspa2 and ICM markers would be strengthened if additional transcripts were used (Oct4, Sox2, Sox21). The graph in 1C would also be more informative if represented as a scatter plot with correlation coefficients (Nanog log2TPM vs Hspa2 log2TPM), rather than bar graphs.

      Thanks for your advisable comment. We chose Nanog as the correlation between Hspa2 and Nanog, a ICM markers, was showing the strongest correlation in result. And, the figure 1C shows the stronger positive correlation between Nanog and Hspa2 in gene expression than random gene pairs (n=100, n means the number of random gene pairs). Thus, the figure 1C with bar graphs is easier to understand.

      (5) Figure 1D: how were individual blastomeres grouped into B1-4? Individually run and then pooled based on relative expression?

      Thanks for your advisable comment. Blastomeres are named B1 to B4 according to increasing Hspa2 concentration in figure 1E.

      (6) Figures 1F, 1I, 5H: the DAPI channel appears to be saturated, but is used to normalize fluorescence intensity and may incorrectly account for light scattering within the embryo. Please clarify by adding more details regarding image analysis. Were partial stacks through the nucleus used for analysis, or max projections? Graph axes should be "relative fluorescence intensity."

      Thanks for your advisable comment. We added the details of fluorescence images analysis. The graph axes had revised in our revised version.

      (7) Line 278: the results in Figure S1C would benefit from more text regarding expression patterns throughout development. The maternal transcript appears to have a sharp downregulation by the early 2-cell stage, and is then upregulated coinciding with ZGA.

      Thanks for your advisable comment. We added more describe of the Figure in main text (LINE 285-290).

      (8) For the analyses in Figure 2 I-J and 2K-L, were arrested embryos excluded from analysis? This is an important detail as including arrested embryos would significantly bias the RNA-seq results. 

      Thanks for your advisable comment. The arrested embryos were excluded in Figure 2 I-J and 2K-L.

      (9) Figures 2G-H would be aided by converting the table in 2H to a bar graph and adding development rates for all stages (2-, 4-, 8-, morula, and blast). This would also show when an arrest occurs.

      Thanks for your advisable comment. We converted the table in 2H to a bar graph.

      (10) Blast rates are represented with too many significant digits (Figures 2H, 4B). They should only be reported to the closest ones given the unit of measure (number of blasts divided by number of zygotes). For instance, a blast rate of 81.63 {plus minus} 2.000 reflects excessive precision that is not measured in the data, it should rather read 82 {plus minus} 2%. This is also true for % cells (Figures 3E, 4H).

      Thanks for your advisable comment. Values were rounded down to the one decimal place (rounded down).

      (11) The clarity and impact of Figure 3A and 3D would benefit from 2D slices through the ICM. 

      Thanks for your advisable comment. In order to get more comprehensive understanding of the 3D structure of blastocyst of Figure 3A and 3D, we did not choose 2D slices.

      (12) To improve clarity and logic, separate the 1-cell and 2-cell knockdown experiments in the text and figures:

      a) 1-cell knockdown with RNA-seq results (Fig 2A-F).

      b) 1-cell knockdown showing less ICM/pluripotency markers in (combine Figures 2G-M and Figures 3A-B; "new Fig 3").

      c) 2-cell knockdown tracing lineage (Figures 2D-E; "new Fig 4").

      The new Figures 3 and 4 should mirror one another (i.e. for each knockdown experiment, development rates and cell counts should be included). For the 2-cell knockdown (Figures 2 D-E), what were the developmental rates (8-cell, morula, blast)?

      Thanks for your advisable comment. However, in order to the overall logical of the article, we do not separate the 1-cell and 2-cell knockdown experiments in the text and figures. And, we added the developmental rates (8-cell, morula, blast) of 2-cell knockdown group in our revised version (Figure S2).

      For the overexpression experiment (Figure 4), why were injections performed at the zygote stage versus the 2-cell stage? Given the significant downregulation of maternal transcript demonstrated in Figure S1C, it seems plausible that the injected RNA was also downregulated.

      Thanks for your advisable comment. For the overexpression experiment, we first chose to inject Hspa2 mRNA at the zygote stage and found that the overexpression of Hspa2 does not induce blastomere cells to bias an ICM fate. The qRT-PCR results indicated that the expression level of Hspa2 in overexpression group was significantly increased compared with normal group at 4cell and blastocyst stage (Figure 4C, 4D).  In addition, there is no guarantee that an equal amount of Hspa2 mRNA be injected into each blastomere in 2-cell stage. Thus, we did not microinject Hspa2 mRNA into the 2-cell stage.

      The 3.5 subheading overstates the results as the Hspa2-Carm1 interaction is not linked to lineage segregation. For example, a more specific subtitle might be, "Hspa2 interacts with Carm1 and alters H3R26me2 levels."

      Thanks for your advisable comment. We revised the subtitle in our revised version (line 376).

      Figures 5B-C and 5D-E. The qRT-PCR and WB analysis of knockdown blasts shows a correlation between Hspa2 downregulation and Carm1 downregulation. However, if the proposed mechanism is Hspa2 binding to Carm1 to mediate downstream methylation, why would it be expected to alter transcript levels at the 4-cell or blast stage? Please add further details and discussion in the results and discussion sections.

      Thanks for your advisable comment. The reason we chose to work at the 4-cell stage is because previous studies on CARM1 have focused on the 4-cell stage (Zernicka-Goetz Cell 2018,2016). 

      In the discussion, the statement in Lines 430-431 is an overinterpretation: "the heterogeneity of HSPA2... acts as an upstream factor to drive [the] first cell-fate decision." The knockdown experiments don't alter heterogeneity per se, but total abundance. Furthermore, the results do not show that heterogeneity drives heterogeneity in H3R26me2 patterns, for example.

      Thanks for your advisable comment. We redescribe the relevant statement in the discussion.

      More needs to be said regarding the ICM cells that persisted in the 1-cell KD experiment (Fig 3B). Lines 449-450 point out this result, but do not propose any plausible explanations. For instance, ICM cells may still form due to the incomplete knockdown achieved or the possibility that redundant pathways exist.

      Thanks for your advisable comment. We redescribe the relevant statement in our revised version (line 468-473).

      The 5th paragraph of the discussion seems incomplete. The authors point out a possible link between Hspa2 and Hippo and Wnt signaling pathways, but need to expand their discussion on how this may act as an additional mechanism incorporating Hspa2 with lineage segregation.

      Thanks for your advisable comment. We redescribe the 5th paragraph of the discussion (line 483-494).

      Statistics: all comparisons with greater than 2 groups should be performed with a one-way ANOVA and multiple comparisons, rather than Student's t-test (Figures 1B, 1D, 1E, 1F).

      All figure legends lack statistical test details.

      Thanks for your advisable comment. All figure legends added statistical test details in statistical analysis.

      Minor comments:

      In all graphs, individual blastomere expression levels should be represented as boxwhisker/bar/scatter/violin plots since the comparison is groups rather than time points (i.e. symbols should not be connected with a line in Figures 1B, 1D, 1F-G, 1I, S1D, S1F).

      Thanks for your advisable comment. Each colored line represents a single cell, and the dots of the same color represent the blastomere of the same cell. Thus, we use a line representation individual blastomere.

      For all fluorescent images, having two representative images may be confusing for the reader. Figures may be improved by just including one representative image for each stage/treatment (Figures 1F, 1I, S1F, 3A, 3D, 4E, 4G).

      Thanks for your advisable comment. The figures just including one representative image for each stage in our revised version. In addition, two representative images from each group were shown for each treatment (Figures 3A, 3D, 4E, 4G).

      The manuscript would be improved with thorough grammar and typo editing.

      For example:

      (1) Lines 18, 73, the wording is confusing, consider: "knockdown of Hspa2 in one of the two-cell blastomeres biased its progeny towards the trophectoderm lineage.".

      (2) Line 23, overstatement. Consider: "we demonstrated that HSPA2 levels correlate with ICMassociated genes and that it interacts with the CARM1.".

      (3) Line 25 confusing wording, "via the execution of commitment and differentiation phases.".

      (4) Line 37, replace "that" with "of;" replace "cell-fate decisions" with "cell-fate decision".

      (5) Line 40: needs space before (CARM1).

      (6) Line 43: the wording is confusing, consider "can result in higher expression levels of".

      (7) Line 45: wording, consider "Recent [studies have] further suggested".

      (8) Line 70: plurality, consider "analyzed gene expression pattern".

      (9) Line 73 typo: "prevents its".

      (10) Line 76-77 wording, consider "Hspa2 expression patterns can bias cell fate in the mouse embryo".

      (11) Line 276: remove "in whole embryos," since MII eggs are not embryos.

      (12) Line 617 "There" should be "Three".

      (13) Axis label in Fig 3b "Totle" should be "Total".

      (14) Lines 417, 419 missing spaces.

      (15) Line 448 missing word, "interfering [with] the cell cycle".

      (16) Line 462 incorrect word, "[a]polar cells being specified as ICM".

      (17) Line 469 incorrect plural, "cell differentiation".

      Thanks for your advisable comment. We revised the whole manuscript carefully according to the reviewers' suggestions.

    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewer #1:

      Summary:

      The manuscript by Zhang et al describes the use of a protein language model (pLM) to analyse disordered regions in proteins, with a focus on those that may be important in biological phase separation. While the paper is relatively easy to read overall, my main comment is that the authors could perhaps make it clearer which observations are new, and which support previous work using related approaches. Further, while the link to phase separation is interesting, it is not completely clear which data supports the statements made, and this could also be made clearer.

      We thank the reviewer for their thoughtful evaluation of our manuscript and for the supportive comments. As outlined in the responses below, we have made substantial revisions to clarify the novel observations presented in our study and to strengthen the connection between sequence conservation and phase separation.

      Comment 1: With respect to putting the work in a better context of what has previously been done before, this is not to say that there is not new information in it, but what the authors do is somewhat closely related to work by others. I think it would be useful to make those links more directly.

      We have addressed the specific comments as outlined below.

      Comment 1a: Alderson et al (reference 71) analysed in detail the conservation of IDRs (via pLDDT, which is itself related to conservation) to show, for example, that conserved residues fold upon binding. This analysis is very similar to the analysis used in the current study (using ESM2 as a different measure of conservation). Thus, the result that "Given that low ESM2 scores generally reflect mutational constraint in folded proteins, the presence of region a among disordered residues suggests that certain disordered amino acids are evolutionarily conserved and likely functionally significant" is in some ways very similar to the results of that (Alderson et al) paper .

      We thank the reviewer for the comment. However, we would like to clarify that our findings show subtle but important differences from those reported by Alderson et al. Specifically, Alderson et al. used AlphaFold2 predictions to identify IDRs that undergo disorder-to-order transitions, which the authors termed as conditionally folded IDRs. These regions could potentially be functionally important, assuming that function of IDRs necessitate folding.

      We argue, however, that, the validity of this structure-function relationship for IDRs remains to be tested. In our opinion, The most direct way to evaluate the functional significance is via evaluating the evolutionary conservation.

      As shown in Author response image 1, the correlation between pLDDT scores and the conservation score, while noticable, is significantly weaker than that between the ESM2 score and the conservation score.

      Author response image 1.

      Comparison of the correlation between AlphaFold2 pLDDT scores and conservation scores with the correlation between ESM2 scores and conservation scores. Calculations were performed using proteins in the MLO-hProt dataset. (A) Correlation between the mean AlphaFold2 pLDDT scores and conservation scores for various amino acids. Pearson correlation coefficients (r) are indicated in the figure legends. The four panels on the right present analogous correlation plots for amino acids grouped by structural order, as defined by their pLDDT scores. (B) Similar as in part A but for ESM2 scores.

      Therefore, we believe that ESM2 score is a better indicator than AlphaFold2 pLDDT score for functional relevance.

      Furthermore, for the human IDRs, we explicitly selected amino acids with pLDDT scores ≤ 70.

      These would be classified as structureless, disordered amino acids, according to the study by Alderson et al. Nevertheless, as shown in Figures 2 and 3 of the main text, our analyses still identifies conserved regions. Therefore, these regions may function via distinct mechanisms than the disorder to order transition.

      We now discuss the novelty of our work in the context of existing studies in the newly added Conclusions and Discussion: Related Work, as quoted below.

      “Numerous studies have sought to identify functionally relevant amino acid groups within IDRs [cite]. For instance, using multiple sequence alignment, several groups have identified evolutionarily conserved residues that contribute to phase separation [cite]. Alderson et al. employed AlphaFold2 to detect disordered regions with a propensity to adopt structured conformations, suggesting potential functional relevance [alderson et al].

      In contrast, our approach based on ESM2 is more direct: it identifies conserved residues without relying on alignment or presupposing that functional significance requires folding into stable 3D structures. Notably, many of the conserved residues identified in our analysis exhibit low pLDDT scores (Figure 2), implying potential functional roles independent of stable conformations.”

      Comment 1b: Dasmeh et al, Lu et al and Ho & Huang analysed conservation in IDRs, including aromatic residues and their role in phase separation.

      We thank the reviewer for bringing these works to our attention! We now explicitly discuss these studies in both the Discussion section as mentioned above and in the Introduction as quoted below.

      “Evolutionary analysis of IDRs is challenging due to difficulties in sequence alignment [cite], though several studies have attempted alignment of disordered proteins with promising results [Dasmeh et al, Lu et al and Ho & Huang].”

      Comment 1c: A number of groups have performed proteomewide saturation scans using pLMs, including variants of the ESM family, including Meier (reference 89, but cited about something else) and Cagiada et al (https://doi.org/10.1101/2024.05.21.595203) that analysed variant effects in IDRs using a pLM. Thus, I think statements such as "their applicability to studying the fitness and evolutionary pressures on IDRs has yet to be established" should possibly be qualified.

      We added a new paragraph in the Introduction to discuss the application of protein language models to IDRs and cited the suggested references.

      “While protein language models have been widely applied to structured proteins [cite], it is important to emphasize that these models themselves are not inherently biased toward folded domains. For example, the Evolutionary Scale Model (ESM2) [cite] is trained as a probabilistic language model on raw protein sequences, without incorporating any structural or functional annotations. Its unsupervised learning paradigm enables ESM2 to capture statistical patterns of residue usage and evolutionary constraints without relying on explicit structural information. Thus, the success of ESM2 in modeling the mutational landscapes of folded proteins [cite] reflects the model’s ability to learn sequence-level constraints imposed by natural selection — a property that is equally applicable to IDRs if those regions are also under functional selection. Indeed, protein language models are increasingly been used to analyze variant effects in IDRs [cite].”

      Comment 2: On page 4, the authors write, "The conserved residues are primarily located in regions associated with phase separation." These results are presented as a central part of the work, but it is not completely clear what the evidence is.

      We thank the reviewer this insightful comment. We realized that our wording is not as precise as we should have been. We meant to state that the regions associated with phase separation are significantly enriched in these conserved residues. This is a significant finding and indicates that phase separation could be a source of evolutionary pressure in dictating IDP sequence conservation. However, we do not intend to suggest that phase separation is the only evolutionary pressure.

      The sentence has been revised to

      “Notably, regions associated with phase separation are significantly enriched in these conserved residues.”

      We further replaced the section title "Conserved, Disordered Residues Localize in Regions Driving Phase Separation" with "Regions Driving Phase Separation Are Enriched with Conserved, Disordered Residues" to further clarify our findings and avoid overinterpretation.

      Finally, we revised the following sentence in the discussion

      “Notably, these conserved, disordered residues are predominantly located in regions actively involved in phase separation, contributing to the formation of membraneless organelles.”

      to

      “Notably, regions actively involved in phase separation are enriched with these conserved, disordered residues, supporting their potential role in the formation of membraneless organelles.”

      The submitted manuscript provides clear evidence supporting the enrichment of conserved residues in MLO-driving IDRs. Specifically, Figures 4A and 4C demonstrate that these IDRs exhibit a substantially higher fraction of conserved residues compared to other IDRs involved in phase separation.

      In this analysis, the nMLO-hIDR group serves as a baseline, representing the distribution of conservation in disordered regions lacking MLO-related functions. In contrast, IDRs from MLOassociated groups show a pronounced lower shift in their median and interquartile ranges, indicating stronger evolutionary constraints. Within the dMLO cohort, the degree of conservation follows a distinct gradient: driving residues exhibit the highest levels of conservation, followed by participant residues, with non-participant residues showing values closer to the nMLO baseline. This pattern reflects the relative functional importance of each group in phase separation, with conservation levels corresponding to their roles in MLO scaffolding.

      To further support this, we computed, for each IDR, the fraction of conserved amino acids. As shown in Figure S11B, for IDRs that actively contribute to phase separation, the fraction is indeed higher than those not involved in phase separation. This analysis is now included in SI.

      During the revision, we explicitly evaluated whether conserved residues are preferentially located in regions associated with phase separation. To this end, for each protein in the MLO-hProt dataset, we calculated the probability p of finding conserved residues within regions contributing to phase separation. These regions include both "driving" and "participating" segments as defined in Figure 4 of the main text.

      Figure S11A presents the distribution of p across all proteins. For comparison, we also include the distribution of 1− p, representing the probability of finding conserved residues in regions not associated with phase separation. On average, p exceeds 0.5, suggesting a tendency for conserved residues to be more frequently located in phase-separating regions. However, the difference between the two distributions is not statistically significant. This result may be due to the generally low density of conserved residues in IDRs, which makes the estimation of p challenging for individual proteins. Additionally, some conserved sites may be involved in functions unrelated to phase separation.

      We added the following text to the Discussion section of the main text.

      “We emphasize that the results presented in Figure 4 do not directly demonstrate that conserved residues are preferentially located in regions associated with phase separation. Although these regions are more enriched in conserved amino acids, their total sequence length can be smaller than that of non-phase-separating regions. As a result, the absolute number of conserved residues may still be higher outside phase-separating regions. To quantitatively assess this, we calculated, for each protein in the MLO-hProt dataset, the probability p of finding conserved residues within regions contributing to phase separation. These regions include both "driving" and "participating" segments, as defined in Figure 4 of the main text. Figure S11 shows the distribution of p across all proteins. For comparison, we also present the distribution of 1− p, which reflects the probability of finding conserved residues in non-phase-separating regions. While the average value of p exceeds 0.5, indicating a trend toward conserved residues being more frequently located in phase-separating regions, the difference between the two distributions is not statistically significant. Future studies with expanded datasets may be necessary to clarify this trend.”

      Comment 3: It would be useful with an assessment of what controls the authors used to assess whether there are folded domains within their set of IDRs.

      We acknowledge that our previous labeling may have caused some confusion. Protein sequences used in Figures 2 and 3 include both folded and disordered domains. Results presented in these figures were constructed using full-length protein sequences to highlight the similarities and differences in ESM2 scores between folded and disordered domains.

      In contrast, the analyses presented in Figures 4 and 5 focus exclusively on IDRs to examine their role in phase separation.

      To prevent further confusion, we have renamed the dataset used in Figures 2 and 3 as MLO-hProt, emphasizing that the analysis pertains to entire protein sequences. The term MLO-hIDR is now reserved for a new dataset that includes only disordered residues, as used in Figures 4 and 5, and corresponding SI Figures.

      For the dMLO-IDR dataset, all except one amino acid (P40967, residue G592) are annotated as disordered in the MobiDB database (https://mobidb.org/). This database characterizes disordered regions based on a combination of predictive algorithms and experimental data. As illustrated in Figure S5A, 25.5% of the proteins in the dataset have direct experimental evidence supporting their disorderedness. These experimental annotations are derived from a diverse range of techniques (Figure S5B). For the remaining proteins, disorder was predicted by one or more computational tools. Although not all tools were applied to every protein, each protein in the dataset was identified as disordered by at least one method.

      For human proteins, IDRs were identified based on AlphaFold2 pLDDT scores, using a threshold of 70. As established in prior studies [1, 2], the pLDDT score provides a quantitative measure of local structural confidence, with lower values indicating greater structural disorder. IDRs associated with conditional folding or disorder-to-order transitions generally exhibit high pLDDT values (e.g., >70).

      Author response image 2 shows a violin plot of AlphaFold2 pLDDT scores for the various MLO-hIDR groups. The consistently low scores support the conclusion that these regions are structurally disordered.

      We also cross-checked the MLO-hIDR regions against the MobiDB database. As shown in Figure S6, approximately 76% of the proteins in the dataset are predicted to contain disordered regions. Among the non-labeled segments with pLDDT scores ≤ 70, the majority are relatively short, with segments of 1–5 residues accounting for approximately 80%.

      Author response image 2.

      AlphaFold pLDDT scores of hIDRs in different MLO-related groups.

      In addition to renaming the dataset, we also revised the manuscript to highlight the validation of disorderedness in section of Results: Regions Driving Phase Separation Are Enriched with Conserved, Disordered Residues.

      “The presence of evolutionarily conserved disordered residues raises the question of their functional significance. To explore this, we identified disordered regions of MLO-hProt using a pLDDT score less than 70 and partitioned these regions into two categories: drivers (dMLO-hIDR), which actively drive phase separation, and clients (cMLO-hIDR), which are present in MLOs under certain conditions but do not promote phase separation themselves [cite]. Additionally, IDRs from human proteins not associated with MLOs, termed nMLO-hIDR, were included as a control. To enhance statistical robustness, we extended our dataset by incorporating driver proteins from additional species [cite], resulting in the expanded dMLO-IDR dataset. Beyond the pLDDT-based classification, the majority of residues in these datasets are also predicted to be disordered by various computational tools and supported by experimental evidence (Figures S5 and S6).”

      Recommendation 1: The authors use the terms "evolutionary fitness of IDRs" (abstract and p. 5, for example), "fitness of amino acids" (p. 4), and "quantify the fitness of particular residues at specific sites" (p. 5). It is not clear what is meant by fitness in this context.

      We thank the reviewer for pointing out the ambiguity in the term fitness. To enhance clarity, we have replaced “fitness" with “mutational tolerance" to more directly emphasize the evolutionary conservation of specific residues.

      Recommendation 2: The authors write (p. 6) "Previous studies have demonstrated a strong correlation between ESM2 scores and changes in free energy related to protein structure stability". While that may be true, it might be worth noting that ESM2 scores report on the effects of mutations and function more broadly than stability because these models have previously been shown to capture conservation effects beyond stability.

      We fully agree with the reviewer’s comment and have revised the main text accordingly. Specifically, the referenced sentence has been revised and relocated, as shown below.

      “Our analysis demonstrated that HP1_α_’s structured domains consistently yield low ESM2 scores, reflecting strong mutational constraints characteristic of folded regions. These constraints are further evident in the local LLR predictions, as shown in Figure 2B, where we illustrate the folded region G120-T130. Given the functional importance of preserving the 3D of structured domains, mutations with greater detrimental effects are likely to disrupt protein folding substantially. This interpretation is consistent with previous studies reporting a significant correlation between ESM2 LLRs and changes in free energy associated with protein structural stability [cite].”

      Recommendation 3: p. 10: The authors write "To exclude sequences that no longer qualify as homologs, we filtered for sequences with at least 20% identity to the reference". How did they decide on 20% and why? And over which residues are these 20% calculated.

      We apologize for the earlier lack of clarity. Sequence alignment was performed using the full-length protein sequences, encompassing both folded and disordered regions. For each sequence, we calculated the percent identity by counting the number of positions, denoted as n, at which the amino acid matches the reference. The percent identity was then computed as n/N, where N represents the total length of the aligned reference sequence. This total includes residues in folded and disordered regions, as well as gap positions introduced during alignment.

      We updated the Methods section of the main text to clarify.

      “We performed multi-sequence alignment (MSA) analysis using HHblits from the HH-suite3 software suite [citations], a widely used open-source toolkit known for its sensitivity in detecting sequence similarities and identifying protein folds. HHblits builds MSAs through iterative database searches, sequentially incorporating matched sequences into the query MSA with each iteration. Sequence alignment was performed using the full-length protein sequences, encompassing both folded and disordered regions.

      ...

      To refine alignment quality by focusing on closely related homologs, we filtered out sequences with ≤ 20% identity to the query, excluding weakly related sequences where only short segments show similarity to the reference. For each sequence, we calculated the percent identity by counting the number of positions, denoted as n, at which the amino acid matches the reference. The percent identity was then computed as n/N, where N represents the total length of the aligned reference sequence. This total includes residues in folded and disordered regions, as well as gap positions introduced during alignment.”

      We selected a 20% sequence identity threshold to balance inclusion of true homologs with exclusion of distant matches that may not share functional relevance. To determine this cutoff, we compared identity thresholds of 0%, 10%, 20%, and 40% and examined the resulting distributions of conservation and ESM2 scores across aligned residues for MLO-hProt dataset (Author response image 3). Thresholds of 10%, 20%, and 40% produced qualitatively similar results, with a consistent correspondence between low ESM2 scores and high conservation. Lower thresholds introduced highly divergent sequences that added noise to the alignment, resulting in reduced overall conservation scores. In contrast, higher thresholds excluded homologs with potentially meaningful conservation, particularly in disordered regions where conservation scores tend to be relatively low.

      Author response image 3.

      Histograms of the ESM2 score and the conservation score, presented in a format consistent with Figure 3B of the main text. The conservation scores were computed using aligned sequences with identity thresholds of ≥0, ≥10%, ≥20%, and ≥40% (left to right). Contour lines represent different levels of −log_P_(CS,ESM2), where P is the joint probability density of conservation score (CS) and ESM2 score. Contours are spaced at 0.5-unit intervals, highlighting regions of distinct density.

      Recommendation 4: In their description of "motif" searching algorithm (p. 20) I think that the search algorithm would give a different result whether the search is performed N->C or C->N (because the first residue (i) needs to have a score <0.5 but the last (j) could have a score >0.5 as long as the average is below 0.5. Is that correct? And if so, why did they choose an asymmetric algorithm? .

      We thank the reviewer for highlighting the asymmetry in our motif-search algorithm.

      To investigate this issue, we repeated the algorithm starting from the C-terminus and compared the resulting motifs with those obtained from the N-terminal scan. We found that the two sets of motifs overlap entirely: each motif identified from the C-terminal direction has a corresponding counterpart from the N-terminal scan. However, the motifs are not identical. The directionality of the search introduces additional amino acids—referred to here as peripheral residues—at the motif boundaries, which differ between the two sets.

      As shown in Author response image 4, the number of peripheral residues is small relative to the total motif length.

      To eliminate asymmetry and ambiguity, we have revised our method to perform bidirectional scans—from both the N- and C-termini—and define each motif as the overlapping region identified by both directions. This approach emphasizes the conserved core and avoids the inclusion of spurious terminal residues. The updated procedure is described in Methods: Motif Identification.

      “To identify motifs within a given IDR, we implemented the following iterative procedure. Starting from either the N– or C–terminus of the sequence, we first locate the initial residue i whose ESM2 score falls within 0.5. From i, residues are sequentially appended…”

      Author response image 4.

      Number of peripheral residues and their relative length to the full-motif length identified from both sides. (A). The unique motifs identified from N-to-C terminal direction. (B) The unique motifs identified from C-to-N terminal direction.

      “…in the direction toward the opposite terminus until the segment’s average ESM2 score exceeds 0.5; the first residue to breach this threshold is denoted j. The segment (i,i+1,..., j−1) is then recorded as a candidate motif. This process repeats starting from j until the end of the IDR is reached.

      We perform this full procedure independently from both termini and designate the final motif as the intersection of the two candidate-motif sets. This bidirectional overlap strategy excludes terminal residues that might transiently satisfy the average-score criterion only due to adjacent low-scoring regions, thereby isolating the conserved core of each motif. All other residues—those not included in either directional pass—are classified as non-motif regions, minimizing peripheral artifacts.”

      Accordingly, we have updated the Supplementary material: ESM2_motif_with_exp_ref.csv for the new identified motifs commonly exited from both N-terminal and C-terminal searches. Minor changes were observed in the set of motifs as being discussed, but these do not affect the main conclusions. Figures 5C, 5D, and S6 have been revised accordingly.

      Reviewer #2:

      Summary:

      Unfortunately, I do not believe that the results can be trusted. ESM2 has not been validated for IDRs through experiments. The authors themselves point out its little use in that context. In this study, they do not provide any further rationale for why this situation might have changed. Furthermore, they mention that experimental perturbations of the predicted motifs in in vivo studies may further elucidate their functional importance, but none of that is done here. That some of the motifs have been previously validated does not give any credibility to the use of ESM2 here, given that such systems were probably seen during the training of the model.

      We thank the reviewer for their detailed and thoughtful critique of our manuscript. We recognize the importance of careful model validation, especially in the context of IDRs, and appreciate the opportunity to clarify the scope and rationale of our study. Below, we respond point-by-point to the main concerns.

      (1) The use of ESM2 is not validated for IDRs, and the authors provide no rationale for its applicability in this context.

      We thank the reviewer for raising this important point.

      First, we emphasize that ESM2 is a probabilistic language model trained entirely on amino acid sequences, without any structural supervision. The model does not receive any input about protein structure — folded or disordered — during training. Instead, it learns to estimate the likelihood of each amino acid at a given position, conditioned on the surrounding sequence context. This makes ESM2 agnostic to whether a sequence is folded or disordered; the model’s capacity to identify patterns of residue usage arises solely from the statistics of natural sequences.

      As such, ESM2 is not inherently biased toward folded proteins, even though previous studies have demonstrated its usefulness in identifying conserved and functionally constrained residues in structured domains [3–9]. These findings support the broader utility of language models for uncovering evolutionary constraints — and by extension, suggest that similar signatures could exist in IDRs, particularly if they are under functional selection.

      Indeed, if certain residues or motifs in IDRs are conserved due to their importance in biological processes (e.g., phase separation), we would expect such selection to be reflected in sequence-based features, which ESM2 is designed to detect. The model’s applicability to IDRs, then, is a natural extension of its core probabilistic architecture.

      To further evaluate this, we carried out an independent in silico validation using multiple sequence alignments (MSAs). This analysis allowed us to compute the evolutionary conservation of individual amino acids without any reliance on ESM2. We then compared these conservation scores to ESM2 scores and found a strong correlation between the two. This provides direct, quantitative support for the idea that ESM2 is capturing biologically meaningful sequence constraints — even in disordered regions.

      While we agree that experimental testing would ultimately provide the most compelling validation, we believe that our MSA-based comparison constitutes a strong and arguably ideal computational validation of the model’s predictions. It offers an orthogonal measure of evolutionary pressure that confirms the biological plausibility of ESM2 scores.

      We added the following text in the introduction to highlight the applicability of ESM2 to IDRs.

      “While protein language models have been widely applied to structured proteins, it is important to emphasize that these models themselves are not inherently biased toward folded domains. For example, the Evolutionary Scale Model (ESM2) [cite] is trained as a probabilistic language model on raw protein sequences, without incorporating any structural or functional annotations. It operates by estimating the likelihood of observing a given amino acid at a particular position, conditioned on the entire surrounding sequence context. This unsupervised learning paradigm enables ESM2 to capture statistical patterns of residue usage and evolutionary constraints without relying on explicit structural information. Thus, the success of ESM2 in modeling fitness landscapes of folded proteins reflects the model’s ability to learn sequence-level constraints imposed by natural selection — a property that is equally applicable to IDRs if those regions are also under functional selection. Indeed, protein language models are increasingly been used to analyze variant effects in IDRs [cite].”

      (2) There is no experimental validation of the ESM2-based predictions in this study.

      We agree that experimental validation would provide definitive support for the utility of ESM2 in IDRs, and we explicitly state this as a limitation in the revised manuscript as quoted below.

      “Limitations: Despite the promising findings, our study has several limitations. Most notably, our analysis is purely computational, relying on ESM2-derived predictions and sequence-based conservation without accompanying experimental validation. While the strong correlation between ESM2 scores and evolutionary conservation provides compelling evidence that the identified motifs are functionally constrained, the precise biological roles of these motifs remain uncharacterized. ESM2 is well-suited for highlighting regions under selective pressure, but it does not provide mechanistic insights into how conserved motifs contribute to specific molecular functions such as phase separation, molecular recognition, or dynamic regulation. Determining these roles will require targeted experimental investigations, including mutagenesis and biophysical characterization.”

      In addition, we revised the manuscript title from “Protein Language Model Identifies Disordered, Conserved Motifs Driving Phase Separation" to “Protein Language Model Identifies Disordered, Conserved Motifs Implicated in Phase Separation". This revision softens the original claim to better reflect the absence of direct experimental evidence for the motifs’ role in phase separation.

      However, we also emphasize that the goal of our study is not to claim definitive predictive power, but rather to explore whether ESM2-derived mutational profiles align with known biological features of IDRs — and in doing so, to generate new, testable hypotheses.

      In addition, while no in vivo experiments were performed, our study does include an in silico validation step, as detailed in the response to the previous comment. The strong correlation between ESM2 scores and conservation scores provides direct support for the utility of ESM2 in identifying residues under evolutionary constraint in disordered regions.

      (3) The overlap between predicted motifs and known ones may be due totraining data leakage.

      We respectfully clarify that training data leakage is not possible in this case, as ESM2 is trained using unsupervised learning on raw protein sequences alone. The model has no access to experimental annotations, functional labels, or knowledge of which motifs are involved in phase separation. It only models statistical sequence patterns derived from evolutionarily observed proteins.

      Therefore, any agreement between ESM2-derived predictions and previously validated motifs arises not from memorization of experimental data, but from the model’s ability to learn meaningful sequence constraints from the natural distribution of proteins.

      (4) The authors should revamp the study with a testable predictive framework.

      We respectfully suggest that a full revamp is not necessary or appropriate in this context.

      As outlined in our previous responses, we believe that certain misunderstandings about the nature and capabilities of ESM2 may have influenced the reviewer’s assessment.

      Importantly, both Reviewer 1 and Reviewer 3 express strong support for the significance and novelty of this work, and recommend publication following minor revisions.

      In this context, we believe the manuscript provides a useful contribution as a first step toward understanding disordered regions using language models, and that it has value even in the absence of direct experimental testing. We have now better positioned the manuscript in this light, clarified limitations, and suggested concrete next steps for follow-up research.

      We hope these clarifications and revisions address the reviewer’s concerns, and we thank them again for helping us strengthen the framing, rigor, and clarity of our study.

      Reviewer #3:

      Summary:

      This is a very nice and interesting paper to read about motif conservation in protein sequences and mainly in IDRs regions using the ESM2 language model. The topic of the paper is timely, with strong biological significance. The paper can be of great interest to the scientific community in the field of protein phase transitions and future applications using the ESM models. The ability of ESM2 to identify conserved motifs is crucial for disease prediction, as these regions may serve as potential drug targets. Therefore, I find these findings highly significant, and the authors strongly support them throughout the paper. The work motivates the scientific community towards further motif exploration related to diseases.

      Strengths:

      (1) Revealing conserved regions in IDRs by the ESM-2 language model.

      (2) Identification of functionally significant residues within protein sequences, especially in IDRs.

      (3) Findings supported by useful analyses.

      We appreciate the reviewer’s thoughtful words and support for our work.

      Weaknesses:

      (1) Lack of examples demonstrating the potential biological functions of these conserved regions.

      As detailed in the Response to Recommendation 6, we conducted additional analyses to connect the identified conserved regions with their biological functions.

      (2) Very limited discussion of potential future work and of limitations.

      We have substantially revised the Conclusions and Discussion section to provide a detailed analysis of the study’s limitations and to propose several directions for future research, as elaborated in our Response to Recommendation 5 below.

      Recommendation 1: The authors describe the ESM2 score such that lower scores are associated with conserved residues, stating that "lower scores indicate higher mutational constraint and reduced flexibility, implying that these residues are more likely essential for protein function, as they exhibit fewer permissible mutational states." However, when examining intrinsically disordered regions (IDRs), which are known to drive phase separation, I observe that the ESM2 score is relatively high (Figure 3C, pLDDT < 50, and Supplementary Figure S2). Could the authors clarify how this relatively high score aligns with the conservation of motifs that drive phase separation?

      We thank the reviewer for this insightful comment. We would like to clarify that most amino acids in the IDRs are not conserved, even for IDRs that contribute to phase separation. Only a small set of amino acids in these IDRs, which we term as motifs, are evolutionarily conserved with low ESM2 scores. Therefore, the ESM2 scores exhibit bimodal distribution at high and low values, as shown in Figures 4A and 4C of the manuscript. When averaged over all the amino acids, the mean ESM2 scores, plotted in Figure 3C, are relatively high due to dominant population of non-conserved amino acids.

      Recommendation 2: The authors mention: "We first analyzed the relationship between ESM2 and pLDDT scores for human Heterochromatin Protein 1 (HP1, residues 1-191)". I appreciate this example as a demonstration of amino acid conservation in IDRs. However, it is questionable whether the authors could provide some more examples to support amino acid conservation particularly within the IDRs along with lower ESM2 score (e.g, Could the authors provide some additional examples of "conserved disordered" regions in various proteins which are associated with relatively low ESM2 score as appear in Figure 2A).

      We thank the reviewer for this valuable suggestion. We want to kindly noted that the conserved residues on IDRs are prevalent as indicated in Figures 2D and 3B. To further illustrate the prevalence of “conserved disordered” regions, we generated ESM2 versus pLDDT score plots for the full dMLO–hProt dataset (82 proteins) in Figure S2. In these plots, residues with pLDDT ≤ 70 are highlighted in blue to denote structural disorder (dMLO-hIDR), and these disordered residues with ESM2 score ≤ 1.5 are shown in purple to indicate conserved disordered segments.

      Recommendation 3: Could the authors plot a Violin conservation score plot for Figure 4A to emphasise the relationship between ESM2 scores and conservation scores of disordered residues?

      We thank the reviewer for this suggestion. We included a violin plot illustrating the distribution of conservation scores for disordered residues across all four IDR groups, shown in Author response image 5. Consistent with the findings in Figure 4A, the phase separation drivers (dMLO-hIDR and dMLOIDR) exhibit a higher proportion of conserved amino acids compared to the client group (cMLOhIDR).

      We also note that the nMLO-hIDR group may contain conserved residues due to functions unrelated to MLO formation, which could contribute to the higher observed levels of conservation in this group.

      Author response image 5.

      Violin plots illustrating the distribution of conservation scores for disordered residues across the nMLO–hIDR, cMLO–hIDR, dMLO–hIDR, and dMLO–IDR datasets. Pairwise statistical comparisons were conducted using two-sided Mann–Whitney U tests on the conservation score distributions (null hypothesis: the two groups have equal medians). P-values indicate the probability of observing the observed rank differences under the null hypothesis. Statistical significance is denoted as follows: ***: p < 0.001; **: p < 0.01; *:p < 0.05;

      Recommendation 4: It will be appreciated if the authors could add to Figure 4 Violin plots, a statistical comparison between the different groups.

      We thank the reviewer for this valuable suggestion. We included the p-values for Figures 4A and 4C to quantify the statistical significance of differences in the distributions.

      Most comparisons are highly significant (p < 0.001), while the largest p-value (p = 0.089) between the conservation score of driving and non-participating groups (Figure 4C) still suggests a marginally significant trend.

      Recommendation 5: Could the authors expand more on potential future research directions using ESM2, given its usefulness in identifying conserved motifs? Specifically, how do the authors envision conserved motifs will contribute to future discoveries/applications/models using ESM (e.g, discuss the importance of conserved motifs, especially in IDRs motifs, in protein phase transition prediction in relation to diseases).

      We thank the reviewer for this insightful comment. To further assess the functional relevance of the conserved motifs, we incorporated pathogenic variant data from ClinVar [10, 11] to evaluate mutational impacts. As shown in Figure S12A and B, a substantial number of pathogenic variants in MLO-hProt proteins are associated with low ESM2 LLR values. This pattern holds for both folded and disordered residues.

      Moreover, we observed that variants located within motifs are more frequently pathogenic compared to those outside motifs (Figure S12C). In the main text, motifs were defined only for driver proteins; however, the available variant data for this subset are limited (6 data points). To improve statistical power, we extended motif identification to include both client and driver human proteins, following the same methodology described in the main text. Consistent with previous findings, variants within motifs in this expanded set are also more likely to be pathogenic. These results further support the functional importance of both low ESM2-scoring residues and the conserved motifs in which they reside.

      The following text was added in the Discussion section of the manuscript to discuss these results and outline future research directions.

      “Several promising directions could extend this work, both to refine our mechanistic understanding and to explore clinical relevance. One avenue is testing the hypothesis that conserved motifs in scaffold proteins act as functional stickers, mediating strong intermolecular interactions. This could be evaluated computationally via free energy calculations or experimentally via interaction assays. Deletion of such motifs in client proteins may also reduce their partitioning into condensates, illuminating their roles in molecular recruitment.

      To explore potential clinical implications, we analyzed pathogenicity data from Clin-Var [10, 11]. As shown in Figure S12A, single-point mutations with low LLR values—indicative of constrained residues—are enriched among clinically reported pathogenic variants, while benign variants typically exhibit higher LLR values. Moreover, mutations within conserved motifs are significantly more likely to be pathogenic than those in non-motif regions (Figure S12B). These findings highlight the potential of ESM2 as a first-pass screening tool for identifying clinically relevant residues and suggest that the conserved motifs described here may serve as priorities for future studies, both mechanistic and therapeutic.”

      Moreover, the functional significance of conserved motifs, particularly their implications in disease and pathology, warrants further investigation. As an initial analysis, we incorporated ClinVar pathogenic variant data [citation] to assess mutational effects within our datasets. As illustrated in Figure R12A, single-point mutations with low LLR values are enriched among clinically reported pathogenic variants, whereas benign variants are more commonly associated with higher LLR values. Notably, mutations within conserved motifs are substantially more likely to be pathogenic compared to those in non-motif regions. These findings highlight the potential of ESM2 as a firstpass tool for identifying residues of clinical relevance. The conserved motifs identified here may be prioritized in future studies aimed at elucidating their biological roles and evaluating their viability as therapeutic targets.

      Recommendation 6: The authors mention: "Our findings provide strong evidence for evolutionary pressures acting on specific IDRs to preserve their roles in scaffolding phase separation mechanisms, emphasizing the functional importance of entire motifs rather than individual residues in MLO formation." They also present a word cloud of functional motifs in Figure 5D. Although it makes sense that evolutionarily conserved motifs, especially within the IDRs regions, act as functional units, I think there is no direct evidence for such functionality (e.g., examples of biological pathways associated with IDRs and phase separation). Hence, there is no justification to write in the figure caption: "ESM2 Identifies Functional Motifs in driving IDRs" unless the authors provide some examples of such functionality. This will even make the paper stronger by establishing a clear connection to biological pathways, and hence these motifs can serve as potential drug targets.

      We thank the reviewer for this insightful suggestion. We have replaced “functional motifs" with “conserved motifs" in the figure caption.

      Identifying the precise biological pathways associated with the conserved motifs is a complex task and a comprehensive investigation lies beyond the scope of this study. Nonetheless, as an initial effort, we explored the potential functions of these motifs using annotations available in DisProt (https://disprot.org/).

      DisProt is the leading manually curated database dedicated to IDPs, providing both structural and functional annotations. Expert curators compile experimentally validated data, including definitions of disordered regions, associated functional terms, and supporting literature references. Author response image 6 presents a representative DisProt entry for DNA topoisomerase 1 (UniProt ID: P11387), illustrating its structural and biological annotation.

      For each motif, we located the corresponding DisProt entry and assigned a functional annotation based on the annotated IDR from which the motif originates. We emphasize that this functional assignment should be regarded as an approximation. Because experimental annotations often pertain to the entire IDR, regions outside the motif may also contribute to the reported function.

      Nevertheless, the annotations provide valuable insights.

      Author response image 6.

      Screenshot of information provided by the DisProt database. Detailed annotations of biological functions and structural features, along with experimental references, are accessible via mouse click.

      Approximately 50% of ESM2-predicted IDR motifs lack functional annotations. Among those that are annotated, motifs from the dMLO-IDR dataset are predominantly associated with “molecular condensate scaffold activity,” followed by various biomolecular binding functions (Author response image 7A). These findings support the role of these motifs in MLO formation.

      For comparison, we applied the same identification procedure (described in Methods: Motif Identification) to motifs from the nMLO-hIDR dataset. In contrast to the dMLO-IDR motifs, these exhibit a broader range of annotated functions related to diverse cellular processes. Collectively, these results suggest that motifs identified by ESM2 are aligned with biologically relevant functions captured in current databases.

      Finally, as illustrated in Figure S12 and discussed in the Response to Recommendation 5, variants occurring within identified motifs are more likely to be pathogenic than those in non-motif regions, further underscoring their functional importance.

      Author response image 7.

      Biological functions of ESM2-predicted motifs. (A) Distribution of biological functions associated with all identified motifs from dMLO-IDR driving groups. (B) Distribution of biological functions associated with all identified motifs from nMLO-hIDR groups.

      Recommendation 7: In Figure 2C the authors present FE (I assume this is free energy), some discussion about the difference in the free energy referring to the "a" region is missing (i.e. both "Folded" and "Disordered" regions are associated with low ESM score but with low and high free energy (FE), respectively.

      We thank the reviewer for the comments. FE indeed abbreviates free energy. To improve clarify and avoid confusion, we have updated all figure captions by replacing “FE” with “−logP” to explicitly denote the logarithm of probability in the contour density plots.

      We used “a" in Figures 2C and 2D to refer to regions with low ESM2 scores, which appears a local minimum in both plots. Since most residues in folded regions are conserved, region a has lower free energy than region b in Figure 2C. On the other hand, as most residues in disordered regions are not conserved, as we elaborated in Response to Recommendation 1, region a has lower population and higher free energy than region b.

      To avoid confusion, we have replaced “a" and “b" in Figure 2D with “I" and “II".

      Recommendation 8: Figure S2: It would be useful to plot the same figure for structured and disordered regions as well.

      We are not certain we fully understood this comment, as we believe the requested analysis has already been addressed. In Figure S2, we used the AlphaFold2 pLDDT score to represent the structural continuum of different protein regions, where residues with pLDDT > 70 (red and lightred bars) are classified as structured, while those with pLDDT ≤ 70 (blue and light-blue bars) are classified as disordered.

      Minor suggestion 1: Could the authors clarify the meaning of the abbreviation "FE" in the colorbar of the contour line? I assume this is free energy.

      We have updated all contour density plot figure captions by replacing “FE” with “−logP” to explicitly denote the logarithm of probability.

      Minor suggestion 2: In Figure 2A - do the authors mean "Conserved folded" instead of just "Folded"? If so, could the authors indicate this?

      We thank the reviewer for this comment. The ESM2 scores indeed suggest that, within folded regions, there may be multiple distinct groups exhibiting varying degrees of evolutionary conservation. However, as our primary focus is on IDRs, we chose not to investigate these distinctions further.

      Figure 2A illustrates a randomly selected folded region based on AlphaFold2 pLDDT scores.

      References

      (1) Ruff, K. M.; Pappu, R. V. AlphaFold and Implications for Intrinsically Disordered Proteins. Journal of Molecular Biology 2021, 433, 167208.

      (2) Alderson, T. R.; Pritišanac, I.; Kolaric, Ð.; Moses, A. M.; Forman-Kay, J. D. Systematic´ Identification of Conditionally Folded Intrinsically Disordered Regions by AlphaFold2. Proceedings of the National Academy of Sciences of the United States of America, 120, e2304302120.

      (3) Brandes, N.; Goldman, G.; Wang, C. H.; Ye, C. J.; Ntranos, V. Genome-Wide Prediction of Disease Variant Effects with a Deep Protein Language Model. Nature Genetics 2023, 55, 1512–1522.

      (4) Lin, Z. et al. Evolutionary-Scale Prediction of Atomic-Level Protein Structure with a Language Model. 2023.

      (5) Zeng, W.; Dou, Y.; Pan, L.; Xu, L.; Peng, S. Improving Prediction Performance of General Protein Language Model by Domain-Adaptive Pretraining on DNA-binding Protein. Nature Communications 2024, 15, 7838.

      (6) Gong, J. et al. THPLM: A Sequence-Based Deep Learning Framework for Protein Stability Changes Prediction upon Point Variations Using Pretrained Protein Language Model. Bioinformatics 2023, 39, btad646.

      (7) Lin, W.; Wells, J.; Wang, Z.; Orengo, C.; Martin, A. C. R. Enhancing Missense Variant Pathogenicity Prediction with Protein Language Models Using VariPred. Scientific Reports 2024, 14, 8136.

      (8) Saadat, A.; Fellay, J. Fine-Tuning the ESM2 Protein Language Model to Understand the Functional Impact of Missense Variants. Computational and Structural Biotechnology Journal 2025, 27, 2199–2207.

      (9) Chu, S. K. S.; Narang, K.; Siegel, J. B. Protein Stability Prediction by Fine-Tuning a Protein Language Model on a Mega-Scale Dataset. PLOS Computational Biology 2024, 20, e1012248.

      (10) Landrum, M. J.; Lee, J. M.; Riley, G. R.; Jang, W.; Rubinstein, W. S.; Church, D. M.; Maglott, D. R. ClinVar: Public Archive of Relationships among Sequence Variation and Human Phenotype. Nucleic Acids Research 2014, 42, D980–D985.

      (11) Landrum, M. J. et al. ClinVar: Improving Access to Variant Interpretations and Supporting Evidence. Nucleic Acids Research 2018, 46, D1062–D1067.

    1. Author Response

      The following is the authors’ response to the original reviews.

      Thank you for the thoughtful consideration of our work, including both reviewers’ constructive comments. Our apologies for taking some extra time for this revision, but we wanted to adress comments thoroughly with new analyses, not to mention a PhD defense, parental leave and my teaching ultimately being the bottleneck for the team’s work!

      Reviewer #1 (Public Review):

      The authors use a combination of structural and MD simulation approaches to characterize phospholipid interactions with the pentameric ligand-gated ion channel, GLIC. By analyzing the MD simulation data using clusters of closed and open states derived previously, the authors also seek to compare lipid interactions between putative functional states. The ultimate goal of this work is to understand how lipids shape the structure and function of this channel.

      The strengths of this article include the following:

      1) The MD simulation data provide extensive sampling of lipid interactions in GLIC, and these interactions were characterized in putative closed and open states of the channel. The extensive sampling permits confident delineation of 5-6 phospholipid interaction sites per subunit. The agreement in phospholipid binding poses between structures and the all-atom MD simulations supports the utility of MD simulations to examine lipid interactions.

      2) The study presents phospholipid binding sites/poses that agree with functionally-important lipid binding sites in other pLGICs, supporting the notion that these sites are conserved. For example, the authors identify interactions of POPC at an outer leaflet intersubunit site that is specific for the open state. This result is quite interesting as phospholipids or drugs that positively modulate other pLGICs are known to occupy this site. Also, the effect of mutating W217 in the inner leaflet intersubunit site suggests that this residue, which is highly conserved in pLGICs, is an important determinant of the strength of phospholipid interactions at this site. This residue has been shown to interact with phospholipids in other pLGICs and forms the binding site of potentiating neurosteroids in the GABA(A) receptor.

      Weaknesses of this article include the following:

      1) The authors describe in detail state-dependent lipid interactions from the MD simulations; however, the functional significance of these findings is unclear. GLIC function appears to be insensitive to lipids, although this understanding is based on experiments where GLIC proteoliposomes were fused to oocyte membranes, which may not be optimal to control the lipid environment. Without functional studies of GLIC in model membranes, the lipid dependence of GLIC function is not definitively known. Therefore, it is difficult to interpret the meaning of these state-dependent lipid interactions in GLIC.

      2) It is unlikely that the bound phospholipids in the GLIC structures, which are co-purified from e. coli membranes, are POPC. Rather, these are most like PE or PG lipids. While it is difficult to accommodate mixed phospholipid membranes in all-atom MD simulations, the choice of POPC for this model, while practically convenient, seems suboptimal, especially since it is not known if PE or PG lipids modulate GLIC function. Nevertheless, it is striking that the overall binding poses of POPC from the simulations agree with those identified in the structures. It is possible that the identity of the phospholipid headgroup will have more of an impact on the strength of interactions with GLIC rather than the interaction poses (see next point).

      3) The all-atom MD simulations provide limited insight into the strength of the POPC interactions at each site, which is important to interpret the significance of these interactions. It is unlikely that the system has equilibrated within the 1.7 microseconds of simulation for each replicate preventing a meaningful assessment of the lipid interaction times. Although the authors report exchange of up to 4 POPC interacting at certain residues in M4, this may not represent binding/unbinding events (depending on how binding/interaction is defined), since the 4 Å cutoff distance for lipid interactions is relatively small. This may instead be a result of small movements of POPC in and out of this cutoff. The ability to assess interaction times may have been strengthened if the authors performed a single extended replicate up to, for example, 10-20 microseconds instead of extending multiple replicates to 1.7 microseconds.

      Reviewer #2 (Public Review):

      The authors convincingly show multiple inner and outer leaflet non-protein (lipid) densities in a cryo-EM closed state structure of GLIC, a prokaryotic homologue of canonical pentameric ligand-gated ion channels, and observe lipids in similar sites during extensive simulations at both resting and activating pH. The simulations not only corroborate structural observations, but also suggest the existence of a state-dependent lipid intersubunit site only occupied in the open state. These important findings will be of considerable interest to the ion channel community and provide new hypotheses about lipid interactions in conjunction with channel gating.

      Recommendations for the authors: please note that you control which, if any, revisions, to undertake

      In particular, a discussion of whether the timescale of the simulations permit measurements of residence or interaction times of the lipids should be addressed.

      Reviewer #1 (Recommendations for the authors):

      Comment 1.1: The authors may consider expanding the discussion about the significance of state-dependent lipid interactions. On the one hand, they emphasize state-dependent interactions of POPC with closed and open states in the outer leaflet in the results. On the other hand, they state that GLIC is insensitive to its lipid environment. What is the significance of the state-dependent interactions of POPC in GLIC, if any? It is possible that GLIC agonist responses are sensitive to phospholipids (such as PE or PG found in e. coli)? The state-dependent differences in lipid interaction identified in this study support this possibility and suggest the need to better understand the effects of phospholipids on GLIC function.

      Response 1.1: We agree with the reviewer that this is an interesting question and we have therefore extended the discussion with additional references on the functional effects on GLIC of various lipid membranes:

      p. 11 (Discussion)

      “Sampling was further simplified by performing simulations in a uniform POPC membrane. Prior experiments have been conducted to assess the sensitivity of GLIC in varying lipid environments (Labriola et al., 2013; Carswell et al., 2015; Menny et al., 2017), indicating that GLIC remains fully functional in pure POPC bilayers. In our cryo-EM experiments, the protein was recombinantly expressed from E. coli, which means that the experimental density would likely represent phosphatidylglycerol or phosphatidylethanolamine lipids. However, as the molecular identities of bound lipids could not be precisely determined, POPC lipids were built for straightforward comparison with simulation poses. While it appears that GLIC is capable of gating in a pure POPC bilayer, it remains plausible that its function could be influenced by different lipid species, especially due to the presence of multiple charged residues around the TMD/ECD interface which might interact differently with different lipid head groups. Further experiments would be needed to confirm whether the state dependence observed in simulations is also lipid-dependent. It is possible that certain types of lipids bind in one but not the other state, or that certain states are stabilized by a particular lipid type.”

      Comment 1.2: It would be helpful to state in the discussion that the co-purified lipids from GLIC structures are likely PE or PG from e. coli membranes. Nevertheless, it is interesting that the phospholipid poses from the structures generally agree with those identified from the MD simulations using PC.

      Response 1.2: Good point. We have clarified in the discussion that the native lipids in the cryo-EM structure are likely PG or PE lipids, as quoted in the preceding Response.

      Comment 1.3: The authors describe a more deeply penetrating interaction of POPC in the outer intrasubunit cleft in the open state, but this is difficult to appreciate from the images in Fig. 4B, 4E or S3B. The same is true of the deep POPC interaction at the outer intersubunit site. It may be helpful to show these densities from a different perspective to appreciate the depth of these binding poses.

      Response 1.3: We have added Figure 4 – figure supplement 1 to better show the depth of lipid binding poses, especially the ones in the outer leaflet intrasubunit cleft and at the inner intersubunit site, and cited the figure on p. 7 (Results).

      Comment 1.4: The representation of the lipid densities in Fig. 4B is not easy to interpret. First, the meaning of resting versus activating conditions and closed versus open states can be easily missed for readers who are not familiar with the author's previous study. It may be helpful to describe this (i.e. how open and closed state clusters were generated from structures determined in resting and activating conditions) in greater detail in either the figure legend, results or methods. Second, the authors state that there are differences in lipid poses between the closed and open states but not resting and activating conditions. With the exception of the intersubunit density, this is difficult to appreciate from Fig. 4B. As stated in point #3, the difference, for example, in the complementary intrasubunit site may be better appreciated with an image from a different perspective.

      Response 1.4: Acknowledged - the distinction between resting and activating conditions v.s. open and closed states can be confusing. We have tried to clarify these differences at the beginning of the results section, the methods section, and in the caption of Figure 4. Regarding differences in lipid poses between open and closed states, we agree it is difficult to appreciate from Figure 4, but here we refer the reader to Figure 4 – figure supplement 2 for an overlay between open and closed densities. Additionally, we now added Figure 1 – figure supplement 1 which provides lipid densities for all five subunits and overlays with the build cryo-EM lipids, possibly making differences easier to appreciate. Regarding images from different perspectives, we trust the new figure supplement described in Response 1.3 provides a better perspective.

      p. 3 (Results)

      “For computational quantification of lipid interactions and binding sites, we used molecular simulations of GLIC conducted under either resting or activating conditions (Bergh et al., 2021a). As described in Methods, resting conditions corresponded to neutral pH with most acidic residues deprotonated; activating conditions corresponded to acidic pH with several acidic residues protonated. Both open and closed conformations were present in both conditions, albeit with different probabilities.”

      p. 8 (Figure 4)

      “Overlaid densities for each state represent simulations conducted under resting (dark shades) or activating (light shades) conditions, which were largely superimposable within each state.”

      p. 24 (Methods)

      “We analyzed previously published MSMs of GLIC gating under both resting and activating conditions (Bergh et al., 2021a). Resting conditions corresponded to pH 7, at which GLIC is nonconductive in functional experiments, with all acidic residues modeled as deprotonated. Activating conditions corresponded to pH 4.6, at which GLIC is conductive and has been crystallized in an open state (Bocquet et al., 2009). These conditions were modeled by protonating a group of acidic residues (E26, E35, E67, E75, E82, D86, D88, E177, E243; H277 doubly protonated) as previously described (Nury et al., 2011).”

      Comment 1.5: The new closed GLIC structure was obtained by merging multiple datasets. What were the conditions of the datasets used? Was it taken from samples in resting or also activating conditions?

      Response 1.5: We have updated the Results, Discussion, and Methods to clarify this important point, in particular by merging datasets and rerunning the classification:

      p. 3 (Results)

      “In our cryo-EM work, a new GLIC reconstruction was generated by merging previously reported datasets collected at pH 7, 5, and 3 (Rovšnik et al., 2021). The predominant class from the merged data corresponded to an apparently closed channel at an overall resolution of 2.9 Å, the highest resolution yet reported for GLIC in this state (Figure 1 – figure supplement 2, Table 1).”

      p. 11 (Discussion)

      “Interestingly, the occupational densities varied remarkably little between resting and activating conditions (Figure 1 – figure supplement 1), indicating state- rather than pH- dependence in lipid interactions, also further justifying the approach of merging closed- state GLIC cryo-EM datasets collected at different pH conditions to resolve lipids.”

      p. 14 (Methods)

      “After overnight thrombin digestion, GLIC was isolated from its fusion partner by size exclusion in buffer B at pH 7, or in buffer B with citrate at pH 5 or 3 substituted for Tris. The purified protein was concentrated to 3–5 mg/mL by centrifugation. [...] Data from three different grids, at pH 7, 5, and 3, were merged and processed together.”

      Comment 1.6: In Fig. 3D, do the spheres represent the double bond? If so, please state in the legend

      Response 1.6: We have clarified in the legend of Figure 3D that the yellow spheres on the lipid tails represent a double bond.

      Comment 1.7: In Fig. 3E, what is the scale of the color representation?

      Response 1.7: We have clarified in the legend of Figure 3E that colors span 0 (white) to 137015 contacts (dark red).

      Reviewer #2 (Recommendations For The Authors):

      Comment 2.1: I'm not sure I fully understand how the final lipids were modeled (built). Fig. 1 caption suggests they may have been manually built? I understand that the idea was to place them in the overlap of simulation densities and structure densities, but can the authors please clarify if there were any quantifiable conditions that were employed during this process or if this was entirely manual placement in a pose that looked good? Regardless, it would be helpful to see an overlay of the built lipids with both the cryo and simulation densities (e.g., overly of Fig. 1F/H and G/H) to better visualize how the final built lipids compare.

      Response 2.1: We thank the reviewer for pointing out unclarities regarding our methods. We have extended the methods section to clarify how the lipids were manually built in the cryo-EM structure. We have also added Figure 1 – figure supplement 1 showing overlays of the computational densities and built cryo-EM lipids.

      p. 15 (Methods)

      “Lipids were manually built in COOT by importing a canonical SMILES format of POPC (Kim et al., 2021) and adjusting it individually into the cryo-EM density in each of the sites associated with a single subunit, based in part on visual inspection of lipid densities from simulations, as described above. After building, 5-fold symmetry was applied to generate lipids at the same sites in the remaining four subunits.”

      Comment 2.2: Regarding the state-dependent lipid entry to the outer leaflet intersubunit site associated with channel opening, if the authors could include a movie depicting this process that would be great. The current short explanation does not do this justice. Also, what were the dynamics of this process? Beyond the correlation between site occupancy and the pore being open, how did the timing of lipid entry/exit and pore opening/closing correlate?

      Response 2.2: The point regarding the timing of state-dependent lipid binding at the subunit interface and pore opening is indeed an interesting one. We have added Figure 4 – figure supplement 3D showing that the state-dependent P250 lipid interaction precedes pore opening, as quantified by pore hydration levels, indicating a potential role in gating. The interaction between lipid binding and conformational change of the protein is also depicted in the newly added Figure 4 - video supplement 1, which we hope will be able to better communicate the conclusions regarding state-dependent interactions. We have also expanded the results and discussion to better explain these results:

      p. 9 (Results)

      “The lipid head made particularly close contacts with residue P250 on the M2-M3 loop, which undergoes substantial conformational change away from the pore upon channel opening, along with outer-leaflet regions of M1–M3 (Figure 4E, Figure 4—figure Supplement 3A,B,C, Figure 4—video 1). These conformational changes were accompanied by a flip of M1 residue F195, which blocked the site in the closed state but rotated inward to allow closer lipid interactions in the open state (Figure 4—figure Supplement 3C, Figure 4—video 1). Indeed, P250 was predominantly located within 3 Å of the nearest lipid atom in open- but not closed-state frames (Figure 4F). Despite being restricted to the open state, interactions with P250 were among the longest duration in all simulations (Figure 2C) and as these binding events preceded pore opening, it is plausible to infer a role for this state-dependent lipid interaction in the gating process (Figure 4 – figure supplement 3D).”

      p. 12 (Discussion)

      “The state-dependent binding event at this site preceded pore opening in MSMs, where lipid binding coincided with crossing a smaller energy barrier between closed and intermediate states, followed by pore opening at the main energy barrier between intermediate and open states (Figure 4 – figure supplement 3D). Further, since the P250- lipid interaction was characterized by relatively long residence times (Figure 2), it is possible this lipid interaction has a role to play in GLIC gating.”

      Comment 2.3: Although the interaction times are helpful, I didn't get a great sense of how mobile the lipids are during the simulations. Can the authors discuss this a bit more. For example, are interaction times dominated by lipids that jiggle a bit away from a residue and then back again, vs how often are lipids exchanging with other lipids initially further away from the protein?

      Response 2.3: We have now added various measures of lipid diffusion, both for initially interacting lipids and for bulk lipids, which are summarized in the new Figure 2 – figure supplement 1. We have further addressed the question of simulation timescales in Results, Discussion, and Methods. These numbers highlight that it is possible for lipids several nanometers away from the protein surface to exchange with lipids of the first lipid shell.

      p. 3,6 (Results)

      “Lateral lipid diffusion coefficients were estimated to 1.47 nm2/µs for bulk lipids and 0.68 nm2/µs for lipids of the first lipid shell (Figure 2 – figure supplement 1A), which is relatively slow compared to the timescales of each trajectory (1.7 µs). However, multiple residues throughout the M1, M3, and M4 helices exchanged contacts with 2-4 different lipid molecules in individual simulations (Figure 2C). Furthermore, 1.7-µs root mean square displacement of lipids originally in the first lipid shell was 2.15 nm, and 3.16 nm in the bulk bilayer, indicating such exchanges are not limited to nearby lipids (Figure 2 – figure supplement 1B). Thus, exchange events and diffusion estimates indicate that the duration of lipid contacts observed in this work can be at least partly attributed to interaction stabilities and not solely to sampling limitations.”

      p. 11 (Discussion)

      “Indeed, the unrestrained atomistic MD simulations studied here were not expected to capture the maximal duration of stable contacts, as indicated by some interaction times approaching the full 1.7-µs trajectory (Figure 2}). Nevertheless, simulations were of sufficient length to sample exchange of up to four lipids, particularly around the M4 helix. Calculation of lipid lateral diffusion coefficients resulted in average displacements at the end of simulations of 2.15 nm for lipids initially interacting with the protein surface, roughly corresponding to lipids diffusing out to the 4th lipid shell. Diffusion of bulk lipids was faster, allowing lipids originally 3.16 nm away from the protein surface to ingress the first lipid shell. This observation underscores the potential for lipid exchange events even among lipids initially distant from the protein surface. Of course, duration of exceptionally stable interactions, such as those involving T274 (Figure 2C), inevitably remain bounded by the length of our simulations. Still, diffusion metrics, supported by robust statistical analysis encompassing diverse starting conditions (500 trajectories), enable confident estimation of relative interaction times.“

      p. 13 (Methods)

      “Time-based measures of protein-lipid interactions, such as mean duration times and exchange of interactions, were calculated for the 100 x 1.7 µs-long simulations using prolintpy (Sejdiu and Tieleman, 2021) with a 4 Å interaction cutoff. Analysis of lateral lipid diffusion in individual simulations was carried out for two disjoint sets of lipids: the first lipid shell defined as lipids with any part within 4 Å of the protein surface (~90 lipids), and bulk lipids consisting of all other lipids (~280 lipids). Mean square displacements of each lipid set were calculated using GROMACS 2021.5 (Abraham et al., 2015b) with contributions from the protein center of mass removed. Diffusion coefficients for each set, DA, were calculated using the Einstein relation (Equation 1) by estimating the slope of the linear curve fit to the data.

      where ri(t) is the coordinate of the center of mass of lipid i of set A at time t and DA is the self-diffusion coefficient.”

      Comment 2.4: How symmetric or asymmetric are the cryo and simulation densities across subunits and was there subunit asymmetry in the final build lipids? I could not tell from any of the figures beyond the casual observation that they maybe look somewhat similar in Fig. 1?

      Response 2.4: We thank the reviewer for this useful remark. We have clarified in the methods that the cryo-EM lipids were built in C5-symmetry, and thus the positions are symmetric. The computational densities were calculated independently for each subunit and are thus not necessarily symmetric. We have added Figure 1 – figure supplement 1 showing densities for all five subunits, also serving as an indication of convergence of the results.

      p. 3 (Results) “Although the stochastic nature of simulations resulted in nonidentical lipid densities associated with the five GLIC subunits, patterns of lipid association were notably symmetric (Figure 1 – figure supplement 1).”

      p. 14-15 (Methods)

      “A smaller subset of particles was used to generate an initial model. All subsequent processing steps were done using 5-fold symmetry. […] A monomer of that model was fit to the reconstructed density and 5-fold symmetry was applied with PHENIX 1.19.2-4158 through NCS restraints detected from the reconstructed cryo-EM map, to generate a complete channel. […] After building, 5-fold symmetry was applied to generate lipids at the same sites in the remaining four subunits.”

      Minor comments:

      Comment 2.5: Fig. 1 is probably not easy to follow for the general reader and the caption is very brief. I suggest adding an additional explanation to the caption and/or additional annotations to the figure to help a general reader step through this.

      Response 2.5: We have expanded the caption of Figure 1 and clarified the meanings of colors, labels, and annotations.

      Comment 2.6: Fig. 1B - Caption is confusing. I would not call the state separation lines outlines as they are not closed loops. Also, I see red/orange and two shades of blue whereas the caption mentions orange and blue only. The caption should also explicitly say what the black lines are (other cluster separations).

      Response 2.6: We have edited the caption to better describe colors, annotations, and the meaning of the data:

      p. 4 (Figure 1)

      “(B) Markov state models were used to cluster simulations conducted under resting (R) or activating (A) conditions into five states, including closed (left of the light or dark orange lines) and open (right of the light or dark blue lines). Black lines mark edges of other state clusters derived from MSM eigenvectors. Experimental structures are highlighted as white circles.”

      Comment 2.7: Fig. 3F caption appears to conflict with data where interaction with W217A appears longer than W217. I think the authors want to suggest here that W217A reduces contact time with T274 as stated in the main text.

      Response 2.7: We have clarified in this legend that “Mutation of residue W217, lining this pocket, reveals shortened interactions at the T274 binding site” (p. 6, Figure 3).

      Comment 2.8: Ref 25 and 26 are the same.

      Response 2.8: Apologies; this mistake has been corrected.

    1. Author Response

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      The paper from Hsu and co-workers describes a new automated method for analyzing the cell wall peptidoglycan composition of bacteria using liquid chromatography and mass spectrometry (LC/MS) combined with newly developed analysis software. The work has great potential for determining the composition of bacterial cell walls from diverse bacteria in high-throughput, allowing new connections between cell wall structure and other important biological functions like cell morphology or host-microbe interactions to be discovered. In general, I find the paper to be well written and the methodology described to be useful for the field. However, there are areas where the details of the workflow could be clarified. I also think the claims connecting cell wall structure and stiffness of the cell surface are relatively weak. The text for this topic would benefit from a more thorough discussion of the weak points of the argument and a toning down of the conclusions drawn to make them more realistic.

      Thank you for your thorough and insightful review of our manuscript. We greatly appreciate your positive and constructive feedbacks on our methodology. We have carefully reviewed your comments and have responded to each point as follows:

      Specific points:

      1) It was unclear to me from reading the paper whether or not prior knowledge of the peptidoglycan structure of an organism is required to build the "DBuilder" database for muropeptides. Based on the text as written, I was left wondering whether bacterial samples of unknown cell wall composition could be analyzed with the methods described, or whether some preliminary characterization of the composition is needed before the high-throughput analysis can be performed. The paper would be significantly improved if this point were explicitly addressed in the main text. We apologize for not making it clearer. The prior knowledge of the peptidoglycan structure of an organism is indeed required to build the “DBuilder” database to accurately identify muropeptides; otherwise, the false discovery rate might increase. While peptidoglycan structures of certain organisms might not have been extensively studied, users still remain the flexibility to adapt the muropeptide compositions based on their study, referencing closely related species for database construction. We have addressed this aspect in the main text to ensure a clearer understanding.

      “(Section HAMA platform: a High-throughput Automated Muropeptide Analysis for Identification of PGN Fragments) …(i) DBuilder... Based on their known (or putative) PGN structures, all possible combinations of GlcNAc, MurNAc and peptide were input into DBuilder to generate a comprehensive database that contains monomeric, dimeric, and trimeric muropeptides (Figure 1b)."

      2) The potential connection between the structure of different cell walls from bifidobacteria and cell stiffness is pretty weak. The cells analyzed are from different strains such that there are many possible reasons for the change in physical measurements made by AFM. I think this point needs to be explicitly addressed in the main text. Given the many possible explanations for the observed measurement differences (lines 445-448, for example), the authors could remove this portion of the paper entirely. Conclusions relating cell wall composition to stiffness would be best drawn from a single strain of bacteria genetically modified to have an altered content of 3-3 crosslinks.

      We understand your concern regarding the weak connection between cell wall structure and cell stiffness. We will make a clear and explicit statement in the main text to acknowledge that the cells analyzed are derived from different strains, introducing the possibility of various factors influencing the observed changes in physical measurements as determined by AFM. Furthermore, we greatly appreciate your suggestion to consider genetically modified strains to investigate the role of cross-bridge length in determining cell envelope stiffness. In this regard, we are in the process of developing a CRISPR/Cas genome editing toolbox for Bifidobacterium longum, and we plan on this avenue of investigation for future work.

      Reviewer #2 (Public Review):

      The authors introduce "HAMA", a new automated pipeline for architectural analysis of the bacterial cell wall. Using MS/MS fragmentation and a computational pipeline, they validate the approach using well-characterized model organisms and then apply the platform to elucidate the PG architecture of several members of the human gut microbiota. They discover differences in the length of peptide crossbridges between two species of the genus Bifidobacterium and then show that these species also differ in cell envelope stiffness, resulting in the conclusion that crossbridge length determines stiffness.

      We appreciate your thoughtful review of our manuscript and your recognition of the potential significance of our work in elucidating the poorly characterized peptidoglycan (PGN) architecture of the human gut microbiota.

      The pipeline is solid and revealing the poorly characterized PG architecture of the human gut microbiota is worthwhile and significant. However, it is unclear if or how their pipeline is superior to other existing techniques - PG architecture analysis is routinely done by many other labs; the only difference here seems to be that the authors chose gut microbes to interrogate.

      We apologize if this could have been clearer. The HAMA platform stands apart from other pipelines by utilizing automatic analysis of LC-MS/MS data to identify muropeptides. In contrast, most of the routine PGN architecture analyses often use LC-UV/Vis or LC-MS platform, where only the automatic analyzing PGFinder software is supported. To our best knowledge, a comparable pipeline on automatically analyzing LC-MS/MS data was reported by Bern et al., which they used commercial Byonic software with an in-house FASTA database and specific glycan modifications. They achieved accurate and sensitive identification on monomer muropeptides, but struggled with cross-linked muropeptides due to the limitations of the Byonic software. We believe that our pipeline introducing the automatic and comprehensive analysis on muropeptide identification (particularly for Gram-positive bacterial peptidoglycans) would be a valuable addition to the field. To enhance clarity, we have adjusted the context as follows:

      (Introduction) … Although they both demonstrated great success in identifying muropeptide monomers, the accurate identification of muropeptide multimers and other various bacterial PGN structures still remains unresolved. This is because deciphering the compositions requires MS/MS fragmentation, but it is still challenging to automatically annotate MS/MS spectra from these complex muropeptide structures."

      I do not agree with their conclusions about the correlation between crossbridge length and cell envelope stiffness. These experiments are done on two different species of bacteria and their experimental setup therefore does not allow them to isolate crossbridge length as the only differential property that can influence stiffness. These two species likely also differ in other ways that could modulate stiffness, e.g. turgor pressure, overall PG architecture (not just crossbridge length), membrane properties, teichoic acid composition etc.

      Regarding the conclusions drawn about the correlation between cross-bridge length and cell envelope stiffness, we understand your point and appreciate your feedback. We revisit this section of our manuscript and tone down the conclusions drawn from this aspect of the study. We also recognize the importance of considering other potential factors that could influence stiffness, as you mentioned above. In light of this, we mentioned the need for further investigations, potentially involving genetically modified strains, in the main text to isolate and accurately determine the impact of bridge length on cell envelope stiffness.

      Reviewer #1 (Recommendations For The Authors):

      Minor points:

      1) One thing to consider would be testing the robustness of the analysis pipeline with one the well-characterized bacteria studied, but genetically modifying them to change the cell wall composition in predictable ways. Does the analysis pipeline detect the expected changes?

      We appreciate the reviewer's suggestion and would like to provide a clear response. Regarding to testing the pipeline with genetically modified strains, our lab previously worked on genetically modified S. maltophilia (KJΔmrdA).1 Inactivation of mrdA turned out the increasing level of N-acetylglucosaminyl-1,6-anhydro-N-acetylmuramyl-L-alanyl-D-glutamyl-meso-diamnopimelic acid-D-alanine (GlcNAc-anhMurNAc tetrapeptide) in muropeptide profiles, which is the critical activator ligands for mutant strain ΔmrdA-mediated β-lactamase expression. In this case, our platform could provide rapid PGN analysis for verifying the expected change of muropeptide profiles (see Author response image 1). Besides, if the predictable changes involve genetically modifications on interpeptide bridges within the PGN structure, for example, the femA/B genes of S. aureus, which are encoded for the synthesis of interpeptide bridges,2 our current HAMA pipeline is capable of detecting these anticipated changes. However, if the genetically modifications involve the introduce of novel components to PGN structures, then it would need to create a dedicated database specific to the genetically modified strain.

      Author response image 1.

      2) Line 368: products catalyzed > products formed

      The sentence has been revised.

      “(Section Inferring PGN Cross-linking Types Based on Identified PGN Fragments) …Based on the muropeptide compositional analysis mentioned above, we found high abundances of M3/M3b monomer and D34 dimer in the PGNs of E. faecalis, E. faecium, L. acidophilus, B. breve, B. longum, and A. muciniphila, which may be the PGN products formed by Ldts.”

      3) Lines 400-402: Is it possible the effect is related to porosity, not "hardness".

      Thank you for the suggestion. The possibility of the slower hydrolysis rate of purified PGN in B. breve being related to porosity is indeed noteworthy. While this could be a potential factor, we would like to acknowledge the limited existing literature that directly addresses the relation between PGN architecture and porosity. It is plausible that current methods available for assessing cell wall porosity may have certain limitations, contributing to the scarcity of relevant studies. In light of this, we would like to propose a speculative explanation for the observed effect. It is plausible that the tighter PGN architecture resulting from shorter interpeptide bridges in B. breve could contribute to its harder texture. This speculation is grounded in the concept that a more compact PGN structure might lead to increased stiffness, aligning with our observations of higher cell stiffness in B. breve.

      4) Lines 403-408: See point #2 above.

      Thank you for the suggestion. We have explicitly addressed this point in the main text:

      “(Section Exploring the Bridge Length-dependent Cell Envelope Stiffness in B. longum and B. breve) … Taken all together, we speculate that a tight peptidoglycan network woven by shorter interpeptide bridges or 3-3 cross-linkages could give bacteria stiffer cell walls. However, it is important to note that cell stiffness is a mechanical property that also depends on PGN thickness, overall architecture, and turgor pressure. These parameters may vary among different bacterial strains. Hence, carefully controlled, genetically engineered strains with similar characteristics will be needed to dissect the role of cross-bridge length in cell envelope stiffness.”

      5) Lines 428-429: It is not clear to me how mapping the cell wall architecture provides structural information about the synthetic system. It is also not clear how antibiotic resistance can be inferred. More detail is needed here to flesh out these points.

      Thank you for the suggestion. To provide further clarity on these important aspects, the context in the manuscript has been revised.

      “(Discussion) …Importantly, our HAMA platform provides a powerful tool for mapping peptidoglycan architecture, giving structural information on the PGN biosynthesis system. This involves the ability to infer possible PGN cross-linkages based on the type of PGN fragments obtained from hydrolysis. For instance, the identification of 3-3 cross-linkage formed by L,D-transpeptidases (Ldts) is of particular significance. Unlike 4-3 cross-linkages, the 3-3 cross-linkage is resistant to inhibition by β-Lactam antibiotics, a class of antibiotics that commonly targets bacterial cell wall synthesis through interference with 4-3 cross-linkages. Therefore, by elucidating the specific cross-linkage types within the peptidoglycan architecture, our approach offers insights into antibiotic resistance mechanisms.”

      6) Line 478: "maneuvers are proposed for" > "work is needed to generate". Also, delete "innovative". Also "in silico" > "in silico-based".

      The sentence has been revised.

      “(Discussion) …To achieve a more comprehensive identification of muropeptides, future work is needed to generate an expanded database, in silico-based fragmentation patterns, and improved MS/MS spectra acquisition.”

      7) Line 485: "Its" > "It has potential"

      The sentence has been revised.

      “(Discussion) …It has potential applications in identifying activation ligands for antimicrobial resistance studies, characterizing key motifs recognized by pattern recognition receptors for host-microbiota immuno-interaction research, and mapping peptidoglycan in cell wall architecture studies.”

      8) Figure 1 legend: Define Gb and Pb.

      Gb and Pb are the abbreviations of glycosidic bonds and peptide bonds. We have revised the Figure legend 1 as follow:

      “(Figure legend 1) …(b) DBuilder constructs a muropeptide database containing monomers, dimers, and trimers with two types of linkage: glycosidic bonds (Gb) and peptide bonds (Pb).”

      9) Figure 2: It is hard to see what is going on in panel a and c with all the labels. Consider removing them and showing a zoomed inset with labels in addition to ab unlabeled full chromatogram.

      We apologize for not making this clearer. The panel a and c in Figure 2 were directly generated by the Analyzer as a software screenshot of the peak annotations on chromatogram. Our intention was to present a comprehensive PGN mapping (approximately 70% of the peak area was assigned to muropeptide signals) using this platform. We understand the label density might affect clarity, so we have added the output tables of the whole muropeptide identifications as source data (Table 1–Source Data 1&2). Additionally, we have uploaded the Analyzer output files (see Additional Files), which can be better visualized in the Viewer program, and it also allows users zoom in for detailed labeling information.

      10) Figure 3: It is worth pointing out what features of the MS/MS fingerprints are helping to discriminate between species.

      Thank you for the suggestion. We have revised Figure 3 and the legend as follow:

      “(Figure legend 3) …The sequence of each isomer was determined using in silico MS/MS fragmentation matching, with the identified sequence having the highest matching score. The key MS/MS fragments that discriminate between two isomers are labeled in bold brown.”

      Author response image 2.

      11) Figure 4 and 5 legend: Can you condense the long descriptions of the abbreviations - or at least only refer to them once?

      Certainly, to enhance clarity and conciseness in the figure legends, we have revised Figure legend 5 as follow:

      “(Figure legend 5) …(b) Heatmap displaying …. Symbols: M, monomer; D, dimer; T, trimer (numbers indicate amino acids in stem peptides). Description of symbol abbreviations as in Figure legend 4, with the addition of "Glycan-T" representing trimers linked by glycosidic bonds.”

      Reviewer #2 (Recommendations For The Authors):

      1. Please read the manuscript carefully for spelling errors.

      We appreciate your careful review of our manuscript. We have thoroughly rechecked the entire manuscript for spelling errors and have made the necessary corrections to ensure the accuracy and quality of the text.

      1. Line 46 - "multilayered" is likely only true for Gram-positive bacteria.

      We thank reviewer #2 for bringing up this concern. Indeed, Gram-negative bacteria mostly possess single layer of peptidoglycan, but could be up to three layers in some part of the cell surface.3, 4 In order to reduce the confusion, we have rewritten the context as follow: “(Introduction) …PGN is a net-like polymeric structure composed of various muropeptide molecules, with their glycans linearly conjugated and short peptide chains cross-linked through transpeptidation.”

      1. Methods section: It seems like pellets from a 10 mL bacterial culture were ultimately suspended in 1.5 L (750 mL water + 750 mL tris) - why such a large volume? And how were PG fragments subsequently washed (centrifugation? There is no information on this in the Methods).

      We apologize for the mislabeling on the units. The accurate volume should be “1.5 mL (750 µL water + 750 µL tris)”. We have updated the correct volume in the Methods section (lines 99-100). For the washing process of purified PGN, we added 1 mL water, centrifuged at 10,000 rpm for 5 minutes, and removed supernatant. This information has added to the Methods section (lines 95-98).

      1. Line 183 - why were 6 modifications chose as the cutoff? Please make rationale more clear.

      We thank reviewer #2 for the comments. We set the maximum modification number of 6 in the assumption of one modification on each sugar of a trimeric muropeptide. A lower cutoff could effectively limit the identification of muropeptides with unlikely numbers of modifications, whereas a higher cutoff could allow for having multiple modifications on a muropeptide. In our hand, muropeptide modifications of E. coli are mostly N-deacetyl-MurNAc and anhydro-MurNAc, and modifications of gut microbes used here are mostly N-deacetyl-GlcNAc, anhydro-MurNAc, O-acetyl-MurNAc, loss of GlcNAc, and amidated iso-Glu. While we recommend starting data analysis with the cutoff of 6 modifications, users are free to adjust this based on their studies.

      1. Line 339 - define donor vs. acceptor here (can be added in parentheses after explaining the relevant chemical reactions further above in the text)

      Thank you for the suggestion. To provide greater clarity regarding the roles of the donor and acceptor substrates in the transpeptidation process, we have revised the content in the manuscript as follows:

      “(Section Inferring PGN Cross-linking Types Based on Identified PGN Fragments) …In general, there are two types of PGN cross-linkage…. Transpeptidation involves two stem peptides which function as acyl donor and acceptor substrates, respectively. As the enzyme names imply, the donor substrates that Ddts and Ldts bind to are terminated as D,D-stereocenters and L,D-stereocenters, which structurally means pentapeptides and tetrapeptides. During D,D-transpeptidation, Ddts recognize D-Ala4-D-Ala5 of the donor stem (pentapeptide) and remove the terminal D-Ala5 residue, forming an intermediate. The intermediate then cross-links the NH2 group in the third position of the neighboring acceptor stem, forming a 4-3 cross-link.”

      1. Line 366 following - can you calculate % crosslinks based on these numbers? What does "high abundance" of 3,3 crosslinks mean in this context? Is this the majority of PG?

      Thank you for your questions. Calculating the percentage of crosslinks based on the muropeptide compositional numbers is a valid consideration. However, it's important to note that the muropeptides we analyzed were hydrolyzed by mutanolysin, and as such, deriving an accurate % crosslink value from these data might not provide a true representation of the crosslinking percentage within the PGN network. For a more precise determination of % crosslinks, methods such as solid-phase NMR on purified peptidoglycan would be required. Our research provides insights into the characterization of PGN fragments and allows us to infer potential PGN cross-linkage types and the enzymes involved based on the dominant muropeptide fragments. Regarding the phrase "high abundance" in the context, it indicates that the M3b/M4b monomer and D34 dimer muropeptides represent a significant portion of the hydrolysis products. These muropeptides are major constituents within the PGN fragments obtained from the enzymatic hydrolysis.

      1. Line 375 - I am not sure PG is a meaningful diffusion barrier for drugs and signaling molecules, give that even larger proteins can apparently diffuse through the pores.

      Thank you for raising this point. Peptidoglycan indeed possesses relatively wide pores that allow for the diffusion of larger molecules, including proteins.5 Research has provided a rough estimate of the porosity of the PGN meshwork, suggesting that it allows for the diffusion of proteins with a maximum molecular mass of around 50 kDa.6 Considering this, we acknowledge that PGN may not serve as a significant diffusion barrier for drugs and signaling molecules. The porosity of the PGN scaffold, which is defined by the degree of cross-linking, plays a role in influencing the transport of molecules to the cell membrane. Thus, while PGN may not serve as a strict diffusion barrier, its structural characteristics still impact bacterial cell mechanics and interactions. We have revised the manuscript to reflect this understanding:

      “(Section Exploring the Bridge Length-dependent Cell Envelope Stiffness in B. longum and B. breve) …The porosity of the PGN scaffold, defined by the degree of cross-linking, influences the transport of larger molecules such as proteins. Therefore, modifications to PGN structure are anticipated to significantly affect bacterial cell mechanics and interactions.”

      1. Line 400 - what does "slower hydrolysis rate" refer to, is this chemical hydrolysis or enzymatic (autolysins?). also, I am not sure hydrolysis rate of either modality allows for solid conclusions about how hard (line 402) the PG is.

      Thank you for your comments. The hydrolysis rate here refers to the enzymatic hydrolysis, specifically the mutanolysin cleaving the β-N-acetylmuramyl-(1,4)-N-acetylglucosamine linkage. Indeed, there is no direct correlation between the hydrolysis rate and the hardness of PGN architecture, although the structure rigidity is a key determinant in protein digestion.7 Considering the enzymatic hydrolysis rate depending on the accessibility of the substrate to the enzyme, we proposed that the tighter PGN architecture could also lead to a slower hydrolysis rate. This speculation aligns with our observations of higher cell stiffness or more compact PGN structure of B. breve and its slower hydrolysis rate. We understand this is indirect proof, so the revised sentence now reads:

      “(Section Exploring the Bridge Length-dependent Cell Envelope Stiffness in B. longum and B. breve) …Furthermore, B. breve also showed a slower enzymatic hydrolysis rate in purified PGNs, implying that the cell wall structure of B. breve is characterized by a compact PGN architecture.”

      1. Line 424 - I am not convinced this pipeline can detect PG architectures that other pipelines cannot; likely, the difference between previous analyses and theirs is due to different growth conditions (3,3 crosslink formation is often modulated by environmental factors/growth stage). In the next sentence, it sounds like mutanolysin treatment is a novelty in PG analysis (which it is not).

      We apologize if this could have been clearer and we have revised the paragraph to describe our study more accurately. We agree that different growth conditions could influence PGN architecture and other pipelines could manually identify the PGN architectures or automatically identify them if they are not too complex. Our original intention was to highlight the ability of the HAMA program to automatically identify unreported PGN structure. Here are the revised sentences:

      “(Discussion) …We speculate that this finding may be influenced by the comprehensive mass spectrometric approaches we employed or by variations in growth conditions. Moreover, we utilized the well-established enzymatic method involving mutanolysin to cleave the β-N-acetylmuramyl-(1,4)-N-acetylglucosamine linkage, which preserves the original peptide linkage in intact PGN subunits.”

      1. Line 440- 442: As outlined in more detail above: I don't think you can conclude something about the relationship between bridge length and envelope stiffness based on these data. Thank you for your valuable feedback. We agree that our data may not definitively support the direct conclusion about the relationship between bridge length and envelope stiffness in Bifidobacterium species. Instead, we will rephrase this section to accurately present the observed correlations without overgeneralizing:

      “(Discussion) … Notably, our study suggested a potential correlation between the cell stiffness and the compactness of bacterial cell walls in Bifidobacterium species (Figure 5). B. longum, which predominantly harbors tetrapeptide bridges (Ser-Ala-Thr-Ala), exhibits a trend towards lower stiffness, whereas B. breve, characterized by PGN cross-linked with monopeptide bridges (Gly), demonstrates a trend towards higher stiffness. These findings suggested that it may be correlated between the increased rigidity and the more compact PGN architecture built by shorter cross-linked bridges.”

      References: 1. Huang, Y.-W.; Wang, Y.; Lin, Y.; Lin, C.; Lin, Y.-T.; Hsu, C.-C.; Yang, T.-C., Impacts of Penicillin Binding Protein 2 Inactivation on β-Lactamase Expression and Muropeptide Profile in Stenotrophomonas maltophilia. mSystems 2017, 2 (4), 00077-00017.

      1. Jarick, M.; Bertsche, U.; Stahl, M.; Schultz, D.; Methling, K.; Lalk, M.; Stigloher, C.; Steger, M.; Schlosser, A.; Ohlsen, K., The serine/threonine kinase Stk and the phosphatase Stp regulate cell wall synthesis in Staphylococcus aureus. Sci. Rep. 2018, 8 (1), 13693.

      2. Labischinski, H.; Goodell, E. W.; Goodell, A.; Hochberg, M. L., Direct proof of a "more-than-single-layered" peptidoglycan architecture of Escherichia coli W7: a neutron small-angle scattering study. J. Bacteriol. 1991, 173 (2), 751-756.

      3. Rohde, M., The Gram-Positive Bacterial Cell Wall. Microbiol. Spectr. 2019, 7 (3), gpp3-0044-2018.

      4. Vollmer, W.; Höltje, J. V., The architecture of the murein (peptidoglycan) in gram-negative bacteria: vertical scaffold or horizontal layer(s)? J. Bacteriol. 2004, 186 (18), 5978-5987.

      5. Vollmer, W.; Blanot, D.; De Pedro, M. A., Peptidoglycan structure and architecture. FEMS Microbiol. Rev. 2008, 32 (2), 149-167.

      6. Li, Q.; Zhao, D.; Liu, H.; Zhang, M.; Jiang, S.; Xu, X.; Zhou, G.; Li, C., "Rigid" structure is a key determinant for the low digestibility of myoglobin. Food Chem.: X 2020, 7, 100094.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      Chen et al. identified the role of endocardial id2b expression in cardiac contraction and valve formation through pharmaceutical, genetic, electrophysiology, calcium imaging, and echocardiography analyses. CRISPR/Cas9 generated id2b mutants demonstrated defective AV valve formation, excitation-contraction coupling, reduced endocardial cell proliferation in AV valve, retrograde blood flow, and lethal effects.

      Strengths:

      Their methods, data and analyses broadly support their claims.

      Weaknesses:

      The molecular mechanism is somewhat preliminary.

      We thank the reviewer for the positive assessment of our work. A detailed point-by-point response has been incorporated in the response to “Recommendations for the authors” section.

      Reviewer #2 (Public review):

      Summary:

      Biomechanical forces, such as blood flow, are crucial for organ formation, including heart development. This study by Shuo Chen et al. aims to understand how cardiac cells respond to these forces. They used zebrafish as a model organism due to its unique strengths, such as the ability to survive without heartbeats, and conducted transcriptomic analysis on hearts with impaired contractility. They thereby identified id2b as a gene regulated by blood flow and is crucial for proper heart development, in particular, for the regulation of myocardial contractility and valve formation. Using both in situ hybridization and transgenic fish they showed that id2b is specifically expressed in the endocardium, and its expression is affected by both pharmacological and genetic perturbations of contraction. They further generated a null mutant of id2b to show that loss of id2b results in heart malformation and early lethality in zebrafish. Atrioventricular (AV) and excitation-contraction coupling were also impaired in id2b mutants. Mechanistically, they demonstrate that Id2b interacts with the transcription factor Tcf3b to restrict its activity. When id2b is deleted, the repressor activity of Tcf3b is enhanced, leading to suppression of the expression of nrg1 (neuregulin 1), a key factor for heart development. Importantly, injecting tcf3b morpholino into id2b-/- embryos partially restores the reduced heart rate. Moreover, treatment of zebrafish embryos with the Erbb2 inhibitor AG1478 results in decreased heart rate, in line with a model in which Id2b modulates heart development via the Nrg1/Erbb2 axis. The research identifies id2b as a biomechanical signaling-sensitive gene in endocardial cells that mediates communication between the endocardium and myocardium, which is essential for heart morphogenesis and function.

      Strengths:

      The study provides novel insights into the molecular mechanisms by which biomechanical forces influence heart development and highlights the importance of id2b in this process.

      Weaknesses:

      The claims are in general well supported by experimental evidence, but the following aspects may benefit from further investigation:

      (1) In Figure 1C, the heatmap demonstrates the up-regulated and down-regulated genes upon tricane-induced cardiac arrest. Aside from the down-regulation of id2b expression, it was also evident that id2a expression was up-regulated. As a predicted paralog of id2b, it would be interesting to see whether the up-regulation of id2a in response to tricane treatment was a compensatory response to the down-regulation of id2b expression.

      We thank the reviewer for the comment. As suggested, we performed qRT-PCR analysis to assess id2a expression in tricaine-treated heart. Our results demonstrate a significant upregulation of id2a following the inhibition of cardiac contraction, suggesting a potential compensatory response to the decreased id2b. These new results have been incorporated into the revised manuscript (Figure 1D).

      (2) The study mentioned that id2b is tightly regulated by the flow-sensitive primary cilia-klf2 signaling axis; however aside from showing the reduced expression of id2b in klf2a and klf2b mutants, there was no further evidence to solidify the functional link between id2b and klf2. It would therefore be ideal, in the present study, to demonstrate how Klf2, which is a transcriptional regulator, transduces biomechanical stimuli to Id2b.

      We have examined the expression levels of id2b in both klf2a and klf2b mutants. The whole mount in situ results clearly demonstrate a decrease in id2b signal in both mutants (Figure 3E). As noted by the reviewer, klf2 is a transcriptional regulator, suggesting that the regulation of id2b may occur at the transcriptional level. However, dissecting the molecular mechanisms underlying the crosstalk between klf2 and id2b is beyond the scope of the present study.

      (3) The authors showed the physical interaction between ectopically expressed FLAG-Id2b and HA-Tcf3b in HEK293T cells. Although the constructs being expressed are of zebrafish origin, it would be nice to show in vivo that the two proteins interact.

      We thank the reviewer for this insightful comment. As suggested, we synthesized Flag-id2b and HA-tcf3b mRNA and co-injected them into 1-cell stage zebrafish embryos. We collected 100-300 embryos at 12, 24, and 48 hpf and performed western blot analysis using the same anti-HA and anti-Flag antibodies validated in HEK293 cell experiments. Despite multiple independent attempts, we were unable to detect clear bands of the tagged proteins in zebrafish embryos. We speculate that this could be due to mRNA instability, translational efficiency, or the low abundance of Id2b and Tcf3b proteins. We have acknowledged these technical limitations in the revised manuscript and clarified that the HEK293 cell data support a potential interaction between Id2b and Tcf3b, while confirming their endogenous interaction will require further investigations (Lines 295-296).

      Reviewer #3 (Public review):

      Summary:

      How mechanical forces transmitted by blood flow contribute to normal cardiac development remains incompletely understood. Using the unique advantages of the zebrafish model system, Chen et al make the fundamental discovery that endocardial expression of id2b is induced by blood flow and required for normal atrioventricular canal (AVC) valve development and myocardial contractility by regulating calcium dynamics. Mechanistically, the authors suggest that Id2b binds to Tcf3b in endocardial cells, which relieves Tcf3b-mediated transcriptional repression of Neuregulin 1 (NRG1). Nrg1 then induces expression of the L-type calcium channel component LRRC1. This study significantly advances our understanding of flow-mediated valve formation and myocardial function.

      Strengths:

      Strengths of the study are the significance of the question being addressed, use of the zebrafish model, and data quality (mostly very nice imaging). The text is also well-written and easy to understand.

      Weaknesses:

      Weaknesses include a lack of rigor for key experimental approaches, which led to skepticism surrounding the main findings. Specific issues were the use of morpholinos instead of genetic mutants for the bmp ligands, cilia gene ift88, and tcf3b, lack of an explicit model surrounding BMP versus blood flow induced endocardial id2b expression, use of bar graphs without dots, the artificial nature of assessing the physical interaction of Tcf3b and Id2b in transfected HEK293 cells, and artificial nature of examining the function of the tcf3b binding sites upstream of nrg1.

      We thank the reviewer for the positive assessment and the constructive suggestions. We have performed additional experiments and data analysis to address these issues. A detailed point-by-point response has been incorporated in the response to “Recommendations for the authors” section.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      Questions/Concerns:

      (1) In the introduction, it would be beneficial to include background information on the id2b gene, what is currently known about its function in heart development/regeneration and in other animal models than just the zebrafish.

      We thank the reviewer for the constructive suggestion. In the revised manuscript, we have added a paragraph in the Introduction to provide background on id2b and its role in heart development. Specifically, we discuss its function as a member of the ID (inhibitor of DNA binding) family of helix-loop-helix (HLH) transcriptional regulators and highlight its involvement in cardiogenesis in both zebrafish and mouse models. These additions help place our findings in a broader developmental and evolutionary context (Lines 91-100).

      (2) Of the 6 differentially expressed genes identified in Figure 1C, why did the authors choose to focus on id2b and not the other 5 downregulated genes?

      We thank the reviewer for the comments. As suggested, we have added a sentence in the revised manuscript to clarify the rationale for selecting id2b as the focus of the present study (Lines 117-121).

      (3) As the authors showed representative in situ images for id2b expression with blebbistatin treatment in Figure 1E, and tnn2a MO in Figure 1F, it would also be beneficial to show relative mRNA expression levels for id2b in conditions of blebbistatin treatment and tnn2a MO knockdown. In Fig. 1C: id2b is downregulated with tricaine, but id2a is upregulated with tricaine. Do these genes perform similar or different functions, results of gene duplication events?

      We thank the reviewer for the thoughtful suggestion. Our in situ hybridization results demonstrate reduced id2b expression following tricaine, blebbistatin, and tnn2 morpholino treatment. To further validate these observations and enhance cellular resolution, we generated an id2b:eGFP knockin line. Analysis of this reporter line confirmed a significant reduction in id2b expression in the endocardium upon inhibition of cardiac contraction and blood flow (Figure 3A-D), supporting our in situ results. The divergent expression patterns of id2a and id2b in response to tricaine treatment likely reflect functional specification following gene duplication in zebrafish. While our current study focuses on characterizing the role of id2b in zebrafish heart development, the specific function of id2a remains to be determined. 

      (4) In Fig. 2b, could the authors compare the id2b fluorescence with RNAscope ISH at 24, 48, and 72 hpf? RNAscope ISH allows for the visualization of single RNA molecules in individual cells. The authors should at least compare these in the heart to demonstrate that id2b accurately reflects the endogenous id2b expression. In Fig. 2E: Suggest showing the individual fluorescent images for id2b:eGFP and kdrl:mCherry in the same colors as top panel images instead of in black and white. In Fig. 2F: The GFP fluorescence from id2b:eGFP signals looks overexposed.

      We thank the reviewer for the valuable comment. In response, we attempted RNAscope in situ hybridization on embryos carrying the id2b:eGFP reporter to directly compare fluorescent reporter expression with endogenous id2b transcripts. However, we encountered a significant reduction in id2b:eGFP fluorescence following the RNAscope procedure, and even subsequent immunostaining with anti-GFP antibodies yielded only weak signals. Despite this technical limitation, the RNAscope results independently confirmed id2b expression in endocardial cells (Figure 2E), supporting the specificity and cell-type localization observed with the reporter line. As suggested by the reviewer, we have updated Figure 2G to display id2b:eGFP and kdrl:mCherry images in the same color scheme as the top panel to improve consistency and clarity. Additionally, we have replaced the images in Figure 2F to avoid overexposure and better represent the spatial distribution of id2b:eGFP in adult heart.

      (5) In Fig. 3A: are all the images in panel A taken with the same magnification? In Fig. 3e, could the authors show the localization of klf2 and id2b and confirm their expression in the same endocardial cells? In Fig. 3, the authors conclude that klf2-mediated biomechanical signaling is essential for activating id2b expression. This statement is somewhat overstated because they only demonstrated that knockout of klf2 reduced id2b expression.

      We thank the reviewer for these constructive comments. All images presented in Figure 3A were captured using the same magnification, as now clarified in the revised figure legend. We appreciate the reviewer’s question regarding the localization of klf2 and id2b. While we were unable to directly visualize both markers in the same embryos due to the current unavailability of klf2 reporter lines, prior studies using klf2a:H2B-eGFP transgenic zebrafish have demonstrated that klf2a is broadly expressed in endocardial cells, with enhanced expression in the atrioventricular canal region (Heckel et al., Curr Bio 2015, PMID: 25959969; Gálvez-Santisteban et al., Elife 2019, PMID: 31237233). Our id2b:eGFP reporter analysis revealed a similarly broad endocardial expression pattern. These independent observations support the likelihood that klf2a and id2b are co-expressed in the same endocardial cell population.   

      We also appreciate the reviewer’s comments regarding the connection between biomechanical signaling and id2b expression. Previous studies have already established that biomechanical cues directly regulate klf2 expression in zebrafish endocardial cells (Vermot et al., Plos Biol 2009, PMID: 19924233; Heckel et al., Curr Bio 2015, PMID: 25959969). In the present study, we observed a significant reduction in id2b expression in both klf2a and klf2b mutants, suggesting that id2b acts downstream of klf2. These observations together establish the role of biomechanical cues-klf2-id2b signaling axis in endocardial cells. Nevertheless, we agree with the reviewer that further investigation is required to elucidate the precise mechanism by which klf2 regulates id2b expression.

      (6) In Fig. 4: What's the mRNA expression for id2b in WT and id2b mutant fish hearts?

      We performed qRT-PCR analysis on purified zebrafish hearts and observed a significant reduction in id2b mRNA levels in id2b mutants compared to wild-type controls. These new results have been incorporated into the revised manuscript (Figure 4A).

      (7) In Fig. 5E, the heart rate shows no difference between id2b+/+ and id2b-/- fish according to echocardiography analysis. However, Fig. 5B indicates a difference in heart rate. Could the authors explain this discrepancy?

      We thank the reviewer for this insightful observation. In our study, we observed a reduction in heart rate in id2b mutants during embryonic stages (120 hpf), as shown in Figure 5B. However, this difference was not evident in adult fish based on echocardiography analysis (Figure 5E). While the exact reason for these changes during development remains unclear, it is possible that the reduction in cardiac output observed in id2b mutants during early development triggers compensatory mechanisms over time, ultimately restoring heart rate in adulthood. Given that heart rate is primarily regulated by pacemaker activity, further investigation will be required to determine whether such compensatory adaptations occur and to elucidate the underlying mechanisms.

      (8) In Fig. 6A: it's a little hard to read the gene names in the left most image in the panel. In Fig. 6B, the authors conducted qRT-PCR analysis of 72 hpf embryonic hearts and validated decreased nrg1 levels in id2b-/- compared to control. Since nrg1 is not specifically expressed in endocardial cells in the developing heart, the authors should isolate endocardial cells and compare nrg1 expression in id2b-/- to control. This would ensure that the loss of id2b affects nrg1 expression derived from endocardial cells rather than other cell types. In Supp Figure S6: Suggest adding an image of the UMAP projection to show tcf3b expression in endocardial cells from sequencing analysis.

      We thank the reviewer for these helpful suggestions. In response, we have increased the font size of gene names in the leftmost panel of Figure 6A to improve readability. Regarding nrg1 expression, we acknowledge the importance of assessing its cell-type specificity. Unfortunately, due to the lack of reliable transgenic or knock-in tools for nrg1, its precise expression pattern in embryonic hearts remains unclear. We attempted to isolate endocardial cells from embryonic hearts using FACS, but the limited number of cells obtained at this stage precluded reliable qRT-PCR analysis. Nonetheless, our data show that id2b is specifically expressed in endocardial cells, and publicly available single-cell RNA-seq datasets also support that nrg1 is predominantly expressed in endocardial, but not myocardial or epicardial cells during embryonic heart development (Figure 6-figure supplement 1). These findings suggest that id2b may regulate nrg1 expression in a cell-autonomous manner within the endocardium. As suggested, we have also added a UMAP image to Figure 7-figure supplement 1 to show tcf3b expression in endocardial cells, further supporting the cell identity in single-cell dataset.

      (9) In Fig. 6, Nrg1 knockout shows no gross morphological defects and normal trabeculation in larvae. Could the authors explain why they propose that endocardial id2b promotes nrg1 synthesis, thereby enhancing cardiomyocyte contractile function? Did Nrg1 knockdown with Mo lead to compromised calcium signaling and cardiac contractile function? Nrg2a has been reported to be expressed in endocardial cells in larvae, and its loss leads to heart function defects. Perhaps Nrg2a plays a more important role than Nrg1.

      We thank the reviewer for raising this important point. Although we did not directly test nrg1 knockout in our study, previous reports have shown that genetic deletion of nrg1 in zebrafish does not impair cardiac trabeculation during embryonic stages (Rasouli et al., Nat Commun 2017, PMID: 28485381; Brown et al., J Cell Mol Med 2018, PMID: 29265764). However, reduced trabecular area and signs of arrhythmia were observed in juvenile and adult fish (Brown et al., J Cell Mol Med 2018, PMID: 29265764), suggesting a potential role for nrg1 in maintaining cardiac structure and function later in development. Whether calcium signaling and cardiac contractility are affected at these stages remains to be determined. Given that morpholino-induced knockdown is limited to early embryonic stages, it is not suitable for assessing nrg1 function in juvenile or adult hearts.

      As noted by the reviewer, nrg2a is expressed in endocardial cells, and its deletion has been associated with cardiac defects (Rasouli et al., Nat Commun 2017, PMID: 28485381). To assess its potential involvement in our model, we performed qRT-PCR analysis and observed increased nrg2a expression in id2b mutant hearts (Author response image 1). This upregulation may reflect a compensatory response to the loss of id2b. Therefore, nrg2a is unlikely to play an essential role in mediating the depressed cardiac function in this context.

      Author response image 1.

      Expression levels of nrg2a. qRT-PCR analysis of nrg2a mRNA in id2b<sup>+/+</sup> and id2b<sup>-/-</sup> adult hearts. Data were normalized to the expression of actb1. N=5 biological replicates, with each sample containing two adult hearts.

      (10) In Fig. 7A of the IP experiment, it is recommended that the authors establish a negative control using control IgG corresponding to the primary antibody source. This control helps to differentiate non-specific background signal from specific antibody signal.

      As suggested, we have included an IgG control corresponding to the primary antibody species in the immunoprecipitation (IP) experiment to distinguish specific from non-specific binding. The updated data are presented in Figure 7A of the revised manuscript.

      (11) In Pg. 5, line 115: there is no reference included for previous literature on blebbistatin.

      We have added the corresponding reference (Line 126, Reference #5).

      In Pg. 5, lines 118-119; pg. 6 line 144: It would be beneficial to include a short sentence describing why choosing a tnnt2a morpholino knockdown to help provide mechanistic context to readers.

      We thank the reviewer for the constructive suggestion. In cardiomyocytes, tnnt2a encodes a sarcomeric protein essential for cardiac contraction, and its knockdown is a well-established method for abolishing heartbeat and blood flow in zebrafish embryos, thereby allowing investigation of flow-dependent gene regulation. In the revised manuscript, we have added a sentence and corresponding reference to clarify the rationale for using tnnt2a morpholino in our study (Lines 128-129, Reference #35).

      In Pg. 6, line 140: Results title of "Cardiac contraction promotes endocardial id2b expression through primary cilia but not BMP" is misleading and contradicts the results presented in this section and corresponding figure. For example, the bmp Mo knockdown experiments led to decreased id2b fluorescence and the last statement of this results section contradicts the title that BMP does not promote endocardial id2b in lines 179-180: "Collectively, these results suggest that BMP signaling and blood flow modulate id2b expression in a developmental-stage-dependent manner." It would be helpful to clarify whether BMP signaling is involved in id2b expression or not.

      We apologize for any confusion caused by the section title. Our results demonstrate that id2b expression is regulated by both BMP signaling and biomechanical forces in a developmental-stage-specific manner. Specifically, morpholino-mediated knockdown of bmp2b, bmp4, and bmp7a at the 1-cell stage significantly reduced id2b:eGFP fluorescence at 24 hpf (Figure 3-figure supplement 1A, B), suggesting that id2b is responsive to BMP signaling during early embryonic development. However, treatment with the BMP inhibitor Dorsomorphin during later stages (24-48 or 36-60 hpf) did not significantly alter id2b:eGFP fluorescence intensity in individual endocardial cells, although a modest reduction in total endocardial cell number was noted (Figure 3-figure supplement 1C, D). These results suggest that BMP signaling is required for id2b expression during early development but becomes dispensable at later stages, when biomechanical cues may play a more prominent role. To address this concern and better reflect the data, we have revised the Results section title to: "BMP signaling and cardiac contraction regulate id2b expression". This revised title more accurately reflects the dual regulation of id2b expression (Line 153).

      In line 205: Any speculation on why the hemodynamics was preserved between id2b mutant and WT siblings at 96 hpf?

      As suggested, we have included a sentence to address this observation. “Surprisingly, the pattern of hemodynamics was largely preserved in id2b<sup>-/-</sup> embryos compared to id2b<sup>+/+</sup> siblings at 96 hpf (Figure 4-figure supplement 1E, Video 1, 2), suggesting that the reduced number of endocardial cells in the AVC region was not sufficient to induce functional defects.” (Lines 223-225)

      In line 246: Fig. 6k and 6j are referenced, but should be figure 5k and 5j.

      We have corrected this in the revised manuscript.

      Reviewer #2 (Recommendations for the authors):

      he manuscript was overall well explained, aside from a few minor points that would help facilitate reader comprehension:

      (1) The last paragraph of the introduction could be a brief summary of the study.

      We thank the reviewer for this constructive suggestion. As recommended, we have included a paragraph in the Introduction section summarizing our key findings to provide clearer context for the study (Lines 96-100).

      (2) Lines 127-128: 'revealed a substantial recapitulation of the... of endogenous id2b expression' may need to be rephrased.

      We thank the reviewer for the valuable suggestion. In the revised manuscript, we have changed the sentence to: “Comparison of id2b:eGFP fluorescence with in situ hybridization at 24, 48, and 72 hpf revealed that the reporter signal closely recapitulates the endogenous id2b expression pattern.” (Lines 137-139)

      (3) Line 182: '... in a developmental-stage-dependent manner' sounds a bit ambiguous, may need to slightly elaborate/ clarify what this means.

      We thank the reviewer for the helpful comment. To improve clarity, we have revised the statement to: “Collectively, these results suggest that id2b expression is regulated by both BMP and biomechanical signaling, with the relative contribution of each pathway varying across developmental stages.” (Lines 195-197)

      Reviewer #3 (Recommendations for the authors):

      (1) The conclusion that BMP signaling prior to 24 hpf is necessary for id2b expression is not fully supported by the data. How do the authors envision pre-linear heart tube BMP signaling impacting endocardial id2b expression during later chamber stages? Id2b reporter fluorescence can be clearly visualized in the linear heart tube in panel B from Figure 1. Does id2b expression initiate prior to contraction? Can the model be refined by showing when id2b endocardial reporter fluorescence is first observed, and whether this early/pre-contractile expression is dependent on BMP signaling?

      We thank the reviewer for the important comment. As suggested, we performed morpholino-mediated knockdown of bmp2b, bmp4, and bmp7a at the 1-cell stage. Live imaging at 24 hpf showed significantly reduced id2b:eGFP fluorescence compared to controls (Figure 3-figure supplement 1A, B), suggesting that id2b is responsive to BMP signaling during early embryonic development. However, treatment with the BMP inhibitor Dorsomorphin during 24-48 or 36-60 hpf did not significantly impact id2b:eGFP fluorescence intensity in individual endocardial cells, although a reduction in endocardial cell number was observed (Figure 3-figure supplement 1C, D). These results suggest that BMP signaling is essential for id2b expression during early embryonic development, while it becomes dispensable at later stages, when biomechanical cues exert a more significant role.

      (2) Overexpressing tagged versions of TCF3b and Id2b in HEK293 cells is a very artificial way to make the major claim that these two proteins interact in endogenous endocardial cells. Can this be done in zebrafish embryonic or adult hearts?

      We thank the reviewer for this insightful comment. As suggested, we synthesized Flag-id2b and HA-tcf3b mRNA and co-injected them into 1-cell stage zebrafish embryos. We collected 100-300 embryos at 12, 24, and 48 hpf and performed western blot analysis using the same anti-HA and anti-Flag antibodies validated in HEK293 cell experiments. Despite multiple independent attempts, we were unable to detect clear bands of the tagged proteins in zebrafish embryos. We speculate that this could be due to mRNA instability, translational efficiency, or the low abundance of Id2b and Tcf3b proteins. We have acknowledged these technical limitations in the revised manuscript and clarified that the HEK293 cell data support a potential interaction between Id2b and Tcf3b, while confirming their endogenous interaction will require further investigations (Lines 295-296).

      (3) The data presented are consistent with the claim that the tcf3b binding sites are functional upstream of nrg1 to repress its transcription. To fully support this idea, those two sites should be disrupted with gRNAs if possible.

      We thank the reviewer for the valuable suggestion. In response, we attempted to disrupt the tcf3b binding sites using sgRNAs. However, we encountered technical difficulties in identifying sgRNAs that specifically and efficiently target these binding sites without affecting adjacent regions. Despite these challenges, our luciferase reporter assay, using tcf3b mRNA overexpression and morpholino knockdown, clearly demonstrated that tcf3b binds to and regulates nrg1 promoter region. Nevertheless, we acknowledge that future study using genome editing will be necessary to validate the direct binding of tcf3b to nrg1 promoter.

      Minor Points:

      (1) Must remove all of the "data not shown" statements and add the primary data to the Supplemental Figures.

      As suggested, we have removed all of the “data not shown” statements and added the original data to the revised manuscript (Figure 4E, middle panels, and Figure 4-figure supplement 1F)

      (2) Must present the order of the panels in the figure as they are presented in the text. One example is Figure 6 where 6E is discussed in the text before 6C and 6D.

      We thank the reviewer for bring up this important point. In the revised manuscript, we have carefully revised the manuscript to ensure that the order of figure panels matches the sequence in which they are discussed in the text. Specifically, we have reorganized the presentation of Figure 6 panels to align with the text flow, discussing panels 6C and 6D before panel 6E. The updated figure and corresponding text have been corrected accordingly in the revised manuscript.

      (3) Change the italicized gene names (e.g. tcf3b) to non-italicized names with the first letter capitalized (e.g. Tcf3b) when referencing the protein.

      As suggested, we have revised the manuscript to use non-italicized names with the first letter capitalized when referring to proteins.

      (4) All bar graphs should be replaced with dot bar graphs.

      We have replaced all bar graphs with dot bar graphs throughout the manuscript.

      (5) The new id2b mutant allele should be validated as a true null using quantitative RT-PCR to show that the message becomes destabilized through non-sense mediated decay or by immunostaining/western blot analysis if there is a zebrafish Id2b-specific antibody available.

      We thank the reviewer for this important suggestion. We have performed qRT-PCR analysis and detected a significant reduction in id2b mRNA levels in id2b<sup>-/-</sup> compared to id2b<sup>+/+</sup> controls. These new results are presented in Figure 4A of the revised manuscript.

      (6) Was tricaine used to anesthetize embryos for capturing heart rate and percent fractional area change? This analysis should be performed with no or very limited tricaine as it affects heart rate and systolic function. These parameters were captured at 120 hpf, but the authors should also look earlier at 72 hpf at a time when valves are not present by calcium transients are necessary to support heart function.

      We thank the reviewer for this important comment. When performing live imaging to assess cardiac contractile function, we used low-dose tricaine (0.16 mg/mL) to anesthetize the zebrafish embryos. We have included this important information in the Methods section (Line 503). As suggested, we have also included the heart function results at 72 hpf, which are now presented in Figure 5-figure supplement 2A-C of the revised manuscript.

      (7) The alpha-actinin staining in Figure 5-supplement 2D is very pixelated and unconvincing. This should be repeated and imaged at a higher resolution.

      As suggested, we have re-performed the α-actinin staining and acquired higher-resolution images. The updated results are now presented in Figure 5-figure supplement 2G of the revised manuscript.

      (8) The authors claim that reductions in id2b mutant heart contractility are due to perturbed calcium transients instead of sarcomere integrity. Why do the authors think that regulation of calcium dynamics was not observed in the DEG enriched GO-terms? Was significant downregulation of cacna1 identified in the bulk RNAseq?

      We thank the reviewer for raising this important point. In our bulk RNAseq dataset comparing id2b mutant and control hearts, GO term enrichment was primarily associated with pathways related to cardiac muscle contraction and heart contraction (Figure 5-figure supplement 1B). We speculate that the transcriptional changes related to calcium dynamics may be relatively subtle and thus were not captured as significantly enriched GO terms. In addition, our qRT-PCR analysis revealed a significant reduction in cacna1c expression in id2b mutant hearts compared to controls, suggesting that id2b deletion impairs calcium channel expression. However, this change was not detected by RNA-seq, likely due to limitations in sensitivity.

      (9) In line 277, the authors say, "To determine whether this interaction occurs in zebrafish, Flag-id2b and HA-tcf3b were co-expressed in HEK293 cells...". This should be re-phrased to, "To determine if zebrafish Id2b and Tcf3b interact in vitro, Flag-id2b and HA-tcf3b were co-expressed in HEK293 cells for co-immunoprecipitation analysis." The sentence in line 275 should be changed to, "....heterodimer with Tcf3b to limit its function as a potent transcriptional repressor."

      We thank the reviewer for these constructive comments and have revised the text accordingly (Lines 291-294).

      (10) Small text corrections or ideas:

      Line 63: emphasized

      We have corrected this in the revised manuscript.

      Line 71: studied signaling pathways

      We have corrected this in the revised manuscript.

      Line 106: the top 6 DEGS (I think that the authors mean top 6 GO-terms) and is Id2b in one of the enriched GO categories?

      id2b is one of the top DEGs. This point has been clarified in the revised manuscript (Lines 116-117).

      Line 125: a knockin id2b:eGFP reporter line

      We have corrected this in the revised manuscript (Line 136).

      Line 138: This paragraph could use a conclusion sentence.

      We have added a conclusion sentence in the revised manuscript (Lines 150-151).

      Line 190: id2b-/- zebrafish experienced early lethality

      We have revised the statement as suggested (Line 206).

      Line 193: The prominent enlargement of the atrium with a smaller ventricle has characterized as cardiomyopathy in zebrafish (Weeks et al. Cardiovasc Res, 2024, PMID: 38900908), which has also been associated with disruptions in calcium transients (Kamel et al J Cardiovasc Dev Dis, 2021, PMID: 33924051 and Kamel et al, Nat Commun 2021, PMID: 34887420). This information should be included in the text along with these references.

      We thank the reviewer for this helpful suggestion. We have incorporated these important references into the revised manuscript and included the relevant information to acknowledge the established link between atrial enlargement, cardiomyopathy, and disrupted calcium transients in zebrafish models (Reference #41, 42, and 45; Lines 210 and 260).

    1. Author Response

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      [...] Weaknesses

      Showing that A-2 and especially A-3 are outliers in the PCA analysis is useful, but it may be hiding other interesting signals in the data. The other strains are remarkably colinear on these plots, hinting that if the outliers were removed, one main component would emerge along which they are situated. It also seems possible that this additional analysis step would allow the second dimension to better differentiate them in a way that is interesting with respect to their mutator status or mutations in key metabolic or regulatory genes.

      We thank the reviewer for their positive comments and their constructive feedback on the manuscript. Following reviewer’s recommendation, we performed the PCA analysis on metabolism data after removing A-2 and A-3 data. We have detailed those results below. Consistent with a similar analysis performed on RNA-seq datasets in our previous publication, we find that removing these outliers has only a modest effect on separating mutators from non-mutators. We find that, while the new PC2 separates most mutators from the non-mutators, the separation is rather weak. Moreover, we do not see a similar distinction when looking at metabolic data in the Stationary phase. In the interest of improving the readability of the manuscript, we recommend not including these analysis in the final manuscript. We have presented the data for the reviewer’s benefit in Author response image 1, 2 and 3.

      Author response image 1.

      Author response image 2.

      Author response image 3.

      There is a missed opportunity to connect some key results to what is known about LTEE mutations that reduce the activity of pykF (pyruvate kinase I). This gene is mutated in all 12 LTEE populations, and often these mutations are frameshifts or transposon insertions that should completely knock out its activity. At first glance, inactivating an enzyme for a step in glycolysis does not make sense when the nutrient source in the growth medium is glucose, even though PykF is only one of two isozymes E. coli encodes for this reaction. There has been speculation that inactivating pykF increases the concentration of phosphoenolpyruvate (PEP) in cells and that this can lead to increased rates of glucose import because PEP is used by the phosphotransferase system of E. coli to import glucose (see https://doi.org/10.1002/bies.20629). The current study has confirmed the higher PEP levels, which is consistent with this model.

      We thank the reviewer for pointing out this missed opportunity. We have expanded the discussion around the role of pykF mutations and the elevated concentrations of PEP observed in our data in section 3.4.

      In the introduction, the papers cited to show the importance of changes in metabolism for adaptation do not seem to fit the focus of this study very well. They stress production of toxins and secondary metabolites, which do not seem to be mechanisms that are at work in the LTEE. I can think of two areas of background that would be more relevant: (1) studies of how bacterial metabolism evolves in adaptive laboratory evolution (ALE) experiments to optimize metabolic fluxes toward biomass production (for example, https://doi.org/10.1038/nature01149), and (2) discussions of how cross-feeding, metabolic niche specialization, and metabolic interdependence evolve in microbial communities, including in other evolution experiments (for example, https://doi.org/10.1073/pnas.0708504105 and https://doi.org/10.1128/mBio.00036-12).

      We thank the reviewer for pointing out missed citations in our introduction. We agree that these papers are relevant to the topic and have added their citations. Additionally, following the suggestion of another reviewer, we have reorganized the introduction so that the concept of the role of metabolism in evolution is presented first and the LTEE second.

      Reviewer #2 (Public Review):

      [...] Overall, this is a significant and well-executed research study. It offers new insights into the complex relationship between genetic changes and observable traits in evolving populations and utilizes metabolomics in the LTEE, a novel approach in combination with RNA-seq and mutation datasets.

      However, the paper's overall clarity is lacking. It is spread too thin and covers many topics without a clear focus. I strongly recommend a substantial rewrite of the manuscript, emphasizing structure and readability. The science is well executed, but the current writing does not do it justice.

      We thank the reviewer for their positive comments and their constructive feedback on the lack of clarity in writing. Following the reviewer’s suggestions, we have rewritten parts of the manuscript and reorganizd a few sections to improve readability. We hope the revised manuscript is significantly improved.

      Recommendations for the authors

      Reviewer #1 (Recommendations For The Authors):

      1) Title and Abstract: Add the study organism to the abstract, and probably also the title. Currently, E. coli is not mentioned in either! I'm also not sure that the LTEE is a sufficiently well-known acronym to abbreviate this in the title.

      We have revised the title of the manuscript and now spell out LTEE and included E. coli in the title and the abstract.

      2) Abstract: I would switch the usage of metabolome to metabolism in a few more places. For example, "changes in its metabolism", "networked and convoluted nature of metabolism". The metabolome, the concentrations of all metabolites, is what is being measured, but I think of this as a phenotypic readout of how metabolism evolving.

      We have changed “metabolome” to “metabolism” in cases where we refer to what is evolving and use “metabolome” when we refer to what is being measured.

      3) Line 16: Technically, the 12 LTEE populations were not initially identical. The Ara- differed from the Ara+ ancestors by one intentional mutation and one unintentional mutation that was not discovered until whole genomes were sequenced. I would rephrase this to "where 12 replicate populations of E. coli are propagated" or something similar so that it can be correct without needing to describe this unnecessary detail.

      The line has been rephrased as suggested.

      4) General Note: The text refers to populations as Ara-3 but the figures use A-3. I'd suggest going with A-3 and similar throughout for consistency.

      Instances of Ara have been changed to A+/-, and a sentence specifying as such has been added to the intro to make mention of this.

      5) Lines 43-44, 97-98. My understanding is that both S and L ecotypes in A-2 can use both glucose and acetate, but that the differentiation is related to their specialization that leads to each one being better on one or the other nutrient. The descriptions make it sound like each grows at a different time. Also, by definition, cells are not growing during "stationary phase". The change from glucose utilization (and acetate secretion) to acetate utilization during one cycle of growth is better described as a diauxic shift.

      We have reworded this part to remove mention of “growth” during stationary phase and changed the wording such that it no longer sounds like they grow at different times.

      6) Line 54: The statement "provide the ability to test hypotheses from previous data" is vague. Either provide an example or delete.

      We have removed this sentence as suggested.

      7) Lines 71-72: The terms "interphase" and "intraphase" sound too much like parts of the cell cycle. I'd suggest describing the comparisons as between and within growth phases.

      The use of intra and interphase have been changed as suggested.

      8) Line 79: The citrate is presumably still a chelating agent, so change phrasing to "Citrate is present in the medium because it was originally added as a chelating agent" or something similar.

      This sentence has been rewritten as suggested.

      9) Line 83: Write out "mutation accumulations" so it is easier to understand as "the number of mutations that have accumulated".

      The phrase has been changed as suggested.

      10) Line 116: It's unclear whether the abundances of metabolites are "strategies of survival" in stationary phase. An equally valid explanation is that there is less selection on the metabolome to have a specific composition during stationary phase to have high fitness.

      We have added a line about the possibility for alternative hypotheses.

      11) Figure 1: There seems to be some information missing from the legend. What are R06 and R07 in Panels A and B? Is panel D exponential phase and panel E stationary phase?

      This information was inadvertently missing from the caption and has been added.

      12) Figures 2 and 3: Gene names should be in italics. To me, the gray for deleted genes is hard to tell apart from the blue/red. Perhaps you could put a little X in these boxes instead? I think that having a little triangle pointing from each gene or metabolite name its corresponding abundance panel would help the reader track which information goes with which features. In Fig. 3 the placement of L-aspartate is a bit awkward. I'd suggest moving it down so the dashed line does not have to go through the abundance panel.

      These figures have been edited to include small triangles that link a gene or metabolite and its heatmap. Additionally, an X has been added where genes have suffered inactivating mutations and the placement of some elements has been moved to improve overall clarity.

      13) Lines 183-185: It would be easier to see and judge the consistency of these argR related relationships if a correlation graph of some kind was shown, probably as a supplemental figure. This plot could, for example, have genes/metabolites across the x-axis and fold-change on the y-axis with lines connecting points corresponding to each of the twelve populations across these categories (like Fig S8 but with lines added). Alternatively, it could be a heat map with the populations across one axis and the genes/metabolites across the other axis (like Fig S3).

      We have added a supplementary figure consisting of heatmaps showing the consistency of these changes within an evolved line. It is now figure S9.

      14) Line 195: I think adding a sentence elaborating on what exactly mutation accumulation means in this context would be helpful to readers.

      We have attempted to clarify the meaning of this by specifically stating that it is due to the accumulation of deleterious mutations.

      15) Line 293: Is standard LTEE medium DM25? These omics experiments with the LTEE sometimes use similar media with different glucose concentrations, and this is a very important detail to precisely specify.

      We reference “standard” LTEE medium in the methods section and have additionally specified the amount of sugar to make it clear that we are not supplementing the media with additional sugar.

      16) Figure S8B. Is "cystine" used instead of "cysteine" on purpose here since the compound is oxidized in the metabolomics treatment?

      The use of cystine is intentional, we detect the oxidized compound.

      Reviewer #2 (Recommendations For The Authors):

      Title:

      The abbreviation "LTEE" should not be in the title. Most readers will not recognize what it means. Instead, either the full name of the experiment, "Long-Term Evolution Experiment with E. coli," should be used, or the title should be rephrased to "Linking genotypic and phenotypic changes during a long-term evolution experiment using metabolomics."

      We have spelled out LTEE and included E. coli in the title.

      Abstract:

      Sentence 1: Consider softening the statement: "Do changes in an organism's environment, genome, or gene expression patterns often lead to changes in its metabolome?"

      We have rephrased this sentence to “Changes in an organism's environment, genome, or gene expression patterns can lead to changes in its metabolism”.

      Sentence 4: Use a hyphen for "Long-Term."

      This addition has been made.

      Sentence 4: Replace "transduce" with a more appropriate term: "...how the effects of mutations can be distributed through a cellular network to eventually affect metabolism and fitness."

      We have rewritten this sentence as “to understand how mutations can eventually affect metabolism and perhaps fitness”.

      Sentence 5: Clarify the use of "both" to refer to the ancestor of the LTEE and its descendant populations as two classes.

      We have reworded this sentence so it’s clear that the ancestors and evolved lines are two separate classes “We used mass-spectrometry to broadly survey the metabolomes of the ancestral strains and all 12 evolved lines…”.

      Sentence 6: Reverse the order for better emphasis: "Our work provides a better understanding of how mutations might affect fitness through the metabolome in the LTEE, and thus provides a major step in developing a complete genotype-phenotype map for this experimental system."

      We have rearranged this sentence per the reviewers suggestion.

      Introduction:

      Revise the introduction for clarity, readability, and logical narrative progression. Start with the second paragraph to set up the basic scientific principles being studied and then transition to describing the LTEE as a model system to examine those principles.

      The introduction has been rearranged and reworded in parts to increase clarity.

      Sentence 1: Revise for clarity: "The Long-Term Evolution Experiment (LTEE) has studied 12 initially identical populations of Escherichia coli as they have evolved in a carbon-limited, minimal glucose medium under a daily serial transfer regime."

      Sentence 2: Suggestion: "Begun in 1988, the LTEE populations have evolved for more than 75,000 generations, making it the longest-running experiment of its kind."

      Paragraph 2, sentence 2: Italicize "Drosophila."

      Paragraph 3, sentence 2: Make an important distinction: "Ara-3 is unique in that it evolved the ability to grow aerobically on citrate."

      Paragraph 3, sentence 4: Introduce the IS-mediated loss of the rbs operon in the LTEE as if it has not been described elsewhere.

      These suggestions have been incorporated into the manuscript.

      Results:

      Section 3.1: The use of samples from hours 2 and 24 to represent exponential and stationary phase may present some issues. For instance, capturing Ara-3 during its exponential growth on glucose, but not citrate, at hour 2. Furthermore, except for Ara-3, the LTEE populations reach stationary phase after approximately 4 hours, and there could be significant differences between early, mid, and late stationary phase. This possibility should be acknowledged, and future follow-up work should consider exploring these differences.

      We have added sentences in the first paragraph of the results section to include these details. We have also added a short paragraph to the conclusions suggesting additional studies of stationary phase, citing work on evolution of E. coli during long term stationary phase.

      Paragraph 3: While Turner et al. 2017 is an essential reference regarding resource use differences between Ara-3 and other LTEE populations, it would be more suitable to reference Blount et al. 2012 for the mutations that enabled access to citrate. Also, it is important to note that the difference lies in the ability to grow aerobically on citrate, rather than the ability to metabolize it.

      This citation has been added.

      Paragraph 4: As mentioned elsewhere, most LTEE populations exhibit balanced polymorphisms. Therefore, it is more appropriate to state that Ara-2 is the best-understood example of long-term diversity. It is likely that there are important metabolic differences between co-existing lineages in other LTEE populations.

      We now refer to Ara-2 as being the best-understood example of long term diversity..

      Paragraph 5: The first sentence of this paragraph should likely end with "levels."

      The word “levels” was added to the end of this sentence.

      Figure 3: It is preferable to refer to the "Superpathway of arginine and polyamine biosynthesis," citing EcoCyc as a reference, rather than a descriptor.

      This has been changed to a reference.

      Section 3.3, Paragraph 3: While higher intracellular amino acid abundances may facilitate higher translation rates and faster growth, the higher abundances themselves do not evaluate the hypothesis. To evaluate the hypothesis, it is necessary to demonstrate that higher abundances are associated with higher translation or growth rates. Therefore, the final sentence of this paragraph is not meaningful.

      We have reworded this sentence to say that it’s not possible to tell what the additional amino acids are being used for given only this data and that additional experiments are needed to confirm this hypothesis.

      Section 3.4: The first paragraph of this section misstates how evolution works. The low level of glucose in the LTEE does not drive innovation; instead, innovation occurs at random through the introduction of variation by mutation. Although the existence of the citrate resource acts as a reward that selects for variation that provides access to it, it is essential to remember that evolution is blind to such a reward. Moreover, regarding the evolution of the Cit+ trait, it is incorrect to assert that low glucose contributed to its evolution. As shown by Quandt et al. (2015), it seems probable that Cit+ evolution was potentiated by adaptation to specialization on acetate, which is produced by overflow metabolism resulting from rapid growth on glucose. This rapid growth only occurs when glucose is relatively abundant. The level of glucose seems low to us because it is low relative to traditional levels in bacteriological media, but not to the bacteria.

      We agree that this is a semantical, but important distinction. We have reworded this part as to not suggest that evolution has any forward thinking properties and is indeed blind to any rewards that might occur as the result of adaptation.

      In general, all instances of "utilize" and its cognates should be replaced with "use" and its cognates.

      Instances of “utilize” have been changed to use and its cognates.

      There is some uncertainty about the expectation of ramping up the TCA cycle in the LTEE. Overflow metabolism and acetate production appear to be prevalent in the LTEE, suggesting that many lineages only partially oxidize carbon derived from glucose, thereby bypassing the TCA cycle. While it is possible that this interpretation is incorrect, it would be helpful to see it addressed in the manuscript.

      We agree that this is a plausible hypothesis, we have added a paragraph at the end of this section that discusses the implications of overflow metabolism as an alternative hypothesis.

    1. Author Response

      The following is the authors’ response to the original reviews.

      eLife assessment

      This study provides potentially important, new information about the combination of information from the two eyes in humans. The data included frequency tagging of each eye's inputs and measures reflecting both cortical (EEG) and sub-cortical processes (pupillometry). Binocular combination is of potentially general interest because it provides -in essence- a case study of how the brain combines information from different sources and through different circuits. The strength of supporting evidence appears to be solid, showing that temporal modulations are combined differently than spatial modulations, with additional differences between subcortical and cortical pathways. However, the manuscript's clarity could be improved, including by adding more convincing motivations for the approaches used.

      We thank the editor and reviewers for their detailed comments and suggestions regarding our paper. We have implemented most of the suggested changes. In doing so we noticed a minor error in our analysis code that affected the functions shown in Figure 2e (previously Figure 1e), and have fixed this and rerun the modelling. Our main results and conclusions are unaffected by this change. We have also added a replication data set to the Appendix, as this bears on one of the points raised by a reviewer, and included a co-author who helped run this experiment.

      Reviewer #1 (Public Review):

      In this paper, the interocular/binocular combination of temporal luminance modulations is studied. Binocular combination is of broad interest because it provides a remarkable case study of how the brain combines information from different sources. In addition, the mechanisms of binocular combination are of interest to vision scientists because they provide insight into when/where/how information from two eyes is combined.

      This study focuses on how luminance flicker is combined across two eyes, extending previous work that focused mainly on spatial modulations. The results appear to show that temporal modulations are combined in different ways, with additional differences between subcortical and cortical pathways.

      1. Main concern: subcortical and cortical pathways are assessed in quite different ways. On the one hand, this is a strength of the study (as it relies on unique ways of interrogating each pathway). However, this is also a problem when the results from two approaches are combined - leading to a sort of attribution problem: Are the differences due to actual differences between the cortical and subcortical binocular combinations, or are they perhaps differences due to different methods. For example, the results suggest that the subcortical binocular combination is nonlinear, but it is not clear where this nonlinearity occurs. If this occurs in the final phase that controls pupillary responses, it has quite different implications.

      At the very least, this work should clearly discuss the limitations of using different methods to assess subcortical and cortical pathways.

      The modelling asserts that the nonlinearity is primarily interocular suppression, and that this is stronger in the subcortical pathway. Moreover the suppression impacts before binocular combination. So this is quite a specific location. We now say more about this in the Discussion, and also suggest that fMRI might avoid the limits on the conclusions we can draw from different methods.

      1. Adding to the previous point, the paper needs to be a better job of justifying not only the specific methods but also other details of the study (e.g., why certain parameters were chosen). To illustrate, a semi-positive example: Only page 7 explains why 2Hz modulation was used, while the methods for 2Hz modulation are described in detail on page 3. No justifications are provided for most of the other experimental choices. The paper should be expanded to better explain this area of research to non-experts. A notable strength of this paper is that it should be of interest to those not working in this particular field, but this goal is not achieved if the paper is written for a specialist audience. In particular, the introduction should be expanded to better explain this area of research, the methods should include justifications for important empirical decisions, and the discussion should make the work more accessible again (in addition to addressing the issues raised in point 1 above). The results also need more context. For example, why EEG data have overtones but pupillometry does not?

      We now explain the choice of frequency in the final paragraph of the introduction as follows:

      ‘We chose a primary flicker frequency of 2Hz as a compromise between the low-pass pupil response (see Barrionuevo et al., 2014; Spitschan et al., 2014), and the relatively higher-pass EEG response (Regan, 1966).’

      We also mention why the pupil response is low-pass:

      ‘The pupil response can be modulated by periodic changes in luminance, and is temporally low-pass (Barrionuevo et al., 2014; Spitschan et al. 2014), most likely due to the mechanical limitations of the iris sphincter and dilator muscles’.

      Reviewer #2 (Public Review):

      Previous studies have extensively explored the rules by which patterned inputs from the two eyes are combined in the visual cortex. Here the authors explore these rules for un-patterned inputs (luminance flicker) at both the level of the cortex, using Steady-State Visual Evoked Potentials (SSVEPs) and at the sub-cortical level using pupillary responses. They find that the pattern of binocular combination differs between cortical and sub-cortical levels with the cortex showing less dichoptic masking and somewhat more binocular facilitation.

      Importantly, the present results with flicker differ markedly from those with gratings (Hou et al., 2020, J Neurosci, Baker and Wade 2017 cerebral cortex, Norcia et al, 2000 Nuroreport, Brown et al., 1999, IOVS). When SSVEP responses are measured under dichoptic conditions where each eye is driven with a unique temporal frequency, in the case of grating stimuli, the magnitude of the response in the fixed contrast eye decreases as a function of contrast in the variable contrast eye. Here the response increases by varying (small) magnitudes. The authors favor a view that cortex and perception pool binocular flicker inputs approximately linearly using cells that are largely monocular. The lack of a decrease below the monocular level when modulation strength increase is taken to indicate that previously observed normalization mechanism in pattern vision does not play a substantial role in the processing of flicker. The authors present a computational model of binocular combination that captures features of the data when fit separately to each data set. Because the model has no frequency dependence and is based on scalar quantities, it cannot make joint predictions for the multiple experimental conditions which is one of its limitations.

      A strength of the current work is the use of frequency-tagging of both pupil and EEG responses to measure responses for flicker stimuli at two anatomical levels of processing. Flicker responses are interesting but have been relatively neglected. The tagging approach allows one to access responses driven by each eye, even when the other eye is stimulated which is a great strength. The tagging approach can be applied at both levels of processing at the same time when stimulus frequencies are low, which is an advantage as they can be directly compared. The authors demonstrate the versatility of frequency tagging in a novel experimental design which may inspire other uses, both within the present context and others. A disadvantage of the tagging approach for studying sub-cortical dynamics via pupil responses is that it is restricted to low temporal frequencies given the temporal bandwidth of the pupil. The inclusion of a behavioral measure and a model is also a strength, but there are some limitations in the modeling (see below).

      The authors suggest in the discussion that luminance flicker may preferentially drive cortical mechanisms that are largely monocular and in the results that they are approximately linear in the dichoptic cross condition (no effect of the fixed contrast stimulus in the other eye). By contrast, prior research using dichoptic dual frequency flickering stimuli has found robust intermodulation (IM) components in the VEP response spectrum (Baitch and Levi, 1988, Vision Res; Stevens et al., 1994 J Ped Ophthal Strab; France and Ver Hoeve, 1994, J Ped Ophthal Strab; Suter et al., 1996 Vis Neurosci). The presence of IM is a direct signature of binocular interaction and suggests that at least under some measurement conditions, binocular luminance combination is "essentially" non-linear, where essential implies a point-like non-linearity such as squaring of excitatory inputs. The two views are in striking contrast. It would thus be useful for the authors could show spectra for the dichoptic, two-frequency conditions to see if non-linear binocular IM components are present.

      This is an excellent point, and one that we had not previously appreciated the importance of. We have generated a figure (Fig 8) showing the IM response in the cross frequency conditions. There is a clear response at 0.4Hz in the pupillometry data (2-1.6Hz), and at 3.6Hz in the EEG data (2+1.6Hz). We therefore agree that this shows the system is essentially nonlinear, despite the binocular combination appearing approximately linear. We now say in the Discussion:

      ‘In the steady-state literature, one hallmark of a nonlinear system is the presence of intermodulation responses at the sums and differences of fundamental flicker frequencies (Baitch & Levi, 1988; Tsai et al., 2012). In Figure 8 we plot the amplitude spectra of conditions from Experiment 1 in which the two eyes were stimulated at different frequencies (2Hz and 1.6Hz) but at the same contrast (48%; these correspond to the binocular cross and dichoptic cross conditions in Figures 2d,e and 3d,e). Consistent with the temporal properties of pupil responses and EEG, Figure 8a reveals a strong intermodulation difference response at 0.4Hz (red dashed line), and Figure 8b reveals an intermodulation sum response at 3.6Hz (red dashed line). The presence of these intermodulation terms is predicted by nonlinear gain control models of the type considered here (Baker and Wade, 2017; Tsai et al., 2012), and indicates that the processing of monocular flicker signals is not fully linear prior to the point at which they are combined across the eyes.’

      If the IM components are indeed absent, then there is a question of the generality of the conclusions, given that several previous studies have found them with dichoptic flicker. The previous studies differ from the authors' in terms of larger stimuli and in their use of higher temporal frequencies (e.g. 18/20 Hz, 17/21 Hz, 6/8 Hz). Either retinal area stimulated (periphery vs central field) or stimulus frequency (high vs low) could affect the results and thus the conclusions about the nature of dichoptic flicker processing in cortex. It would be interesting to sort this out as it may point the research in new directions.

      This is a great suggestion about retinal area. As chance would have it, we had already collected a replication data set where we stimulated the periphery, and we now include a summary of this data set as an Appendix. In general the results are similar, though we obtain a measurable (though still small) second harmonic response in the pupillometry data with this configuration, which is a further indication of nonlinear processing.

      Whether these components are present or absent is of interest in terms of the authors' computational model of binocular combination. It appears that the present model is based on scalar magnitudes, rather than vectors as in Baker and Wade (2017), so it would be silent on this point. The final summation of the separate eye inputs is linear in the model. In the first stage of the model, each eye's input is divided by a weighted input from the other eye. If we take this input as inhibitory, then IM would not emerge from this stage either.

      We have performed the modelling using scalar values here for simplicity and transparency, and to make the fitting process computationally feasible (it took several days even done this way). This type of model is quite capable of processing sine waves as inputs, and producing a complex output waveform which is Fourier transformed and then analysed in the same way as the experimental data (see e.g. Tsai, Wade & Norcia, 2012, J Neurosci; Baker & Wade, 2017, Cereb Cortex). However our primary aim here was to fit the model, and make inferences about the parameter values, rather than to use a specific set of parameter values to make predictions. We now say more about this family of models and how they can be applied in the methods section:

      “Models from this family can handle both scalar contrast values and continuous waveforms (Tsai et al., 2012) or images (Meese and Summers, 2007) as inputs. For time-varying inputs, the calculations are performed at each time point, and the output waveform can then be analysed using Fourier analysis in the same way as for empirical data.This means that the model can make predictions for the entire Fourier spectrum, including harmonic and intermodulation responses that arise as a consequence of nonlinearities in the model (Baker and Wade, 2017). However for computational tractability, we performed fitting here using scalar contrast values.”

      As a side point, there are quite a lot of ways to produce intermodulation terms, meaning they are not as diagnostic as one might suppose. We demonstrate this in Author response image 1, which shows the Fourier spectra produced by a toy model that multiplies its two inputs together (for an interactive python notebook that allows various nonlinearities to be explored, see here). Intermodulation terms also arise when two inputs of different frequencies are summed, followed by exponentiation. So it would be possible to have an entirely linear binocular summation process, followed by squaring, and have this generate IM terms (not that we think this is necessarily what is happening in our experiments).

      Author response image 1

      Related to the model: One of the more striking results is the substantial difference between the dichoptic and dichoptic-cross conditions. They differ in that the latter has two different frequencies in the two eyes while the former has the same frequency in each eye. As it stands, if fit jointly on the two conditions, the model would make the same prediction for the dichoptic and dichoptic-cross conditions. It would also make the same prediction whether the two eyes were in-phase temporally or in anti-phase temporally. There is no frequency/phase-dependence in the model to explain differences in these cases or to potentially explain different patterns at the different VEP response harmonics. The model also fits independently to each data set which weakens its generality. An interpretation outside of the model framework would thus be helpful for the specific case of differences between the dichoptic and dichoptic-cross conditions.

      As mentioned above, the limitations the reviewer highlights are features of the specific implementation, rather than the model architecture in general. Furthermore, although this particular implementation of the model does not have separate channels for different phases, these can be added (see e.g. Georgeson et al., 2016, Vis Res, for an example in the spatial domain). In future work we intend to explore the phase relationship of flicker, but do not have space to do this here.

      Prior work has defined several regimes of binocular summation in the VEP (Apkarian et al.,1981 EEG Journal). It would be useful for the authors to relate the use of their terms "facilitation" and "suppression" to these regimes and to justify/clarify differences in usage, when present. Experiment 1, Fig. 3 shows cases where the binocular response is more than twice the monocular response. Here the interpretation is clear: the responses are super-additive and would be classed as involving facilitation in the Apkarian et al framework. In the Apkarian et al framework, a ratio of 2 indicates independence/linearity. Ratios between 1 and 2 indicate sub-additivity and are diagnostic of the presence of binocular interaction but are noted by them to be difficult to interpret mechanistically. This should be discussed. A ratio of <1 indicates frank suppression which is not observed here with flicker.

      Operationally, we use facilitation to mean an increase in response relative to a monocular baseline, and suppression to mean a decrease in response. We now state this explicitly in the Introduction. Facilitation greater than a factor of 2 indicates some form of super-additive summation. In the context of the model, we also use the term suppression to indicate divisive suppression between channels, however this feature does not always result in empirical suppression (it depends on the condition, and the inhibitory weight). We think that interpretation of results such as these is greatly aided by the use of a computational modelling framework, which is why we take this approach here. The broad applicability of the model we use in the domain of spatial contrast lends it credibility for our stimuli here.

      Can the model explore the full range of binocular/monocular ratios in the Apkarian et al framework? I believe much of the data lies in the "partial summation" regime of Apkarian et al and that the model is mainly exploring this regime and is a way of quantifying varying degrees of partial summation.

      Yes, in principle the model can produce the full range of behaviours. When the weight of suppression is 1, binocular and monocular responses are equal. When the weight is zero, the model produces linear summation. When the weight is greater than 1, suppression occurs. It is also possible to produce super-additive summation effects, most straightforwardly by changing the model exponents. However this was not required for our data here, and so we kept these parameters fixed. We agree that the model is a good way to unify the results across disparate experimental paradigms, and that is our main intention with Figure 7i.

      Reviewer #3 (Public Review):

      This manuscript describes interesting experiments on how information from the two eyes is combined in cortical areas, sub-cortical areas, and perception. The experimental techniques are strong and the results are potentially quite interesting. But the manuscript is poorly written and tries to do too much in too little space. I had a lot of difficulty understanding the various experimental conditions, the complicated results, and the interpretations of those results. I think this is an interesting and useful project so I hope the authors will put in the time to revise the manuscript so that regular readers like myself can better understand what it all means.

      Now for my concerns and suggestions:

      The experimental conditions are novel and complicated, so readers will not readily grasp what the various conditions are and why they were chosen. For example, in one condition different flicker frequencies were presented to the two eyes (2Hz to one and 1.6Hz to the other) with the flicker amplitude fixed in the eye presented to the lower frequency and the flicker amplitude varied in the eye presented to the higher frequency. This is just one of several conditions that the reader has to understand in order to follow the experimental design. I have a few suggestions to make it easier to follow. First, create a figure showing graphically the various conditions. Second, come up with better names for the various conditions and use those names in clear labels in the data figures and in the appropriate captions. Third, combine the specific methods and results sections for each experiment so that one will have just gone through the relevant methods before moving forward into the results. The authors can keep a general methods section separate, but only for the methods that are general to the whole set of experiments.

      We have created a new figure (now Fig 1) that illustrates the conditions from Experiment 1, and is referenced throughout the paper. We have kept the names constant, as they are rooted in a substantial existing literature, and it will be confusing to readers familiar with that work if we diverge from these conventions. We did consider separating out the methods section, but feel it helps the flow of the results section to keep it as a single section.

      I wondered why the authors chose the temporal frequencies they did. Barrionuevo et al (2014) showed that the human pupil response is greatest at 1Hz and is nearly a log unit lower at 2Hz (i.e., the change in diameter is nearly a log unit lower; the change in area is nearly 2 log units lower). So why did the authors choose 2Hz for their primary frequency? And why did the authors choose 1.6Hz which is quite close to 2Hz for their off frequency? The rationale behind these important decisions should be made explicit.

      We now explain this in the Introduction as follows:

      ‘We chose a primary flicker frequency of 2Hz as a compromise between the low-pass pupil response (see Barrionuevo et al., 2014; Spitschan et al., 2014), and the relatively higher-pass EEG response (Regan, 1966).’

      It is a compromise frequency that is not optimal for either modality, but generates a measurable signal for both. The choice of 1.6 Hz was for similar reasons - for a 10-second trial it is four frequency bins away from the primary frequency, so can be unambiguously isolated in the spectrum.

      By the way, I wondered if we know what happens when you present the same flicker frequencies to the two eyes but in counter-phase. The average luminance seen binocularly would always be the same, so if the pupil system is linear, there should be no pupil response to this stimulus. An experiment like this has been done by Flitcroft et al (1992) on accommodation where the two eyes are presented stimuli moving oppositely in optical distance and indeed there was no accommodative response, which strongly suggests linearity.

      We have not tried this yet, but it’s on our to-do list for future work. The accommodation work is very interesting, and we now cite it in the manuscript as follows:

      ‘Work on the accommodative response indicates that binocular combination there is approximately linear (Flitcroft et al. 1992), and can even cancel when signals are in antiphase (we did not try this configuration here).’

      Figures 1 and 2 are important figures because they show the pupil and EEG results, respectively. But it's really hard to get your head around what's being shown in the lower row of each figure. The labeling for the conditions is one problem. You have to remember how "binocular" in panel c differs from "binocular cross" in panel d. And how "monocular" in panel d is different than "monocular 1.6Hz" in panel e. Additionally, the colors of the data symbols are not very distinct so it makes it hard to determine which one is which condition. These results are interesting. But they are difficult to digest.

      We hope that the new Figure 1 outlining the conditions has helped with interpretation here.

      The authors make a strong claim that they have found substantial differences in binocular interaction between cortical and sub-cortical circuits. But when I look at Figures 1 and 2, which are meant to convey this conclusion, I'm struck by how similar the results are. If the authors want to continue to make their claim, they need to spend more time making the case.

      Indeed, it is hard to make direct comparisons across figures - this is why Figure 4 plots the ratio of binocular to monocular conditions, and shows a clear divergence between the EEG and pupillometry results at high contrasts.

      Figure 5 is thankfully easy to understand and shows a very clear result. These perceptual results deviate dramatically from the essentially winner-take-all results for spatial sinewaves shown by Legge & Rubin (1981); whom they should cite by the way. Thus, very interestingly the binocular combination of temporal variation is quite different than the binocular combination of spatial variation. Can the pupil and EEG results also be plotted in the fashion of Figure 5? You'd pick a criterion pupil (or EEG) change and use it to make such plots.

      We now cite Legge & Rubin. We see what you mean about plotting the EEG and pupillometry results in the same coordinates as the matching data, but we don’t think this is especially informative as we would end up only with data points along the axes and diagonal of the plot, without the points at other angles. This is a consequence of how the experiments were conducted.

      My main suggestion is that the authors need to devote more space to explaining what they've done, what they've found, and how they interpret the data. I suggest therefore that they drop the computational model altogether so that they can concentrate on the experiments. The model could be presented in a future paper.

      We feel that the model is central to the understanding and interpretation of our results, and have retained it in the revised version of the paper.

      Reviewer #2 (Recommendations For The Authors):

      I found the terms for the stimulus conditions confusing. I think a simple schematic diagram of the conditions would help the reader.

      Now added (the new Fig 1).

      In reporting the binocular to monocular ratio, please clarify whether the monocular data was from one eye alone (and how that eye was chosen) or from both eyes and then averaged, or something else. It would be useful to plot the results from the dichoptic condition in this form, as well.

      These were averaged across both eyes. We now say in the Methods section:

      ‘We confirmed in additional analyses that the monocular consensual pupil response was complete, justifying our pooling of data across the eyes.’

      Also, clarify whether the term facilitation is used as above throughout (facilitation being > 2 times monocular response under binocular condition) or if a different criterion is being used. If we take facilitation to mean a ratio > 2, then facilitation depends on temporal frequency in Figure 4.

      We now explain our use of these terms in the final paragraph of the Introduction:

      ‘Relative to the response to a monocular signal, adding a signal in the other eye can either increase the response (facilitation) or reduce it (suppression).’

      The magnitude of explicit facilitation attained is interesting, but not without precedent. Ratios of binocular to mean monocular > 2, have been reported previously and values of summation depend strongly on the stimulus used (see for example Apkarian et al., EEG Journal, 1981, Nicol et al., Doc Ophthal, 2011).

      We now mention this in the Discussion as follows:

      ‘(however we note that facilitation as substantial as ours has been reported in previous EEG work by Apkarian et al. (1981))’

      In Experiment 3, the authors say that the psychophysical matching results are consistent with the approximately linear summation effects observed in the EEG data of Experiment 1. In describing Fig. 3, the claim is that the EEG is non-linear, e.g. super-additive - at least at high contrasts. Please reconcile these statements.

      We think that the ‘superadditive’ effects are close enough to linear that we don’t want to make too much of a big deal about them - this could be measurement error, for example. So we use terms such as near-linear, or approximately linear, when referring to them throughout.

      Reviewer #3 (Recommendations For The Authors):

      Let me make some more specific comments using a page/paragraph/line format to indicate where in the text they're relevant.

      1/2 (middle)/3 from end. "In addition" seems out of place here.

      Removed.

      1/3/4. By "intensities" do you mean "contrasts"?

      Fixed.

      1/3/last. "... eyes'...".

      Fixed.

      2/5/3. By "one binocular disc", you mean into "one perceptually fused disc".

      Rewritten as: ‘to help with their perceptual fusion, giving the appearance of a single binocular disc’

      3/1/1. "calibrated" seems like the wrong word here. I think you're just changing the vergence angle to enable fusion, right?

      Now rewritten as: ‘Before each experiment, participants adjusted the angle of the stereoscope mirrors to achieve binocular fusion’

      3/1/1. "adjusting the angles...". And didn't changing the mirror angles affect the shapes of the discs in the retinal images?

      Perhaps very slightly, but this is well within the tolerance of the visual system to compensate for in the fused image, especially for such high contrast edges.

      3/3/5. "fixed contrast" is confusing here because it's still a flickering stimulus if I follow the text here. Reword.

      Now ‘fixed temporal contrast’

      3/4/1. It would be clearer to say "pupil tracker" rather than "eye tracker" because you're not really doing eye tracking.

      True, but the device is a commercial eye tracker, so this is the appropriate term regardless of what we are using it for.

      3/5/6. I'm getting lost here. "varying contrast levels" applies to the dichoptic stimulus, right?

      Yes, now reworded as ‘In the other interval, a target disc was displayed, flickering at different contrast levels on each trial, but with a fixed interocular contrast ratio across the block.’

      3/5/7. Understanding the "ratio of flicker amplitudes" is key to understanding what's going on here. More explanation would be helpful.

      Addressed in the above point.

      4/3/near end. Provide some explanation about why the Fourier approach is more robust to noise.

      Added ‘(which can make the phase and amplitude of a fitted sine wave unstable)’

      Figure 1. In panel a, explain what the numbers on the ordinate mean. What's zero, for example? Which direction is dilation? Same question for panel b. It's interesting in panel c that the response in one eye to 2Hz increases when the other eye sees 1.6Hz. Would be good to point that out in the text.

      Good idea about panel (a) - we have changed the y-axis to ‘Relative amplitude’ for clarity, and now note in the figure caption that ‘Negative values indicate constriction relative to baseline, and positive values indicate dilation.’ Panel (b) is absolute amplitude, so is unsigned. Panel (c) only contains 2Hz conditions, but there is some dichoptic suppression across the two frequencies in panels (d,e) - we now cover this in the text and include statistics.

      6/2/1. Make clear in the text that Figure 1c shows contrast response functions for the pupil.

      Now noted in the caption.

      Figure 3. I'm lost here. I feel like I should be able to construct this figure from Figures 1 and 2, but don't know how. More explanation is needed at least in the caption.

      Done. The caption now reads:

      ‘Ratio of binocular to monocular response for three data types. These were calculated by dividing the binocular response by the monocular response at each contrast level, using the data underlying Figures 2c, 3c and 3f. Each value is the average ratio across N=30 participants, and error bars indicate bootstrapped standard errors.’

      9/1/1-2. I didn't find the evidence supporting this statement compelling.

      We now point the reader to Figure 4 as a reminder of the evidence for this difference.

      9/1/6-9. You said this. But this kind of problem can be fixed by moving the methods sections as I suggested above.

      As mentioned, we feel that the results section flows better with the current structure.

      Figure 4. Make clear that this is EEG data.

      Now added to caption.

      Figure 5 caption. Infinite exponent in what equation?

      Now clarified as: ‘models involving linear combination (dotted) or a winner-take-all rule (dashed)’

      Figure 6. I hope this gets dropped. No one will understand how the model predictions were derived. And those who look at the data and model predictions will surely note (as the authors do) that they are rather different from one another.

      As noted above, we feel that the model is central to the paper and have retained this figure. We have also worked out how to correct the noise parameter in the model for the number of participants included in the coherent averaging, which fixes the discrepancy at low contrasts. The correspondence between the data and model in is now very good, and we have plotted the data points and curves in the same panels, which makes the figure less busy.

      12/1. Make clear in this paragraph that "visual cortex" is referring to EEG and perception results and that "subcortical" is referring to pupil. Explain clearly what "linear" would be and what the evidence for "non-linear" is.

      Good suggestion, we have added qualifiers linking to both methods. Also tidied up the language to make it clearer that we are talking about binocular combination specifically in terms of linearity, and spelled out the evidence for each point.

      12/2/6-9. Explain the Quaia et al results enough for the reader to know what reflexive eye movements were studied and how.

      We now specify that these eye movements are also known as the ‘ocular following response’ and were measured using scleral search coils.

      12/2/9-10. Same for Spitchan and Cajochen: more explanation.

      Added:

      “(melatonin is a hormone released by the pineal gland that regulates sleep; its production is suppressed by light exposure and can be measured from saliva assays)”

      12/3/2-3. Intriguing statements about optimally combining noisy signals, but explain this more. It won't be obvious to most readers.

      We have added some more explanation to this section.

      13/1. This is an interesting paragraph where the authors have a chance to discuss what would be most advantageous to the organism. They make the standard argument for perception, but basically punt on having an argument for the pupil.

      Indeed, we agree that this point is necessarily speculative, however we think it is interesting for the reader to consider.

      13/2/1. "Pupil size affects the ..." is more accurate.

      Fixed.

      13/2/2 from end. Which "two pathways"? Be clear.

      Changed to ‘the pupil and perceptual pathways’

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      (1) The mechanism by which STAMBPL1 mediates GRHL3 transcription through its interaction with FOXO1 is not sufficiently discussed, especially in relation to how STAMBPL1 regulates FOXO1. Some reported effects are modest.

      We appreciate the reviewer’s comments. In response, we have added a discussion on the potential mechanisms by which STAMPBL1 regulates FOXO1 transcriptional activity in Discussion, highlighted in red on page 18, lines 342 to 352. The specific reply content is as follows: “The transcriptional activity of FOXO1 is primarily regulated by its nucleocytoplasmic shuttling process (Van Der Heide, Hoekman et al. 2004). The PI3K/AKT pathway promotes the phosphorylation of FOXO1, resulting in the formation of a complex with members of the 14-3-3 family (including 14-3-3σ, 14-3-3ε, and 14-3-3ζ), which facilitates its export from the nucleus and inhibits its transcriptional activity (Huang and Tindall 2007, Tzivion, Dobson et al. 2011). It’s reported that TDAG51 prevents the binding of 14-3-3ζ to FOXO1 in the nucleus by interacting with FOXO1, thereby enhancing its transcriptional activity through increased accumulation within the nucleus (Park, Jeon et al. 2023). Our results indicate that the overexpression of STAMBPL1 and STAMBPL1-E292A did not affect the protein levels of FOXO1 (Fig.7E and Fig.S5E), but STAMBPL1 co-localizes with FOXO1 in the nucleus (Fig.7M) and interacts with it (Fig.7N and Fig.S5I-J). This suggests that STAMBPL1 enhances the transcriptional activity of FOXO1 on GRHL3 by interacting with nuclear FOXO1.” The result was added to Supplementary Figure 5 as Fig.S5E.

      Reviewer #2 (Public review):

      (1) A potential limitation of the study is the reliance on specific cellular and animal models, which may constrain the extrapolation of these findings to the broader spectrum of human TNBC biology. Furthermore, while the study provides evidence for a novel regulatory axis involving STAMBPL1, FOXO1, and GRHL3, the multifaceted nature of angiogenesis may implicate additional regulatory factors not exhaustively addressed in this research.

      We appreciate the valuable suggestions provided by the reviewer. In Discussion, we have added an in-depth discussion of the limitations of the study, as well as an analysis of the regulatory factors related to tumor angiogenesis, which highlighted in red on pages 20 to 21, lines 396 to 412. The relevant content added is as follows: “In this study, we utilized two triple-negative breast cancer cell lines, HCC1806 and HCC1937, along with human primary umbilical vein endothelial cells (HUVECs) and a nude mouse breast orthotopic transplantation tumor model to investigate the regulatory mechanism by which STAMBPL1 activates the GRHL3/HIF1α/VEGFA signaling pathway through its interaction with FOXO1, thereby promoting angiogenesis in TNBC. The results of this study have certain limitations regarding their applicability to human TNBC biology. Furthermore, in addition to the HIF1α/VEGFA signaling pathway emphasized in this study, tumor cells can continuously release or upregulate various pro-angiogenic factors, such as Angiopoietin and FGF, which activate endothelial cells, pericytes (PCs), cancer-associated fibroblasts (CAFs), endothelial progenitor cells (EPCs), and immune cells (ICs). This leads to capillary dilation, basement membrane disruption, extracellular matrix remodeling, pericyte detachment, and endothelial cell differentiation, thereby sustaining a highly active state of angiogenesis (Liu, Chen et al. 2023). It is important to collect clinical TNBC tissue samples in the future to analyze the expression of the STAMBPL1/FOXO1/GRHL3/HIF1α/VEGFA signaling axis. Furthermore, patient-derived organoid and xenograft models are useful to elucidate the regulatory relationship of this axis in TNBC angiogenesis”

      Reviewer #3 (Public review):

      The main weaknesses of this work are that the relevance of this molecular axis to the pathogenesis of TNBC is not clear, and it is not clearly established whether this is a regulatory pathway that occurs in hypoxic conditions or independently of oxygen levels.

      (1) With respect to the first point, both FOXO1 and GRHL3 have been previously described as tumor suppressors, with reports of FOXO1 inhibiting tumor angiogenesis. Therefore, this works describes an apparently contradictory function of these proteins in TNBC. While it is not surprising that the same genes perform divergent functions in different tumor contexts, a stronger evidence in support of the oncogenic function of these two genes should be provided to make the data more convincing. As an example, the data in support of high STAMBPL1, FOXO and GRHL3 gene expression in TNBC TCGA specimens provided in Figure 8 is not very strong and it is not clear what the non-TNBC specimens are (whether other breast cancers or other tumors, perhaps those tumors whether these genes perform tumor suppressive functions). To strengthen the notion that STAMBPL1, FOXO and GRHL3 are overexpressed in TNCB, the authors could provide a comparison with normal tissue, as well as the analysis of other publicly available datasets (like the NCI Clinical Proteomic Tumor Analysis Consortium as an example). Finally, is it not clear what are the basal protein expression levels of STAMBPL1 in the cell lines used in this study, as based on the data presented in Figures 2D and F it appears that the protein is not expressed if not exogenously overexpressed. It would be helpful if the authors addressed this issue and provided further evidence of STAMBPL1 expression in TNBC cell lines.

      We appreciate the suggestions. In this study, we utilized the BCIP online tool to analyze the Metabric database, incorporating adjacent normal tissues as controls. Although the expression levels of STAMBPL1, FOXO1, and GRHL3 in breast cancer tissues are not uniformly higher than those in adjacent tissues, their expression levels in triple-negative breast cancer (TNBC) are significantly elevated compared to non-TNBC. The results of this re-analysis have been added in Supplementary Figure 6 as Fig.S6A-C.

      About the question of the basal protein expression levels of STAMBPL1 in the cell lines used in this study, our response is that Fig. 2A showed the endogenous level of STAMBPL1 in HCC1806 and HCC1937. For Fig. 2D and 2F, the overexpressed STAMBPL1 was fused with a 3xFlag tag, resulting in a higher molecular weight compared to the endogenous STAMBPL1. In the revised Figure 2, we have indicated the positions of the endogenous (Endo.) and exogenous (OE.) STAMBPL1 bands with arrows.

      (2) Linked to these considerations is the second major criticism, namely that it is not made clear if this new regulatory axis is proposed to act in normoxic or hypoxic conditions. The experiments presented in this paper are performed in both conditions but a clear explanation as to why cells are exposed to hypoxia is not given and would be necessary being that HIF-1a transcription and not protein stability is being analyzed. Also, different hypoxic conditions are sometimes used, resulting in different mRNA levels of HIF-1a and its downstream targets and quite significant fluctuations within the same cell line from one experimental setting to the next. The authors should provide an explanation as to why experimental conditions are changed and, more importantly, the experiments presented in Figure 2 should be performed also in normoxia.

      Thanks for the comments. Under normoxic conditions, HIF1α is recognized by pVHL due to hydroxylation and is rapidly degraded via the proteasomal pathway. In contrast, under hypoxic conditions, HIF1α protein is accumulated. To investigate the effect of STAMBPL1 knockdown on HIF1A gene transcription levels, we conducted experiments under hypoxic conditions to avoid interference from the rapid degradation of HIF1α at the protein level, as shown in Figures 2B-C. Furthermore, under normoxic conditions, the overexpression of STAMBPL1 had been demonstrated to significantly enhance the protein levels of HIF1α and upregulate the transcription of VEGFA through HIF1α. To avoid the potential impact of excessive accumulation of HIF1α protein under hypoxic conditions on its protein level detection and the transcription of downstream VEGFA, the related experiments shown in Figure 2D-G were performed under normoxic conditions. We have explained the corresponding experimental conditions in the “Result” and “Figure legends” according to the reviewer's comments, highlighted in red.

      (3) Another critical point is that necessary experimental controls are sometimes missing, and this is reducing the strength of some of the conclusions enunciated by the authors. As examples, experiments where overexpression of STAMBPL1 is coupled to silencing of FOXO1 to demonstrate dependency lack FOXO1 silencing the absence of STAMBPL1 overexpression. Because diminishing FOXO1 expression affects HIF-1a/VEGF transcription even in the absence of STAMBPL1 (shown in Figure 7C, D), it is not clear if the data presented in Figure 7G are significant. The difference between HIF-1a expression upon FOXO1 silencing should be compared in the presence or absence of STAMBPL1 overexpression to understand if FOXO1 impacts HIF-1a transcription dependently or independently of STAMBPL1.

      Thank you for this comment. For Fig.7G-H, our experimental objective was to determine whether the activation of HIF1A/VEGFA transcription by STAMBPL1 via FOXO1. Therefore, under STAMBPL1 overexpression, we knocked down FOXO1 to investigate whether FOXO1 silencing could reverse the upregulation of HIF1A/VEGFA transcription induced by STAMBPL1 overexpression.

      (4) In addition, some minor comments to improve the quality of this manuscript are provided.

      (4.1) As a general statement, the manuscript is extremely synthetic. While this is not necessarily a negative feature, sometimes results are discussed in the figure legends and not in the main text (as an example, western blots showing HIF-1a expression) and this makes it hard to read thought the data in an easy and enjoyable manner.

      Thank you for this suggestion. We have revised the figure legends to make them clearer and more concise, highlighted in red.

      (4.2) The effect of STAMBPL1 overexpression on HIF-1a transcription is minor (Figure 2) The authors should explain why they think this is the case and whether hypoxia may provide a molecular environment that is more permissive to this type of regulation.

      Thank you for the comment. Under normoxic conditions, we conducted WB to examine the protein expression of HIF1α after the overexpression of STAMBPL1 and the knockdown of HIF1α. To visually illustrate the impact of STAMBPL1 overexpression on HIF1A protein levels, as well as the effectiveness of HIF1α knockdown, we annotated the grayscale analysis results of the bands in Figures 2D and 2F. As the reviewer pointed out, under normoxic conditions, HIF1α is rapidly degraded, which may explain why the upregulation of HIF1α protein levels by STAMBPL1 overexpression is not very pronounced.

      (4.3) HIF-1a does not appear upregulated at the protein level protein by STAMBPL1 or GRLH3 overexpression, even though this is stated in the legends of Figures 2 and 6. The authors should show unsaturated western blots images and provide quantitative data of independent experiments to make this point.

      Thank you for this comment. We have added the unsaturated image of HIF1α into Fig.2D, and performed a grayscale analysis of the HIF1α bands in Fig.2D and Fig.6A to indicate the relative protein level of HIF1α.

      Reviewer #1 (Recommendations for the authors):

      (1) The authors previously reported that STAMBPL1 stabilizes MKP1 in TNBC. However, in this study, they focus on HIF1a. Given that STAMBPL1 affects HIF1a expression, it would be valuable to examine the levels of ROS in TNBC cells with or without STAMBPL1, as ROS is known to influence HIF1a stability.

      Thank you for your comments. It’s known that STAMBPL1 functions as a deubiquitinating enzyme. However, our study reveals that the upregulation of HIF1α by STAMBPL1 is independent of its deubiquitinating activity. This conclusion is supported by the observation that overexpression of the deubiquitinase active site mutant, STAMBPL1-E292A, also upregulated HIF1α expression (Figure 1F). Moreover, STAMBPL1 overexpression enhanced HIF1α transcription (Figures 4E and S3E), while STAMBPL1 knockdown was able to inhibit the transcription of HIF1α (Figures 2B-C). These results indicate that STAMBPL1 mediates the transcription of HIF1α but does not affect the stability of HIF1α. For these reasons, we think that it is unnecessary to examine the ROS levels.

      (2) Figure 1A: The regulation of HIF1a mRNA by STAMBPL1, but not its protein levels, could be better addressed by using MG132 to rule out the impact of protein degradation.

      Thanks for this comment. Under normoxic conditions, the oxygen-sensitive prolyl hydroxylases PHD1-3 act on HIF1α, specifically inducing hydroxylation at the proline 402 and 564 residues. These hydroxylated residues are recognized by the pVHL/E3 ubiquitin ligase complex, leading to ubiquitination and subsequent degradation via the proteasome pathway. Conversely, under hypoxic conditions, PHD1-3 are inactivated, and non-hydroxylated HIF1α is not recognized by the pVHL/E3 ubiquitin ligase complex, thereby avoiding ubiquitination and proteasomal degradation (DOI: 10.1073/pnas.95.14.7987, DOI: 10.1515/BC.2004.016, and DOI: 10.1042/BJ20040620). The mechanism of HIF1α accumulation under hypoxia is analogous to the action of the proteasome inhibitor MG132. When we treated cells with hypoxia, the ubiquitination and proteasomal degradation pathway of HIF1α was blocked. At this time, STAMBPL1 knockdown could downregulate the expression of HIF1α (Fig.1A). Meanwhile, since the knockdown of STAMBPL1 significantly downregulated the mRNA level of HIF1α under hypoxia (Fig.2B-C), we concluded that STAMBPL1 affects the expression of HIF1α by mediating its transcription. In addition, MG132 will block all proteasomal substrate degradation and may affect HIF1α mRNA levels indirectly.

      (3) Figure 2D and 2F: The effect of STAMBPL1 in promoting HIF1a expression is quite mild, and the effect of HIF1a knockdown is also modest. Given the high levels of STAMBPL1 in TNBC cell lines (Figure 2A), it would be better to repeat these experiments in a STAMBPL1-knockdown setting for clearer insights.

      We appreciate this insightful suggestion. Considering that the regulation of HIF1α expression by STAMBPL1 occurs at the transcriptional level, and to prevent excessive accumulation of HIF1a during hypoxia that could confound the effect of STAMBPL1 overexpression on HIF1α regulation, we opted to overexpress STAMBPL1 under normoxic conditions and subsequently knock down HIF1α, as shown in Fig.2D and Fig.2F. This approach allowed us to observe that STAMBPL1 overexpression can upregulate HIF1a expression to some extent. Additionally, in response to the reviewer's suggestion to knock down STAMBPL1, we have conducted the corresponding experiments, with results presented in Fig.1A-E and Fig.2B-C.

      (4) Figure 4A: Why does the RNA-seq pattern differ significantly between the two siRNAs? Additionally, the authors should clarify why they focus primarily on transcription factors, as other mechanisms, such as mRNA stability and RNA modification, could also influence gene transcription.

      Thank you for this comment. Two siRNAs for STAMBPL1 were designed and synthesized by a biotechnology company. Although both siRNAs target STAMBPL1, they target different sequences. While both siRNAs effectively knocked down STAMBPL1 (Fig. 1A and Fig. 2A), the possibility of off-target effects cannot be completely ruled out. Therefore, we needed to use two siRNAs simultaneously for RNA-seq, ensuring that the gene expression changes observed are due to the knockdown of STAMBPL1 by focusing on genes downregulated by both two siRNAs. Additionally, among the 27 genes downregulated by both two siRNAs, only 18 genes were annotated. Of these 18 genes, except for GRHL3, which is a transcription factor reported to be involved in gene transcription regulation, the remaining 17 genes have no documented association with RNA transcription, stability, or modification. Therefore, we focused on the GRHL3 gene.

      (5) Figure 5G: To investigate whether STAMBPL1 and GRHL3 function epistatically in the pathway, a double knockdown of STAMBPL1 and GRHL3 should be examined. Additionally, a double knockdown of STAMBPL1 and FOXO1 should be assessed.

      Thank you for your comment. In Figure 5G, we aimed to assess the knockdown efficiency of GRHL3 using siRNAs. To determine whether STAMBPL1 upregulates the HIF1a/VEGFA axis via GRHL3, we overexpressed STAMBPL1 and subsequently knocked down GRHL3. Our findings indicated that STAMBPL1 overexpression indeed enhanced the HIF1a/VEGFA axis, which was rescued by the knockdown of GRHL3, as shown in Figures 4E-F and S3E-F. Similarly, upon overexpressing STAMBPL1 and knocking down FOXO1, we observed that STAMBPL1 overexpression increased the GRHL3/HIF1a/VEGFA axis, which could also be rescued by knocking down FOXO1, as shown in Figures 7F-H. These results suggest that STAMBPL1 upregulates the GRHL3/HIF1a/VEGFA axis through FOXO1. We do not think it is a right way to double knock down STAMBPL1 and FOXO1 or GRHL3.

      (6) Figure 7: It remains unclear how STAMBPL1 regulates FOXO1. The authors show that STAMBPL1 increases the transcriptional activation of FOXO1 at the GRHL3 promoter, but it is not clear if STAMBPL1 is required for FOXO1 binding to the GRHL3 promoter. To address this, STAMBPL1-knockdown should be included to examine its effect on FOXO1 binding to the GRHL3 promoter. Furthermore, it would be important to determine whether the STAMBPL1-FOXO1 interaction is essential for GRHL3 transcription. Since the interaction sites of STAMBPL1-FOXO1 have been mapped, a mutant disrupting the interaction would provide better insight into how STAMBPL1 promotes GRHL3 transcription by interacting with FOXO1.

      Thank you for this comment. It has been reported that FOXO1 promotes the transcription of the GRHL3 gene by interacting with its promoter (DOI: 10.1093/nar/gkw1276). We also verified through ChIP assay that FOXO1 can bind to the promoter of GRHL3 gene (Fig.7I) and mediate its transcription. Specifically, knocking down FOXO1 significantly down-regulated the mRNA level of GRHL3 (Fig.7B), and the GRHL3 promoter lacking FOXO1 binding site almost completely lost transcriptional activity (Fig.7J), indicating that FOXO1 is crucial for the transcriptional activity of the GRHL3 promoter. Overexpression of STAMBPL1 enhances the activating effect of FOXO1 on the transcriptional activity of the GRHL3 promoter (Fig.7K). However, the up-regulation of GRHL3 transcription by overexpression of STAMBPL1 is completely blocked by FOXO1 knockdown (Fig.7F), and the knockdown of FOXO1 essentially blocks the binding of STAMBPL1 to the GRHL3 promoter (Fig.7L), suggesting that STAMBPL1 affects the transcriptional expression of GRHL3 based on FOXO1. As we added in Discussion, the transcription factor activity of FOXO1 is mainly regulated by its nucleoplasm shuttling process, and the accumulation of FOXO1 in nucleus can enhance its transcription factor activity (DOI: 10.1042/BJ20040167; DOI: 10.15252/embj.2022111867). In our research, neither STAMBPL1 nor its mutant of deubiquitinating enzyme site affected the expression of FOXO1 (Fig.S5E), but STAMBPL1 and FOXO1 co-located in the nucleus (Fig.7M), and they interacted with each other (Fig.7N, Fig.S5I-J). Therefore, we speculate that STAMBPL1 interacts with FOXO1 in the nucleus, obstructs the binding of FOXO1 with the members of 14-3-3 family, inhibits the export of FOXO1, thereby enhancing its transcriptional activity. This interaction between STAMBPL1 and FOXO1 does not necessarily affect the binding of FOXO1 with DNA, including the GRHL3 promoter.

      (7) Figure 8 A-C: What is the correlation among the expressions of STAMBPL1, FOXO1, and GRHL3 in TNBC tumors compared to non-TNBC tumors?

      Thank you for your comment. In Figure 8A-C, we analyzed the expression levels of STAMBPL1, FOXO1, and GRHL3 in both TNBC and non-TNBC samples using the BCIP. The results indicate that the expression levels of these three genes are significantly higher in TNBC compared to non-TNBC samples. To investigate the correlation among the expressions of STAMBPL1, FOXO1, and GRHL3 in TNBC versus non-TNBC, we further utilized the Metabric data. Besides the positive correlation trend between STAMBPL1 and GRHL3 expression in TNBC clinical samples (Pearson R = 0.27), no significant correlation was observed in the expression levels of STAMBPL1, FOXO1, and GRHL3 in TNBC and non-TNBC clinical samples (as shown in Author response image 1 below). Since STAMBPL1 and FOXO1 are involved as protein molecules in the transcriptional regulation of GRHL3 gene, and the data obtained from the Metabric database are the transcriptional levels of these three genes, this might be the reason why the correlation between their expressions was not observed.

      Author response image 1.

      Reviewer #2 (Recommendations for the authors):

      The authors have thoroughly elucidated the role of STAMBPL1 in TNBC. However, it would be beneficial to discuss the potential clinical implications of these findings, such as how targeting STAMBPL1 or FOXO1 might impact current treatment strategies for TNBC. However, several issues need to be addressed.

      Major:

      (1) While the study provides an exhaustive analysis of the molecular mechanisms, a comparison with other subtypes of breast cancer could enhance our understanding of the specificity of the STAMBPL1/FOXO1/GRHL3/HIF1α/VEGFA axis in TNBC.

      Thank you for your comment. According to report, STAMBPL1 is significantly associated with the mesenchymal characteristics of breast cancer (DOI: 10.1038/s41416-020-0972-x). We utilized cBioPortal (http://www.cbioportal.org/) to analyze the expression of STAMBPL1 across various clinical subtypes of breast cancer. The results indicated that STAMBPL1 is highly expressed in invasive breast cancer, which has been added to Supplementary Figure 6 as Fig.S6D. Given that TNBC is an aggressive type of invasive breast cancer, we further examined the expression of STAMBPL1 in TNBC compared to non-TNBC using BCIP (http://omicsnet.org/bcancer/database). Our findings revealed that the expression level of STAMBPL1 in TNBC was elevated relative to its levels in non-TNBC (Fig.8A). Additionally, since tumor angiogenesis is a critical factor influencing the metastasis of cancer cells, our study focused specifically on the pro-angiogenic effects of STAMBPL1 in TNBC.

      (2) The authors might consider discussing any potential off-target effects of the siRNA and shRNA used in the study to bolster the conclusions drawn from the knockdown experiments.

      We appreciate the reviewer's suggestion. It is well-known that siRNA or shRNA have off-target effects. To address this concern, we employed two siRNAs for each gene knockdown in our study. Specifically, we knocked down genes such as STAMBPL1, FOXO1, GRHL3, and HIF1A in two TNBC cell lines, HCC1806 and HCC1937, using two siRNAs. Except for siRNA#1 targeting HIF1A, which did not show a significant knockdown effect in HCC1806 cells (Fig.2D and Fig.6A), the knockdown effects of other siRNAs on their respective genes were effective, and the resulting phenotypes were consistent. As shown in Fig.2F and Fig.S4H, siRNA#1 targeting HIF1A had a significant knockdown effect in HCC1937 cells. The lower knockdown efficiency of this siRNA in HCC1806 cell line might be attributed to cell-specific factors.

      (3) It would be advantageous if the authors could provide further details on the patient demographics and tumor characteristics in the TCGA database analysis to better comprehend the clinical relevance of their findings.

      Thanks for the reviewer's suggestions. We have now indicated the number of clinical samples in each group in the legend of Fig.8A-C. Since we utilized the BCIP online database to analyze and compare the expression levels of the three genes STAMBPL1, FOXO1, and GRHL3 in TNBC and non-TNBC, we are unable to obtain more specific information regarding the tumor characteristics of each sample. However, our analysis clearly shows that the expression levels of these three genes are significantly higher in TNBC compared to non-TNBC.

      (4) The authors should consider discussing any limitations regarding the generalizability of their findings, such as potential variations among different TNBC subtypes or the specificity of their observations to certain stages of the disease.

      We appreciate the reviewer's comment. Accordingly, we have added a discussion on the limitation of this study in Discussion, highlighted in red font on pages 20 to 21, lines 396 to 412. In addition, we utilized the bc-GenExMiner online database to conduct a comparative analysis of STAMBPL1 expression in different subtypes of non-TNBC and TNBC. The result indicates that STAMBPL1 is highly expressed in mesenchymal-like and basal-like TNBC, which has been added into Supplementary Figure 6 as Fig.S6E. Since these two subtypes of TNBC are highly invasive and metastatic, it suggests that targeting the signaling pathway of STAMBPL1/FOXO1/GRHL3/HIF1α/VEGFA may offer clinical benefits for patients with invasive TNBC.

      Minor:

      The paper is generally well-written, but it's crucial to maintain vigilance for subject-verb agreement, proper use of tense, and consistent terminology.

      Thank you for this suggestion. We have thoroughly revised the article for issues such as grammar, including tense, subject-verb agreement, and terminology.

    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewer #1 (Public review):

      Summary:

      This study aimed at replicating two previous findings that showed (1) a link between prediction tendencies and neural speech tracking, and (2) that eye movements track speech. The main findings were replicated which supports the robustness of these results. The authors also investigated interactions between prediction tendencies and ocular speech tracking, but the data did not reveal clear relationships. The authors propose a framework that integrates the findings of the study and proposes how eye movements and prediction tendencies shape perception.

      Strengths:

      This is a well-written paper that addresses interesting research questions, bringing together two subfields that are usually studied in separation: auditory speech and eye movements. The authors aimed at replicating findings from two of their previous studies, which was overall successful and speaks for the robustness of the findings. The overall approach is convincing, methods and analyses appear to be thorough, and results are compelling.

      Weaknesses:

      Linking the new to the previous studies could have been done in more detail, and the extent to which results were replicated could have been discussed more thoroughly.

      Eye movement behavior could have been presented in more detail and the authors could have attempted to understand whether there is a particular component in eye movement behavior (e.g., microsaccades) that drives the observed effects.

      We would like to thank you for your time and effort in reviewing our work and we appreciate the positive comments!

      We extended our manuscript, now providing intermediate results on individual prediction tendency, which can be compared to our results from Schubert et al., (2023).

      Furthermore, we expanded our discussion now detailing the extent to which our results (do not) replicate the previous findings (e.g. differences in horizontal vs. vertical ocular speech tracking, lack of distractor tracking, link between ocular speech tracking and behavioral outcomes).

      While we agree with the reviewer that it is an important and most interesting question, to what extent individual features of gaze behavior (such as microsaccades, blinks etc.) contribute to the ocular speech tracking effect, it is beyond the scope of the current manuscript. It will be methodologically and conceptually challenging to distinguish these features from one another and to relate them to diverse cognitive processes. We believe that a separate manuscript is needed to give these difficult questions sufficient space for new methodological approaches and control analyses. The primary goal of this manuscript was to replicate the findings of Gehmacher et al. (2024) using similar methods and to relate them to prediction tendencies, attention, and neural speech tracking. 

      Reviewer #2 (Public review):

      Summary

      Schubert et al. recorded MEG and eye-tracking activity while participants were listening to stories in single-speaker or multi-speaker speech. In a separate task, MEG was recorded while the same participants were listening to four types of pure tones in either structured (75% predictable) or random (25%) sequences. The MEG data from this task was used to quantify individual 'prediction tendency': the amount by which the neural signal is modulated by whether or not a repeated tone was (un)predictable, given the context. In a replication of earlier work, this prediction tendency was found to correlate with 'neural speech tracking' during the main task. Neural speech tracking is quantified as the multivariate relationship between MEG activity and speech amplitude envelope. Prediction tendency did not correlate with 'ocular speech tracking' during the main task. Neural speech tracking was further modulated by local semantic violations in the speech material, and by whether or not a distracting speaker was present. The authors suggest that part of the neural speech tracking is mediated by ocular speech tracking. Story comprehension was negatively related to ocular speech tracking.

      Strengths

      This is an ambitious study, and the authors' attempt to integrate the many reported findings related to prediction and attention in one framework is laudable. The data acquisition and analyses appear to be done with great attention to methodological detail (perhaps even with too much focus on detail-see below). Furthermore, the experimental paradigm used is more naturalistic than was previously done in similar setups (i.e. stories instead of sentences).

      Weaknesses

      For many of the key variables and analysis choices (e.g. neural/ocular speech tracking, prediction tendency, mediation) it is not directly clear how these relate to the theoretical entities under study, and why they were quantified in this particular way. Relatedly, while the analysis pipeline is outlined in much detail, an overarching rationale and important intermediate results are often missing, which makes it difficult to judge the strength of the evidence presented. Furthermore, some analysis choices appear rather ad-hoc and should be made uniform and/or better motivated.

      We would like to thank you very much for supporting our paper and your thoughtful feedback!

      To address your concerns, that our theoretical entities as well as some of our analytical choices lack transparency, we expanded our manuscript in several ways:

      (1) We now provide the intermediate results of our prediction tendency analysis (see new Figure 2 of our manuscript). These results are comparable to our findings from Schubert et al. (2023), demonstrating that on a group level there is a tendency to pre-activate auditory stimuli of high probability and illustrating the distribution of this tendency value in our subject population.

      (2) We expanded our methods section in order to explain our analytical choices (e.g. why this particular entropy modulation paradigm was used to measure individual prediction tendency).

      (3) We now provide an operationalisation of the terms “neural speech tracking” and “ocular speech tracking” at their first mention, to make these metrics more transparent to the reader.

      (4) We are summarizing important methodological information ahead of each results section, in order to provide the reader with a comprehensible background, without the necessity to read through the detailed methods section. 

      (5) We expanded our discussion section, with a special emphasis on relating the key variables of the current investigation to theoretical entities.

      Reviewer #3 (Public review):

      Summary:

      In this paper, the authors measured neural activity (using MEG) and eye gaze while individuals listened to speech from either one or two speakers, which sometimes contained semantic incongruencies.

      The stated aim is to replicate two previous findings by this group: (1) that there is "ocular speech tracking" (that eye-movements track the audio of the speech), (2) that individual differences in neural response to tones that are predictable vs. not-predictable in their pitch is linked to neural response to speech. In addition, here they try to link the above two effects to each other, and to link "attention, prediction, and active sensing".

      Strengths:

      This is an ambitious project, that tackles an important issue and combines different sources of data (neural data, eye-movements, individual differences in another task) in order to obtain a comprehensive "model" of the involvement of eye-movements in sensory processing.

      The authors use many adequate methods and sophisticated data-analysis tools (including MEG source analysis and multivariate statistical models) in order to achieve this.

      Weaknesses:

      Although I sympathize with the goal of the paper and agree that this is an interesting and important theoretical avenue to pursue, I am unfortunately not convinced by the results and find that many of the claims are very weakly substantiated in the actual data.

      Since most of the analyses presented here are derivations of statistical models and very little actual data is presented, I found it very difficult to assess the reliability and validity of the results, as they currently stand. I would be happy to see a thoroughly revised version, where much more of the data is presented, as well as control analyses and rigorous and well-documented statistical testing (including addressing multiple comparisons).

      We thank you for your thoughtful feedback. We appreciate your concerns and will address them below in greater detail.

      These are the main points of concern that I have regarding the paper, in its current format.

      (1) Prediction tendencies - assessed by listening to sequences of rhythmic tones, where the pitch was either "predictable" (i.e., followed a fixed pattern, with 25% repetition) or "unpredictable" (no particular order to the sounds). This is a very specific type of prediction, which is a general term that can operate along many different dimensions. Why was this specific design selected? Is there theoretical reason to believe that this type of prediction is also relevant to "semantic" predictions or other predictive aspects of speech processing?

      Theoretical assumptions and limitations of our quantification of individual prediction tendency are now shortly summarized in the first paragraph of our discussion section. With this paradigm we focus on anticipatory “top-down” predictions, whilst controlling for possibly confounding “bottom-up” processes. Since this study aimed to replicated our previous work we chose the same entropy-modulation paradigm as in other studies from our group (e.g. Demarchi et al. 2019, Schubert et al. 2023;2024, Reisinger et al. 2024), which has proven to give reproducible findings of feature-specific preactivations of sounds in a context of low entropy. One advantage of this design is that it gives us the opportunity to directly compare the processing of “predictable” and “unpredictable” sounds of the same frequency in a time-resolved manner (this argument is now also included in the Methods section).

      Regarding the question to what extent this type of prediction might also be relevant to “semantic” predictions we would like to refer to our previous study (Schubert et al., 2023), where we explicitly looked at the interaction between individual prediction tendency and encoding of semantic violations in the cortex. (In short, there we found a spatially dissociable interaction effect, indicating an increased encoding of semantic violations that scales with prediction tendency in the left hemisphere, as well as a disrupted encoding of semantic violations for individuals with stronger prediction tendency in the right hemisphere.) We did not aim to replicate all our findings in the current study, but instead we focused on merging the most important results from two independent phenomena in the domain of speech processing and bringing them into a common framework. However, as now stated in our discussion, we believe that “predictions are directly linked to the interpretation of sensory information. This interpretation is likely to occur at different levels along the cognitive (and anatomical) hierarchy…” and that “this type of prediction is relevant for acoustic processing such as speech and music, whose predictability unfolds over time.”

      (2) On the same point - I was disappointed that the results of "prediction tendencies" were not reported in full, but only used later on to assess correlations with other metrics. Even though this is a "replication" of previous work, one would like to fully understand the results from this independent study. On that note, I would also appreciate a more detailed explanation of the method used to derive the "prediction tendency" metric (e.g, what portion of the MEG signal is used? Why use a pre-stimulus and not a post-stimulus time window? How is the response affected by the 3Hz steady-state response that it is riding on? How are signals integrated across channels? Can we get a sense of what this "tendency" looks like in the actual neural signal, rather than just a single number derived per participant (an illustration is provided in Figure 1, but it would be nice to see the actual data)? How is this measure verified statistically? What is its distribution across the sample? Ideally, we would want enough information for others to be able to replicate this finding).

      We now included a new figure (similar to Schubert et al. 2023) showing the interim results of the “prediction tendency” effect as well as individual prediction tendency values of all subjects.

      Furthermore we expanded the description of the “prediction tendency” metric in the Methods section, where we explain our analytical choices in more detail. In particular we used a pre-stimulus time window in order to capture “anticipatory predictions”. The temporally predictably design gives us the opportunity to capture this type of predictions. The integration across channels is handled by the multivariate pattern analysis (MVPA), which inherently integrates multidimensional data (as mentioned in the methods section we used data from 102 magnetometers) and links it to (in this case) categorical information.

      (3) Semantic violations - half the nouns ending sentences were replaced to create incongruent endings. Can you provide more detail about this - e.g., how were the words selected? How were the recordings matched (e.g., could they be detected due to audio editing?)? What are the "lexically identical controls that are mentioned"? Also, is there any behavioral data to know how this affected listeners? Having so many incongruent sentences might be annoying/change the nature of listening. Were they told in advance about these?

      We expanded the Methods section and included the missing information: 

      “We randomly selected half of the nouns that ended a sentence (N = 79) and replaced them with the other half to induce unexpected semantic violations. The swap of nouns happened in the written script before the audio material was recorded in order to avoid any effects of audio clipping. Narrators were aware of the semantic violations and had been instructed to read out the words as normal. Consequently all target words occurred twice in the text, once in a natural context (serving as lexical controls) and once in a mismatched context (serving as semantic violations) within each trial, resulting in two sets of lexically identical words that differed greatly in their contextual probabilities (see Figure 1F for an example). Participants were unaware of these semantic violations.” Since we only replaced 79 words with semantic violations in a total of ~ 24 minutes of audio material we believe that natural listening was not impaired. In fact none of the participants mentioned to have noticed the semantic violations during debriefing (even though they had an effect on speech tracking in the brain). 

      (4) TRF in multi-speaker condition: was a univariate or multivariate model used? Since the single-speaker condition only contains one speech stimulus - can we know if univariate and multivariate models are directly comparable (in terms of variance explained)? Was any comparison to permutations done for this analysis to assess noise/chance levels?

      For mTRF models it depends on the direction (“encoding” vs. “decoding”) whether or not the model is comparable to a univariate model. In our case of an encoding model the TRFs are fitted to each MEG channel independently. This gives us the possibility to explore the effect over different areas (whereas a multivariate “decoding” model would result in only one speech reconstruction value).

      In both conditions (single and multi speaker) a single input feature (the envelope of the attended speech stream) was used. Of course it would be possible to fit the model to use a multivariate encoding model, predicting the brain’s response to the total input of sounds. This would, however, target a slightly different question than ours as we aimed to investigate how much of the attended speech is tracked.

      Regarding your suggestion of a comparison to permutations to assess noise levels we would like to point out that we chose the same methodological approach as in our previous studies, that we aimed to replicate here. Indeed in these original studies no permuted versions (with exception of the mediation analysis where comparing a model with an additional input predictor to a single predictor model would not result in a fair comparison) have been used. We conducted the mTRF approach considering the guidelines of Crosse et al. (2016) to the best of our knowledge and in accordance with similar studies in this field.

      Crosse, M. J., Di Liberto, G. M., Bednar, A., & Lalor, E. C. (2016). The multivariate temporal response function (mTRF) toolbox: a MATLAB toolbox for relating neural signals to continuous stimuli. Frontiers in human neuroscience, 10, 604.

      (5) TRF analysis at the word level: from my experience, 2-second segments are insufficient for deriving meaningful TRFs (see for example the recent work by Mesik & Wojtczak). Can you please give further details about how the analysis of the response to semantic violations was conducted? What was the model trained on (the full speech or just the 2-second long segments?) Is there a particular advantage to TRFs here, relative - say - to ERPs (one would expect a relatively nice N400 response, not)? In general, it would be nice to see the TRF results on their own (and not just the modulation effects).

      We fully agree with the reviewers statement that 2-second segments would have been too short to derive meaningful TRFs. To investigate the effect of semantic violations, we used the same TRFs trained on the whole dataset (with 4-fold cross validation). The resulting true as well as the predicted data was segmented into single word epochs of 2 seconds. We selected semantic violations as well as their lexically identical controls and correlated true with predicted responses for every word. Thus, we conducted the same analysis as for the overall encoding effect, focusing on only part of the data. We have reformulated the Methods section accordingly to clear up this misunderstanding. Since the TRFs are identical to the standard TRFs from the overall neural speech tracking, they are not informative to the semantic violation effect. However, since the mTRF approach is the key method throughout the manuscript (and our main focus is not on the investigations of brain responses to semantic violations) we have favoured this approach over the classical ERF analysis. 

      (6) Another related point that I did not quite understand - is the dependent measure used for the regression model "neural speech envelope tracking" the r-value derived just from the 2sec-long epochs? Or from the entire speech stimulus? The text mentions the "effect of neural speech tracking" - but it's not clear if this refers to the single-speaker vs. twospeaker conditions or to the prediction manipulation. Or is it different in the different analyses? Please spell out exactly what metric was used in each analysis.

      As suggested we now provide a clear definition of each dependent metric for each analysis.

      “Neural speech tracking” refers to the correlation coefficients between predicted and true brain responses from the aforementioned encoding model, trained and tested on the whole audio material within condition (single vs. multi-speaker).

      Recommendations for the authors:

      Reviewing Editor Comments:

      The reviewers have provided a number of recommendations to improve the manuscript, particularly requesting that more data be reported, with an emphasis on the measurements themselves (eye movements and TRFs) rather than just the numerical outputs of mathematical models.

      We appreciate all the reviewers' and editor’s comments and effort to improve our manuscript. In the revised version we provide interim findings and missing data, updated figures that include an intuitive illustration of the metrics (such as TRFs), and a thoroughly revised discussion section where we focus on the relationship between our observed quantities and theoretical entities. We now offer operationalized definitions of the relevant concepts (“prediction tendency”, “active ocular sensing” and “selective attention”) and suggest how these entities might be related in the context of speech processing, based on the current findings. We are confident that this revision has improved the quality of our paper a lot and we are grateful for all the feedback and suggestions. 

      Reviewer #1 (Recommendations for the authors):

      (1) Participants had to fixate throughout the tasks. How did the authors deal with large eye movements that violated the instructed fixation?

      As described in the Methods section: “Participants were instructed to look at a black fixation cross at the center of a grey screen.” This instruction was not intended to enforce strict fixation but rather to provide a general reference point, encouraging participants to keep their gaze on the grey screen and avoid freely scanning the room or closing their eyes. Unlike trial-based designs, where strict fixation is feasible due to shorter trial durations, this approach did not impose rigid fixation requirements. Consequently, the threshold for "instruction violation" was inherently more flexible, and no additional preprocessing was applied to the gaze vectors.

      Fixating for such an extended period of time (1.5 hours?) is hard. Did fixation behavior change over time? Could (fixation) fatigue affect the correlations between eye movements and speech tracking? For example, fatigued participants had to correct their fixation more often and this drives, in part, the negative correlation with comprehension?

      Yes, participants spent approximately 2 hours in the MEG, including preparation time (~30 minutes). However, participants were given opportunities to rest their eyes between different parts and blocks of the experiment (e.g., resting state, passive listening, and audiobook blocks), which should help mitigate fatigue to some extent.

      That said, we agree that it is an intriguing idea that fatigue could drive the ocular speech tracking effect, with participants potentially needing to correct their gaze more as the experiment progresses. However, our analysis suggests this is unlikely for several reasons:

      (1) Cross-validation in encoding models: Ocular speech tracking effects were calculated using a 4-fold cross-validation approach (this detail has now been added to the Methods section; please see our response to public review #3). This approach reduces the influence of potential increases in gaze corrections over time, as the models are trained and validated on independent data splits.  Moreover, if there were substantial differences in underlying response magnitudes between folds - for instance, between the first and fourth fold - this would likely compromise the TRF's ability to produce valid response functions for predicting the left-out data. Such a scenario would not result in significant tracking, further supporting the robustness of the observed effects.

      (2) TRF time-course stability: If fatigue were driving increased gaze corrections, we would expect this to be reflected in a general offset (capturing the mean difference between folds) in the TRF time-courses shown in Figure 4 (right panel). However, no such trend / offset is evident.

      (3) Comparison of eye movement data: To directly investigate this possibility, we compared the amount of total eye movements between the first and last blocks for both the single and multi-speaker conditions. Total movement was calculated by first calculating the differences in pixel values between consecutive eye positions on both the x- and y-axes. The Euclidean distance was then computed for each difference, providing a measure of movement between successive time points. Summing these distances yielded the total movement for each block. Statistical analysis was performed separately for the single speaker (ASS) and multi-speaker (AMS) conditions. For each condition, paired comparisons were made between the first and last blocks (we resorted to non-parametric tests, if assumptions of normality were violated):

      For the single speaker condition (ASS), the normality assumption was not satisfied (p≤0.05p, Kolmogorov-Smirnov test). Consequently, a Wilcoxon signedrank test was conducted, which revealed no significant difference in total movements between the first and last blocks (z=−1.330, p=0.184). For the multi-speaker condition (AMS), the data met the normality assumption (p>0.05), allowing the use of a paired t-test. The results showed no significant difference in total movements between the first and last blocks (t=−0.184, p=0.855).

      The results are visualized in a bar plot (see below), where individual data points are displayed alongside the mean and standard error for each block. Statistical annotations indicate that neither condition demonstrated significant differences between the blocks. These findings suggest that total eye movements remained stable across the experimental conditions, regardless of whether participants were exposed to a single or multiple speakers.

      Author response image 1.

      (4) Behavioral responses: Participants’ behavioral responses did not indicate any decrease in comprehensibility for later blocks compared to earlier ones. Specifically, a comparison of comprehension scores between the first and last blocks revealed no significant difference in either the single-speaker condition (ASS; Wilcoxon signed-rank test Z=−0.5911, p=0.5545) or the multi-speaker condition (AMS; Wilcoxon signed-rank test: Z=0.5018, p=0.6158). These findings suggest that participants maintained consistent levels of comprehension throughout the experiment, regardless of the condition or block order. The results are visualized in a bar plot (see below), where individual data points are displayed alongside the mean and standard error for each block. Statistical annotations indicate that neither condition demonstrated significant differences between the blocks.

      Author response image 2.

      Together, these factors suggest that fatigue is unlikely to be a significant driver of the ocular speech tracking effects observed in this study.

      (2) The authors should provide descriptive statistics of fixation behavior /fixational eye movements. What was the frequency and mean direction of microsaccades, do they follow the main sequence, etc., quantify drift and tremor?

      Thank you for their suggestion regarding descriptive statistics. To address this, we computed the rates of microsaccades (which were extracted using the microsaccade detection algorithm as proposed in Liu, B., Nobre, A. C. & van Ede, F. Functional but not obligatory link between microsaccades and neural modulation by covert spatial attention. Nat. Commun. 13, 3503 (2022)) and fixations as these metrics are directly relevant to our study and the requests above.

      Microsaccade Rates:

      - Single speaker Condition: Mean = 2.306 Hz, SD = 0.363 Hz. ○ Multi speaker: Mean = 2.268 Hz, SD = 0.355 Hz.

      Fixation Rates:

      - Single speaker Condition: Mean = 2.858 Hz, SD = 1.617 Hz. ○ Multi speaker Condition: Mean = 2.897 Hz, SD = 1.542 Hz.

      These values fall within the expected ranges reported in the literature (fixation rates: 2– 4 Hz, microsaccade rates: ~0.5–2.5 Hz) and serve as a sanity check, confirming the plausibility of our eye-tracking data. Regarding the reviewer’s request for additional metrics (e.g., microsaccade direction, main sequence analysis, drift, and tremor), extracting these features would require advanced algorithms and analyses not supported by our current preprocessing pipeline or dataset. We hope that the provided metrics, which were the main focus of this study, serve as a sufficient sanity check and highlight the robustness of our data.

      Related to this, I am wondering whether microsaccades are the feature that drives speech tracking.

      This is an important and pressing question that we aim to address in future publications. Currently, our understanding - and the reason microsaccades and blinks are not analysed in this manuscript - is limited by methodological constraints. Specifically, microsaccades are binary response vectors, which are not compatible with TRF analyses. Addressing this would require adapting future models to handle timecontinuous binary response data or exploring alternative approaches, such as regression-based ERFs (for example as in Heilbron et al. 2022). As the primary goal of this manuscript was to replicate the findings of Gehmacher et al. (2024) using similar methods and to integrate these findings into an initial unified framework, we did not investigate additional eye movement features here. However, we agree that microsaccades (and also blinks, see below) likely contribute, at least in part, to the observed ocular speech tracking effects, and we now suggest this in the Discussion:  

      “Relatedly, it remains an open question whether microsaccades are a key feature driving ocular speech tracking. However, our current study does not analyze microsaccades due to methodological constraints: microsaccades are binary response vectors, which are incompatible with TRF analyses used here. Addressing this would require adapting models to handle time-continuous binary response data or potentially exploring alternative approaches, such as regression-based ERFs (e.g., as in Heilbron et al., 2022). While these limitations preclude microsaccade analysis in the current study, we hypothesize that they could enhance temporal precision and selectively amplify relevant sensory input, supporting auditory perception. Future studies should explore this possibility to uncover the specific contributions of microsaccades to speech tracking.”

      (3) Can the authors make sure that interpolated blinks did not drive any of the effects? Can interpolated blink trials be excluded?

      Using continuous audiobooks as stimuli meant that we could not exclude blink periods from the analysis without introducing substantial continuation artifacts in the TRF analysis. Importantly, the concept of covert motor routines and active sensing suggests that participants engage more strongly in motor routines - including ocular behaviors such as microsaccades and blinks - during tasks like speech tracking. These motor routines are inherently tied to individual gaze patterns, making microsaccades and blinks correlated with other ocular behaviors. This complicates efforts to disentangle their individual contributions to the observed ocular speech tracking effects.

      Engagement in these motor routines, as posited by active sensing, would naturally load onto various viewing behaviors, further intertwining their roles.

      Even if we were to examine correlations, such as the amount of blinks with the ocular speech tracking effect, it is unlikely to provide a clearer understanding due to these inherent overlaps. The methodological and conceptual challenge lies in distinguishing these features from one another and understanding their respective roles in driving the observed effects.

      However, the aim of this manuscript was not to dissect the ocular speech tracking effect in greater detail, but rather to relate it - based on similar analytical choices as in Gehmacher et al - to prediction tendencies, attention, and neural speech tracking. While it will be crucial in future work to differentiate these patterns and their connections to diverse cognitive processes, it is beyond the scope of this study to address all these questions comprehensively.

      We acknowledge that eye movements, including microsaccades and blinks (however, see challenges for this in response 2), remain underexplored in many experimental paradigms. Their interplay with cognitive processes - such as attention, prediction, and sensory integration - will undoubtedly be an important focus for future studies. 

      (4) Could the authors provide more details on how time shuffling was done for the eyemovement predictor, and include a circularly shifted version (or a version that does not destroy temporal contiguity) in their model comparisons? Some types of shuffling can result in unrealistic time series, which would end up in an unfair comparison with the model that has the real eye movement traces as predictors.

      We thank the reviewer for their insightful question regarding the time-shuffling procedure for the eye-movement predictor and for suggesting the inclusion of a circularly shifted version in our model comparisons. Below, we provide further details about our approach and the rationale behind it:

      (1) Random Shuffling: In our analysis, the eye-movement predictor was randomly shuffled over time, meaning that individual samples were randomly replaced. This method completely disrupts the temporal structure of the signal, providing a null model that directly tests whether the temporal mediation observed is due to the specific temporal relationship between ocular movements and envelope tracking.

      (2) Circular Shifting: While circular shifting maintains temporal contiguity, it introduces certain challenges in the context of TRF analysis. Specifically:

      - Adaptation to Shifts: The TRF model could adapt to the introduced shift, potentially reducing the validity of the null comparison.

      - Similarity due to Repetition: The broadband envelope exhibits strong repetitive patterns over time, such as rhythms inherent to speech. Circular shifting can therefore produce predictors that are very similar to the original signal. As a result, this similarity may lead to null distributions that do not adequately disrupt the temporal mediation we aim to test, making it less robust as a control.

      (3) Rationale for Random Shuffling: The primary goal of our mediation analysis is to determine whether there is a temporal mediation of envelope tracking by ocular movements. By deliberately destroying the temporal structure through random shuffling, we ensure that the null model tests for the specific temporal relationship that is central to our hypothesis. Circularly shifted predictors, on the other hand, may partially preserve temporal dependencies, making them less suitable for this purpose.

      In summary, while circular shifting is a valuable approach in other contexts, it is less appropriate for the specific goals of this study. We hope this explanation clarifies our methodological choices and demonstrates their alignment with the aims of our analysis.

      (5) Replication: I want to point out that it is great that the previous findings were in principle replicated. However, I would like to suggest a more nuanced evaluation of the replication:

      a) Instead of a (direct) replication, the present study should be called a 'conceptual replication', since modifications in design and procedure were made.

      Thank you very much for this suggestion! We now use the term ‘conceptual replication’ throughout the manuscript.

      b) Not all the findings from the Gehmacher et al., 2024 study were replicated to a full extent:

      Did the authors find indications of a vertical vs. horizontal tracking difference in the Gehmacher 2024 data? Could they check this in the Gehmacher 2024 data?

      The findings for horizontal and vertical gaze tracking in Gehmacher et al. (2024) are detailed in the supplementary material of that publication. Both single-speaker and multi-speaker target conditions showed significant speech tracking effects in both horizontal and vertical directions. However, there was a slightly stronger tracking effect for the single-speaker condition in the vertical direction. Due to the highly predictable structure of words in Gehmacher et al. effects here were probably overall boosted as compared to continuous audiobook listening, likely leading to the differentiation of horizontal and vertical gaze. See figures in Gehmacher et al. supplementary file for reference.

      c) Another difference between their previous and this study is the non-existent tracking of the multi-speaker distractor in this study. The authors should point this out clearly in the discussion and potentially provide an explanation.

      Thank you for highlighting this point! We now address this in the discussion:

      “Importantly, in contrast to Gehmacher et al. (2024), we did not observe ocular tracking of the multi-speaker distractor in this study. This difference is likely attributable to the simplistic single-trial, 5-word task structure in Gehmacher et al., which resulted in high temporal overlap between the target and distractor speech streams and likely drove the significant distractor-tracking effects observed in that study. The absence of such an effect during continuous listening in our study suggests that ocular tracking is indeed more specific to selective attention.”

      Minor:

      (1) I was a little surprised to not see an indication of eyes/eye movements in Figure 6. The intention of the authors might have been to create a general schematic illustration, but I find this a bit misleading. This paper provides nice evidence for a specific ocular effect in speech tracking. There is, to my knowledge, no indication that speech would be influenced by different kinds of active sensing (if there are, please include them in the discussion). Given that the visuomotor system is quite dominant in humans, it might actually be the case that the speech tracking the authors describe is specifically ocular.

      Taking into account all the reviewers' remarks on the findings and interpretations, we have updated this figure (now Fig. 7) in the manuscript to make it more specific and aligned with the revised discussion section. Throughout the manuscript, we now explicitly refer to active ocular sensing in relation to speech processing and have avoided the broader term 'active sensing' in this context. We hope these revisions address the concerns raised.

      (2) I find the part in the discussion (page 2, last paragraph) on cognitive processes hard to follow. I don't agree that 'cognitive processes' are easily separable from any of the measured responses (eye and brain). Referring to the example they provide, there is evidence that eye movements are correlated with brain activity that is correlated with memory performance. How, and more importantly, why would one separate those?

      Thank you for raising this important point. We have carefully considered your comments, particularly regarding the interplay between cognitive processes and measured responses (eye and brain), as well as the challenge of conceptually separating them. Additionally, we have incorporated Reviewer #2's query (13) into a unified and complementary reasoning. In response, we have rewritten the relevant paragraph in the discussion to provide a clearer and more detailed explanation of how ocular and neural responses contribute to speech processing in an interdependent manner. We hope this revision addresses your concerns and offers a more precise and coherent discussion on this topic:

      “Despite the finding that eye movements mediate neural speech tracking, the behavioural relevance for semantic comprehension appears to differ between ocular and neural speech tracking. Specifically, we found a negative association between ocular speech tracking and comprehension, indicating that participants with lower comprehension performance exhibited increased ocular speech tracking. Interestingly, no significant relationship was observed between neural tracking and comprehension.

      In this context, the negative association between ocular tracking and comprehension might reflect individual differences in how participants allocate cognitive resources. Participants with lower comprehension may rely more heavily on attentional mechanisms to process acoustic features, as evidenced by increased ocular tracking. This reliance could represent a compensatory strategy when higher-order processes, such as semantic integration or memory retrieval, are less effective. Importantly, our comprehension questions (see Experimental Procedure) targeted a broad range of processes, including intelligibility and memory, suggesting that this relationship reflects a trade-off in resource allocation between low-level acoustic focus and integrative cognitive tasks.

      Rather than separating eye and brain responses conceptually, our analysis highlights their complementary contributions. Eye movements may enhance neural processing by increasing sensitivity to acoustic properties of speech, while neural activity builds on this foundation to integrate information and support comprehension. Together, these systems form an interdependent mechanism, with eye and brain responses working in tandem to facilitate different aspects of speech processing.

      This interpretation is consistent with the absence of a difference in ocular tracking for semantic violations (e.g., words with high surprisal versus lexically matched controls), reinforcing the view that ocular tracking primarily reflects attentional engagement with acoustic features rather than direct involvement in semantic processing. This aligns with previous findings that attention modulates auditory responses to acoustic features (e.g., Forte et al., 2017), further supporting the idea that ocular tracking reflects mechanisms of selective attention rather than representations of linguistic content.

      Future research should investigate how these systems interact and explore how ocular tracking mediates neural responses to linguistic features, such as lexical or semantic processing, to better understand their joint contributions to comprehension.”.  

      (3) Attention vs. predictive coding. I think the authors end up with an elegant description of the observed effects, "as an "active sensing" mechanism that implements the attentional optimization of sensory precision." However, I feel the paragraph starts with the ill-posed question "whether ocular speech tracking is modulated not by predictive, but other (for example attentional) processes". If ocular tracking is the implementation of a process (optimization of sensory precision, aka attention), how could it be at the same time modulated by that process? In my opinion, adding the notion that there is a modulation by a vague cognitive concept like attention on top of what the paper shows does not improve our understanding of how speech tracking in humans works.

      Thank you for raising this point. We agree that it is critical to clarify the relationship between ocular speech tracking, attention, and predictive processes, and we appreciate the opportunity to refine this discussion.  

      To avoid the potential confusion that active ocular sensing represents on the one hand an implementation of selective attention on the other it seems to be modulated by it, we now use  the formulation “ocular speech tracking reflects attentional mechanisms rather than predictive processes.”

      To address your concern that the conceptualization of attention seems rather vague, we have revised the whole paragraph in order to redefine the theoretical entities in question (including selective attention) and to provide a clearer and more precise picture (see also our revised version of Fig. 6, now Fig. 7). We now focus on highlighting the distinct yet interdependent roles of selective attention and individual prediction tendencies for speech tracking.:

      “With this speculative framework we attempt to describe and relate three important phenomena with respect to their relevance for speech processing: 1) “Anticipatory predictions” that are created in absence of attentional demands and contain probabilistic information about stimulus features (here, inferred from frequency-specific pre-activations during passive listening to sound sequences). 2) “Selective attention” that allocates resources towards relevant (whilst suppressing distracting) information (which was manipulated by the presence or absence of a distractor speaker). And finally 3) “active ocular sensing”, which refers to gaze behavior that is temporally aligned to attended (but not unattended) acoustic speech input (inferred from the discovered phenomenon of ocular speech tracking). We propose that auditory inflow is, at a basic level, temporally modulated via active ocular sensing, which “opens the gates” in the sensory periphery at relevant timepoints. How exactly this mechanism is guided (for example where the information about crucial timepoints comes from, if not from prediction, and whether it requires habituation to a speechstream etc.) is yet unclear. Unlike predictive tendencies, active ocular sensing appears to reflect selective attention, manifesting as a mechanism that optimizes sensory precision. Individual differences with respect to anticipatory predictions on the other hand, seem to be independent from the other two entities, but nevertheless relevant for speech processing. We therefore support the notion that representational content is interpreted based on prior probabilistic assumptions. If we consider the idea that “a percept” of an (auditory) object is actually temporally and spatially distributed (across representational spacetime - see Fig. 7), the content of information depends on where and when it is probed (see for example Dennett, 1991 for similar ideas on consciousness). Having to select from multiple interpretations across space and time requires a careful balance between the weighting of internal models and the allocation of resources based on current goals. We suggest that in the case of speech processing, this challenge results in an independent adaptation of feature-based precision-weighting by predictions on the one hand and temporal precision-weighting by selective attention on the other.”

      Reviewer #2 (Recommendations for the authors):

      My main recommendation is outlined in the Weaknesses above: the overarching rationale for many analysis choices should be made explicit, and intermediate results should be shown where appropriate, so the reader can follow what is being quantified and what the results truly mean. Specifically, I recommend to pay attention to the following (in no particular order):

      (1) Define 'neural speech tracking' early on. (e.g.: 'The amount of information in the MEG signal that can multivariately be explained by the speech amplitude envelope.' (is that correct?))

      Thank you for pointing out that this important definition is missing. It is now defined at the first mention in the Introduction as follows: “Here (and in the following) “neural speech tracking” refers to a correlation coefficient between actual brain responses and responses predicted from an encoding model based solely on the speech envelope”.

      (2) Same for 'ocular speech tracking'. Here even reading the Methods does not make it unambiguous how this term is used.

      It is now defined at the first mention in the Introduction as follows: “Ocular speech tracking” (similarly to “neural speech tracking” refers to the correlation coefficient between actual eye movements and movements predicted from an encoding model based on the speech envelope”.

      In addition also define both (neural and ocular speech tracking) metrics in the Methods Section.

      (3) Related to this: for ocular speech tracking, are simply the horizontal and vertical eye traces compared to the speech envelope? If so, this appears somewhat strange: why should the eyes move more rightward/upward with a larger envelope? And the direction here depends on the (arbitrary) sign of right = positive, etc. (It would make more sense to quantify 'amount of movement' in some way, but if this is done, I missed it in Methods.)

      Thank you for your insightful comments. You are correct that the horizontal and vertical traces were used for ocular speech tracking, and no additional details were included in the Methods. While we agree that the observed rightward/upward movement may seem unusual, this pattern is consistent with previous findings, including those reported in Gehmacher et al. (2024). In that study, we discussed how ocular speech tracking could reflect a broader engagement of the motor system during speech perception. For example, we observed a general right-lateralized gaze bias when participants attended to auditory speech, which we hypothesized might resemble eye movements during text reading, with a similar temporal alignment (~200 ms). We also speculated that this pattern might differ in cultures that read text from right to left.

      We appreciate your suggestion to explore alternative methods for quantifying gaze patterns, such as the "amount of movement" or microsaccades. While these approaches hold promise for future studies, our primary aim here was to replicate previous findings using the same signal and analysis methods to establish a basis for further exploration.  

      (4) In the Introduction, specifically blink-related ocular activity is mentioned as being related to speech tracking (for which a reference is, incidentally, missing), while here, any blink-related activity is excluded from the analysis. This should be motivated, as it appears in direct contradiction.

      Thank you for pointing this out. The mention of blink-related ocular activity in the Introduction refers to findings by Jin et al. (2018), where such activity was shown to align with higher-order syntactic structures in artificial speech. We have now included the appropriate reference for clarity.

      While Jin et al. focused on blink-related activity, in the present study, we focused on gaze patterns to investigate ocular speech tracking, replicating findings from

      Gehmacher et al. (2024). This approach was motivated by our goal to validate previous results using the same methodology. Importantly to this point, the exclusion of blinks in our analysis was due to methodological constraints of TRF analysis, which requires a continuous response signal; blinks, being discrete and artifact-prone, are incompatible with this approach.

      To address your concern, we revised the Introduction to clarify this distinction and provide explicit motivation for focusing on gaze patterns. It now reads:

      “Along these lines, It has been shown that covert, mostly blink related eye activity aligns with higher-order syntactic structures of temporally predictable, artificial speech (i.e. monosyllabic words; Jin et al, 2018). In support of ideas that the motor system is actively engaged in speech perception (Galantucci et al., 2006; Liberman & Mattingly, 1985), the authors suggest a global entrainment across sensory and (oculo)motor areas which implements temporal attention. 

      In another recent study from our lab (Gehmacher et al., 2024), we showed that eye movements continuously track intensity fluctuations of attended natural speech, a phenomenon we termed ocular speech tracking. In the present study, we focused on gaze patterns rather than blink-related activity, both to replicate findings from

      Gehmacher et al. (2024) and because blink activity is unsuitable for TRF analysis due to its discrete and artifact-prone nature. Hence, “Ocular speech tracking” (similarly to “neural speech tracking” refers to the correlation coefficient between actual eye movements and movements predicted from an encoding model based on the speech envelope.”

      Jin, P., Zou, J., Zhou, T., & Ding, N. (2018). Eye activity tracks task-relevant structures during speech and auditory sequence perception. Nature communications, 9(1), 5374.

      (5) The rationale for the mediation analysis is questionable. Let speech envelope = A, brain activity = B, eye movements = C. The authors wish to claim that A -> C -> B. But it is equally possible that A -> B -> C. They reflect on this somewhat in Discussion, but throughout the rest of the paper, the mediation analysis is presented as specifically testing whether A -> B is mediated by C, which is potentially misleading.

      Indeed we share your concern regarding the directionality of the relationships in the mediation analysis. Our choice of ocular movements as a mediator was motivated by the fact that the relationship between acoustic speech and neural activity is well established, as well as previous results indicating that oculomotor activity contributes to cognitive effects in auditory attention (Popov et al., 2022). 

      Indeed, here we treat both interpretations (“ocular movements contribute to neural speech tracking” versus “neural activity contributes to ocular speech tracking”) as equal.  We now emphasise this point in our discussion quite thoroughly:

      “It is important to note that our current findings do not allow for inference on directionality. Our choice of ocular movements as a mediator was motivated by the fact that the relationship between acoustic speech and neural activity is well established, as well as previous results indicating that oculomotor activity contributes to cognitive effects in auditory attention (Popov et al., 2022). However, an alternative model may suggest that neural activity mediates the effect of ocular speech tracking. Hence, it is possible that ocular mediation of speech tracking may reflect a) active (ocular) sensing for information driven by (top-down) selective attention or b) improved neural representations as a consequence of temporally aligned increase of sensory gain or c) (not unlikely) both. In fact, when rejecting the notion of a single bottom-up flow of information and replacing it with a model of distributed parallel and dynamic processing, it seems only reasonable to assume that the direction of communication (between our eyes and our brain) will depend on where (within the brain) as well as when we look at the effect. Thus, the regions and time-windows reported here should be taken as an illustration of oculo-neural communication during speech processing rather than an attempt to "explain" neural speech processing by ocular movements.”

      (6) The mediation analysis can be improved by a proper quantification of the effect (sizes or variance explained). E.g. how much % of B is explained by A total, and how much of that can in turn be explained by C being involved? For drawing directional conclusions perhaps Granger causality could be used.

      In Figure 4 (now Figure 5) of our manuscript we use standardized betas (which correspond to effect sizes) to illustrate the mediation effect. With the current mTRF approach it is however not possible (or insightful) to compare the variance explained. It is reasonable to assume that variance in neural activity will be explained better when including oculomotor behavior as a second predictor along with acoustic simulation. However this increase gives no indication to what extent this oculomotor behavior was task relevant or irrelevant (since all kinds of “arbitrary” movements will be captured with brain activity and therefore lead to an increase in variance explained). For this reason we chose to pursue the widely accepted framework of mediation (Baron & Kenny, 1986). This (correlational) approach is indeed limited in its interpretations (see prev. response), however the goal of the current study was to replicate and illustrate the triad relationship of acoustic speech input, neural activity and ocular movements with no particular hypotheses on directionality.

      (7) Both prediction tendency and neural speech tracking depend on MEG data, and thus on MEG signal-to-noise ratio (SNR). It is possible some participants may have higher SNR recordings in both tasks, which may result in both higher (estimated) prediction tendency and higher (estimated) speech tracking. This would result in a positive correlation, as the authors observe. This trivial explanation should be ruled out, by quantifying the relative SNR and testing for the absence of a mediation here.

      We agree that for both approaches (MVPA and mTRF models) individual MEG SNR plays an important role. This concern has been raised previously and addressed in our previous manuscript (Schubert et al., 2023). First, it should be noted that our prediction tendency value is the result of a condition contrast (rather than simple decoding accuracy) which compensates for the influence of subject specific signal-to-noise ratio (as no vacuous difference in SNR is to be expected between conditions). Second, in our previous study we also used frequency decoding accuracy as a control variable to correlate with speech tracking variables of interest and found no significant effect.

      (8) Much of the analysis pipeline features temporal response functions (TRFs). These should be shown in a time-resolved manner as a key intermediate step.

      We now included the Neural Speech tracking TRFs into the Figure (now Figure 3).

      (9) Figure 2 shows much-condensed results from different steps in the pipeline. If I understand correctly, 2A shows raw TRF weights (averaged over some time window?), while 2B-F shows standardized mean posterior regressor weights after Bayesian stats? It would be very helpful to make much more explicit what is being shown here, in addition to showing the related TRFs.

      Thank you for pointing this out! The figure description so far has been indeed not very insightful on this issue. We now adapted the caption and hope this clarifies the confusion: “ Neural speech tracking is related to prediction tendency and word surprisal, independent of selective attention. A) Envelope (x) - response (y) relationships are estimated using deconvolution (Boosting). The TRF (filter kernel, h) models how the brain processes the envelope over time. This filter is used to predict neural responses via convolution. Predicted responses are correlated with  actual neural activity to evaluate model fit and the TRF's ability to capture response dynamics. Correlation coefficients from these models are then used as dependent variables in Bayesian regression models. (Panel adapted from Gehmacher et al., 2024b). B) Temporal response functions (TRFs) depict the time-resolved neural tracking of the speech envelope for the single speaker and multi speaker target condition, shown here as absolute values averaged across channels. Solid lines represent the group average. Shaded areas represent 95% Confidence Intervals. C–H) The beta weights shown in the sensor plots are derived from Bayesian regression models in A). For Panel C, this statistical model is based on correlation coefficients computed from the TRF models (further details can be found in the Methods Section). C) In a single speaker condition, neural tracking of the speech envelope was significant for widespread areas, most pronounced over auditory processing regions. D) The condition effect indicates a decrease in neural speech tracking with increasing noise (1 distractor). E) Stronger prediction tendency was associated with increased neural speech tracking over left frontal areas. F) However, there was no interaction between prediction tendency and conditions of selective attention. G) Increased neural tracking of semantic violations was observed over left temporal areas. H) There was no interaction between word surprisal and speaker condition, suggesting a representation of surprising words independent of background noise. Marked sensors indicate ‘significant’ clusters, defined as at least two neighboring channels showing a significant result. N = 29.”

      Gehmacher, Q., Schubert, J., Kaltenmaier, A., Weisz, N., & Press, C. (2024b). The "Ocular Response Function" for encoding and decoding oculomotor related neural activity. bioRxiv, 2024-11.

      (10) Bayesian hypothesis testing is not done consistently. Some parts test for inclusion of 0 in 94% HDI, while some parts adopt a ROPE approach. The same approach should be taken throughout. Additionally, Bayes factors would be very helpful (I appreciate these depend on the choice of priors, but the default Bambi priors should be fine).

      Our primary aim in this study was to replicate two recent findings: (1) the relationship between individual prediction tendencies and neural speech tracking, and (2) the tracking of the speech envelope by eye movements. To maintain methodological consistency with the original studies, we did not apply a ROPE approach when analyzing these replication effects. Instead, we followed the same procedures as the original work, focusing on the inclusion of 0 in the HDI for the neural effects and using the same methods for the ocular effects. Additionally, we were not specifically interested in potential null effects in these replication analyses, as our primary goal was to test whether we could reproduce the previously reported findings.

      For the mediation analysis, however, we chose to extend the original approach by not only performing the analysis in a time-resolved manner but also applying a ROPE approach. This decision was motivated by our interest in gaining more comprehensive insights — beyond the replication goals — by also testing for potential null effects, which can provide valuable information about the presence or absence of mediation effects.

      We appreciate your thoughtful feedback and hope this clarifies our rationale for the differing approaches in our Bayesian hypothesis testing. 

      Regarding Bayes Factors, 

      We understand that some researchers find Bayes Factors appealing, as they offer a seemingly simple and straightforward way to evaluate the evidence in favor of/ or against H0 in relation to H1 (e.g. BF10 > 102 =  Decisive; according to the Jeffreys Scale). However, in practice Bayes Factors are often misunderstood e.g. by interpreting Bayes Factor as posterior odds or not acknowledging the notion of relative evidence in the Bayes Factor (see Wong et al. 2022). Instead of using Bayes Factors, we prefer to rely on estimating and reporting the posterior distribution of parameters given the data, prior and model assumptions (in form of the 94% HDI). This allows for a continuous evaluation of evidence for a given hypothesis that is in our eyes easier to interpret as a Bayes Factor.

      Jeffreys, Harold (1998) [1961]. The Theory of Probability (3rd ed.). Oxford, England. p. 432. ISBN 9780191589676.

      Wong, T. K., Kiers, H., & Tendeiro, J. (2022). On the Potential Mismatch Between the Function of the Bayes Factor and Researchers’ Expectations. Collabra: Psychology, 8(1), 36357. https://doi.org/10.1525/collabra.36357

      (11) It would be helpful if Results could be appreciated without a detailed read of Methods. I would recommend a recap of each key methodological step before introducing the relevant Result. (This may also help in making the rationale explicit.)

      In addition to the short recaps of methods that were already present, and information on quantifications of neural and ocular tracking and bayes statistics (see responses 1, 2, 9), we now added the following parts below to the results sections. Please refer to them in the context of the manuscript where they should now complement a key recap of methodological steps necessary to readily understand each analysis and rational that led to the results:

      Individual prediction tendency is related to neural speech tracking:

      “Thus, this measure is a single value per subject, which comprises a) differences between two contextual probabilities (i.e. ordered vs. random) in b) feature-specific tone representations c) in advance of their observation (summed over a time-window of -0.3 - 0 s). Importantly, this prediction tendency was assessed in an independent entropy modulation paradigm (see Fig. 1). On a group level we found an increased tendency to pre-activate a stimulus of high probability (i.e. forward transition) in an ordered context compared to a random context (see Fig, 2A). This effect replicates results from our previous work (Schubert et al., 2023, 2024). Using the summed difference between entropy levels (ordered - random) across pre-stimulus time, one value was extracted per subject (Fig. 2B). This value was used as a proxy for “individual prediction tendency” and correlated with encoding of clear speech across different MEG sensors. [...]

      Neural speech tracking, quantified as the correlation coefficients between predicted and observed MEG responses to the speech envelope, was used as the dependent variable in Bayesian regression models. These models included condition (single vs. multi-speaker) as a fixed effect, with either prediction tendency or word surprisal as an additional predictor, and random effects for participants.”

      Eye movements track acoustic speech in selective attention:

      “For this, we separately predicted horizontal and vertical eye movements from the acoustic speech envelope using temporal response functions (TRFs). The resulting model fit (i.e. correlation between true and predicted eye movements) is commonly referred to as “speech tracking”. Bayesian regression models were applied to evaluate tracking effects under different conditions of selective attention (single speaker, attended multi-speaker, unattended multi-speaker). Furthermore, we assessed whether individual prediction tendency or semantic word surprisal influenced ocular speech tracking.”

      Neural speech tracking is mediated by eye movements:

      “This model evaluates to what extent gaze behaviour functions as a mediator between acoustic speech input and brain activity.”

      Neural and ocular speech tracking are differently related to comprehension: “Bayesian regression models were used to investigate relationships between neural/ocular speech tracking and comprehension or difficulty. Ocular speech tracking was analyzed separately for horizontal and vertical eye movements.”

      (12) The research questions in the Introduction should be sharpened up, to make explicit when a question concerns a theoretical entity, and when it concerns something concretely measured/measurable.

      We sharpened them up:

      “Taking into account the aforementioned study by Schubert and colleagues (2023), the two recently uncovered predictors of neural tracking (individual prediction tendency and ocular tracking) raise several empirical questions regarding the relationship between predictive processes, selective attention, and active ocular sensing in speech processing:

      (1) Are predictive processes related to active ocular sensing in the same way they are to neural speech tracking? Specifically, do individuals with a stronger tendency to anticipate predictable auditory features, as quantified through prestimulus neural representations in an independent tone paradigm, show increased or even decreased ocular speech tracking, measured as the correlation between predicted and actual eye movements? Or is there no relationship at all?

      (2) To what extent does selective attention influence the relationship between prediction tendency, neural speech tracking, and ocular speech tracking? For example, does the effect of prediction tendency or ocular speech tracking on neural tracking differ between a single-speaker and multi-speaker listening condition?

      (3) Are individual prediction tendency and ocular speech tracking related to behavioral outcomes, such as comprehension and perceived task difficulty? Speech comprehension is assessed through accuracy on comprehension questions, and task difficulty is measured through subjective ratings.

      Although predictive processes, selective attention, and active sensing have been shown to contribute to successful listening, their potential interactions and specific roles in naturalistic speech perception remain unclear. Addressing these questions will help disentangle their contributions and establish an integrated framework for understanding how neural and ocular speech tracking support speech processing.”

      (13) The negative relationship between story comprehension and ocular speech tracking appears to go against the authors' preferred interpretation, but the reflection on this in the Discussion is very brief and somewhat vague.

      Thank you for pointing this out. We have taken your comments into careful consideration and also incorporated Reviewer #1's query (Minor point 2) into a unified and complementary reasoning. We have rewritten the relevant paragraph in the discussion to provide a clearer and more detailed explanation. We hope this revision offers a more precise and less vague discussion on this important point.

      “Despite the finding that eye movements mediate neural speech tracking, the behavioural relevance for semantic comprehension appears to differ between ocular and neural speech tracking. Specifically, we found a negative association between ocular speech tracking and comprehension, indicating that participants with lower comprehension performance exhibited increased ocular speech tracking. Interestingly, no significant relationship was observed between neural tracking and comprehension.

      In this context, the negative association between ocular tracking and comprehension might reflect individual differences in how participants allocate cognitive resources. Participants with lower comprehension may rely more heavily on attentional mechanisms to process acoustic features, as evidenced by increased ocular tracking. This reliance could represent a compensatory strategy when higher-order processes, such as semantic integration or memory retrieval, are less effective. Importantly, our comprehension questions (see Experimental Procedure) targeted a broad range of processes, including intelligibility and memory, suggesting that this relationship reflects a trade-off in resource allocation between low-level acoustic focus and integrative cognitive tasks.

      Rather than separating eye and brain responses conceptually, our analysis highlights their complementary contributions. Eye movements may enhance neural processing by increasing sensitivity to acoustic properties of speech, while neural activity builds on this foundation to integrate information and support comprehension. Together, these systems form an interdependent mechanism, with eye and brain responses working in tandem to facilitate different aspects of speech processing.

      This interpretation is consistent with the absence of a difference in ocular tracking for semantic violations (e.g., words with high surprisal versus lexically matched controls), reinforcing the view that ocular tracking primarily reflects attentional engagement with acoustic features rather than direct involvement in semantic processing. This aligns with previous findings that attention modulates auditory responses to acoustic features (e.g., Forte et al., 2017), further supporting the idea that ocular tracking reflects mechanisms of selective attention rather than representations of linguistic content.

      Future research should investigate how these systems interact and explore how ocular tracking mediates neural responses to linguistic features, such as lexical or semantic processing, to better understand their joint contributions to comprehension.”.  

      (14) Page numbers would be helpful.

      We added the page numbers.

      Reviewer #3 (Recommendations for the authors):

      Results

      (1) Figure 2 - statistical results are reported in this figure, but they are not fully explained in the text, nor are statistical values provided for any of the analyses (as far as I can tell).

      Also, how were multiple comparisons dealt with (the choice of two neighboring channels seems quite arbitrary)? Perhaps for this reason, the main result - namely the effect of "prediction tendency" and "semantic violations" - is quite sparse and might not survive more a rigorous statistical criterion. I would feel more comfortable with these results if the reporting of the statistical analysis had been more thorough (ideally, including comparison to control models).

      We would like to thank you again for your detailed queries, comments, and questions on our work. We first of all adapted this figure (now Figure 3 in the manuscript, please see responses 8 and 9 to Reviewer #2) to help readers understand the metrics and values within each statistical analysis. In addition, we indeed did not include the detailed statistics in the text! We now added the missing statistic reports calculated as averages over ‘clusters’:

      “Replicating previous findings (Schubert et al., 2023), we found widespread encoding of clear speech (average over cluster: β = 0.035, 94%HDI = [0.024, 0.046]), predominantly over auditory processing regions (Fig. 3C), that was decreased (β = -0.018, 94%HDI = [0.029, -0.006]) in a multi-speaker condition (Fig. 3D). Furthermore, a stronger prediction tendency was associated with increased neural speech tracking (β = 0.014, 94%HDI = [0.004, 0.025]) over left frontal sensors (see Fig. 3E). We found no interaction between prediction tendency and condition (see Fig. 3F).” [...] “In a direct comparison with lexically identical controls, we found an increased neural tracking of semantic violations (β = 0.039, 94%HDI = [0.007, 0.071]) over left temporal areas (see Fig. 3G). Furthermore, we found no interaction between word surprisal and speaker condition (see Fig. 3H).”

      Regarding the "prediction tendency" effect, it is important to note that this finding replicates a result from Schubert et al. (2023). The left frontal location of this effect is also consistent over studies, which convinces us of the robustness of the finding. Furthermore, testing this relationship properly requires a mixed-effects model in order to account for the variability across subjects that is not explained by fixed effects and the repeated measures design. For this reason a random Intercept had to be fitted for each subject (1|subject in the respective model formula). This statistical requirement motivated our decision to use bayesian statistics as (at least to our knowledge) there is no implementation of a cluster-based permutation mixed effects model (yet). In order to provide a more conservative criterion (as bayesian statistics don’t require a multiple comparison correction) we chose to impose in addition the requirement of a “clustered” effect.

      The choice of using two neighboring channels is consistent with the default parameter settings in FieldTrip’s cluster-based permutation testing (cfg.minnbchan = 2). This parameter specifies the minimum number of neighboring channels required for a sample to be included in the clustering algorithm, ensuring spatial consistency in the identified clusters. This alignment ensures that our methodology is comparable to numerous prior studies in the field, where such thresholds are standard. While it is true that all statistical analyses involve some degree of arbitrariness in parameter selection (e.g., alpha levels or clustering thresholds), our approach reflects established conventions and ensures comparability with previous findings.

      While the original study utilized source space analyses, we replicated this effect using only 102 magnetometers. This choice was made for computational simplicity, demonstrating that the effect is robust even without source-level modeling. Similarly, the "semantic violation" effect, while perceived as sparse, is based solely on magnetometer data and - in our opinion - should not be viewed as overly sparse given the methods employed. This effect aligns with the two-neighbor clustering approach, ensuring spatial consistency across magnetometers. The results reflect the robustness of the effects within the constraints of magnetometer-level analyses.

      Overall, the methodological choices, including the choice of a bayesian linear mixed effects model, the use of two neighboring channels and the reliance on magnetometers, are grounded in established practices and methodological considerations. While stricter thresholds or alternative approaches might yield different results, our methods align with best practices in the field and ensure the robustness, comparability, and replicability of our findings.

      (2) Figure 3 - the difference between horizontal and vertical eye-movements. This result is quite confusing and although the authors do suggest a possible interpretation for this in the discussion, I do wonder how robust this difference is or whether the ocular signal (in either direction) is simply too noisy or the effect too small to be detected consistently across conditions. Also, the ocular-TRFs themselves are not entirely convincing in suggesting reliable response/tracking of the audio - despite the small-but-significant increase in prediction accuracy.

      The horizontal versus vertical comparison was conducted to explore potential differences in how these dimensions contribute to ocular tracking of auditory stimuli (please also see our response to Reviewer #1, Response 5b, that includes the vertical vs. horizontal effects of Gehmacher at al. 2024). It would indeed be interesting to develop a measure that combines the two directions into a more natural representation of 'viewing,' such as a combined vector. However, this approach would require the use of complex numbers to represent both magnitude and direction simultaneously, hence the development of novel TRF algorithms capable of modeling this multidimensional signal. While beyond the scope of the current study, this presents an exciting avenue for future research and would allow us to move closer to understanding ocular speech tracking and the robustness of these effects, above and beyond the already successful replication.

      It is also important to emphasize that ocular-TRFs are derived from (viewing) behavioral data rather than neural signals, and are thus inherently subject to greater variability across participants and time. This higher variability does not necessarily indicate a small or unreliable effect but reflects the dynamic and task-dependent nature of eye movement behavior. The TRFs with shaded error margins represent this variability, highlighting how eye movements are influenced by both individual differences and moment-to-moment changes in task engagement.

      Despite this inherent variability, the significant prediction accuracy improvements confirm that ocular-TRFs reliably capture meaningful relationships between eye movements and auditory stimuli. The observed differences between horizontal and vertical TRFs further support the hypothesis that these dimensions are differentially involved in the task, possibly driven by the specific roles they play in sensorimotor coupling.

      (3) Figure 4 - this figure shows source distribution of 3 PCA components, derived from the results of the mediation effect of eye movements on the speech-tracking. Here too I am having difficulty in interpreting what the results actually are. For one, all three components are quite widespread and somewhat overlapping, so although they are statistically "independent" it is hard to learn much from them about the brain regions involved and whether they truly represent separable contributions. Similarly difficult to interpret are the time courses, which share some similarities with the known TRFs to speech (especially PC3). I would have expected to find a cleaner "auditory" response, and clearer separation between sensory regions and regions involved in the control of eye movements. I also wonder why the authors chose not to show the sourcelocalization of the neural and ocular speech-tracking responses alone - this could have helped us between understand what "mediation" of the neural response might look like.

      We appreciate the reviewer’s interest in better understanding the source distribution and time courses of the PCA components. While we acknowledge that the widespread and overlapping nature of the components may make a more fine grained interpretation challenging, it is important to emphasize that our analysis simply reflects the data, hence we can only present and interpret what the analysis revealed.

      Regarding your suggestion to show the source localization of ocular speech tracking and neural speech tracking alone, we would like to point out that ocular tracking is represented by only one channel for vertical and one channel for horizontal eye movements. Thus, in this case the estimated source of the effect are the eyes themselves. We believe that the source localization of neural speech tracking has been a thoroughly studied topic in research so far (locating it to perisylvian, auditory areas with a stronger preference for the left hemisphere) and can also be seen in Schubert et al., (2023). Nevertheless, we believe the observed PCA components still provide valuable, and most importantly novel insights into the interplay between eye movements and neural responses in speech tracking.  

      Discussion/interpretation

      (1) Although I appreciate the authors' attempt to propose a "unified" theoretical model linking predictions about low-level features to higher features, and the potential involvement of eye movements in 'active sensing' I honestly think that this model is overambitious, given the data presented in the current study. Moreover, there is very little discussion of past literature and existing models of active sensing and hierarchical processing of speech, that could have helped ground the discussion in a broader theoretical context. The entire discussion contains fewer than 20 citations (some of which are by these authors) and needs to be substantially enriched in order to provide context for the authors' claims.

      Thank you very much for your thoughtful feedback and for appreciating our approach. We hope that the revised manuscript addresses your concerns. Specifically, we now emphasize that our proposal is a conceptual framework, with the main goal to operationale “prediction tendency”, “active ocular sensing”, and “selective attention” and to “organise these entities according to their assumed function for speech processing and to describe their relationship with each other.” We did this by thoroughly revising our discussion section with a clear emphasis on the definition of terms, for example: 

      “With this speculative framework we attempt to describe and relate three important phenomena with respect to their relevance for speech processing: 1) “Anticipatory predictions” that are created in absence of attentional demands and contain probabilistic information about stimulus features (here, inferred from frequency-specific pre-activations during passive listening to sound sequences). 2) “Selective attention” that allocates resources towards relevant (whilst suppressing distracting) information (which was manipulated by the presence or absence of a distractor speaker). And finally 3) “active ocular sensing”, which refers to gaze behavior that is temporally aligned to attended (but not unattended) acoustic speech input (inferred from the discovered phenomenon of ocular speech tracking).”

      Our theoretical proposals are now followed by a recap of our results that support the respective idea, for example: 

      “...these predictions are formed in parallel and carry high feature-specificity but low temporal precision (as they are anticipatory in nature). This idea is supported by our finding that pure-tone anticipation is visible over a widespread prestimulus interval, instead of being locked to sound onset”

      “....we suggest that active (ocular) sensing does not necessarily convey feature- or content-specific information, it is merely used to boost (and conversely filter) sensory input at specific timescales (similar to neural oscillations). This assumption is supported by our finding that semantic violations are not differentially encoded in gaze behaviour than lexical controls.”

      And we put a strong focus on highlighting the boundaries of these ideas, in order to avoid theoretical confusion, misunderstandings or implicit theoretical assumption that are not grounded in data, in particular: 

      “In fact, when rejecting the notion of a single bottom-up flow of information and replacing it with a model of distributed parallel and dynamic processing, it seems only reasonable to assume that the direction of communication (between our eyes and our brain) will depend on where (within the brain) as well as when we look at the effect. Thus, the regions and time-windows reported here should be taken as an illustration of oculo-neural communication during speech processing rather than an attempt to "explain" neural speech processing by ocular movements.”

      “Even though the terminology [“hierarchy”] is suggestive of a fixed sequence (similar to a multi storey building) with levels that must be traversed one after each other (and even the more spurious idea of a rooftop, where the final perceptual experience is formed and stored into memory), we distance ourselves from these (possibly unwarranted) ideas. Our usage of “higher” or “lower” simply refers to the observation that the probability of a feature at a higher (as in more associative) level affects the interpretation (and thus the representation and prediction) of a feature at lower (as in more segregated) levels (Caucheteux et al., 2023).”

      Additionally, we have made substantial efforts to present complementary results (see response to Reviewer #2, point 8) to further substantiate our interpretation. Importantly, we have updated the illustration of the model (see response to Reviewer #, minor point 1) and refined both our interpretations and the conceptual language in the Discussion. Furthermore, we have included additional citations where appropriate to strengthen our argument.

      We would also like to briefly note that this section of the Discussion aimed to highlight existing literature that bridges the gap our model seeks to address. However, as this is a relatively underexplored area, the references available are necessarily limited.

      (2) Given my many reservations about the data, as presented in the current version of the manuscript, I find much of the discussion to be an over-interpretation of the results. This might change if the authors are able to present more robust results, as per some of my earlier comments.

      We sincerely hope that our comprehensive revisions have addressed your concerns and improved the manuscript to your satisfaction.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer 1:

      The main weaknesses of the paper are a lack of significance in key findings, and relatedly, concluding effects from insignificant findings. Additional elements could be improved to help strengthen this overall well-rounded and intriguing set of results.

      In the original manuscript, we reported that chemogenetic silencing of POA-social neurons (previously called POA-iso neurons; more details on rationale for renaming below in our responses to reviewer recommendations) tended to reduce mounting in both single-housed female and single-housed male mice, although these effects were non-significant. We have added samples to both datasets and now report that chemogenetic silencing of POA-social neurons significantly reduces the proportion of trials with mounting in both sexes (Fig. 2C and Fig. 6G). 

      We have also included new analyses to test whether optogenetic activation of POAsocial neurons in group-housed females promotes social investigation (in addition to USV production, as reported in the original manuscript). We now report that optogenetic activation of POA-social neurons significantly increases the probability of social investigation (Fig. 4E-F) and significantly increases the duration of social investigation bouts (Fig. 4G). 

      Additional recommendations from the reviewer are addressed in detail below. Thank you for your critical and insightful feedback.

      Reviewer 2:

      All the activity-dependent labeling experiments with TRAP mice, including the subsequent neural activity manipulation experiments (Figures 2, 3, 4, 5E-F), were conducted by labeling neurons only in socially isolated animals, not group-housed animals. The authors labeled neurons after 30-minute social interactions, raising the possibility that the labeled neurons simply represent a "social interaction/behavior population" (mediating mounting and USVs in females and males) rather than a set of neurons specific to social isolation.

      I strongly recommend including experimental groups that involve labeling neurons after 30minute social interactions in group-housed female or male mice and inhibit TRAPed neurons after social isolation or activate TRAPed neurons after group housing. If manipulating the grouphoused TRAP neurons has similar effects to manipulating the isolated TRAP neurons, it would suggest the current labeling paradigm is not isolating neurons specific to the effect of social isolation per se. Rather, the neurons may mediate more general social interaction or motivationrelated activities. Given the known role of POA in male mating behavior, a group-housed TRAP experiment in males with a female visitor is especially important for understanding the selectivity of the labeled cells.

      Without proper controls, referring to the labeled neurons as "POAiso" neurons is potentially misleading. The data thus far suggests these neurons may predominantly reflect a "POA social behavior" population rather than a set of cells distinctly responsive to isolated housing.

      We agree with the reviewer that the POA neurons we are studying regulate the production of social behaviors in females and males, rather than representing a set of cells distinctly responsive to single housing. To more clearly reflect our thinking, we have changed the name of the neurons from “POA-iso neurons” to “POA-social neurons”. Thank you for this helpful criticism.

      Our Fos data are consistent with the idea that the POA may regulate social behaviors in group-housed females (not just single-housed females). Namely, we found that counts of Fospositive POA neurons are significantly related to rates of social investigation (p = 0.01) and tend to be related to USV rates (p = 0.05) in group-housed females that engaged in same-sex interactions (Fig. S1C). We now include two new sets of experiments aimed at further testing the idea this idea. 

      First, we include 2 control groups in which TRAPing sessions were performed in grouphoused females following same-sex interactions. We find that chemogenetic silencing of grouphoused-TRAPed POA neurons fails to reduce social behaviors in females that are subsequently single-housed and given a same-sex social interaction (Fig. 5A-D), and that optogenetic activation of group-housed-TRAPed POA neurons fails to promote female social behavior (Fig. 5E-H). At face value, these findings do not support the idea that the POA contains neurons that regulate social behaviors in group-housed females.

      However, one important caveat is that group-housed females engage in low rates of social behaviors (low investigation time, no mounting, and few USVs), and thus TRAP-based labeling may not work efficaciously in these mice. There may be POA neurons that regulate social behaviors in group-housed females but that do not upregulate Fos following production of relatively low rates of social behaviors. To test this idea, we also include females in which POA neurons are chemogenetically silenced using a viral strategy that does not depend on activitydependent labeling. In this new experiment, we report that silencing of POA neurons significantly reduces USV production in group-housed females (Fig. 5J-L) and significantly reduces social investigation, mounting, and USV production when these same females are retested following single-housing (Fig. 5M-O). Together, these experiments suggest that the POA may regulate the production of social behaviors during same-sex interactions in group-housed females, but that these effects may be difficult to detect in some cases given the low rates at which group-housed females engage in social behaviors during same-sex interactions relative to single-housed females.

      Finally, we want to highlight an additional new dataset that supports the idea that POAsocial neurons regulate social behaviors, rather than encoding the “state” of social isolation. We now include a control group for the chemogenetic silencing of female POA-social neurons, in which females were single-housed but were not given a social interaction prior to 4-OHT treatment (N = 5 non-social controls). Rates of social behaviors were subsequently unaffected following CNO delivery in these females (Fig. S2D-G). These new data support the conclusion that POA-social neurons regulate the production of social behaviors, rather than encoding the state of social isolation. 

      Reviewer 3:

      While the authors should be commended for performing and reporting multiple circuit perturbation experiments (e.g., chemogenetics, ablation), the conflicting effects on behavior are hard to interpret without additional experiments. For example, chemogenetic silencing of the POA neurons (using DREADDs) attenuated all three behavioral measures but the ablation of the same POA neurons (using CASPACE) decreased mounting duration without impacting social investigation or USV production. Similarly, optogenetic activation of POA neurons was sufficient to generate USV production as reported in earlier studies but mounting or social investigation remained unaffected. 

      Do these discrepancies arise due to the efficiency differences between DREADD-mediated silencing vs. Casp3 ablation? Or does the chemogenetic result reflect off-manifold effects on downstream circuitry whereas a more permanent ablation strategy allows other brain regions to compensate due to redundancy? It is important to resolve whether these arise due to technical reasons or whether these reflect the underlying (perhaps messy) logic of neural circuitry. Therefore, while it is clear that POA neurons likely contribute to multiple behavioral readouts of social isolation, understanding their exact roles in any greater detail will require further experiments.

      We have added new analyses to consider the possibility that optogenetic activation of female POA-social neurons promotes social investigation. In the original manuscript, we analyzed the duration of social investigation bouts in POA-social-ChR2 females according to whether they overlapped with laser stimulation or whether they did not overlap. We realized that we made an error in this first analysis and inadvertently included social investigation bouts that occurred during the first 5 minutes of the social sessions, prior to any laser stimulation. Because these earlier bouts tend to be longer duration than later bouts, this mistake washed out the effect of laser stimulation on social bout duration. After correcting that error, we now report that optogenetic activation of female POA-social neurons lengthens social investigation bout duration (Fig. 4G). Inspired by this interesting finding, we also included analyses of the probability of social investigation following laser stimulation (Fig. 4E-F; excluding laser stimulations that were preceded by social investigation in the pre-laser baseline period). These analyses support the conclusion that optogenetic activation of POA-social neurons promotes both USV production and social investigation in group-housed females.  

      The majority of the females that we used in our TRAP2-based ablation experiments were heterozygous for TRAP2 (N = 11 of 15 POA-social-caspase subjects were TRAP2;Ai14 females), whereas all females used in our chemogenetic silencing experiments were homozygous for TRAP2. To test whether a more effective ablation of POA-social neurons might drive decreases in social investigation and USV production, we set up additional TRAP2 homozygous POA-social-caspase females and directly compare the effects of ablation between the two genotypes (Fig. S3; N = 11 hets in total and N = 9 homozygotes in total). These experiments revealed that effects on mounting were more pronounced following POA-social ablation in TRAP2 homozygotes vs. heterozygotes, but that neither group exhibited decreased social investigation or USV production following 4-OHT treatment.

      To ask whether caspase-mediated ablation in TRAP2 homozygotes was effective in eliminating neural activity associated with social behaviors in females, we performed Fos immunostaining in a subset of the POA-social-caspase TRAP2 homozygotes following a samesex interaction. We found that POA Fos expression was robustly reduced in these females relative to control group-housed and control single-housed females that also engaged in samesex interactions, down to levels seen in group-housed and single-housed females that did not engage in a social interaction (comparison shown in Fig. S3D; control female data same as in Fig. 1). Moreover, the remaining POA Fos in these TRAP2 homozygotes was no longer positively correlated to social investigation or USV production (Fig. S3E-F). Together, these findings lead us to favor the interpretation suggested by the reviewer below, that permanent ablation of POA-social neurons leads to compensation from other brain regions due to redundancy. In addition, our finding that optogenetic activation of POA-social neurons promotes both USV production and social investigation supports the idea that POA-social neurons directly regulate these behaviors. We agree with the reviewer that additional work is needed to understand the complex sex- and context-dependent role played by the POA in the regulation of mouse social behaviors.

      Recommendations for the Authors:

      Reviewer 1 Recommendations:

      (1) The largest issue is that many of the stated "key" behavioral findings are not statistically significant.

      (1a) Figure 2C is not significant and Figure 5G is not significant

      We have added N = 5 POA-social-hM4Di females, N = 3 POA-social-hM4Di males, and N = 3 POA-social-GFP males to the dataset. The decrease in mounting following chemogenetic silencing of POA-social neurons is now statistically significant in both sexes (p < 0.05 for both; see current Figs. 2C and 6G). We also simplified our statistical analysis of mounting in these experiments to consider the proportion of trials with and without resident-initiated mounting on saline vs. CNO days, using McNemar’s test for paired proportions. 

      (1b) Mounting graphs are completely omitted in Figure 4. 

      Given that mounting was only observed infrequently in POA-social-ChR2 females, we simply report this information in the Results text (lines 382-388). In our prior summary of the mounting results, we reported that mounting was observed in a total of 3 trials from 2 females, but we inadvertently included information from a duplicate trial from one of the POA-socialChR2 females in this summary (all other analyses of the POA-social-ChR2 females included one trial per female). We have corrected that error and now report that we observed mounting following laser stimulation in 1 trial from 1 POA-social-ChR2 female. We have expanded our consideration of potential effects of optogenetic activation of POA-social neurons on social investigation and include these new analyses as part of Figure 4 (Fig. 4E-G), following the existing analyses of USV production.

      (1c) Figure 3C shows a reduction of mounting following the ablation of POA (although no stats on the graph to denote significance), but this ablation approach can't resolve whether POA is required to encode the state produced by the short period of isolation, and/or whether it needs to be online at test.

      We have now added an asterisk in Fig. 3C to denote a p value less than 0.05. Thank you for catching our oversight.

      We designed our activity-dependent labeling experiments to TRAP and express viruses in POA neurons that increase their activity in conjunction with the production of social behaviors in single-housed females. We believe our findings our most consistent with the conclusion that these neurons regulate the production of social behaviors, rather than encoding the state of social isolation, and we have renamed these neurons as “POA-social” neurons to better reflect our thinking.

      We also now include control experiments (albeit chemogenetic inhibition, not caspase ablation) in which the TRAP2 strategy is used to express hM4Di in the POA of single-housed females that do not experience a social interaction prior to 4-OHT delivery (non-social controls, Fig. S2D-G). We report that chemogenetic inhibition of these neurons does not decrease social behavior in single-housed females during a subsequent same-sex interaction (p > 0.05 for saline vs. CNO rates of social investigation, mounting, and USVs). These additional findings support the idea that the activity of POA-social neurons is related to the production of social behaviors rather than to the state of social isolation. 

      The reviewer is correct that our ablation approach cannot resolve the question of whether POA-social neuronal activity is required online during testing, but our reversible chemogenetic inhibition experiments provide evidence that the activity of POA-social neurons is required online at the time of testing to regulate social behavior.

      (1d) A similar issue is seen regarding investigation (a general lack of significance with most of the LOF and GOF manipulations).

      As reported in the original manuscript, we find that chemogenetic inhibition of POAsocial neurons reduces social investigation in females, while caspase-mediated ablation of female POA-social neurons does not. Our original caspase dataset used mostly but not all TRAP2 heterozygous females (N = 11 TRAP2 heterozygotes (TRAP2;Ai14), generated by crossing TRAP2 mice with Ai14 mice, for the purpose of visualizing the absence of tdTomato labeling to estimate spread of the caspase virus; and N = 4 TRAP2 homozygotes). By adding to the TRAP2 homozygous caspase dataset and comparing the effects on female social behavior of ablation of POA-social neurons in TRAP2 heterozygous vs. TRAP2 homozygous females, we

      now provide evidence that the attenuation of mounting is more efficacious in TRAP2 homozygous females than in heterozygotes (Fig. S3B). Nonetheless, we fail to see effects on social investigation and USV production, even when caspase ablation of POA-social neurons is performed in TRAP2 homozygous females (Fig. S3A,C). 

      In spite of the lack of effect on these behaviors, we show that caspase-mediated ablation of POA-social neurons in TRAP2 homozygous females leads to a dramatic reduction in social interaction-induced Fos expression in the POA. POA Fos expression in these caspase females is reduced to the levels seen in control group-housed and single-housed females that are not given social interactions and are significantly lower than Fos expression in group-housed and single-housed females that are given a same-sex interaction (Fig. S3D). Moreover, the remaining POA Fos expression in the caspase females is no longer related to rates of social investigation (Fig. S3E), as is normally the case in group-housed and single-housed control females (Fig. S1C, left). Together, these data support the idea that some type of neuronal compensation outside of the POA is occurring following ablation of POA-social neurons, and this compensation permits normal levels of USV production and social investigation.

      As in the original manuscript, we report that chemogenetic inhibition of POA-social neurons in male mice reduces mounting but does not reduce social investigation (or USV production). We now include quantification of social behaviors produced by male and female POA-social-hM4Di mice in the TRAPing sessions that preceded 4-OHT delivery (Fig. S5). These measurements show that males spent significantly more time than females engaged in mounting, and we speculate that this bias in TRAPing session behavior might have led to a bias in TRAP-mediated viral labeling of male POA neurons that regulate mounting, at the expense of male POA neurons that regulate social investigation (or USV production).

      We have added new analyses to consider the possibility that optogenetic activation of female POA-social neurons promotes social investigation. In the original manuscript, we analyzed the duration of social investigation bouts in POA-social-ChR2 females according to whether they overlapped with laser stimulation or whether they did not overlap. We realized that we made an error in this first analysis and inadvertently included social investigation bouts that occurred during the first 5 minutes of the social sessions, prior to any laser stimulation. Because these earlier bouts tend to be longer duration than later bouts, this mistake washed out the effect of laser stimulation on social bout duration. After correcting that error, we now report that optogenetic activation of female POA-social neurons lengthens social investigation bout duration (Fig. 4G). Inspired by this encouraging finding, we also included analyses of the probability of social investigation following laser stimulation (Fig. 4E-F; excluding laser stimulations that were preceded by social investigation in the pre-laser baseline period). These analyses support the conclusion that optogenetic activation of POA-social neurons promotes both USV production and social investigation in group-housed females.

      (2) In Figure 1 and elsewhere, the authors use a Mann-Whitney U test, which should be used for non-parametric data, but in other places, they use statistical tests for normally distributed data. Why? How was the normality of distributions tested?

      We tested the normality of data distributions using the Shapiro-Wilk test. Parametric tests were used for analyses that contained normally distributed data, and non-parametric tests were used for analyses that contained non-normally distributed data. This information is included in the Methods (lines 997-1000), and full details of statistical analyses can be found in Table S1.

      (3) The method for "trapping" neurons that are part of the short-term isolation ensemble has some caveats that have not been adequately addressed. First, 4-OHT was administered after social interaction, but before 24 hours of isolation, making it unclear exactly WHAT is being trapped.

      i) Is it neurons that encode the recent 3-day iso experience? (seems unlikely, as this would have been hours after the end of that iso window)

      We now include a group of control females to directly test this possibility (Fig. S2D-G). These TRAP2 females were single-housed for 3 days but were not given a social interaction prior to 4-OHT treatment (N = 5 non-social controls). Presumably, POA neurons TRAPed in these females might encode the experience of short-term isolation. However, we found that chemogenetic inactivation of these TRAPed neurons during a subsequent same-sex interaction failed to decrease social behaviors in single-housed females (Fig. S2E-G; p > 0.05 for CNO vs. saline rates of social investigation, mounting, and USV production). These control experiments support the idea that we are TRAPing neurons whose activity is related to the production of social behaviors, and we have renamed the neurons as “POA-social” neurons to reflect this thinking.

      ii) Is it neurons that encode the recent behavior impacted by the 3-day iso? (this seems to be the goal, but the authors do not provide evidence that the time course of their injection is efficient enough to recruit the recently activated neurons, nor do they provide evidence that opening the trapping window directly after the behavior is better than directly before)

      We opted to perform IP injections of 4-OHT immediately following the behavior session, rather than behavior, due to concern that handling the mice and delivering IP injections prior to behavior sessions would stress the mice, leading to lower rates of social behaviors. The nonsocial female hM4Di experiments described above support the idea that we are TRAPing neurons related to the production of social behaviors, as the reviewer suggests. 

      iii) Is it trapping neurons active during the subsequent 24 hours of isolation? (seems possible, but this would mean that the authors are looking at a different population of neurons than they claim).

      If chemogenetic silencing of POA neurons that were TRAPed following 3-days of social isolation but in the absence of a social interaction (N = 5 non-social controls, Fig. S2D-G) does not alter social behaviors, there is no compelling reason to hypothesize that TRAPing POA neurons activated following the 24 hours of social isolation that follow a social interaction would do so. Moreover, in the original study characterizing the TRAP2 mice (DeNardo et al., 2019), the authors performed experiments to characterize the time course of TRAPing relative to 4-OHT treatment and concluded that the majority of TRAPing occurs within a 6-hour window centered around the 4-OHT injection.

      (4) Relatedly, the authors seem to find a fair bit of variability in their TRAP-mediated experiments. This begs the question - are the effects of their GOF and LOF approaches

      i) dependent on the iso-behaviors that were "trapped" for each animal (in other words, how does behavior at test 1 correlate with behavior at test 2)? 

      To test the reviewer’s idea, we compared rates of TRAPing session behaviors for the POA-social-hM4Di females to the subsequent effects of neuronal silencing on these behaviors (calculated as (CNO behavior – saline behavior). These correlations are shown in Fig. S2A-C and are all non-significant. We also include below for the reviewer the same types of correlations for the other datasets in our study (loss-of-function experiments: female POAsocial-caspase, male POA-social-hM4Di; and gain-of-function experiments: female POA-socialChR2).

      Author response image 1.

      The only loss-of-function experiment comparison in the above figure that reveals a negative and significant correlation is the mounting comparison for the POA-social-hM4Di males (time spent mounting during TRAPing session vs. (CNO time spent mounting -saline time spent mounting). This significant correlation likely reflects that fact that (1) no males mounted in the CNO session and (2) that mounting rates for individual males are relatively consistent over time (in comparison to female mounting, which is more variable; see Author response image 2 below of TRAPing session vs. saline mounting in male vs. female POA-social-hM4Di experiments). The correlation between TRAPing session and testing session mounting is significant for the POA-social-ChR2 females, but despite the significant correlation, we would want to see more instances of optogenetically-elicited mounting to make any claim about its relationship to TRAPing session behavior.

      Author response image 2.

      Nonetheless, we agree with the reviewer’s intuition that one would expect the effects of POA activity manipulations on different behaviors to scale with rates at which these behaviors were performed during the TRAPing session. We speculate that variability in the TRAPing process might have obscured such a relationship. There is inevitable variability in the exact body cavity placement of IP injections, which can affect drug absorption, and another point is that we delivered a fixed volume of 4-OHT (10 mg/mL 4-OHT in 150 uL filtered corn oil) to all mice in the study, regardless of their weight, which likely added variability in TRAPing efficacy from animal to animal. This detail was reported inaccurately in the Methods, and that error has been corrected (line 920). With regard to our male POA-social-hM4Di dataset, we find that these males spend more time mounting during their TRAPing sessions than female POA-socialhM4Di (Fig. S5; males also spent less time investigating and tended to produce fewer USVs than females), a fact that we hypothesize may have led to a bias toward TRAPing mountingrelated POA neurons in male subjects. In addition, however, the fact that male mice typically weigh more than females and would have received a slightly lower effective dosage of 4-OHT may also have contributed to the weaker effects on behavior in the male POA-social-hM4Di experiments relative to the female POA-social-hM4Di experiments.

      We also want to highlight that interpreting correlations for females between time spent mounting during the TRAPing session and time spent mounting during the test sessions can be complicated. For example, we see 2 cases in the female POA-social-hM4Di dataset in which the female did not mount in the TRAPing session, and then mounted on the saline day (12s and 10s total mounting for those 2 females) but not on the CNO day. One interpretation of the data from these 2 females is that mounting on the TRAPing day is not required to attenuate mounting on the later test days. However, female mounting behavior itself is variable, both across different females and across different tests of a given female, as noted above. If we consider all singlehoused females included in our dataset for which we quantified control behavioral data (i.e., behavior trials from unmanipulated females and TRAPing sessions from females that were later manipulated), we find that mounting is not observed in ~30% of the females (24 of 83). In ongoing behavioral experiments not included in this manuscript, we are investigating factors that regulate female mounting following single-housing. In that dataset, we also see little evidence that female mounting in one social interaction predicts mounting in a subsequent interaction

      (i.e., there don’t appear to stable “high mounters” and “low mounters” following single housing). Thus, the small number of cases in which females did not mount in the TRAPing session and then displayed mounting on the CNO only day are difficult to interpret. 

      Two additional considerations are that TRAPing may not be equally efficacious for POA neurons that regulate different behaviors, and that different behaviors may be differentially sensitive to perturbations of the POA. Previous elegant calcium imaging work has shown that different subsets of Esr1+ POA neurons exhibit activity that is “tuned” to specific behaviors (sniffing vs. mounting in males interacting with females; Yang et al., 2023). However, it is possible that these subsets of neurons display differential levels of Fos expression following the production of their preferred behavior and that some behavior-related subsets may thus be more easily TRAPed than others. It may also be the case that some behaviors are more easily disrupted by POA activity manipulations than others (e.g., perturbation in a smaller percentage of behavior-related POA neurons may be required to disrupt some behaviors relative to others). 

      Despite these caveats, we have two lines of evidence that the effects of chemogenetic silencing of POA-social neurons depends on the behaviors produced during the TRAPing sessions.

      (1) Social behavior is required during the TRAPing session to see subsequent effects on social behavior following chemogenetic silencing of TRAPed POA neurons. In control females that were single-housed but were not given a social interaction prior to 4OHT treatment, social behaviors are not reduced by chemogenetic silencing of TRAPed POA neurons (Figs. S2D-G).

      (2) To directly test whether mounting in the TRAPing session is required to see attenuation of mounting during subsequent chemogenetic silencing of POA-social neurons, we performed control experiments in which single-housed females interacted with a female visitor that was placed under a cup during the TRAPing session prior to 4-OHT treatment. Mounting was not possible in this context, and we also found that females produced lower rates of USVs during the TRAPing session relative to single-housed females engaged in free social interaction. However, subject females spent more time engaged in social investigation of the visitor relative to single-housed females engaged in free social interactions (see Author response image 3 below).

      Author response image 3.

      Unfortunately, none of the experimental females in this cohort displayed mounting in the CNO or saline sessions. Given that we could use this dataset to address the intended question, we did not include it in the manuscript. However, it is quite interesting that female subjects displayed higher than normal social investigation and lower than normal USV production in their TRAPing sessions (relative to single-housed females engaged in free interactions), and subsequently, chemogenetic inhibition of TRAPed POA neurons decreased social investigation but did not decrease USV production (Author response image 4 below). 

      Author response image 4.

      Together, we think our data support the idea that the POA neurons that are TRAPed are related to the social behaviors performed by the animals, but these relationships may be complex and difficult to detect from comparisons across animals within a single experimental group.

      And/or are they

      ii) influenced by the spread or amount of virus for each animal? These correlations could help shed light on what exactly is being trapped - is it specific behaviors or is it the "state" of shortterm isolation?

      Our control experiments with females that were single-housed but did not receive a social interaction prior to 4-OHT treatment provide evidence that the production of social behaviors is required to see subsequent effects on behavior following chemogenetic inhibition of TRAPed POA neurons (Figs. S2D-G).

      The same volume of virus was injected across all activity manipulation experiments (200 nL). Because of the trajectory of our POA viral injections (performed at a slight rostral angle relative to vertical), we did sometimes see viral labeling that spread into the AH caudal to the POA. For this reason, we included the AH TRAPed control group (Fig. 2), to rule out the possibility that viral spread into the AH could account for the effects of chemogenetic silencing of POA-social neurons on female social behaviors. Also because of the injection angle used, we don’t see substantial viral spread rostral to our injection coordinates. In short, there isn’t systematic variability in the targeting or spread of our POA viral injections that can account for variability in the effects on USV production and social investigation of our LOF and GOF manipulations (female hM4Di and female ChR2 experiments).

      In older lesion studies in male rodents and birds, there is some support for the idea that rostral vs. caudal POA neurons differentially regulate appetitive vs. consummatory sexual behaviors (as reviewed in Balthazart and Ball, 2007). However, all of our viral injections were placed in what that review paper would have considered ‘caudal’ POA. We also note that more recent imaging studies have reported that subsets of POA neurons are differentially tuned to male sniffing vs. male mounting (Yang et al.,2023), and these subsets must be relatively co-localized given that they are imaged in the same field of view. Whether distinct subsets of POA neurons regulate the production of different female social behaviors, and if so, how these subsets are localized within the POA, remains an important question for future study.

      (5) The authors label their region of interest as the "POA" but images throughout (e.g. their fos image, Figure 1E), look more like the MPO. Why label it POA?

      The POA neurons in our study are found in a band that spans the medial POA, as well as a bit of the lateral POA. To avoid over-specifying, we call this region the POA more generally.

      (6) In all the experiments, mice are isolated and then re-group housed with siblings. Do all the siblings in the group belong to the same experimental group, or are siblings naïve? This may be critical to help determine whether some of the effects observed may be "group" effects.

      In general, multiple (although not always all) mice in a cage belonged to the same experimental group. In our inhibitory DREADDs experiments, it is unclear how that could drive our observed effects on behavior, given that home cage behavior would only be expected to differ for a given mouse in the time period following their CNO session. 

      For the female POA-social-caspase mice, we cannot rule out the possibility that their home cage behaviors differed in the time period following 4-OHT treatment and re-grouphousing and prior to post-4-OHT behavior measurements. However, given that the only social behavior affected by ablation of POA-social neurons was mounting, and that rates of mounting would be expected to be very low in group-housed females within home cages, it is unclear how our experimental result could be attributed to group effects.

      If by “group” effects the reviewer means “litter” effects, we include a plot below that shows the CNO vs. saline behaviors for the POA-social-hM4Di females, separated by cage ID. There is no evidence that the effects of chemogenetic silencing of POA-social-hM4Di females are being driven by only certain cages (only social investigation and USVs are shown, because mounting was uniformly low (1 of 17 females mounted) in the CNO session).

      Author response image 5.

      (7) For chemogenetic experiments, the authors state that CNO and Saline were given in a counterbalanced order (eg line 189). Did the authors see any order effects?

      We did not see order effects, and we can include plots of those data below for the female and male POA-social-hM4Di groups, with mice plotted according to which treatment they received first.

      Author response image 6.

      (8) In the control experiments in Figure 2 where VMH or AH are chemogenetically silenced, it isn't clear whether these groups include mice that were subjected to 3 days of isolation. Please clarify.

      Yes, these female groups were also subjected to 3 days of isolation (first prior to the TRAPing session, and for a second time prior to the onset of the CNO/saline testing sessions). That information has been clarified in the Results section (line 214) and in the Methods (lines 935-938).

      (9) Line 312. The title for this section, "POA neurons increase their activity....." is somewhat misleading. It sounds like the authors imaged trapped neurons. I think what they mean is that more POA neurons are activated following opposite-sex interactions with males.

      Thanks for this catch. We have modified the section title, as well as the title of the first results sub-section.

      (10) Figure 5A, right panels. The authors fail to find an increase in the investigation of male-male pairs following the short-term isolation of one. This contrasts with the main finding in Matthews et al., 2016 Cell, where short periods of isolation are said to promote pro-social behaviors. The authors could comment on this discrepancy in their discussion (eg difference in testing apparatus/test type? Difference in the number of days of isolation? etc.).

      In current Fig. 6A, there is no significant interaction between the two main effects, but each main effect is significant: single-housed males spend more time investigating partners than group-housed males, and males spend more time investigating female partners than male partners. The significant main effect of housing condition is consistent with the findings of Matthews et al., 2016 and is included within the Results (lines 486-492). 

      (11) Figure 5F, the authors seem to have a main effect of virus (more overall investigation in dreadds mice). Nothing about this is addressed.

      We sometimes see differences in social behavior between cohorts of males when they are tested at different times and, correspondingly, with different groups of female social partners. Our POA-social-hM4Di and POA-social-GFP males were set-up and tested at largely non-overlapping times. We have added a brief note to the Results section to include this information (lines 535-539).

      Reviewer 2 Recommendations:

      (1) (C)ritical control experiments are missing to support this claim (that a population of preoptic hypothalamic neurons contribute to the effects of short-term social isolation on the social behaviors of female mice).  

      (1a) All the activity-dependent labeling experiments with TRAP mice, including the subsequent neural activity manipulation experiments (Figures 2, 3, 4, 5E-F), were conducted by labeling neurons only in socially isolated animals, not group-housed animals. The authors labeled neurons after 30-minute social interactions, raising the possibility that the labeled neurons simply represent a "social interaction/behavior population" (mediating mounting and USVs in females and males) rather than a set of neurons specific to social isolation behaviors of mice)… The data thus far suggests these neurons may predominantly reflect a "POA social behavior" population rather than a set of cells distinctly responsive to isolated housing.

      We agree with the reviewer that the POA neurons we are studying regulate the production of social behaviors in females and males, rather than representing a set of cells distinctly responsive to single housing. To more clearly reflect our thinking, we have changed the name of the neurons from “POA-iso neurons” to “POA-social neurons”. Thank you for this helpful criticism.

      Our Fos data are consistent with the idea that the POA may regulate social behaviors in group-housed females (not just single-housed females). Namely, we found that counts of Fospositive POA neurons are significantly related to rates of social investigation (p = 0.01) and tend to be related to USV rates (p = 0.05) in group-housed females that engaged in same-sex interactions (Fig. S1C). We now include two new sets of experiments aimed at further testing the idea this idea. 

      First, we include 2 control groups in which TRAPing sessions were performed in grouphoused females following same-sex interactions. We find that chemogenetic silencing of these group-housed-TRAPed POA neurons fails to reduce social behaviors in females that are subsequently single-housed and given a same-sex social interaction (Fig. 5A-D; GH-TRAPed POA hM4Di females), and that optogenetic activation of group-housed-TRAPed POA neurons fails to promote female social behavior (Fig. 5E-H; GH-TRAPed POA ChR2 females). At face value, these findings do not support the idea that the POA contains neurons that regulate social behaviors in group-housed females.

      However, one important caveat is that group-housed females engage in low rates of social behaviors (low investigation time, no mounting, and few USVs), and thus TRAP-based labeling may not work efficaciously in these mice. There may be POA neurons that regulate social behaviors in group-housed females but that do not upregulate Fos following production of relatively low rates of social behaviors. To test this idea, we also include females in which POA neurons are chemogenetically silenced using a viral strategy that does not depend on activitydependent labeling. In this new experiment, we report that silencing of POA neurons significantly reduces USV production in group-housed females (Fig. 5J-L) and significantly reduces social investigation, mounting, and USV production when these same females are retested following single-housing (Fig. 5M-O).

      (2) Please add strain background information of subject animals in the methods.

      This information has been added to the Animals section within the Methods (lines 788802).

      Responses to Reviewer 3 Recommendations:

      (1a) (T)he conflicting effects on behavior are hard to interpret without additional experiments….Similarly, optogenetic activation of POA neurons was sufficient to generate USV production as reported in earlier studies but mounting or social investigation remained unaffected. 

      We have added new analyses to consider the possibility that optogenetic activation of female POA-social neurons promotes social investigation. In the original manuscript, we analyzed the duration of social investigation bouts in POA-social-ChR2 females according to whether they overlapped with laser stimulation or whether they did not overlap. We realized that we made an error in this first analysis and inadvertently included social investigation bouts that occurred during the first 5 minutes of the social sessions, prior to any laser stimulation. Because these earlier bouts tend to be longer duration than later bouts, this mistake washed out the effect of laser stimulation on social bout duration. After correcting that error, we now report that optogenetic activation of female POA-social neurons lengthens social investigation bout duration (Fig. 4G). Inspired by this interesting finding, we also included analyses of the probability of social investigation following laser stimulation (Fig. 4E-F; excluding laser stimulations that were preceded by social investigation in the pre-laser baseline period). These analyses support the conclusion that optogenetic activation of POA-social neurons promotes both USV production and social investigation in group-housed females.

      (1b) Do these discrepancies (between hM4Di and caspase) arise due to the efficiency differences between DREADD-mediated silencing vs. Casp3 ablation? Or does the chemogenetic result reflect off-manifold effects on downstream circuitry whereas a more permanent ablation strategy allows other brain regions to compensate due to redundancy? It is important to resolve whether these arise due to technical reasons or whether these reflect the underlying (perhaps messy) logic of neural circuitry.  

      The possibility that the difference in effects on behavior between chemogenetic silencing and caspase ablation at face value seems inconsistent with the findings of previous experiments, in which ablation of large numbers of POA neurons failed to reduce USV production in male mice (POA lesions in Bean et al., 1981; ablation of VGAT+ POA neurons by Gao et al., 2018). These findings stand in contrast to those using chemogenetic silencing of large numbers of POA neurons, which report reduced USV production in male mice (VGAT+/Esr1+ in Karigo et al., 2021; Esr1+ in Chen et al., 2021).

      However, it is the case that the majority of the females that we used in our TRAP2-based ablation experiments were heterozygous for TRAP2 (N = 11 of 15 POA-social-caspase subjects were TRAP2;Ai14 females), whereas all females used in our chemogenetic silencing experiments were homozygous for TRAP2. To test whether a more effective ablation of POAsocial neurons might drive decreases in social investigation and USV production, we set up additional TRAP2 homozygous POA-social-caspase females and directly compare the effects of ablation between the two genotypes (Fig. S3; N = 11 hets in total and N = 9 homozygotes in total). These experiments revealed that effects on mounting were more pronounced following POA-social ablation in TRAP2 homozygotes vs. heterozygotes, but that neither group exhibited decreased social investigation or USV production following 4-OHT treatment.

      To ask whether caspase-mediated ablation in TRAP2 homozygotes was effective in eliminating neural activity associated with social behaviors in females, we performed Fos immunostaining in a subset of the POA-social-caspase TRAP2 homozygotes following a samesex interaction. We found that POA Fos expression was robustly reduced in these females relative to control group-housed and control single-housed females that also engaged in samesex interactions, down to levels seen in group-housed and single-housed females that did not engage in a social interaction (comparison shown in Fig. S3D; control female data same as in Fig. 1). Moreover, the remaining POA Fos in these TRAP2 homozygotes was no longer positively correlated to social investigation or USV production (Fig. S3E-F). Together, these findings lead us to favor the interpretation suggested by the reviewer below, that permanent ablation of POA-social neurons leads to compensation from other brain regions due to redundancy.

      Given the negative results above, we favor this possibility and indicate so in our Discussion. In addition, our finding that optogenetic activation of POA-social neurons promotes both USV production and social investigation supports the idea that POA-social neurons directly regulate these behaviors. We agree with the reviewer that additional work is needed to understand the complex sex- and context-dependent role played by the POA in the regulation of mouse social behaviors.

      (2) L 49: Please define Mesolimbic circuitry the first time it is mentioned.

      We have added a definition (lines 52-53).

      (3) L 210: In Figure 2C, the mounting duration baseline (saline) distribution seems lower than the same experimental baseline in Figures 1C and 3C. Does this reflect natural variability in the behavioral assay and might this be mitigated by additional sampling of animals?

      Yes, there is substantial variability in the display of mounting behavior by single-housed females, including in the proportion of trials with mounting as well as in the total duration of mounting. In the revised manuscript, we have simplified our analysis of mounting in our TRAPbased experiments to quantify the proportion of trials with mounting, rather than considering the total time spent mounting. After adding N = 5 additional females to the POA-social-hM4Di dataset, we now report a statistically significant decrease in the proportion of trials with mounting following chemogenetic silencing of POA-social neurons (Fig. 2C; McNemar’s test for paired proportions). 

      (4) L 310: The authors claim that "These findings suggest that a subset of POAiso neurons overlap with GABAergic, PAG-projecting POA neurons that have been demonstrated in previous work to promote USVs via disinhibition of excitatory PAG neurons important to USV production (Chen et al., 2021; Michael et al., 2020)." I think the data reported suggests the opposite since only 18.3% of all POA->PAG neurons are cFos+. Perhaps better rephrased as "A subset (18.3%) of POA->PAG neurons are labelled by cFos and that is sufficient to drive the production of USVs". Is it surprising?

      We modified the phrasing (lines 468-469), but a bit differently than suggested above, because although we suspect that optogenetic activation of the PAG-projecting neurons within the larger population of POA-social neurons is responsible for eliciting USV production, we did not technically demonstrate this to be the case in the current dataset. 

      We do find it surprising that so few (only ~20%) of PAG-projecting POA neurons upregulate Fos following female-female interactions marked by high rates of USV production. Even though optogenetic activation of PAG-projecting POA neurons elicits USV production, our finding suggests that the majority of PAG-projecting POA neurons may not play a role in regulating vocalization. In future work, it may be useful to apply an intersectional approach to further understand how the POA regulates USV production (for example, measure or manipulate activity selectively in projection-defined subsets of POA-social neurons).

      (5) Given the considerable prior evidence of POA->PAG circuit in promoting USVs, it is hard to understand why chemogenetic inactivation of POA neurons in males affects mounting but not USV production (Figures 5F-H). Any potential explanation for this discrepancy?

      We have two ideas about this surprising result. First, we examined the TRAPing session social behaviors of female and male POA-social-hM4Di mice. We found that male POA-socialhM4Di mice spent more time than female subjects mounting during the TRAPing sessions, and conversely, males spent less time investigating visitors and tended to produce fewer USVs than female subjects (Fig. S5). Given that our labeling method is activity-dependent, one possibility is that this bias in behavior is reflected in a bias toward labeling of POA neurons related to mounting.  

      Second, each mouse in the TRAP2-based hM4Di datasets received an IP injection of the same amount of 4-OHT (150 nL of 10 mg/mL 4-OHT in filtered corn oil) not adjusted for weight of the mouse. This information was not reported accurately in the Methods, and we have adjusted that section accordingly (line 920). As a result, because male mice typically weigh more than females and would have received a lower effective dosage of 4-OHT, another possibility is that TRAPing in males was less efficient than in females and accounts for the less complete effects on social behaviors. We have added language to the Results to discuss these possibilities (lines 540-560).

      (6) L 472: Typo. "we found that short-term isolation exerts more robust on the effects of male behavior during subsequent interactions with females than during interactions with males."

      Thank you for catching this mistake.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      In this study, the authors address whether the dorsal nucleus of the inferior colliculus (DCIC) in mice encodes sound source location within the front horizontal plane (i.e., azimuth). They do this using volumetric two-photon Ca2+ imaging and high-density silicon probes (Neuropixels) to collect single-unit data. Such recordings are beneficial because they allow large populations of simultaneous neural data to be collected. Their main results and the claims about those results are the following:

      (1) DCIC single-unit responses have high trial-to-trial variability (i.e., neural noise);

      (2) approximately 32% to 40% of DCIC single units have responses that are sensitive tosound source azimuth;

      (3) single-trial population responses (i.e., the joint response across all sampled single unitsin an animal) encode sound source azimuth "effectively" (as stated in title) in that localization decoding error matches average mouse discrimination thresholds;

      (4) DCIC can encode sound source azimuth in a similar format to that in the central nucleusof the inferior colliculus (as stated in Abstract);

      (5) evidence of noise correlation between pairs of neurons exists;

      and 6) noise correlations between responses of neurons help reduce population decoding error.

      While simultaneous recordings are not necessary to demonstrate results #1, #2, and #4, they are necessary to demonstrate results #3, #5, and #6.

      Strengths:

      - Important research question to all researchers interested in sensory coding in the nervous system.

      - State-of-the-art data collection: volumetric two-photon Ca2+ imaging and extracellularrecording using high-density probes. Large neuronal data sets.

      - Confirmation of imaging results (lower temporal resolution) with more traditionalmicroelectrode results (higher temporal resolution).

      - Clear and appropriate explanation of surgical and electrophysiological methods. I cannot comment on the appropriateness of the imaging methods.

      Strength of evidence for claims of the study:

      (1) DCIC single-unit responses have high trial-to-trial variability - The authors' data clearlyshows this.

      (2) Approximately 32% to 40% of DCIC single units have responses that are sensitive tosound source azimuth - The sensitivity of each neuron's response to sound source azimuth was tested with a Kruskal-Wallis test, which is appropriate since response distributions were not normal. Using this statistical test, only 8% of neurons (median for imaging data) were found to be sensitive to azimuth, and the authors noted this was not significantly different than the false positive rate. The Kruskal-Wallis test was not performed on electrophysiological data. The authors suggested that low numbers of azimuth-sensitive units resulting from the statistical analysis may be due to the combination of high neural noise and relatively low number of trials, which would reduce statistical power of the test. This may be true, but if single-unit responses were moderately or strongly sensitive to azimuth, one would expect them to pass the test even with relatively low statistical power. At best, if their statistical test missed some azimuthsensitive units, they were likely only weakly sensitive to azimuth. The authors went on to perform a second test of azimuth sensitivity-a chi-squared test-and found 32% (imaging) and 40% (e-phys) of single units to have statistically significant sensitivity. This feels a bit like fishing for a lower p-value. The Kruskal-Wallis test should have been left as the only analysis. Moreover, the use of a chi-squared test is questionable because it is meant to be used between two categorical variables, and neural response had to be binned before applying the test.

      The determination of what is a physiologically relevant “moderate or strong azimuth sensitivity” is not trivial, particularly when comparing tuning across different relays of the auditory pathway like the CNIC, auditory cortex, or in our case DCIC, where physiologically relevant azimuth sensitivities might be different. This is likely the reason why azimuth sensitivity has been defined in diverse ways across the bibliography (see Groh, Kelly & Underhill, 2003 for an early discussion of this issue). These diverse approaches include reaching a certain percentage of maximal response modulation, like used by Day et al. (2012, 2015, 2016) in CNIC, and ANOVA tests, like used by Panniello et al. (2018) and Groh, Kelly & Underhill (2003) in auditory cortex and IC respectively. Moreover, the influence of response variability and biases in response distribution estimation due to limited sampling has not been usually accounted for in the determination of azimuth sensitivity.

      As Reviewer #1 points out, in our study we used an appropriate ANOVA test (KruskalWallis) as a starting point to study response sensitivity to stimulus azimuth at DCIC. Please note that the alpha = 0.05 used for this test is not based on experimental evidence about physiologically relevant azimuth sensitivity but instead is an arbitrary p-value threshold. Using this test on the electrophysiological data, we found that ~ 21% of the simultaneously recorded single units reached significance (n = 4 mice). Nevertheless these percentages, in our small sample size (n = 4) were not significantly different from our false positive detection rate (p = 0.0625, Mann-Whitney, See Author response image 1 below).  In consequence, for both our imaging (Fig. 3C) and electrophysiological data, we could not ascertain if the percentage of neurons reaching significance in these ANOVA tests were indeed meaningfully sensitive to azimuth or this was due to chance. 

      Author response image 1.

      Percentage of the neuropixels recorded DCIC single units across mice that showed significant median response tuning, compared to false positive detection rate (α = 0.05, chance level).

      We reasoned that the observed markedly variable responses from DCIC units, which frequently failed to respond in many trials (Fig. 3D, 4A), in combination with the limited number of trial repetitions we could collect, results in under-sampled response distribution estimations. This under-sampling can bias the determination of stochastic dominance across azimuth response samples in Kruskal-Wallis tests. We would like to highlight that we decided not to implement resampling strategies to artificially increase the azimuth response sample sizes with “virtual trials”, in order to avoid “fishing for a smaller p-value”, when our collected samples might not accurately reflect the actual response population variability.

      As an alternative to hypothesis testing based on ranking and determining stochastic dominance of one or more azimuth response samples (Kruskal-Wallis test), we evaluated the overall statistical dependency to stimulus azimuth of the collected responses.  To do this we implement the Chi-square test by binning neuronal responses into categories. Binning responses into categories can reduce the influence of response variability to some extent, which constitutes an advantage of the Chi-square approach, but we note the important consideration that these response categories are arbitrary.

      Altogether, we acknowledge that our Chi-square approach to define azimuth sensitivity is not free of limitations and despite enabling the interrogation of azimuth sensitivity at DCIC, its interpretability might not extend to other brain regions like CNIC or auditory cortex. Nevertheless we hope the aforementioned arguments justify why the Kruskal-Wallis test simply could not “have been left as the only analysis”.

      (3) Single-trial population responses encode sound source azimuth "effectively" in that localization decoding error matches average mouse discrimination thresholds - If only one neuron in a population had responses that were sensitive to azimuth, we would expect that decoding azimuth from observation of that one neuron's response would perform better than chance. By observing the responses of more than one neuron (if more than one were sensitive to azimuth), we would expect performance to increase. The authors found that decoding from the whole population response was no better than chance. They argue (reasonably) that this is because of overfitting of the decoder modeltoo few trials used to fit too many parameters-and provide evidence from decoding combined with principal components analysis which suggests that overfitting is occurring. What is troubling is the performance of the decoder when using only a handful of "topranked" neurons (in terms of azimuth sensitivity) (Fig. 4F and G). Decoder performance seems to increase when going from one to two neurons, then decreases when going from two to three neurons, and doesn't get much better for more neurons than for one neuron alone. It seems likely there is more information about azimuth in the population response, but decoder performance is not able to capture it because spike count distributions in the decoder model are not being accurately estimated due to too few stimulus trials (14, on average). In other words, it seems likely that decoder performance is underestimating the ability of the DCIC population to encode sound source azimuth.

      To get a sense of how effective a neural population is at coding a particular stimulus parameter, it is useful to compare population decoder performance to psychophysical performance. Unfortunately, mouse behavioral localization data do not exist. Therefore, the authors compare decoder error to mouse left-right discrimination thresholds published previously by a different lab. However, this comparison is inappropriate because the decoder and the mice were performing different perceptual tasks. The decoder is classifying sound sources to 1 of 13 locations from left to right, whereas the mice were discriminating between left or right sources centered around zero degrees. The errors in these two tasks represent different things. The two data sets may potentially be more accurately compared by extracting information from the confusion matrices of population decoder performance. For example, when the stimulus was at -30 deg, how often did the decoder classify the stimulus to a lefthand azimuth? Likewise, when the stimulus was +30 deg, how often did the decoder classify the stimulus to a righthand azimuth?

      The azimuth discrimination error reported by Lauer et al. (2011) comes from engaged and highly trained mice, which is a very different context to our experimental setting with untrained mice passively listening to stimuli from 13 random azimuths. Therefore we did not perform analyses or interpretations of our results based on the behavioral task from Lauer et al. (2011) and only made the qualitative observation that the errors match for discussion.

      We believe it is further important to clarify that Lauer et al. (2011) tested the ability of mice to discriminate between a positively conditioned stimulus (reference speaker at 0º center azimuth associated to a liquid reward) and a negatively conditioned stimulus (coming from one of five comparison speakers positioned at 20º, 30º, 50º, 70 and 90º azimuth, associated to an electrified lickport) in a conditioned avoidance task. In this task, mice are not precisely “discriminating between left or right sources centered around zero degrees”, making further analyses to compare the experimental design of Lauer et al (2011) and ours even more challenging for valid interpretation.

      (4) DCIC can encode sound source azimuth in a similar format to that in the central nucleusof the inferior colliculus - It is unclear what exactly the authors mean by this statement in the Abstract. There are major differences in the encoding of azimuth between the two neighboring brain areas: a large majority of neurons in the CNIC are sensitive to azimuth (and strongly so), whereas the present study shows a minority of azimuth-sensitive neurons in the DCIC. Furthermore, CNIC neurons fire reliably to sound stimuli (low neural noise), whereas the present study shows that DCIC neurons fire more erratically (high neural noise).

      Since sound source azimuth is reported to be encoded by population activity patterns at CNIC (Day and Delgutte, 2013), we refer to a population activity pattern code as the “similar format” in which this information is encoded at DCIC. Please note that this is a qualitative comparison and we do not claim this is the “same format”, due to the differences the reviewer precisely describes in the encoding of azimuth at CNIC where a much larger majority of neurons show stronger azimuth sensitivity and response reliability with respect to our observations at DCIC. By this qualitative similarity of encoding format we specifically mean the similar occurrence of activity patterns from azimuth sensitive subpopulations of neurons in both CNIC and DCIC, which carry sufficient information about the stimulus azimuth for a sufficiently accurate prediction with regard to the behavioral discrimination ability.

      (5) Evidence of noise correlation between pairs of neurons exists - The authors' data andanalyses seem appropriate and sufficient to justify this claim.

      (6) Noise correlations between responses of neurons help reduce population decodingerror - The authors show convincing analysis that performance of their decoder increased when simultaneously measured responses were tested (which include noise correlation) than when scrambled-trial responses were tested (eliminating noise correlation). This makes it seem likely that noise correlation in the responses improved decoder performance. The authors mention that the naïve Bayesian classifier was used as their decoder for computational efficiency, presumably because it assumes no noise correlation and, therefore, assumes responses of individual neurons are independent of each other across trials to the same stimulus. The use of decoder that assumes independence seems key here in testing the hypothesis that noise correlation contains information about sound source azimuth. The logic of using this decoder could be more clearly spelled out to the reader. For example, if the null hypothesis is that noise correlations do not carry azimuth information, then a decoder that assumes independence should perform the same whether population responses are simultaneous or scrambled. The authors' analysis showing a difference in performance between these two cases provides evidence against this null hypothesis.

      We sincerely thank the reviewer for this careful and detailed consideration of our analysis approach. Following the reviewer’s constructive suggestion, we justified the decoder choice in the results section at the last paragraph of page 18:

      “To characterize how the observed positive noise correlations could affect the representation of stimulus azimuth by DCIC top ranked unit population responses, we compared the decoding performance obtained by classifying the single-trial response patterns from top ranked units in the modeled decorrelated datasets versus the acquired data (with noise correlations). With the intention to characterize this with a conservative approach that would be less likely to find a contribution of noise correlations as it assumes response independence, we relied on the naive Bayes classifier for decoding throughout the study. Using this classifier, we observed that the modeled decorrelated datasets produced stimulus azimuth prediction error distributions that were significantly shifted towards higher decoding errors (Fig. 5B, C) and, in our imaging datasets, were not significantly different from chance level (Fig. 5B). Altogether, these results suggest that the detected noise correlations in our simultaneously acquired datasets can help reduce the error of the IC population code for sound azimuth.”

      Minor weakness:

      - Most studies of neural encoding of sound source azimuth are done in a noise-free environment, but the experimental setup in the present study had substantial background noise. This complicates comparison of the azimuth tuning results in this study to those of other studies. One is left wondering if azimuth sensitivity would have been greater in the absence of background noise, particularly for the imaging data where the signal was only about 12 dB above the noise. The description of the noise level and signal + noise level in the Methods should be made clearer. Mice hear from about 2.5 - 80 kHz, so it is important to know the noise level within this band as well as specifically within the band overlapping with the signal.

      We agree with the reviewer that this information is useful. In our study, the background R.M.S. SPL during imaging across the mouse hearing range (2.5-80kHz) was 44.53 dB and for neuropixels recordings 34.68 dB. We have added this information to the methods section of the revised manuscript.

      Reviewer #2 (Public Review):

      In the present study, Boffi et al. investigate the manner in which the dorsal cortex of the of the inferior colliculus (DCIC), an auditory midbrain area, encodes sound location azimuth in awake, passively listening mice. By employing volumetric calcium imaging (scanned temporal focusing or s-TeFo), complemented with high-density electrode electrophysiological recordings (neuropixels probes), they show that sound-evoked responses are exquisitely noisy, with only a small portion of neurons (units) exhibiting spatial sensitivity. Nevertheless, a naïve Bayesian classifier was able to predict the presented azimuth based on the responses from small populations of these spatially sensitive units. A portion of the spatial information was provided by correlated trial-to-trial response variability between individual units (noise correlations). The study presents a novel characterization of spatial auditory coding in a non-canonical structure, representing a noteworthy contribution specifically to the auditory field and generally to systems neuroscience, due to its implementation of state-of-the-art techniques in an experimentally challenging brain region. However, nuances in the calcium imaging dataset and the naïve Bayesian classifier warrant caution when interpreting some of the results.

      Strengths:

      The primary strength of the study lies in its methodological achievements, which allowed the authors to collect a comprehensive and novel dataset. While the DCIC is a dorsal structure, it extends up to a millimetre in depth, making it optically challenging to access in its entirety. It is also more highly myelinated and vascularised compared to e.g., the cerebral cortex, compounding the problem. The authors successfully overcame these challenges and present an impressive volumetric calcium imaging dataset. Furthermore, they corroborated this dataset with electrophysiological recordings, which produced overlapping results. This methodological combination ameliorates the natural concerns that arise from inferring neuronal activity from calcium signals alone, which are in essence an indirect measurement thereof.

      Another strength of the study is its interdisciplinary relevance. For the auditory field, it represents a significant contribution to the question of how auditory space is represented in the mammalian brain. "Space" per se is not mapped onto the basilar membrane of the cochlea and must be computed entirely within the brain. For azimuth, this requires the comparison between miniscule differences between the timing and intensity of sounds arriving at each ear. It is now generally thought that azimuth is initially encoded in two, opposing hemispheric channels, but the extent to which this initial arrangement is maintained throughout the auditory system remains an open question. The authors observe only a slight contralateral bias in their data, suggesting that sound source azimuth in the DCIC is encoded in a more nuanced manner compared to earlier processing stages of the auditory hindbrain. This is interesting, because it is also known to be an auditory structure to receive more descending inputs from the cortex.

      Systems neuroscience continues to strive for the perfection of imaging novel, less accessible brain regions. Volumetric calcium imaging is a promising emerging technique, allowing the simultaneous measurement of large populations of neurons in three dimensions. But this necessitates corroboration with other methods, such as electrophysiological recordings, which the authors achieve. The dataset moreover highlights the distinctive characteristics of neuronal auditory representations in the brain. Its signals can be exceptionally sparse and noisy, which provide an additional layer of complexity in the processing and analysis of such datasets. This will be undoubtedly useful for future studies of other less accessible structures with sparse responsiveness.

      Weaknesses:

      Although the primary finding that small populations of neurons carry enough spatial information for a naïve Bayesian classifier to reasonably decode the presented stimulus is not called into question, certain idiosyncrasies, in particular the calcium imaging dataset and model, complicate specific interpretations of the model output, and the readership is urged to interpret these aspects of the study's conclusions with caution.

      I remain in favour of volumetric calcium imaging as a suitable technique for the study, but the presently constrained spatial resolution is insufficient to unequivocally identify regions of interest as cell bodies (and are instead referred to as "units" akin to those of electrophysiological recordings). It remains possible that the imaging set is inadvertently influenced by non-somatic structures (including neuropil), which could report neuronal activity differently than cell bodies. Due to the lack of a comprehensive ground-truth comparison in this regard (which to my knowledge is impossible to achieve with current technology), it is difficult to imagine how many informative such units might have been missed because their signals were influenced by spurious, non-somatic signals, which could have subsequently misled the models. The authors reference the original Nature Methods article (Prevedel et al., 2016) throughout the manuscript, presumably in order to avoid having to repeat previously published experimental metrics. But the DCIC is neither the cortex nor hippocampus (for which the method was originally developed) and may not have the same light scattering properties (not to mention neuronal noise levels). Although the corroborative electrophysiology data largely eleviates these concerns for this particular study, the readership should be cognisant of such caveats, in particular those who are interested in implementing the technique for their own research.

      A related technical limitation of the calcium imaging dataset is the relatively low number of trials (14) given the inherently high level of noise (both neuronal and imaging). Volumetric calcium imaging, while offering a uniquely expansive field of view, requires relatively high average excitation laser power (in this case nearly 200 mW), a level of exposure the authors may have wanted to minimise by maintaining a low the number of repetitions, but I yield to them to explain.

      We assumed that the levels of heating by excitation light measured at the neocortex in Prevedel et al. (2016), were representative for DCIC also. Nevertheless, we recognize this approximation might not be very accurate, due to the differences in tissue architecture and vascularization from these two brain areas, just to name a few factors. The limiting factor preventing us from collecting more trials in our imaging sessions was that we observed signs of discomfort or slight distress in some mice after ~30 min of imaging in our custom setup, which we established as a humane end point to prevent distress. In consequence imaging sessions were kept to 25 min in duration, limiting the number of trials collected. However we cannot rule out that with more extensive habituation prior to experiments the imaging sessions could be prolonged without these signs of discomfort or if indeed influence from our custom setup like potential heating of the brain by illumination light might be the causing factor of the observed distress. Nevertheless, we note that previous work has shown that ~200mW average power is a safe regime for imaging in the cortex by keeping brain heating minimal (Prevedel et al., 2016), without producing the lasting damages observed by immunohistochemisty against apoptosis markers above 250mW (Podgorski and Ranganathan 2016, https://doi.org/10.1152/jn.00275.2016).

      Calcium imaging is also inherently slow, requiring relatively long inter-stimulus intervals (in this case 5 s). This unfortunately renders any model designed to predict a stimulus (in this case sound azimuth) from particularly noisy population neuronal data like these as highly prone to overfitting, to which the authors correctly admit after a model trained on the entire raw dataset failed to perform significantly above chance level. This prompted them to feed the model only with data from neurons with the highest spatial sensitivity. This ultimately produced reasonable performance (and was implemented throughout the rest of the study), but it remains possible that if the model was fed with more repetitions of imaging data, its performance would have been more stable across the number of units used to train it. (All models trained with imaging data eventually failed to converge.) However, I also see these limitations as an opportunity to improve the technology further, which I reiterate will be generally important for volume imaging of other sparse or noisy calcium signals in the brain.

      Transitioning to the naïve Bayesian classifier itself, I first openly ask the authors to justify their choice of this specific model. There are countless types of classifiers for these data, each with their own pros and cons. Did they actually try other models (such as support vector machines), which ultimately failed? If so, these negative results (even if mentioned en passant) would be extremely valuable to the community, in my view. I ask this specifically because different methods assume correspondingly different statistical properties of the input data, and to my knowledge naïve Bayesian classifiers assume that predictors (neuronal responses) are assumed to be independent within a class (azimuth). As the authors show that noise correlations are informative in predicting azimuth, I wonder why they chose a model that doesn't take advantage of these statistical regularities. It could be because of technical considerations (they mention computing efficiency), but I am left generally uncertain about the specific logic that was used to guide the authors through their analytical journey.

      One of the main reasons we chose the naïve Bayesian classifier is indeed because it assumes that the responses of the simultaneously recorded neurons are independent and therefore it does not assume a contribution of noise correlations to the estimation of the posterior probability of each azimuth. This model would represent the null hypothesis that noise correlations do not contribute to the encoding of stimulus azimuth, which would be verified by an equal decoding outcome from correlated or decorrelated datasets. Since we observed that this is not the case, the model supports the alternative hypothesis that noise correlations do indeed influence stimulus azimuth encoding. We wanted to test these hypotheses with the most conservative approach possible that would be least likely to find a contribution of noise correlations. Other relevant reasons that justify our choice of the naive Bayesian classifier are its robustness against the limited numbers of trials we could collect in comparison to other more “data hungry” classifiers like SVM, KNN, or artificial neuronal nets. We did perform preliminary tests with alternative classifiers but the obtained decoding errors were similar when decoding the whole population activity (Author response image 2A). Dimensionality reduction following the approach described in the manuscript showed a tendency towards smaller decoding errors observed with an alternative classifier like KNN, but these errors were still larger than the ones observed with the naive Bayesian classifier (median error 45º). Nevertheless, we also observe a similar tendency for slightly larger decoding errors in the absence of noise correlations (decorrelated, Author response image 2B). Sentences detailing the logic of classifier choice are now included in the results section at page 10 and at the last paragraph of page 18 (see responses to Reviewer 1).

      Author response image 2.

      A) Cumulative distribution plots of the absolute cross-validated single-trial prediction errors obtained using different classifiers (blue; KNN: K-nearest neighbors; SVM: support vector machine ensemble) and chance level distribution (gray) on the complete populations of imaged units. Cumulative distribution plots of the absolute cross-validated singletrial prediction errors obtained using a Bayes classifier (naive approximation for computation efficiency) to decode the single-trial response patterns from the 31 top ranked units in the simultaneously imaged datasets across mice (cyan), modeled decorrelated datasets (orange) and the chance level distribution associated with our stimulation paradigm (gray). Vertical dashed lines show the medians of cumulative distributions. K.S. w/Sidak: Kolmogorov-Smirnov with Sidak.

      That aside, there remain other peculiarities in model performance that warrant further investigation. For example, what spurious features (or lack of informative features) in these additional units prevented the models of imaging data from converging?

      Considering the amount of variability observed throughout the neuronal responses both in imaging and neuropixels datasets, it is easy to suspect that the information about stimulus azimuth carried in different amounts by individual DCIC neurons can be mixed up with information about other factors (Stringer et al., 2019). In an attempt to study the origin of these features that could confound stimulus azimuth decoding we explored their relation to face movement (Supplemental Figure 2), finding a correlation to snout movements, in line with previous work by Stringer et al. (2019).

      In an orthogonal question, did the most spatially sensitive units share any detectable tuning features? A different model trained with electrophysiology data in contrast did not collapse in the range of top-ranked units plotted. Did this model collapse at some point after adding enough units, and how well did that correlate with the model for the imaging data?

      Our electrophysiology datasets were much smaller in size (number of simultaneously recorded neurons) compared to our volumetric calcium imaging datasets, resulting in a much smaller total number of top ranked units detected per dataset. This precluded the determination of a collapse of decoder performance due to overfitting beyond the range plotted in Fig 4G.

      How well did the form (and diversity) of the spatial tuning functions as recorded with electrophysiology resemble their calcium imaging counterparts? These fundamental questions could be addressed with more basic, but transparent analyses of the data (e.g., the diversity of spatial tuning functions of their recorded units across the population). Even if the model extracts features that are not obvious to the human eye in traditional visualisations, I would still find this interesting.

      The diversity of the azimuth tuning curves recorded with calcium imaging (Fig. 3B) was qualitatively larger than the ones recorded with electrophysiology (Fig. 4B), potentially due to the larger sampling obtained with volumetric imaging. We did not perform a detailed comparison of the form and a more quantitative comparison of the diversity of these functions because the signals compared are quite different, as calcium indicator signal is subject to non linearities due to Ca2+ binding cooperativity and low pass filtering due to binding kinetics. We feared this could lead to misleading interpretations about the similarities or differences between the azimuth tuning functions in imaged and electrophysiology datasets. Our model uses statistical response dependency to stimulus azimuth, which does not rely on features from a descriptive statistic like mean response tuning. In this context, visualizing the trial-to-trial responses as a function of azimuth shows “features that are not obvious to the human eye in traditional visualizations” (Fig. 3D, left inset).

      Finally, the readership is encouraged to interpret certain statements by the authors in the current version conservatively. How the brain ultimately extracts spatial neuronal data for perception is anyone's guess, but it is important to remember that this study only shows that a naïve Bayesian classifier could decode this information, and it remains entirely unclear whether the brain does this as well. For example, the model is able to achieve a prediction error that corresponds to the psychophysical threshold in mice performing a discrimination task (~30 {degree sign}). Although this is an interesting coincidental observation, it does not mean that the two metrics are necessarily related. The authors correctly do not explicitly claim this, but the manner in which the prose flows may lead a non-expert into drawing that conclusion.

      To avoid misleading the non-expert readers, we have clarified in the manuscript that the observed correspondence between decoding error and psychophysical threshold is explicitly coincidental.

      Page 13, end of middle paragraph:

      “If we consider the median of the prediction error distribution as an overall measure of decoding performance, the single-trial response patterns from subsamples of at least the 7 top ranked units produced median decoding errors that coincidentally matched the reported azimuth discrimination ability of mice (Fig 4G, minimum audible angle = 31º) (Lauer et al., 2011).”

      Page 14, bottom paragraph:

      “Decoding analysis (Fig. 4F) of the population response patterns from azimuth dependent top ranked units simultaneously recorded with neuropixels probes showed that the 4 top ranked units are the smallest subsample necessary to produce a significant decoding performance that coincidentally matches the discrimination ability of mice (31° (Lauer et al., 2011)) (Fig. 5F, G).”

      We also added to the Discussion sentences clarifying that a relationship between these two variables remains to be determined and it also remains to be determined if the DCIC indeed performs a bayesian decoding computation for sound localization.

      Page 20, bottom:

      “… Concretely, we show that sound location coding does indeed occur at DCIC on the single trial basis, and that this follows a comparable mechanism to the characterized population code at CNIC (Day and Delgutte, 2013). However, it remains to be determined if indeed the DCIC network is physiologically capable of Bayesian decoding computations. Interestingly, the small number of DCIC top ranked units necessary to effectively decode stimulus azimuth suggests that sound azimuth information is redundantly distributed across DCIC top ranked units, which points out that mechanisms beyond coding efficiency could be relevant for this population code.

      While the decoding error observed from our DCIC datasets obtained in passively listening, untrained mice coincidentally matches the discrimination ability of highly trained, motivated mice (Lauer et al., 2011), a relationship between decoding error and psychophysical performance remains to be determined. Interestingly, a primary sensory representations should theoretically be even more precise than the behavioral performance as reported in the visual system (Stringer et al., 2021).”

      Moreover, the concept of redundancy (of spatial information carried by units throughout the DCIC) is difficult for me to disentangle. One interpretation of this formulation could be that there are non-overlapping populations of neurons distributed across the DCIC that each could predict azimuth independently of each other, which is unlikely what the authors meant. If the authors meant generally that multiple neurons in the DCIC carry sufficient spatial information, then a single neuron would have been able to predict sound source azimuth, which was not the case. I have the feeling that they actually mean "complimentary", but I leave it to the authors to clarify my confusion, should they wish.

      We observed that the response patterns from relatively small fractions of the azimuth sensitive DCIC units (4-7 top ranked units) are sufficient to generate an effective code for sound azimuth, while 32-40% of all simultaneously recorded DCIC units are azimuth sensitive. In light of this observation, we interpreted that the azimuth information carried by the population should be redundantly distributed across the complete subpopulation of azimuth sensitive DCIC units.

      In summary, the present study represents a significant body of work that contributes substantially to the field of spatial auditory coding and systems neuroscience. However, limitations of the imaging dataset and model as applied in the study muddles concrete conclusions about how the DCIC precisely encodes sound source azimuth and even more so to sound localisation in a behaving animal. Nevertheless, it presents a novel and unique dataset, which, regardless of secondary interpretation, corroborates the general notion that auditory space is encoded in an extraordinarily complex manner in the mammalian brain.

      Reviewer #3 (Public Review):

      Summary:

      Boffi and colleagues sought to quantify the single-trial, azimuthal information in the dorsal cortex of the inferior colliculus (DCIC), a relatively understudied subnucleus of the auditory midbrain. They used two complementary recording methods while mice passively listened to sounds at different locations: a large volume but slow sampling calcium-imaging method, and a smaller volume but temporally precise electrophysiology method. They found that neurons in the DCIC were variable in their activity, unreliably responding to sound presentation and responding during inter-sound intervals. Boffi and colleagues used a naïve Bayesian decoder to determine if the DCIC population encoded sound location on a single trial. The decoder failed to classify sound location better than chance when using the raw single-trial population response but performed significantly better than chance when using intermediate principal components of the population response. In line with this, when the most azimuth dependent neurons were used to decode azimuthal position, the decoder performed equivalently to the azimuthal localization abilities of mice. The top azimuthal units were not clustered in the DCIC, possessed a contralateral bias in response, and were correlated in their variability (e.g., positive noise correlations). Interestingly, when these noise correlations were perturbed by inter-trial shuffling decoding performance decreased. Although Boffi and colleagues display that azimuthal information can be extracted from DCIC responses, it remains unclear to what degree this information is used and what role noise correlations play in azimuthal encoding.

      Strengths:

      The authors should be commended for collection of this dataset. When done in isolation (which is typical), calcium imaging and linear array recordings have intrinsic weaknesses. However, those weaknesses are alleviated when done in conjunction with one another - especially when the data largely recapitulates the findings of the other recording methodology. In addition to the video of the head during the calcium imaging, this data set is extremely rich and will be of use to those interested in the information available in the DCIC, an understudied but likely important subnucleus in the auditory midbrain.

      The DCIC neural responses are complex; the units unreliably respond to sound onset, and at the very least respond to some unknown input or internal state (e.g., large inter-sound interval responses). The authors do a decent job in wrangling these complex responses: using interpretable decoders to extract information available from population responses.

      Weaknesses:

      The authors observe that neurons with the most azimuthal sensitivity within the DCIC are positively correlated, but they use a Naïve Bayesian decoder which assume independence between units. Although this is a bit strange given their observation that some of the recorded units are correlated, it is unlikely to be a critical flaw. At one point the authors reduce the dimensionality of their data through PCA and use the loadings onto these components in their decoder. PCA incorporates the correlational structure when finding the principal components and constrains these components to be orthogonal and uncorrelated. This should alleviate some of the concern regarding the use of the naïve Bayesian decoder because the projections onto the different components are independent. Nevertheless, the decoding results are a bit strange, likely because there is not much linearly decodable azimuth information in the DCIC responses. Raw population responses failed to provide sufficient information concerning azimuth for the decoder to perform better than chance. Additionally, it only performed better than chance when certain principal components or top ranked units contributed to the decoder but not as more components or units were added. So, although there does appear to be some azimuthal information in the recoded DCIC populations - it is somewhat difficult to extract and likely not an 'effective' encoding of sound localization as their title suggests.

      As described in the responses to reviewers 1 and 2, we chose the naïve Bayes classifier as a decoder to determine the influence of noise correlations through the most conservative approach possible, as this classifier would be least likely to find a contribution of correlated noise. Also, we chose this decoder due to its robustness against limited numbers of trials collected, in comparison to “data hungry” non linear classifiers like KNN or artificial neuronal nets. Lastly, we observed that small populations of noisy, unreliable (do not respond in every trial) DCIC neurons can encode stimulus azimuth in passively listening mice matching the discrimination error of trained mice. Therefore, while this encoding is definitely not efficient, it can still be considered effective.

      Although this is quite a worthwhile dataset, the authors present relatively little about the characteristics of the units they've recorded. This may be due to the high variance in responses seen in their population. Nevertheless, the authors note that units do not respond on every trial but do not report what percent of trials that fail to evoke a response. Is it that neurons are noisy because they do not respond on every trial or is it also that when they do respond they have variable response distributions? It would be nice to gain some insight into the heterogeneity of the responses.

      The limited number of azimuth trial repetitions that we could collect precluded us from making any quantification of the unreliability (failures to respond) and variability in the response distributions from the units we recorded, as we feared they could be misleading. In qualitative terms, “due to the high variance in responses seen” in the recordings and the limited trial sampling, it is hard to make any generalization. In consequence we referred to the observed response variance altogether as neuronal noise. Considering these points, our datasets are publicly available for exploration of the response characteristics.

      Additionally, is there any clustering at all in response profiles or is each neuron they recorded in the DCIC unique?

      We attempted to qualitatively visualize response clustering using dimensionality reduction, observing different degrees of clustering or lack thereof across the azimuth classes in the datasets collected from different mice. It is likely that the limited number of azimuth trials we could collect and the high response variance contribute to an inconsistent response clustering across datasets.

      They also only report the noise correlations for their top ranked units, but it is possible that the noise correlations in the rest of the population are different.

      For this study, since our aim was to interrogate the influence of noise correlations on stimulus azimuth encoding by DCIC populations, we focused on the noise correlations from the top ranked unit subpopulation, which likely carry the bulk of the sound location information.  Noise correlations can be defined as correlation in the trial to trial response variation of neurons. In this respect, it is hard to ascertain if the rest of the population, that is not in the top rank unit percentage, are really responding and showing response variation to evaluate this correlation, or are simply not responding at all and show unrelated activity altogether. This makes observations about noise correlations from “the rest of the population” potentially hard to interpret.

      It would also be worth digging into the noise correlations more - are units positively correlated because they respond together (e.g., if unit x responds on trial 1 so does unit y) or are they also modulated around their mean rates on similar trials (e.g., unit x and y respond and both are responding more than their mean response rate). A large portion of trial with no response can occlude noise correlations. More transparency around the response properties of these populations would be welcome.

      Due to the limited number of azimuth trial repetitions collected, to evaluate noise correlations we used the non parametric Kendall tau correlation coefficient which is a measure of pairwise rank correlation or ordinal association in the responses to each azimuth. Positive rank correlation would represent neurons more likely responding together. Evaluating response modulation “around their mean rates on similar trials” would require assumptions about the response distributions, which we avoided due to the potential biases associated with limited sample sizes.

      It is largely unclear what the DCIC is encoding. Although the authors are interested in azimuth, sound location seems to be only a small part of DCIC responses. The authors report responses during inter-sound interval and unreliable sound-evoked responses. Although they have video of the head during recording, we only see a correlation to snout and ear movements (which are peculiar since in the example shown it seems the head movements predict the sound presentation). Additional correlates could be eye movements or pupil size. Eye movement are of particular interest due to their known interaction with IC responses - especially if the DCIC encodes sound location in relation to eye position instead of head position (though much of eye-position-IC work was done in primates and not rodent). Alternatively, much of the population may only encode sound location if an animal is engaged in a localization task. Ideally, the authors could perform more substantive analyses to determine if this population is truly noisy or if the DCIC is integrating un-analyzed signals.

      We unsuccessfully attempted eye tracking and pupillometry in our videos. We suspect that the reason behind this is a generally overly dilated pupil due to the low visible light illumination conditions we used which were necessary to protect the PMT of our custom scope.

      It is likely that DCIC population activity is integrating un-analyzed signals, like the signal associated with spontaneous behaviors including face movements (Stringer et al., 2019), which we observed at the level of spontaneous snout movements. However investigating if and how these signals are integrated to stimulus azimuth coding requires extensive behavioral testing and experimentation which is out of the scope of this study. For the purpose of our study, we referred to trial-to-trial response variation as neuronal noise. We note that this definition of neuronal noise can, and likely does, include an influence from un-analyzed signals like the ones from spontaneous behaviors.

      Although this critique is ubiquitous among decoding papers in the absence of behavioral or causal perturbations, it is unclear what - if any - role the decoded information may play in neuronal computations. The interpretation of the decoder means that there is some extractable information concerning sound azimuth - but not if it is functional. This information may just be epiphenomenal, leaking in from inputs, and not used in computation or relayed to downstream structures. This should be kept in mind when the authors suggest their findings implicate the DCIC functionally in sound localization.

      Our study builds upon previous reports by other independent groups relying on “causal and behavioral perturbations” and implicating DCIC in sound location learning induced experience dependent plasticity (Bajo et al., 2019, 2010; Bajo and King, 2012), which altogether argues in favor of DCIC functionality in sound localization.

      Nevertheless, we clarified in the discussion of the revised manuscript that a relationship between the observed decoding error and the psychophysical performance, or the ability of the DCIC network to perform Bayesian decoding computations, both remain to be determined (please see responses to Reviewer #2).

      It is unclear why positive noise correlations amongst similarly tuned neurons would improve decoding. A toy model exploring how positive noise correlations in conjunction with unreliable units that inconsistently respond may anchor these findings in an interpretable way. It seems plausible that inconsistent responses would benefit from strong noise correlations, simply by units responding together. This would predict that shuffling would impair performance because you would then be sampling from trials in which some units respond, and trials in which some units do not respond - and may predict a bimodal performance distribution in which some trials decode well (when the units respond) and poor performance (when the units do not respond).

      In samples with more that 2 dimensions, the relationship between signal and noise correlations is more complex than in two dimensional samples (Montijn et al., 2016) which makes constructing interpretable and simple toy models of this challenging. Montijn et al. (2016) provide a detailed characterization and model describing how the accuracy of a multidimensional population code can improve when including “positive noise correlations amongst similarly tuned neurons”. Unfortunately we could not successfully test their model based on Mahalanobis distances as we could not verify that the recorded DCIC population responses followed a multivariate gaussian distribution, due to the limited azimuth trial repetitions we could sample.

      Significance:

      Boffi and colleagues set out to parse the azimuthal information available in the DCIC on a single trial. They largely accomplish this goal and are able to extract this information when allowing the units that contain more information about sound location to contribute to their decoding (e.g., through PCA or decoding on top unit activity specifically). The dataset will be of value to those interested in the DCIC and also to anyone interested in the role of noise correlations in population coding. Although this work is first step into parsing the information available in the DCIC, it remains difficult to interpret if/how this azimuthal information is used in localization behaviors of engaged mice.

      Recommendations for the authors:

      Reviewer #2 (Recommendations For The Authors):

      General:

      The manuscript is generally well written, but could benefit from a quick proof by a native English speaker (e.g., "the" inferior colliculus is conventionally used with its article). The flow of arguments is also generally easy to follow, but I would kindly ask the authors to consider elaborating or clarifying the following points (including those already mentioned in my public review).

      (1) Choice of model:

      There are countless ways one can construct a decoder or classifier that can predict a presented sensory stimulus based on a population neuronal response. Given the assumptions of independence as mentioned in my public review, I would ask the authors to explicitly justify their choice of a naïve Bayesian classifier.

      A section detailing the logic of classifier choice is now included in the results section at page 10 and the last paragraph of page 18 from the revised version of the manuscript.

      (2) Number of imaging repetitions:

      For particularly noisy datasets, 14 repetitions is indeed quite few. I reckon this was not the choice of the authors, but rather limited by the inherent experimental conditions. Despite minimisation of required average laser power during the development of s-TeFo imaging, the authors still required almost 200 mW (which is still quite a lot of exposure). Although 14 repetitions for 13 azimuthal locations every 5 s is at face value a relatively short imaging session (~15 min.), at 191 mW, with the desire to image mice multiple times, I could imagine that this is a practical limitation the authors faced (to avoid excessive tissue heating or photodamage, which was assessed in the original Nature Methods article, but not here). Nevertheless, this logic (or whatever logic they had) should be explained for non-imaging experts in the readership.

      This is now addressed in the answers to the public reviews.

      (3) Redundancy:

      It is honestly unclear to me what the authors mean by this. I don't speculate that they mean there are "redundant" (small) populations of neurons that sufficiently encode azimuth, but I'm actually not certain. If that were the case, I believe this would need further clarification, since redundant representations would be both inconsistent with the general (perhaps surprising) finding that large populations are not required in the DCIC, which is thought to be the case at earlier processing stages.

      In the text we are referring to the azimuth information being redundantly distributed across DCIC top ranked units. We do not mention redundant “populations of neurons”.

      (4) Correspondence of decoding accuracy with psychometric functions in mice: While this is an interesting coincidental observation, it should not be interpreted that the neuronal detection threshold in the DCIC somehow is somehow responsible its psychometric counterpart (which is an interesting yet exceedingly complex question). Although I do not believe the authors intended to suggest this, I would personally be cautious in the way I describe this correspondence. I mention this because the authors point it out multiple times in the manuscript (whereas I would have just mentioned it once in passing).

      This is now clarified in the revised manuscript.

      (5) Noisy vs. sparse:

      I'm confident that the authors understand the differences between these terms, both in concept (stochastic vs. scattered) and in context (neuronal vs. experimental), but I personally would be cautious in the way I use them in the description of the study. Indeed, auditory neuronal signals are to my knowledge generally thought to be both sparse and noisy, which is in itself interesting, but the study also deals with substantial experimental (recording) noise, and I think it's important for the readership to understand when "noise" refers to the recordings (in particular the imaging data) and to neuronal activity. I mention this specifically because "noisy" appears in the title.

      We have clarified this issue at the bottom of page 5 by adding the following sentences to the revised manuscript:

      “In this section we used the word “noise” to refer to the sound stimuli used and recording setup background sound levels or recording noise in the acquired signals. To avoid confusion, from now on in the manuscript the word “noise” will be used in the context of neuronal noise, which is the trial-to-trial variation in neuronal responses unrelated to stimuli, unless otherwise noted.”

      (6)  More details in the Methods:

      The Methods section is perhaps the least-well structured part of the present manuscript in my view, and I encourage the authors to carefully go through it and add the following information (in case I somehow missed it).

      a. Please also indicate the number of animals used here.

      Added.

      b. How many sessions were performed on each mouse?

      This is already specified in the methods section in page 25:

      “mice were imaged a total of 2-11 times (sessions), one to three times a week.”

      We added for clarification:

      “Datasets here analyzed and reported come from the imaging session in which we observed maximal calcium sensor signal (peak AAV expression) and maximum number of detected units.”

      c. For the imaging experiments, was it possible to image the same units from session tosession?

      This is not possible for sTeFo 2P data due to low spatial resolution which makes precisely matching neuron ROIs across sessions challenging.

      d. Could the authors please add more detail to the analyses of the videos (to track facialmovements) or provide a reference?

      Added citation.

      e. The same goes for the selection of subcellular regions of interest that were used as"units."

      Added to page 25:

      “We used the CaImAn package (Giovannucci et al., 2019) for automatic ROI segmentation through constrained non negative matrix factorization and selected ROIs (Units) showing clear Ca transients consistent with neuronal activity, and IC neuron somatic shape and size (Schofield and Beebe, 2019).”

      Specific: In order to maximise the efficiency of my comments and suggestions (as there are no line numbers), my numerated points are organised in sequential order.

      (1) Abstract: I wouldn't personally motivate the study with the central nucleus of the IC (i.e. Idon't think this is necessary). I think the authors can motivate it simply with the knowledge gaps in spatial coding throughout the auditory system, in which such large data sets such as the ones presented here are of general value.

      (2) Page 4: 15-50 kHz "white" noise is incorrect. It should be "band-passed" noise.

      Changed.

      (3) Supplemental figure 1, panel A: Since the authors could not identify cell bodiesunequivocally from their averaged volume timeseries data, it would be clearer to the readership if larger images are shown, so that they can evaluate (speculate) for themselves what subcellular structures were identified as units. Even better would be to include a planar image through a cross-section. As mentioned above, not everything determined for the cortex or hippocampus can be assumed to be true for the DCIC.

      The raw images and segmentations are publicly available for detailed inspections.

      (4) Supplemental figure 2, panel A: This panel requires further explanation, in particular thepanel on the right. I assume that to be a simple subtraction of sequential frames, but I'm thrown off by the "d(Grey)" colour bar. Also, if "grey" refers to the neutral colour, it is conventionally spelled "gray" in US-American English.

      Changed.

      (5) Supplemental figure 2, panel B: I'm personally curious why the animals exhibitedmovement just prior to a stimulus. Did they learn to anticipate the presentation of a sound after some habituation? Is that somehow a pre-emptive startle response? We observe that in our own experiments (but as we stochastically vary the inter-trial-intervals, the movement typically occurs directly after the stimulus). I don't suggest the authors dwell on this, but I find it an interesting observation.

      It is indeed interesting, but we can’t conclude much about it without comparing it to random inter-trial-intervals.

      (6) Supplemental figure 3: I personally find these data (decoding of all electrophysiologicaldata) of central relevance to the study, since it mirrors the analyses presented for its imaging data counterpart and encourage the authors to move it to the main text.

      Changed.

      (7) Page 12: Do the authors have any further analyses of spatial tuning functions? We allknow they can parametrically obscure (i.e., bi-lobed, non-monotonic, etc.), but having these parameters (even if just in a supplemental figure) would be informative for the spatial auditory community.

      We dedicated significant effort to attempt to parametrize and classify the azimuth response dependency functions from the recorded DCIC cells in an unbiased way. Nevertheless, given the observed response noise and the “obscure” properties of spatial tuning functions mentioned by the reviewer, we could only reach the general qualitative observation of having a more frequent contralateral selectivity.

      (8) Page 14 (end): Here, psychometric correspondence is referenced. Please add theLauer et al., (2011) reference, or, as I would, remove the statement entirely and save it for the discussion (where it is also mentioned and referenced).

      Changed.

      (9) Figure 5, Panels B and C: Why don't the authors report the Kruskal-Wallis tests (forincreasing number of units training the model), akin to e.g., Panel G of Figure 4? I think that would be interesting to see (e.g., if the number of required units to achieve statistical significance is the same).

      Within class randomization produced a moderate effect on decoder performance, achieving statistical significance at similar numbers of units, as seen in figure 5 panels B and C. We did not include these plots for the sake of not cluttering the figure with dense distributions and fuzzing the visualization of the differences between the distributions shown.

      (10) Figure 5, Panels B and C (histograms): I see a bit of skewedness in the distributions(even after randomisation). Where does this come from? This is just a small talking point.

      We believe this is potentially due to more than one distribution of pairwise correlations combined into one histogram (like in a Gaussian mixture model).

      (11) Page 21: Could the authors please specify that the Day and Delgutte (2013) study wasperformed on rabbits? Since rabbits have an entirely different spectral hearing range compared to mice, spatial coding principles could very well be different in those animals (and I'm fairly certain such a study has not yet been published for mice).

      Specified.

      (12) Page 22: I'd encourage the authors to remove the reference to Rayleigh's duplextheory, since mice hardly (if at all) use interaural time differences for azimuthal sound localisation, given their generally high-frequency hearing range.

      That sentence is meant to discuss beyond the mouse model an exciting outlook of our findings in light of previous reports, which is a hypothetical functional relationship between the tonotopy in DCIC and the spatial distribution of azimuth sensitive DCIC neurons. We have clarified this now in the text.

      (13) Page 23: I believe the conventional verb for gene delivery with viruses is still"transduce" (or "infect", but not "induce"). What was the specific "syringe" used for stereotactic injections? Also, why were mice housed separately after surgery? This question pertains to animal welfare.

      Changed. The syringe was a 10ml syringe to generate positive or negative pressure, coupled to the glass needle through a silicon tubing via a luer 3-way T valve. Single housing was chosen to avoid mice compromising each other’s implantations. Therefore this can be seen as a refinement of our method to maximize the chances of successful imaging per implanted mouse.

      (14) Page 25: Could the authors please indicate the refractory period violation time windowhere? I had to find it buried in the figure caption of Supplementary figure 1.

      Added.

      (15) Page 27: What version of MATLAB was used? This could be important for reproductionof the analyses, since The Mathworks is infamously known to add (or even more deplorably, modify) functions in particular versions (and not update older ones accordingly).

      Added.

      Reviewer #3 (Recommendations For The Authors):

      Overall I thought this was a nice manuscript and a very interesting dataset. Here are some suggestions and minor corrections:

      You may find this work of interest - 'A monotonic code for sound azimuth in primate inferior colliculus' 2003, Groh, Kelly & Underhill.

      We thank the reviewer for pointing out this extremely relevant reference, which we regrettably failed to cite. It is now included in the revised version of the manuscript.

      In your introduction, you state "our findings point to a functional role of DCIC in sound location coding". Though your results show that there is azimuthal information contained in a subset of DCIC units there's no evidence in the manuscript that shows a functional link between this representation and sound localization.

      This is now addressed in the answers to the public reviews.

      I found the variability in your DCIC population quite striking - especially during the intersound intervals. The entrainment of the population in the imaging datatset suggests some type of input activating the populations - maybe these are avenues for further probing the variability here:

      (1) I'm curious if you can extract eye movements from your video. Work from Jennifer Grohshows that some cells in the primate inferior colliculus are sensitive to different eye positions (Groh et. al., 2001). With recent work showing eye movements in rodents, it may explain some of the variance in the DCIC responses.

      This is now addressed in the answers to the public reviews.

      (2) I was also curious if the motor that moves the speaker made noise It could be possiblesome of the 'on going' activity could be some sound-evoked response.

      We were careful to set the stepper motor speed so that it produced low frequency noise, within a band mostly outside of the hearing range of mice (<4kHz). Nevertheless, we cannot fully rule out that a very quiet but perhaps very salient component of the motor noise could influence the activity during the inter trial periods. The motor was stationary and quiet for a period of at least one stimulus duration before and during stimulus presentation.  

      (3) Was the sound you present frozen or randomly generated on each trial? Could therebe some type of structure in the noise you presented that sometimes led cells to respond to a particular azimuth location but not others?

      The sound presented was frozen noise. This is now clarified in the methods section.

      It may be useful to quantify the number of your units that had refractory period violations.

      Our manual curation of sorted units was very stringent to avoid mixing differently tuned neurons. The single units analyzed had very infrequent refractory period violations, in less than ~5% of the spikes, considering a 2 ms refractory period.

      Was the video recording contralateral or ipsilateral to the recording?

      The side of the face ipsilateral to the imaged IC was recorded. Added to methods.

      I was struck by the snout and ear movements - in the example shown in Supplementary Figure 2B it appears as they are almost predicting sound onset. Was there any difference in ear movements in the habituated and non-habituated animals? Also, does the placement of the cranial window disturb any of the muscles used in ear movement?

      Mouse snout movements appear to be quite active perhaps reflecting arousal (Stringer et al., 2019). We cannot rule out that the cranial window implantation disturbed ear movement but while moving the mouse headfixed we observed what could be considered normal ear movements.

      Did you correlate time-point by time-point in the average population activity and movement or did you try different temporal labs/leads in case the effect of the movements was delayed in some way?

      Point by point due to 250ms time resolution of imaging.

      Are the video recordings only available during the imaging? It would be nice to see the same type of correlations in the neuropixel-acquired data as well.

      Only imaging. For neuropixels recordings, we were skeptical about face videography as we suspected that face movements were likely influenced by the acute nature of the preparation procedure. Our cranial window preparation in the other hand involved a recovery period of at least 4 weeks. Therefore we were inclined to perform videographical interrogation of face movements on these mice instead.

      If you left out more than 1 trial do you think this would help your overfitting issue (e.g. leaving out 20% of the data).

      Due to the relatively small number of trial repetitions collected, fitting the model with an even smaller training dataset is unlikely to help overfitting and will likely decrease decoder performance.

      It would be nice to see a confusion matrix - even though azimuthal error and cumulative distribution of error are a fine way to present the data - a confusion matrix would tell us which actual sounds the decoder is confusing. Just looking at errors could result in some funky things where you reduce the error generally but never actually estimate the correct location.

      We considered confusion matrices early on in our study but they were not easily interpretable or insightful, likely due to the relatively low discrimination ability of the mouse model with +/- 30º error after extensive training. Therefore, we reasoned that in passively listening mice (and likely trained mice too) with limited trial repetitions, an undersampled and diffuse confusion matrix is expected which is not an ideal means of visualizing and comparing decoding errors. Hence we relied on cumulative error distributions.

      Do your top-ranked units have stronger projections onto your 10-40 principal components?

      It would be interesting to know if the components are mostly taking into account those 30ish percent of the population that is dependent upon azimuth.

      Inspection of PC loadings across units ranked based on response dependency to stimulus azimuth does not show a consistent stronger projection of top ranked units onto the first 10-40 principal components (Author response image 3).

      Author response image 3.

      PC loading matrices for each recorded mouse. The units recorded in each mouse are ranked in descending order of response dependency to stimulus azimuth based on  the p value of the chi square test. Units above the red dotted line display a chi square p value < 0.05, units below this line have p values >= 0.05.

      How much overlap is there in the tuning of the top-ranked units?

      This is quite varying from mouse to mouse and imaging vs electrophysiology, which makes it hard to make a generalization since this might depend on the unique DCIC population sampled in each mouse.

      I'm not really sure I follow what the nS/N adds - it doesn't really measure tuning but it seems to be introduced to discuss/extract some measure of tuning.

      nS/N is used to quantify how noisy neurons are, independent of how sensitive their responses are to the stimulus azimuth.

      Is the noise correlation - observed to become more positive - for more contralateral stimuli a product of higher firing rates due to a more preferred stimulus presentation or a real effect in the data? Was there any relationship between distance and strength of observed noise correlation in the DCIC?

      We observed a consistent and homogeneous trend of pairwise noise correlation distributions either shifted or tailed towards more positive values across stimulus azimuths, for imaging and electrophysiology datasets (Author response image 3). The lower firing frequency observed in neuropixels recordings in response to ipsilateral azimuths could have affected the statistical power of the comparison between the pairwise noise correlation coefficient distribution to its randomized chance level, but the overall histogram shapes qualitatively support this consistent trend across azimuths (Author response image 4).

      Author response image 4.

      Distribution histograms for the pairwise correlation coefficients (Kendall tau) from pairs of simultaneously recorded top ranked units across mice (blue) compared to the chance level distribution obtained through randomization of the temporal structure of each unit’s activity to break correlations (purple). Vertical lines show the medians of these distributions. Imaging data comes from n = 12 mice and neuropixels data comes from n = 4 mice.

      Typos:

      'a population code consisting on the simultaneous" > should on be of?

      'half of the trails' > trails should be trials?

      'referncing the demuxed channels' > should it be demixed?

      Corrected.

    1. Author Response

      The following is the authors’ response to the original reviews.

      eLife assessment

      This study presents a valuable finding on the immunophenotypes of cancer treatment-related pneumonitis. The evidence supporting the claims of the authors is solid, although the inclusion of controls, as suggested by one of the reviewers, strengthened the study. The work will be of interest to cancer immunologists.

      Response: We are thankful for the editor's recognition of the contribution our study makes to understanding the immunophenotypes associated with cancer treatment-related pneumonitis. We agree that the inclusion of control data is pivotal for benchmarking biomarkers. While our initial study design was constrained by the availability of BALF from healthy individuals within clinical settings, we addressed this limitation by incorporating scRNA-seq data from healthy control and COVID-19 BALF cells sourced from the GSE145926 dataset. This additional analysis has provided a baseline for comparison, revealing that CD16 is expressed in a minority of T cells in healthy BALF, specifically 1.0% of CD4+ T cells and 1.6% of CD8+ T cells. The inclusion of this data as Figures 6H and 6I in our manuscript offers a robust context for the significant increase in CD16-expressing T cells observed in patients with PCP, thus enhancing the robustness of our study's conclusions.

      Author response image 1.

      Reviewer #1 (Recommendations For The Authors):

      Many thanks for giving me the opportunity to review your paper. I really enjoyed the way you carried out this work - for example, your use of a wide panel of markers and the use of two analytical methods - you have clearly given great thought to bias avoidance. I also greatly appreciated your paragraph on the limitations, as there are several, but you do not 'over-sell' your conclusions so there is no issue here for me.

      To improve the piece, there are a few typos (eg 318 - specific to alpha-myosin) and I was briefly confused about the highlighted clusters in Figure 4. Perhaps mention why they are highlighted when they first appear in 4D instead of E?

      Response: We have corrected the typos, and we have rearranged the sequence of Figures 3E and 3F, as well as 4D and 4E, to ensure a logical flow. Citrus-generated violin plots are now presented prior to the heatmap of the clusters, which better illustrates the progression of our analysis and the derivation of the clusters.

      In terms of improvements to the data, obviously it would have been ideal if you had had some sort of healthy control as a point of reference for all cohorts, but working in the field I understand the difficulties in getting healthy BAL. It would be worth your while however trying to find more supportive data in the literature in general. There are studies which assess various immune markers in healthy BAL eg https://journal-inflammation.biomedcentral.com/articles/10.1186/1476-9255-11-9. and so I think it is worth looking wrt the main findings. For example, are CD16+ T cells seen in healthy BAL or any other conditions (at present the COVID study is being over-relied on)? Could these cells be gamma deltas? (gamma deltas frequently express CD8 and CD16, and can switch to APC like phenotypes).

      Response: We are grateful for the reviewer's consideration of the practical challenges associated with collecting BALF from healthy individuals. Alternatively, we have supplemented our analysis with single-cell RNA sequencing data from BALF cells of healthy controls, as found in existing literature (Nature Medicine 2020; 26: 842-844). We have accessed to GSE145926 and downloaded data of BALF cells from healthy control (n=3) and severe COVID19 (n=6). The filtered gene-barcode matrix was first normalized using ‘NormalizeData’ methods in Seurat v.4 with default parameters. The top 2,000 variable genes were then identified using the ‘vst’ method in Seurat FindVariableFeatures function. Then PCA and UMAP was performed. T cells were identified as CD2 >1 and CD3E >1, and FCGR3A expression was explored using an expression threshold of 0.5. Violin plots and bar plots were generated by ggplot function.

      Regarding the pivotal finding of increased CD16-expressing T cells in patients with PCP, the scRNA-seq data mining indicates that CD16 is expressed by a minority of T cells in healthy BALF—1.0% of CD4+ T cells and 1.6% of CD8+ T cells. These figures, now incorporated into our revised manuscript as Figures 6H and 6I, substantiate our findings. These cells could be gamma delta T cells, but we could not confirm it with the limited data. We will investigate in the future study. The main text has been updated to reflect these findings.

      Author response image 2.

      I would agree with your approach of not going down the transcript route, so just focus on protein expression.

      I think you need to mention more about the impact of ICI on PD1 expression - in the methods you lose one approach owing to low T cell expression (132) but in the discussion you mention ICI induced high expression (311) as previously reported. This apparent contradiction needs an explanation.

      Response: We acknowledge the need for clarification regarding the impact of ICIs on PD-1 expression. In the methods section, the low detection of PD-1 expression on T cells in patients treated with nivolumab was indeed noted; this was due to the competitive nature of the PD-1 detection antibody EH12.2 with nivolumab. As reported by Suzuki et al. (International Immunology 2020; 32: 547-557), T cells from patients with ICI-induced ILD, including those treated with nivolumab, exhibit upregulated PD-1 expression, where the PD-1 detection antibody (clone: MIH4). Conversely, as outlined by Yanagihara et al. (BBRC 2020; 527: 213-217), the PD-1 detection antibody clone EH12.2 conjugated with 155Gd (#3155009B) used in our study is unable to detect PD-1 when patients are under nivolumab treatment due to competitive inhibition. The absence of a metal-conjugated PD-1 antibody with the MIH4 clone presented a limitation in our study. Ideally, we would have conjugated the MIH4 antibody with 155Gd for our analysis, which is a refinement we aim to incorporate in future research. We have now included this discussion in our manuscript to clarify the contradiction between the methodological limitations and the high PD-1 expression induced by ICIs, as reported in the literature. This addition will guide readers through the nuances of antibody selection and its implications for detecting PD-1 expression in the context of ICI treatment.

      Finally, since you have the severity data, it would be good to assess all the significantly different clusters against this metric, as you have done for CD16+ T cells. Not only may this reveal more wrt the impact of other immune populations, but it'll also give a point of reference for the CD16+ T cell data.

      Response: Thank you for the suggestion to assess all significantly different clusters against the disease severity metric. We have expanded our analysis to include a thorough correlation study between the disease severity and intensity of various T-cell markers. Notably, we observed that intensity of CCR7 expression correlates with the disease severity. Although the precise biological significance of this correlation remains to be elucidated, it may suggest a role for CCR7+ T cells in the pathogenesis or progression of the disease. We have considered the potential implications of this finding and included it as Supplementary Figure 5. We have also discussed this observation in the discussion section.

      Author response image 3.

      Overall though I think this is a really nice study, with a potentially very significant finding in linking CD16+ T cells with severity. Congratulations.

      Response: We would like to thank the reviewer’s heartful comments on our manuscript.

      Reviewer #2 (Recommendations For The Authors):

      General:

      1) The fact that this is a retrospective study should be indicated earlier in the paper.

      Response: Now we have mentioned the retrospective nature of the study in the method section as follows: In this retrospective study, patients who were newly diagnosed with PCP, DI-ILD, and ICI-ILD and had undergone BALF collection at Kyushu University Hospital from January 2017 to April 2022 were included. The retrospective study was approved by the Ethics Committee of Kyushu University Hospital (reference number 22117-00).

      2) tSNE and UMAP are dimensionality reduction techniques that don't cluster the cells, the authors should specify what clustering algorithm was used subsequently (e.g FlowSOM)

      Response: The cluster was determined manually by their expression pattern.

      3) With regards to the role of CD16 in a potential exacerbated cytotoxicity in the fatal PCP case, the authors could measure the levels of C3a related proteins in patient serum to link to a common immunopathogenic pathway with COVID.

      Response: We did not collect serum from the patients in this study as our research protocol was approved by the Ethics committee for the use of BALF only. However, we agree with your assessment that the measurement of serum C3a levels would be informative. In future studies, we will incorporate the measurement of serum C3a levels to provide more comprehensive insights into the impact of C3a on immune function. Thank you for your valuable feedback and for helping us to improve the quality of our research.

      Line-specific:

      101 The authors should provide some information on how the cryopreservation of the BALF was carried out.

      Response: Upon collection, BALF samples were immediately centrifuged at 300 g for 5 minutes to pellet the cells. The resultant cell pellets were then resuspended in Cellbanker 1 cryopreservation solution (Takara, catalog #210409). This suspension was aliquoted into cryovials and gradually frozen to –80ºC using a controlled rate freezing method to ensure cell viability. The samples were stored at –80ºC until required for experimental analysis. We have added the information in the method section.

      Fig 3B: It would be very helpful if the authors could add a supplementary figure with marker expression on the UMAP projection.

      Response: We have added Supplementary Figure 4 with marker expression on the UMAP projection in Figure 3B.

      Fig 4A: Same as Fig 3B

      Response: We have added Supplementary Figure 5 with marker expression on the UMAP projection in Figure 4A.

      Fig 5B: Same as Fig 3B

      Response: We have added Supplementary Figure 6 with marker expression on the tSNE projection in Figure 5B.

      266 Authors should state if the data is not shown with regards to differences in myeloid cell fractions

      430 Marker intensity is not shown in panel D

      Re: Corrected as follows: “Citrus network tree visualizing the hierarchical relationship of each marker between identified T cell ~”

      446 The legend says patients have IPF, CTD-ILD, sarcoidosis but the figure shows PCP, DI-ILD, ICI-ILD.

      Re: Corrected.

      451 What do the authors mean in "Graphical plots represent individual samples"? Panel B is a dot plot of all samples.

      Response: Corrected as “Dot plots represent ~”.

      472 What do the authors mean in "Graphical plots represent individual samples"? Panel C is a dot plot of all samples.

      Response: Corrected as “Dot plots represent ~”.

      Reviewer #3 (Recommendations For The Authors):

      An important thing is to add comparisons against healthy donors, at least. A common baseline is needed to firmly establish any biomarkers.

      Response: We acknowledge the reviewer's concern regarding the comparison with healthy donors. Although our study did not initially include BALF collection from healthy controls due to the constraints of clinical practice, we recognize the importance of a control baseline to validate biomarkers. To address this, we have integrated scRNA-seq data from healthy control BALF cells available in public datasets (Nature Medicine 2020; 26: 842-844), accessed from GSE145926. This dataset includes BALF cells from healthy controls (n=3) alongside severe COVID-19 patients (n=6). Data mining confirmed that CD16 expression is in a minority of T cells in healthy BALF—1.0% of CD4+ T cells and 1.6% of CD8+ T cells. We have included this comparative data in our manuscript as Figures 6H and 6I to provide context for the observed increase in CD16-expressing T cells in PCP patients, which substantiates our findings.

      Author response image 4.

      Data analysis needs to go deeper. There are several other tools on Cytobank alone that would allow a more quantitative analysis of the data. Fold changes in marker expressions would be very important as measurements of phenotypic changes.

      Response: We thank the reviewer for their constructive feedback on the depth of our data analysis. We acknowledge the value of a more quantitative approach, including the use of fold change measurements to assess phenotypic alterations, and recognize the potential insights such tools on Cytobank could provide. Due to the scope and limited space of the current study, we have focused our analysis on the most pertinent findings relevant to our research questions. We believe the present analysis serves the immediate objectives of this study. However, we agree that further quantitative analysis would enhance the understanding of the data. We have expanded our analysis to include a thorough correlation study between the disease severity of PCP and intensity of various T-cell markers. Notably, we observed that intensity of CCR7 expression correlates with the disease severity of PCP. Although the precise biological significance of this correlation remains to be elucidated, it may suggest a role for CCR7+ T cells in the pathogenesis or progression of the disease. We have considered the potential implications of this finding and included it as Supplementary Figure 5. We have also discussed this observation in the discussion section. We aim to consider these approaches in future work to build upon the foundation laid by this study. Your suggestions are invaluable and will be kept at the forefront as we plan subsequent research phases.

      Author response image 5.

      Reviewer #1 (Public Review):

      Cytotoxic agents and immune checkpoint inhibitors are the most commonly used and efficacious treatments for lung cancers. However their use brings two significant pulmonary side-effects; namely Pneumocystis jirovecii infection and resultant pneumonia (PCP), and interstitial lung disease (ILD). To observe the potential immunological drivers of these adverse events, Yanagihara et al. analysed and compared cells present in the bronchoalveolar lavage of three patient groups (PCP, cytotoxic drug-induced ILD [DI-ILD], and ICI-associated ILD [ICI-ILD]) using mass cytometry (64 markers). In PCP, they observed an expansion of the CD16+ T cell population, with the highest CD16+ T proportion (97.5%) in a fatal case, whilst in ICI-ILD, they found an increase in CD57+ CD8+ T cells expressing immune checkpoints (TIGIT+ LAG3+ TIM-3+ PD-1+), FCRL5+ B cells, and CCR2+ CCR5+ CD14+ monocytes. Given the fatal case, the authors also assessed for, and found, a correlation between CD16+ T cells and disease severity in PCP, postulating that this may be owing to endothelial destruction. Although n numbers are relatively small (n=7-9 in each cohort; common numbers for CyTOF papers), the authors use a wide panel (n=65) and two clustering methodologies giving greater strength to the conclusions. The differential populations discovered using one or two of the analytical methods are robust: whole population shifts with clear and significant clustering. These data are an excellent resource for clinical disease specialists and pan-disease immunologists, with a broad and engaging contextual discussion about what they could mean.

      Strengths:

      • The differences in immune cells in BAL in these specific patient subgroups is relatively unexplored.

      • This is an observational study, with no starting hypothesis being tested.

      • Two analytical methods are used to cluster the data.

      • A relatively wide panel was used (64 markers), with particular strength in the alpha beta T cells and B cells.

      • Relevant biomarkers, beta-D-glucan and KL-6 were also analysed

      • Appropriate statistics were used throughout.

      • Numbers are low (7 cases of PCP, 9 of DI-ILD, and 9 of ICI-ILD) but these are difficult samples to collect and so in relative terms, and considering the use of CyTOF, these are good numbers.

      • Beta-D-glucan shows potential as a biomarker for PCP (as previously reported) whilst KL-6 shows potential as a biomarker for ICI-ILD (not reported before). Interestingly, KL-6 was not seen to be increased in DI-ILD patients.

      • Despite the relatively low n numbers and lack of matching there are some clear differentials. The CD4/CD8+CD16+HLA-DR+CXCR3+CD14- T cell result is striking - up in PCP (with EM CD4s significantly down) - whilst the CD8 EMRA population is clear in ICI-ILD and 'non-exhausted' CD4s, with lower numbers of EMRA CD8s in DI-ILD.

      • The authors identify 17/31 significantly differentiated clusters of myeloid cells, eg CD11bhi CD11chi CD64+ CD206+ alveolar macrophages with HLA-DRhi in PCP.

      • With respect to B cells, the authors found that FCRL5+ B cells were more abundant in patients with ICI-ILD compared to those with PCP and DI-ILD, suggesting these FCRL5+ B cells may have a role in irAE.

      • One patient's extreme CD16+ T cell (97.5% positive) and death, led the authors to consider CD16+ T cells as an indicator of disease severity in PCP. This was then tested and found to be correct.

      • Authors discuss results in context of literature leading them to suggest that CD16+ T cells may target endothelial cells and wonder if anti-complement therapy may be efficacious in PCP.

      • Great discussion on auto-reactive T cell clones where the authors suggest that in ICI-ILD CD8s may react against healthy lung, driving ILD.

      • An observation of CXCR3 in different CD8 populations in ICI-ILD and PCP lead the authors to hypothesise on the chemoattractants in the microenvironment.

      • Excellent point suggesting CD57 may not always be a marker of senescence on T cells - reflective of growing change within the community.

      • Well considered suggestion that FCRL5+ B cells may be involved in ICI-ILD driven autoimmunity.

      • The authors discuss the main weaknesses in the discussion and stress that the findings detailed in the paper "demonstrate a correlation rather than proof of causation".

      • Figures and legends are clear and pleasing to the eye.

      Weaknesses:

      • This is an observational study, with no starting hypothesis being tested.

      • Only patients who were able to have a lavage taken have been recruited.

      • One set of analysis wasn't carried out for one subgroup (ICI-ILD) as PD1 expression was negative owing to the use of nivolumab.

      • Some immune cell subsets wouldn't be picked up with the markers and gating strategies used; e.g. NK cells.

      • Some immune cells would be disproportionately damaged by the storage, thawing and preparation of the samples; e.g. granulocytes.

      • Numbers are low (7 cases of PCP, 9 of DI-ILD, and 9 of ICI-ILD), sex, age and adverse event matching wasn't performed, and treatment regimen are varied and 'suspected' (suggesting incomplete clinical data) - but these are difficult samples to collect. These numbers drop further for some analyses e.g. T cell clustering owing to factors such as low cell number.

      • The disease comparisons are with each other, there is no healthy control.

      • Samples are taken at one time point.

      • The discussion on probably the stand out result - the CD16+ T cells in PCP - relies on two papers - leading to a slightly skewed emphasis on one paper on CD16+ cells in COVID. There are other papers out there that have observed CD16+ T cells in other conditions. It is also worth being in mind that given the markers used, these CD16+ T cell may be gamma deltas.

      • The discussion on ICI patient consistently showing increased PD1, could have been greater, as given the ICI is targeting PD1, one would expect the opposite as commented on, and observed, in the methods section.

      Reviewer #2 (Public Review):

      Yanagihara and colleagues investigated the immune cell composition of bronchoalveolar lavage fluid (BALF) samples in a cohort of patients with malignancy undergoing chemotherapy and with with lung adverse reactions including Pneumocystis jirovecii pneumonia (PCP) and immune-checkpoint inhibitors (ICIs) or cytotoxic drug induced interstitial lung diseases (ILDs). Using mass cytometry, their aim was to characterize the cellular and molecular changes in BAL to improve our understanding of their pathogenesis and identify potential biomarkers and therapeutic targets. In this regard, the authors identify a correlation between CD16 expression in T cells and the severity of PCP and an increased infiltration of CD57+ CD8+ T cells expressing immune checkpoints and FCLR5+ B cells in ICI-ILD patients.

      The conclusions of this paper are mostly well supported by data, but some aspects of the data analysis need to be clarified and extended.

      1) The authors should elaborate on why different set of markers were selected for each analysis step. E.g., Different set of markers were used for UMAP, CITRUS and viSNE in the T cell and myeloid analysis.

      2) The authors should state if a normality test for the distribution of the data was performed. If not, non-parametric tests should be used.

      3) The authors should explore the correlation between CD16 intensity and the CTCAE grade in T cell subsets such as EMRA CD8 T cells, effector memory CD4, etc as identified in Figure 1B.

      4) The authors could use CITRUS to better assess the B cell compartment.

      Reviewer #3 (Public Review):

      The authors collected BALF samples from lung cancer patients newly diagnosed with PCP, DI-ILD or ICI-ILD. CyTOF was performed on these samples, using two different panels (T-cell and B-cell/myeloid cell panels). Results were collected, cleaned-up, manually gated and pre-processed prior to visualisation with manifold learning approaches t-SNE (in the form of viSNE) or UMAP, and analysed by CITRUS (hierarchical clustering followed by feature selection and regression) for population identification - all using Cytobank implementation - in an attempt to identify possible biomarkers for these disease states. By comparing cell abundances from CITRUS results and qualitative inspection of a small number of marker expressions, the authors claimed to have identified an expansion of CD16+ T-cell population in PCP cases and an increase in CD57+ CD8+ T-cells, FCRL5+ B-cells and CCR2+ CCR5+ CD14+ monocytes in ICI-ILD cases.

      By the authors' own admission, there is an absence of healthy donor samples and, perhaps as a result of retrospective experimental design, also an absence of pre-treatment samples. The entire analysis effectively compares three yet-established disease states with no common baseline - what really constitutes a "biomarker" in such cases? The introduction asserts that "y characterizing the cellular and molecular changes in BAL from patients with these complications, we aim to improve our understanding of their pathogenesis and identify potential therapeutic targets" (lines 82-84). Given these obvious omissions, no real "changes" have been studied in the paper. These are very limited comparisons among three, and only these three, states.

      Even assuming more thorough experimental design, the data analysis is unfortunately too shallow and has not managed to explore the wealth of information that could potentially be extracted from the results. CITRUS is accessible and convenient, but also make a couple of big assumptions which could affect data analysis - 1) Is it justified to concatenate all FCS files to analyse the data in one batch / small batches? Could there be batch effects or otherwise other biological events that could confuse the algorithm? 2) With a relatively small number of samples, and after internal feature selection of CITRUS, is the regression model suitable for population identification or would it be too crude and miss out rare populations? There are plenty of other established methods that could be used instead. Have those methods been considered?

      Colouring t-SNE or UMAP (e.g. Figure 6C) plots by marker expression is useful for quick identification of cell populations but it is not a quantitative analysis. In a CyTOF analysis like this, it is common to work out fold changes of marker expressions between conditions. It is inadequate to judge expression levels and infer differences simply by looking at colours.

      The relatively small number of samples also mean that most results presented in the paper are not statistical significant. Whilst it is understandable that it is not always possible to collect a large number of patient samples for studies like this, having several entire major figures showing "n.s." (e.g. Figures 3A, 4B and 5C), together with limitations in the comparisons themselves and inadequate analysis, make the observations difficult to be convincing, and even less so for the single fatal PCP case where N = 1.

      It would also be good scientific practice to show evidence of sample data quality control. Were individual FCS files examined? Did the staining work? Some indication of QC would also be great.

      This dataset generated and studied by the authors have the potential to address the question they set out to answer and thus potentially be useful for the field. However, in the current state of presentation, more evidence and more thorough data analysis are needed to draw any conclusions, or correlations, as the authors would like to frame them.

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife assessment

      This paper provides useful information about how the ionome of Arabidopsis thaliana adapts to very high CO2-levels, backed up by solid evidence and carefully designed studies. However, the broader claims of the paper about climate change and food security - heavily emphasized in the abstract, introduction, and discussion - are inappropriate, as there is no direct link to the presented work.

      We sincerely thank you for the work you have done in reviewing our manuscript. We very much appreciate your overall positive assessment of the experimental work as a whole, its value and robustness.

      In this revised version, we took on board the majority of your suggestions and your comments. In particular, we understood your critical point about overstating our objectives, which might in turn seem uncorrelated with our results. We fully agree with the comments that have been made on this point. Consequently, we have made substantial modifications and corrections in order to clarify our objectives and their implications: exploring in depth the natural variation of the shoot ionome response to elevated CO2, and generating a valuable resource allowing a better understanding of the genetic and molecular mechanisms involved in the regulation of plant mineral nutrition by the elevation of atmospheric CO2.

      We also made modifications in response to the other suggestions, including a clarification of the functional experiments carried out around the function of TIP2;2 in response to elevated CO2. Figure 7 now comprises the comparison between both ambient and elevated CO2 conditions, which is much more informative that what appeared in the previous version.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      The study's abstract, introduction, and conclusions are not supported by the methods and results conducted. In fact, the results presented suggest that Arabidopsis could easily adapt to an extremely high CO2 environment.

      We understand the reviewer’s comment. Although our work is considered useful, robust and well designed, we agree with the reviewer's point. We have certainly overemphasized the significance of our work to address the issue of food security in response to rising atmospheric CO2, at the expense of the factual description of the results of our fundamental study of the mechanisms at the interface between CO2 and mineral nutrition. We have clarified this focus by modifying the text of the introduction, objectives and discussion. We hope that these modifications will enable readers to better appreciate the core of this work.

      Regarding the last part of the comment, our results do suggest that genetic variation could allow adaptation to rising atmospheric CO2, and our study does indeed aim to identify the extent and basis of this genetic variation.

      This study offers good evidence pointing to a genetic basis for Arabidopsis thaliana's response to elevated CO2 (eCO2) levels and its subsequent impact on the leaf ionome. The natural variation analyses in the study support the hypothesis that genetic factors, rather than local adaptation, guide the influence of eCO2 on the ionome of rosette leaves in Arabidopsis. However, the manuscript's claim regarding its role in "the development of biofortified crops adapted to a high-CO2 world" (line 23) is overstated, especially given the absence of any analysis on the influence of eCO2 on the seed ionome and Arabidopsis is a poor model for harvest index for any crop. The manuscript, in its current form, necessitates massive revisions, particularly in clarifying its broader implications and in providing more substantial evidence for some of its assertions.

      We thank the reviewer for this comment, and we would like to thank the reviewer for the positive appreciation for the identification of genetic basis for Arabidopsis thaliana's response to elevated CO2 and its subsequent impact on the leaf ionome. Nevertheless, it is true that the study of the leaf ionome is far from being able to lead to the development of biofortified plants. Some papers described that nutrient harvest index in Arabidopsis is a potential indicator of nutrient use efficiency (for instance, Masclaux-Daubresse and Chardon, Journal of Experimental Botany 2011 or Aranjuelo et al., Journal of Experimental Botany 2013). However, as we did not include any seed ionome data in the paper, we added clear mentions that our analyses were made on leaves (lines 56/57/250/319) and a comment in the discussion section to address this limitation (lines 325-328).

      Major Drawbacks and Questions:

      (1) Evidence for the Central Premise:

      The foundational premise of the study is the assertion that rising atmospheric CO2 levels result in a decline in plant mineral content. This phenomenon is primarily observed in C3 plants, with C4 plants seemingly less affected. The evidence provided on this topic is scant and, in some instances, contradicts the authors' own references. The potential reduction of certain minerals, especially in grains, can be debated. For instance, reduced nitrogen (N) and phosphorus (P) content in grains might not necessarily be detrimental for human and animal consumption. In fact, it could potentially mitigate issues like nitrogen emissions and phosphorus leaching. Labeling this as a "major threat to food security" (line 30) is exaggerated. While the case for microelements might be more compelling, the introduction fails to articulate this adequately. Furthermore, the introduction lacks any discussion on how eCO2 might influence nutrient allocation to grains, which would be crucial in substantiating the claim that eCO2 poses a threat to food security. A more comprehensive introduction that clearly delineates the adverse effects of eCO2 and its implications for food security would greatly enhance the manuscript.

      We partially agree with this comment. The decline in mineral status of C3 plants under conditions of elevated atmospheric CO2 has been widely described in the literature, and specifically documented for the cereal grains. While there are variations in this effect (depending on species, ecotype, cultivar), there is no debate about its acceptance. Here are just a few of the many works describing this effect, both on a global scale and at the level of the individual plant (Cotrufo MF (1998) Elevated CO2 reduces the nitrogen concentration of plant tissues. Global Change Biology 4: 43-54; Loladze I (2014) Hidden shift of the ionome of plants exposed to elevated CO(2)depletes minerals at the base of human nutrition. eLife 3: e02245; Myers SS (2014) Increasing CO2 threatens human nutrition. Nature 510: 139-142; Poorter H (1997) The effect of elevated CO2 on the chemical composition and construction costs of leaves of 27 C3 species. Plant, Cell & Environment 20: 472-482 ; Soares JC (2019) Preserving the nutritional quality of crop plants under a changing climate: importance and strategies. Plant and Soil 443: 1-26; Stitt] M (1999) The interaction between elevated carbon dioxide and nitrogen nutrition: the physiological and molecular background. Plant, Cell & Environment 22: 583-621; Uddling J (2018) Crop quality under rising atmospheric CO2. Curr Opin Plant Biol 45: 262-267).

      In addition to this, the threat to food security posed by this alteration in plant mineral status has also been well described in the literature by several modeling approaches (Beach RH (2019) Combining the effects of increased atmospheric carbon dioxide on protein, iron, and zinc availability and projected climate change on global diets: a modelling study. Lancet Planet Health 3: e307-e317; Ebi KL (2019) Elevated atmospheric CO(2) concentrations and climate change will affect our food's quality and quantity. Lancet Planet Health 3: e283-e284; Medek DE (2017) Estimated Effects of Future Atmospheric CO2 Concentrations on Protein Intake and the Risk of Protein Deficiency by Country and Region. Environ Health Perspect 125: 087002; Smith MR (2018) Impact of anthropogenic CO2 emissions on global human nutrition. Nature Climate Change 8: 834-839; Weyant C (2018) Anticipated burden and mitigation of carbon-dioxide-induced nutritional deficiencies and related diseases: A simulation modeling study. PLoS Med 15: e1002586; Zhu C (2018) Carbon dioxide (CO2) levels this century will alter the protein, micronutrients, and vitamin content of rice grains with potential health consequences for the poorest rice-dependent countries. Sci Adv 4: eaaq1012). To reinforce this point, we have added a sentence and references (lines 30-33). Nevertheless, we understand the reviewer's comment on the nuance to be given to the intensity of this potential threat. We have therefore modified the text, replacing "major threat" by "significant threat" (lines 3 and 29).

      We also would like to answer the reviewer’s comment on the potential environmental benefit associated with reduced N and P content in grains (mitigation of N emissions and P leaching). Indeed, if this reduced N and P content results from a lowered use efficiency of soil nutrients by plants, as suggested by several studies (Bloom 2010, Cassan 2023, Gojon 2023 and references therein), this may at the opposite favor N oxides emission and P leaching from the soil.

      (2) Exaggerated Concerns:

      The paper begins with the concern that carbon fertilization will lead to carbon dilution in our foods. While we indeed face numerous genuine threats in the coming decades, this particular issue is manageable. The increase in CO2 alone offers many opportunities for boosting yield. However, the heightened heat and increased evapotranspiration will pose massive challenges in many environments.

      While there are indeed multiple threats that we are facing in the coming decades, we don't fully agree with this comment. At present, there's no evidence to say that the negative effect of CO2 on plant mineral content will be manageable. Furthermore, there is compelling evidence that altered mineral nutrition and mineral status of plants will be an important factor limiting the high CO2-induced increase in yield, as will be heat or increased evapotranspiration (see for instance Coskun et al (2016) Nutrient constraints on terrestrial carbon fixation: The role of Nitrogen. J. Plant Physiol. 203: 95-109; Jiang M (2020) Low phosphorus supply constrains plant responses to elevated CO2 : A meta-analysis. Glob Chang Biol 26: 5856-5873 ; Reich PB (2006) Nitrogen limitation constrains sustainability of ecosystem response to CO2. Nature 440: 922-925). Thus, although we do not negate the crucial importance of heat and water stress, we believe it is relevant to study the basic mechanisms responsible for the negative effect of CO2 on plant mineral composition.

      Figure 4 in fact suggests that 43% of the REGMAP panel (cluster 3) is already pre-adapted to very high CO2 levels. This suggests annual species could adapt very rapidly.

      We agree with the reviewer. However, this suggests that genetic variation exists in some ecotypes to support adaptation to elevated CO2. The purpose of this work is indeed to identify this genetic variation, in order to characterize the mechanisms behind.

      (3) Assumptions on CO2 Levels:

      The assumption of 900ppm seems to be based on a very extreme climate change scenario. Most people believe we will overshoot the 1.5°C scenario, however, it seems plausible that 2.5 to 3°C scenarios are more likely. This would correspond to around 500ppm of CO2. https://www.nature.com/articles/s41597-022-01196-7/tables/4

      We agree with the reviewer that the CO2 concentration we used corresponds to a high value in the IPCC projections. That said, this value is currently considered very plausible: the following figure (from Smith and Myers (2018) Nature Climate Change) shows that current CO2 emissions align with the IPCC's most extreme model (RCP 8.5), which would result in a CO2 concentration of around 900 ppm in 2100. Furthermore, nothing allows to exclude the 4°C scenario in the 6th IPCC report.

      Author response image 1.

      (4) Focus on Real Challenges:

      We have numerous real challenges, such as extreme heat and inconsistent rainfall, to address in the context of climate change. However, testing under extreme CO2 conditions and then asserting that carbon dilution will negatively impact nutrition is exaggerated.

      While we fully agree that several threats linked to climate change exist, and all deserve to be studied, we find it questionable to consider that the potential effect of high CO2 on the mineral nutrition of plants is not a real challenge. The mineral nutrition of plants is already a current major environmental challenge. This perspective seems to reflect the reviewer's personal opinion rather than an analysis of our work.

      In contrast, the FACE experiments are fundamental and are conducted at more realistic eCO2 levels. Understanding the interaction between a 20% increase in CO2 and new precipitation patterns is key for global carbon flux prediction.

      Again, we do not fully understand this comment, as the aim of our study was not to perform a global carbon flux prediction, but to unravel genes and mechanisms underlying the negative effect of elevated CO2 on the nutrient content of Arabidopsis rosettes. However, we agree with the reviewer’s comment and with the fact that FACE are useful facilities to explore the CO2 response in more natural environments, and we highlight the fact that the decrease in mineral status of C3 plants has been widely documented in FACE studies. FACE experiments do not facilitate, however, to conduct fully controlled experiments (temperature, rainfall, wind and light intensities are not controllable in FACE), that allow to disentangle the mechanisms by which elevated CO2 regulates the signaling pathways associated with the plant mineral composition. In the longer term, studying the mechanisms we have identified in a more global context of climate change could be highly relevant.

      As I look at the literature on commercial greenhouse tomato production, 1000ppm of eCO2 is common, but it also looks like the breeders and growers have already solved for flavor and nutrition under these conditions.

      Indeed, tomato is often cultivated in CO2-enriched greenhouses at 1000 ppm. According to the literature, this results in a 20-25% reduction in vitamin C or lycopene, and requires a significantly higher nitrogen and water intake to reach expected sugar levels (Doddrell H (2023) Horticulture Research). In addition, the negative effect of elevated CO2 on tomato nutrient content seems to have significant repercussions on nutrition-health properties (Boufeldja (2023), Molecules).

      Conclusion:

      While the study provides valuable insights into the genetic underpinnings of Arabidopsis thaliana's response to elevated CO2 levels, it requires an entirely revised writeup, especially in its abstract, broader claims and implications. The manuscript would benefit from a more thorough introduction, a clearer definition of its scope, and a clear focus on the limits of this study.

      We thank the reviewer for the comments made on our manuscript. In addition to the responses that we provide to these comments, we have modified the main text of the introduction, objectives and discussion to take these comments into consideration. We believe that this will significantly improve the manuscript.

      Reviewer #2 (Public Review):

      Strengths:

      The authors have conducted a large, well-designed experiment to test the response to eCO2. Overall, the experimental design is sound and appropriate for the questions about how a change in CO2 affects the ionome of Arabidopsis. Most of the conclusions in this area are well supported by the data that the authors present.

      We thank the reviewer for this positive appreciation.

      Weakness:

      While the authors have done good experiments, it is a big stretch from Arabidopsis grown in an arbitrary concentration of CO2 to relevance to human and animal nutrition in future climates. Arabidopsis is a great model plant, but its leaves are not generally eaten by humans or animals.

      We agree with the reviewer’s comment. We recognized that implying a direct contribution of our work to human nutrition in the future climates is overstated, as mentioned by the reviewer 1 as well. This was not an intentional overstatement, as we have always been convinced that our work contributed to the understanding of the basic mechanisms involved in the negative regulation of plant mineral nutrition by high CO2. We have significantly modified the text to correct any misunderstanding of our work’s implication.

      The authors don't justify their choice of a CO2 concentration. Given the importance of the parameter for the experiment, the rationale for selecting 900 ppm as elevated CO2 compared to any other concentration should be addressed. And CO2 is just one of the variables that plants will have to contend with in future climates, other variables will also affect elemental concentrations.

      We agree with this comment. We added a justification of the high CO2 concentration used in this work in the Material and Methods section (lines 343-344). You can also read the explanation of this choice in the response to the reviewer 1’s point 3.

      Given these concerns, I think the emphasis on biofortification for future climates is unwarranted for this study.

      Anew, we agree with this comment and we have significantly modified the text to correct any misunderstanding of our work’s implication.

      Additionally, I have trouble with these conclusions:

      -Abstract "Finally, we demonstrate that manipulating the function of one of these genes can mitigate the negative effect of elevated CO2 on the plant mineral composition."

      -Discussion "Consistent with these results, we show that manipulating TIP2;2 expressions with a knock-out mutant can modulate the Zn loss observed under high CO2."

      The authors have not included the data to support this conclusion as stated. They have shown that this mutant increases the Zn content of the leaves when compared to WT but have not demonstrated that this response is different than in ambient CO2. This is an important distinction: one way to ameliorate the reduction of nutrients due to eCO2 is to try to identify genes that are involved in the mechanism of eCO2-induced reduction. Another way is to increase the concentration of nutrients so that the eCO2-induced reduction is not as important (i.e. a 10% reduction in Zn due to eCO2 is not as important if you have increased the baseline Zn concentration by 20%). The authors identified tip2 as a target from the GWAS on difference, but their validation experiment only looks at eCO2.

      We thank the reviewer for this comment, and we agree with it. It is much more interesting, especially in the context of this paper, to analyze the function of a candidate gene not only in elevated CO2, but in both ambient and elevated CO2. Therefore, we added in Figure 7 data for the expression of TIP2;2 in contrasted haplotypes under ambient CO2, in comparison to those already presented under elevated CO2 (now Fig. 7C and 7D). This showed that TIP2;2 expression is lower in haplotype 0 also under ambient CO2. We also added in Figure 7 (Fig. 7E) the Zn level in WT and tip2;2-1 mutant under ambient CO2, in comparison to those already presented under elevated CO2. This showed that that the tip2;2-1 mutant line did not present any decrease in Zn shoot content in response to elevated CO2, in opposition to what is observed for the WT.

      We have added comments associated to these new results in the Results and Discussion sections and in the discussion section (lines 233-242 in the results section, and lines 310-314 in the discussion section).

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      Reviewer Comments on the Article's Approach to Ionome Analysis

      (1) Omission of Phosphorus from the Ionome:

      It's surprising that phosphorus (P) was not measured in the ionome. After nitrogen (N), P is often the most limiting mineral for plant development and yield, making it a significant component of the ionome. Why did the authors omit this crucial element?

      We agree with the reviewer that P is an important mineral for plant growth. The absence of data related to P content is due to feasibility constraints rather than oversight. The MP-AES instrument we used to analyze the ionome (except N and C, that we obtained from an Elementar Analyzer) would have required an extra-step and an extra-analysis to obtain data for macronutrient such as P or K. In the context of this large-scale experiment, we faced the necessity to compromise and proceed without these data.

      (2) Relationship Between Leaf Ionome and Seed:

      The manuscript lacks evidence demonstrating the relationship between the leaf ionome and the seed. This connection is vital to establish the study's aims as outlined in lines 20-24. If the central argument is that eCO2 threatens food security, it's essential for the authors to either:

      • Provide evidence that eCO2 induces changes in the ionome profiles of seeds.

      • Show that changes in the rosette leaf ionome lead to alterations in seed ionome profiles.

      We agree with the reviewer. Although we know that seed ionome composition of Arabidopsis model accession such as Columbia is indeed negatively affected by eCO2, we do not provide the data that support some of the terms used in lines 20-24. The correspondence between leaf and seed ionome in natural population under eCO2 is certainly a next question that we will address. Therefore, to align our stated objectives with our data, we have modified the sentence in lines 20-24. We also added a comment on this point lines on the discussion section (lines 324-328).

      (3) Analysis of Ionome in Rosette Leaves:

      Why did the authors choose to analyze the ionome specifically in rosette leaves? Is there a known correlation between the ionome profile in rosette leaves and seeds?

      See our answer to the above comment.

      (4) Experimental Design Comments:

      • The layout of the accession growouts, the methods of randomization, blocking, and controls/checks should be detailed.

      • Were BLUEs (Best Linear Unbiased Estimators) or BLUPs (Best Linear Unbiased Predictors) employed to account for experimental design conditions? If not, it's recommended that they be used.

      We thank the reviewer for this comment. A note on replicates has been added in the Method/Plant Material section. Concerning the BLUEs/BLUPs, although I am not familiar with their use, I do not think that these approaches are relevant in our experimental design. Indeed, we pooled 3 to 5 replicates for each accession to measure the ionome (as mentioned in the Method/Ionome analysis section – we realized this was perhaps not clear enough, and thus we reinforced this point in this section). Therefore, we do not have the variance data required to perform BLUEs/BLUPs.

      (5) Carbon Dilution Effect:

      The statement, "The first component of the PCA described a clear antagonistic trend between C content and the change of other mineral elements (Fig. 3B)..." suggests a well-understood carbon dilution effect. These results are anticipated and align with existing knowledge.

      We thank the reviewer for this comment. However, this sentence does not relate to the biomass dilution hypothesis referred to by the reviewer. Indeed, the composition of each mineral (C and others) is expressed as a percentage of biomass, not as an absolute value. Therefore, this reflects more a probable effect of the increase in carbon compounds (notably soluble sugars), which could influence mineral composition.

      (6) Heritability Estimates:

      The authors should report both the broad-sense heritability and an estimate of heritability based on a GRM or Kinship matrix.

      We thank the reviewer for this suggestion. We are skeptical of using a kinship matrix to estimate heritability in our study. Estimating narrow-sense heritability using a kinship matrix is conceptually based on the infinitesimal model of Fisher, thereby meaning that phenotypic variation is driven by hundreds to thousands of QTLs with small effects. If this is the case, GWAS conducted on several hundred (or even thousands) of genotypes will not be powerful enough to detect such QTLs. Accordingly, estimates of broad-sense heritability based on estimates of variance components can drastically differ from estimates of narrow-sense heritability based on the use of a kinship matrix, as illustrated in the study of Bergelson et al. (2019 Scientific Reports).

      (7) Application of the Breeder's Equation:

      It would be beneficial if the authors applied the breeder's equation to estimate the species' potential rate of response. Based on the allele frequency of the adapted cluster 3 (69 ecotypes or 43% frequency of Figure 3B), it seems plausible that the populations could adapt within 23 generations.

      We thank the reviewer for this suggestion. Indeed, it would be really interesting to test whether sub-populations could adapt in comparison with others, and over what period of time. It is nevertheless not possible to do so using the Breeder’s equation in our case, as this requires fitness data under conditions of ambient or elevated CO2 (i.e. production of seeds) to be applied, and we do not have these data at the level of the whole population.

      (8) Overall Quality:

      In general, the authors have executed a high-quality ionome mapping experiment. However, the abstract, introduction, and discussion should be entirely rewritten and reframed.

      We thank the reviewer for the positive evaluation of our experiment. As previously mentioned, we are for the most part in agreement with the comments made about the need to align our stated objectives with our experimental data and conclusions. To do so, we have rewritten part of the abstract, introduction and discussion. The details of these modifications are described in the responses made to each comment.

      Here's a line-by-line list of suggestions on writing:

      Line 30 would read better with a comma after thus (or by replacing thus with therefore and then a comma at the start of the sentence).

      Line 33 nevertheless would read better in between commas.

      Lines 45 - 48 sentence is too long, could probably divide it into two.

      Lines 90 - 94 are hard to interpret, recommend rephrasing for clarity.

      Line 130 - keep verbs in the past tense for consistency (ran instead of run).

      Line 194 - what do the authors mean by crossed? I'm inferring they looked at the intersection of DEGs with the list of genes identified by GWA mapping, probably should use a more concise word.

      There's a concurrent use of the adjective strong (Lines 80, 142, 144, 197, 245). I would advise using a more concise adjective or avoiding its use to let the reader form their own opinion on the data.

      Lines 174-176 the cited reference (No. 15) is incorrect. The study by Katz et al. (2022) does not provide information on the role of ZIF1 in zinc sequestration mechanisms under elevated CO2 conditions.

      We thank the reviewer for these detailed recommendations. We have corrected or rephrased the text according to these suggestions.

      Reviewer #2 (Recommendations For The Authors):

      Technical points:

      900 ppm as elevated CO2: Given the importance of the parameter for the experiment, the rationale for selection 900 ppm as elevated CO2 compared to any other concentration should be addressed.

      We acknowledge the reviewer's point and have previously addressed related aspects earlier in our response. In line with this, we have included a justification for this particular parameter in the Method section.

      The authors do not mention what genotype was used for their root/shoot RNAseq experiment.

      We thank the reviewer for this comment, and indeed, this information was not mentioned. This is now done, in the Method section.

      Line 125: Spelling error "REGMPA".

      This has been corrected.

      Line 338: Removal of outlier observations - "Prior to GWAS and multivariate analyses such as PCA or clustering, mineral composition measures were pre-processed to remove technical outliers". The authors should mention the exact number of outliers that were removed and what the explicit criteria were for removal.

      The number of outliers removed from each dataset is now indicated in Supplemental Table 7 (this is cited in the Method section). The explicit criteria used for this analysis is actually mentioned in the corresponding Method section: “the values positioned more than 5 median absolute deviations away from the median were removed from the dataset”.

      Line 379: "Lowly expressed genes with an average value across conditions under 25 reads were excluded from the analysis". Providing information about the number of the lowly expressed genes that were removed from the analysis can help with the interpretation of the likelihood of the candidates selected being correct.

      This is a standard procedure in RNAseq analysis. It avoids many false positives in the differential analysis of gene expression based on ratios (where a very small number in the denominator can lead to a very high variation in expression, of no real significance). For information, this step led to the removal of 11607 and 10121 genes for the shoot and root datasets.

      Line 384: It's not clear how many biological replicates were used.

      This has been corrected.

      Additional comment: We have also become aware of a confusion concerning one of the candidate genes located close to GWA peaks: line 180 of the first version, we mentioned CAX1 (AT1G16380) for its role on nutrient deficiency response. There are actually two genes annotated as CAX1 in TAIR (both are cation exchangers), but the one involved in nutrient deficiency response is AT2G38170. We therefore removed the sentence mentioning AT1G16380/CAX1 as a potential candidate gene.

    1. Author Response

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      This paper performed a functional analysis of the poorly characterized pseudo-phosphatase Styxl2, one of the targets of the Jak/Stat pathway in muscle cells. The authors propose that Styxl2 is essential for de novo sarcomere assembly by regulating autophagic degradation of non-muscle myosin IIs (NM IIs). Although a previous study by Fero et al. (2014) has already reported that Styxl2 is essential for the integrity of sarcomeres, this study provides new mechanistic insights into the phenomenon. In vivo studies in this manuscript are compelling; however, I feel the contribution of autophagy in the degradation of NM IIs is still unclear.

      Major concerns:

      1) The contribution of autophagy in the degradation of Myh9 is still unclear to this reviewer.

      It has been reported that autophagy is dispensable for sarcomere assembly in mice (Cell Metab, 2009, PMID; 1994508). In Fig. 7A, the authors showed that overexpressed Styxl2 downregulated the amount of ectopically expressed Myh9 in an ATG5-dependent manner in C2C12 cells; however, the experiment is far from a physiological condition. Therefore, the authors should test ATG5 knockdown and the genetic interaction between Styxl2 and ATG5 in vivo. That is, 1) loss of ATG5 on sarcomere assembly in zebrafish, and 2) the genetic interaction between Styxl2 and ATG5; co-injection of Styxl2 mRNA and ATG5-MO into the zebrafish embryos.

      Our response: In fact, the reference cited by the reviewer (Cell Metab, 2009; PMID; 19945408) clearly indicated that autophagy is required for sarcomere assembly. Moreover, another paper using the fish extraocular muscle regeneration model (Autophagy, 2014, PMID: 27467399), also showed that the sarcomere structure was disrupted in the regenerated muscles when autophagy was inhibited by chloroquine. In addition, other references (Nature medicine, 2007, PMID: 17450150; Autophagy, 2010, PMID: 20431347) also showed that loss of Atg5 in mouse cardiac muscles led to disorganized sarcomere structure. We also performed the Atg5 knockdown experiments as suggested by the reviewer. However, the sarcomere structure defects were not so obvious as Styxl2 knockdown (see Author response image 1 below). In fact, it was reported that Atg5 knockdown may not be a desirable strategy to disrupt autophagy as it was found “--- only a small amount of Atg5 is needed for autophagy, knockdown of Atg5 to levels low enough to block autophagy might be difficult to achieve, --” (Nature medicine, 2007, PMID: 17450150). Due to the ineffectiveness of the Atg5 MO in our assays, we did not perform the second experiment suggested by the reviewer. Moreover, as Styxl2 is not a key component of the autophagy machinery, it is less likely that overexpression of Styxl2 alone can rescue the autophagy defects caused by Atg5.

      Author response image 1.

      The fish zygotes were injected with Atg5 or Ctrl MO. 48 hpf, the fish were stained with an anti-Actinin antibody. Some fast muscle fibers were disrupted when Atg5 was knocked down. The number in numerator at the bottom of each image represents fish embryos showing normal Actinin staining pattern, while that in denominator represents the total number of embryos examined. Scale bar, 10 µm.

      2) As referenced, Yamamoto et al. reported that Myh9 is degraded by autophagy. Mechanistically, Nek9 acts as an autophagic adaptor that bridges Atg8 and Myh9 through interactions with both. Inconsistent with the model, the authors mentioned on page 12, lines 365-367, "A recent report showed that Myh9 could also undergo Nek9-mediated selective autophagy (Yamamoto et al., 2021), suggesting that Myh9 is ubiquitinated". I think it is not yet explored whether autophagic degradation of Myh9 requires its ubiquitination. Moreover, I cannot judge whether Myh9 is ubiquitinated in a Styxl2-dependent manner from the data in Fig. 7C. The author should test whether Nek9 is required for Myh9 degradation in muscles. If Nek plays a role in the Myh9 degradation, it would be better to remove Fig. 7C.

      Our response: Indeed, as pointed out by the reviewer, it has not been explored whether Myh9 is ubiquitinated or not. However, it has been well-established that some proteins undergoing autophagic degradation are ubiquitinated, which are linked to Atg8/LC3 via p62 and NBR1 (Mol Cell, 2009, PMID: 19250911; J Biol Chem, 2007, PMID: 17580304). To improve the data quality, we repeated the Myh9 ubiquitination experiment in cells with or without Styxl2 by using a slightly different strategy: as shown in the revised Figure 7C, we first co-transfect HEK 293T cells with HA-Myh9, Myc-ubiquitin, and Flag-Styxl2. We then immunoprecipitated Myc-tagged Ubiquitin from the whole cell lysates, and then blot for HAMyh9. We detected an obvious increase in Ubiquitin-conjugated HA-Myh9 (revised Figure 7C). As suggested by the reviewer, we also tested whether knockdown of Nek9 affects the degradation of Myh9. We failed to detect an obvious effect (see Author response image 2 below) caused by Nek9 knockdown. One possible explanation for this negative result is that Nek9 itself is a negative regulator of selective autophagy (J Biol Chem, 2020, PMID: 31857374). By knocking it down, the functions of the autophagy machinery are expected to be enhanced instead of being impaired. This may explain why we failed to detect an effect on Myh9 degradation simply by knocking down Nek9. To further elucidate whether Nek9 is involved in Myh9 degradation in myoblasts, we may need to use a dominant-negative mutant of Nek9 missing the LCIII-binding motif as shown by Yamamoto (Nat Commun, 2021, PMID: 34078910). This will be addressed in our future study.

      Author response image 2.

      C2C12 cells were transfected with negative control siRNA (NC), siNek9#2 or siNek9#3. 18 h later, the cells were transfected with plasmids HA-Myh9 and Flag-Styxl2 or Flag-Stk24. After another 24 h, the cells were harvested for RT-qPCR (left panel) or western blot (right panel).

      3) In Fig. 5F, the protein level of Styxl2 and Myh10 should be checked because the efficiency of Myh10-MO was not shown anywhere in this manuscript.

      Our response: As suggested by the reviewer, a Western blot showing the protein levels of Myh10 was shown in Figure 5-figure supplement 1B.

      Reviewer #2 (Public Review):

      The authors investigated the role of the Jak1-Stat1 signaling pathway in myogenic differentiation by screening the transcriptional targets of Jak1-Stat1 and identified Styxl2, a pseudophosphatase, as one of them. Styxl2 expression was induced in differentiating muscles. The authors used a zebrafish knockdown model and conditional knockout mouse models to show that Styxl2 is required for de novo sarcomere assembly but is dispensable for the maintenance of existing sarcomeres. Styxl2 interacts with the non-muscle myosin IIs, Myh9 and Myh10, and promotes the replacement of these non-muscle myosin IIs by muscle myosin IIs through inducing autophagic degradation of Myh9 and Myh10. This function is independent of its phosphatase domain.

      A previous study using zebrafish found that Styxl2 (previously known as DUSP27) is expressed during embryonic muscle development and is crucial for sarcomere assembly, but its mechanism remains unknown. This paper provides important information on how Styxl2 mediates the replacement of non-muscle myosin with muscle myosin during differentiation. This study may also explain why autophagy deficiency in muscles and the heart causes sarcomere assembly defects in previous mouse models.

      Reviewer #3 (Public Review):

      Wu and colleagues are characterising the function of Styxl2 during muscle development, a pseudo-phosphatase that was already described to have some function in sarcomere morphogenesis or maintenance (Fero et al. 2014). The authors verify a role for Styxl2 in sarcomere assembly/maintenance using zebrafish embryonic muscles by morpholino knockdown and by a conditional Styxl2 allele in mice (knocked-out in satellite cells with Pax7 Cre).

      Experiments using a tamoxifen inducible Cre suggest that Styxl2 is dispensable for sarcomere maintenance and only needed for sarcomere assembly.

      BioID experiments with Styxl2 in C2C 12 myoblasts suggest binding of nonmuscle myosins (NMs) to Styxl2. Interestingly, both NMs are downregulated when muscles differentiate after birth or during regeneration in mice. This down-regulation is reduced in the Styxl2 mutant mice, suggesting that Styxl2 is required for the degradation of these NMs.

      Impressively, reducing one NM (zMyh10) by double morpholino injection in a Styxl2 morphant zebrafish, does improve zebrafish mobility and sarcomere structure. Degradation of Mhy9 is also stimulated in cell culture if Styxl2 is co-expressed. Surprisingly, the phosphatase domain is not needed for these degradation and sarcomere structure rescue effects. Inhibitor experiments suggest that Styxl2 does promote the degradation of NMs by promoting the selective autophagy pathway.

      Strengths:

      A major strength of the paper is the combination of various systems, mouse and fish muscles in vivo to test Styxl2 function, and cell culture including a C2C12 muscle cell line to assay protein binding or protein degradation as well as inhibitor studies that can suggest biochemical pathways.

      Weakness:

      The weakness of this manuscript is that the sarcomere phenotypes and also the western blots are not quantified. Hence, we rely on judging the results from a single image or blot. Also, Styxl2 role in sarcomere biology was not entirely novel.

      Few high resolution sarcomere images are shown, myosins have not been stained for.

      Reviewer #1 (Recommendations For The Authors):

      Minor concerns:

      4) The position of molecular weight markers should be shown in all Western blot data.

      Our response: As suggested by the reviewer, the molecular weight markers have been added in the Western blot data.

      5) Schematic models of Styxl2deltaN509 and N513 construct would be helpful for the readers.

      Our response: A schematic has been added in Figure 6B (upper panel) to show Styxl2deltaN509 and Styxl2N513.

      6) Several data were described but not shown (data not shown). I think the data need to be included in the main or supplemental figures.

      Our response: As suggested by the reviewer, the raw data were now included in the Figure 6-figure supplement 1A and Figure 7-figure supplement 1.

      Reviewer #2 (Recommendations For The Authors):

      1) In Fig. 5E, the authors suggest that the needle touch response was improved by additional knockdown of Myh10. This is a bit confusing because the germline knockout of Myh10 is lethal (line 445). The authors should provide more explanation on this point. Additionally, it would be better to include Myh10-MO in Fig. 5E.

      Our response:<br /> In line 445 of our original manuscript, we stated that germline knockout of mouse Myh10 gene is lethal based on a published report (Proc Natl Acad Sci USA, 1997, PMID: 9356462). Here, in zebrafish zygotes, we only knocked down zMyh10, thus, we do not expect to get a lethal phenotype. In addition, other groups who knocked down Myh10 in fish also did not get a lethal phenotype (Dev Biol, 2015, PMID: 25446029). As to the control involving Myh10MO in the experiment in Fig.5E, we did include it in our experiments. As we did not observe any obvious effects on either motility or sarcomere structures, we did not include the data set in the figure.

      2) It was suggested that Myh9 and Myh10 form a complex (Rao et al. PLoS One 9, e114087, 2014). Thus, the IP experiments do not rule out the possibility that Styxl2 directly interacts with either Myh9 or Myh10 and indirectly with the other.

      Our response: In known myosin-II complexes, different myosin molecules can associate with each other through their tail domains (Bioarchitecture, 2013, PMID: 24002531). Thus, if we use fulllength myosin molecules in our co-immunoprecipitation assays, it will be difficult to exclude the possibility raised by the reviewer. However, by using truncated myosin proteins, we showed that the head domain of either Myh9 or Myh10 could interact with Styxl2 in the absence of the tail domain (Figure 4E, F). This result strongly suggests that both Myh9 and Myh10 can independently interact with Styxl2.

      Reviewer #3 (Recommendations For The Authors):

      1) The western blot shown in Figure 3B supporting the induced deletion of Styxl2 should be quantified. Ideally, some other blots, e.g., in Figure 5, too. Please add the age of the mice in Figure 5B to the figure legend.

      Our response:<br /> As suggested by the reviewer, we quantified the data in Figures.3B, 3F, 5B, 5D, and 7A and the data were included in the revised figures. In Fig.5B, we already indicated the age of the mice (i.e., P1) in the legend.

      2) A quantification of the sarcomere phenotypes in the double knock-down of zMyh10 and Styxl2 compared to Styxl2 single would make the paper significantly stronger. Furthermore, a double morpholino control should be included to rule out any RNAi machinery 'dilution effect'.

      Our response: As suggested by the reviewer, we quantified the sarcomere structures using the line scan analysis in ImageJ and the scan images were placed as inserts in the upper corner of the immunofluorescent images (revised Figures 5F, and 6C). To avoid potential “dilution effects”, in all the experiments involving the use of two different MOs, the total amount of MO was kept the same in all control samples by including a control MO (e.g., in samples treated with one specific MO, an equal amount of a control MO was also included, while in samples without any specific MO, twice as much control MO was used).

      3) The sarcomere phenotypes in figure 6 should also be better quantified, for example using simple line scans of the alpha-actinin stains and assay periodicity or calculating the autocorrelation coefficients. How about myosin stains?

      Our response: We quantified Figure 6C as suggested by the reviewer. We also performed myosin staining. The results were similar to that shown by the a-actinin antibody (see revised Figure 6-Fig supplement 1B).

      4) Do the authors see periodic NMs patterns in developing mouse muscle fibers as indicated by the model in in in figure 7D? It is unclear if nonmuscle myosin is present in a PERIODIC pattern in early myofibrils. NM myosin periodic patterns that have been observed have a periodicity of only about 1 µm fitting the shorter length of the NM bipolar filaments (about 300 nm only, PMID 28114270).

      Our response: The reviewer raised a good point here. Ideally, we should examine developing mouse muscle fibers to prove that NM shows periodic patterns. However, due to the difficulty in catching myocytes undergoing sarcomere assembly, the majority of the studies involving NM in sarcomeres use cultured cardiomyocytes. Using TA muscles from P1 new-born mice, we failed to detect the presence of NM in sarcomeres (see Author response image 3 below). Actually, nearly all the myofibers showed mature sarcomere pattern without the NM signal. More work is needed in the future to examine developing mouse fibers at different embryonic stages to look for the presence of NM in developing sarcomeres.

      Author response image 3.

      The TA muscles were collected from male and female P1 mice. The muscles were sectioned and co-stained for a-actinin (Actn) and Myh9. The majority of myofibrils is mature without the NM II signal. Scale bar, 10 µm.

      5) Recent work suggested that mechanical tension is key to assemble the first long periodic myofibril containing immature sarcomeres. Tension is likely produced by a combination of NM and Mhc in the assembling sarcomeres themselves. This could be included in the introduction or discussion (PMIDs 24631244, 29316444, 29702642, 35920628).

      Our response: We thank the reviewer for pointing to us additional relevant references. We have added them in the Introduction.

      6) I suggest replacing "sarcomeric muscles" with "striated muscles".

      Our response: We revised the term in the manuscript as suggested by the reviewer.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      We appreciate the valuable and constructive comments of Reviewer #1 on our manuscript. We have addressed the comments from Reviewer #1 in the public review in the response to the recommendations for the authors, as the public review comments largely overlap with that of the recommendations for the authors.

      Reviewer #1 (Recommendations For The Authors):

      (1.1) Figure 1 did not use a mock-infected control for the development of R-loops but only a time before infection. I think it would have been a good control to have that after the same time of infection non-infected cells did not show increases in R-loops and this is not a product of the cell cycle.

      We prepared our DRIPc-seq library using cell extracts harvested at 0, 3, 6, and 12 h post-infection (hpi), all at the same post-seeding time point. Each sample was infected with HIV-1 virus in a time-dependent manner. Therefore, it is unlikely that the host cellular R-loop induction observed in our DRIPc-seq results was due to R-loop formation during the cell cycle. In Lines 93–95 of the Results section of the revised manuscript, we have provided a more detailed description of our DRIPc-seq library experimental scheme. Thank you. 

      (1.2) Figure 2 should have included a figure showing the proportion of DRIPc-seq peaks located in different genome features relative to one another instead of whether they were influenced by time post-infection. Figure 2C was performed in HeLa cells, but primary T cell data would have been more relevant as primary CD4+ T cells are more relevant to HIV infection.

      We have included a new figure presenting the relative proportion of DRIPc-seq peaks mapped to different genomic features at each hpi (Fig. 2C of the revised manuscript). We found that the proportion of DRIPc-seq peaks mapped to various genomic compartments remained consistent over the hours following the HIV-1 infection. This further supports our original claim that HIV-1 infection does not induce R-loop enrichment at specific genomic features but that the accumulation of R-loops after HIV-1 infection is widely distributed.

      We considered HeLa cells as the primary in vitro infection model, therefore, we conducted RNA-seq only on HeLa cells. However, we agree with the reviewer's opinion that data from primary CD4+ T cells may be more physiologically relevant. Nevertheless, as demonstrated in the new figure (Fig. 2C of the revised manuscript), HIV-1 infection did not significantly alter the proportion of R-loop peaks mapped to specific genomic compartments, such as gene body regions, in HeLa, primary CD4+ T, and Jurkat cells. Therefore, we anticipate no clear correlation between changes in gene expression levels and R-loop peak detection upon HIV-1 infection, even in primary T cells. Thank you.   

      (1.3) Figure 5G is very hard to see when printed, is there a change in brightness or contrast that could be used? The arrows are helpful but they don't seem to be pointing to much.

      We have highlighted the intensity of the PLA foci and magnified the images in Fig. 5G in the revised manuscript. While editing the images according to your suggestion, we found a misannotation regarding the multiplicity of infection in the number of PLA foci per nucleus quantification analysis graph in Fig. 5G of the original manuscript. We have corrected this issue and hope that it is now much clearer. 

      (1.4) The introduction provided a good background for those who may not have a comprehensive understanding of DNA-RNA hybrids and R-loops, but the rationale that integration in non-expressed sequence implies that R-loops may be involved is very weak and was not addressed experimentally. A better rationale would have been to point out that, although integration in genes is strongly associated with gene expression, the association is not perfect, particularly in that some highly expressed genes are, nonetheless, poor integration targets.

      In accordance with the reviewer's comment, we revised the Introduction. We have deleted the statement and reference in the introduction "... the most favored region of HIV-1 integration is an intergenic locus, ...”, which may overstate the relevance of the R-loop in HIV-1 integration events in non-expressed sequences. Instead, we introduced a more recent finding that high levels of gene expression do not always predict high levels of integration, together with the corresponding citation (Lines 46– 47 of the revised manuscript), according to the reviewer’s suggestion in the reviewer's public review 2)-(a).

      (1.5) The discussion was seriously lacking in connecting their conclusions regarding R-loop targeting of integration to how integration works at the structural level, where it is very clear that concerted integration on the two DNA strands ca 5 bp apart is essential to correct, 2-ended integration. It is very difficult to visualize how this would be possible with the triple-stranded R-loop as a target. The manuscript would be greatly strengthened by an experiment showing concerted integration into a triplestranded structure in vitro using PICs or pure integrase.

      We believe there has been a misunderstanding of our interpretation regarding the putative role of R-loop structures in the HIV-1 integration site mechanism because of some misleading statements in our original manuscript. Based primarily on our current data, we believe that R-loop structures are bound by HIV-1 integrase proteins and lead to HIV-1 viral genome integration into the vicinity regions of the host genomic R-loops. By carefully revising our manuscript, we found that the title, abstract, and discussion of our original manuscript includes phrases, such as “HIV-1 targets R-loops for integration,” which may overstate our finding on the role of R-loop in HIV-1 integration site selection. We replaced these phrases. For example, we used phrases, such as, “HIV-1 favors vicinity regions of R-loop for the viral genome integration,” in the revised manuscript. We apologize for the inconvenience caused by the unclear and nonspecific details of our findings.  

      Using multiple biochemical experiments, we successfully demonstrated the interaction between the cellular R-loop and HIV-1 integrase proteins in cells and in vitro (Fig. 5 of the revised manuscript). However, we could not validate whether the center of the triple-stranded R-loops is the extraction site of HIV-1 integration, where the strand transfer reaction by integrase occurs. This is because an R-loop can be multi-kilobase in size (1, 2); therefore, we displayed a large-scale genomic region (30-kb windows) to present the integration sites surrounding the R-loop centers. Nevertheless, we believe that we validated R-loop-mediated HIV-1 integration in R-loop-forming regions using our pgR-poor and pgR-rich cell line models. When infected with HIV-1, pgR-rich cells, but not pgR-poor cells, showed higher infectivity upon R-loop induction in designated regions following DOX treatment (Fig. 3C and 3D of the revised manuscript). In addition, we quantified site-specific integration events in R-loop regions, and found that a greater number of integration events occurred in designated regions of the pgR-rich cellular genome upon R-loop induction by DOX treatment, but not in pgR-poor cells (Fig. 3E–G of the revised manuscript). 

      We agree with the reviewer that an experiment showing the concerted integration of purified PICs into a triple-stranded structure in vitro would greatly strengthen our manuscript. We attempted the purification of viral DNA (vDNA)-bound PICs using either Sso7d-tagged HIV-1 integrase proteins or non-tagged HIV-1 integrase proteins (F185K/C280S) procured from the NIH HIV reagent program (HRP-20203), following the method described by Passos et al., Science, 2017; 355 (89-92) (3). Despite multiple attempts, we could not purify the nucleic acid-bound protein complexes for in vitro integration assays. However, we believe that pgR-poor and pgR-rich cell line models provide a strong advantage in specificity of our primer readouts. Compounded with our in cellulo observation, we believe that our work provides strong evidence for a causative relationship between R-loop formation/R-loop sites and HIV-1 integration.

      Additionally, in the Discussion section of the revised manuscript, we have expanded our discussion on the role of genomic R-loops contributing in molding the host genomic environment for HIV-1 integration site selection, and the potential explanation on how R-loops are driving integration over long-range genomic regions. Thank you. 

      (1.6) There are serious concerns with the quantitation of integration sites used here, which should be described in detail following line 503 but isn't. In Figure 3, E-G, they are apparently shown as reads per million, while in Figure 4B as "sites (%)" and in 4C as log10 integration frequency." Assuming the authors mean what they say, they are using the worst possible method for quantitation. Counting reads from restriction enzyme-digested, PCR-digested DNA can only mislead. At the numbers provided (MOI 0.6, 10 µg DNA assayed) there would be about 1 million proviruses in the samples assayed, so the probability of any specific site being used more than once is very low, and even less when one considers that a 10% assay efficiency is typical of integration site assays. Although the authors may obtain millions of reads per experiment, the number of reads per site is an irrelevant value, determined only by technical artefacts in the PCR reactions, most significantly the length of the amplicons, a function of the distance from the integration site to the nearest MstII site, further modified by differences in Tm. Better is to collapse identical reads to 1 per site, as may have been done in Figure 4B, however, the efficiency of integration site detection will still be inversely related to the length of the amplicon. Indeed, if the authors were to plot the read frequency against distance to the nearest MstII site, it is likely that they would get plots much like those in Figure 4B.

      Detailed methods for integration site sequencing data processing are described in the Materials and Methods section of the revised manuscript (Line 621–631 of the revised manuscript). We primarily followed HIV-1 integration site sequencing data processing methods previously described by Li et al., mBio, 2020; 11(5) (4).  

      While it may be correct that the HIV-1 integration event cannot occur more than once at a given site, our Fig. 3E, 4C, and 4D of the revised manuscript present the number of integration-site sequencing read counts expressed in reads-per-million (RPM) units or as log10-normalized values. Based on the number of mapped reads from the integration site sequencing results, we can infer that there was an integration event at this site, whether it was a single or multiple event.

      We believe that the original annotation of y-axis, “Integration frequency,” may be misleading as it can be interpreted as a probability of any specific site being used for HIV-1 integration. Therefore, we corrected it as “number of mapped read” for clarity (Fig. 3E–G, 4C and 4D, and the corresponding figure legends of the revised manuscript). We apologize for any confusion. Thank you.

      Other points:

      (1.7) Overall: There are numerous grammatical and usage errors, especially in agreement of subject and verb, and missing articles, sometimes multiple times in the same sentence. These must be corrected prior to resubmission.

      The revised manuscript was edited by a professional editing service. Thank you.

      (1.8) Line 126-134: A striking result, but it needs more controls, as discussed above, including a dose-response analysis.

      We determined the doses of NVP and RAL inhibitors in HeLa cells by optimizing the minimum dose of drug treatment that provided a sufficient inhibitory effect on HIV1 infection (Author response image 1). The primary objective of this experiment was to determine R-loop formation while reverse transcription or integration of the HIV-1 life cycle was blocked, therefore, we do not think that a dose-dependent analysis of inhibitors is required.

      Author response image 1.

      (A and B) Representative flow cytometry histograms of VSV-G-pseudotyped HIV-1-EGFP-infected HeLa cells at an MOI of 1, harvested at 48 hpi. The cells were treated with DMSO, the indicated doses of nevirapine (NVP) (A) or indicated doses of raltegravir (RAL) (B) for 24 h before infection. 

      (1.9) Line 183: Please tell us what ECFP is and why it was chosen. Is there a reference for its failure to form R-loops?

      Ibid: The human AIRN gene is a very poor target for HIV integration in PBMC.

      A high GC skew value (> 0) is a predisposing factor for R-loop formation at the transcription site. This is because a high GC skew causes a newly synthesized RNA strand to hybridize to the template DNA strand, and the non-template DNA strand remains looped out in a single-stranded conformation (5) (Ref 36 in the revised manuscript). The ECFP sequence possessed a low GC skew value, as previously used for an R-loop-forming negative sequence (6) (Ref 17 of the revised manuscript). We have added this description and the corresponding references to Lines 188–192 of the revised manuscript.  

      The human AIRN gene (RefSeq DNA sequence: NC_000006.12) sequence possesses a GC skew value of -0.04, in a window centered at base 2186, while the mouse AIRN (mAIRN) sequence is characterized by a GC skew value of 0.213. The ECFP sequence gave a GC skew value of -0.086 in our calculation. We anticipated that the human AIRN gene region does not form a stable R-loop, and in fact, it did not harbor R-loop enrichment upon HIV-1 infection in our DRIPc-seq data analysis of multiple cell types (Author response image 2)

      Author response image 2.

      Genome browser screenshot over the chromosomal regions in 20-kb windows centered on human AIRN showing results from DRIPc-seq in the indicated HIV-1-infected cells (blue, 0 hpi; yellow, 3 hpi; green, 6 hpi; red, 12 hpi)

      (1.10) Line 190: You haven't shown dependence. Associated is a better word.

      Thank you for the suggestion. We have changed “R-loop-dependent site-specific HIV-1 integration events...” to “R-loop-associated site-specific HIV-1 integration events...” (Line 198 of the revised manuscript) according to the reviewer’s suggestion in the revised manuscript. 

      (1.11) Line 239: What happened to P1? What is the relationship of the P and N regions to genes?

      We have added superimpositions of the P1 chromatin region on DRIPc-seq and the HIV-1 integration frequency to Figure 4C of the revised manuscript. We observed a relevant integration event within the P1 R-loop region, but to a lesser extent than in the P2 and P3 R-loop regions, perhaps because the P1 region has relatively less R-loop enrichment than the P2 and P3 regions, as examined by DRIP-qPCR in S3A Fig. of the revised manuscript.

      Genome browser screenshots with annotations of accommodating genes in the P and N regions are shown in S2A–E Fig. of the revised manuscript, and RNA-seq analysis of the relative gene expression levels of the P1-3 and N1,2 R-loop regions are shown in S4 Table of the revised manuscript. Thank you.

      (1.12) Line 261: But the binding affinity of integrase to the R-loop is somewhat weaker than to double-stranded DNA according to Figure 5A.

      Nucleic acid substrates were loaded at the same molarity, and the percentage of the unbound fraction was calculated by dividing the intensity of the unbound fraction in each lane by the intensity of the unbound fraction in the lane with 0 nM integrase in the binding reaction. The calculated percentages of the unbound fraction from three independent replicate experiments are shown in Fig. 5A, right of the revised manuscript. In our analysis and measurements, the integrase proteins showed higher binding affinities to the R-loop and R-loop comprising nucleic acid structures than to dsDNA in vitro. We hope that this explanation clarifies this point. 

      (1.13) Line 337: "accumulate". This is a not uncommon misinterpretation of the results of studies on the distribution of intact proviruses in elite controllers. The only possible correct interpretation of the finding is that proviruses form everywhere else but cells containing them are eliminated, most likely by the immune system.

      Thank you for the suggestion. We have changed the Line 337 of the original manuscript to “... HIV-1 proviruses in heterochromatic regions are not eliminated but selected by immune system,” in Lines 361-363 of the revised manuscript. 

      (1.14) Line 371 How many virus particles per cell does this inoculum amount to?

      We determined the amount of GFP reporter viruses required to transduce ∼50% of WT Jurkat T cells, corresponding to an approximate MOI of 0.6. We repeatedly obtained 30–50% of VSV-G-pseudotyped HIV-1-EGFP positively infected cells for HIV1 integration site sequencing library construction for Jurkat T cells. 

      (1.15) Line 503 and Figures 3 and 4: There must be a clear description of how integration events are quantitated.

      Detailed methods for integration site sequencing data processing are described in the Materials and Methods section of the revised manuscript (Line 621–631 of the revised manuscript). We primarily followed HIV-1 integration site sequencing data processing methods previously described in Li et al., mBio, 2020; 11(5) (4).

      Reviewer #2 (Public Review):

      Retroviral integration in general, and HIV integration in particular, takes place in dsDNA, not in R-loops. Although HIV integration can occur in vitro on naked dsDNA, there is good evidence that, in an infected cell, integration occurs on DNA that is associated with nucleosomes. This review will be presented in two parts. First, a summary will be provided giving some of the reasons to be confident that integration occurs on dsDNA on nucleosomes. The second part will point out some of the obvious problems with the experimental data that are presented in the manuscript.

      We appreciate your comments. We have carefully addressed the concerns expressed as follows (your comments are in italics):  

      (2.1) 2017 Dos Passos Science paper describes the structure of the HIV intasome. The structure makes it clear that the target for integration is dsDNA, not an R-loop, and there are very good reasons to think that structure is physiologically relevant. For example, there is data from the Cherepanov, Engelman, and Lyumkis labs to show that the HIV intasome is quite similar in its overall structure and organization to the structures of the intasomes of other retroviruses. Importantly, these structures explain the way integration creates a small duplication of the host sequences at the integration site. How do the authors propose that an R-loop can replace the dsDNA that was seen in these intasome structures?

      We do appreciate the current understanding of the HIV-1 integration site selection mechanism and the known structure of the dsDNA-bound intasome. Our study proposes an R-loop as another contributor to HIV-1 integration site selection. Recent studies providing new perspectives on HIV-1 integration site targeting motivated our current work. For instance, Ajoge et al., 2022 (7) indicated that a guanine-quadruplex (G4) structure formed in the non-template DNA strand of the R-loop influences HIV-1 integration site targeting. Additionally, I. K. Jozwik et al., 2022 (8) showed retroviral integrase protein structure bound to B-to-A transition in target DNA. R-loop structures are a prevalent class of alternative non-B DNA structures (9). We acknowledge the current understanding of HIV-1 integration site selection and explore how R-loop interactions may contribute to this knowledge in the Discussion section of our manuscript. 

      Primarily based on our current data, we believe that R-loop structures are bound by HIV-1 integrase proteins and lead to HIV-1 viral genome integration into the vicinity regions of the host genomic R-loops, but we do not claim that R-loops completely replace dsDNA as the target for HIV-1 integration. An R-loop can be multi-kilobase in size and the R-loop peak length widely varies depending on the immunoprecipitation and library construction methods (1, 2), therefore, we could not validate whether the center of triple-stranded R-loops is the extraction site of HIV-1 integration where the strand transfer reaction by integrase occurs. Therefore, we replaced phrases such as, “HIV-1 targets R-loops for integration,” which may overstate our finding on the role of R-loop in HIV-1 integration site selection, with phrases, such as, “HIV-1 favors vicinity regions of R-loop for the viral genome integration,” in the revised manuscript. We apologize for the inconvenience caused by the unclear and non-specific details of our findings. Nevertheless, we believe that we validated R-loop-mediated HIV-1 integration in R-loop-forming regions using our pgR-poor and pgR-rich cell line models. We quantified site-specific integration events in the R-loop regions, and found that a greater number of integration events occurred in designated regions of the pgR-rich cellular genome upon R-loop induction by DOX treatment, but not in pgR-poor cells (Fig. 3E–G of the revised manuscript). 

      dsDNA may have been the sole target of the intasome demonstrated in vitro possibly because dsDNA has only been considered as a substrate for in vitro intasome assembly. We hope that our work will initiate and advance future investigations on target-bound intasome structures by considering R-loops as potential new targets for integrated proteins and intasomes.  

      (2.2) As noted above, concerted (two-ended) integration can occur in vitro on a naked dsDNA substrate. However, there is compelling evidence that, in cells, integration preferentially occurs on nucleosomes. Nucleosomes are not found in R loops. In an infected cell, the viral RNA genome of HIV is converted into DNA within the capsid/core which transits the nuclear pore before reverse transcription has been completed. Integration requires the uncoating of the capsid/core, which is linked to the completion of viral DNA synthesis in the nucleus. Two host factors are known to strongly influence integration site selection, CPSF6 and LEDGF. CPSF6 is involved in helping the capsid/core transit the nuclear pore and associate with nuclear speckles. LEDGF is involved in helping the preintegration complex (PIC) find an integration site after it has been released from the capsid/core, most commonly in the bodies of highly expressed genes. In the absence of an interaction of CPSF6 with the core, integration occurs primarily in the lamin-associated domains (LADs). Genes in LADs are usually not expressed or are expressed at low levels. Depending on the cell type, integration in the absence of CPSF6 can be less efficient than normal integration, but that could well be due to a lack of LEDGF (which is associated with expressed genes) in the LADs. In the absence of an interaction of IN with LEDGF (and in cells with low levels of HRP2) integration is less efficient and the obvious preference for integration in highly expressed genes is reduced. Importantly, LEDGF is known to bind histone marks, and will therefore be preferentially associated with nucleosomes, not R-loops. LEDGF fusions, in which the chromatin binding portion of the protein is replaced, can be used to redirect where HIV integrates, and that technique has been used to map the locations of proteins on chromatin. Importantly, LEDGF fusions in which the chromatin binding component of LEDGF is replaced with a module that recognizes specific histone marks direct integration to those marks, confirming integration occurs efficiently on nucleosomes in cells. It is worth noting that it is possible to redirect integration to portions of the host genome that are poorly expressed, which, when taken with the data on integration into LADs (integration in the absence of a CPSF6 interaction) shows that there are circumstances in which there is reasonably efficient integration of HIV DNA in portions of the genome in which there are few if any R-loops.

      Although R-loops may not wrap around nucleosomes, long and stable R-loops likely cover stretches of DNA corresponding to multiple nucleosomes (10). For example, R-loops are associated with high levels of histone marks, such as H3K36me3, which LEDGF recognizes (2, 11). R-loops dynamically regulate the chromatin architecture. Possibly by altering nucleosome occupancy, positioning, or turnover, R-loop structures relieve superhelical stress and are often associated with open chromatin marks and active enhancers (2, 10). These features are also distributed over HIV-1 integration sites (12). In the Discussion section of the revised manuscript, we explored the R-loop molding mechanisms in the host genomic environment for HIV-1 integration site selection and its potential collaborative role with LEDGF/p75 and CPSF6 governing HIV-1 integration site selection. 

      By carefully revising our original manuscript, with respect to the reviewer's comment, we recognized the need to tone down our statements. We found that the title, abstract, and discussion of our original manuscript includes phrases, such as, “HIV-1 targets Rloops for integration,” which may overstate our finding on the role of R-loop in HIV-1 integration site selection. We replaced these phrases. For example, we used phrases, such as “HIV-1 favors vicinity regions of R-loop for the viral genome integration,” in the revised manuscript. We apologize for the inconvenience caused by the unclear and non-specific details of our findings.

      (2.3) Given that HIV DNA is known to preferentially integrate into expressed genes and that R-loops must necessarily involve expressed RNA, it is not surprising that there is a correlation between HIV integration and regions of the genome to which R loops have been mapped. However, it is important to remember that correlation does not necessarily imply causation.

      We understand the reviewer's concern regarding the possibility of a coincidental correlation between the R-loop regions and HIV-1 integration sites, particularly when the interpretation of this correlation is primarily based on a global analysis. 

      Therefore, we designed pgR-poor and pgR-rich cell lines, which we believe are suitable models for distinguishing between integration events driven by transcription and the presence of R-loops. Although the two cell lines showed comparable levels of transcription at the designated region upon DOX treatment via TRE promoter activation (Fig. 3B of the revised manuscript), only pgR-rich cells formed R-loops at the designated regions (Fig. 3C of the revised manuscript). When infected with HIV1, pgR-rich cells, but not pgR-poor cells, showed higher infectivity after DOX treatment (Fig. 3D of the revised manuscript). Moreover, we quantified site-specific integration events in the R-loop regions, and found that a greater number of integration events occurred in designated regions of the pgR-rich cellular genome upon R-loop induction by DOX treatment, but not in pgR-poor cells (Fig. 3E of the revised manuscript). Therefore, we concluded that transcriptional activation without an R-loop (in pgR-poor cells) may not be sufficient to drive HIV-1 integration. We believe that our work provides strong evidence for a causative relationship between R-loop formation/Rloop sites and HIV-1 integration. We hope that our explanation addresses your concerns. Thank you.

      If we consider some of the problems in the experiments that are described in the manuscript:

      (2.4) In an infected individual, cells are almost always infected by a single virion and the infecting virion is not accompanied by large numbers of damaged or defective virions. This is a key consideration: the claim that infection by HIV affects R-loop formation in cells was done with a VSVg vector in experiments in which there appears to have been about 6000 virions per cell. Although most of the virions prepared in vitro are defective in some way, that does not mean that a large fraction of the defective virions cannot fuse with cells. In normal in vivo infections, HIV has evolved in ways that avoid signaling infected the cell of its presence. To cite an example, carrying out reverse transcription in the capsid/core prevents the host cell from detecting (free) viral DNA in the cytoplasm. The fact that the large effect on R-loop formation which the authors report still occurs in infections done in the absence of reverse transcription strengthens the probability that the effects are due to the massive amounts of virions present, and perhaps to the presence of VSVg, which is quite toxic. To have physiological relevance, the infections would need to be carried out with virions that contain HIV even under circumstances in which there is at most one virion per cell.

      Our virus production and in vitro and ex vivo HIV-1 infection experimental conditions, designed for infecting cell types, such as HeLa cells and primary CD4+ T cells with VSV-G pseudotyped HIV, were based on a comprehensive review of numerous references. At the very beginning of this study, we tested HIV-1-specific host genomic R-loop induction using empty virion particles (virus-like particles, VLP) or other types of viruses (non-retrovirus, SeV; retroviruses, FMLV and FIV), all produced with a VSV G protein donor. We could not include a control omitting the VSV G protein or using natural HIV-1 envelope protein to prevent viral spread in culture. We observed that despite all types of virus stocks being prepared using VSV-G, only cells infected with HIV-1 viruses showed R-loop signal enrichment (Author response image 3). Therefore, we omitted the control for the VSV G protein in subsequent analyses, such as DRIPcseq. We have also revised our manuscript to provide a clearer description of the experimental conditions. In particular, we now clearly stated that we used VSV-G pseudotyped HIV-1 in this study, throughout the abstract, results, and discussion sections of the revised manuscript. Thank you.

      Author response image 3.

      (A) Dot blot analysis of the R-loop in gDNA extracts from HIV-1 infected U2OS cells with MOI of 0.6 harvested at 6 hpi. The gDNA extracts were incubated with or without RNase H in vitro before membrane loading (anti-S9.6 signal). (B) Dot blot analysis of the R-loop in gDNA extracts from HeLa cells infected with 0.3 MOI of indicated viruses. The infected cells were harvested at 6 hpi. The gDNA extracts were incubated with or without RNase H in vitro before membrane loading (anti-S9.6 signal).

      HIV-1 co-infection may also be expected in cell-free HIV-1 infections. However, it was previously suggested that the average number of infection events varies within 1.02 to 1.65 based on a mathematical model that estimates the frequency of multiple infections with the same virus (Figure 4c of Ito et al., Sci. Rep, 2017; 6559) (13). 

      (2.5) Using the Sso7d version of HIV IN in the in vitro binding assays raises some questions, but that is not the real question/problem. The real problem is that the important question is not what/how HIV IN protein binds to, but where/how an intasome binds. An intasome is formed from a combination of IN bound to the ends of viral DNA. In the absence of viral DNA ends, IN does not have the same structure/organization as it has in an intasome. Moreover, HIV IN (even Sso7d, which was modified to improve its behavior) is notoriously sticky and hard to work with. If viral DNA had been included in the experiment, intasomes would need to be prepared and purified for a proper binding experiment. To make matters worse, there are multiple forms of multimeric HIV IN and it is not clear how many HIV INs are present in the PICs that actually carry out integration in an infected cell.

      As the reviewer has noted, HIV IN, even with Sso7d tagging, is difficult. We attempted the purification of viral DNA (vDNA)-bound PICs using either Sso7d-tagged HIV-1 integrase proteins or non-tagged HIV-1 integrase proteins (F185K/C280S), procured from the NIH HIV reagent program (HRP-20203), following the method described by Passos et al., Science, 2017; 355 (89-92) (3). Despite multiple attempts, we were unable to purify the vDNA-bound IN protein complexes for in vitro assays. However, through multiple biochemical experiments, we believe that we have successfully demonstrated the interaction between cellular R-loops and HIV-1 integrase proteins both in cells and in vitro (Fig. 5A–F of the revised manuscript). We also observed a close association between integrase proteins and host cellular Rloops in HIV-1-infected cells, using a fluorescent recombinant virus (HIV-IN-EGFP) with intact IN-EGFP PICs (Fig. 5G of the revised manuscript). 

      (2.6) As an extension of comment 2, the proper association of an HIV intasome/PIC with the host genome requires LEDGF and the appropriate nucleic acid targets need to be chromatinized.

      The interaction between cellular R-loops and HIV-1 integrase proteins in HeLa cells endogenously expressing LEDGF/p75 was examined using reciprocal immunoprecipitation assays in Fig. 5C–F, S6B, and S6D Fig. of the revised manuscript. In addition, as discussed in more detail in our response to comment [28], we observed a close association between host cellular R-loops and HIV-1 integrase proteins by PLA assay, in HIV-1-infected HeLa cells. 

      (2.7) Expressing any form of IN, by itself, in cells to look for what IN associates with is not a valid experiment. A major factor that helps to determine both where integration takes place and the sites chosen for integration is the transport of the viral DNA and IN into the nucleus in the capsid core. However, even if we ignore that important part of the problem, the IN that the authors expressed in HeLa cells won't be bound to the viral DNA ends (see comment 2), even if the fusion protein would be able to form an intasome. As such, the IN that is expressed free in cells will not form a proper intasome/PIC and cannot be expected to bind where/how an intasome/PIC would bind.

      As discussed in more detail in our response to comment [2-8], we believe that our PLA experiment using the pVpr-IN-EGFP virus, which has previously been examined for virion integrity, as well as the IN-EGFP PICs (14), demonstrated a close association between host cellular R-loops and HIV-1 integrase proteins in HIV-1-infected cells. 

      (2.8) As in comment 1, for the PLA experiments presented in Figure 5 to work, the number of virions used per cell (which differs from the MOI measured by the number of cells that express a viral marker) must have a high, which is likely to have affected the cells and the results of the experiment. However, there is the additional question of whether the IN-GFP fusion is functional. The fact that the functional intasome is a complex multimer suggests that this could be a problem. There is an additional problem, even if IN-GFP is fully functional. During a normal infection, the capsid core will have delivered copies of IN (and, in the experiments reported here, the IN-GFP fusion) into the nucleus that is not part of the intasome. These "free" copies of IN (here IN-GFP) are not likely to go to the same sites as an intasome, making this experiment problematic (comment 4).

      The HIV-IN-EGFP virus stock was produced by polyethylenimine-mediated transfection of HEK293T cells with 6 µg of pVpr-IN-EGFP, 6 µg of HIV-1 NL4-3 noninfectious molecular clone (pD64E; NIH AIDS Reagent Program 10180), and 1 µg of pVSV-G as previously described in (14), and described in the Materials and Methods section of our manuscript. The pVpr-IN-EGFP vector used to produce HIV-1-IN-EGFP virus stock was provided by Anna Cereseto group (Albanese et al., PLOS ONE, 2008; 6(6); Ref 34 of the revised manuscript). It was previously reported that the HIV-1INEGFP virions produced by IN-EGFP trans-incorporation through Vpr are intact and infective viral particles (Figure 1 of Albanese et al., PLOS ONE, 2008; 6(6)). Therefore, we believe that the HIV-IN-EGFP used in our PLA experiments was functional. 

      Additionally, Albanese et al. showed that the EGFP signal of HIV-IN-EGFP virions colocalizes with the viral protein matrix (p17MA) and capsid (P24CA) as well as with the newly synthesized cDNA produced by reverse transcriptase by labeling and visualizing the synthesized cDNA (14). In addition, the fluorescent recombinant virus (HIV-INEGFP) was structurally intact at the nuclear level (Figure 6 of Albanese et al., PLOS ONE, 2008; 6(6)). Therefore, we believe that our PLA experimental result is not likely misled as the reviewer concerns due to the integrity of the HIV-IN-EGFP virion as well as IN-EGFP PICs.

      Furthermore, the in vitro HIV-1 infection setting of our PLA experiments was carefully determined based on multiple studies that performed image-based assays on HIV-1infected cells. For instance, Albanese et al. infected 4 × 104 cells with viral loads equivalent to 1.5 or 3 µg of HIV-1 p24 for their immunofluorescence analysis, in their previous report (14). We titrated the fluorescent HIV-1 virus stocks by examining both the multiplicity of infection (MOI) and quantifying the HIV-1 p24 antigen content (Author response image 4). In our calculation, we infected 5 × 104 HeLa cells with viral loads equivalent to 1.3 ug of HIV-1 p24, which is indicated as 2 MOI in Fig. 5G of our manuscript, for our PLA experiments. 

      Image-Based Assays often require increased and enhanced signal for statistical robustness. For example, Achuthan et al. infected cells with VSV-G-pseudotyped HIV1 at the approximate MOI of 350 for vDNA and PIC visualization (15). Therefore, we believe our experimental condition for PLA experiments, which we carefully designed based on previous study that are frequently referred, are reasonable. We really hope that our discussion sufficiently addressed the reviewer’s concern. 

      Author response image 4.

      Gating strategy used to determine HIV-1-infectivity in HeLa cells at 48 hpi. Cells were infected with a known p24 antigen content in the stock of the VSV-G-pseudotyped HIV-1-EGFP-virus. The percentages of GFP-positive cell population are indicated.

      (2.9) In the Introduction, the authors state that the site of integration affects the probability that the resulting provirus will be expressed. Although this idea is widely believed in the field, the actual data supporting it are, at best, weak. See, for example, the data from the Bushman lab showing that the distribution of integration sites is the same in cells in which the integrated proviruses are, and are not, expressed. However, given what the authors claim in the introduction, they should be more careful in interpreting enzyme expression levels (luciferase) as a measure of integration efficiency in experiments in which they claim proviruses are integrated in different places.

      We thank the reviewer for the constructive comment. We have changed the statement in Lines 41–42 in the Introduction section of our original manuscript to “The chromosomal landscape of HIV-1 integration influences proviral gene expression, persistence of integrated proviruses, and prognosis of antiretroviral therapy.” (Lines 39-41 of the revised manuscript). We believe that this change can tone-down the relevance between the site of integration and the provirus expression level.

      The piggyBac transposase randomly insert the “cargo (transposon)” into TTAA chromosomal sites of the target genome, generating efficient insertions at different genomic loci (16, 17). We believe that this random insertion of the pgR-poor/rich vector mediated by the piggyBac system allows us not to mislead the R-loop-mediated HIV1 integration site because of the genome locus bias of the vector insertion. Therefore, Figure 3 in our manuscript does not claim any relevance between the site of integration and the resulting provirus expression levels. Instead, as noted in Line 214 of the revised manuscript, using the luciferase reporter HIV-1 virus, we attempted to examine HIV-1 infection in cells with an "extra number of R-loops” in the host cellular genome. We observed that pgR-rich cells showed higher luciferase activity upon DOX treatment than pgR-poor cells (Fig. 3D of the revised manuscript). We believe that this is because a greater number of HIV-1 integration events may occur in pgR-rich cells, where DOX-inducible de novo R-loop regions are introduced. This has been further examined in Fig. 3E–G of the revised manuscript. We hope this explanation clarifies the Figure 3. Thank you. 

      (2.10) Using restriction enzymes to create an integration site library introduces biases that derive from the uneven distribution of the recognition sites for the restriction enzymes.

      As described in the Materials and Methods section, we adopted a sequencing library construction method using a previously established protocol (18, 19). Although we recognize the advantages of DNA fragmentation by sonication, in in vitro or ex vivo HIV-1 infection settings, where the multiplicity of infection is carefully determined based on multiple references, more copies of integrated viral sequences are expected compared to that in samples from infected patients (18). Therefore, in these settings, restriction enzyme-based DNA fragmentation and ligation-mediated PCR sequencing are well-established methods that provide significant data sources for HIV-1 integration site sequencing (15, 20-22). Furthermore, our data showing the proportion of integration sites over R-loop regions (Fig. 4B of the revised manuscript) are presented alongside the respective random controls (i.e., proportion of integration sites within the 30-kb windows centered on randomized DRIPc-seq peaks, gray dotted lines; control comparisons between randomized integration sites with DRIPc-seq peaks, black dotted lines; and randomized integration sites with randomized DRIPcseq peaks, gray solid lines), which do not show such a correlation between the HIV-1 integration sites and nearby areas of the R-loop regions. Therefore, we believe that our results from the integration site sequencing data analysis are unlikely to be biased. 

      Reviewer #3 (Public Review):

      In this manuscript, Park and colleagues describe a series of experiments that investigate the role of R-loops in HIV-1 genome integration. The authors show that during HIV-1 infection, R-loops levels on the host genome accumulate. Using a synthetic R-loop prone gene construct, they show that HIV-1 integration sites target sites with high R-loop levels. They further show that integration sites on the endogenous host genome are correlated with sites prone to R-loops. Using biochemical approaches, as well as in vivo co-IP and proximity ligation experiments, the authors show that HIV-1 integrase physically interacts with R-loop structures.

      My primary concern with the paper is with the interpretations the authors make about their genome-wide analyses. I think that including some additional analyses of the genome-wide data, as well as some textual changes can help make these interpretations more congruent with what the data demonstrate. Here are a few specific comments and questions:

      We are grateful for the time and effort we spent on our behalf and the reviewer’s appreciation for the novelty of our work, in particular, R-loop induction by HIV-1 infection and the correlation between host R-loops and the genomic site of HIV-1 integration. In the following sections, we provide our responses to your comments and suggestions. Your comments are in italics. We have carefully addressed the following issues.

      (3.1) I think Figure 1 makes a good case for the conclusion that R-loops are more easily detected HIV-1 infected cells by multiple approaches (all using the S9.6 antibody). The authors show that their signals are RNase H sensitive, which is a critical control. For the DRIPc-Seq, I think including an analysis of biological replicates would greatly strengthen the manuscript. The authors state in the methods that the DRIPc pulldown experiments were done in biological replicates for each condition. Are the increases in DRIPc peaks similar across biological replicates? Are genomic locations of HIV-1-dependent peaks similar across biological replicates? Measuring and reporting the biological variation between replicate experiments is crucial for making conclusions about increases in R-loop peak frequency. This is partially alleviated by the locus-specific data in Figure S3A. However, a better understanding of how the genome-wide data varies across biological replicates will greatly enhance the quality of Figure 1.

      DRIPc-seq experiments were conducted with two biological replicates. To define consensus DRIPc-seq peaks using these two replicates, we used two methods applicable to ChIP-seq analysis: the irreproducible discovery rate (IDR) method and sequencing data pooling. We found that the sequencing data pooling method yielded significantly more DRIPc-seq peaks than consensus peak identification through IDR, and we decided to utilize R-loop peaks from pooled sequencing data for our downstream analyses, as described in the figure legends and Materials and Methods of the revised manuscript. 

      As noted by the reviewer, it is important to verify whether the increasing trend in the number of R-loop peaks and genomic locations of HIV-1 dependent R-loops were consistently observed across the two biological replicates. Therefore, we independently performed R-loop calling on each replicate of the sequencing data of primary CD4+ T cells from two individual donors to verify that the increase in R-loop numbers was consistent (Author response image 5). Additionally, the overlap of the R-loop peaks between the two replicates was statistically significant across the genome (Author response table 1). Thank you.

      Author response image 5.

      Bar graph indicating DRIPc-seq peak counts for HIV-1-infected primary CD4+ T cells harvested at the indicated hours post infection (hpi). Pre-immunoprecipitated samples were untreated (−) or treated (+) with RNase H, as indicated. Each dot corresponds to an individual data set from two biologically independent experiments.

      Author response table 1.

      DRIPc-seq peak length and Chi-square p-value in CD4+ T cells from individual donor 1 and 2 

      (3.2) I think that the conclusion that R-loops "accumulate" in infected cells is acceptable, given the data presented. However, in line 134 the authors state that "HIV1 infection induced host genomic R-loop formation". I suggest being very specific about the observation. Accumulation can happen by (a) inducing a higher frequency of the occurrence of individual R-loops and/or (b) stabilizing existing R-loops. I'm not convinced the authors present enough evidence to claim one over the other. It is altogether possible that HIV-1 infection stabilizes R-loops such that they are more persistent (perhaps by interactions with integrase?), and therefore more easily detected. I think rephrasing the conclusions to include this possibility would alleviate my concerns.

      We thank the reviewer for the considerable discussion on our manuscript. We have now changed Line 134 to, “HIV-1 infection induces host genomic R-loop enrichment” (Lines 132-133 of the revised manuscript), and added a new conclusion sentence implicating the possible explanation for the R-loop signal enrichment upon HIV-1 infection (Lines 133–135 of the revised manuscript), according to the reviewer's suggestion.    

      (3.3) A technical problem with using the S9.6 antibody for the detection of R-loops via microscopy is that it cross-reacts with double-stranded RNA. This has been addressed by the work of Chedin and colleagues (as well as others). It is absolutely essential to treat these samples with an RNA:RNA hybrid-specific RNase, which the authors did not include, as far as their methods section states. Therefore, it is difficult to interpret all of the immunofluorescence experiments that depend on S9.6 binding.

      We understand the reviewer's concern regarding the cross-reactivity of the S9.6 antibody with more abundant dsRNA, particularly in imaging applications. We carefully designed the experimental and analytical methods for R-loop detection using microscopy. For example, we pre-extracted the cytoplasmic fraction before staining with the S9.6 antibody and quantified the R-loop signal by subtracting the nucleolar signal. Both of these steps were taken to eliminate the possibility of misdetecting Rloops via microscopy because of the prominent cytoplasmic and nucleolar S9.6 signals, which primarily originate from ribosomal RNA. In addition, we included R-loop negative control samples in our microscopy analysis that were subjected to intensive RNase H treatment (60U/mL RNase H for 36 h) and observed a significant reduction in the S9.6 signal (Figure 1E of the revised manuscript). RNase H-treated samples served as essential and widely accepted negative controls for R-loop detection. 

      We would like to point out that recent studies have reported strong intrinsic specificity of S9.6 anybody for DNA:RNA hybrid duplex over dsDNA and dsRNA, along with the structural elucidations of S9.6 antibody recognition of hybrids (23, 24). Therefore, our interpretation of host cellular R-loop enrichment after HIV-1 infection using S9.6 antibodies in multiple biochemical approaches is well supported. Nevertheless, we agree with the reviewer's opinion that additional negative controls for the detection of R-loops via microscopy, such as RNase T1-and RNase III-treated samples, could improve the robustness and accuracy of R-loop imaging data (25).  

      (3.4) Given that there is no clear correlation between expression levels and R-loop peak detection, combined with the data that show increased detection of R-loop frequency in non-genic regions, I think it will be important to show that the R-loop forming regions are indeed transcribed above background levels. This will help alleviate possible concerns that there are technical errors in R-loop peak detection.

      Figures S5D and S5E in the revised manuscript show the relative gene expression levels of the R-loop-forming positive regions (P1-3) and the referenced Rloop-positive loci (RPL13A and CALM3). The gene expression levels of these R-loopforming regions were significantly higher than those of the ECFP or mAIRN genes without DOX treatment, which can be considered background levels of transcription in cells. Thank you. 

      (3.5) In Figures 4C and D the hashed lines are not defined. It is also interesting that the integration sites do not line up with R-loop peaks. This does not necessarily directly refute the conclusions (especially given the scale of the genomic region displayed), but should be addressed in the manuscript. Additionally, it would greatly improve Figure 4 to have some idea about the biological variation across replicates of the data presented 4A.

      We thank the reviewer for the considerable comment on our study. First of all, we added an annotation for the dashed lines in the figure legends of Figures 4C and 4D in the revised manuscript.

      We agree with the reviewer's interpretation of the relationship between the integration sites and R-loop peaks. Primarily based on our current data, we believe R-loop structures are bound by HIV-1 integrase proteins and lead HIV-1 viral genome integration into the “vicinity” regions of the host genomic R-loops. We displayed a large-scale genomic region (30-kb windows) to present integration sites surrounding R-loop centers because an R-loop can be multi-kilobase in size (1, 2). Depending on the immunoprecipitation and library construction methods, the R-loop peaks varied in size, and the peak length showed a wide distribution (Figure 3B of Malig et al., 2020, Figure 1B of Sanz et al., 2016, and Figure 2A of the revised manuscript). Therefore, presenting integration site events within a wide window of R-loop peaks could be more informative and better reflect the current understanding of R-loop biology.

      R-loop formation recruits diverse chromatin-binding protein factors, such as H3K4me1, p300, CTCF, RAD21, and ZNF143 (Figure 6A and 6B of Sanz et al., 2016) (26), which allow R-loops to exhibit enhancer and insulator chromatin states, which can act as distal regulatory elements (26, 27). We have demonstrated physical interactions between host cellular R-loops and HIV-1 integrase proteins (Figure 5 of the revised manuscript), therefore, we believe that this ‘distal regulatory element-like feature’ of the R-loop can be a potential explanation for how R-loops drive integration over longrange genomic regions.

      According to your suggestion, we added this explanation to the relevant literature in the Discussion section of the revised manuscript.

      Author response image 6 which represents the biological variation across replicates of the data shown in Figure 4A. The integration site sequencing data for Jurkat cells were adopted from SRR12322252 (4), which consists of the integration site sequencing data of HIV-1-infected wild type Jurkat cells with one biological replicate. We hope that our explanations and discussion have successfully addressed your concerns. Thank you. 

      Author response image 6.

      Bar graphs showing the quantified number of HIV-1 integration sites per Mb pair in total regions of 30-kb windows centered on DRIPc-seq peaks from HIV-1 infected HeLa cells and primary CD4+ T cells (magenta) or non-R-loop region in the cellular genome (gray). Each dot corresponds to an individual data set from two biologically independent experiments.

      (3.6) The authors do not adequately describe the Integrase mutant that they use in their biochemical experiments in Figure 5A. Could this impact the activity of the protein in such a way that interferes with the interpretation of the experiment? The mutant is not used in subsequent experiments for Figure 5 and so even though the data are consistent with each other (and the conclusion that Integrase interacts with R-loops) a more thorough explanation of why that mutant was used and how it impacts the biochemical activity of the protein will help the interpretation of the data presented in Figure 5.

      We appreciate the reviewer’s suggestions. In our EMSA analysis, we purified and used Sso7d-tagged HIV-1 integrase proteins with an active-site amino acid substitution, E152Q. First, we used the Sso7d-tagged HIV-1 integrase protein, as it has been suggested in previous studies that the fusion of small domains, such as Sso7d (DNA binding domain) can significantly improve the solubility of HIV integrase proteins without affecting their ability to assemble with substrate nucleic acids and their enzymatic activity (Figure 1B of Li et al., PLOS ONE, 2014;9 (8) (28, 29). We used an integrase protein with an active site amino acid substitution, E152Q, in our mobility shift assay, because the primary goal of this experiment was to examine the ability of the protein to bind or form a complex with different nucleic acid substrates. We thought that abolishing the enzymatic activity of the integrase protein, such as 3'-processing that cleaves DNA substrates, would be more appropriate for our experimental objective. This Sso7d tagged- HIV-1 integrase with the E152Q mutation has also been used to elucidate the structural model of the integrase complex with a nucleic acid substrate by cryo-EM (3) and has been shown to not disturb substrate binding.   Based on the reviewer’s comments, we have added a description of the E152Q mutant integrase protein in Lines 268–270 of the revised manuscript. Thank you.

      Reviewer #3 (Recommendations For The Authors):

      The paper suffers from many grammatical errors, which sometimes interfere with the interpretations of the experiments. In the view of this reviewer, the manuscript must be carefully revised prior to publication. For example, lines 247-248 "Intasomes consist of HIV-1 viral cDNA and HIV-1 coding protein, integrases." It is unclear from this sentence whether there are multiple integrases or multiple proteins that interact with the viral genome to facilitate integration. This makes the subsequent experiments in Figure 5 difficult to interpret. There are many other examples, too numerous to point out individually.

      We thoughtfully revised the original manuscript, making the best efforts to provide clearer details of our findings. We believe that we have made substantial changes to the manuscript, including Lines 247–248 of the original manuscript that the reviewer noted. Furthermore, the revised manuscript was edited by a professional editing service. Thank you.     (1) M. Malig, S. R. Hartono, J. M. Giafaglione, L. A. Sanz, F. Chedin, Ultra-deep Coverage Singlemolecule R-loop Footprinting Reveals Principles of R-loop Formation. J Mol Biol 432, 22712288 (2020).

      (2) L. A. Sanz et al., Prevalent, Dynamic, and Conserved R-Loop Structures Associate with Specific Epigenomic Signatures in Mammals. Mol Cell 63, 167-178 (2016).

      (3) D. O. Passos et al., Cryo-EM structures and atomic model of the HIV-1 strand transfer complex intasome. Science 355, 89-92 (2017).

      (4) W. Li et al., CPSF6-Dependent Targeting of Speckle-Associated Domains Distinguishes Primate from Nonprimate Lentiviral Integration. mBio 11,  (2020).

      (5) P. A. Ginno, Y. W. Lim, P. L. Lott, I. Korf, F. Chedin, GC skew at the 5' and 3' ends of human genes links R-loop formation to epigenetic regulation and transcription termination. Genome Res 23, 1590-1600 (2013).

      (6) S. Hamperl, M. J. Bocek, J. C. Saldivar, T. Swigut, K. A. Cimprich, Transcription-Replication Conflict Orientation Modulates R-Loop Levels and Activates Distinct DNA Damage Responses. Cell 170, 774-786 e719 (2017).

      (7) H. O. Ajoge et al., G-Quadruplex DNA and Other Non-Canonical B-Form DNA Motifs Influence Productive and Latent HIV-1 Integration and Reactivation Potential. Viruses 14,  (2022).

      (8) I. K. Jozwik et al., B-to-A transition in target DNA during retroviral integration. Nucleic Acids Res 50, 8898-8918 (2022).

      (9) F. Chedin, C. J. Benham, Emerging roles for R-loop structures in the management of topological stress. J Biol Chem 295, 4684-4695 (2020).

      (10) F. Chedin, Nascent Connections: R-Loops and Chromatin Patterning. Trends Genet 32, 828838 (2016).

      (11) P. B. Chen, H. V. Chen, D. Acharya, O. J. Rando, T. G. Fazzio, R loops regulate promoterproximal chromatin architecture and cellular differentiation. Nat Struct Mol Biol 22, 9991007 (2015).

      (12) A. R. Schroder et al., HIV-1 integration in the human genome favors active genes and local hotspots. Cell 110, 521-529 (2002).

      (13) Y. Ito et al., Number of infection events per cell during HIV-1 cell-free infection. Sci Rep 7, 6559 (2017).

      (14) A. Albanese, D. Arosio, M. Terreni, A. Cereseto, HIV-1 pre-integration complexes selectively target decondensed chromatin in the nuclear periphery. PLoS One 3, e2413 (2008).

      (15) V. Achuthan et al., Capsid-CPSF6 Interaction Licenses Nuclear HIV-1 Trafficking to Sites of Viral DNA Integration. Cell Host Microbe 24, 392-404 e398 (2018).

      (16) X. Li et al., piggyBac transposase tools for genome engineering. Proc Natl Acad Sci U S A 110, E2279-2287 (2013).

      (17) Y. Cao et al., Identification of piggyBac-mediated insertions in Plasmodium berghei by next generation sequencing. Malar J 12, 287 (2013).

      (18) E. Serrao, P. Cherepanov, A. N. Engelman, Amplification, Next-generation Sequencing, and Genomic DNA Mapping of Retroviral Integration Sites. J Vis Exp,  (2016).

      (19) K. A. Matreyek et al., Host and viral determinants for MxB restriction of HIV-1 infection. Retrovirology 11, 90 (2014).

      (20) G. A. Sowd et al., A critical role for alternative polyadenylation factor CPSF6 in targeting HIV-1 integration to transcriptionally active chromatin. Proc Natl Acad Sci U S A 113, E10541063 (2016).

      (21) B. Lucic et al., Spatially clustered loci with multiple enhancers are frequent targets of HIV-1 integration. Nat Commun 10, 4059 (2019).

      (22) P. K. Singh, G. J. Bedwell, A. N. Engelman, Spatial and Genomic Correlates of HIV-1 Integration Site Targeting. Cells 11,  (2022).

      (23) C. Bou-Nader, A. Bothra, D. N. Garboczi, S. H. Leppla, J. Zhang, Structural basis of R-loop recognition by the S9.6 monoclonal antibody. Nat Commun 13, 1641 (2022).

      (24) Q. Li et al., Cryo-EM structure of R-loop monoclonal antibody S9.6 in recognizing RNA:DNA hybrids. J Genet Genomics 49, 677-680 (2022).

      (25) J. A. Smolka, L. A. Sanz, S. R. Hartono, F. Chedin, Recognition of RNA by the S9.6 antibody creates pervasive artifacts when imaging RNA:DNA hybrids. J Cell Biol 220,  (2021).

      (26) L. A. Sanz, F. Chedin, High-resolution, strand-specific R-loop mapping via S9.6-based DNARNA immunoprecipitation and high-throughput sequencing. Nat Protoc 14, 1734-1755 (2019).

      (27) M. Merkenschlager, D. T. Odom, CTCF and cohesin: linking gene regulatory elements with their targets. Cell 152, 1285-1297 (2013).

      (28) M. Li, K. A. Jurado, S. Lin, A. Engelman, R. Craigie, Engineered hyperactive integrase for concerted HIV-1 DNA integration. PLoS One 9, e105078 (2014).

      (29) M. Li et al., A Peptide Derived from Lens Epithelium-Derived Growth Factor Stimulates HIV1 DNA Integration and Facilitates Intasome Structural Studies. J Mol Biol 432, 2055-2066 (2020).

    1. Author Response

      The following is the authors’ response to the original reviews.

      General remarks for the Editor and the Reviewers

      We would like to thank the Editor and the Reviewers for their feedback. Below we address their comments and present our point-by-point responses as well as the related changes in the manuscript.

      In addition to these changes, in a few cases we have found it necessary to move some texts and provide some additional explanations within the manuscript. We emphasize that these amendments have been made for only technical reasons, and do not alter the results and conclusions of the paper, but may help to render the text more coherent and understandable to readers with little knowledge of the subject.

      These minor corrections are:

      • We extended the Introduction section by a sentence (lines 40-42) that is intended to fit the proposed template directed, non-enzymatic replication mechanism into a more general prebiotic evolutionary context, thus emphasizing its biological relevance. This sentence includes an additional reference (Rosenberger et al., 2021).

      • Two very methodologically oriented and repeated descriptions of random sequence generation have been moved to the Methods section (lines 178-185) from the Results section (lines 336-339 and lines 351-354).

      • We complemented the Data availability statement with licensing information (lines 684-685).

      • Further minor changes (also indicated by red texts) have been implemented to remedy logical and grammatical glitches.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      Szathmary and colleagues explore the parabolic growth regime of replicator evolution. Parabolic growth occurs when nucleic acid strain separation is the rate-limiting step of the replication process which would have been the case for non-enzymatic replication of short oligonucleotide that could precede the emergence of ribozyme polymerases and helicases. The key result is that parabolic replication is conducive to the maintenance of genetic diversity, that is, the coexistence of numerous master sequences (the Gause principle does not apply). Another important finding is that there is no error threshold for parabolic replication except for the extreme case of zero fidelity.

      Strengths:

      I find both the analytic and the numerical results to be quite convincing and well-described. The results of this work are potentially important because they reveal aspects of a realistic evolutionary scenario for the origin of replicators.

      Weaknesses:

      There are no obvious technical weaknesses. It can be argued that the results represent an incremental advance because many aspects of parabolic replication have been explored previously (the relevant publications are properly cited). Obviously, the work is purely theoretical, experimental study of parabolic replication is due. In the opinion of this reviewer, though, these are understandable limitations that do not actually detract from the value of this work.

      We are grateful that this Reviewer appreciates our work. We completely agree that the ultimate validation must come from experiments. It is important to stress that in this field theory often preceded experimental work by decades, and the former often guided the latter. We hope that for the topic of the present paper experiments will follow considerably faster.

      Reviewer #2 (Public Review):

      Summary:

      A dominant hypothesis concerning the origin of life is that, before the appearance of the first enzymes, RNA replicated non-enzymatically by templating. However, this replication was probably not very efficient, due to the propensity of single strands to bind to each other, thus inhibiting template replication. This phenomenon, known as product inhibition, has been shown to lead to parabolic growth instead of exponential growth. Previous works have shown that this situation limits competition between alternative replicators and therefore promotes RNA population diversity. The present work examines this scenario in a model of RNA replication, taking into account finite population size, mutations, and differences in GC content. The main results are (1) confirmation that parabolic growth promotes diversity, but that when the population size is small enough, sequences least efficient at replicating may nevertheless go extinct; (2) the observation that fitness is not only controlled by the replicability of sequences, but also by their GC content; (3) the observation that parabolic growth attenuates the impact of mutations and, in particular, that the error threshold to which exponentially growing sequences are subject can be exceeded, enabling sequence identity to be maintained at higher mutation rates.

      Strengths:

      The analyses are sound and the observations are intriguing. Indeed, it has been noted previously that parabolic growth promotes coexistence, its role in mitigating the error threshold catastrophe - which is often presented as a major obstacle to our understanding of the origin of life - had not been examined before.

      Weaknesses:

      Although all the conclusions are interesting, most are not very surprising for people familiar with the literature. As the authors point out, parabolic growth is well known to promote diversity (SzathmaryGladkih 89) and it has also been noted previously that a form of Darwinian selection can be found at small population sizes (Davis 2000).

      Given that under parabolic growth, no sequence is ever excluded for infinite populations, it is also not surprising to find that mutations have a less dramatic exclusionary impact.

      In the two articles cited (Szathmary-Gladkih 1989 and Davis 2000) the subexponentiality of the system was implemented in a mechanistic way, by introducing the exponent 0 < 𝑝 < 1. Although the behaviour of these models is more or less consistent with experimental findings (von Kiedrowski, 1986; Zielinski and Orgel, 1987), the divergence of per capita growth rates (𝑥̇/𝑥) at very low concentrations–which guarantees the ability to maintain unlimited diversity in the case of infinite population sizes–makes this formal approach partly unrealistic.

      To avoid the possible artefacts of this mechanistic approach, and as there are no previous studies analysing the diversity maintaining ability of finite populations of parabolic replicators in an individual-based model context, we implemented a simplified template replication mechanism leading to parabolic growth and analysed the dynamics in an individual-based stochastic model context. The key point of our investigation is that considerable diversity can be maintained in the system even when the population size is quite small.

      Regarding the Reviewer’s comment on selection: Darwinian selection can only occur in a simple subexponential dynamics if the ratio of replicabilities diverges, cf. Eq. (8) and the preceding paragraph in Davis, 2000.

      Our results also show (Figs. 4B and 4C) that high mutation rates and the error threshold problem can still be considered as a major limiting factor for parabolically replicating systems in terms of their diversity-maintaining ability. In the light of the above, potential mechanisms to relax the error threshold in such systems, one of which is demonstrated in the present study, seem to be important steps to account for the sequence diversification and increase in molecular complexity during the early evolution of RNA replicators.

      A general weakness is the presentation of models and parameters, whose choices often appear arbitrary. Modeling choices that would deserve to be further discussed include the association of the monomers with the strands and the ensuing polymerization, which are combined into a single association/polymerization reaction (see also below), or the choice to restrict to oligomers of length L = 10. Other models, similar to the one employed here, have been proposed that do not make these assumptions, e.g. Rosenberger et al. Self-Assembly of Informational Polymers by Templated Ligation, PRX 2021. To understand how such assumptions affect the results, it would be helpful to present the model from the perspective of existing models.

      The assumption of one-step polymerization reactions that we used here is a common technique for modelling template replication of sequence-represented replicators [see, e.g., Fontana and Schuster, 1998 (10.1126/science.280.5368.1451), Könnyű et al., 2008 (10.1186/1471-2148-8267), Vig-Milkovics et al, 2019 (10.1016/j.jtbi.2018.11.020) or Szilágyi et al., 2020 (10.1371/journal.pgen.1009155)]. This is because assuming base-to-base polymerisation of the copy would lead to a very large number of different types of intermediates, which a Gillespietype stochastic simulation algorithm could not handle in reasonable computation times, even if the sequences were relatively short. For comparison, in our model, where polymerization is one-step, the characteristic time of a simulation for 𝐿 = 10, 𝑁 = 105 and 𝛿 = 0.01 was 552 hours.

      Note that in Rosenberg et al. (PRX 2021), in contrast to a pioneering work [Fernando et al, 2007 (10.1007/s00239-006-0218-4)], sequences of replicators are not represented, which makes this approach completely inapplicable to our case, in which sequence defines the fitness. In sum, we suggest that this valid criticism points to possible future work.

      The values of the (many) parameters, often very specific, also very often lack justifications. For example, why is the "predefined error factor" ε = 0.2 and not lower or higher? How would that affect the results?

      A general remark. For the more important parameters , several values were used to test the behaviour of the model (see Table 1), but due to the considerable number of parameters, it is impossible to examine all possible combinations. 𝑐+ = 1 fixes the timescale, 𝐿 is set to 10 to obtain reasonable running times (see above).

      𝜀 characterizes how replicability decreases as the number of mutations increases. In the manuscript we used the following default vector: 𝜀 = (0.05, 0.2, 1) in which the third element corresponds to the mutation-free sequence, so it must to be 1. The first element determines the baseline replicability (see Methods), which we preferred not to change because it would fundamentally alter the ratio of replication propensities to association and dissociation propensities (as the substantial amount of complementary sequences of the master sequences are of baseline replicability) and thus would alter the reaction kinetics to an extent that it is not comparable with the original results. Therefore, only the second element can be adjusted. Accordingly, we have analysed the behaviour of the model in the cases of a steeper and a more gradual loss of replicability using the following two vectors, respectively: 𝜀, = (0.05, 𝟎. 𝟎𝟓, 1) and 𝜀,, = (0.05, 𝟎. 𝟓, 1). The choice of 𝜀, is chemically more plausible, since for very short oligomers the loss of chemical activity and replicability as a function of the number of mutations can be very sharp. We performed a series of simulations with all possible combinations of 𝛿 = 0.001, 0.005, 0.1 and 𝑁 = 103, 104, 105 for 𝜀′ and 𝜀,,in the constant population and chemostat model context (36 different runs). For other parameters, we took the default values, see Table 1. These values also correspond to the parameters we used in Figures 2 and 6. The results show that the steeper loss of replicability (𝜀,) slightly increases the diversity maintaining ability of the system, whereas the more gradual loss of replicability (𝜀,,) moderately decreases the diversity-maintaining ability of the system, and that these shifts are more pronounced in the constant population size model (Author response image 1) than in the chemostat model (Author response image 2). Altogether, these results confirm that the qualitative outcome of the model is robust in a wide range of loss of replicability (𝜀 vector) values.

      Author response image 1.

      Replicator coexistence in the constant population model with different loss of replicability (𝜀 vector) values. Within a given combination of 𝛿 and 𝑁 parameter values, the upper panel corresponds to the steeper loss of replicability (𝜀!), the middle panel to the default 𝜀 vector (Figure 2A), and the bottom panel to the more gradual loss of replicability vector (𝜀!!). Within each 𝛿; 𝑁 parameter combination, the same master sequence set was used with the three different 𝜀 vectors for comparability.

      Author response image 2.

      Replicator coexistence in the chemostat model with different loss of replicability (𝜀 vector) values. Within a given combination of 𝛿 and 𝑁 parameter values, the upper panel corresponds to the steeper loss of replicability (𝜀!), the middle panel to the default 𝜀 vector (Figure 6A), and the bottom panel to the more gradual loss of replicability vector (𝜀!!). Within each 𝛿; 𝑁 parameter combination, the same master sequence set was used with the three different 𝜀 vectors for comparability.

      Similarly, in equation (11), where does the factor 0.8 come from?

      This factor scales the decay rate of duplex sequences (𝑐"!") as the function of the binding energy

      (𝐸b). The value of 0.8 is an arbitrary choice, the value should be in the interval (0,1) and is only relevant in the chemostat model. It is expected to have a similar effect on the dynamics as the duplex decay factor parameter 𝑓, which we have investigated in a wide range of different values (cf. Table 1, Fig. 6), although 𝑓 is independent of the binding energy (𝐸/): increasing/decreasing the 0.8 factor is expected to decrease/increase the average total population size. We have investigated the diversity maintaining ability of the system at smaller (0.6) and larger (0.9) parameter values at different population sizes (𝑁 ≈ 103, 104 and 105) and at different replicability distances (δ = 0.001, 0.005 and 0.01) as shown in Fig. 6. We have found that the number of coexisting master types changes very little in response to changes in this factor. Only two shifts could be detected (underlined): factor 0.9 combined with 𝑁 ≈ 104 and 𝛿 = 0.001 caused the number of surviving master types to decrease by one, while factor 0.9 combined with 𝑁 ≈ 103 and 𝛿 = 0.01 caused the number of surviving master types to increase by one (Author response table 1). Factor 0.6 produced the same number of surviving types as the default (Author response table 1). In summary, the model shows marked robustness to changes in the values of this parameter.

      Author response table 1.

      Number of coexisting master types in the chemostat model with different binding energy dependent duplex decay rates. Within each 𝛿; 𝑁 parameter combination, the same master sequence set was used with the three different factor values: 0.6, 0.8 (the original) and 0.9 for comparability.

      Why is the kinetic constant for duplex decay reaction 1.15e10−8?

      Note that this value is the minimum of the duplex decay rate, Table 1 correctly shows the interval of this kinetic constant as: [1.15 ⋅ 10-8, 6.4 ⋅ 10-5]. Both values are derived from the basic parameters of the system and can be computed according to Eq. (11). The minimum: as the parameter set corresponding to this value is: . The maximum: with .

      Are those values related to experiments, or are they chosen because specific behaviors can happen only then?

      See above.

      The choice of the model and parameters potentially impact the two main results, the attenuation of the error threshold and the role of GC content:

      Regarding the error threshold, it is also noted (lines 379-385) that it disappears when back mutations are taken into account. This suggests that overcoming the error threshold might not be as difficult as suggested, and can be achieved in several ways, which calls into question the importance of the particular role of parabolic growth. Besides, when the concentration of replicators is low, product inhibition may be negligible, such that a "parabolic replicator" is effectively growing exponentially and an error catastrophe may occur. Do the authors think that this consideration could affect their conclusion? Can simulations be performed?

      The assumption of back mutation only provides a theoretical solution to the error threshold problem: back mutation guarantees a positive (non-zero) concentration of a master type, but, since the probability of back mutation is generally very low, this equilibrium concentration may be extremely low, or negligible for typical system sizes. Consequently, back mutation alone does not solve the problem of the error catastrophe: in our system back mutation is present (the probability that a sequence with 𝑘 errors mutates back to a master sequence is 𝜇k(1−𝜇)L-k), and the diversity-maintaining ability is limited. The effect of back mutation decreases exponentially with increasing sequence length.

      Regarding the role of the GC content, GC-rich oligomers are found to perform the worst but no rationale is provided.

      For GC-rich oligonucleotides the dissociation probability of a template-copy complex is relatively low (cf. Eqs. (9, 10)), thus they have a relatively low number of offspring, cf. lines 557-561: “a relatively high dissociation probability and the consequential higher propensity of being in a simple stranded form provides an advantage for sequences with relatively low GC content in terms of their replication affinity, that is, the expected number of offspring in case of such variants will be relatively high.”. Note that the simulation results shown in Fig. 3A, demonstrate the realization of this effect with prepared sequences (along a GC content gradient).

      One may assume that it happens because GC-rich sequences are comparatively longer to release the product. However, it is also conceivable that higher GC content may help in the polymerization of the monomers as the monomers attach longer on the template (as described in Eq. (9)). This is an instance where the choice to pull into a single step the association and polymerization reactions are pulled into a single step independent of GC content may be critical.

      It would be important to show that the result arises from the actual physics and not from this modeling choice.

      Some more specific points that would deserve to be addressed:

      • Line 53: it is said that p "reflects how easily the template-reaction product complex dissociates". This statement is not correct. A reaction order p<1 reflects product inhibition, the propensity of templates to bind to each other, not slow product release. Product release can be limiting, yet a reaction order of 1 can be achieved if substrate concentrations are sufficiently high relative to oligomer concentrations (von Kiedrowski et al., 1991).

      We think the key reference is Von Kiedrowski (1993) in this case. Other things being equal, his Table 1 on p. 134 shows that a sufficient increase in 𝐾4, i.e., the stability of the duplex (template and copy) (association rate divided by dissociation rate) throws the system into the parabolic regime. This is what we had in mind. In order to clarify this, we modified the quoted sentence thus: “In this kinetics, the growth order is equal or close to 0.5 (i.e., the dynamics is sub-exponential) because increased stability of the template-copy complex (rate of association divided by dissociation) promotes parabolic growth (von Kiedrowski et al., 1991; von Kiedrowski & Szathmáry, 2001).”

      • Population size is a key parameter, and a comparison is made between small (10^3) and large (10^5) populations, but without explaining what determines the scale (small/large relative to what?).

      The “small” value (103) corresponds to the smallest meaningful population size, significantly smaller population sizes (e.g. 102) cannot maintain the 10 master types (or any subset of them) and are chemically unrealistic. The “large value” (105) is the largest population size for which simulation times are still acceptable, in the case of 106 the runtimes are in the order of months.

      • In the same vein, we might expect size not to be the only important parameter, but also concentration.

      With constant volume population size and concentration are strictly coupled.

      • Lines 543-546: if understanding correctly, the quantitative result is that the error threshold rises from 0.1 in the exponential case to 0.196 in the parabolic. Are the authors suggesting that a factor of 2 is a significant difference?

      In this paragraph we compared the empirical error threshold of our system (which is close to 𝑝"#$ = 0.15) with the error threshold of the well-known single peak fitness landscape (which can be approximated by ) as a reference case. To make the message even clearer we have extended the last sentence (lines 596-597) as follows: “but note that applying this approach to our system is a serious oversimplification”. The 0.196 is simply the probability of error-free replication of a sequence when , but we have removed this sentence (“corresponding to the replication accuracy of a master sequence”) from the manuscript as it seems to be confusing.

      • Figure 3C: this figure shows no statistically significant effect?

      Thank you for pointing out this. We statistically tested the hypothesis that the GC content between the survived and the extinct master subsets are different. This analysis revealed that the differences between these two groups are statistically significant, which we now included in the manuscript at lines 380-390: “A direct investigation of whether the sequence composition of the master types is associated with their survival outcome was conducted using the data from the constant population model simulation results (Figure 2). In these data, the average GC content was measured to be lower in the surviving master subpopulations than in the extinct subpopulations (Figure 3C). To determine whether this difference was statistically significant, nonparametric, two-sample Wilcoxon rank-sum tests (Hollander & Wolfe, 1999) were performed on the GC content of the extinct-surviving master subsets. The GC content was significantly different between these two groups in all nine investigated parameter combinations of population size (N) and replicability distance (δ) at p<0.05 level, indicating a selective advantage for a lower GC content in the constant population model context. The exact p values obtained from this analysis are shown in Figure 3C.”

      • line 542: "phase transition-like species extension (Figure 4B)": such a clear threshold is not apparent.

      Thank you for pointing out the incorrect phrasing. As there is no clear threshold in the number of coexisting types as a function of the mutation rate, we removed the “phase transition-like” expression: “However, when finite population sizes and stochastic effects are taken into account, at the largest investigated per-base mutation rate (𝑝mut = 0.15), the summed relative steady-state master frequencies approach zero (Figure 4C) with accelerating species extinction (Figure 4B), indicating that this value is close to the system׳s empirical error threshold.” (lines 589-594).

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      On the whole, the work is well done and presented, there are no major recommendations. It seems a good idea to cite and briefly discuss this recent paper: https://pubmed.ncbi.nlm.nih.gov/36996101/ which develops a symbiotic scenario of the coevolution of primordial replicators and reproducers that appears to be fully compatible with the results of the current work.

      Thank you for bringing this article to our attention. We have inserted the following sentence at lines 621-624: “The demonstrated diversity-maintaining mechanism of finite parabolic populations can be used as a plug-in model to investigate the coevolution of naked and encapsulated molecular replicators (e.g., Babajanyan et al., 2023).”

      The manuscript is well written, but there are some minor glitches that merit attention. For example:

      l. 5 "carriers presents a problem, because product formation and mutual hybridization" - "mutual" is superfluous here, delete

      l. 13 "amplification. In addition, sequence effects (GC content) and the strength of resource" - hardly "effects" - should be 'features' or 'properties'

      l. 41 "If enzyme-free replication of oligomer modules with a high degree of sequence" - "modules" here is only confusing - simply, "oligomers"

      l. 44 "under ecological competition conditions with which distinct replicator types with different" - delete "with" etc, there are many such minor glitches that are best corrected.

      Thank you for pointing out, we have corrected! Other drafting errors, glitches, superfluous sentences have also been corrected.

      Reviewer #2 (Recommendations For The Authors):

      None

      Editor (Recommendations For The Authors):

      In the manuscript, it appears that coexistence is assessed at a given point in time, while figures seem to show that it remains time-dependent. It would be great if the authors could clarify this and/or discuss this.

      We appreciate you bringing this to our attention, as we have indeed missed to elaborate on this important point. The steady state characteristic of the coexistence is assessed in our model in the following way: the relative frequency of each master sequence is tested for the condition of ≥ 100- (cut-off relative frequency for survival) in every 2,000th replication step in the interval between 10,000 replication steps before termination and actual termination (10= replication steps). If the above condition is true more than once, we consider the master type in question as survived (we have included this explanation in the Methods section: lines 258-268). Although this relatively narrow time interval can still be regarded as a snapshot of the state of the system, according to our numerical experiences, the resulting measure is a reliable quantitative indicator of the apparent stability of species coexistence in the parabolic dynamics.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer 1:

      (1) General comment: The evidence for these highly novel, potentially interesting roles (of the exocyst) would need to be more compelling to support direct involvement.

      We wish to thank the reviewer for his/her comments, and for considering that the proposed functions are highly novel and potentially interesting. To strengthen the evidence supporting the new roles of the exocyst, we have performed a number of additional experiments that are depicted in novel figures or figure panels of the new version of the manuscript. Particularly, we aimed at providing further support of the direct involvement of the exocyst in different steps of the regulated secretory pathway. Please see the details below.

      (2) For instance, the localization of exocyst to Golgi or to granule-granule contact sites does not seem substantial.

      We have performed quantitative colocalization studies, as suggested by the reviewer to further substantiate our initial findings. We have carefully analysed GFP-Sec15 distribution in relation to the Golgi complex and secretory Glue granules at relevant time points of salivary gland development. Overall, we found that GFP-Sec15 distribution is dynamic during salivary gland development. Before Glue synthesis (72 h AEL), Sec15 was observed in close association (defined as a distance equal to, or less than 0.6 µm) with the Golgi complex (please see below Author response image 1). This association was lost once Glue granules have begun to form (96 h AEL). Importantly, we do not see relevant association between GFP-Sec15 and the ER (please see Author response image 2). These observations support our conclusion that the exocyst plays a role at the Golgi complex. New images supporting these conclusions, as well as quantitative data, have been included in Figure 5 of the new version of the manuscript. In addition, real time imaging, as well as 3D reconstruction analyses, confirming the close association between Sec15 and Golgi cisternae are now included in the manuscript. Please see Supplementary Videos 1-3. These new data are described in the text lines 200-210 of the Results section and text lines 359368 of the Discussion section.

      Interestingly, at the time when Sec15-Golgi association is lost (96 h AEL), Sec15 foci associate instead with newly formed secretory granules (< 1µm diameter). This association persists during secretory granule maturation (100-116 h AEL), when Sec15 foci localize specifically in between neighbouring, immature secretory granules. When maturation has ended and Glue granule exocytosis begins (116-120 h AEL), this localization between granules is lost. These observations are consistent with a role of the exocyst in homotypic fusion during SG maturation. We have included new images showing that association between Sec15 and secretory granules is dynamic and depends on the developmental stage. We have quantified this association both during maturation and at a stage when SGs are already mature. We have in addition performed a 3D reconstruction analysis of these images to confirm the close association between Sec15 and immature SGs. These new data are now depicted in Figure 7BC, Supplementary Videos 4-5, and described in text lines 216-221 of the Results section. In addition, a lower magnification image is provided below in this letter (Author response image 3), quantifying the proportion of Sec15 foci localized in between SGs (yellow arrows) relative to the total number of Sec15 foci (yellow arrows + green arrowheads).

      Author response image 1.

      Criteria utilized to define Sec15 focithat were“associated” or“not associated” withthe trans-Golgi network in the experiments of Figure 5C-E of the manuscript.When the distance between maximal intensities of GFP-Sec15 and Golgi-RFP signals was equal or less than 0.6 m, the signals were considered “associated” (upper panels). When the distance was more than 0.6 m, the signals were considered “not associated” (lower panels).

      Author response image 2.

      Criteria utilized to define Sec15 focithat were“associated” or“not associated” withthe ERin the experiments of Figure 5A-Bof the manuscript.When the distance between maximal intensities of GFP-Sec15 and KDEL-RFP signals was equal or less than 0.6 m, the signals were considered “associated”. When the distance was more than 0.6 m, the signals were considered “not associated”.

      Author response image 3.

      (A) GFP-Sec15 foci (cyan) and SGs (red) are shown in cells bearing Immature SGs or (B) with mature SGs. Yellow arrows indicate GFP-Sec15 foci localized in between SGs; green arrowheads indicate GFP-Sec15 foci that arenot in between SGs. (C) Quantification of the percentage (%) of Sec15 foci localized in between SGs respect to the total number of Sec15 foci in cells filled with immature SGs (ISG)vs cells with mature SGs (MSG).

      It is interesting to mention that previous evidence from mammalian cultured cells (Yeaman et al,  2001) show that the exocyst localizes both at the trans-Golgi network and at the plasma membrane, weighing in favour of our claim that the exocyst is required at various steps of the exocytic pathway. Thus, the exocyst may play multiple roles in the secretion pathway in other biological models as well. This concept has now been included at the Discussion section of the revised version of the manuscript (lines 359-368).

      To make the conclusions of our work clearer, in the revised version of the manuscript, we have now included a graphical abstract, summarizing the dynamic localization of the exocyst in relation to the processes of SG biogenesis, maturation and exocytosis reported in our work. 

      (3) Instead, it is possible that defects in Golgi traffic and granule homotypic fusion are not due to direct involvement of the exocyst in these processes, but secondary to a defect in canonical exocyst roles at the plasma membrane. A block in the last step of glue exocytosis could perhaps propagate backward in the secretory pathway to disrupt Golgi complexes or cause poor cellular health due to loss of cell polarity or autophagy.

      We thank the reviewer for these thoughtful comments. We have performed a number of additional experiments to assess “cellular health” or to identify possible defects in cell polarity after knock-down of exocyst subunits. These new data have been included in new supplementary figures 5 and 6 of the revised version of the manuscript (please see below). 

      In our view, the precise localization of GFP-Sec15 at the Golgi complex (Figure 5C-E), as well as in between immature secretory granules (Figure 7B-D), argues in favour of a direct involvement of the exocyst in SG biogenesis and homofusion respectively. 

      We truly appreciate the comment of the reviewer raising the possibility that the defects that we observe at early steps of the pathway (SG biogenesis and SG maturation) may actually stem from a backward effect of the role of the exocyst in SG-plasma membrane tethering. We wish to respectfully point out that the processes of biogenesis, maturation and plasma membrane tethering/fusion of SGs do not occur simultaneously in the Drosophila larval salivary gland in vivo, as they do in other secretory model systems (i.e. cell culture). In this regard, the experimental model is unique in terms of synchronization. In each cell of the salivary gland, the three processes (biogenesis, maturation and exocytosis) occur sequentially, and controlled by developmental cues. At the developmental stage when SGs fuse with the plasma membrane, SG biogenesis has already ceased many hours earlier: SG biogenesis occurs at 96-100 hours after egg lay (AEL), SG maturation takes place at 100-112 hours AEL, and SG-plasma membrane fusion happens only when all SGs have undergone maturation and are ready to fuse with the plasma membrane at 116-120 h AEL. Thus, in our view it is not conceivable that a defect in SG-plasma membrane tethering/fusion (116-120 h AEL) may affect backwards the processes of SG biogenesis or SG maturation, which have occurred earlier in development (96-112 h AEL).

      As suggested by the reviewer, we have analysed several markers of cellular health and cell polarity, comparing conditions of exocyst subunit silencing (exo70RNAi, sec3RNAi or exo84RNAi) with wild type controls (whiteRNAi). These new data are depicted in Supplementary Figures 5 and 6, and described in lines 172-179 of the Results section of the revised version of the manuscript. Noteworthy, for these experiments we have applied silencing conditions that block secretory granule maturation, bringing about mostly immature SGs. Our analyses included: 1) Subcellular distribution of PI(4,5)P2, 2) subcellular distribution of the tetraspanin CD63, 3) of Rab11, 4) of filamentous actin, and 5) of CD8. We have also compared 6) nuclear size and nuclear general morphology, 7) the number and distribution of mitochondria, 8) morphology and subcellular distribution of the cis- and 9) trans-Golgi networks. Finally, 10) we have compared basal autophagy in salivary cells with or without knocking down exocyst subunits. The markers that we have analysed behaved similarly to those of control salivary glands, suggesting that the observed defects in regulated exocytosis indeed reflect different roles of the exocyst in the secretory pathway, rather than poor cellular health or impaired cell polarity.  

      Our conclusions are in line with previous studies in which apico-basal polarity, Golgi complex morphology and distribution, as well as apical membrane trafficking were also evaluated in exocyst mutant backgrounds, finding no anomalies (Jafar-Nejad et al, 2005). 

      Conversely, in studies in which apical polarity was disturbed by interfering with Crumbs levels, SG biogenesis, maturation and exocytosis were not affected (Lattner et al, 2019), indicating that these processes not necessarily interfere with one another.  

      (4) Final recommendation: In the absence of stronger evidence for these other exocyst roles, I would suggest focusing the study on the canonical role (interesting, as it was previously reported that Drosophila exocyst had no function in the salivary gland and limited function elsewhere [DOI: 10.1034/j.1600-0854.2002.31206.x]), and leave the alternative roles for discussion and deeper study in the future.  

      We appreciate the reviewer´s recommendation. However, we believe that the major strength of our work is the discovery of non-canonical roles of the exocyst complex, unrelated to its function as a tethering complex for vesicle-plasma membrane fusion. We believe that in the new version of our manuscript, we provide stronger evidence supporting the two novel roles of the exocyst:

      a) Its participation in maintaining the normal structure of the Golgi complex, and b) Its function in secretory granule maturation.

      Reviewer 2:

      (5) General comment: A key strength is the breadth of the assays and study of all 8 exocyst subunits in a powerful model system (fly larvae). Many of the assays are quantitated and roles of the exocyst in early phases of granule biogenesis have not been ascribed. 

      We are grateful that the reviewer appreciates the novelty of our contribution.

      (6) However there are several weaknesses, both in terms of experimental controls, concrete statements about the granules (better resolution), and making a clear conceptual framework. Namely, why do KD of different exocysts have different effects on presumed granule formation

      The reviewer has raised a point that is central to the interpretation of all our data throughout the manuscript. The short answer is that the extent of RNAi-dependent silencing of exocyst subunits determines the phenotype: 

      1) Maximum silencing affects Golgi complex morphology and prevents SG biogenesis. 2) Intermediate silencing blocks SG maturation, without affecting Golgi complex morphology and SG biogenesis. 3) Weak silencing blocks SG tethering and fusion with the plasma membrane, without affecting Golgi complex morphology, SG biogenesis or SG maturation. 

      In other words, 1) Low levels of exocyst subunits are sufficient for normal Golgi complex morphology and SG biogenesis. 2) Intermediate levels of exocyst subunits are sufficient for SG maturation (and also sufficient for SG biogenesis). 3) High levels of exocyst subunits are required for SG tethering and subsequent fusion with the plasma membrane. 

      Based on the above notion, we have exploited the fact that temperature can fine-tune the level of Gal4/UAS-dependent transcription, thereby achieving different levels of silencing, as shown by Norbert Perrimon et al in their seminal paper “the level of RNAi knockdown can also be altered by using Gal4 lines of various strengths, rearing flies at different temperatures, or via coexpression of UAS-Dicer2” (Perkins et al, 2015). 

      We found in our system that indeed, by applying appropriate silencing conditions (RNAi line and temperature) to any of the eight subunits of the exocyst, we have been able to obtain one of the three alternative phenotypes: Impaired SG biogenesis, or impaired SG maturation, or impaired SG tethering/fusion with the plasma membrane.

      These concepts are summarized below in Author response image 4. Please see also at point 26, the general comment of Reviewer #3. 

      We have conducted qRT-PCR assays to provide experimental support to the notions summarized above in Author response image 4. We measured the remaining levels of mRNAs of some of the exocyst subunits, after inducing RNAi-mediated silencing at different temperatures, or with different RNAi transgenic lines. The remaining RNA levels after silencing correlate well with the observed phenotypes, following the predictions of Author response image 4 and summarized in Author response image 5. These new data are now shown in Supplementary Figure 2 of the revised version of the manuscript, and described in lines 153-159 at the Results section.

      (7) Why does just overexpression of a single subunit (Sec15) induce granule fusion?

      The reviewer raises a very important point. Based on available data from the literature, Sec15 behaves as a seed for assembly of the holocomplex and it also mediates the recruitment of the holocomplex to SGs through its interaction with Rab11 (Escrevente et al, 2021; Bhuin and Roy, 2019; Wu et al, 2005; Zhang et al, 2004; Guo et al, 1999). Thus, overexpression of Sec15 is expected to enhance exocyst assembly, thereby potentiating the activities carried out by the complex in the cell, including SG homofusion. In the revised version of the manuscript we have also performed the overexpression of Sec8, finding that, unlike Sec15, Sec8 fails to induce homotypic fusion. These results were expected, as they confirm that Sec8 does not behave as a seed for mounting the whole complex. These new data have been included in Figure 7E-H, and are described in text lines 221-229 of the Results section. 

      Author response image 4.

      Conceptual model of RNAi expression at different temperatures , remaining levels of mRNA/protein levels and phenotypes obtained at each temperature.

      Author response image 5.

      qRT-PCR assays presented in Supplementary Figure 2 are shown in combination with the phenotypes observed at each of the conditions analyzed. Note the correlation between phenotypes and the extent of mRNA downregulation.

      (8) While the paper is fascinating, the major comments need to be addressed to really be able to make better sense of this work, which at present is hard to disentangle direct vs. secondary effects, especially as much of the TGN seems to be altered in the KDs.  

      We hope that our response to point 6) has helped to clarify this important point raised by the Reviewer. After applying silencing conditions where normal structure of the trans-Golgi network is impaired, SG biogenesis does not occur. Thus, since SGs do not form, it is not conceivable to detect defects in SG maturation or SG fusion with the plasma membrane in the same cell.

      (9) The authors conveniently ascribe many of the results to the holocomplex, but their own data (Fig. 4 and Fig. 6) are at odds with this.

      This is another central point of our work, so we thank the reviewer for his/her comment. In Figures 4A, 7A and 9A of the revised version of the manuscript, we show that, by inducing appropriate levels of silencing of any of the 8 subunits of the exocyst, each of the three alternative phenotypic manifestations can occur. In our opinion, this argues in favour of a function for the whole exocyst complex in each of the three specific activities proposed in our study: 1) SG biogenesis, 2) SG maturation, and 3) SG tethering/fusion with the plasma membrane. In detailed characterizations of these three phenotypes performed throughout the study, we decided to induce silencing of just two or three of the subunits of the exocyst, assuming that the whole complex accounts the mechanisms involved.

      Major comments

      (10) Resolution not sufficient. Identification of "mature secretory granules" (MSG) in Fig. 3 is based on low-resolution images in which the MSG are not clearly seen (see control in Fig. 3A) and rather appear as a diffuse haze, and not as clear granules. There may be granules here, but as shown it is not clear. Thus it would be helpful to acquire images at higher resolution (at the diffraction limit, or higher) to see and count the MSG.

      We thank the reviewer for raising this point, as it may not be straightforward to the reader to identify the SGs throughout the figures of our study. To make it clearer, in Figure 3A (magnified insets on the right), we have delimitated individual SGs with a green dotted line, and included diagrams (far right), which we hope will help the identification of SGs. In Figure 3B, we show that after silencing Sec84, a mosaic phenotype was observed: In some cells SGs fail to undergo maturation, and remain smaller than normal. In other cells of this mosaic phenotype, biogenesis of SGs was impaired and the fluorescent cargo remained trapped in a mesh-like structure (that we later show that corresponds to the ER). The dotted line marks individual SGs, and the diagrams included on the right intend to help the interpretation of the phenotype. The mesh-like structures where Sgs3-GFP was retained are also marked with dotted line, and schematized on the right. These new schemes are described in the Figure 3 caption of the revised version of the manuscript.

      We wish to mention that all the confocal images depicted in this figure and throughout the manuscript  have been captured at high resolution, with a theoretical resolution limit of 168177nm (d = γ/2NA). Given that secretory granules range from 0.8-7µm in diameter, the resolution is more than sufficient to clearly resolve these structures. 

      (11) Note: the authors are not clear on which objective was used. Maybe the air objective as the resolution appears poor).  

      In this particular figure, we have utilized a Plan-Apochromat 63X/1.4NA oil objective of the inverted Carl Zeiss LSM 880 confocal microscope (mentioned in materials and methods).

      (12) They need to prove that the diffuse Sgs3-GFP haze is indeed due to MSG.  

      If we interpret correctly the concern of the reviewer, what he/she calls “diffuse haze” is actually the distribution of Sgs3-GFP within individual SGs, which, as previously reported by other authors, is not homogeneous at this stage (Syed et al. 2022). We hope that the diagrams that we have included in Figure 3 A, B (point 10) will help the readers interpreting the images.   

      (13) Related it is unclear what are the granule structures that correspond to Immature secretory granules (ISG) and cells with mesh-like structures (MLS)?

      We are confident that the diagrams now included in Figure 3A and B will help the interpretation, and particularly to identify immature granules and the mesh-like structure generated after silencing of exocyst subunits.

      (14) Similarly, Sgs3 images of KD of 8 exocyst subunits were interpreted to be identical, in Fig. 4, but the resolution is poor.

      We hope that the issue related to resolution of our images has been properly addressed in the response to point 10) of this letter. In Figure 4A, we show that after silencing of any of the 8 subunits (with the appropriate conditions), in all cases SG biogenesis was impaired, and Sgs3GFP was instead retained in a mesh-like structure. Images obtained after silencing different exocyst subunits are of course not identical, but in all cases, a mesh-like structure has replaced the formation of SGs (Figure 4A). Hopefully, the diagrams now included in Figure 3A and B help the correct interpretation of the phenotypes throughout the study.

      To demonstrate that the structure in which Sgs3-GFP was retained upon exocyst complex knockdown corresponds to the ER, we performed a colocalization analysis between Sgs3-GFP and the ER markers GFP-KDEL or Bip-sfGFP-HDEL, after which we calculated the Pearsons Coefficient, which indicated substantial colocalization (Figure 4B-G and Supplementary Figures 7 and 8). These new data are described in lines 196-199 of the revised version of the manuscript. To facilitate the visualization of the results, in the revised version of the manuscript we have included magnified cropped areas of the images shown in Figure 4A.

      (15) What is remarkable is a highly variable effect of different subunit KD on the percentage of cells with MLS (Fig. 4C). Controls = 100 %, Exo70=~75% (at 19 deg), Sec3 = ~30%, Sec10 = 0%, Exo84 = 100% ... This is interesting for the functional exocyst is an octameric holocomples, thus why the huge subunit variability in the phenotypes? The trivial explanation is either: i) variable exocyst subunit KD (not shown) or ii) variability between experiments (no error bars are shown). Both should be addressed by quantification of the KD of different proteins and secondly by replicating the experiments.

      We agree with the reviewer statement. We believe that both, variability of KD efficiency (i) and variability between experiments (ii) contribute to the variable effect observed after knocking down the different subunits. As detailed in the response to point 6), we have performed qRT-PCR determinations to confirm that the severity of the phenotype depends on the efficiency of RNAimediated silencing. We chose to analyse in detail the effect on the subunits exo70 and sec3, which were those with the highest phenotypic differences between the three silencing temperatures utilized. We found that as expected, the levels of silencing were temperaturedependent, being higher at 29°C and lower at 19°C. These data were included in Supplementary Figure 2, and described lines 153-159 of the Results section and also summarized in Author response images 4 and 5 of this rebuttal letter.

      We thank the reviewer for his/her comment on the replication of experiments and statistics. We failed to include detailed numerical information in the original submission, such as the number of replicas and standard deviations of the data depicted in Figure 3C and Supplementary Figure 1, so we apologize for this omission. In the revised version of the manuscript, we have included a table (Supplementary Table 3) in which all the raw data of Figure 3C and Supplementary Figure 1, including standard deviations, are now depicted.

      (16) If their data holds up then the underlying mechanism here needs to be considered.

      (Note: there is some precedent from the autophagy field of differential exocyst effects)

      Our proposed mechanism is essentially that the holocomplex is required for multiple processes along the secretory pathway. Each of these actions (Golgi structure maintenance, SG maturation and SG tethering/fusion with the plasma membrane) requires different amounts of holocomplex activity, being this the reason why each phenotype manifests at different levels of RNAi-mediated silencing (Author response image 4 of this letter). The model predicts that Golgi structure maintenance requires minimal levels of complex activity, and that is why strong knock-down of exocyst subunits is required to obtain this phenotype. In line with our results, it has been reported that other tethering complexes of the CATCHR family are also required for maintaining Golgi cisternae stuck together (D'Souza et al, 2020; Khakurel and Lupashin, 2023; Liu et al, 2019). One possibility is that the exocyst may play a redundant role in the maintenance of the normal structure of the Golgi complex, along with other CATCHR complexes. This potential redundancy could explain why severe exocyst knock-down is required to observe structural anomalies at this organelle. On the other end of the spectrum, we propose that tethering/fusion with the plasma membrane is very susceptible to even slight reduction of complex activity, so that mild RNAi-mediated silencing is sufficient to provoke defects in this process. This proposed model is depicted in Author response image 4 and discussed in lines 395-405 of the Discussion section. 

      (17) In the salivary glands the authors state that the exocyst is needed for Sgs3-GFP exit from the ER. First, Pearson's coefficient should be shown so as to quantitate the degree of ER localizations of all KDs.

      We thank the reviewer for this comment that helped us to strengthen the observation that when SG biogenesis is impaired, Sgs3-GFP remains trapped in the ER. In the revised version of the manuscript, we have calculated Pearson´s coefficient to assess colocalization between ER markers (GFP-KDEL or Bip-sfGFP-HDEL) and Sgs3-GFP in salivary gland cells that express sec15RNAi. The Pearson’s coefficient was around 0.6 for both ER markers, indicating that colocalization with Sgs3-GFP was substantial (Supplementary Figure 8, text lines 196-199 of the Results section).

      (18) Second, there should be some rescue performed (if possible) to support specificity. 

      As suggested by the reviewer, we have performed a rescue experiment of the phenotype provoked by the expression of sec15 RNAi, which consisted on the retention of Sgs3-GFP in the endoplasmic reticulum: Expression of Sec15-GFP reverted substantially the ER retention phenotype, rescuing SG biogenesis and also SG maturation in most cells (over 60% of the cells). These new data are now shown in Supplementary Figure 4, and described in lines 168-171 of the Results section.

      (19) Third, importantly other proteins that should traffic to the PM need to be shown to traffic normally so as to rule out a non-specific effect.

      We have addressed this issue (also mentioned by Reviewer #1), by analyzing the localization of a number of polarization markers, finding that the overall polarization of the cell was not affected by loss of function of exocyst subunits. Please, see our response to the point 3) raised by Reviewer #1. The new data showing cell polarization markers are shown in Supplementary Figure 6 of the revised version of the manuscript, and described on text lines 172-179 of the Results section.

      (20) It is unclear from their model (Fig. 5) why after exocyst KD of Sec15 the cis-Golgi is more preserved than the TGN, which appears as large vacuoles. This is not quantitated and not shown for the 8 subunits.

      We thank the reviewer for this relevant comment. We agree that the phenotype of either, sec15 or sec3 loss-of-function cells manifests differently with cis-Golgi and trans-Golgi markers. While the cis-Golgi marker looked fragmented and aggregated, the trans-Golgi marker adopted a swollen appearance. However, in our view, the different appearance of the two markers does not necessarily imply that one compartment is more preserved than the other. In the revised version of the manuscript, we have quantified the penetrance of the phenotypes provoked by sec15 or sec3 silencing, using both cis-Golgi and trans-Golgi markers. In both cases, the penetrance was high, although even higher with the trans-Golgi marker. These new data are now depicted in Supplementary Figure 9 of the revised version of the manuscript. 

      It is interesting to mention that in HeLa cells, as well as in the retinal epithelial cell line hTERT, Golgi phenotypes similar to those we have described here have been reported after loss-offunction of other tethering complexes, which were shown to maintain the Golgi cisternae stuck together, including the GOC and GARP complexes (D'Souza et al, 2020, Khakurel and Lupashin, 2023; Shijie Liu et al, 2019). As we did throughout our work, not every aspect of the analysis included the silencing of all eight subunits. In this case, we chose to silence Sec3 and Sec15. Please note that we have modified the model depicted in Figure 6E-F, to highlight the cis- and transGolgi phenotypes upon exocyst knock-down, as well as the localization of the exocyst in cisternae of the Golgi complex.

      (21) Acute/Chronic control: It would be nice to acutely block the exocyst so as to better distinguish if the effects observed are primary or secondary effects (e.g. on a recycling pathway).

      We thank the reviewer for raising this important issue. To address this point, and to be able to induce silencing of exocyst subunits at specific time intervals of larval development, we utilized a strategy based on a thermosensitive variant of the Gal4 inhibitor Gal80 (Gal80ts)(Lee and Luo, 1999). We blocked Gal4 activity (and therefore RNAi expression) by maintaining the larvae at 18 °C during the 1st and 2nd instars (until 120 hours after egg lay), and then induced the activity of Gal4 specifically at the 3rd larval instar by raising the temperature to 29 ºC, a condition in which Gal80ts becomes inactive. After silencing the expression of sec3 or sec15 at the 3rd larval instar only, the phenotype was very similar to that observed after chronic silencing of exocyst subunits (larvae maintained at 29 ºC all throughout development, where Gal4 was never inhibited). These observations suggest that the defects observed in the secretory pathway after knock down of exocyst subunits reflect genuine functions of the exocyst in this pathway, rather than a secondary effect derived from impaired development of the salivary glands at early larval stages. These new results are now shown in Supplementary Figure 3, and described in manuscript lines 160-171 of the Results section.   

      (22) Granule homotypic fusion. Strangely over-expression of just one subunit, Sec15-GFP, made giant secretory granules (SG) that were over 8 microns big! Why is that, especially if normally the exocyst is normally a holocomplex. Was this an effect that was specific to Sec15 or all exocyst subunits? Is the Sec15 level rate limiting in these cells? It may be that a subcomplex of Sec15/10 plays earlier roles, but in any case this needs to be addressed across all (or many) of the exocyst subcomplex members.

      Please, see our response to point 7) of this letter. Sec15 is believed to act as a seed for the formation of the whole complex.

      (23) In summary, there are clearly striking effects on secretory granule biogenesis by dysfunction of the exocyst, however right now it is hard to disentangle effects on ERGolgi traffic, loss of the TGN, and a problem in maturation or fusion of granules. 

      As discussed in detail in our response to the point 3 raised by Reviewer #1, the secretory pathway is highly synchronized in each of the cells of the Drosophila salivary gland. SG biogenesis, SG maturation and SG fusion with the plasma membrane never occur simultaneously in the same cell. Thus, in a cell in which ER-Golgi traffic is impaired (and SG biogenesis does not occur), SGs do not exist, and therefore, they cannot exhibit defects in the process of maturation or fusion with the plasma membrane. In summary, we believe that our work has shown that in Drosophila larval salivary glands the exocyst holocomplex is required for (at least) three functions along the secretory pathway: 1) To maintain the appropriate Golgi complex architecture, thus enabling ERGolgi transport; 2) For secretory granule maturation: both, homotypic fusion and acquisition of maturation factors; 3) For secretory granule exocytosis: secretory granule tethering to enable subsequent fusion with the plasma membrane. As mentioned above (point 6 of this letter), these three functions require different amounts of the holocomplex, and therefore can be revealed by inducing different levels of silencing.  

      (24) It is also confusing if the entire exocyst holocomplex or subcomplex plays a key role 

      The fact that, by silencing any of the subunits (with the appropriate conditions) it is possible obtain any of the 3 phenotypes (impaired SG biogenesis, impaired SG maturation or impaired SG fusion with the plasma membrane) argues in favour of a function of the complex as a whole in each of these three functions.

      Reviewer 3:

      (25) General comment: Freire and co-authors examine the role of the exocyst complex during the formation and secretion of mucins from secretory granules in the larval salivary gland of Drosophila melanogaster. Using transgenic lines with a tagged Sgs3 mucin the authors KD expression of exocyst subunit members and observe a defect in secretory granules with a heterogeneity of phenotypes. By carefully controlling RNAi expression using a Gal4-based system the authors can KD exocyst subunit expression to varying degrees. The authors find that the stronger the inhibition of expression of exocyst the earlier in the secretory pathway the defect. The manuscript is well written, the model system is physiological, and the techniques are innovative.

      We appreciate the reviewer´s assessment of our work. 

      (26) My major concern is that the evidence underlying the fundamental claim of the manuscript that "the exocyst complex participates" in multiple secretory processes lacks direct evidence.

      We thank the reviewer for raising this important issue. We believe that the analysis of Sec15 subcellular localization during salivary gland development (Figures 5, 7B-D and 9E-F), in combination with the detailed analysis of the phenotypes provoked by loss-of-function of each of the exocyst subunits, provide evidence supporting multiple functions of the exocyst in the secretory pathway. We have also included 3D reconstructions and videos of GFP-Sec15 colocalization with Golgi and SG markers to support exocyst localization associated to these structures (Supplementary Videos 1-7), text lines 200-210; 216-221 and 303-305.

      (27) It is clear from multiple lines of evidence, which are discussed by the authors, that exocyst is essential for an array of exocytic events. The fundamental concern is that loss of homeostasis on the plasma membrane proteome and lipidome might have severe pleiotropic effects on the cell.

      We agree with the reviewer that this is an important point that needed to be addressed. As discussed in detail above at the response to point 3 raised by Reviewer #1, we have analysed several plasma membrane markers (including a PI(4,5)P2 lipid reporter), and found that overall, plasma membrane integrity and polarity were not substantially affected (Supplementary Figure 6). In addition, we have analyzed several markers of general cellular “health” that indicate that salivary gland cells do not seem to be distressed by the reduction of exocyst complex activity (Supplementary Figure 5). These new data are described in lines 172-179 of the Results section.

      (28) Perhaps the authors have more evidence that exocyst is important for homeotypic fusion of the SGs, as supported by the localisation of Sec15 on the fusion sites.

      We believe that the fact that, by silencing any of the exocyst subunits (with the appropriate conditions), immature smaller-than-normal granules were observed, argus in favour that the exocyst as a whole participates in SG homofusion (Figure 7A). In addition, we have included more images, quantifications, 3D reconstructions and videos of GFP-Sec15 localized just at the contact sites between immature SGs. We have quantified and compared GFP-Sec15 localization at immature SG vs its localization at mature SGs, finding that localizes preferentially at immature SGs, supporting a role of the exocyst as a tethering complex during homotypic fusion (shown Figure 7B-C and Supplementary Videos 4-6, and described in lines 216-221 of the Results section). Please see also our response to the point 2 raised by reviewer 1 in this rebuttal letter, and to Author response image 3 above in this letter.

      (29) The second question that I think is important to address is, what exactly do the varying RNAi levels correspond to in terms of experiments, and have these been validated? Due to the fundamental claim being that the severity of the phenotype being correlated with the level of KD, I think validation of this model is absolutely essential.  

      We thank the Reviewer for raising this important point, and agree it was lacking in the original version of our manuscript. As discussed in our response to the point 6) raised by Reviewer #2, we have performed qRT-PCR determinations for exo70 and sec3 mRNA levels after inducing silencing of these subunits at different temperatures, or with different RNAi transgenic lines. The remnant mRNA levels correlate well with the observed phenotypes. Please see Supplementary Figure 2 of the revised manuscript, and Author response image 5 of this rebuttal letter; described in lines 155-159 of the Results section. 

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      -  The authors assert in the discussion that exocyst involvement in constitutive secretion is well documented. This is based on a very recent study in mammalian culture cells. Therefore, I would not dismiss the issue as completely settled. Furthermore, a previous study of Drosophila sec10 reported no roles outside the ring gland (DOI: 10.1034/j.1600-0854.2002.31206.x).

      We have included these observations in the Discussion section. Lines 326-329.

      -  A salivary gland screening by Julie Brill's lab reported exocyst components as hits (DOI: 10.1083/jcb.201808017).

      We have referred to this paper in the Discussion section. Lines 326-329.

      -  It should be explained in more detail what is measured in graphs 7C, F, and others quantifying fluorescence around secretory granules. Looking at the images, the decrease in Rab1 and Rab11 seems less convincing.

      We have made a clearer description of how fluorescence intensity was measured in the Methods section lines 558-561. Also, we have uploaded a source data file in which the raw data of each experiment used for quantifications are disclosed. 

      Please note that the data indicates that Rab11 levels are higher in sec5 (Figure 8J-L) and sec3 (supplementary Figure 11M-R).

      Reviewer #2 (Recommendations For The Authors):

      No major issues.

      Writing - The authors should better frame their interpretations of other studies of the exocyst that include the role in autophagy, Palade body trafficking, and differential roles of the subunits.

      We have discussed these specific points in the Discussion section, lines 348-355 and 409-410.

      Minor - Fig. 6A: Why are variable temperatures (19-29 deg C used for the 8 KD experiments)?

      Please show it all at the same temperature (control too).

      The need for the usage of specific temperatures to obtain specific phenotypes with each of the RNAi lines used was explained in point 6 of this letter.

      Reviewer #3 (Recommendations For The Authors):

      In the abstract, the authors refer to the exocytic process and go on to describe secretory granule biogenesis and exocytosis. However, there are many exocytic processes aside from secretory granule biogenesis, and I think the authors should clarify this.

      Corrected in the Abstract. Lines 19-21

      Page 17 Thomas, 2021 reference, there is a glitch with the reference.

      Thanks for noticing. Fixed.

      References

      Bhuin T, Roy JK. Developmental expression, co-localization and genetic interaction of exocyst component Sec15 with Rab11 during Drosophila development. Exp Cell Res. 2019 Aug 1;381(1):94-104. doi: 10.1016/j.yexcr.2019.04.038. Epub 2019 May 7. PMID: 31071318.

      D'Souza Z, Taher FS, Lupashin VV. Golgi inCOGnito: From vesicle tethering to human disease. Biochim Biophys Acta Gen Subj. 2020 Nov;1864(11):129694. doi: 10.1016/j.bbagen.2020.129694. Epub 2020 Jul 27. PMID: 32730773; PMCID: PMC7384418.

      Escrevente C, Bento-Lopes L, Ramalho JS, Barral DC. Rab11 is required for lysosome exocytosis through the interaction with Rab3a, Sec15 and GRAB. J Cell Sci. 2021 Jun 1;134(11):jcs246694. doi: 10.1242/jcs.246694. Epub 2021 Jun 8. PMID: 34100549; PMCID: PMC8214760.

      Guo W, Roth D, Walch-Solimena C, Novick P. The exocyst is an effector for Sec4p, targeting secretory vesicles to sites of exocytosis. EMBO J. 1999 Feb 15;18(4):1071-80. doi: 10.1093/emboj/18.4.1071. PMID: 10022848; PMCID: PMC1171198.

      Jafar-Nejad H, Andrews HK, Acar M, Bayat V, Wirtz-Peitz F, Mehta SQ, Knoblich JA, Bellen HJ. Sec15, a component of the exocyst, promotes notch signaling during the asymmetric division of Drosophila sensory organ precursors. Dev Cell. 2005 Sep;9(3):351-63. doi: 10.1016/j.devcel.2005.06.010. PMID: 16137928.

      Khakurel A, Lupashin VV. Role of GARP Vesicle Tethering Complex in Golgi Physiology. Int J Mol Sci. 2023 Mar 23;24(7):6069. doi: 10.3390/ijms24076069. PMID: 37047041; PMCID: PMC10094427.

      Lattner J, Leng W, Knust E, Brankatschk M, Flores-Benitez D. Crumbs organizes the transport machinery by regulating apical levels of PI(4,5)P2 in Drosophila. Elife. 2019 Nov 7;8:e50900. doi: 10.7554/eLife.50900. PMID: 31697234; PMCID: PMC6881148.

      Lee T, Luo L. Mosaic analysis with a repressible cell marker for studies of gene function in neuronal morphogenesis. Neuron. 1999 Mar;22(3):451-61. doi: 10.1016/s08966273(00)80701-1. PMID: 10197526.

      Liu S, Majeed W, Grigaitis P, Betts MJ, Climer LK, Starkuviene V, Storrie B. Epistatic Analysis of the Contribution of Rabs and Kifs to CATCHR Family Dependent Golgi Organization. Front Cell Dev Biol. 2019 Aug 2;7:126. doi: 10.3389/fcell.2019.00126. PMID: 31428608; PMCID: PMC6687757.

      Perkins LA, Holderbaum L, Tao R, Hu Y, Sopko R, McCall K, Yang-Zhou D, Flockhart I, Binari R, Shim HS, Miller A, Housden A, Foos M, Randkelv S, Kelley C, Namgyal P, Villalta C, Liu LP, Jiang X, Huan-Huan Q, Wang X, Fujiyama A, Toyoda A, Ayers K, Blum A, Czech B, Neumuller R, Yan D, Cavallaro A, Hibbard K, Hall D, Cooley L, Hannon GJ, Lehmann R, Parks A, Mohr SE, Ueda R, Kondo S, Ni JQ, Perrimon N. The Transgenic RNAi Project at Harvard Medical School: Resources and Validation. Genetics. 2015 Nov;201(3):843-52. doi: 10.1534/genetics.115.180208. Epub 2015 Aug 28. PMID: 26320097; PMCID: PMC4649654.

      Wu S, Mehta SQ, Pichaud F, Bellen HJ, Quiocho FA. Sec15 interacts with Rab11 via a novel domain and affects Rab11 localization in vivo. Nat Struct Mol Biol. 2005 Oct;12(10):879-85. doi: 10.1038/nsmb987. Epub 2005 Sep 11. PMID: 16155582.

      Yeaman C, Grindstaff KK, Wright JR, Nelson WJ. Sec6/8 complexes on trans-Golgi network and plasma membrane regulate late stages of exocytosis in mammalian cells. J Cell Biol. 2001 Nov 12;155(4):593-604. doi: 10.1083/jcb.200107088. Epub 2001 Nov 5. PMID: 11696560; PMCID: PMC2198873.

      Zhang XM, Ellis S, Sriratana A, Mitchell CA, Rowe T. Sec15 is an effector for the Rab11 GTPase in mammalian cells. J Biol Chem. 2004 Oct 8;279(41):43027-34. doi: 10.1074/jbc.M402264200. Epub 2004 Jul 29. PMID: 15292201.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      The paper proposes an interesting perspective on the spatio-temporal relationship between FC in fMRI and electrophysiology. The study found that while similar network configurations are found in both modalities, there is a tendency for the networks to spatially converge more commonly at synchronous than asynchronous time points. However, my confidence in the findings and their interpretation is undermined by an apparent lack of justification for the expected outcomes for each of the proposed scenarios, and in the analysis pipeline itself.

      Main Concerns

      (1) Figure 1 makes sense to me conceptually, including the schematics of the trajectories, i.e.

      Scenario 1: Temporally convergent, same trajectories through connectome state space

      Scenario 2: Temporally divergent, different trajectories through connectome state space

      However, based on my understanding I am concerned that these scenarios do not necessarily translate into the schematic CRP plots shown in Figure 2C, or the statements in the main text:

      For Scenario 1: "epochs of cross-modal spatial similarity should occur more frequently at on-diagonal (synchronous) than off-diagonal (asynchronous) entries, resulting in an on-/off-diagonal ratio larger than unity"

      For Scenario 2: "epochs of spatial similarity could occur equally likely at on-diagonal and off-diagonal entries (ratio≈1)"

      Where do the authors get these statements and the schematics in Figure 2C from? Are they based on previous literature, theory, or simulations?

      I am not convinced based on the evidence currently in the paper, that the ratio of off- to on-diagonal entries (and under what assumptions) is a definitive way to discriminate between scenarios 1 and 2.

      For example, what about the case where the same network configuration reoccurs in both modalities at multiple time points? It seems to me that one would get a CRP with entries occurring equally on the on-diagonal as on the off-diagonal, regardless of whether the dynamics are matched between the two modalities or not (i.e. regardless of scenario 1 or 2 being true).

      This thought experiment example might have a flaw in it, and the authors might ultimately be correct, but nonetheless, a systematic justification needs to be provided for using the ratio of off- to on-diagonal entries to discriminate between scenarios 1 and 2 (and under what assumptions it is valid).

      In the absence of theory, a couple of ways I can think of to gain insight into this key aspect are:

      (1) Use surrogate data for scenarios 1 and 2:

      a. For scenario 1: Run the CRP using a single modality. E.g. feed in the EEG into the analysis as both modality 1 AND modality 2. This should provide at least one example of CRP under scenario 1 (although it does not ensure that all CRPs under this scenario will look like this, it is at least a useful sanity check)

      b. For scenario 2: Run the CRP using a single modality plus a shuffled version. E.g. feed in the EEG into the analysis as both modality 1 AND a temporally shuffled version of the EEG as modality 2. The temporal shuffling of the EEG could be done by simply splitting the data into blocks of say ~10s and then shuffling them into a new order. This should provide a version of the CRP under scenario 2 (although it does not ensure that all CRPs under this scenario will look like this, it is at least a useful sanity check).

      (2) Do simulations, with clearly specified assumptions, for scenarios 1 and 2. One way of doing this is to use a simplified (state-space) setup and randomly simulate N spatially fixed networks that are independently switching on and off over time (i.e. "activation" is 0 or 1). Note that this would result in a N-dimensional connectome state space.

      The authors would only need to worry about simulating the network activation time courses, i.e. they would not need to bother with specifying the spatial configuration of each network, instead, they would make the implied assumption that each of these networks has the same spatial configuration in modality 1 and modality 2.

      With that assumption, the CRP calculation should simply correspond to calculating, at each time i in modality 1 and time j in modality 2, the number of networks that are activating in both modality 1 and modality 2, by using their activation time courses. Using this, one can simulate and compute the CRPs for the two scenarios:

      a. Scenario 1: where the simulated activation timecourses are set to be the same between both modalities

      b. Scenario 2: where the simulated activation timecourses are simulated separately for each of the modalities

      We thank the reviewer for raising this important matter as it directly relates to our study hypothesis. To address this point, we chose to focus on the first of the two alternative suggestions of the reviewer, as it provides evidence based on empirical data. In line with the reviewer’s suggestion 1, recurrence plots have indeed been previously applied to connectome dynamics data from the same modality [Hansen et al., NeuroImage 2015; Fig. 2B]. As shown in the referenced study, where the recurrence plot has been estimated within fMRI connectome dynamics, the on-diagonal entries have noticeably larger correlation values in comparison to off-diagonal entries. As the authors state, this contrast emphasizes the autocorrelation of connectome dynamics in their single modality recurrence plot. Extending these findings to our cross-modal recurrence plots, more synchronicity of connectome dynamics across fMRI and EEG will -by theory- translate into stronger correlation values along the diagonal axis as it represents neighboring timepoints in the data. On the other hand, less cross-modal synchronicity translates to a lack of such correlation prevalence along the diagonal axis.

      Complementing these statements with empirical data, Author response image 1 shows the fMRI-to-iEEG and fMRI-to-fMRI CRPs side by side as suggested by the reviewer. For simplicity, we thresholded each CRP at the top 5% of entries and calculated their corresponding on-/off-diagonal ratios. The on/off-diagonal ratio for fMRI-to-fMRI CRP was 4.32 ± 6.26 across -5 to +5 TR lags (with a maximum of 16.56 at a lag of one TR), while this value was 1.00 ± 0.31 for fMRI-to-iEEG CRP. Thus, it becomes apparent that synchronicity of connectome dynamics directly translates to the on-/off-diagonal ratio in CRP.

      Author response image 1.

      Sample CRP shown for a subject for comparing two cases: fMRI-to-iEEG (left) and fMRI-to-fMRI (right). The comparison shows that in the presence of genuine synchronous connectome dynamics, as expected for the within-molality case (right panel), the on-/off-diagonal ratio is expected to show noticeably higher values. This figure establishes a strong link between our proposed metric of on-/off-diagonal ratio and the extent of synchronicity of connectome dynamics.

      Author response image 2.

      On-/off-diagonal ratio in the fMRI-to-fMRI recurrence plot is considerably higher than the cross-modal fMRI-to-iEEG case. Horizontal axis shows the lag where the metric was calculated in the CRP. The bars reflect the group average metric while the whickers show standard deviation. Note that for the within-modality case, ratio is not defined at lag zero because of identical connectome frames.

      (2) Choices in the analysis pipeline leading up to the computation of FC in fMRI or EEG will affect the quality of information available in the FC. For example, but not only, the choice of parcellation (in the study, the number of parcels is very high given the number of EEG sensors). I think it is important that we see the impact of the chosen pipeline on the time-averaged connectomes, an output that the field has some idea about what is sensible. This would give confidence that the information being used in the main analyses in the paper is based on a sensible footing and relates to what the field is used to thinking about in terms of FC. This should be trivial to compute, as it is just a case of averaging the time-varying FCs being used for the CRP over all time points. Admittedly, this approach is less useful for the intracranial EEG.

      We agree with the reviewer on ensuring that the time-averaged FC aligns with expectations of the field and prior work. For this reason, our supplementary analysis already included an analysis that replicates the well-established (albeit modest) spatial similarity between fMRI static connectome and EEG/iEEG static connectomes:

      “In scalp EEG-fMRI data, cross-modal spatial (2D) Pearson correlation of group-level time-averaged connectomes between fMRI and EEG-FCAmp or fMRI and EEG-FCPhase were calculated across all frequency bands. The average spatial correlation value across frequency bands r = 0.28 and r = 0.28 for EEG-FCAmp and EEG-FCPhase, respectively. The spatial correlation values across all frequency bands and connectivity measures were significantly higher than the corresponding null distributions generated by phase-permuted group-level fMRI-FC spatial organization (p<0.005; 200 repetitions; FDR-corrected at q<0.05 for the number of frequency bands). …. Of note, the small effect sizes are strongly in line with prior literature (Hipp and Siegel, 2015; Wirsich et al., 2017; Betzel et al., 2019) and may point to possible divergence in the dynamic domain as investigated in the main manuscript.”

      This replication directly confirms the validity of our selected atlas for further investigations into the connectome dynamics. We acknowledge that with 64 EEG channels, one can only estimate a relatively coarse connectome. Among the well-known coarse atlases, we chose the Desikan-Killiany atlas as it is based on anatomical features, eliminating possible biases towards a particular functional data modality. Moreover, this atlas has been commonly used for multimodal functional connectivity studies, facilitating the confirmation of prior findings in the time-averaged domain [Deligianni et al. Front. Neurosci 2104, Wirsich et al. NeuroImage, 2020, Wirsich et al., NeuroImage 2021].

      (3) Leakage correction. The paper states: "To mitigate this issue, we provide results from source-localized data both with and without leakage correction (supplementary and main text, respectively)." Given that FC in EEG is dominated by spatial leakage (see Hipp paper), then I cannot see how it can be justified to look at non-spatial leakage correction results at all, let alone put them up front as the main results. All main results/figures for the scalp EEG should be done using spatial leakage-corrected EEG data.

      We agree that relying on leakage-uncorrected scalp EEG alone would be problematic. It is for this reason that the intracranial data constructs the core of our results, emphasizing that the observed multiplex architecture of connectomes is indeed present in the absence of source leakage. Only when this finding is established in the intracranial EEG, do we provide the scalp EEG data as a generalization to whole-cortex coverage connectomes of healthy subjects. Moreover, it is known that existing source-leakage correction algorithms may inadvertently remove some of the genuine zero-lag connectivity. For instance, Finger and colleagues have shown that the similarity of functional connectivity to structural connectivity diminishes after correction for source-leakage (Finger et. al, PLOS Comp. Biol. 2016). Therefore, we have deliberately chosen to include our generalization findings before source-leakage correction (main text) as well as after source-leakage correction reflecting a more stringent approach (supplementary analysis). Importantly, our conclusions hold true for both before and after source-leakage correction.

      Reviewer #2 (Public Review):

      Summary:

      The study investigates the brain's functional connectivity (FC) dynamics across different timescales using simultaneous recordings of intracranial EEG/source-localized EEG and fMRI. The primary research goal was to determine which of three convergence/divergence scenarios is the most likely to occur.

      The results indicate that despite similar FC patterns found in different data modalities, the time points were not aligned, indicating spatial convergence but temporal divergence.

      The researchers also found that FC patterns in different frequencies do not overlap significantly, emphasizing the multi-frequency nature of brain connectivity. Such asynchronous activity across frequency bands supports the idea of multiple connectivity states that operate independently and are organized into a multiplex system.

      Strengths:

      The data supporting the authors' claims are convincing and come from simultaneous recordings of fMRI and iEEG/EEG, which has been recently developed and adapted.

      The analysis methods are solid and involve a novel approach to analyzing the co-occurrence of FC patterns across modalities (cross-modal recurrence plot, CRP) and robust statistics, including replication of the main results using multiple operationalizations of the functional connectome (e.g., amplitude, orthogonalized, and phase-based coupling).

      In addition, the authors provided a detailed interpretation of the results, placing them in the context of recent advances and understanding of the relationships between functional connectivity and cognitive states.

      Weaknesses:

      Despite the impressive work, the paper still lacks some analyses to make it complete.

      Firstly, the effect of the window size is unclear, especially in the case of different frequencies where the number of cycles that fall in a window will vary drastically. A typical oscillation lasts just a few cycles (see Myrov et al., 2024), and brain states are usually short-lived because of meta-stability (see Roberts et al., 2019).

      We now replicate our results with an additional window size. Please see section “Recommendations for the authors”.

      Secondly, the authors didn't examine frequencies lower than 1Hz despite similarities between fMRI and infra-slow oscillations found in prior literature (see Palva et al., 2014; Zhang et al., 2023).

      We address this issue below. Please see section “Recommendations for the authors”.

      On a minor note, the phase-locking value (PLV) is positively biased for EEG data (see Palva et al., 2018) and a different metric for phase coupling could be a more appropriate choice (e.g., iPLV/wPLI, see Vinck et al., 2011).

      While iPLV and wPLI are not positively biased, they may reduce genuine zero-phase connectivity as they were initially designed to address spurious zero-phase connectivity from source leakage in scalp EEG. Indeed, PLV connectivity is shown to be more strongly correlated with structural connectivity than wPLI and other phase coupling methods [Finger et al., PLOS Comp. Biol. 2016], emphasizing that it contains genuine connectivity that may be lacking when zero-phase connectivity is removed. We chose PLV because it is a widely used functional connectivity metric, particularly in intracranial data where source leakage is not a critical concern. Thus, using PLV facilitates cross-study comparisons including to our prior work [e.g. Mostame et al. NeuroImage 2020, Mostame et al. J Neurosci 2021].

      The repository with the code is also unavailable.

      Thank you for bringing this to our attention. We have now made our repository publicly accessible at: https://github.com/connectlab/Mostame2024_Multiplex_iEEG_fMRI.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      The window widths used to compute FC as a function of time are an important aspect, so I feel that this should be briefly described up-front in the main Results text.

      Methods. "Finally, to compensate for the time lag between hemodynamic and neural responses of the brain (Logothetis et al., 2001), we shifted the fMRI-FC time course 6 seconds backwards in time." What about the effects of temporal blurring from the HRF? Do we need to care about that?

      We agree with the importance to investigate the effect if temporal blurring of the HRF. The main text already included a replication of findings from CRPs generated using fMRI data and EEG amplitude signals convolved with the canonical HRF. This method serves as an alternative to the 6-second shifting. Both approaches produced similar results.

      Methods. In fMRI connectome computation it is common to look at partial correlation rather than full correlation. Partial correlation focuses more on direct connections. It would be good if the paper acknowledged and justified why it is OK to use full correlation.

      We have now added a brief explanation in this regard in the main text (Methods section) as follows:

      “In fMRI connectome computation, some prior work has used partial correlation instead of full correlation. Partial correlation emphasizes direct connections by calculating correlation between any pair of bran regions after regressing out the timeseries of all other regions. However, we have opted to use full correlation because this permits interpretation of our outcomes in the context of the vast existing literature that uses full correlations in fMRI including the majority of bimodal (EEG-fMRI) connectome studies (e.g. Tagliazucchi et al., 2012; Deligianni et al., 2014; Wirsich et al., 2017b, 2020, 2021; Allen et al., 2018).”

      The paper should relate the results to findings showing clear links between simultaneously recorded EEG and fMRI beyond FC. E.g. Mantini (PNAS) 2007 and Van De Ville (PNAS) 2010 to name two.

      In line with this important point, we have extended the existing discussion section that compares our outcomes to EEG-fMRI beyond functional connectivity:

      “Prior multi-modal studies of neural dynamics have predominantly aimed at methodologically cross-validating hemodynamic and electrophysiological observations, thus focusing on their convergence. These important foundational studies include e.g., the cross-modal comparison of region-wise (Mukamel et al., 2005; Nir et al., 2007) or ICN-wise (Mantini et al., 2007) activity fluctuations, instantaneous activity maps (Hunyadi et al., 2019; Zhang et al., 2020) or EEG microstates (Van de Ville 2010), infraslow connectome states (Abreu et al., 2020), or connection-wise FC including studies in the iEEG-fMRI and scalp EEG-fMRI data used in the current study (Ridley et al., 2017; and Wirsich et al., 2020, respectively). In contrast to this prior work, the current study investigated the highly time-resolved cross-modal temporal relationship at the level of FC patterns distributed over all available pairwise connections, and found a connectome-level temporal divergence. The discrepancy between temporal divergence in our study and convergence in prior studies implies that infraslow fluctuations of activity in individual regions or of FC in individual region-pairs observable in both modalities (prior studies) are neurally distinct from connectome-wide FC dynamics observable separately in each modality (current study). Indeed, we confirmed the existence of infraslow electrophysiological FC dynamics driving cross-modal temporal associations at the level of individual connections (Fig. S3) …”

      Reviewer #2 (Recommendations For The Authors):

      (1) Check different window sizes and stability of the FC patterns as a function of it.

      We thank the reviewer for the helpful feedback. We agree that the window size could possibly affect the estimation of individual connectome frames, particularly given that neural processes unfold at hundreds of milliseconds rather than seconds. However, we expect that the asynchronous nature of cross-modal convergence observed in our data would remain intact regardless of the specific window length used for FC calculations. To confirm this, we replicated some of our main analyses in the iEEG-fMRI data with a window length of 500ms (as opposed to 3s, equivalent to one TR) as follows:

      First, we showed that changing the window length does not substantially impact the overall architecture of the connectomes (Author response image 3). Particularly, the time-averaged connectome patterns across different frequency bands were all strongly correlated between the two analyses (500ms and 3s window lengths).

      Author response image 3.

      Time-averaged connectome patterns are highly replicable when calculated using 3s or 500ms window lengths. Horizontal axis represents frequency bands, while each dot represents a subject. Vertical axis shows 2D Pearson correlation of the two connectomes. The group average within each frequency band is marked by a horizontal line.

      Second, we replicated our major findings of CRP and its on-/off-diagonal ratio in the iEEG-fMRI dataset using a window length of 500ms for FC calculations. Indeed, the data does not show a substantial difference in the on-/off-diagonal ratios of the CRP entries between the 3s and 500ms window lengths. Specifically, the ratio was equal to 1.02 ± 0.07 for 500ms window length, emphasizing absence of significant temporal convergence of the connectome dynamics (see Author response image 4). A paired t-test between group-averaged ratios across different lags confirms a lack of significant difference between the two analyses (p= 0.50). This finding further emphasizes the genuine asynchronous nature of connectome dynamics across the neural timescales measured in fMRI and electrophysiology. We have added this analysis to the supplementary data.

      Author response image 4.

      On-/off-diagonal ratio is shown across lags for both analyses: 3s window length (blue) and 500ms window length (red). Each bar shows the mean across subjects, while the whiskers show the corresponding standard deviations.

      (2) Try to decrease the lowest frequency of the analysis below 1Hz or just compute it for multiple log-spaced frequencies from infra-slow delta to high-gamma band.

      Thank you for pointing out this matter. We do not expect considerable signal in the frequency range below the current lower bound of delta (1Hz) because as in most other EEG recordings, EEG was not recorded in DC setting and has a hardware high-pass filter of 0.1Hz. Nonetheless, we investigated the power spectral density of our iEEG-fMRI data and found that there is indeed little signal power left in the available infraslow range [0.5 – 1 Hz] after the preprocessing steps (Author response image 5).

      Author response image 5.

      Power spectral density of all subjects in the fMRI-iEEG dataset shows lack of sufficient power in the infraslow range. Infraslow range signals are almost always filtered out during recording unless the recording setup includes a DC amplifier. The infraslow signal of EEG that is often considered correlated with the fMRI signals in the literature most commonly are extracted from the slow-changing envelope of the bandlimited signals, like envelope of gamma oscillations.

      Accordingly, when the iEEG signals are filtered within the range of [0.5, 1], there is little signal variation observed in the signal timeseries, contrasting the adjacent delta band signal (Author response image 6). Importantly, the power envelope of the delta band (and all other canonical bands not shown here) comprise major fluctuations in the infraslow range, as expected. We would like to emphasize that the existing studies addressing infraslow EEG signal dynamics typically consider the infraslow envelope fluctuations of band-limited signals in traditional frequency bands [e.g. Nir et. al, Nat Neurosci 2008] rather than direct recordings in the infraslow frequency range. Investigating HRF-convolved EEG signals similarly captures the infraslow characteristics of the timeseries [e.g. Mantini et al. PNAS 2007, Sadaghiani et al., J Neurosci 2010] (and note that HRF-convolved analyses are included as supplementary investigation in the current study). To the best of our knowledge, very few studies have investigated direct infraslow EEG signals using DC EEG, and we are aware of only two DC-EEG studies with concurrent fMRI [Hiltunen et al., J Neurosci 2014, Grooms et al., Brain Connectivity 2017]. The infraslow correlates of fMRI in electrophysiological signals reported in prior work therefore reflect the slow changes in faster activity or connectivity of traditional frequency bands, which is indeed already included in the current study.

      Author response image 6.

      Sample timeseries of the iEEG signal of the nine subjects (nine rows) for a 400 second interval. Blue signals show the bandlimited delta with its envelope shown as darker blue. The red signal represents the infraslow signal component left in the data, which is much lower in power.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Ritvo and colleagues present an impressive suite of simulations that can account for three findings of differentiation in the literature. This is important because differentiation-in which items that have some features in common, or share a common associate are less similar to one another than are unrelated items-is difficult to explain with classic supervised learning models, as these predict the opposite (i.e., an increase in similarity). A few of their key findings are that differentiation requires a high learning rate and low inhibitory oscillations, and is virtually always asymmetric in nature.

      This paper was very clear and thoughtful-an absolute joy to read. The model is simple and elegant, and powerful enough to re-create many aspects of existing differentiation findings. The interrogation of the model and presentation of the findings were both extremely thorough. The potential for this model to be used to drive future work is huge. I have only a few comments for the authors, all of which are relatively minor.

      (1) I was struck by the fact that the "zone" of repulsion is quite narrow, compared with the zone of attraction. This was most notable in the modeling of Chanales et al. (i.e., just one of the six similarity levels yielded differentiation). Do the authors think this is a generalizable property of the model or phenomenon, or something idiosyncratic to do with the current investigation? It seems curious that differentiation findings (e.g., in hippocampus) are so robustly observed in the literature despite the mechanism seemingly requiring a very particular set of circumstances. I wonder if the authors could speculate on this point a bit-for example, might the differentiation zone be wider when competitor "pop up" is low (i.e., low inhibitory oscillations), which could help explain why it's often observed in hippocampus? This seems related a bit to the question about what makes something "moderately" active, or how could one ensure "moderate" activation if they were, say, designing an experiment looking at differentiation.

      We thank the reviewer for this comment. In the previous version of the manuscript, in the section entitled “Differentiation Requires a High Learning Rate and Is Sensitive to Activation Dynamics”, we discussed some reasons why differentiation may be more likely to be found in the hippocampus – namely, the high learning rate of the hippocampus and the sparsity of hippocampal activation patterns (pp. 27-28):

      “These results have implications for where to look for differentiation in the brain. Our finding that differentiation requires a high learning rate suggests that differentiation will be more evident in the hippocampus than in neocortex, insofar as hippocampus is thought to have a higher learning rate than neocortex (McClelland et al., 1995). In keeping with this prediction, numerous studies have found differentiation effects in hippocampus but not in neocortical regions involved in sensory processing (e.g., Chanales et al., 2017; Favila et al., 2016; Zeithamova et al., 2018). At the same time, some studies have found differentiation effects in neocortex (e.g., Schlichting et al., 2015; Wammes et al., 2022). One possible explanation of these neocortical differentiation effects is that they are being ``propped up’’ by top-down feedback from differentiated representations in the hippocampus. This explanation implies that disruptions of hippocampal processing (e.g., lesions, stimulation) will eliminate these neocortical differentiation effects; we plan to test this prediction in future work.

      Additionally, the simulations where we adjusted the oscillation amount (using our model of Schlichting et al., 2015) imply that differentiation will be most evident in brain regions where it is relatively hard to activate competitors. Given the U shape of the NMPH learning rule, limiting competitor activity makes it less likely that plasticity will ``cross over'' from weakening (and differentiation) to strengthening (and integration). Thus, within the hippocampus, subregions with sparser activity (e.g., dentate gyrus, and to a lesser extent, CA3; Barnes et al., 1990, GoodSmith et al., 2017; West et al., 1991) will be more prone to differentiation. There is strong empirical support for this prediction. For example, Wammes et al. (2022) manipulated the similarity of stimuli in a statistical learning experiment and found that moderate levels of visual similarity were associated with significant differentiation in the dentate gyrus but not other subregions. Also, numerous studies have found greater differentiation in dentate gyrus / CA3 than in CA1 (e.g., Dimsdale-Zucker et al., 2018; Wanjia et al., 2021; Molitor et al., 2021; Kim et al., 2017; but see Zheng et al., 2021).”

      In the revised draft we have supplemented this discussion with a new section entitled “Reconciling the Prevalence of Differentiation in the Model and in the Data” (pp. 30-31):

      “A key lesson from our model is that, from a computational perspective, it is challenging to obtain differentiation effects: The region of parameter space that gives rise to differentiation is much smaller than the one that gives rise to integration (for further discussion of this issue, see the section in Methods on Practical Advice for Getting the Model to Show Differentiation). However, the fact that integration is more prevalent in our simulations across parameter configurations does not mean that integration will be more prevalent than differentiation in real-life circumstances. What really matters in predicting the prevalence of differentiation in real life is how the parameters of the brain map on to parameters of the model: If the parameters of the brain align with regions of model parameter space that give rise to differentiation (even if these regions are small), this would explain why differentiation has been so robustly observed in extant studies. Indeed, this is exactly the case that we sought to make above about the hippocampus – i.e., that its use of especially sparse coding and a high learning rate will give rise to the kinds of neural dynamics that cause differentiation (as opposed to integration). As another example, while it is true that half of the overlap conditions in our simulation of Chanales et al. (2021) give rise to integration, this does not imply that integration will occur half of the time in the Chanales et al. (2021) study; it may be that the levels of overlap that are actually observed in the brain in Chanales et al. (2021) are more in line with the levels of overlap that give rise to differentiation in our model.”

      (2) With real fMRI data we know that the actual correlation value doesn't matter all that much, and anti-correlations can be induced by things like preprocessing decisions. I am wondering if the important criterion in the model is that the correlations (e.g., as shown in Figure 6) go down from pre to post, versus that they are negative in sign during the post learning period. I would think that here, similar to in neural data, a decrease in correlation would be sufficient to conclude differentiation, but would love the authors' thoughts on that.

      We thank the reviewer for bringing this up. In the paper, we define differentiation as the moving apart of representations – so we agree with the reviewer that it would be appropriate to conclude that differentiation is taking place when correlations go down from pre to post.

      In addition to the definitional question (“what counts as differentiation”), one can also ask the mechanistic question of what is happening in the model at the (simulated) neuronal level in conditions where differentiation (i.e., an average decrease in similarity from pre to post) occurs. Here, the model’s answer is clear: When the similarity of two pairmates decreases, it is because the pairmates have acquired anticorrelated representations at the (simulated) neuronal level. When similarity decreases on average from pre to post, but the average “post” similarity value is not negative, this is because there is a mix of outcomes across runs of the model (due to variance in the initial, random model weights and also variance in the order in which items are presented across training epochs) – some runs lead to differentiation (manifested as anticorrelated pairmate representations) whereas others lead to no change or integration. The average pre-to-post change depends on the relative frequencies with which these different outcomes occur.

      We have made several edits to the paper to clarify this point.

      We added a new section under “Results” in our simulation of Chanales et al. (2021) entitled, “Pairs of Items that Differentiate Show Anticorrelated Representations” (p. 15):

      “Figure 6B also highlights that, for learning rates where robust differentiation effects occur in aggregate (i.e., there is a reduction in mean pattern similarity, averaging across model runs), these aggregate effects involve a bimodal distribution across model runs: For some model runs, learning processes give rise to anticorrelated representations, and for other model runs the model shows integration; this variance across model runs is attributable to random differences in the initial weight configuration of the model. The aggregate differentiation effect is therefore a function of the proportion of model runs showing differentiation (here, anticorrelation) and the proportion of model runs showing integration. The fact that differentiation shows up as anticorrelation in the model's hidden layer relates to the learning effects discussed earlier:

      Unique competitor units are sheared away from (formerly) shared units, so the competitor ends up not having any overlap with the target representation (i.e., the level of overlap is less than you would expect due to chance, which mathematically translates into anticorrelation). We return to this point and discuss how to test for anticorrelation in the Discussion section.”

      We added new text to the “Take-Home Lessons” section in the Chanales et al. (2021) simulation (p. 17):

      “In particular, the simulations expose some important boundary conditions for when representational change can occur according to the NMPH (e.g., that differentiation depends on a large learning rate, but integration does not), and the simulations provide a more nuanced account of exactly how representations change (e.g., that differentiation driven by the NMPH is always asymmetric, whereas integration is sometimes asymmetric and sometimes symmetric; and that, when differentiation occurs on a particular model run, it tends to give rise to anticorrelated representations in the model's hidden layer).”

      We added new text to the “Nature of Representational Change” section in the Favila et al. (2016) simulation (p. 21):

      “Figure 8 - Supplement 1 also indicates that, as in our simulation of Chanales et al. (2021), individual model runs where differentiation occurs show anticorrelation between the pairmate representations, and gradations in the aggregate level of differentiation that is observed across conditions reflect differences in the proportion of trials showing this anticorrelation effect.”

      We added new text to the “Take-Home Lessons” section in the Favila et al. (2016) simulation (p.21):

      “As in our simulation of \cite{chanales2021adaptive}, we found that the NMPH-mediated differentiation was asymmetric, manifested as anticorrelation between pairmate representations on individual model runs, and required a high learning rate, leading to abrupt representational change.”

      We added new text to the “Nature of Representational Change” section in the Schlichting et al. (2015) simulation (p. 26):

      “Also, as in our other simulations, when differentiation occurs on a particular model run it tends to give rise to anticorrelated representations (results not shown).”

      We added new text to the “Take-Home Lessons” section in the Schlichting et al. (2015) simulation (pp. 26-27):

      “As in the other versions of our model, differentiation requires a high learning rate, and – on model runs when it occurs – it is asymmetric and gives rise to anticorrelated representations.”

      We added new text at the start of the Discussion (p. 27):

      “In addition to qualitatively replicating the results from the studies we simulated, our model gives rise to several novel predictions – most notably, that differentiation driven by the NMPH requires a rapid learning rate and, when it occurs for a particular pair of items, it is asymmetric and gives rise to anticorrelated representations.”

      We also added a new section in the Discussion entitled “Testing the Model's Prediction about Anticorrelation”, which (among other things) highlights the reviewer’s point that fMRI pattern similarity values can be affected by preprocessing choices (p. 30):

      “Even though we operationally define differentiation as a reduction in similarity with learning, the way that it actually shows up on individual model runs is as anticorrelation between pairmates; in the model, the size of the aggregate differentiation effect is determined by the proportion of model runs that show this anticorrelation effect (vs. no change or integration). This implies that, if we could get a clean measurement of the similarity of pairmates in an experiment, we might see a multimodal distribution, with some pairmates showing anticorrelation, and others showing increased correlation (integration) or no change in similarity. This kind of clean readout of the similarity of individual pairs might be difficult to obtain with fMRI; it is more feasible that this could be obtained with electrophysiology. Another challenge with using fMRI to test this prediction is that anticorrelation at the individual-neuron level might not scale up to yield anticorrelation at the level of the BOLD response; also, fMRI pattern similarity values can be strongly affected by preprocessing choices – so a negative pattern similarity value does not necessarily reflect anticorrelation at the individual-neuron level. A final caveat is that, while we predict that differentiation will show up as anticorrelation in the brain region that gives rise to the differentiation effect, this might not translate into anticorrelation in areas that are downstream of this region (e.g., if the hippocampus is the source of the differentiation effect, we would expect anticorrelation there, but not necessarily in neocortical regions that receive input from the hippocampus; we revisit this point later in the discussion, when we address limitations and open questions).”

      We added new text in the Discussion, under “Limitations and Open Questions” (p. 31):

      “Importantly, while hippocampus can boost the representation of unique features in neocortex, we expect that neocortex will continue to represent shared perceptual features (e.g., in Favila et al., 2016, the fact that both pairmates are photos of barns). For this reason, in paradigms like the one used by Favila et al. (2016), the predicted effect of hippocampal differentiation on neocortical representations will be a reduction in pattern similarity (due to upregulation in the representation of unique pairmate features) but neocortex should not cross over into anticorrelation in these paradigms (due to its continued representation of shared perceptual features). Indeed, this is exactly the pattern that Wanjia et al. (2021) observed in their study, which used similar stimuli to those used in Favila et al. (2016).”

      Lastly, we updated the Abstract (p. 1)

      “What determines when neural representations of memories move together (integrate) or apart (differentiate)? Classic supervised learning models posit that, when two stimuli predict similar outcomes, their representations should integrate. However, these models have recently been challenged by studies showing that pairing two stimuli with a shared associate can sometimes cause differentiation, depending on the parameters of the study and the brain region being examined. Here, we provide a purely unsupervised neural network model that can explain these and other related findings. The model can exhibit integration or differentiation depending on the amount of activity allowed to spread to competitors – inactive memories are not modified, connections to moderately active competitors are weakened (leading to differentiation), and connections to highly active competitors are strengthened (leading to integration). The model also makes several novel predictions – most importantly, that when differentiation occurs as a result of this unsupervised learning mechanism, it will be rapid and asymmetric, and it will give rise to anticorrelated representations in the region of the brain that is the source of the differentiation. Overall, these modeling results provide a computational explanation for a diverse set of seemingly contradictory empirical findings in the memory literature, as well as new insights into the dynamics at play during learning.”

      (3) For the modeling of the Favila et al. study, the authors state that a high learning rate is required for differentiation of the same-face pairs. This made me wonder what happens in the low learning rate simulations. Does integration occur?

      For the same-face condition of the Favila simulation, lowering learning rate does not result in an overall integration effect:

      Author response image 1.

      In other cases, we do see integration emerge at lower learning rates – e.g., in the Schlichting interleaved condition we see a small integration effect emerge for a learning rate value of 0.3:

      Author response image 2.

      Our view is that, while integration can emerge at low learning rates, it is not a reliable property of the model – in some cases, there is a “window” of learning rates where there is enough learning to drive integration but not enough to drive differentiation, and in other cases there is not. Given this lack of reliability across simulations, we would prefer not to discuss this in the paper.

      This paradigm has a lot of overlap with acquired equivalence, and so I am thinking about whether these are the sorts of small differences (e.g., same-category scenes and perhaps a high learning rate) that bias the system to differentiate instead of integrate.

      We agree that it would be very interesting to use the model to explore acquired equivalence and related phenomena, but we think it is out of scope of the current paper. We have added some text to the Discussion under “Limitations and Open Questions” (p. 32):

      “Another important future direction is to apply the model to a wider range of learning phenomena involving representational change – for example, acquired equivalence, which (like some of the studies modeled here) involves linking distinct stimuli to a shared associate (see, e.g., Honey and Hall, 1989; Shohamy and Wagner, 2008; Myers et al., 2003; Meeter et al., 2009; de Araujo Sanchez and Zeithamova, 2023). It is possible that some of these phenomena might be better explained by supervised learning, or a mixture of unsupervised and supervised learning, than by unsupervised learning alone.”

      (4) For the simulations of the Schlichting et al. study, the A and B appear to have overlap in the hidden layer based on Figure 9, despite there being no similarity between the A and B items in the study (in contrast to Favila et al., in which they were similar kinds of scenes, and Chanales et al., in which they were similar colors). Why was this decision made? Do the effects depend on some overlap within the hidden layer? (This doesn't seem to be explained in the paper that I saw though, so maybe just it's a visualization error?)

      Overlap in the pretrained hidden representations of A and B is not strictly necessary for these effects – it would be possible to reconfigure other parameters to get high levels of competition even if there were no overlap (e.g., by upregulating the strengths of connections from shared input features). Having said that, it is definitely true that overlap between the pretrained hidden representations boosts competition, and we think it is justified to posit this in the Schlichting simulation. We have now added an explanation for this in the paper (p. 23):

      “New text in Schlichting, “Knowledge Built into the Network”

      Matching the previous two simulations, we pretrained the weights so the hidden representations of the stimuli initially had 2/6 units in common. Even though the A and B stimuli used in the actual experiment did not have obvious feature overlap (they were randomly selected novel objects), it is important to note that the hidden layer is not simply a representation of the sensory features of the A and B stimuli; the hidden layer also receives input from the output layer, which represents the shared associate of A and B (X). We think that the presence of this shared associate justifies our use of initially-overlapping hidden representations.”

      (5) It seems as though there were no conditions under which the simulations produced differentiation in both the blocked and intermixed conditions, which Schlichting et al. observed in many regions (as the present authors note). Is there any way to reconcile this difference?

      We thank the reviewer for bringing this up. If we set the connection strength between X (in the output layer) and A (in the hidden layer) in the blocked condition to .9 instead of .999 (keeping this connection strength at .8 for the interleaved condition) and we set Osc to .0615, we observe differentiation in both conditions.

      Rather than replacing the original results in the paper, which would entail re-making the associated videos, etc., we have added a supplementary figure (Figure 10 - Supplement 1), which is included on p. 46.

      We also added the following to the Results section of the Schlichting simulation in the main text (p. 26):

      “Figure 10 - Supplement 1 shows results from an alternative parameterization where, in the low-oscillation-amplitude condition, differentiation is observed in both the blocked and interleaved conditions (mirroring results from Schlichting et al., 2015, who found differentiation in both conditions in several regions of interest, including parts of the hippocampus and medial prefrontal cortex).”

      (6) A general question about differentiation/repulsion and how it affects the hidden layer representation in the model: Is it the case that the representation is actually "shifted" or repelled over so it is no longer overlapping? Or do the shared connections just get pruned, such that the item that has more "movement" in representational space is represented by fewer units on the hidden layer (i.e., is reduced in size)? I think, if I understand correctly, that whether it gets shifted vs. reduce would depend on the strength of connections along the hidden layer, which would in turn depend on whether it represents some meaningful continuous dimension (like color) or not. But, if the connections within the hidden layer are relatively weak and it is the case that representations become reduced in size, would there be any anticipated consequences of this (e.g., cognitively/behaviorally)?

      The representations are shifted – this is discussed in the Chanales results section:

      “Because the activity ``set point'' for the hidden layer (determined by the kWTA algorithm) involves having 6 units active, and the unique parts of the competitor only take up 4 of these 6 units, this leaves room for activity to spread to additional units. Given the topographic projections in the output layer, the model is biased to ``pick up'' units that are adjacent in color space to the currently active units; because activity cannot flow easily from the competitor back to the target (as a result of the aforementioned severing of connections), it flows instead {\em away} from the target, activating two additional units, which are then incorporated into the competitor representation. This sequence of events (first a severing of the shared units, then a shift away from the target) completes the process of neural differentiation, and is what leads to the behavioral repulsion effect in color recall (because the center-of-mass of the color representation has now shifted away from the target).”

      Reviewer #2 (Public Review):

      This paper addresses an important computational problem in learning and memory. Why do related memory representations sometimes become more similar to each other (integration) and sometimes more distinct (differentiation)? Classic supervised learning models predict that shared associations should cause memories to integrate, but these models have recently been challenged by empirical data showing that shared associations can sometimes cause differentiation. The authors have previously proposed that unsupervised learning may account for these unintuitive data. Here, they follow up on this idea by actually implementing an unsupervised neural network model that updates the connections between memories based on the amount of coactivity between them. The goal of the authors' paper is to assess whether such a model can account for recent empirical data at odds with supervised learning accounts. For each empirical finding they wish to explain, the authors built a neural network model with a very simple architecture (two inputs layers, one hidden layer, and one output layer) and with prewired stimulus representations and associations. On each trial, a stimulus is presented to the model, and inhibitory oscillations allow competing memories to pop up. Pre-specified u-shaped learning rules are used to update the weights in the model, such that low coactivity leaves model connections unchanged, moderate coactivity weakens connections, and high coactivity strengthens connections. In each of the three models, the authors manipulate stimulus similarity (following Chanales et al), shared vs distinct associations (following Favila et al), or learning strength (a stand in for blocked versus interleaved learning schedule; following Schlichting et al) and evaluate how the model representations evolve over trials.

      As a proof of principle, the authors succeed in demonstrating that unsupervised learning with a

      simple u-shaped rule can produce qualitative results in line with the empirical reports. For instance, they show that pairing two stimuli with a common associate (as in Favila et al) can lead to *differentiation* of the model representations. Demonstrating these effects isn't trivial and a formal modeling framework for doing so is a valuable contribution. Overall, the authors do a good job of both formally describing their model and giving readers a high level sense of how their critical model components work, though there are some places where the robustness of the model to different parameter choices is unclear. In some cases, the authors are very clear about this (e.g. the fast learning rate required to observe differentiation). However, in other instances, the paper would be strengthened by a clearer reporting of the critical parameter ranges.

      We thank the reviewer for raising this point. The interdependence of parameters in our model makes it infeasible to identify critical parameter ranges. We have added a paragraph to the “Approach to Parameterization and Data Fitting” section in the Methods to address this point (p. 33):

      “The overall goal of this modeling work is to account for key empirical regularities regarding differentiation and integration and to establish boundary conditions on these regularities. As such, the modeling work described below focuses more on qualitative fits to general properties of the data space than on quantitative fits to results from specific studies. Automatic parameter optimization is not feasible for this kind of model, given the large number of model parameters and the highly interactive, nonlinear nature of competitive dynamics in the model; consequently, model fitting was done by hand.

      These complex interactions between parameters also make it infeasible to list “critical parameter ranges” for generating particular model outcomes. Our experience in working with the model has been that activation dynamics are what matter most for learning, and that disparate parameter sets can give rise to the same activation dynamics and -- through this -- the same learning effects; likewise, similar parameter sets can give rise to different activation dynamics and different learning outcomes. Consequently, in this paper we have focused on characterizing the dynamics that give rise to different learning effects (and how they can be affected by local parameter perturbations, e.g., relating to learning rate and oscillation size), rather than the – impossible, we believe – task of enumerating the full set of parameter configurations that give rise to a particular result.”

      For instance, it's clear from the manipulation of oscillation strength in the model of Schlichting et al that this parameter can dramatically change the direction of the results. The authors do report the oscillation strength parameter values that they used in the other two models, but it is not clear how sensitive these models are to small changes in this value.

      In some cases, the effects of oscillation strength are relatively smooth. For example, in the Favila simulation, increasing the oscillation amplitude Osc effectively recapitulates the U-shaped curve (i.e., higher levels of Osc lead to more competitor activation, which initially leads to weakening / differentiation but then gives way to strengthening / integration), as is shown for the Favila Different Face condition in this plot:

      Author response image 3.

      In the Chanales 2/6 overlap condition, the effects of varying Osc are more nonlinear:

      Author response image 4.

      We think this is attributable to the increased “all-or-none” recurrent dynamics in this simulation (due to the recurrent projections within the output layer), which make it more difficult to evoke moderate (vs. high) levels of activation. This difficulty in reliably obtaining graded activation dynamics is likely a consequence of the small-scale (“toy”) nature of the model and the simple inhibitory mechanisms employed here, as opposed to being a generalizable property of the brain – presumably, the actual brain employs more nuanced and effective means of controlling activation. Furthermore, we don’t think that the high prevalence of integration in the model’s parameter space necessarily translates into a prediction that integration should be more prevalent overall – see the new “Reconciling the Prevalence of Differentiation in the Model and in the Data” section described in response to one of the reviewer’s other points below. Due to the paper already being quite long, we have opted not to include the above plots / discussion in the paper.

      Similarly, it's not clear whether the 2/6 hidden layer overlap (only explicitly manipulated in the model of Chanales et al) is required for the other two models to work.

      When we were parameterizing the model, we opted to keep the 2/6 level of overlap for all of the simulations and we adjusted other parameters to fit the data; in part, this was because overlap can only be adjusted in discrete jumps, whereas other influential parameters in the model can be adjusted in a more graded, real-valued way. Our use of 2/6 overlap (as opposed to, say, 1/6 or 3/6 overlap) for the Favila and Schlichting models was done out of convenience, and should not be interpreted as a strong statement that this particular level of overlap is necessary for obtaining differentiation; we could easily get the model to show differentiation given other overlap levels by adjusting other parameters.

      Finally, though the u-shaped learning rule is essential to this framework, the paper does little formal investigation of this learning rule. It seems obvious that allowing the u-shape to collapse too much toward a horizontal line would reduce the model's ability to account for empirical results, but there may be other more interesting features of the learning rule parameterization that are essential for the model to function properly.

      Given that the paper is already quite long, we have opted not to include further exploration of the parameters of the U-shaped learning rule in the paper. However, for the reviewer’s information, we report the effects of a few illustrative manipulations of these parameters below. As a general principle, the effects of these manipulations make sense in light of the theoretical framework described in the paper.

      For example, the parameter “DRevMag” controls the size of the negative “dip” in the U-shaped curve (more negative values = a larger dip). Given that this negative dip is essential for severing weights to competitors and causing differentiation, shifting DRevMag upwards towards zero should shift the balance of the model away from differentiation and towards integration. This is indeed what we observe, as shown in this parameter sweep from the Chanales simulation:

      Author response image 5.

      As another example: The “DRev” parameter controls where the U-shaped curve transitions from negative weight change to positive weight change. Lower values of DRev mean that the region of coactivity values leading to negative weight change will be smaller, and the region of coactivity values leading to positive weight change will be larger. As such, we would expect that lower values of DRev would bias the model toward integration. That is indeed the case, as shown in this parameter sweep from the Schlichting Blocked simulation:

      Author response image 6.

      There are a few other points that may limit the model's ability to clearly map onto or make predictions about empirical data. The model(s) seems very keen to integrate and do so more completely than the available empirical data suggest. For instance, there is a complete collapse of representations in half of the simulations in the Chanales et al model and the blocked simulation in the Schlichting et al model also seems to produce nearly complete integration Even if the Chanales et al paper had observed some modest behavioral attraction effects, this model would seem to over-predict integration. The author's somewhat implicitly acknowledge this when they discuss the difficulty of producing differentiation ("Practical Advice for Getting the Model to Show Differentiation") and not of producing integration, but don't address it head on.

      We thank the reviewer for this comment – R1 had a similar comment. We have added a new section to the Discussion to address this point (p. 30):

      “Reconciling the Prevalence of Differentiation in the Model and in the Data.

      A key lesson from our model is that, from a computational perspective, it is challenging to obtain differentiation effects: The region of parameter space that gives rise to differentiation is much smaller than the one that gives rise to integration (for further discussion of this issue, see the section in Methods on Practical Advice for Getting the Model to Show Differentiation). However, the fact that integration is more prevalent in our simulations across parameter configurations does not mean that integration will be more prevalent than differentiation in real-life circumstances. What really matters in predicting the prevalence of differentiation in real life is how the parameters of the brain map on to parameters of the model: If the parameters of the brain align with regions of model parameter space that give rise to differentiation (even if these regions are small), this would explain why differentiation has been so robustly observed in extant studies. Indeed, this is exactly the case that we sought to make above about the hippocampus – i.e., that its use of especially sparse coding and a high learning rate will give rise to the kinds of neural dynamics that cause differentiation (as opposed to integration). As another example, while it is true that half of the overlap conditions in our simulation of Chanales et al. (2021) give rise to integration, this does not imply that integration will occur half of the time in the Chanales et al. (2021) study; it may be that the levels of overlap that are actually observed in the brain in Chanales et al. (2021) are more in line with the levels of overlap that give rise to differentiation in our model.”

      Second, the authors choice of strongly prewiring associations in the Chanales and Favila models makes it difficult to think about how their model maps onto experimental contexts where competition is presumably occurring while associations are only weakly learned. In the Chanales et al paper, for example, the object-face associations are not well learned in initial rounds of the color memory test. While the authors do justify their modeling choice and their reasons have merit, the manipulation of AX association strength in the Schlichting et al model also makes it clear that the association strength has a substantial effect on the model output. Given the effect of this manipulation, more clarity around this assumption for the other two models is needed.

      We thank the reviewer for bringing this up. We have edited the section entitled “A Note on Prewiring Representations” in the Methods to further justify our choice to prewire associations in the Chanales and Favila models (p. 37):

      “In our model, our practice of ``prewiring'' memory representations for the A and B pairmates serves two functions. In some cases, it is meant to stand in for actual training (as in the blocked / interleaved manipulation; the connections supporting the AX association are prewired to be stronger in the blocked condition than in the interleaved condition). However, the other, more fundamental role of prewiring is to ensure that the A and B input patterns evoke sparse distributed representations in the hidden layer (i.e., where some units are strongly active but most other units are inactive). In the real brain, this happens automatically because the weight landscape has been extensively sculpted by both experience and evolution. For example, in the real hippocampus, when the second pairmate is presented for the first time, it will evoke a sparse distributed representation in the CA3 subfield (potentially overlapping with the first pairmate’s CA3 representation) even before any learning of the second pairmate has occurred, due to the strong, sparse mossy fiber projections that connect the dentate gyrus to CA3 (McNaughton & Morris, 1987). As discussed above, we hypothesize that this initial, partial overlap between the second pairmate’s representation and the first pairmate’s representation can lead to pop-up of the unique features of the first pairmate’s representation, triggering learning that leads to differentiation or integration. In our small-scale model, we are effectively starting with a ``blank brain''; in the absence of prewiring, the A and B inputs would activate overly diffuse representations that do not support these kinds of competitive dynamics. As such, prewiring in our model is necessary for proper functioning. The presence of prewired A and B representations should therefore not be interpreted as reflecting a particular training history (except in the blocked / interleaved case above); rather, these prewired representations constitute the minimum step we would take to ensure well-defined competitive dynamics in our small-scale model.

      The fact that connection strengths serve this dual function – sometimes reflecting effects of training (as in our simulation of Schlichting et al., 2015) and in other cases reflecting necessary prewiring – complicates the interpretation of these strength values in the model. Our view is that this is a necessary limitation of our simplified modeling approach – one that can eventually be surmounted through the use of more biologically-detailed architectures (see Limitations and Open Questions in the Discussion).”

      Overall, this is strong and clearly described work that is likely to have a positive impact on computational and empirical work in learning and memory. While the authors have written about some of the ideas discussed in this paper previously, a fully implemented and openly available model is a clear advance that will benefit the field. It is not easy to translate a high-level description of a learning rule into a model that actually runs and behaves as expected. The fact that the authors have made all their code available makes it likely that other researchers will extend the model in numerous interesting ways, many of which the authors have discussed and highlighted in their paper.

      Reviewer #3 (Public Review):

      This paper proposes a computational account for the phenomenon of pattern differentiation (i.e., items having distinct neural representations when they are similar). The computational model relies on a learning mechanism of the nonmonotonic plasticity hypothesis, fast learning rate and inhibitory oscillations. The relatively simple architecture of the model makes its dynamics accessible to the human mind. Furthermore, using similar model parameters, this model produces simulated data consistent with empirical data of pattern differentiation. The authors also provide insightful discussion on the factors contributing to differentiation as opposed to integration. The authors may consider the following to further strengthen this paper:

      The model compares different levels of overlap at the hidden layer and reveals that partial overlap seems necessary to lead to differentiation. While I understand this approach from the perspective of modeling, I have concerns about whether this is how the human brain achieves differentiation. Specifically, if we view the hidden layer activation as a conjunctive representation of a pair that is the outcome of encoding, differentiation should precede the formation of the hidden layer activation pattern of the second pairmate. Instead, the model assumes such pattern already exists before differentiation. Maybe the authors indeed argue that mechanistically differentiation follows initial encoding that does not consider similarity with other memory traces?

      Related to the point above, because the simulation setup is different from how differentiation actually occurs, I wonder how valid the prediction of asymmetric reconfiguration of hidden layer connectivity pattern is.

      We thank the reviewer for this comment. In the revised manuscript, we have edited the “Note on Prewiring Representations” in the Methods to clarify how our assumptions about prewiring relate to what we really think is happening in the brain (p. 37):

      “In our model, our practice of ``prewiring'' memory representations for the A and B pairmates serves two functions. In some cases, it is meant to stand in for actual training (as in the blocked / interleaved manipulation; the connections supporting the AX association are prewired to be stronger in the blocked condition than in the interleaved condition). However, the other, more fundamental role of prewiring is to ensure that the A and B input patterns evoke sparse distributed representations in the hidden layer (i.e., where some units are strongly active but most other units are inactive). In the real brain, this happens automatically because the weight landscape has been extensively sculpted by both experience and evolution. For example, in the real hippocampus, when the second pairmate is presented for the first time, it will evoke a sparse distributed representation in the CA3 subfield (potentially overlapping with the first pairmate’s CA3 representation) even before any learning of the second pairmate has occurred, due to the strong, sparse mossy fiber projections that connect the dentate gyrus to CA3 (McNaughton & Morris, 1987). As discussed above, we hypothesize that this initial, partial overlap between the second pairmate’s representation and the first pairmate’s representation can lead to pop-up of the unique features of the first pairmate’s representation, triggering learning that leads to differentiation or integration. In our small-scale model, we are effectively starting with a ``blank brain''; in the absence of prewiring, the A and B inputs would activate overly diffuse representations that do not support these kinds of competitive dynamics. As such, prewiring in our model is necessary for proper functioning. The presence of prewired A and B representations should therefore not be interpreted as reflecting a particular training history (except in the blocked / interleaved case above); rather, these prewired representations constitute the minimum step we would take to ensure well-defined competitive dynamics in our small-scale model.

      The fact that connection strengths serve this dual function – sometimes reflecting effects of training (as in our simulation of Schlichting et al., 2015) and in other cases reflecting necessary prewiring – complicates the interpretation of these strength values in the model. Our view is that this is a necessary limitation of our simplified modeling approach – one that can eventually be surmounted through the use of more biologically-detailed architectures (see Limitations and Open Questions in the Discussion).”

      Although as the authors mentioned, there haven't been formal empirical tests of the relationship between learning speed and differentiation/integration, I am also wondering to what degree the prediction of fast learning being necessary for differentiation is consistent with current data. According to Figure 6, the learning rates lead to differentiation in the 2/6 condition achieved differentiation after just one-shot most of the time. On the other hand, For example, Guo et al (2021) showed that humans may need a few blocks of training and test to start showing differentiation.

      We thank the reviewer for mentioning this. We have added a paragraph to the “Differentiation Requires a High Learning Rate and Is Sensitive to Activity Dynamics” section of the Discussion that addresses this point (pp. 28-29):

      “Although the results from Wanjia et al. (2021) provide strong support for the model's prediction that differentiation will be abrupt, they raise another question: What explains variance across items in when this abrupt change takes place? The answer to this question remains to be seen, but one possibility is encoding variability: If we assume that participants stochastically sample (i.e., attend to) the features of the scene pairmates, it is possible that participants might initially fail to sample the features that distinguish the scene pairmates, which can be quite subtle – and if the distinguishing features of the pairmates are not represented in high-level visual regions (i.e., the pairmates are represented in these regions as having the same features), this could delay the onset of differentiation until the point at which the distinguishing features happen (by chance) to be sampled.”

      Related to the point above, the high learning rate prediction also seems to be at odds with the finding that the cortex, which has slow learning (according to the theory of complementary learning systems), also shows differentiation in Wammes et al (2022).

      We now address this point in the section of the Discussion entitled “Differentiation Requires a High Learning Rate and Is Sensitive to Activity Dynamics” (p. 27):

      “Our finding that differentiation requires a high learning rate suggests that differentiation will be more evident in the hippocampus than in neocortex, insofar as hippocampus is thought to have a higher learning rate than neocortex (McClelland et al., 1995). In keeping with this prediction, numerous studies have found differentiation effects in hippocampus but not in neocortical regions involved in sensory processing (e.g., Chanales et al., 2017; Favila et al., 2016; Zeithamova et al., 2018). At the same time, some studies have found differentiation effects in neocortex (e.g., Schlichting et al., 2015; Wammes et al., 2022). One possible explanation of these neocortical differentiation effects is that they are being ``propped up’’ by top-down feedback from differentiated representations in the hippocampus.”

      More details about the learning dynamics would be helpful. For example, equation(s) showing how activation, learning rate and the NMPH function work together to change the weight of connections may be added. Without the information, it is unclear how each connection changes its value after each time point.

      We thank the reviewer for this comment. We have made two major changes to address this concern. First, we have edited the “Learning” section within “Basic Network Properties” in the main text (pp. 6-7):

      “Connection strengths in the model between pairs of connected units x and y were adjusted at the end of each trial (i.e., after each stimulus presentation) as a U-shaped function of the coactivity of x and y, defined as the product of their activations on that trial. The parameters of the U-shaped learning function relating coactivity to change in connection strength (i.e., weakening / strengthening) were specified differently for each projection where learning occurs (bidirectionally between the input and hidden layers, the hidden layer to itself, and the hidden to output layer). Once the U-shaped learning function for each projection in each version of the model was specified, we did not change it for any of the various conditions. Details of how we computed coactivity and how we specified the U-shaped function can be found in the Methods section.”

      Second, we have added the requested equations to the “Learning” part of the Methods (pp. 37-38):

      The right side of the function, strong activation leads to strengthening of the connectivity, which I assume will lead to stronger activation on the next time point. The model has an upper limit of connection strength to prevent connection from strengthening too much. The same idea can be applied to the left side of the function: instead of having two turning points, it can be a linear function such that low activation keeps weakening connection until the lower limit is reached. This way the NMPH function can take a simpler form (e.g., two line-segments if you think the weakening and strengthening take different rates) and may still simulate the data.

      We thank the reviewer for mentioning this. We have added a new paragraph in the “Learning” section of the Methods to justify the particular shape of the learning curve (pp. 38-39):

      “Evidence for the U-shaped plasticity function used here (where low activation leads to no change, moderate activation leads to weakening, and higher levels of activation lead to strengthening) was previously reviewed in Ritvo et al. (2019). In brief, there are three lines of work that support the U shape: First, multiple neurophysiological studies have found that moderate postsynaptic depolarization leads to synaptic weakening and higher levels of depolarization lead to synaptic strengthening (e.g., Artola et al., 1990; Hansel et al., 1996). Second, human neuroscience studies have used pattern classifiers, applied to fMRI and EEG data, to measure memory activation, and have related this measure to subsequent memory accessibility; several studies using this approach have found that low levels of activation lead to no change in memory strength, moderate levels of activation lead to impaired subsequent memory, and higher levels of activation lead to increased subsequent memory (e.g., Newman and Norman, 2010; Detre et al., 2013; Kim et al., 2014; for related findings, see Lewis-Peacock and Norman, 2014; Wang et al., 2019). Third, a recent human fMRI study by Wammes et al. (2022) manipulated memory activation by varying the visual similarity of pairmates and observed a U-shaped function relating visual similarity to representational change in the hippocampus, whereby low levels of pairmate similarity were associated with no change, moderate levels of similarity were associated with differentiation, and the differentiation effect went away at higher levels of similarity.

      We have also included a pointer to this new paragraph in the “Nonmonotonic Plasticity Hypothesis” section of Introduction (p. 2):

      (for further discussion of the empirical justification for the NMPH, see the Learning subsection in the Methods)”

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      A few additional minor things about data presentation and the like:

      (1) Figure 1 legend - a more general description of how to interpret the figure might be helpful for more naive readers (e.g., explaining how one can visualize in the schematic that there is overlap in the hidden layer between A and B). Also, from the Figure 1 depiction, it's not clear what is different about the setup from the initial left hand side panels in A, B, C, to make it such that activity spreads strongly to A in panel A, weakly in panel B, and not at all in panel C since the weights are the same. Is there a way to incorporate this into the graphic, or describe it in words?

      To address this point, we have added the following text to the Figure 1 caption (p. 3):

      “Note that the figure illustrates the consequences of differences in competitor activation for learning, without explaining why these differences would arise. For discussion of circumstances that could lead to varying levels of competitor activation, see the simulations described in the text.”

      (2) I believe not all of the papers cited on lines 193-195 actually have similarity manipulations in them. I'd recommend double checking this list and removing those less relevant to the statement.

      Thank you for pointing this out; we have removed the Ballard reference and we have clarified what we mean by similarity reversal (p. 7):

      “The study was inspired by recent neuroimaging studies showing ``similarity reversals'', wherein stimuli that have more features in common (or share a common associate) show less hippocampal pattern similarity (Favila et al., 2016; Schlichting et al., 2015; Molitor et al., 2021; Chanales et al., 2017; Dimsdale-Zucker et al., 2018; Wanjia et al., 2021; Zeithamova et al., 2018; Jiang et al., 2020; Wammes et al., 2022).”

      (3) I wanted a bit more detail about how the parameters were set in the main paper, not just in the methods. Even something as brief as noting that model fitting was done by hand by tweaking parameters to re-create the empirical patterns (if I'm understanding correctly) would have been helpful for me.

      To address this point, we have added the following text under “Basic Network Properties” (p. 4):

      “Our goal was to qualitatively fit key patterns of results from each of the aforementioned studies. We fit the parameters of the model by hand as they are highly interdependent (see the Methods section for more details).”

      (4) In Figure 4E, it would be helpful to describe the x and y axes of the MDS plots in the legend.

      To address this point, we have added the following new text to the Figure 4 caption that clarifies how the MDS plots were generated (p. 11):

      “MDS plots were rotated, shifted, and scaled such that pairmate 1before is located at (0,0), pairmate 2before is located directly to the right of pairmate 1before, and the distance between pairmate 1before and pairmate 2before is proportional to the baseline distance between the pairmates.”

      (5) Figure 6 - at first I thought the thicker line was some sort of baseline, but I think it is just many traces on top of one another. If other readers may be similarly confused, perhaps this could be stated.

      Thanks for this comment. We have updated Figure 6 (p. 16).

      We have also updated the caption.

      I am having a lot of difficulty understanding the terms "competitor-to-competitor,"

      "competitor-to-target/shared," and "target/shared-to-target/shared," and therefore I don't fully get Figure 5. I think it might be helpful to expand the description of these terms where they are first introduced in the paper (p. 13?). I think I am missing something crucial here, and I am not quite sure what that is-which I know is not very helpful! But, to narrate my confusion a bit, I thought that these terms would somehow relate to connections between different connections of the network. For example is competitor-to-competitor within the hidden layer? Or is this somehow combining across relevant connections that might span different pairs of layers in the model? And, I really have no idea why it is "target/shared."

      Thank you for these comments. We have updated Figure 5 and we have also made several changes to the main text and the figure caption to address these points.

      Changes to the main text (p. 13):

      “Whether symmetric or asymmetric integration occurs depends on the relative strengths of connections between pairs of unique competitor units (competitor-competitor connections) compared to connections between unique competitor units and shared units (competitor-shared connections) after the first trial (Figure 5; note that the figure focuses on connections between hidden units, but the principle also applies to connections that span across layers). Generally, coactivity between unique competitor units (competitor-competitor coactivity) is less than coactivity between unique competitor units and shared units (competitor-shared coactivity), which is less than coactivity between unique target units and shared units (target-shared coactivity).”

      (7) Relatedly in Figure 13, I understand how some competitor-to-target/shared connections could be spared in the bottom instance given panel B. However, I'm struggling to understand how that relates to the values in the corresponding chart in panel A. What about panel A, bottom (vs. the top) means lower coactivities between some competitor-to-target/shared? Is it because if the noise level is higher, the "true" activation of competitor-to-target/shared connections is weaker? I think again, I'm missing something critical here! and wonder if other readers may be in the same situation. (I know the authors described this also on p. 36, but I'm still confused!)

      We have updated Figure 13 to clarify these points.

      (8)  In Figure 9, I believe there is no caption for panel D. Also, it looks as though the item unit active for A and B is the same. I wonder if this is an error?

      Thank you for catching these errors! They have both been fixed.

      Reviewer #2 (Recommendations For The Authors):

      -Perhaps I missed it, but I think defining coactivity (how it is computed) in the main text would be useful for readers, as this is critical for understanding the model. I did find it in the methods.

      We thank the reviewer for this suggestion. We have updated the “Learning” section within “Basic Network Properties” in the main text to address this point (pp. 6-7):

      “Connection strengths in the model between pairs of connected units x and y were adjusted at the end of each trial (i.e., after each stimulus presentation) as a U-shaped function of the coactivity of x and y, defined as the product of their activations on that trial. The parameters of the U-shaped learning function relating coactivity to change in connection strength (i.e., weakening / strengthening) were specified differently for each projection where learning occurs (bidirectionally between the input and hidden layers, the hidden layer to itself, and the hidden to output layer). Once the U-shaped learning function for each projection in each version of the model was specified, we did not change it for any of the various conditions. Details of how we computed coactivity and how we specified the U-shaped function can be found in the Methods section.”

      -The modeling results in the different face condition are at odds with the data for the Favila et al model (they observe some differentiation in the paper and the model predicts no change). This could be due to a number of unmodeled factors, but it is perhaps worth noting.

      Thank you for pointing this out. It is possible to better capture the pattern of results observed by Favila et al. in their paper (with some differentiation in the different-face condition and even more differentiation in the same-face condition) by slightly adjusting the model parameters (specifically, by setting the oscillation amplitude Osc for the hidden layer to .1 instead of .067).

      Rather than replacing the old (Osc \= .067) results in the paper, which would entail re-making the associated videos, etc., we have added a supplementary figure (Figure 8 - Supplement 1; see p.45):

      We also added new text to the Favila Results, under “Differentiation and Integration” (p. 20):

      “Note also that the exact levels of differentiation that are observed in the different-face and same-face conditions are parameter dependent; for an alternative set of results showing some differentiation in the different-face condition (but still less than is observed in the same-face condition), see Figure 8 - Supplement 1.”

      -Related to my comment in the public review about pre-wiring associations, in the caption for Figure 9 (Schlichting model), the authors report "In both conditions, the pre-wired connection linking the "item B" hidden units to the "item X" output unit is set to .7. In the interleaved condition, the connection linking the "item A" hidden units to the "item X" output unit is set to .8, to reflect some amount of initial AX learning. In the blocked condition, the connection linking the "item A" hidden units to the "item X" output unit is set a higher value (.999), to reflect extra AX learning." What are the equivalent values for the other models, especially the Favila model since the structure is the same as Schlichting? I understood all the "strong" connections to be .99 unless otherwise stated. If that's the case, I don't understand why the blocked Schlichting model and the Favila model produce opposite effects. More clarity would be useful here.

      We have added a new paragraph to the results section for the Schlicting model (under “Differentiation and Integration”) to clarify why the blocked Schlichting model and the Favila model show different results (p. 24):

      “Note that the key feature driving integration in the blocked condition of this simulation is not the high strength of the connection from X to A on its own – rather, it is the asymmetry in the pretrained connection strengths from X to A (.999) and from X to B (.7). This asymmetry, which is meant to reflect the extensive training on A-X that occurred before the initial presentation of B-X, results in the A-X hidden representation decisively winning the competition during B-X presentation, which then leads to the B input also being linked to this representation (i.e., integration). It is instructive to compare this to the same-face condition from our simulation of Favila et al. (2016): In that simulation, the two pairmates are also linked strongly (.99 initial connection strength) to a shared associate, but in that case the connections are equally strong, so there is more balanced competition -- in this case, the competitor representation only comes to mind moderately (instead of displacing the target representation), so the result is differentiation instead of integration.”

      -The meaning of the different colored dots in Figure 5 is bit hard to keep track of, even given the legend labels. The figure might benefit from a model sketch highlighting each of the different coactivity types. The left side of Fig 13 was useful but again somehow mapping on the colors would help further. Another note on these figures: what does having two dots of each color mean? Is it just an illustration of the variance? There would be more dots if there was one dot per coactivity value.

      We have updated Figure 5 and Figure 13 to clarify these points (including a clarification that the dots only represent a subset of the possible pairings between units).

      -While I appreciate the goal of the paper is to account for these three studies, readers who aren't familiar with or specifically interested in these studies may appreciate a small amount of intuition on why formalizing unsupervised learning models may be broadly important for computational investigations of learning/memory/cognition.

      We have added the following text under “Basic Network Properties” in the Introduction to address this point (p. 4):

      “Achieving a better understanding of unsupervised learning is an important goal for computational neuroscience, given that learning agents have vastly more opportunities to learn in an unsupervised fashion than from direct supervision (for additional discussion of this point, see, e.g., Zhuang et al., 2021).”

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      This paper presents a compelling and comprehensive study of decision-making under uncertainty. It addresses a fundamental distinction between belief-based (cognitive neuroscience) formulations of choice behaviour with reward-based (behavioural psychology) accounts. Specifically, it asks whether active inference provides a better account of planning and decision-making, relative to reinforcement learning. To do this, the authors use a simple but elegant paradigm that includes choices about whether to seek both information and rewards. They then assess the evidence for active inference and reinforcement learning models of choice behaviour, respectively. After demonstrating that active inference provides a better explanation of behavioural responses, the neuronal correlates of epistemic and instrumental value (under an optimised active inference model) are characterised using EEG. Significant neuronal correlates of both kinds of value were found in sensor and source space. The source space correlates are then discussed sensibly, in relation to the existing literature on the functional anatomy of perceptual and instrumental decision-making under uncertainty.

      Strengths:

      The strengths of this work rest upon the theoretical underpinnings and careful deconstruction of the various determinants of choice behaviour using active inference. A particular strength here is that the experimental paradigm is designed carefully to elicit both information-seeking and reward-seeking behaviour; where the information-seeking is itself separated into resolving uncertainty about the context (i.e., latent states) and the contingencies (i.e., latent parameters), under which choices are made. In other words, the paradigm - and its subsequent modelling - addresses both inference and learning as necessary belief and knowledge-updating processes that underwrite decisions.

      The authors were then able to model belief updating using active inference and then look for the neuronal correlates of the implicit planning or policy selection. This speaks to a further strength of this study; it provides some construct validity for the modelling of belief updating and decision-making; in terms of the functional anatomy as revealed by EEG. Empirically, the source space analysis of the neuronal correlates licences some discussion of functional specialisation and integration at various stages in the choices and decision-making.

      In short, the strengths of this work rest upon a (first) principles account of decision-making under uncertainty in terms of belief updating that allows them to model or fit choice behaviour in terms of Bayesian belief updating - and then use relatively state-of-the-art source reconstruction to examine the neuronal correlates of the implicit cognitive processing.

      Response: We are deeply grateful for your careful review of our work and for the thoughtful feedback you have provided. Your dedication to ensuring the quality and clarity of the work is truly admirable. Your comments have been invaluable in guiding us towards improving the paper, and We appreciate your time and effort in not just offering suggestions but also providing specific revisions that I can implement. Your insights have helped us identify areas where I can strengthen the arguments and clarify the methodology.

      Comment 1:

      The main weaknesses of this report lies in the communication of the ideas and procedures. Although the language is generally excellent, there are some grammatical lapses that make the text difficult to read. More importantly, the authors are not consistent in their use of some terms; for example, uncertainty and information gain are sometimes conflated in a way that might confuse readers. Furthermore, the descriptions of the modelling and data analysis are incomplete. These shortcomings could be addressed in the following way.

      First, it would be useful to unpack the various interpretations of information and goal-seeking offered in the (active inference) framework examined in this study. For example, it will be good to include the following paragraph:

      "In contrast to behaviourist approaches to planning and decision-making, active inference formulates the requisite cognitive processing in terms of belief updating in which choices are made based upon their expected free energy. Expected free energy can be regarded as a universal objective function, specifying the relative likelihood of alternative choices. In brief, expected free energy can be regarded as the surprise expected following some action, where the expected surprise comes in two flavours. First, the expected surprise is uncertainty, which means that policies with a low expected free energy resolve uncertainty and promote information seeking. However, one can also minimise expected surprise by avoiding surprising, aversive outcomes. This leads to goal-seeking behaviour, where the goals can be regarded as prior preferences or rewarding outcomes.

      Technically, expected free energy can be expressed in terms of risk plus ambiguity - or rearranged to be expressed in terms of expected information gain plus expected value, where value corresponds to (log) prior preferences. We will refer to both decompositions in what follows; noting that both decompositions accommodate information and goal-seeking imperatives. That is, resolving ambiguity and maximising information gain have epistemic value, while minimising risk or maximising expected value have pragmatic or instrumental value. These two kinds of values are sometimes referred to in terms of intrinsic and extrinsic value, respectively [1-4]."

      Response 1: We deeply thank you for your comments and corresponding suggestions about our interpretations of active inference. In response to your identified weaknesses and suggestions, we have added corresponding paragraphs in the Methods section (The free energy principle and active inference, line 95-106):

      “Active inference formulates the necessary cognitive processing as a process of belief updating, where choices depend on agents' expected free energy. Expected free energy serves as a universal objective function, guiding both perception and action. In brief, expected free energy can be seen as the expected surprise following some policies. The expected surprise can be reduced by resolving uncertainty, and one can select policies with lower expected free energy which can encourage information-seeking and resolve uncertainty. Additionally, one can minimize expected surprise by avoiding surprising or aversive outcomes (oudeyer et al., 2007; Schmidhuber et al., 2010). This leads to goal-seeking behavior, where goals can be viewed as prior preferences or rewarding outcomes.

      Technically, expected free energy can also be expressed as expected information gain plus expected value, where the value corresponds to (log) prior preferences. We will refer to both formulations in what follows. Resolving ambiguity, minimizing risk, and maximizing information gain has epistemic value while maximizing expected value have pragmatic or instrumental value. These two types of values can be referred to in terms of intrinsic and extrinsic value, respectively (Barto et al., 2013; Schwartenbeck et al., 2019).”

      Oudeyer, P. Y., & Kaplan, F. (2007). What is intrinsic motivation? A typology of computational approaches. Frontiers in neurorobotics, 1, 108.

      Schmidhuber, J. (2010). Formal theory of creativity, fun, and intrinsic motivation (1990–2010). IEEE transactions on autonomous mental development, 2(3), 230-247.

      Barto, A., Mirolli, M., & Baldassarre, G. (2013). Novelty or surprise?. Frontiers in psychology, 4, 61898.

      Schwartenbeck, P., Passecker, J., Hauser, T. U., FitzGerald, T. H., Kronbichler, M., & Friston, K. J. (2019). Computational mechanisms of curiosity and goal-directed exploration. elife, 8, e41703.

      Comment 2:

      The description of the modelling of choice behaviour needs to be unpacked and motivated more carefully. Perhaps along the following lines:

      "To assess the evidence for active inference over reinforcement learning, we fit active inference and reinforcement learning models to the choice behaviour of each subject. Effectively, this involved optimising the free parameters of active inference and reinforcement learning models to maximise the likelihood of empirical choices. The resulting (marginal) likelihood was then used as the evidence for each model. The free parameters for the active inference model scaled the contribution of the three terms that constitute the expected free energy (in Equation 6). These coefficients can be regarded as precisions that characterise each subjects' prior beliefs about contingencies and rewards. For example, increasing the precision or the epistemic value associated with model parameters means the subject would update her beliefs about reward contingencies more quickly than a subject who has precise prior beliefs about reward distributions. Similarly, subjects with a high precision over prior preferences or extrinsic value can be read as having more precise beliefs that she will be rewarded. The free parameters for the reinforcement learning model included..."

      Response 2: We deeply thank you for your comments and corresponding suggestions about our description of the behavioral modelling. In response to your identified weaknesses and suggestions, we have added corresponding content in the Results section (Behavioral results, line 279-293):

      “To assess the evidence for active inference over reinforcement learning, we fit active inference (Eq.9), model-free reinforcement learning, and model-based reinforcement learning models to the behavioral data of each participant. This involved optimizing the free parameters of active inference and reinforcement learning models. The resulting likelihood was used to calculate the Bayesian Information Criterion (BIC) (Vrieze 2012) as the evidence for each model. The free parameters for the active inference model (AL, AI, EX, prior, and α) scaled the contribution of the three terms that constitute the expected free energy in Eq.9. These coefficients can be regarded as precisions that characterize each participant's prior beliefs about contingencies and rewards. For example, increasing α means participants would update their beliefs about reward contingencies more quickly, increasing AL means participants would like to reduce ambiguity more, and increasing AI means participants would like to learn the hidden state of the environment and avoid risk more. The free parameters for the model-free reinforcement learning model are the learning rate α and the temperature parameter γ and the free parameters for the model-based are the learning rate α, the temperature parameter γ and prior (the details for the model-free reinforcement learning model can be seen in Eq.S1-11 and the details for the model-based reinforcement learning model can be seen Eq.S12-23 in the Supplementary Method). The parameter fitting for these three models was conducted using the `BayesianOptimization' package in Python (Frazire 2018), first randomly sampling 1000 times and then iterating for an additional 1000 times.”

      Vrieze, S. I. (2012). Model selection and psychological theory: a discussion of the differences between the Akaike information criterion (AIC) and the Bayesian information criterion (BIC). Psychological methods, 17(2), 228.

      Frazier, P. I. (2018). A tutorial on Bayesian optimization. arXiv preprint arXiv:1807.02811.

      Comment 3:

      In terms of the time-dependent correlations with expected free energy - and its constituent terms - I think the report would benefit from overviewing these analyses with something like the following:

      "In the final analysis of the neuronal correlates of belief updating - as quantified by the epistemic and intrinsic values of expected free energy - we present a series of analyses in source space. These analyses tested for correlations between constituent terms in expected free energy and neuronal responses in source space. These correlations were over trials (and subjects). Because we were dealing with two-second timeseries, we were able to identify the periods of time during decision-making when the correlates were expressed.

      In these analyses, we focused on the induced power of neuronal activity at each point in time, at each brain source. To illustrate the functional specialisation of these neuronal correlates, we present whole-brain maps of correlation coefficients and pick out the most significant correlation for reporting fluctuations in selected correlations over two-second periods. These analyses are presented in a descriptive fashion to highlight the nature and variety of the neuronal correlates, which we unpack in relation to the existing EEG literature in the discussion. Note that we did not attempt to correct for multiple comparisons; largely, because the correlations observed were sustained over considerable time periods, which would be almost impossible under the null hypothesis of no correlations."

      Response 3: We deeply thank you for your comments and corresponding suggestions about our description of the regression analysis in the source space. In response to your suggestions, we have added corresponding content in the Results section (EEG results at source level, line 331-347):

      “In the final analysis of the neural correlates of the decision-making process, as quantified by the epistemic and intrinsic values of expected free energy, we presented a series of linear regressions in source space. These analyses tested for correlations over trials between constituent terms in expected free energy (the value of avoiding risk, the value of reducing ambiguity, extrinsic value, and expected free energy itself) and neural responses in source space. Additionally, we also investigated the neural correlate of (the degree of) risk, (the degree of) ambiguity, and prediction error. Because we were dealing with a two-second time series, we were able to identify the periods of time during decision-making when the correlates were expressed. The linear regression was run by the "mne.stats.linear regression" function in the MNE package (Activity ~ Regressor + Intercept). Activity is the activity amplitude of the EEG signal in the source space and regressor is one of the regressors that we mentioned (e.g., expected free energy, the value of reducing ambiguity, etc.).

      In these analyses, we focused on the induced power of neural activity at each time point, in the brain source space. To illustrate the functional specialization of these neural correlates, we presented whole-brain maps of correlation coefficients and picked out the brain region with the most significant correlation for reporting fluctuations in selected correlations over two-second periods. These analyses were presented in a descriptive fashion to highlight the nature and variety of the neural correlates, which we unpacked in relation to the existing EEG literature in the discussion. Note that we did not attempt to correct for multiple comparisons; largely, because the correlations observed were sustained over considerable time periods, which would be almost impossible under the null hypothesis of no correlations.”

      Comment 4:

      There was a slight misdirection in the discussion of priors in the active inference framework. The notion that active inference requires a pre-specification of priors is a common misconception. Furthermore, it misses the point that the utility of Bayesian modelling is to identify the priors that each subject brings to the table. This could be easily addressed with something like the following in the discussion:

      "It is a common misconception that Bayesian approaches to choice behaviour (including active inference) are limited by a particular choice of priors. As illustrated in our fitting of choice behaviour above, priors are a strength of Bayesian approaches in the following sense: under the complete class theorem [5, 6], any pair of choice behaviours and reward functions can be described in terms of ideal Bayesian decision-making with particular priors. In other words, there always exists a description of choice behaviour in terms of some priors. This means that one can, in principle, characterise any given behaviour in terms of the priors that explain that behaviour. In our example, these were effectively priors over the precision of various preferences or beliefs about contingencies that underwrite expected free energy."

      Response 4: We deeply thank you for your comments and corresponding suggestions about the prior of Bayesian methods. In response to your suggestions, we have added corresponding content in the Discussion section (The strength of the active inference framework in decision-making, line 447-453):

      “However, it may be the opposite. As illustrated in our fitting results, priors can be a strength of Bayesian approaches. Under the complete class theorem (Wald 1947; Brown 1981), any pair of behavioral data and reward functions can be described in terms of ideal Bayesian decision-making with particular priors. In other words, there always exists a description of behavioral data in terms of some priors. This means that one can, in principle, characterize any given behavioral data in terms of the priors that explain that behavior. In our example, these were effectively priors over the precision of various preferences or beliefs about contingencies that underwrite expected free energy.”

      Wald, A. (1947). An essentially complete class of admissible decision functions. The Annals of Mathematical Statistics, 549-555.

      Brown, L. D. (1981). A complete class theorem for statistical problems with finite sample spaces. The Annals of Statistics, 1289-1300.

      Reviewer #2 (Public Review):

      Summary:

      Zhang and colleagues use a combination of behavioral, neural, and computational analyses to test an active inference model of exploration in a novel reinforcement learning task.

      Strengths:

      The paper addresses an important question (validation of active inference models of exploration). The combination of behavior, neuroimaging, and modeling is potentially powerful for answering this question.

      Response: We want to express our sincere gratitude for your thorough review of our work and for the valuable comments you have provided. Your attention to detail and dedication to improving the quality of the work are truly commendable. Your feedback has been invaluable in guiding us towards revisions that will strengthen the work. We have made targeted modifications based on most of the comments. However, due to factors such as time and energy constraints, we have not added corresponding analyses for several comments.

      Comment 1:

      The paper does not discuss relevant work on contextual bandits by Schulz, Collins, and others. It also does not mention the neuroimaging study of Tomov et al. (2020) using a risky/safe bandit task.

      Response 1:

      We deeply thank you for your suggestions about the relevant work. We now discussion and cite these representative papers in the Introduction section (line 42-55):

      “The decision-making process frequently involves grappling with varying forms of uncertainty, such as ambiguity - the kind of uncertainty that can be reduced through sampling, and risk - the inherent uncertainty (variance) presented by a stable environment. Studies have investigated these different forms of uncertainty in decision-making, focusing on their neural correlates (Daw et al., 2006; Badre et al., 2012; Cavanagh et al., 2012).

      These studies utilized different forms of multi-armed bandit tasks, e.g the restless multi-armed bandit tasks (Daw et al., 2006; Guha et al., 2010), risky/safe bandit tasks (Tomov et al., 2020; Fan et al., 2022; Payzan et al., 2013), contextual multi-armed bandit tasks (Schulz et al., 2015; Schulz et al., 2015; Molinaro et al., 2023). However, these tasks either separate risk from ambiguity in uncertainty, or separate action from state (perception). In our work, we develop a contextual multi-armed bandit task to enable participants to actively reduce ambiguity, avoid risk, and maximize rewards using various policies (see Section 2.2) and Figure 4(a)). Our task makes it possible to study whether the brain represents these different types of uncertainty distinctly (Levy et al., 2010) and whether the brain represents both the value of reducing uncertainty and the degree of uncertainty. The active inference framework presents a theoretical approach to investigate these questions. Within this framework, uncertainties can be reduced to ambiguity and risk. Ambiguity is represented by the uncertainty about model parameters associated with choosing a particular action, while risk is signified by the variance of the environment's hidden states. The value of reducing ambiguity, the value of avoiding risk, and extrinsic value together constitute expected free energy (see Section 2.1).”

      Daw, N. D., O'doherty, J. P., Dayan, P., Seymour, B., & Dolan, R. J. (2006). Cortical substrates for exploratory decisions in humans. Nature, 441(7095), 876-879.

      Badre, D., Doll, B. B., Long, N. M., & Frank, M. J. (2012). Rostrolateral prefrontal cortex and individual differences in uncertainty-driven exploration. Neuron, 73(3), 595-607.

      Cavanagh, J. F., Figueroa, C. M., Cohen, M. X., & Frank, M. J. (2012). Frontal theta reflects uncertainty and unexpectedness during exploration and exploitation. Cerebral cortex, 22(11), 2575-2586.

      Guha, S., Munagala, K., & Shi, P. (2010). Approximation algorithms for restless bandit problems. Journal of the ACM (JACM), 58(1), 1-50.

      Tomov, M. S., Truong, V. Q., Hundia, R. A., & Gershman, S. J. (2020). Dissociable neural correlates of uncertainty underlie different exploration strategies. Nature communications, 11(1), 2371.

      Fan, H., Gershman, S. J., & Phelps, E. A. (2023). Trait somatic anxiety is associated with reduced directed exploration and underestimation of uncertainty. Nature Human Behaviour, 7(1), 102-113.

      Payzan-LeNestour, E., Dunne, S., Bossaerts, P., & O’Doherty, J. P. (2013). The neural representation of unexpected uncertainty during value-based decision making. Neuron, 79(1), 191-201.

      Schulz, E., Konstantinidis, E., & Speekenbrink, M. (2015, April). Exploration-exploitation in a contextual multi-armed bandit task. In International conference on cognitive modeling (pp. 118-123).

      Schulz, E., Konstantinidis, E., & Speekenbrink, M. (2015, November). Learning and decisions in contextual multi-armed bandit tasks. In CogSci.

      Molinaro, G., & Collins, A. G. (2023). Intrinsic rewards explain context-sensitive valuation in reinforcement learning. PLoS Biology, 21(7), e3002201.

      Levy, I., Snell, J., Nelson, A. J., Rustichini, A., & Glimcher, P. W. (2010). Neural representation of subjective value under risk and ambiguity. Journal of neurophysiology, 103(2), 1036-1047.

      Comment 2:

      The statistical reporting is inadequate. In most cases, only p-values are reported, not the relevant statistics, degrees of freedom, etc. It was also not clear if any corrections for multiple comparisons were applied. Many of the EEG results are described as "strong" or "robust" with significance levels of p<0.05; I am skeptical in the absence of more details, particularly given the fact that the corresponding plots do not seem particularly strong to me.

      Response 2: We deeply thank you for your comments about our statistical reporting. We have optimized the fitting model and rerun all the statistical analyses. As can be seen (Figure 6, 7, 8, S3, S4, S5), the new regression results are significantly improved compared to the previous ones. Due to the limitation of space, we place the other relevant statistical results, including t-values, std err, etc., on our GitHub (https://github.com/andlab-um/FreeEnergyEEG). Currently, we have not conducted multiple comparison corrections based on Reviewer 1’s comments (Comments 3) “Note that we did not attempt to correct for multiple comparisons; largely, because the correlations observed were sustained over considerable time periods, which would be almost impossible under the null hypothesis of no correlations”.

      Author response image 1.

      Comment 3:

      The authors compare their active inference model to a "model-free RL" model. This model is not described anywhere, as far as I can tell. Thus, I have no idea how it was fit, how many parameters it has, etc. The active inference model fitting is also not described anywhere. Moreover, you cannot compare models based on log-likelihood, unless you are talking about held-out data. You need to penalize for model complexity. Finally, even if active inference outperforms a model-free RL model (doubtful given the error bars in Fig. 4c), I don't see how this is strong evidence for active inference per se. I would want to see a much more extensive model comparison, including model-based RL algorithms which are not based on active inference, as well as model recovery analyses confirming that the models can actually be distinguished on the basis of the experimental data.

      Response 3: We deeply thank you for your comments about the model comparison details. We previously omitted some information about the comparison model, as classical reinforcement learning is not the focus of our work, so we put the specific details in the supplementary materials. Now we have placed relevant information in the main text (see the part we have highlighted in yellow). We have now added the relevant information regarding the model comparison in the Results section (Behavioral results, line 279-293):

      “To assess the evidence for active inference over reinforcement learning, we fit active inference (Eq.9), model-free reinforcement learning, and model-based reinforcement learning models to the behavioral data of each participant. This involved optimizing the free parameters of active inference and reinforcement learning models. The resulting likelihood was used to calculate the Bayesian Information Criterion (BIC) as the evidence for each model. The free parameters for the active inference model (AL, AI, EX, prior, and α) scaled the contribution of the three terms that constitute the expected free energy in Eq.9. These coefficients can be regarded as precisions that characterize each participant's prior beliefs about contingencies and rewards. For example, increasing α means participants would update their beliefs about reward contingencies more quickly, increasing AL means participants would like to reduce ambiguity more, and increasing AI means participants would like to learn the hidden state of the environment and avoid risk more. The free parameters for the model-free reinforcement learning model are the learning rate α and the temperature parameter γ and the free parameters for the model-based are the learning rate α, the temperature parameter γ and prior (the details for the model-free reinforcement learning model can be found in Eq.S1-11 and the details for the model-based reinforcement learning model can be found in Eq.S12-23 in the Supplementary Method). The parameter fitting for these three models was conducted using the `BayesianOptimization' package in Python, first randomly sampling 1000 times and then iterating for an additional 1000 times.”

      We have now incorporated model-based reinforcement learning into our comparison models and placed the descriptions of both model-free and model-based reinforcement learning algorithms in the supplementary materials. We have also changed the criterion for model comparison to Bayesian Information Criterion. As indicated by the results, the performance of the active inference model significantly outperforms both comparison models.

      Sorry, we didn't do model recovery before, but now we have placed the relevant results in the supplementary materials. From the result figures, we can see that each model fits its own generated simulated data well:

      “To demonstrate how reliable our models are (the active inference model, model-free reinforcement learning model, and model-based reinforcement learning model), we run some simulation experiments for model recovery. We use these three models, with their own fitting parameters, to generate some simulated data. Then we will fit all three sets of data using these three models.

      The model recovery results are shown in Fig.S6. This is the confusion matrix of models: the percentage of all subjects simulated based on a certain model that is fitted best by a certain model. The goodness-of-fit was compared using the Bayesian Information Criterion. We can see that the result of model recovery is very good, and the simulated data generated by a model can be best explained by this model.”

      Author response image 2.

      Comment 4:

      Another aspect of the behavioral modeling that's missing is a direct descriptive comparison between model and human behavior, beyond just plotting log-likelihoods (which are a very impoverished measure of what's going on).

      Response 4: We deeply thank you for your comments about the comparison between the model and human behavior. Due to the slight differences between our simulation experiments and real behavioral experiments (the "you can ask" stage), we cannot directly compare the model and participants' behaviors. However, we can observe that in the main text's simulation experiment (Figure 3), the active inference agent's behavior is highly consistent with humans (Figure 4), exhibiting an effective exploration strategy and a desire to reduce uncertainty. Moreover, we have included two additional simulation experiments in the supplementary materials, which demonstrate that active inference may potentially fit a wide range of participants' behavioral strategies.

      Author response image 3.

      (An active inference agent with AL=AI=EX=0. It can accomplish tasks efficiently like a human being, reducing the uncertainty of the environment and maximizing the reward.)

      Author response image 4.

      (An active inference agent with AL=AI=0, EX=10. It will only pursue immediate rewards (not choosing the "Cue" option due to additional costs), but it can also gradually optimize its strategy due to random effects.)

      Author response image 5.

      (An active inference agent with EX=0, AI=AL=10. It will only pursue environmental information to reduce the uncertainty of the environment. Even in "Context 2" where immediate rewards are scarce, it will continue to explore.) (a) shows the decision-making of active inference agents in the Stay-Cue choice. Blue corresponds to agents choosing the "Cue" option and acquiring "Context 1"; orange corresponds to agents choosing the "Cue" option and acquiring "Context 2"; purple corresponds to agents choosing the "Stay" option and not knowing the information about the hidden state of the environment. The shaded areas below correspond to the probability of the agents making the respective choices. (b) shows the decision-making of active inference agents in the Stay-Cue choice. The shaded areas below correspond to the probability of the agents making the respective choices. (c) shows the rewards obtained by active inference agents. (d) shows the reward prediction errors of active inference agents. (e) shows the reward predictions of active inference agents for the "Risky" path in "Context 1" and "Context 2".

      Comment 5:

      The EEG results are intriguing, but it wasn't clear that these provide strong evidence specifically for the active inference model. No alternative models of the EEG data are evaluated.

      Overall, the central claim in the Discussion ("we demonstrated that the active inference model framework effectively describes real-world decision-making") remains unvalidated in my opinion.

      Response 5: We deeply thank you for your comments. We applied the active inference model to analyze EEG results because it best fit the participants' behavioral data among our models, including the new added results. Further, our EEG results serve only to verify that the active inference model can be used to analyze the neural mechanisms of decision-making in uncertain environments (if possible, we could certainly design a more excellent reinforcement learning model with a similar exploration strategy). We aim to emphasize the consistency between active inference and human decision-making in uncertain environments, as we have discussed in the article. Active inference emphasizes both perception and action, which is also what we wish to highlight: during the decision-making process, participants not only passively receive information, but also actively adopt different strategies to reduce uncertainty and maximize rewards.

      Reviewer #3 (Public Review):

      Summary:

      This paper aims to investigate how the human brain represents different forms of value and uncertainty that participate in active inference within a free-energy framework, in a two-stage decision task involving contextual information sampling, and choices between safe and risky rewards, which promotes a shift from exploration to exploitation. They examine neural correlates by recording EEG and comparing activity in the first vs second half of trials and between trials in which subjects did and did not sample contextual information, and perform a regression with free-energy-related regressors against data "mapped to source space." Their results show effects in various regions, which they take to indicate that the brain does perform this task through the theorised active inference scheme.

      Strengths:

      This is an interesting two-stage paradigm that incorporates several interesting processes of learning, exploration/exploitation, and information sampling. Although scalp/brain regions showing sensitivity to the active-inference-related quantities do not necessarily suggest what role they play, it can be illuminating and useful to search for such effects as candidates for further investigation. The aims are ambitious, and methodologically it is impressive to include extensive free-energy theory, behavioural modelling, and EEG source-level analysis in one paper.

      Response: We would like to express our heartfelt thanks to you for carefully reviewing our work and offering insightful feedback. Your attention to detail and commitment to enhancing the overall quality of our work are deeply admirable. Your input has been extremely helpful in guiding us through the necessary revisions to enhance the work. We have implemented focused changes based on a majority of your comments. Nevertheless, owing to limitations such as time and resources, we have not included corresponding analyses for a few comments.

      Comment 1:

      Though I could surmise the above general aims, I could not follow the important details of what quantities were being distinguished and sought in the EEG and why. Some of this is down to theoretical complexity - the dizzying array of constructs and terms with complex interrelationships, which may simply be part and parcel of free-energy-based theories of active inference - but much of it is down to missing or ambiguous details.

      Response 1: We deeply thank you for your comments about our work’s readability. We have significantly revised the descriptions of active inference, models, research questions, etc. Focusing on active inference and the free energy principle, we have added relevant basic descriptions and unified the terminology. We have added information related to model comparison in the main text and supplementary materials. We presented our regression results in clearer language. Our research focused on the brain's representation of decision-making in uncertain environments, including expected free energy, the value of reducing ambiguity, the value of avoiding risk, extrinsic value, ambiguity, and risk.

      Comment 2:

      In general, an insufficient effort has been made to make the paper accessible to readers not steeped in the free energy principle and active inference. There are critical inconsistencies in key terminology; for example, the introduction states that aim 1 is to distinguish the EEG correlates of three different types of uncertainty: ambiguity, risk, and unexpected uncertainty. But the abstract instead highlights distinctions in EEG correlates between "uncertainty... and... risk" and between "expected free energy .. and ... uncertainty." There are also inconsistencies in mathematical labelling (e.g. in one place 'p(s|o)' and 'q(s)' swap their meanings from one sentence to the very next).

      Response 2: We deeply thank you for your comments about the problem of inconsistent terminology. First, we have unified the symbols and letters (P, Q, s, o, etc.) that appeared in the article and described their respective meanings more clearly. We have also revised the relevant expressions of "uncertainty" throughout the text. In our work, uncertainty refers to ambiguity and risk. Ambiguity can be reduced through continuous sampling and is referred to as uncertainty about model parameters in our work. Risk, on the other hand, is the inherent variance of the environment and cannot be reduced through sampling, which is referred to as uncertainty about hidden states in our work. In the analysis of the results, we focused on how the brain encodes the value of reducing ambiguity (Figure 8), the value of avoiding risk (Figure 6), and (the degree of) ambiguity (Figure S5) during action selection. We also analyzed how the brain encodes reducing ambiguity and avoiding risk during belief update (Figure 7).

      Comment 3:

      Some basic but important task information is missing, and makes a huge difference to how decision quantities can be decoded from EEG. For example:

      - How do the subjects press the left/right buttons - with different hands or different fingers on the same hand?

      Response 3: We deeply thank you for your comments about the missing task information. We have added the relevant content in the Methods section (Contextual two-armed bandit task and Data collection, line 251-253):

      “Each stage was separated by a jitter ranging from 0.6 to 1.0 seconds. The entire experiment consists of a single block with a total of 120 trials. The participants are required to use any two fingers of one hand to press the buttons (left arrow and right arrow on the keyboard).”

      Comment 4:

      - Was the presentation of the Stay/cue and safe/risky options on the left/right sides counterbalanced? If not, decisions can be formed well in advance especially once a policy is in place.

      Response 4: The presentation of the Stay/cue and safe/risky options on the left/right sides was not counterbalanced. It is true that participants may have made decisions ahead of time. However, to better study the state of participants during decision-making, our choice stages consist of two parts. In the first two seconds, we ask participants to consider which option they would choose, and after these two seconds, participants are allowed to make their choice (by pressing the button).

      We also updated the figure of the experiment procedure as below (We circled the time that the participants spent on making decisions).

      Author response image 6.

      Comment 5:

      - What were the actual reward distributions ("magnitude X with probability p, magnitude y with probability 1-p") in the risky option?

      Response 5: We deeply thank you for your comments about the missing task information. We have placed the relevant content in the Methods section (Contextual two-armed bandit task and Data collection, line 188-191):

      “The actual reward distribution of the risky path in "Context 1" was [+12 (55%), +9 (25%), +6 (10%), +3 (5%), +0 (5%)] and the actual reward distribution of the risky path in "Context 2" was [+12 (5%), +9 (5%), +6 (10%), +3 (25%), +0 (55%)].”

      Comment 6:

      The EEG analysis is not sufficiently detailed and motivated.

      For example,

      - why the high lower-filter cutoff of 1 Hz, and shouldn't it be acknowledged that this removes from the EEG any sustained, iteratively updated representation that evolves with learning across trials?

      Response 6: We deeply thank you for your comments about our EEG analysis. The 1Hz high-pass filter may indeed filter out some useful information. We chose a 1Hz high-pass filter to filter out most of the noise and prevent the noise from affecting our results analysis. Additionally, there are also many decision-related works that have applied 1Hz high-pass filtering in EEG data preprocessing (Yau et al., 2021; Cortes et al., 2021; Wischnewski et al., 2022; Schutte et al., 2017; Mennella et al., 2020; Giustiniani et al., 2020).

      Yau, Y., Hinault, T., Taylor, M., Cisek, P., Fellows, L. K., & Dagher, A. (2021). Evidence and urgency related EEG signals during dynamic decision-making in humans. Journal of Neuroscience, 41(26), 5711-5722.

      Cortes, P. M., García-Hernández, J. P., Iribe-Burgos, F. A., Hernández-González, M., Sotelo-Tapia, C., & Guevara, M. A. (2021). Temporal division of the decision-making process: An EEG study. Brain Research, 1769, 147592.

      Wischnewski, M., & Compen, B. (2022). Effects of theta transcranial alternating current stimulation (tACS) on exploration and exploitation during uncertain decision-making. Behavioural Brain Research, 426, 113840.

      Schutte, I., Kenemans, J. L., & Schutter, D. J. (2017). Resting-state theta/beta EEG ratio is associated with reward-and punishment-related reversal learning. Cognitive, Affective, & Behavioral Neuroscience, 17, 754-763.

      Mennella, R., Vilarem, E., & Grèzes, J. (2020). Rapid approach-avoidance responses to emotional displays reflect value-based decisions: Neural evidence from an EEG study. NeuroImage, 222, 117253.

      Giustiniani, J., Nicolier, M., Teti Mayer, J., Chabin, T., Masse, C., Galmès, N., ... & Gabriel, D. (2020). Behavioral and neural arguments of motivational influence on decision making during uncertainty. Frontiers in Neuroscience, 14, 583.

      Comment 7:

      - Since the EEG analysis was done using an array of free-energy-related variables in a regression, was multicollinearity checked between these variables?

      Response 7: We deeply thank you for your comments about our regression. Indeed, we didn't specify our regression formula in the main text. We conducted regression on one variable each time, so there was no need for a multicollinearity check. We have now added the relevant content in the Results section (“EEG results at source level” section, line 337-340):

      “The linear regression was run by the "mne.stats.linear regression" function in the MNE package (Activity ~ Regressor + Intercept). Activity is the activity amplitude of the EEG signal in the source space and regressor is one of the regressors that we mentioned (e.g., expected free energy, the value of reducing ambiguity, etc.).”

      Comment 8:

      - In the initial comparison of the first/second half, why just 5 clusters of electrodes, and why these particular clusters?

      Response 8: We deeply thank you for your comments about our sensor-level analysis. These five clusters are relatively common scalp EEG regions to analyze (left frontal, right frontal, central, left parietal, and right parietal), and we referred previous work analyzed these five clusters of electrodes (Laufs et al., 2006; Ray et al., 1985; Cole et al., 1985). In addition, our work pays more attention to the analysis in source space, exploring the corresponding functions of specific brain regions based on active inference models.

      Laufs, H., Holt, J. L., Elfont, R., Krams, M., Paul, J. S., Krakow, K., & Kleinschmidt, A. (2006). Where the BOLD signal goes when alpha EEG leaves. Neuroimage, 31(4), 1408-1418.

      Ray, W. J., & Cole, H. W. (1985). EEG activity during cognitive processing: influence of attentional factors. International Journal of Psychophysiology, 3(1), 43-48.

      Cole, H. W., & Ray, W. J. (1985). EEG correlates of emotional tasks related to attentional demands. International Journal of Psychophysiology, 3(1), 33-41.

      Comment 9:

      How many different variables are systematically different in the first vs second half, and how do you rule out less interesting time-on-task effects such as engagement or alertness? In what time windows are these amplitudes being measured?

      Response 9 (and the Response for Weaknesses 11): There were no systematic differences between the first half and the second half of the trials, with the only difference being the participants' experience. In the second half, participants had a better understanding of the reward distribution of the task (less ambiguity). The simulation results can well describe these.

      Author response image 7.

      As shown in Figure (a), agents can only learn about the hidden state of the environment ("Context 1" (green) or "Context 2" (orange)) by choosing the "Cue" option. If agents choose the "Stay" option, they will not be able to know the hidden state of the environment (purple). The risk of agents is only related to wh

      ether they choose the "Cue" option, not the number of rounds. Figure (b) shows the Safe-Risky choices of agents, and Figure (e) is the reward prediction of agents for the "Risky" path in "Context 1" and "Context 2". We can see that agents update the expected reward and reduce ambiguity by sampling the "Risky" path. The ambiguity of agents is not related to the "Cue" option, but to the number of times they sample the "Risky" path (rounds).

      In our choosing stages, participants were required to think about their choices for the first two seconds (during which they could not press buttons). Then, they were asked to make their choices (press buttons) within the next two seconds. This setup effectively kept participants' attention focused on the task. And the two second during the “Second choice” stage when participants decide which option to choose (they cannot press buttons) are measured for the analysis of the sensor-level results.

      Comment 10:

      In the comparison of asked and not-asked trials, what trial stage and time window is being measured?

      Response 10: We have added relevant descriptions in the main text. The two second during the “Second choice” stage when participants decide which option to choose (they cannot press buttons) are measured for the analysis of the sensor-level results.

      Author response image 8.

      Comment 11:

      Again, how many different variables, of the many estimated per trial in the active inference model, are different in the asked and not-asked trials, and how can you know which of these differences is the one reflected in the EEG effects?

      Response 11: The difference between asked trials and not-asked trials lies only in whether participants know the specific context of the risky path (the level of risk for the participants). A simple comparison indeed cannot tell us which of these differences is reflected in the EEG effects. Therefore, we subsequently conducted model-based regression analysis in the source space.

      Comment 12:

      The authors choose to interpret that on not-asked trials the subjects are more uncertain because the cue doesn't give them the context, but you could equally argue that they don't ask because they are more certain of the possible hidden states.

      Response 12: Our task design involves randomly varying the context of the risky path. Only by choosing to inquire can participants learn about the context. Participants can only become increasingly certain about the reward distribution of different contexts of the risky path, but cannot determine which specific context it is. Here are the instructions for the task that we will tell the participants (line 226-231).

      "You are on a quest for apples in a forest, beginning with 5 apples. You encounter two paths: 1) The left path offers a fixed yield of 6 apples per excursion. 2) The right path offers a probabilistic reward of 0/3/6/9/12 apples, and it has two distinct contexts, labeled "Context 1" and "Context 2," each with a different reward distribution. Note that the context associated with the right path will randomly change in each trial. Before selecting a path, a ranger will provide information about the context of the right path ("Context 1" or "Context 2") in exchange for an apple. The more apples you collect, the greater your monetary reward will be."

      Comment 13:

      - The EEG regressors are not fully explained. For example, an "active learning" regressor is listed as one of the 4 at the beginning of section 3.3, but it is the first mention of this term in the paper and the term does not arise once in the methods.

      Response 13: We have accordingly revised the relevant content in the main text (as in Eq.8). Our regressors now include expected free energy, the value of reducing ambiguity, the value of avoiding risk, extrinsic value, prediction error, (the degree of) ambiguity, reducing ambiguity, and avoiding risk.

      Comment 14:

      - In general, it is not clear how one can know that the EEG results reflect that the brain is purposefully encoding these very parameters while implementing this very mechanism, and not other, possibly simpler, factors that correlate with them since there is no engagement with such potential confounds or alternative models. For example, a model-free reinforcement learning model is fit to behaviour for comparison. Why not the EEG?

      Response 14: We deeply thank you for your comments. Due to factors such as time and effort, and because the active inference model best fits the behavioral data of the participants, we did not use other models to analyze the EEG data. At both the sensor and source level, we observed the EEG signal and brain regions that can encode different levels of uncertainties (risk and ambiguity). The brain's uncertainty driven exploration mechanism cannot be explained solely by a simple model-free reinforcement learning approach.

      Recommendations for the authors:

      Response: We have made point-to-point revisions according to the reviewer's recommendations, and as these revisions are relatively minor, we have only responded to the longer recommendations here.

      Reviewer #1 (Recommendations For The Authors)

      I enjoyed reading this sophisticated study of decision-making. I thought your implementation of active inference and the subsequent fitting to choice behaviour - and study of the neuronal (EEG) correlates - was impressive. As noted in my comments on strengths and weaknesses, some parts of your manuscript with difficult to read because of slight collapses in grammar and an inconsistent use of terms when referring to the mathematical quantities. In addition to the paragraphs I have suggested, I would recommend the following minor revisions to your text. In addition, you will have to fill in some of the details that were missing from the current version of the manuscript. For example:

      Recommendation 1:

      Which RL model did you use to fit the behavioural data? What were its free parameters?

      Response 1: We have now added information related to the comparison models in the behavioral results and supplementary materials. We applied both simple model-free reinforcement learning and model-based reinforcement learning. The free parameters for the model-free reinforcement learning model are the learning rate α and the temperature parameter γ, while the free parameters for the model-based approach are the learning rate α, the temperature parameter γ, and the prior.

      Recommendation 2:

      When you talk about neuronal activity in the final analyses (of time-dependent correlations) what was used to measure the neuronal activity? Was this global power over frequencies? Was it at a particular frequency band? Was it the maximum amplitude within some small window et cetera? In other words, you need to provide the details of your analysis that would enable somebody to reproduce your study at a certain level of detail.

      Response 2: In the final analyses, we used the activity amplitude at each point in the source space for our analysis. Previously, we had planned to make our data and models available on GitHub to facilitate easier replication of our work.

      Reviewer #3 (Recommendations For The Authors)

      Recommendation 1:

      It might help to explain the complex concepts up front, to use the concrete example of the task itself - presumably, it was designed so that the crucial elements of the active inference framework come to the fore. One could use hypothetical choice patterns in this task to exemplify different factors such as expected free energy and unexpected uncertainty at work. It would also be illuminating to explain why behaviour on this task is fit better by the active inference model than a model-free reinforcement learning model.

      Response 1: Thank you for your suggestions. We have given clearer explanations to the three terms in the active inference formula: the value of reducing ambiguity, the value of avoiding risk, and the extrinsic value (Eq.8), which makes it easier for readers to understand active inference.

      In addition, we can simply view active inference as a computational model similar to model-based reinforcement learning, where the expected free energy represents a subjective value, without needing to understand its underlying computational principles or neurobiological background. In our discussion, we have argued why the active inference model fits the participants' behavior better than our reinforcement learning model, as the active inference model has an inherent exploration mechanism that is consistent with humans, who instinctively want to reduce environmental uncertainty (line 435-442).

      “Active inference offers a superior exploration mechanism compared with basic model-free reinforcement learning  (Figure 4 (c)). Since traditional reinforcement learning models determine their policies solely on the state, this setting leads to difficulty in extracting temporal information (Laskin et al., 2020) and increases the likelihood of entrapment within local minima. In contrast, the policies in active inference are determined by both time and state. This dependence on time (Wang et al., 2016) enables policies to adapt efficiently, such as emphasizing exploration in the initial stages and exploitation later on. Moreover, this mechanism prompts more exploratory behavior in instances of state ambiguity. A further advantage of active inference lies in its adaptability to different task environments (Friston et al., 2017). It can configure different generative models to address distinct tasks, and compute varied forms of free energy and expected free energy.”

      Laskin, M., Lee, K., Stooke, A., Pinto, L., Abbeel, P., & Srinivas, A. (2020). Reinforcement learning with augmented data. Advances in neural information processing systems, 33, 19884-19895.

      Wang, J. X., Kurth-Nelson, Z., Tirumala, D., Soyer, H., Leibo, J. Z., Munos, R., ... & Botvinick, M. (2016). Learning to reinforcement learn. arXiv preprint arXiv:1611.05763.

      Friston, K., FitzGerald, T., Rigoli, F., Schwartenbeck, P., & Pezzulo, G. (2017). Active inference: a process theory. Neural computation, 29(1), 1-49.

      Recommendation 2:

      Figure 1A provides a key example of the lack of effort to help the reader understand. It suggests the possibility of a concrete example but falls short of providing one. From the caption and text, applied to the figure, I gather that by choosing either to run or to raise one's arms, one can control whether it is daytime or nighttime. This is clearly wrong but it is what I am led to think by the paper.

      Response 2: Thank you for your suggestion, which we had not considered before. In this figure, we aim to illustrate that "the agent receives observations and optimizes his cognitive model by minimizing variational free energy → the agent makes the optimal action by minimizing expected free energy → the action changes the environment → the environment generates new observations for the agent." We have now modified the image to be simpler to prevent any possible confusion for readers. Correspondingly, we removed the figure of a person raising their hand and the shadowed house in Figure a.

      Author response image 9.

      Recommendation 3:

      I recommend an overhaul in the labelling and methodological explanations for consistency and full reporting. For example, line 73 says sensory input is 's' and the cognitive model is 'q(s),' and the cause of the sensory input is 'p(s|o)' but on the very next line, the cognitive model is 'p(s|o)' and the causes of sensory input are 'q(s).' How this sensory input s relates to 'observations' or 'o' is unclear, and meanwhile, capital S is the set of environmental states. P seems to refer to the generative distribution, but it also means probability.

      Response 3: Thank you for your advice. Now we have revised the corresponding labeling and methodological explanations in our work to make them consistent. However, we are not sure how to make a good modification to P here. In many works, P can refer to a certain probability distribution or some specific probabilities.

      Recommendation 4:

      Even the conception of a "policy" is unclear (Figure 2B). They list 4 possible policies, which are simply the 4 possible sequences of steps, stay-safe, cue-risky, etc, but with no contingencies in them. Surely a complete policy that lists 'cue' as the first step would entail a specification of how they would choose the safe or risky option BASED on the information in that cue

      Response 4: Thank you for your suggestion. In active inference, a policy actually corresponds to a sequence of actions. The policy of "first choosing 'Cue' and then making the next decision based on specific information" differs from the meaning of policy in active inference.

      Recommendation 5:

      I assume that the heavy high pass filtering of the EEG (1 Hz) is to avoid having to baseline-correct the epochs (of which there is no mention), but the authors should directly acknowledge that this eradicates any component of decision formation that may evolve in any way gradually within or across the stages of the trial. To take an extreme example, as Figure 3E shows, the expected rewards for the risky path evolve slowly over the course of 60 trials. The filter would eliminate this.

      Response 5: Thank you for your suggestion. The heavy high pass filtering of the EEG (1 Hz) is to minimize the noise in the EEG data as much as possible.

      Recommendation 6:

      There is no mention of the regression itself in the Methods section - the section is incomplete.

      Response 6: Thank you for your suggestion. We have now added the relevant content in the Results section (EEG results at source level, line 337-340):

      “The linear regression was run by the "mne.stats.linear regression" function in the MNE package (Activity ∼ Regressor + Intercept, Activity is the activity amplitude of the EEG signal in the source space and regressor is one of the regressors that we mentioned).”

      Recommendation 7:

      On Lines 260-270 the same results are given twice.

      Response 7: Thank you for your suggestion. We have now deleted redundant content.

      Recommendation 8:

      Frequency bands are displayed in Figure 5 but there is no mention of those in the Methods. In Figure 5b Theta in the 2nd half is compared to Delta in the 1st half- is this an error?

      Response 8: Thank you for your suggestion. It indeed was an error (they should all be Theta) and now we have corrected it.

      Author response image 10.

    1. Author Response

      The following is the authors’ response to the original reviews.

      Reviewer 1

      Major points:

      R1C1: I appreciate that the data are aligned, in some points, with related studies of this niche. However, it would help the reader to have this alignment explored more extensively in the Discussion as well.

      Answer: We acknowledge that the discussion would benefit from additional comparisons to the available datasets. We thus add the following comment after the first paragraph of the discussion: “Previous studies of the different sub-populations of SVZ progenitors were carried out using transcriptomic approaches based on the expression of various more or less specific markers. These approaches have made it possible to identify quiescent and activated neural stem cells as well as mature neuroblasts, but have been faced with the strong influence of the cell cycle on cell clustering. Indeed, neural progenitors in these studies cycling have been gathered in either “mitotic” clusters (Llorens et al. 2015, Zywitza et al. 2018, Cebrian et al. 2021) or “neural progenitor cells” clusters (Dulken et al. 2017) that had no clear biological significance and hindering identification of subtypes of SVZ cycling progenitors. Our study, combining, for the first time, characterization of Facs-isolated cells and an irradiation-based model of sequential regeneration, allowed to clearly distinguish the molecular profiles of TAP and iNB among cycling progenitors reflecting differences in their in vitro and in vivo respective potentials”.

      R1C2: The data on multilineage differentiation, both in culture and upon engraftment, would be greatly strengthened by quantification. What is the relative yield of TUJ1/DCX-positive cells versus the other marker combinations? Specifically regarding the multilineage differentiation in vitro - because different media conditions are used to generate each lineage, it may be difficult to determine relative yield. Could a differentiation system that allows production of all 3 lineages be used instead?

      If the fraction of non-DCX/TUJ1-labeled progeny is low, particularly in vivo, this might suggest that while multilineage differentiation is possible, it is a much less likely cellular state outcome than production of mature neuroblasts. Some suggested references with examples of the culture conditions, experimental conditions, and discussions highlighted in the public review: Culture conditions that allow simultaneous trilineage differentiation. PMID: 17615304 Influence of culture conditions on potency: similar to issues covered in PMID: 21549325.

      Answer: We agree with the reviewer that quantification of a multilineage differentiation in vitro would improve the characterization of the relative potencies of the different SVZ progenitor.

      According to PMID: 17615304 and PMID: 21549325, and in agreement with our own experience, the only culture condition that allows neurosphere-derived neural progenitors to differentiate in vitro into the three lineages is the removal of mitogens from the culture medium. However, this does not work on freshly isolated SVZ cells, which remain in an undifferentiated state in this condition.

      This is why we chose to use specific differentiation media for each of the 3 lineages as in Figure 1C. It is also for this reason that we performed as many experiments as possible in vivo rather than in vitro as in Figure S2. In the new version, we have added a quantitative analysis of stainings by antibodies against GFAP, CNPase or DCX of GFP-positive cells persisting at IS, where high number of grafted cells were found in Figure S2B. This was performed by using the NIS software measuring eGFP-, GFAP-, CNPase- and DCX-positive areas. The intersection between each marker and eGFP areas was then determined as a percentage of staining (Figure S2C). The results showed that approximately one third of GFP+ cells expressed GFAP or DCX. The quantitative analysis of CNPase expression was complicated by CNPase-positive host cells, but the stronger CNPase staining in eGFP-positive areas clearly revealed the expression of CNPase by a significant proportion of eGFP-positive cells.

      R1C3: Additionally, for claims similar to what is currently made in the text, it would be extremely valuable to confirm the purity of the sort for each population - for example by fixing and staining the sorted fraction with additional antibodies that confirm cell identity.

      Answer: We have previously shown in Daynac et al. 2013 that s-iNB expressed the neuroblast markers CD24 and DCX, but also markers of neural progenitors such as Mash1, a basic helix-loop-helix transcription factor. As suggested by the reviewer, we have further investigated the expression of other markers of neural progenitors by sorted cells. The results showed that the proportion of DLX2+ cells a marker of proliferating progenitors (Doetsch et al. 2002) was very high in aNSC/TAP (98%) and progressively decreased in iNB (82%) and mNB (25%). Similarly, the expression of the transcription factor SOX2 that plays an essential role in the maintenance of neural progenitors (PMID: 25126380) accounted for 78% of aNSC/TAP, 70% of iNB and 17% of mNB.

      Altogether, these new data confirmed the identity of the different cell populations and particularly that of iNB. They are commented at the beginning of the Results and shown in Figure S1.

      R1C4: Line 125: GFAP alone doesn't necessarily indicate a "conversion to NSCs" - this conclusion could be greatly strengthened by inclusion of more markers, particularly at the protein level, or cyto-architectural studies.

      Answer: We agree with the reviewer that GFAP expression alone is not sufficient to evidence the presence of NSC in the SVZ. We have thus modified the text accordingly: “Importantly, eGFP+ cells were present in the SVZ of all the animals transplanted with eGFP+s-iNB and eGFP+s-NSC/TAP (Fig. 1Db, Fig. 1Dc), some of them expressing GFAP indicating the generation of astrocytes, and therefore possibly NSC”.

      R1C5: Could these cellular states be reflective of preferential translation of DCX? It would be very helpful to see the flow cytometry sort data for iNBs / mNBs used in Figure 6, particularly if these cells were also fixed and stained directly for DCX protein.

      Answer: As suggested by the reviewer, freshly FAC-sorted iNB and mNB were fixed and labelled with an anti-DCX monoclonal antibody after permeabilization. As shown in the figure below, we found a higher level of DCX expression in mNB than in iNB. Therefore, this result tends to indicate that the proliferation capacity is somehow related to the level of DCX expression. However, because of the relatively low importance of this result, we decided not to include them in the manuscript.

      Author response image 1.

      Modal histogram representation of DCX expression level in unstained, iNB and mNB cells determined by flow cytometry (FlowJo).

      <R1C6: Figure S8 is all zeroes, showing the GFP+Dcxhigh NBs do not retain proliferative capacity. But we don't get a direct experimental comparison to EGFPnegative/lowDcxlow iNB engraftment, which would strengthen the conclusions of the paper.

      Answer: Unfortunately, there is no method available to analyse the eGFPnegative/lowDcxlow iNB engraftment: by definition, these cells do not express eGFP and the use of a tracker is not appropriate for long periods of time — and thus a high number of cell divisions — after engraftment. However, to us, this control is not needed to conclude that GFP+Dcxhigh iNB have no (or at least a lower) stem cell potential in vivo considering that we have shown in Figure 1 and Table 1 that the whole iNB population is able to generate the different types of neural cells.

      R1C7: Transplant data in Table 1 - a relatively small proportion of transplant derived cells are in OB, etc. Given that A cells are thought to cycle at least once in vivo, is this expected?

      Answer: The reviewer is right considering that a relatively small proportion of transplant derived cells were found in the OB. However, we should consider that we used immunocompetent mice as receivers, which could have significantly reduced the engraftment efficiency, and the migration of engrafted cells outside the injection site.

      R1C8: A caveat is that there is not much functional testing of the proposed model, especially for the interconversion of iNB states suggested by the diagram in Figure 7. The text is relatively restrained in proposing this model, so it is reasonable to keep - but perhaps should be noted that this part of the model will need additional testing.

      Answer: Data presented in Figure 6 clearly suggest that Dcxhigh iNB have similar in vitro potential than Dcxlow iNB, whereas they don’t have such potential in vivo (Figure S10). This suggests that, providing they are in appropriate conditions, Dcxhigh iNB could reacquire stem/progenitor properties. However, we agree that this hypothesis requires further investigation. Therefore, as suggested by the reviewer, we have added in the Figure 7 legend: “Possible interconversion of iNB states would require further experimental confirmation.”

      Additional minor points:

      R1C9: Introduction: the SVZ is described as "the lateral wall" - however, several works in the mouse have also examined the medial wall and callosal roof, as cited later in the intro. Suggest rephrasing the second sentence (line 48) and later sentence (line 66) to clarify that "the SVZ" encompasses all of these subregions, they are not necessarily separate niches. Answer: As indicated by the reviewer, the SVZ encompasses distinct subdomains, with NSCs having a regional identity based on their location in the lateral or septal wall of the ventricle and generating different types of neuronal and glial progeny (PMID:34259628.). To address the reviewer concern about possible confusion and clearly indicate that SVZ encompass several subdomains, we have modified the sentence line 66 as follows: “Since then, the single cell RNA-sequencing has revolutionized the field and has made it possible to precisely elucidate the transcriptome of SVZ cells present in the LW and in the septal wall which also harbors NSC niches”.

      However, we did not modify the line 48, since in this sentence we just indicate that the largest neurogenic niche in the adult brain reside in the LW of the SVZ.

      R1C10: Line 77: "exposure" not "exposition"

      Answer: The error has been corrected in the revised manuscript.

      R1C11: As noted in the Public Review - the use of the term "D1/D2" cells seems likely to confuse readers who are also versed in dentate gyrus neurogenesis. Recommend removing this term from the manuscript.

      Answer: We agree that the D1/D2 terminology could bring confusion, D cells referring to Tanycytes in the hypothalamus. We now refer to iNB1 for DcxLow iNB and iNB2 for DcxHigh iNB in the revised manuscript.

      Reviewer 2

      Major comments:

      Lack of rigor

      R2C1: There is a lack of appropriate normalization controls for the microarray data. As there is a decreased level of transcription in quiescent NSCs, there needs to be a cell number control (spike-ins based on cell numbers). Without this normalization, the readout can be greatly skewed.

      Answer: We agree that qNSC are marked by a decreased level of transcription due to quiescence. To overcome this problem in the Clariom assays, we thus chose to calibrate each population, with a fixed amount of cRNA and cDNA using Hela cells as internal control. We totally agree that this method is not optimal but it appears to be efficient in the end. Indeed, it should be noticed that it has been adopted, thus with the same rigor, in other microarray studies published in the field (PMID: 24811379) and also on skeletal muscle cells (PMID: 29273087). Moreover, interestingly the transcriptomic signature of qNSC matches perfectly with those from other studies and particularly to those of related clusters in single cell experiments (including ours, Figure S5). This is probably linked to the fact that more importantly that the number of cells, the main characteristic of these cells is the lack of expression of genes involved in cell proliferation and metabolism. Whatever so, these data confirming previously published are not the main information of our manuscript, which is mainly dedicated to the characterization of proliferating cells, which is not impaired by our choices of normalization.

      R2C2: The absolute segregation of clusters in the single-cell analysis is currently entirely in agreement with the cell cycle stage. This suggests that in the author's analysis, the clustering in 3F is entirely shaped by the cell cycle, making that the defining characteristic of the author's definitions for their cell types. Has an analysis been done that regresses out cell cycle-associated genes to see if there are clusters for different cell states/types that are identified in the absence of cell cycle stage being the defining factor? (Barron and Li, 2016). For example, just as you would see a difference in cluster if you are a quiescent or activated NSC as compared to a neuroblast for example, even without the contribution of cell cycle. These are different cell types.

      Answer: We agree that cell cycle regression would theoretically allow for further discrimination between cycling cells along successive neurogenic stages. We have already performed regression using several methods, including regressing using S- and G2/M-score regression as indicated in the Seurat workflow, removing cell cycle-related PCs from UMAP calculation as used in the Cebrian-Sylla study, and using alternative gene sets such as the ones provided by the tricycle method (PMID: 35101061). These regression methods have all been used on our datasets, the original Cebrian-Sylla datasets and a combination of our datasets with the Cebrian-Sylla original datasets to increase cell number and clustering resolution. However, none of these methods modified the clustering of cycling cells.

      In fact, the strong influence of the cell cycle over clustering highlights the relevance of our depletion/replenishment approaches to decipher the molecular changes masked by the cell cycle, as discussed below.

      R2C3: The use of the DCX-CreERT2 line is a lineage tracing line. Once DCX is expressed, Cre recombines the DNA to allow for fluorescence. It is binary, on or off associated with DCX expression. And once on, it is always on, whether the cell is currently expressing DCX or not. As the authors had previously described a DCXlow condition, the eGFP- cells would not reflect DCXlow, but no DCX at all. And the eGFP+ cells may not be currently expressing DCX anymore. The authors should have used a system where the DCX promoter itself drives fluorescence.

      Answer: We took advantage of the DCX-CreERT2 line to demonstrate that some neural cells that have recently acquired DCX expression (i.e. eGFP+ iNB) could keep (or recover) the potential of neural progenitors in vitro. Of course, some of these GFP+ cells could have stopped to express DCX. This is probably the case when they differentiate into astrocytes and oligodendrocytes in vitro as shown in Figure 6.

      Whatever so, the use of the Dcx promoter as a direct driver of eGFP fluorescence would have totally impeded our capacity to demonstrate such changes in cell fate in vivo because of the impossibility to track oligodendrocytes or astrocytes derived from iNB because of the loss of Dcx expression.

      R2C4: The lack of analysis of images (differentiation, for example) limits the conclusions of the in-vitro data, and the images with unclear staining, limit the conclusions of the in-vivo experiments.

      Answer: This comment is similar to that of R1C2. We have now added a quantification in Figure S2.

      R2C5: The cited difference in splicing differences in cell types was interesting (though did not show up in the transcriptome enrichment analyses Fig S2) and would be something to further pursue, however, this was a very limited analysis. There was no further study of these splicing mediators beyond single-cell data.

      Answer: We now show enrichments of GO terms corresponding to mRNA splicing isoforms in the different types of sorted SVZ cells (Figure S4). This analysis clearly revealed that spliced genes in SVZ cells are mainly involved in neuron development and neurogenesis. Interestingly this also showed that qNSC logically differed from the other cell types by splicing concerning genes involved in mitosis and cell cycle, consistently with their quiescent state. More importantly, GO annotations of differentially spliced isoforms further confirmed that s-TAP and s-iNB have distinct features. We agree with the reviewer that further analysis of splicing mediators would be very important for understanding molecular changes involved in neurogenesis. However, we think that it is largely beyond the scope of this study.

      R2C6: Fig 1C - Show values, not just pictures. You may need to shift your current differentiation paradigm to do so by removing growth factors instead of unique differentiation conditions.

      Answer: See the answer to R1C2.

      R2C7: Fig S1A - Stainings for GFAP and DCX are not clear. It is very hard to distinguish which cells are associated with these signals.

      Answer: This figure (now Figure S2A) shows an eGFP+iNB cell (white arrow) that has reached the rostral migratory stream and expressed DCX (inset a3), but not GFAP (inset a2). This is now indicated in the figure legend. We have also moved the arrow for more clarity.

      R2C8: Fig S1B2 - There is red staining everywhere, so it is very hard to see a specific CNPase signal.

      Answer: We have added a new figure (Fig S2B) distinguishing eGFP+CNPase+ cells (yellow arrows) from eGFP+CNPase- cells (white arrow).

      R2C9: Line 174 - It's the mRNA that you are detecting is being downregulated - be more specific as you are not showing protein downregulation.

      Answer: We specified, "encoding" a major splicing repressor in the Line 174 text to refer to the mRNA: “Interestingly, Ptbp1, encoding a major splicing repressor”.

      R2C10: Line 189 - text in this line have some clusters not shown in the figure - (clusters 6 and 15, DCX+ Ki67+ neuroblasts) - which would be an important thing to visualize. As is shown now, the authors are only showing that iNBs are similar to mitotic TAPs.

      Answer: Clusters 6 and 15 have been added to Figure S5.

      R2C11: Fig 3D-E - Why is cluster 17 called aNSCs (3E) when it has the highest GFAP (Fig 3D). Typically, the highest GFAP cells are qNSCs or astrocytes, not aNSCs.

      Answer: We previously reported that the level of gfap mRNA expression in neural stem cells (quiescent and activated) did not exactly reflect the amount of protein in these cells. This is the reason why we also used the Slc1a3 marker (Glast), which is highly expressed both at the RNA and protein levels in quiescent NSCs (Daynac et al. 2013).

      R2C12: Line 216 - You said in line 216 cluster 13 were astrocytes, then you said in line 227 that cluster 13 was s-qNSC. Which is it?

      Answer: This is due to the fact that we performed two distinct analyses.

      In the first one (line 216), cells were scored based on datasets provided by Cebrian et al. with one dataset containing genes enriched in astrocytes, and another one, genes enriched in quiescent B-cells. Therefore, cluster 13 was shown to contain 73% cells expressing astrocyte markers, whereas cluster 4 gathered cells expressing both qNSC (B-cells, 48%) and astrocyte (52%) genes.

      In the second one (line 227), cells were scored using our transcriptomic signatures of FAC-sorted SVZ cells, which do not include differentiated astrocytes. We demonstrated that the cluster 13 cells only expressed s-qNSC genes.

      R2C13: Line 214 - While other clusters were all named in lines 214-221 that were then further discussed in lines 227-230, clusters 15 and 19 were not. You associate both of those clusters with s-iNB - what was it associated with in the above section?

      Answer: Lines 219-221 have been reworded as follows: Clusters 10, 5, 15, 12, and 8 were defined as cycling progenitors based on the expression of proliferative markers such as Top2a, Mki67, Ascl1. Clusters 1, 3, 7 and 9 were identified as mNB due to the loss of Mki67, Top2 a and Ascl1 expressions and the expression of Robo2 and Dcx. Cluster 19 that have lost Ascl1 but still expressing Top2a and Mki67 together with Robo2 and Dcx appears at the transition between iNB and mNB.

      R2C14: Fig 3I-J - 5 days after irradiation, I would like to see from tissue slices how many cells are dividing compared to 1day post-irradiation and controls. In other paradigms, such as temozolomide experiments (Kalamakis et al), by 5 days we should see less cells in quiescence and more of those quiescent cells exiting quiescence into the cell cycle. Why would there be more cells in quiescence in the irradiated brain? Even if they are radiation resistant, the base number should be comparative between controls and irradiated, which is not what you show in Fig 3I-J. And R2C14)

      Line 234-235 - the text says normalized to numbers of qNSCs which is supposed to be the same (which I agree should be the same). However, your graph in 3I and J shows more qNSCs in irradiated conditions, which would influence greatly and is currently hard to interpret.

      Answer: As stated by the reviewer, there is no increase in the absolute number of quiescent cells in the irradiated SVZ. The reconstitution of SVZ cell populations after 4Gy irradiation has already been studied by our group (Daynac et al. 2013, see Fig. 3F), showing that s-iNB and s-mNB are still under-represented after 5 days, while qNSC are in similar numbers as in unirradiated SVZ. Therefore, this led to an over-representation of quiescent cells and early SVZ progenitors in Figure 3J as compared in Figure 3I.

      R2C15: Fig 6A - the authors show a significant difference in neurospheres between eGFP- (DCX-) and eGFP+ (DCX+) iNBs - as would be expected as DCX suggests a further commitment towards neurogenic fates, yet your population doubling is the same.

      Answer: To determine the population doublings, the medium was changed and cells numbered every 7 days. This condition masked the differences between two cell populations reaching the plateau phase at different time, explaining why eGFP-iNB and eGFP+iNB could not be clearly distinguished by this technique.

      R2C16: Fig 6C - Differentiation data (in-vitro) should be quantified in 6C, just as was mentioned for 1C. These values should be done for both of the populations (eGFP-iNB, and eGFP+iNB) and not just compared to the previous pictures which were on total iNB. Again, numbers are required, not just picture examples.

      Answer: Quantitative data have been given in Figure 6D showing that approximately 60-80% of cells eGFP+iNB are able to differentiate in either neurons, oligodendrocytes or astrocytes. We did not analyze the differentiation of eGFP-iNB since it would not add any supplementary information.

      R2C17: Fig S8 - The authors did not show if the lack of engraftment of eGFP+ cells is due to the transplant (previously you showed only 2/3 worked in a similar paradigm). It would be helpful if the authors would have some means to visualize the DCX low cells to confirm they worked as before in the transplantation (another color? Another type of mouse (Thy1 antigen differences)?) Answer: Unfortunately, the Thy1 antigen has not been documented in mouse subventricular zone progenitors, but only in neurons (PMID: 10813783). Thy1 antigen has also been described in bipotent glial progenitor cell (GCP) from the developing human brain giving rise to oligodendrocytes (PMID: 36931245).

      As shown, in Figure S10 we have performed 5 grafts with s-iNB eGFP+ cells, 2 alone and 3 mixed with eGFP- cells and never found any eGFP+ cells 5 weeks after grafting. Moreover, we did not find any eGFP+ cells in the brains of 3 other animals 2 weeks after grafting with s-iNB eGFP+ cells (These data have been added to Figure S10). As compared to the results described in Figure 1 this clearly shows that iNB DCXhigh are not able to generate persistent cells in the grafted brains similarly as mNB.

      R2C18: Fig S8 - Why were there no eGFP cells even at the injection site? DCX expression promotes migration, indeed DCX expression becomes very high in cells in the SVZ as they begin to exit to go to the migratory stream. If one didn't see migration, one would expect you would still have survival. Currently, the authors show no cells at 5 weeks, however, they would need to show earlier timepoints as well to determine what is happening with these cells. It is possible these GFP+ cells are not even expressing DCX anymore (see above).

      Answer: As stated above, we did not find any GFP+ cells in the brains of 3 other animals 2 weeks after grafting with s-iNB eGFP+ cells (see Figure S10).

      R2C19: Line 320 - the authors suggest a subpopulation of NEURONS continues to divide and cite 2 works from the 1990s showing proliferating SVZ cells can differentiate. Our knowledge of this system has come dramatically forward since the 1990s as well as technologically, and to date, neurons have not been shown to divide.

      Answer: We apologize for this lack of clarity, as we agree that neurons correspond to differentiated non-cycling cells, but we used the terminology used in these articles. The incorrect part of the sentence Line 320 has thus been deleted from the text.

      R2C20: Fig 7 - The whole figure is based on changing levels of RSR genes which were not confirmed in any way to be involved in any of these stages, only descriptively in single-cell analyses.

      Answer: As stated above, in our opinion, further characterization of the involvement of RSR genes in neurogenesis is largely beyond the scope of our manuscript. Nevertheless, we think that the role of RSR genes in neurogenesis is an important question that should be addressed in further studies.

      Overstatement of findings

      R2C21: Fig 1 - Authors did not compare all cell types in each condition but made overstatements about their relationships to each other between graphs. There should also be separate graphs showing all cell types at 4% and a separate one at 20%.

      Answer: In the revised version, Figure 1 shows the graph comparing all cell types at 4%O2 and a separate one at 20% as requested by the reviewer. The graphs clearly shows that 4%O2 promotes iNB proliferation compared to the 20% condition.

      R2C22: Fig 1D-b2 - Why does DCX look nuclear? One can't say they are only NSCs if they are GFAP as astrocytes also express GFAP. The authors would need another marker to separate those populations. In the text, the authors say expressing GFAP (line 124) which means NSC, but then in line 127 expressing GFAP means astrocytes - which further shows you need additional markers to validate those 2 different cell types. Answer: DCX nuclear translocation has been shown to improve cellular proliferation (PMID:32050972).

      As indicated in R1C4. The text has been modified as follows: “Importantly, eGFP+ cells were present in the SVZ of all the animals transplanted with s-iNB eGFP+ and s-NSC/TAP eGFP+ (Fig. 1Db, 1Dc), some of them expressing GFAP indicating the generation of astrocytes, and therefore possibly NSC”.

      R2C23: Fig S2 - The transcriptome signature for s-iNBs is very similar to s-TAP, basically suggesting the iNBs are further along in cell cycle.

      Answer: This is now the Figure S3. Functional enrichment analysis of individual transcriptome signatures revealed that both s-TAP and s-iNB are enriched in genes related to the cell cycle although with different GO terms enrichments. Indeed, s-TAP are enriched in genes related to G1, G1/S and S phase (but with low -log10 adjusted p-values) and s-iNB with genes related to cell cycle mitosis and M phase (with high -log10 adjusted p-values).

      We have previously shown that around 33 % s-iNB have DNA content>2N, versus around 26% of s-TAP and s- aNSC (Daynac et al. 2013), which is in accordance with GO terms enrichments. However, these data have also shown that most s-iNB and s-TAP are in G1, indicating that siNB are not just further along mitosis than TAP.

      Moreover, our transcriptomic data clearly show that s-iNB are distinct from s-TAP: 1) according to principal component analyses (Figure 2B et C), the whole transcriptome of s-TAP is closer to that of s-aNSCs than to that of s-iNB (10% variations in PCA2), 2) the heatmap in Figure 2D shows that they have different RSR genes expression profiles, 3) the new Figure S4 shows that GO annotations of differentially spliced isoforms further confirmed that s-TAP and s-iNB have distinct features, and 5) Figure S5 shows that s-iNB expressed genes associated to either TAP or NB that have been described in previous studies, whereas s-TAP did not express genes associated to NB, but look closer to aNSC. Finally, scRNAsq cell clusters related to s-iNB are distinct from the cluster related to s-TAP as shown 1) in Figure 3D and 2) in Figure 4.

      R2C24: Fig 3 - The lack of information about timepoint 0 after irradiation, and when proliferation and cell cycle entry begins again following irradiation, limits our interpretation of the single-cell irradiated data.

      Answer: We have previously reported the relative abundance of each SVZ neural progenitors in the young adult mouse brain in several papers. Particularly, we based our interpretation on our SVZ irradiation model reported in Daynac et al. 2013 demonstrating a radio resistance of qNSC re-entering into the cell cycle as early as 2 days after 4Gy irradiation successively regenerating aNSC, TAP then iNB and mNB.

      R2C25: Fig S3 - These results effectively show that the s-aNSCs and s-TAPs are actually less specific when compared to that same identity in other studies, and that the iNBs are most similar to mitotic TAPs. This supports what was mentioned above, which is that the transcriptional signatures are very similar between the s-TAPs and i-NBs, showing these are not a unique cell state, but just a bit further along mitosis within the TAP cell state.

      Answer: This is now the Figure S5. In this figure, we show that s-iNB expressed genes associated to either TAP or NB that have been described in previous studies, whereas s-TAP did not express genes associated to NB, but look like closer to aNSC. As indicated above in R2C23, s-iNB are not just a bit further along mitosis within the TAP cell state. Indeed, we give several data showing that s-iNB and s-TAP have different transcriptomic profiles.

      R2C26: Fig 4B - The focus on Ptbp1 as being associated with the iNB cluster border to mNB is expected as all previous studies of Ptbp1 have focused on its role in the progression of other cell types through the cell cycle, its control of cell cycle regulators, and a cell cycle mRNA regulon (Monzon-Casanova et al, 2018, 2019, 2020). This further supports these analyses are specifically defined by cell cycle stages.

      Answer: We totally agree that Ptbp1 expression distinguishes cycling cells from postmitotic neuroblasts in accordance with previously published paper, and that based on this unique gene we cannot find any differences between cycling cells ie. aNSC, TAP and iNB. However, as shown in the manuscript and stated above (R2C23 and 25), these cells can be distinguished by their respective expression of many other genes, including other RSR genes.

      R2C27: Line 281-282 is an overstatement - the authors suggest that this is a new type of cycling neural progenitor - when all studies point to it being the end of mitosis TAPs as they go on their way to mNBs. This clearly shows a trajectory and not a defined, binary cell type.

      Answer: We agree with this statement that the use of the word "type" was misleading, and changed it to "stage" to better reflect that s-iNB are a distinct stage along the differentiation process according to our pseudotime cell-trajectory analysis.

      Author response image 2.

      Pseudotime analysis using Monocle 3 (excluding the cluster 13 corresponding to astrocytes and starting from s-qNSC) revealed two branches starting from s-TAP, one towards cell cycle the other towards neuronal differentiation.

      minor comments:

      R2C28: Fig 3D - For ease, please define what you called the clusters in 3D - not just cluster numbers

      Answer: We chose not to call the clusters in 3D because their identification (Group names) is based on data presented after in Figures 3E, F and G.

      R2C29: Fig 3E-F - Show astrocytes by text in 3E and F

      Answer: As discussed above, astrocytes cannot be shown in these figures because they are based on our signatures which did not include astrocyte signature.

    1. Author Response

      The following is the authors’ response to the original reviews.

      eLife Assessment

      This study investigated the factors related to understudied genes in biomedical research. It showed that understudied genes are largely abandoned at the writing stage, and it identified a number of biological and experimental factors that influence which genes are selected for investigation. The study is a valuable contribution to this branch of meta-research, and while the evidence in support of the findings is solid, the interpretation and presentation of the results (especially the figures) needs to be improved.

      We thank the editor and reviewers for their detailed and thoughtful assessment of our work. Below, we present detailed responses to reviewers’ comments and suggestions. We are also submitting a version edited for clarity of presentation and precision of interpretation.

      Following the eLife assessment, we also tried to identify further statements where results could be presented in a more precise way.

      First, in the section Subsequent reception by other scientists does not penalize studies on understudied genes, we now state “This result again opposes the hypothesis that less-investigated genes will yield articles with lower impact.”

      Second, in section Identification of biological and experimental factors associated with selection of highlighted genes, we now state:

      “We cautiously hypothesize that this might reflect on many different research groups producing reagents surrounding the genes that they actively study. The most informative continuous factor is the number of research articles about a gene (Figure 1B).”, removing claims of causality.

      Finally, for improved readability, we have moved all supplemental tables into separate .xlsx files.

      Reviewer #1 (Public Review):

      Summary and strengths

      The authors tried to address why only a subset of genes are highlighted in many publications. Is it because these highlighted genes are more important than others? Or is it because there are non-genetic reasons? This is a critical question because in the effort to discover new genes for drug targets and clinical benefit, we need to expand a pool of genes for deep analyses. So I appreciate the authors' efforts in this study, as it is timely and important. They also provided a framework called FMUG (short for Find My Understudied Gene) to evaluate genes for a number of features for subsequent analyses.

      We thank the reviewer for their insightful comments and are pleased that the reviewer shares our appreciation for the gravity of these questions. As the reviewer emphasizes, it is critical to understand whether the choice of genes reflects their importance or non-genetic reasons. Previously we and others demonstrated that this choice does not reflect biological importance, when the latter is assessed through unbiased genome-wide data (e.g.: Haynes et al., 2018; Stoeger et al. 2018). Now we contribute to this critical question by systematically evaluating individual non-genetic reasons. We address the reviewer’s comments below.

      Weaknesses

      Many of the figures are hard to comprehend, and the figure legends do not sufficiently explain them.

      For example, what was plotted in Fig 1b? The number of articles increased from results -> write-ups -> follow-ups in all four categories with different degrees. But it does not seem to match what the authors meant to deliver.

      We apologize for the lack of clarity. We identified two interrelated elements that we have now fixed: i) the prior figure legend provided for each genomics approach n number of articles, such as “GWAS (n=450 articles)”; ii) the prior y-axis was labelled “Number of articles”.

      Addressing the first element, we now rephrased the legend for clarity:

      “b, We identified articles reporting on genome-wide CRISPR screens (CRISPR, 15 focus articles and 18 citing articles), transcriptomics (T-omics, 148 focus articles and 1,678 citing articles), affinity purification–mass spectrometry (AP-MS, 296 focus articles and 1,320 citing articles), and GWAS (450 focus articles and 3,524 citing articles). Focusing only on protein-coding genes (white box plot), we retrieved data uploaded to repositories describing which genes came up as “hits” in each experiment (first colored box plot). We then retrieved the hits mentioned in the titles and abstracts of those articles (second colored box plot) and hits mentioned in the titles and abstracts of articles citing those articles (third colored box plot). Unique hit genes are only counted once.”

      The number of genes in each box plot is now reported in the x-axis labels for each step. For example, the results for CRISPR were obtained from 15 focus studies (original research) and 18 subsequent studies (papers citing focus articles). Those 15 studies identified 9,268 genes where loss-of-function changed phenotypes but, in their titles and abstracts, mentioned only 18 of those 9,268 genes. While the 9,268 hit genes have received similar research attention to the entirety of protein-coding genes, the 18 hit genes mentioned in the title or abstract are significantly more well studied. The articles citing the focus articles also only mentioned in their titles and abstracts 19 highly studied hit genes.

      Addressing the second element, we updated the axis label to “Number of articles about gene”, to distinguish it from number of articles mentioned in the legend, convey that this is the number of articles about each gene that were published independently of the genomics assays we inspect. To further underscore this point we now label the “20% highest-studied genes” that we mention in the main text, and reworded the figure caption to better capture where the critical increase occurs: “A shift in focus towards well-studied genes occurs during the summarization and write-up of results and remains in subsequent studies.”.

      Fig 4 is also confusing. It appears that the genes were clustered by many features that the authors developed. But does it have any relationship with genes being under- or over-studied?

      We again apologize for the lack of clarity. As is described in the main text, while the results of Figs. 1-2 suggest that gene popularity may be predict the highlighting of a differentially expressed gene in the title or abstract, we want to conduct a systematically analysis of the factors that correlate with such a decision. We thus build a set of 45 factors that have been discussed as factors explaining why some genes receive increased research attention.

      The data in Fig. 4 shows that those 45 factors are not independent but that some are highly correlated. Because of those correlations, we are able to select a smaller number as representative of the full set. Those are the default factors shown to users of FMUG. While users can choose all factors that are significantly correlated with the highlighting in title or abstract, the default of presenting factors representing different clusters of factors enabled us to limit the number of factors that are initially displayed.

      Please note that following the suggestion of Reviewer 3, we have now moved this Figure to the supplemental material, as Figure S11.

      Reviewer #2 (Public Review)

      Summary and strengths

      In this manuscript the authors analyse the trajectory of understudied genes (UGs) from experiment to publication and study the reasons for why UGs remain underrepresented in the scientific literature. They show that UGs are not underrepresented in experimental datasets, but in the titles and abstracts of the manuscripts reporting experimental data as well as subsequent studies referring to those large-scale studies. They also develop an app that allows researchers to find UGs and their annotation state. Overall, this is a timely article that makes an important contribution to the field. It could help to boost the future investigation of understudied genes, a fundamental challenge in the life sciences. It is concise and overall well-written, and I very much enjoyed reading it. However, there are a few points that I think the authors should address.

      We thank the reviewer for their kind assessment.

      Weaknesses

      The authors conclude that many UGs "are lost" from genome-wide assay at the manuscript writing stage. If I understand correctly, this is based on gene names not being reported in the title or abstract of these manuscripts. However, for genome-wide experiments, it would be quite difficult for authors to mention large numbers of understudied genes in the abstract. In contrast, one might highlight the expected behaviour of a well-studied protein simply to highlight that the genome-wide study provides credible results.

      We agree that it is not reasonable to expect a title or abstract to highlight hundreds or even thousands of differentially expressed genes. We’ve now extended our Study Limitations section to address this:

      “we take a gene being mentioned in the title or abstract of an article as a proxy for a gene receiving attention by the article’s authors. The title and abstract are space-limited and thus cannot accommodate discussion of large numbers of genes.”

      We also agree that highlighting the expected behavior of a well-studied protein may provide credibility to a study and increase confidence on other results. The soundness of such a strategy was quantitatively studied in a study by Uzzi et al. (Science 2013), which we now include in the section on study limitations as:

      “authors beginning manuscripts with something familiar before introducing something new”.

      To convey the practical limitation of abstracts needing to be concise, we added the following sentence to our discussion section, when suggesting controlled trials that add genes to abstracts:

      “This intervention would need to be carefully designed since abstracts are limited in their size.”

      To avoid over-interpretation we have in the discussion also extended the sentence on “lost in a leaky pipeline” to “lost to titles and abstracts of research articles in a leaky pipeline”.

      Our focus on titles and abstracts has been equally motivated by their availability (full text still is often behind paywalls and/or not accessible for bulk-download and text-mining) and by abstracts being the most visible and most read parts of research articles (e.g.: bioRxiv estimates that for the preprint for the present manuscript, the abstract was read ~10 times more frequently than full-text HTML and 4 times more frequently than the pdf).

      Could this bias the authors' conclusions and, if so, how could this be addressed? For example, would it be worth to normalise studies based on the total number of genes they cover?

      We previously described that – in line with the reviewer’s expectations – unstudied genes are preferentially added to the title or abstract of articles that feature more genes in the title or abstract (Stoeger et al., Plos Biology, 2022; Fig. 2B). Normalizing by the total number of genes should thus preserve the pronounced division between well-studied genes and unstudied genes show in Figure 1B. In line with these predictions, we randomly select one gene per title/abstract and find that the effect remains (see new Figure S7).

      Author response image 1.

      Figure 1B is confusing in its present form. I think the plot and/or the legend need revising. For example, what "numbers to the right of each box plot" are the authors referring to? Also, I assume that the filled boxes are understudied genes and the empty/white box is "all genes", but that's not explained in the legend. In the main text, the figure is referred to with the sentence "we found that hit genes that are highlighted in the title or abstract are strongly over-represented among the 20% highest-studied genes in all biomedical literature ". I cannot follow how the figure shows this. My interpretation is that the y-axis is not showing the number of articles, but represents the percentage of articles mentioning a gene in the title/abstract, displayed on a log scale. If so, perhaps a better axis labels and legend text could be sufficient. But then one would also need to somehow connect this to the statement in the main text about the 20% highest-studied genes (a dashed line?). Alternatively, the authors could consider other ways of plotting these data, e.g. simply plotting the "% of publication in which a gene appears" from 0-100% or so.

      Reviewer 1 raised a similar point on overall figure clarity. We identified two interrelated elements that contribute to overall confusion and have now fixed them (see response to Reviewer 1 beginning on page 2 of this document).

      We attempted an alternative plotting of Fig 1B according to the reviewer’s suggestion. In the version below, the y-axis instead shows the percent of gene-related articles that are about each gene. We chose to keep the original y-axis (showing number of articles about each gene) as it additionally conveys the absolute scale of scholarship on individual genes.

      Author response image 2.

      Reviewer #3 (Public Review):

      Summary and strengths

      The manuscript investigated the factors related to understudied genes in biomedical research. It showed that understudied are largely abandoned at the writing stage and identified biological and experimental factors associated with selection of highlighted genes.

      It is very important for the research community to recognize the systematic bias in research of human genes and take precautions when designing experiments and interpreting results. The authors have tried to profile this issue comprehensively and promoted more awareness and investigation of understudied genes.

      We thank the reviewer for their kind assessment of our work.

      Weaknesses

      Regarding result section 1 "Understudied genes are abandoned at synthesis/writing stage", the figures are not clear and do not convey the messages written in the main text. For example, in Figure 1B, figure S5 and S6,

      • There is no "numbers to the right of each box plot".

      The “numbers to the right” statement in the caption was an erroneous inclusion from an earlier version of the figure. We apologize for our error and have now removed this statement.

      • Do these box plots only show understudied genes? How many genes are there in each box plot? The definition and numbers of understudied genes are not clear.

      The x-axis describes genes featured in each stage of the publication process (from all protein-coding genes to genes found as hits in genome-wide screen to genes found in the title/abstract to genes found in the title/abstract of citing articles) and the y-axis describes the number of articles annotated to those genes. We have also now added the number of genes in each box plot to the figure. This information is also in Materials and Methods under each technology’s heading (see also response to Reviewer 1 beginning on page 2 of this document).

      Author response image 3.

      • "We found that hit genes that are highlighted in the title or abstract are strongly over-represented among the 20% highest-studied genes in all biomedical literature (Figure 1B)". This is not clear from the figure.

      We have revised Figure 1B and its caption to better communicate the main point of the figure: that genes which make it to the title/abstract of the reporting article tend to be more popular than genes which are hits in genome-wide experiments from those articles. We have added a horizontal line that shows the cutoff for the top 20% most popular genes.

      Regarding result section 2 "Subsequent reception by other scientists does not penalize studies on understudied genes", the authors showed in figure 2 that there is a negative correlation between articles per gene before 2015 and median citations to articles published in 2015. Another explanation could be that for popular genes, there are more low-quality articles that didn't get citations, not necessarily that less popular genes attract more citations.

      We believe that both explanations for the observed phenomenon are not mutually exclusive. Previously, we focused on the median of citations to articles about a gene to capture the typical effect. In a new analysis, we also find support for the possibility outlined by the reviewer and believe that adding this to our manuscript complements and balances our analysis of citations. Specifically, in the new Figure S8B we find that most popular genes are slightly more likely to be among least cited papers (and in Figure S8A that the least studied genes have been much more likely to be among the most cited papers). In-text, we state:

      “Further, since 1990, articles about the least popular genes have at times been 3 to 4 times more likely to be among the most cited articles than articles on the most popular genes whereas articles on the most popular genes have been slightly less to be highly cited than lowly cited (Figure S8)”.

      We thank the reviewer for their suggestion, which strengthens our manuscript. The figure caption reads:

      “Figure S8: Likelihoods of being highly cited (top 5% of citations among all articles about genes, panel a) or lowly cited (bottom 5% of citations among all articles about genes, panel b) for articles about the most popular genes (top 5% accumulated articles) versus articles about the least popular genes (bottom 5% accumulated articles) by year of publication. Only articles with a single gene in the title/abstract are considered. Shaded regions show ±1 standard error of the proportion."

      Author response image 4.

      Regarding result section 3 "Identification of biological and experimental factors associated with selection of highlighted genes", in Figure 3 and table s2, the author stated that "hits with a compound known to affect gene activity are 5.114 times as likely to be mentioned in the title/abstract in an article using transcriptomics", The number 5.144 comes out of nowhere both in the figure and the table. In addition, figure 4 is not informative enough to be included as a main figure.

      This is the result of both a typo and imprecise terminology. The number should read 4.262 (the likelihood ratio of being mentioned in the title/abstract between genes with and without a compound), which corresponds to an odds ratio of 4.331. We have clarified this in the table caption, stating:

      “e.g. hits with a compound known to affect gene activity are 4.262 times as likely to be mentioned in the title/abstract in an article using transcriptomics, corresponding to an odds ratio of 4.331".

      We have removed Figure 4 as a main-text figure and added a version, with revised color scheme along comments of Reviewer 1, as Figure S11. We added to the figure caption “Bold indicates FMUG ‘s default factors, which we selected based on this clustering and based on their strength of association with gene selection (Figure 3, Table S2 and Table S3)."

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      • Fig 2a shows that papers highlighting understudied genes are actually cited more. I wonder why authors only looked at data before 2015. Fig 2b shows an increased correlation since 2015. Please consider redrawing Fig 2a to include data from 2015-2020?

      We highlight data from 2015 since, from our used version of iCite (v32, released July 2022, covering citations made through most of 2021), papers published in 2015 have had about 6 years to accumulate citations. With fewer years to accumulate citations, insufficient signal may cause correlation to converge toward zero. Below, we repeat the analysis in Figure 2 but only considering citations made within a year of an article’s publication, which substantially reduces correlation (although remaining significant).

      Author response image 5.

      We added a note to the figure caption:

      “We forgo depicting more recent years than 2015 to allow for citations to accumulate over multiple years, providing a more sensitive and robust readout of long-term impact.”

      For Figure 2B, we add:

      “For more recent years, where articles have had less time to accumulate citations, insufficient signal may cause correlation to converge toward zero.”

      • Can FMUG be posted on the web for easy access by researchers with non-computational backgrounds?"

      We presently regretfully do not have the resources to create or maintain a web-based version. We hope that the publication of this manuscript will enable us to attract resources to create and maintain a web-based version.

      Reviewer #2 (Recommendations for the authors):

      • Related to the first weakness in my public review: The observed disparity between CRISPR and GWAS study in terms of which genes they promote to the abstract is interesting. I wonder if this has to do with the application of these techniques. GWAS studies will often highlight that they retrieve known associations between a gene and a phenotype, to show that a screen is working. I guess often the point is to subsequently identify more genes associated with a particular phenotype, but often it is unclear how to validate/verify newly found associations. In contrast, CRISPR screens might be more focussed on functionally/mechanistically understanding unknown processes, e.g. observing a phenotype that appears/disappears in response to a gene deletion. In such studies, the follow-up of a previously unknown gene could be more straightforward and relevant to the outcome. Does that mean CRIPSR screens are better than GWAS studies for addressing the UG problem? Perhaps the authors could briefly discuss this issue.

      The number of studies we included featuring CRISPR screens is relatively small (n = 15 compared to n = 450 for GWAS). Thus, it is not possible to conclude in a statistically sound manner whether authors of CRISPR screens are truly more likely to highlight understudied genes.

      However, the reviewer raises compelling reasons for why this might be the case, and we now embed the broader discussion point that some techniques might be more powerful toward understudied genes.

      The discussion now includes:

      “Further, the observed discrepancy between the popularity of hits highlighted by GWAS versus other technologies suggests that some -omics technologies may be more powerful than others for characterizing understudied genes. This possibility merits further research and researchers participating in unknomics should consider the relative strengths of each technology towards providing tractable results for follow-up.”

      • Affinity capture mass spectrometry (Aff-MS): Perhaps I misunderstood this, but typically this is referred to as affinity purification MS (AP-MS)

      Thank you for the clarification. We have changed ‘Aff-MS’ to ‘AP-MS’ throughout the manuscript.

      • Page 3, line 96. The sentence "The first possibility is that seemingly understudied genes are, in fact, not understudied as they would rarely be identified through experiments.". Would they not still be understudied, just not intentionally?

      We have rephrased this sentence to:

      “The first possibility is that some genes are less studied because they are rarely identified as hits in experiments.”

      • Fig 4 is very interesting, but I also found it a bit confusing. First, the choice of colour scheme, where blue shows the absence and white shows the presence of something, seems counterintuitive, especially on a white background. Second, I find it confusing that only some of the experiments are labelled in the heatmap. Could the authors not simply use Fig S9 as Fig 4? Or alternatively, only include the 8 labelled factors in the simplified figure.

      In line with this feedback and that of Review #1 and #3, we have removed Figure 4 as a main-text figure and instead include this figure as Supplementary Figure S11. We have reversed the color scheme so that purple indicates one and white indicates zero. We also now label all factors. Previously we had only listed the default features of FMUG. We also now updated the figure legend to convey how it assisted the choice of default factors in FMUG. It reads:

      “Bold indicates FMUG ‘s default factors, which we selected based on this clustering and based on their strength of association with gene selection (Figure 3, Table S2 and Table S3)”.

      • The FMUG app is fantastic and sounds exactly like something that is required to boost the visibility of understudied genes and overcome the understudied gene bias. However, I did not understand the choice of reporting this in the Discussion section.

      We thank the reviewer for their enthusiasm, and have now moved FMUG into the results section.

      • To further increase usability of the FMUG app, is there a way it could be deployed online? I appreciate this could require a major amount of coding work, which would not be reasonable to demand. So please consider this a suggestion, potentially for a future implementation.

      We presently regretfully do not have the resources to create or maintain a web-based version. We hope that the publication of this manuscript will enable us to attract resources to create and maintain a web-based version.

      Reviewer #3 (Recommendations for the authors):

      Table s2 and s3: p values are indicated by star signs. However, with so many hypothesis tests, the p values should be corrected for multiple tests.

      We have now applied Benjamini-Hochberg multiple hypothesis correction to these tables, correcting p-values within each of the four technologies. We update our significance calling to read:

      “We identified 45 factors that relate to genes and found 33 (12 out of 23 binary factors and 21 out of 22 continuous factors) associated with selection in at least one assay type at Benjamini-Hochberg FDR < 0.001.”

      Figure S1 - S4

      These figures contain too many noninformative boxes. In all the figures, only the last three boxes are informative (reports assessed for eligibility, reports excluded, and studies included in review). The rest boxes convey little information and should be simplified.

      We have simplified these diagrams, removing boxes which contained no information.

      Figure S6: what does it mean by "prior to the publication of the first article represented in this sample"? What is "this sample"?

      “This sample” refers to the collection of 450 GWAS articles, 296 articles using AP-MS, 148 transcriptomics articles, and 15 genome-wide CRISPR screen articles. We have rephrased this sentence to make this clear. It now reads:

      “Variant of Figure 1B only considering articles published in 2002 or before, prior to the publication of any of the articles featuring -omics experiments which we considered for this analysis.”

    1. Author response:

      The following is the authors’ response to the current reviews.

      eLife Assessment

      This neuroimaging and electrophysiology study in a small cohort of congenital cataract patients with sight recovery aims to characterize the effects of early visual deprivation on excitatory and inhibitory balance in visual cortex. While contrasting sight-recovery with visually intact controls suggested the existence of persistent alterations in Glx/GABA ratio and aperiodic EEG signals, it provided only incomplete evidence supporting claims about the effects of early deprivation itself. The reported data were considered valuable, given the rare study population. However, the small sample sizes, lack of a specific control cohort and multiple methodological limitations will likely restrict usefulness to scientists working in this particular subfield.

      We thank the reviewing editors for their consideration and updated assessment of our manuscript after its first revision.

      In order to assess the effects of early deprivation, we included an age-matched, normally sighted control group recruited from the same community, measured in the same scanner and laboratory. This study design is analogous to numerous studies in permanently congenitally blind humans, which typically recruited sighted controls, but hardly ever individuals with a different, e.g. late blindness history. In order to improve the specificity of our conclusions, we used a frontal cortex voxel in addition to a visual cortex voxel (MRS). Analogously, we separately analyzed occipital and frontal electrodes (EEG).

      Moreover, we relate our findings in congenital cataract reversal individuals to findings in the literature on permanent congenital blindness. Note, there are, to the best of our knowledge, neither MRS nor resting-state EEG studies in individuals with permanent late blindness.

      Our participants necessarily have nystagmus and low visual acuity due to their congenital deprivation phase, and the existence of nystagmus is a recruitment criterion to diagnose congenital cataracts.

      It might be interesting for future studies to investigate individuals with transient late blindness. However, such a study would be ill-motivated had we not found differences between the most “extreme” of congenital visual deprivation conditions and normally sighted individuals (analogous to why earlier research on permanent blindness investigated permanent congenitally blind humans first, rather than permanently late blind humans, or both in the same study). Any result of these future work would need the reference to our study, and neither results in these additional groups would invalidate our findings.

      Since all our congenital cataract reversal individuals by definition had visual impairments, we included an eyes closed condition, both in the MRS and EEG assessment. Any group effect during the eyes closed condition cannot be due to visual acuity deficits changing the bottom-up driven visual activation.

      As we detail in response to review 3, our EEG analyses followed the standards in the field.

      Public Reviews:

      Reviewer (1 (Public review):

      Summary

      In this human neuroimaging and electrophysiology study, the authors aimed to characterise effects of a period of visual deprivation in the sensitive period on excitatory and inhibitory balance in the visual cortex. They attempted to do so by comparing neurochemistry conditions ('eyes open', 'eyes closed') and resting state, and visually evoked EEG activity between ten congenital cataract patients with recovered sight (CC), and ten age-matched control participants (SC) with normal sight.

      First, they used magnetic resonance spectroscopy to measure in vivo neurochemistry from two locations, the primary location of interest in the visual cortex, and a control location in the frontal cortex. Such voxels are used to provide a control for the spatial specificity of any effects, because the single-voxel MRS method provides a single sampling location. Using MR-visible proxies of excitatory and inhibitory neurotransmission, Glx and GABA+ respectively, the authors report no group effects in GABA+ or Glx, no difference in the functional conditions 'eyes closed' and 'eyes open'. They found an effect of group in the ratio of Glx/GABA+ and no similar effect in the control voxel location. They then perform multiple exploratory correlations between MRS measures and visual acuity, and report a weak positive correlation between the 'eyes open' condition and visual acuity in CC participants.

      The same participants then took part in an EEG experiment. The authors selected two electrodes placed in the visual cortex for analysis and report a group difference in an EEG index of neural activity, the aperiodic intercept, as well as the aperiodic slope, considered a proxy for cortical inhibition. Control electrodes in the frontal region did not present with the same pattern. They report an exploratory correlation between the aperiodic intercept and Glx in one out of three EEG conditions.

      The authors report the difference in E/I ratio, and interpret the lower E/I ratio as representing an adaptation to visual deprivation, which would have initially caused a higher E/I ratio. Although intriguing, the strength of evidence in support of this view is not strong. Amongst the limitations are the low sample size, a critical control cohort that could provide evidence for higher E/I ratio in CC patients without recovered sight for example, and lower data quality in the control voxel. Nevertheless, the study provides a rare and valuable insight into experience-dependent plasticity in the human brain.

      Strengths of study

      How sensitive period experience shapes the developing brain is an enduring and important question in neuroscience. This question has been particularly difficult to investigate in humans. The authors recruited a small number of sight-recovered participants with bilateral congenital cataracts to investigate the effect of sensitive period deprivation on the balance of excitation and inhibition in the visual brain using measures of brain chemistry and brain electrophysiology. The research is novel, and the paper was interesting and well written.

      Limitations

      Low sample size. Ten for CC and ten for SC, and further two SC participants were rejected due to lack of frontal control voxel data. The sample size limits the statistical power of the dataset and increases the likelihood of effect inflation.

      In the updated manuscript, the authors have provided justification for their sample size by pointing to prior studies and the inherent difficulties in recruiting individuals with bilateral congenital cataracts. Importantly, this highlights the value the study brings to the field while also acknowledging the need to replicate the effects in a larger cohort.

      Lack of specific control cohort. The control cohort has normal vision. The control cohort is not specific enough to distinguish between people with sight loss due to different causes and patients with congenital cataracts with co-morbidities. Further data from a more specific populations, such as patients whose cataracts have not been removed, with developmental cataracts, or congenitally blind participants, would greatly improve the interpretability of the main finding. The lack of a more specific control cohort is a major caveat that limits a conclusive interpretation of the results.

      In the updated version, the authors have indicated that future studies can pursue comparisons between congenital cataract participants and cohorts with later sight loss.

      MRS data quality differences. Data quality in the control voxel appears worse than in the visual cortex voxel. The frontal cortex MRS spectrum shows far broader linewidth than the visual cortex (Supplementary Figures). Compared to the visual voxel, the frontal cortex voxel has less defined Glx and GABA+ peaks; lower GABA+ and Glx concentrations, lower NAA SNR values; lower NAA concentrations. If the data quality is a lot worse in the FC, then small effects may not be detectable.

      In the updated version, the authors have added more information that informs the reader of the MRS quality differences between voxel locations. This increases the transparency of their reporting and enhances the assessment of the results.

      Because of the direction of the difference in E/I, the authors interpret their findings as representing signatures of sight improvement after surgery without further evidence, either within the study or from the literature. However, the literature suggests that plasticity and visual deprivation drives the E/I index up rather than down. Decreasing GABA+ is thought to facilitate experience dependent remodelling. What evidence is there that cortical inhibition increases in response to a visual cortex that is over-sensitised to due congenital cataracts? Without further experimental or literature support this interpretation remains very speculative.

      The updated manuscript contains key reference from non-human work to justify their interpretation.

      Heterogeneity in patient group. Congenital cataract (CC) patients experienced a variety of duration of visual impairment and were of different ages. They presented with co-morbidities (absorbed lens, strabismus, nystagmus). Strabismus has been associated with abnormalities in GABAergic inhibition in the visual cortex. The possible interactions with residual vision and confounds of co-morbidities are not experimentally controlled for in the correlations, and not discussed.

      The updated document has addressed this caveat.

      Multiple exploratory correlations were performed to relate MRS measures to visual acuity (shown in Supplementary Materials), and only specific ones shown in the main document. The authors describe the analysis as exploratory in the 'Methods' section. Furthermore, the correlation between visual acuity and E/I metric is weak, not corrected for multiple comparisons. The results should be presented as preliminary, as no strong conclusions can be made from them. They can provide a hypothesis to test in a future study.

      This has now been done throughout the document and increases the transparency of the reporting.

      P.16 Given the correlation of the aperiodic intercept with age ("Age negatively correlated with the aperiodic intercept across CC and SC individuals, that is, a flattening of the intercept was observed with age"), age needs to be controlled for in the correlation between neurochemistry and the aperiodic intercept. Glx has also been shown to negatively correlates with age.

      This caveat has been addressed in the revised manuscript.

      Multiple exploratory correlations were performed to relate MRS to EEG measures (shown in Supplementary Materials), and only specific ones shown in the main document. Given the multiple measures from the MRS, the correlations with the EEG measures were exploratory, as stated in the text, p.16, and in Fig.4. yet the introduction said that there was a prior hypothesis "We further hypothesized that neurotransmitter changes would relate to changes in the slope and intercept of the EEG aperiodic activity in the same subjects." It would be great if the text could be revised for consistency and the analysis described as exploratory.

      This has been done throughout the document and increases the transparency of the reporting.

      The analysis for the EEG needs to take more advantage of the available data. As far as I understand, only two electrodes were used, yet far more were available as seen in their previous study (Ossandon et al., 2023). The spatial specificity is not established. The authors could use the frontal cortex electrode (FP1, FP2) signals as a control for spatial specificity in the group effects, or even better, all available electrodes and correct for multiple comparisons. Furthermore, they could use the aperiodic intercept vs Glx in SC to evaluate the specificity of the correlation to CC.

      This caveat has been addressed. The authors have added frontal electrodes to their analysis, providing an essential regional control for the visual cortex location.

      Comments on the latest version:

      The authors have made reasonable adjustments to their manuscript that addressed most of my comments by adding further justification for their methodology, essential literature support, pointing out exploratory analyses, limitations and adding key control analyses. Their revised manuscript has overall improved, providing valuable information, though the evidence that supports their claims is still incomplete.

      We thank the reviewer for suggesting ways to improve our manuscript and carefully reassessing our revised manuscript.

      Reviewer 2 (Public review):

      Summary:

      The study examined 10 congenitally blind patients who recovered vision through the surgical removal of bilateral dense cataracts, measuring neural activity and neuro chemical profiles from the visual cortex. The declared aim is to test whether restoring visual function after years of complete blindness impacts excitation/inhibition balance in the visual cortex.

      Strengths:

      The findings are undoubtedly useful for the community, as they contribute towards characterising the many ways in which this special population differs from normally sighted individuals. The combination of MRS and EEG measures is a promising strategy to estimate a fundamental physiological parameter - the balance between excitation and inhibition in the visual cortex, which animal studies show to be heavily dependent upon early visual experience. Thus, the reported results pave the way for further studies, which may use a similar approach to evaluate more patients and control groups.

      Weaknesses:

      The main methodological limitation is the lack of an appropriate comparison group or condition to delineate the effect of sight recovery (as opposed to the effect of congenital blindness). Few previous studies suggested that Excitation/Inhibition ratio in the visual cortex is increased in congenitally blind patients; the present study reports that E/I ratio decreases instead. The authors claim that this implies a change of E/I ratio following sight recovery. However, supporting this claim would require showing a shift of E/I after vs. before the sight-recovery surgery, or at least it would require comparing patients who did and did not undergo the sight-recovery surgery (as common in the field).

      We thank the reviewer for suggesting ways to improve our manuscript and carefully reassessing our revised manuscript.

      Since we have not been able to acquire longitudinal data with the experimental design of the present study in congenital cataract reversal individuals, we compared the MRS and EEG results of congenital cataract reversal individuals  to published work in congenitally permanent blind individuals. We consider this as a resource saving approach. We think that the results of our cross-sectional study now justify the costs and enormous efforts (and time for the patients who often have to travel long distances) associated with longitudinal studies in this rare population.

      There are also more technical limitations related to the correlation analyses, which are partly acknowledged in the manuscript. A bland correlation between GLX/GABA and the visual impairment is reported, but this is specific to the patients group (N=10) and would not hold across groups (the correlation is positive, predicting the lowest GLX/GABA ratio values for the sighted controls - opposite of what is found). There is also a strong correlation between GLX concentrations and the EEG power at the lowest temporal frequencies. Although this relation is intriguing, it only holds for a very specific combination of parameters (of the many tested): only with eyes open, only in the patients group.

      Given the exploratory nature of the correlations, we do not base the majority of our conclusions on this analysis. There are no doubts that the reported correlations need replication; however, replication is only possible after a first report. Thus, we hope to motivate corresponding analyses in further studies.

      It has to be noted that in the present study significance testing for correlations were corrected for multiple comparisons, and that some findings replicate earlier reports (e.g. effects on EEG aperiodic slope, alpha power, and correlations with chronological age).

      Conclusions:

      The main claim of the study is that sight recovery impacts the excitation/inhibition balance in the visual cortex, estimated with MRS or through indirect EEG indices. However, due to the weaknesses outlined above, the study cannot distinguish the effects of sight recovery from those of visual deprivation. Moreover, many aspects of the results are interesting but their validation and interpretation require additional experimental work.

      We interpret the group differences between individuals tested years after congenital visual deprivation and normally sighted individuals as supportive of the E/I ratio being impacted by congenital visual deprivation. In the absence of a sensitive period for the development of an E/I ratio, individuals with a transient phase of congenital blindness might have developed a visual system indistinguishable  from normally sighted individuals. As we demonstrate, this is not so. Comparing the results of congenitally blind humans with those of congenitally permanently blind humans (from previous studies) allowed us to identify changes of E/I ratio, which add to those found for congenital blindness.  

      We thank the reviewer for the helpful comments and suggestions related to the first submission and first revision of our manuscript. We are keen to translate some of them into future studies.

      Reviewer 3 (Public review):

      This manuscript examines the impact of congenital visual deprivation on the excitatory/inhibitory (E/I) ratio in the visual cortex using Magnetic Resonance Spectroscopy (MRS) and electroencephalography (EEG) in individuals whose sight was restored. Ten individuals with reversed congenital cataracts were compared to age-matched, normally sighted controls, assessing the cortical E/I balance and its interrelationship and to visual acuity. The study reveals that the Glx/GABA ratio in the visual cortex and the intercept and aperiodic signal are significantly altered in those with a history of early visual deprivation, suggesting persistent neurophysiological changes despite visual restoration.

      First of all, I would like to disclose that I am not an expert in congenital visual deprivation, nor in MRS. My expertise is in EEG (particularly in the decomposition of periodic and aperiodic activity) and statistical methods.

      Although the authors addressed some of the concerns of the previous version, major concerns and flaws remain in terms of methodological and statistical approaches along with the (over)interpretation of the results. Specific concerns include:

      (1 3.1 Response to Variability in Visual Deprivation<br /> Rather than listing the advantages and disadvantages of visual deprivation, I recommend providing at least a descriptive analysis of how the duration of visual deprivation influenced the measures of interest. This would enhance the depth and relevance of the discussion.

      Although Review 2 and Review 3 (see below) pointed out problems in interpreting multiple correlational analyses in small samples, we addressed this request by reporting such correlations between visual deprivation history and measured EEG/MRS outcomes.

      Calculating the correlation between duration of visual deprivation and behavioral or brain measures is, in fact, a common suggestion. The existence of sensitive periods, which are typically assumed to not follow a linear gradual decline of neuroplasticity, does not necessary allow predicting a correlation with duration of blindness. Daphne Maurer has additionally worked on the concept of “sleeper effects” (Maurer et al., 2007), that is, effects on the brain and behavior by early deprivation which are observed only later in life when the function/neural circuits matures.

      In accordance with this reasoning, we did not observe a significant correlation between duration of visual deprivation and any of our dependent variables.

      (2 3.2) Small Sample Size

      The issue of small sample size remains problematic. The justification that previous studies employed similar sample sizes does not adequately address the limitation in the current study. I strongly suggest that the correlation analyses should not feature prominently in the main manuscript or the abstract, especially if the discussion does not substantially rely on these correlations. Please also revisit the recommendations made in the section on statistical concerns.

      In the revised manuscript, we explicitly mention that our sample size is not atypical for the special group investigated, but that a replication of our results in larger samples would foster their impact. We only explicitly mention correlations that survived stringent testing for multiple comparisons in the main manuscript.

      Given the exploratory nature of the correlations, we have not based the majority of our claims on this analysis.

      (3 3.3) Statistical Concerns

      While I appreciate the effort of conducting an independent statistical check, it merely validates whether the reported statistical parameters, degrees of freedom (df), and p-values are consistent. However, this does not address the appropriateness of the chosen statistical methods.

      We did not intend for the statcheck report to justify the methods used for statistics, which we have done in a separate section with normality and homogeneity testing (Supplementary Material S9), and references to it in the descriptions of the statistical analyses (Methods, Page 13, Lines 326-329 and Page 15, Lines 400-402).

      Several points require clarification or improvement:

      (4) Correlation Methods: The manuscript does not specify whether the reported correlation analyses are based on Pearson or Spearman correlation.

      The depicted correlations are Pearson correlations. We will add this information to the Methods.

      (5) Confidence Intervals: Include confidence intervals for correlations to represent the uncertainty associated with these estimates.

      We will add the confidence intervals to the second revision of our manuscript.

      (6) Permutation Statistics: Given the small sample size, I recommend using permutation statistics, as these are exact tests and more appropriate for small datasets.

      Our study focuses on a rare population, with a sample size limited by the availability of participants. Our findings provide exploratory insights rather than make strong inferential claims. To this end, we have ensured that our analysis adheres to key statistical assumptions (Shapiro-Wilk as well as Levene’s tests, Supplementary Material S9),and reported our findings with effect sizes, appropriate caution and context.

      (7) Adjusted P-Values: Ensure that reported Bonferroni corrected p-values (e.g., p > 0.999) are clearly labeled as adjusted p-values where applicable.

      In the revised manuscript, we will change Figure 4 to say ‘adjusted p,’  which we indeed reported.

      (8) Figure 2C

      Figure 2C still lacks crucial information that the correlation between Glx/GABA ratio and visual acuity was computed solely in the control group (as described in the rebuttal letter). Why was this analysis restricted to the control group? Please provide a rationale.

      Figure 2C depicts the correlation between Glx/GABA+ ratio and visual acuity in the congenital cataract reversal group, not the control group. This is mentioned in the Figure 2 legend, as well as in the main text where the figure is referred to (Page 18, Line 475).

      The correlation analyses between visual acuity and MRS/EEG measures were only performed in the congenital cataract reversal group since the sighed control group comprised of individuals with vision in the normal range; thus this analyses would not make sense. Table 1 with the individual visual acuities for all participants, including the normally sighted controls, shows the low variance in the latter group.  

      For variables in which no apiori group differences in variance were predicted, we performed the correlation analyses across groups (see Supplementary Material S12, S15).

      We will highlight these motivations more clearly in the Methods of the revised manuscript.

      (9 3.4) Interpretation of Aperiodic Signal

      Relying on previous studies to interpret the aperiodic slope as a proxy for excitation/inhibition (E/I) does not make the interpretation more robust.

      How to interpret aperiodic EEG activity has been subject of extensive investigation. We cite studies which provide evidence from multiple species (monkeys, humans) and measurements (EEG, MEG, ECoG), including studies which pharmacologically manipulated E/I balance.

      Whether our findings are robust, in fact, requires a replication study. Importantly, we analyzed the intercept of the aperiodic activity fit as well, and discuss results related to the intercept.

      Quote:

      “3.4 Interpretation of aperiodic signal:

      - Several recent papers demonstrated that the aperiodic signal measured in EEG or ECoG is related to various important aspects such as age, skull thickness, electrode impedance, as well as cognition. Thus, currently, very little is known about the underlying effects which influence the aperiodic intercept and slope. The entire interpretation of the aperiodic slope as a proxy for E/I is based on a computational model and simulation (as described in the Gao et al. paper).

      Response: Apart from the modeling work from Gao et al., multiple papers which have also been cited which used ECoG, EEG and MEG and showed concomitant changes in aperiodic activity with pharmacological manipulation of the E/I ratio (Colombo et al., 2019; Molina et al., 2020; Muthukumaraswamy & Liley, 2018). Further, several prior studies have interpreted changes in the aperiodic slope as reflective of changes in the E/I ratio, including studies of developmental groups (Favaro et al., 2023; Hill et al., 2022; McSweeney et al., 2023; Schaworonkow & Voytek, 2021) as well as patient groups (Molina et al., 2020; Ostlund et al., 2021).

      - The authors further wrote: We used the slope of the aperiodic (1/f) component of the EEG spectrum as an estimate of E/I ratio (Gao et al., 2017; Medel et al., 2020; Muthukumaraswamy & Liley, 2018). This is a highly speculative interpretation with very little empirical evidence. These papers were conducted with ECoG data (mostly in animals) and mostly under anesthesia. Thus, these studies only allow an indirect interpretation by what the 1/f slope in EEG measurements is actually influenced.

      Response: Note that Muthukumaraswamy et al. (2018) used different types of pharmacological manipulations and analyzed periodic and aperiodic MEG activity in humans, in addition to monkey ECoG (Muthukumaraswamy & Liley, 2018). Further, Medel et al. (now published as Medel et al., 2023) compared EEG activity in addition to ECoG data after propofol administration. The interpretation of our results are in line with a number of recent studies in developing (Hill et al., 2022; Schaworonkow & Voytek, 2021) and special populations using EEG. As mentioned above, several prior studies have used the slope of the 1/f component/aperiodic activity as an indirect measure of the E/I ratio (Favaro et al., 2023; Hill et al., 2022; McSweeney et al., 2023; Molina et al., 2020; Ostlund et al., 2021; Schaworonkow & Voytek, 2021), including studies using scalp-recorded EEG from humans.

      In the introduction of the revised manuscript, we have made more explicit that this metric is indirect (Page 3, Line 91), (additionally see Discussion, Page 24, Lines 644-645, Page 25, Lines 650-657).

      While a full understanding of aperiodic activity needs to be provided, some convergent ideas have emerged. We think that our results contribute to this enterprise, since our study is, to the best of our knowledge, the first which assessed MRS measured neurotransmitter levels and EEG aperiodic activity.“

      (10) Additionally, the authors state:

      "We cannot think of how any of the exploratory correlations between neurophysiological measures and MRS measures could be accounted for by a difference e.g. in skull thickness."

      (11) This could be addressed directly by including skull thickness as a covariate or visualizing it in scatterplots, for instance, by representing skull thickness as the size of the dots.

      We are not aware of any study that would justify such an analysis.

      Our analyses were based on previous findings in the literature.

      Since to the best of our knowledge, no evidence exists that congenital cataracts go together with changes in skull thickness, and that skull thickness might selectively modulate visual cortex Glx/GABA+ but not NAA measures, we decided against following this suggestion.

      Notably, the neurotransmitter concentration reported here is after tissue segmentation of the voxel region. The tissue fraction was shown to not differ between groups in the MRS voxels (Supplementary Material S4). The EEG electrode impedance was lowered to <10 kOhm in every participant (Methods, Page 13, Line 344), and preparation was identical across groups.

      (12 3.5) Problems with EEG Preprocessing and Analysis

      Downsampling: The decision to downsample the data to 60 Hz "to match the stimulation rate" is problematic. This choice conflates subsequent spectral analyses due to aliasing issues, as explained by the Nyquist theorem. While the authors cite prior studies (Schwenk et al., 2020; VanRullen & MacDonald, 2012) to justify this decision, these studies focused on alpha (8-12 Hz), where aliasing is less of a concern compared of analyzing aperiodic signal. Furthermore, in contrast, the current study analyzes the frequency range from 1-20 Hz, which is too narrow for interpreting the aperiodic signal as E/I. Typically, this analysis should include higher frequencies, spanning at least 1-30 Hz or even 1-45 Hz (not 20-40 Hz).

      As mentioned in the Methods (Page 15 Line 376) and the previous response, the pop_resample function used by EEGLAB applies an anti-aliasing filter, at half the resampling frequency (as per the Nyquist theorem https://eeglab.org/tutorials/05_Preprocess/resampling.html). The upper cut off of the low pass filter set by EEGlab prior to down sampling (30 Hz) is still far above the frequency of interest in the current study  (1-20 Hz), thus allowing us to derive valid results.

      Quote:

      “- The authors downsampled the data to 60Hz to "to match the stimulation rate". What is the intention of this? Because the subsequent spectral analyses are conflated by this choice (see Nyquist theorem).

      Response: This data were collected as part of a study designed to evoke alpha activity with visual white-noise, which ranged in luminance with equal power at all frequencies from 1-60 Hz, restricted by the refresh rate of the monitor on which stimuli were presented (Pant et al., 2023). This paradigm and method was developed by VanRullen and colleagues (Schwenk et al., 2020; Vanrullen & MacDonald, 2012), wherein the analysis requires the same sampling rate between the presented frequencies and the EEG data. The downsampling function used here automatically applies an anti-aliasing filter (EEGLAB 2019) .”

      Moreover, the resting-state data were not resampled to 60 Hz. We will make this clearer in the Methods of the revised manuscript.

      Our consistent results of group differences across all three  EEG conditions, thus, exclude any possibility that they were driven by aliasing artifacts.

      The expected effects of this anti-aliasing filter can be seen in the attached Figure R1, showing an example participant’s spectrum in the 1-30 Hz range (as opposed to the 1-20 Hz plotted in the manuscript), clearly showing a 30-40 dB drop at 30 Hz. Any aliasing due to, for example, remaining line noise, would additionally be visible in this figure (as well as Figure 3) as a peak.

      Author response image 1.

      Power spectral density of one congenital cataract-reversal (CC) participant in the visual stimulation condition across all channels. The reduced power at 30 Hz shows the effects of the anti-aliasing filter applied by EEGLAB’s pop_resample function.

      As we stated in the manuscript, and in previous reviews, so far there has been no consensus on the exact range of measuring aperiodic activity. We made a principled decision based on the literature (showing a knee in aperiodic fits of this dataset at 20 Hz) (Medel et al., 2023; Ossandón et al., 2023), data quality (possible contamination by line noise at higher frequencies) and the purpose of the visual stimulation experiment (to look at the lower frequency range by stimulating up to 60 Hz, thereby limiting us to quantifying below 30 Hz), that 1-20 Hz would be the fit range in this dataset.

      Quote:

      “(3) What's the underlying idea of analyzing two separate aperiodic slopes (20-40Hz and 1-19Hz). This is very unusual to compute the slope between 20-40 Hz, where the SNR is rather low.

      "Ossandón et al. (2023), however, observed that in addition to the flatter slope of the aperiodic power spectrum in the high frequency range (20-40 Hz), the slope of the low frequency range (1-19 Hz) was steeper in both, congenital cataract-reversal individuals, as well as in permanently congenitally blind humans."

      Response: The present manuscript computed the slope between 1-20 Hz. Ossandón et al. as well as Medel et al. (2023) found a “knee” of the 1/f distribution at 20 Hz and describe further the motivations for computing both slope ranges. For example, Ossandón et al. used a data driven approach and compared single vs. dual fits and found that the latter fitted the data better. Additionally, they found the best fit if a knee at 20 Hz was used. We would like to point out that no standard range exists for the fitting of the 1/f component across the literature and, in fact, very different ranges have been used (Gao et al., 2017; Medel et al., 2023; Muthukumaraswamy & Liley, 2018).“

      (13) Baseline Removal: Subtracting the mean activity across an epoch as a baseline removal step is inappropriate for resting-state EEG data. This preprocessing step undermines the validity of the analysis. The EEG dataset has fundamental flaws, many of which were pointed out in the previous review round but remain unaddressed. In its current form, the manuscript falls short of standards for robust EEG analysis. If I were reviewing for another journal, I would recommend rejection based on these flaws.

      The baseline removal step from each epoch serves to remove the DC component of the recording and detrend the data. This is a standard preprocessing step (included as an option in preprocessing pipelines recommended by the EEGLAB toolbox, FieldTrip toolbox and MNE toolbox), additionally necessary to improve the efficacy of ICA decomposition (Groppe et al., 2009).

      In the previous review round, a clarification of the baseline timing was requested, which we added. Beyond this request, there was no mention of the appropriateness of the baseline removal and/or a request to provide reasons for why it might not undermine the validity of the analysis.

      Quote:

      “- "Subsequently, baseline removal was conducted by subtracting the mean activity across the length of an epoch from every data point." The actual baseline time segment should be specified.

      Response: The time segment was the length of the epoch, that is, 1 second for the resting state conditions and 6.25 seconds for the visual stimulation conditions. This has been explicitly stated in the revised manuscript (Page 13, Line 354).”

      Prior work in the time (not frequency) domain on event-related potential (ERP) analysis has suggested that the baselining step might cause spurious effects (Delorme, 2023) (although see (Tanner et al., 2016)). We did not perform ERP analysis at any stage. One recent study suggests spurious group differences in the 1/f signal might be driven by an inappropriate dB division baselining method (Gyurkovics et al., 2021), which we did not perform.

      Any effect of our baselining procedure on the FFT spectrum would be below the 1 Hz range, which we did not analyze.  

      Each of the preprocessing steps in the manuscript match pipelines described and published in extensive prior work. We document how multiple aspects of our EEG results replicate prior findings (Supplementary Material S15, S18, S19), reports of other experimenters, groups and locations, validating that our results are robust.

      We therefore reject the claim of methodological flaws in our EEG analyses in the strongest possible terms.

      Quote:

      “3.5 Problems with EEG preprocessing and analysis:

      - It seems that the authors did not identify bad channels nor address the line noise issue (even a problem if a low pass filter of below-the-line noise was applied).

      Response: As pointed out in the methods and Figure 1, we only analyzed data from two occipital channels, O1 and O2 neither of which were rejected for any participant. Channel rejection was performed for the larger dataset, published elsewhere (Ossandón et al., 2023; Pant et al., 2023). As control sites we added the frontal channels FP1 and Fp2 (see Supplementary Material S14)

      Neither Ossandón et al. (2023) nor Pant et al. (2023) considered frequency ranges above 40 Hz to avoid any possible contamination with line noise. Here, we focused on activity between 0 and 20 Hz, definitely excluding line noise contaminations (Methods, Page 14, Lines 365-367). The low pass filter (FIR, 1-45 Hz) guaranteed that any spill-over effects of line noise would be restricted to frequencies just below the upper cutoff frequency.

      Additionally, a prior version of the analysis used spectrum interpolation to remove line noise; the group differences remained stable (Ossandón et al., 2023). We have reported this analysis in the revised manuscript (Page 14, Lines 364-357).

      Further, both groups were measured in the same lab, making line noise (~ 50 Hz) as an account for the observed group effects in the 1-20 Hz frequency range highly unlikely. Finally, any of the exploratory MRS-EEG correlations would be hard to explain if the EEG parameters would be contaminated with line noise.

      - What was the percentage of segments that needed to be rejected due to the 120μV criteria? This should be reported specifically for EO & EC and controls and patients.

      Response: The mean percentage of 1 second segments rejected for each resting state condition and the percentage of 6.25 long segments rejected in each group for the visual stimulation condition have been added to the revised manuscript (Supplementary Material S10), and referred to in the Methods on Page 14, Lines 372-373).

      - The authors downsampled the data to 60Hz to "to match the stimulation rate". What is the intention of this? Because the subsequent spectral analyses are conflated by this choice (see Nyquist theorem).

      Response: This data were collected as part of a study designed to evoke alpha activity with visual white-noise, which changed in luminance with equal power at all frequencies from 1-60 Hz, restricted by the refresh rate of the monitor on which stimuli were presented (Pant et al., 2023). This paradigm and method was developed by VanRullen and colleagues (Schwenk et al., 2020; VanRullen & MacDonald, 2012), wherein the analysis requires the same sampling rate between the presented frequencies and the EEG data. The downsampling function used here automatically applies an anti-aliasing filter (EEGLAB 2019) .

      - "Subsequently, baseline removal was conducted by subtracting the mean activity across the length of an epoch from every data point." The actual baseline time segment should be specified.

      The time segment was the length of the epoch, that is, 1 second for the resting state conditions and 6.25 seconds for the visual stimulation conditions. This has now been explicitly stated in the revised manuscript (Page 14, Lines 379-380).<br /> - "We excluded the alpha range (8-14 Hz) for this fit to avoid biasing the results due to documented differences in alpha activity between CC and SC individuals (Bottari et al., 2016; Ossandón et al., 2023; Pant et al., 2023)." This does not really make sense, as the FOOOF algorithm first fits the 1/f slope, for which the alpha activity is not relevant.

      Response: We did not use the FOOOF algorithm/toolbox in this manuscript. As stated in the Methods, we used a 1/f fit to the 1-20 Hz spectrum in the log-log space, and subtracted this fit from the original spectrum to obtain the corrected spectrum. Given the pronounced difference in alpha power between groups (Bottari et al., 2016; Ossandón et al., 2023; Pant et al., 2023), we were concerned it might drive differences in the exponent values. Our analysis pipeline had been adapted from previous publications of our group and other labs (Ossandón et al., 2023; Voytek et al., 2015; Waschke et al., 2017).

      We have conducted the analysis with and without the exclusion of the alpha range, as well as using the FOOOF toolbox both in the 1-20 Hz and 20-40 Hz ranges (Ossandón et al., 2023). The findings of a steeper slope in the 1-20 Hz range as well as lower alpha power in CC vs SC individuals remained stable. In Ossandón et al., the comparison between the piecewise fits and FOOOF fits led the authors to use the former, as it outperformed the FOOOF algorithm for their data.

      - The model fits of the 1/f fitting for EO, EC, and both participant groups should be reported.

      Response: In Figure 3 of the manuscript, we depicted the mean spectra and 1/f fits for each group.

      In the revised manuscript, we added the fit quality metrics (average R<sup>2</sup> values > 0.91 for each group and condition) (Methods Page 15, Lines 395-396; Supplementary Material S11) and additionally show individual subjects’ fits (Supplementary Material S11).“

      (14) The authors mention:

      "The EEG data sets reported here were part of data published earlier (Ossandón et al., 2023; Pant et al., 2023)." Thus, the statement "The group differences for the EEG assessments corresponded to those of a larger sample of CC individuals (n=38) " is a circular argument and should be avoided."

      The authors addressed this comment and adjusted the statement. However, I do not understand, why not the full sample published earlier (Ossandón et al., 2023) was used in the current study?

      The recording of EEG resting state data stated in 2013, while MRS testing could only be set up by the end of 2019. Moreover, not all subjects who qualify for EEG recording qualify for being scanned (e.g. due to MRI safety, claustrophobia)

      References

      Bottari, D., Troje, N. F., Ley, P., Hense, M., Kekunnaya, R., & Röder, B. (2016). Sight restoration after congenital blindness does not reinstate alpha oscillatory activity in humans. Scientific Reports. https://doi.org/10.1038/srep24683

      Colombo, M. A., Napolitani, M., Boly, M., Gosseries, O., Casarotto, S., Rosanova, M., Brichant, J. F., Boveroux, P., Rex, S., Laureys, S., Massimini, M., Chieregato, A., & Sarasso, S. (2019). The spectral exponent of the resting EEG indexes the presence of consciousness during unresponsiveness induced by propofol, xenon, and ketamine. NeuroImage, 189(September 2018), 631–644. https://doi.org/10.1016/j.neuroimage.2019.01.024

      Delorme, A. (2023). EEG is better left alone. Scientific Reports, 13(1), 2372. https://doi.org/10.1038/s41598-023-27528-0

      Favaro, J., Colombo, M. A., Mikulan, E., Sartori, S., Nosadini, M., Pelizza, M. F., Rosanova, M., Sarasso, S., Massimini, M., & Toldo, I. (2023). The maturation of aperiodic EEG activity across development reveals a progressive differentiation of wakefulness from sleep. NeuroImage, 277. https://doi.org/10.1016/J.NEUROIMAGE.2023.120264

      Gao, R., Peterson, E. J., & Voytek, B. (2017). Inferring synaptic excitation/inhibition balance from field potentials. NeuroImage, 158(March), 70–78. https://doi.org/10.1016/j.neuroimage.2017.06.078

      Groppe, D. M., Makeig, S., & Kutas, M. (2009). Identifying reliable independent components via split-half comparisons. NeuroImage, 45(4), 1199–1211. https://doi.org/10.1016/j.neuroimage.2008.12.038

      Gyurkovics, M., Clements, G. M., Low, K. A., Fabiani, M., & Gratton, G. (2021). The impact of 1/f activity and baseline correction on the results and interpretation of time-frequency analyses of EEG/MEG data: A cautionary tale. NeuroImage, 237. https://doi.org/10.1016/j.neuroimage.2021.118192

      Hill, A. T., Clark, G. M., Bigelow, F. J., Lum, J. A. G., & Enticott, P. G. (2022). Periodic and aperiodic neural activity displays age-dependent changes across early-to-middle childhood. Developmental Cognitive Neuroscience, 54, 101076. https://doi.org/10.1016/J.DCN.2022.101076

      Maurer, D., Mondloch, C. J., & Lewis, T. L. (2007). Sleeper effects. In Developmental Science. https://doi.org/10.1111/j.1467-7687.2007.00562.x

      McSweeney, M., Morales, S., Valadez, E. A., Buzzell, G. A., Yoder, L., Fifer, W. P., Pini, N., Shuffrey, L. C., Elliott, A. J., Isler, J. R., & Fox, N. A. (2023). Age-related trends in aperiodic EEG activity and alpha oscillations during early- to middle-childhood. NeuroImage, 269, 119925. https://doi.org/10.1016/j.neuroimage.2023.119925

      Medel, V., Irani, M., Crossley, N., Ossandón, T., & Boncompte, G. (2023). Complexity and 1/f slope jointly reflect brain states. Scientific Reports, 13(1), 21700. https://doi.org/10.1038/s41598-023-47316-0

      Molina, J. L., Voytek, B., Thomas, M. L., Joshi, Y. B., Bhakta, S. G., Talledo, J. A., Swerdlow, N. R., & Light, G. A. (2020). Memantine Effects on Electroencephalographic Measures of Putative Excitatory/Inhibitory Balance in Schizophrenia. Biological Psychiatry: Cognitive Neuroscience and Neuroimaging, 5(6), 562–568. https://doi.org/10.1016/j.bpsc.2020.02.004

      Muthukumaraswamy, S. D., & Liley, D. T. (2018). 1/F electrophysiological spectra in resting and drug-induced states can be explained by the dynamics of multiple oscillatory relaxation processes. NeuroImage, 179(November 2017), 582–595. https://doi.org/10.1016/j.neuroimage.2018.06.068

      Ossandón, J. P., Stange, L., Gudi-Mindermann, H., Rimmele, J. M., Sourav, S., Bottari, D., Kekunnaya, R., & Röder, B. (2023). The development of oscillatory and aperiodic resting state activity is linked to a sensitive period in humans. NeuroImage, 275, 120171. https://doi.org/10.1016/J.NEUROIMAGE.2023.120171

      Ostlund, B. D., Alperin, B. R., Drew, T., & Karalunas, S. L. (2021). Behavioral and cognitive correlates of the aperiodic (1/f-like) exponent of the EEG power spectrum in adolescents with and without ADHD. Developmental Cognitive Neuroscience, 48, 100931. https://doi.org/10.1016/j.dcn.2021.100931

      Pant, R., Ossandón, J., Stange, L., Shareef, I., Kekunnaya, R., & Röder, B. (2023). Stimulus-evoked and resting-state alpha oscillations show a linked dependence on patterned visual experience for development. NeuroImage: Clinical, 103375. https://doi.org/10.1016/J.NICL.2023.103375

      Schaworonkow, N., & Voytek, B. (2021). Longitudinal changes in aperiodic and periodic activity in electrophysiological recordings in the first seven months of life. Developmental Cognitive Neuroscience, 47. https://doi.org/10.1016/j.dcn.2020.100895

      Schwenk, J. C. B., VanRullen, R., & Bremmer, F. (2020). Dynamics of Visual Perceptual Echoes Following Short-Term Visual Deprivation. Cerebral Cortex Communications, 1(1). https://doi.org/10.1093/TEXCOM/TGAA012

      Tanner, D., Norton, J. J. S., Morgan-Short, K., & Luck, S. J. (2016). On high-pass filter artifacts (they’re real) and baseline correction (it’s a good idea) in ERP/ERMF analysis. Journal of Neuroscience Methods, 266, 166–170. https://doi.org/10.1016/j.jneumeth.2016.01.002

      Vanrullen, R., & MacDonald, J. S. P. (2012). Perceptual echoes at 10 Hz in the human brain. Current Biology. https://doi.org/10.1016/j.cub.2012.03.050

      Voytek, B., Kramer, M. A., Case, J., Lepage, K. Q., Tempesta, Z. R., Knight, R. T., & Gazzaley, A. (2015). Age-related changes in 1/f neural electrophysiological noise. Journal of Neuroscience, 35(38). https://doi.org/10.1523/JNEUROSCI.2332-14.2015

      Waschke, L., Wöstmann, M., & Obleser, J. (2017). States and traits of neural irregularity in the age-varying human brain. Scientific Reports 2017 7:1, 7(1), 1–12. https://doi.org/10.1038/s41598-017-17766-4


      The following is the authors’ response to the original reviews.

      eLife Assessment

      This potentially useful study involves neuro-imaging and electrophysiology in a small cohort of congenital cataract patients after sight recovery and age-matched control participants with normal sight. It aims to characterize the effects of early visual deprivation on excitatory and inhibitory balance in the visual cortex. While the findings are taken to suggest the existence of persistent alterations in Glx/GABA ratio and aperiodic EEG signals, the evidence supporting these claims is incomplete. Specifically, small sample sizes, lack of a specific control cohort, and other methodological limitations will likely restrict the usefulness of the work, with relevance limited to scientists working in this particular subfield.

      As pointed out in the public reviews, there are very few human models which allow for assessing the role of early experience on neural circuit development. While the prevalent research in permanent congenital blindness reveals the response and adaptation of the developing brain to an atypical situation (blindness), research in sight restoration addresses the question of whether and how atypical development can be remediated if typical experience (vision) is restored. The literature on the role of visual experience in the development of E/I balance in humans, assessed via Magnetic Resonance Spectroscopy (MRS), has been limited to a few studies on congenital permanent blindness. Thus, we assessed sight recovery individuals with a history of congenital blindness, as limited evidence from other researchers indicated that the visual cortex E/I ratio might differ compared to normally sighted controls.

      Individuals with total bilateral congenital cataracts who remained untreated until later in life are extremely rare, particularly if only carefully diagnosed patients are included in a study sample. A sample size of 10 patients is, at the very least, typical of past studies in this population, even for exclusively behavioral assessments. In the present study, in addition to behavioral assessment as an indirect measure of sensitive periods, we investigated participants with two neuroimaging methods (Magnetic Resonance Spectroscopy and electroencephalography) to directly assess the neural correlates of sensitive periods in humans. The electroencephalography data allowed us to link the results of our small sample to findings documented in large cohorts of both, sight recovery individuals and permanently congenitally blind individuals. As pointed out in a recent editorial recommending an “exploration-then-estimation procedure,” (“Consideration of Sample Size in Neuroscience Studies,” 2020), exploratory studies like ours provide crucial direction and specific hypotheses for future work.

      We included an age-matched sighted control group recruited from the same community, measured in the same scanner and laboratory, to assess whether early experience is necessary for a typical excitatory/inhibitory (E/I) ratio to emerge in adulthood. The present findings indicate that this is indeed the case. Based on these results, a possible question to answer in future work, with individuals who had developmental cataracts, is whether later visual deprivation causes similar effects. Note that even if visual deprivation at a later stage in life caused similar effects, the current results would not be invalidated; by contrast, they are essential to understand future work on late (permanent or transient) blindness.

      Thus, we think that the present manuscript has far reaching implications for our understanding of the conditions under which E/I balance, a crucial characteristic of brain functioning, emerges in humans.

      Finally, our manuscript is one of the first few studies that relate MRS neurotransmitter concentrations to parameters of EEG aperiodic activity. Since present research has been using aperiodic activity as a correlate of the E/I ratio, and partially of higher cognitive functions, we think that our manuscript additionally contributes to a better understanding of what might be measured with aperiodic neurophysiological activity.

      Public Reviews:<br /> Reviewer #1 (Public Review):

      Summary:

      In this human neuroimaging and electrophysiology study, the authors aimed to characterize the effects of a period of visual deprivation in the sensitive period on excitatory and inhibitory balance in the visual cortex. They attempted to do so by comparing neurochemistry conditions ('eyes open', 'eyes closed') and resting state, and visually evoked EEG activity between ten congenital cataract patients with recovered sight (CC), and ten age-matched control participants (SC) with normal sight.

      First, they used magnetic resonance spectroscopy to measure in vivo neurochemistry from two locations, the primary location of interest in the visual cortex, and a control location in the frontal cortex. Such voxels are used to provide a control for the spatial specificity of any effects because the single-voxel MRS method provides a single sampling location. Using MR-visible proxies of excitatory and inhibitory neurotransmission, Glx and GABA+ respectively, the authors report no group effects in GABA+ or Glx, no difference in the functional conditions 'eyes closed' and 'eyes open'. They found an effect of the group in the ratio of Glx/GABA+ and no similar effect in the control voxel location. They then performed multiple exploratory correlations between MRS measures and visual acuity, and reported a weak positive correlation between the 'eyes open' condition and visual acuity in CC participants.

      The same participants then took part in an EEG experiment. The authors selected only two electrodes placed in the visual cortex for analysis and reported a group difference in an EEG index of neural activity, the aperiodic intercept, as well as the aperiodic slope, considered a proxy for cortical inhibition. They report an exploratory correlation between the aperiodic intercept and Glx in one out of three EEG conditions.

      The authors report the difference in E/I ratio, and interpret the lower E/I ratio as representing an adaptation to visual deprivation, which would have initially caused a higher E/I ratio. Although intriguing, the strength of evidence in support of this view is not strong. Amongst the limitations are the low sample size, a critical control cohort that could provide evidence for a higher E/I ratio in CC patients without recovered sight for example, and lower data quality in the control voxel.

      Strengths of study:

      How sensitive period experience shapes the developing brain is an enduring and important question in neuroscience. This question has been particularly difficult to investigate in humans. The authors recruited a small number of sight-recovered participants with bilateral congenital cataracts to investigate the effect of sensitive period deprivation on the balance of excitation and inhibition in the visual brain using measures of brain chemistry and brain electrophysiology. The research is novel, and the paper was interesting and well-written.

      Limitations:

      (1.1) Low sample size. Ten for CC and ten for SC, and a further two SC participants were rejected due to a lack of frontal control voxel data. The sample size limits the statistical power of the dataset and increases the likelihood of effect inflation.

      Applying strict criteria, we only included individuals who were born with no patterned vision in the CC group. The population of individuals who have remained untreated past infancy is small in India, despite a higher prevalence of childhood cataract than Germany. Indeed, from the original 11 CC and 11 SC participants tested, one participant each from the CC and SC group had to be rejected, as their data had been corrupted, resulting in 10 participants in each group.

      It was a challenge to recruit participants from this rare group with no history of neurological diagnosis/intake of neuromodulatory medications, who were able and willing to undergo both MRS and EEG. For this study, data collection took more than 2.5 years.

      We took care of the validity of our results with two measures; first, we assessed not just MRS, but additionally, EEG measures of E/I ratio. The latter allowed us to link results to a larger population of CC individuals, that is, we replicated the results of a larger group of 28 additional individuals (Ossandón et al., 2023) in our sub-group.

      Second, we included a control voxel. As predicted, all group effects were restricted to the occipital voxel.

      (1.2) Lack of specific control cohort. The control cohort has normal vision. The control cohort is not specific enough to distinguish between people with sight loss due to different causes and patients with congenital cataracts with co-morbidities. Further data from more specific populations, such as patients whose cataracts have not been removed, with developmental cataracts, or congenitally blind participants, would greatly improve the interpretability of the main finding. The lack of a more specific control cohort is a major caveat that limits a conclusive interpretation of the results.

      The existing work on visual deprivation and neurochemical changes, as assessed with MRS, has been limited to permanent congenital blindness. In fact, most of the studies on permanent blindness included only congenitally blind or early blind humans (Coullon et al., 2015; Weaver et al., 2013), or, in separate studies, only late-blind individuals (Bernabeu et al., 2009). Thus, accordingly, we started with the most “extreme” visual deprivation model, sight recovery after congenital blindness. If we had not observed any group difference compared to normally sighted controls, investigating other groups might have been trivial. Based on our results, subsequent studies in late blind individuals, and then individuals with developmental cataracts, can be planned with clear hypotheses.

      (1.3) MRS data quality differences. Data quality in the control voxel appears worse than in the visual cortex voxel. The frontal cortex MRS spectrum shows far broader linewidth than the visual cortex (Supplementary Figures). Compared to the visual voxel, the frontal cortex voxel has less defined Glx and GABA+ peaks; lower GABA+ and Glx concentrations, lower NAA SNR values; lower NAA concentrations. If the data quality is a lot worse in the FC, then small effects may not be detectable.

      Worse data quality in the frontal than the visual cortex has been repeatedly observed in the MRS literature, attributable to magnetic field distortions (Juchem & Graaf, 2017) resulting from the proximity of the region to the sinuses (recent example: (Rideaux et al., 2022)). Nevertheless, we chose the frontal control region rather than a parietal voxel, given the potential neurochemical changes in multisensory regions of the parietal cortex due to blindness. Such reorganization would be less likely in frontal areas associated with higher cognitive functions. Further, prior MRS studies of the visual cortex have used the frontal cortex as a control region as well (Pitchaimuthu et al., 2017; Rideaux et al., 2022). In the revised manuscript, we more explicitly inform the reader about this data quality difference between regions in the Methods (Pages 11-12, MRS Data Quality/Table 2) and Discussion (Page 25, Lines 644- 647).

      Importantly, while in the present study data quality differed between the frontal and visual cortex voxel, it did not differ between groups (Supplementary Material S6).  

      Further, we checked that the frontal cortex datasets for Glx and GABA+ concentrations were of sufficient quality: the fit error was below 8.31% in both groups (Supplementary Material S3). For reference, Mikkelsen et al. reported a mean GABA+ fit error of 6.24 +/- 1.95% from a posterior cingulate cortex voxel across 8 GE scanners, using the Gannet pipeline. No absolute cutoffs have been proposed for fit errors. However, MRS studies in special populations (I/E ratio assessed in narcolepsy (Gao et al., 2024), GABA concentration assessed in Autism Spectrum Disorder (Maier et al., 2022) have used frontal cortex data with a fit error of <10% to identify differences between cohorts (Gao et al., 2024; Pitchaimuthu et al., 2017). Based on the literature, MRS data from the frontal voxel of the present study would have been of sufficient quality to uncover group differences.

      In the revised manuscript, we added the recently published MRS quality assessment form to the supplementary materials (Supplementary Excel File S1). Additionally, we would like to allude to our apriori prediction of group differences for the visual cortex, but not for the frontal cortex voxel. Finally, EEG data quality did not differ between frontal and occipital electrodes; therefore, lower sensitivity of frontal measures cannot easily explain the lack of group differences for frontal measures.

      (1.4) Because of the direction of the difference in E/I, the authors interpret their findings as representing signatures of sight improvement after surgery without further evidence, either within the study or from the literature. However, the literature suggests that plasticity and visual deprivation drive the E/I index up rather than down. Decreasing GABA+ is thought to facilitate experience-dependent remodelling. What evidence is there that cortical inhibition increases in response to a visual cortex that is over-sensitised due to congenital cataracts? Without further experimental or literature support this interpretation remains very speculative.

      Indeed, higher inhibition was not predicted, which we attempt to reconcile in our discussion section. We base our discussion mainly on the non-human animal literature, which has shown evidence of homeostatic changes after prolonged visual deprivation in the adult brain (Barnes et al., 2015). It is also interesting to note that after monocular deprivation in adult humans, resting GABA+ levels decreased in the visual cortex (Lunghi et al., 2015). Assuming that after delayed sight restoration, adult neuroplasticity mechanisms must be employed, these studies would predict a “balancing” of the increased excitatory drive following sight restoration by a commensurate increase in inhibition (Keck et al., 2017). Additionally, the EEG results of the present study allowed for speculation regarding the underlying neural mechanisms of an altered E/I ratio. The aperiodic EEG activity suggested higher spontaneous spiking (increased intercept) and increased inhibition (steeper aperiodic slope between 1-20 Hz) in CC vs SC individuals (Ossandón et al., 2023).

      In the revised manuscript, we have more clearly indicated that these speculations are based primarily on non-human animal work, due to the lack of human studies on the subject (Page 23, Lines 609-613).

      (1.5) Heterogeneity in the patient group. Congenital cataract (CC) patients experienced a variety of duration of visual impairment and were of different ages. They presented with co-morbidities (absorbed lens, strabismus, nystagmus). Strabismus has been associated with abnormalities in GABAergic inhibition in the visual cortex. The possible interactions with residual vision and confounds of co-morbidities are not experimentally controlled for in the correlations, and not discussed.

      The goal of the present study was to assess whether we would observe changes in E/I ratio after restoring vision at all. We would not have included patients without nystagmus in the CC group of the present study, since it would have been unlikely that they experienced congenital patterned visual deprivation. Amongst diagnosticians, nystagmus or strabismus might not be considered genuine “comorbidities” that emerge in people with congenital cataracts. Rather, these are consequences of congenital visual deprivation, which we employed as diagnostic criteria. Similarly, absorbed lenses are clear signs that cataracts were congenital. As in other models of experience dependent brain development (e.g. the extant literature on congenital permanent blindness, including anophthalmic individuals (Coullon et al., 2015; Weaver et al., 2013), some uncertainty remains regarding whether the (remaining, in our case) abnormalities of the eye, or the blindness they caused, are the factors driving neural changes. In case of people with reversed congenital cataracts, at least the retina is considered to be intact, as they would otherwise not receive cataract removal surgery.

      However, we consider it unlikely that strabismus caused the group differences, because the present study shows group differences in the Glx/GABA+ ratio at rest, regardless of eye opening or eye closure, for which strabismus would have caused distinct effects. By contrast, the link between GABA concentration and, for example, interocular suppression in strabismus, have so far been documented during visual stimulation (Mukerji et al., 2022; Sengpiel et al., 2006), and differed in direction depending on the amblyopic vs. non-amblyopic eye. Further, one MRS study did not find group differences in GABA concentration between the visual cortices of 16 amblyopic individuals and sighted controls (Mukerji et al., 2022), supporting that the differences in Glx/GABA+ concentration which we observed were driven by congenital deprivation, and not amblyopia-associated visual acuity or eye movement differences. 

      In the revised manuscript, we discussed the inclusion criteria in more detail, and the aforementioned reasons why our data remains interpretable (Page 5, Lines 143 – 145, Lines 147-149). 

      (1.6) Multiple exploratory correlations were performed to relate MRS measures to visual acuity (shown in Supplementary Materials), and only specific ones were shown in the main document. The authors describe the analysis as exploratory in the 'Methods' section. Furthermore, the correlation between visual acuity and E/I metric is weak, and not corrected for multiple comparisons. The results should be presented as preliminary, as no strong conclusions can be made from them. They can provide a hypothesis to test in a future study.

      In the revised manuscript, we have clearly indicated that the exploratory correlation analyses are reported to put forth hypotheses for future studies (Page 4, Lines 118-128; Page 5, Lines 132-134; Page 25, Lines 644- 647).

      (1.7) P.16 Given the correlation of the aperiodic intercept with age ("Age negatively correlated with the aperiodic intercept across CC and SC individuals, that is, a flattening of the intercept was observed with age"), age needs to be controlled for in the correlation between neurochemistry and the aperiodic intercept. Glx has also been shown to negatively correlate with age.

      The correlation between chronological age and aperiodic intercept was observed across groups, but the correlation between Glx and the intercept of the aperiodic EEG activity was seen only in the CC group, even though the SC group was matched for age. Thus, such a correlation was very unlikely to be predominantly driven by an effect of chronological age.

      In the revised manuscript, we added the linear regressions with age as a covariate (Supplementary Material S16, referred to in the main Results, Page 21, Lines 534-537), demonstrating the significant relationship between aperiodic intercept and Glx concentration in the CC group. 

      (1.8) Multiple exploratory correlations were performed to relate MRS to EEG measures (shown in Supplementary Materials), and only specific ones were shown in the main document. Given the multiple measures from the MRS, the correlations with the EEG measures were exploratory, as stated in the text, p.16, and in Figure 4. Yet the introduction said that there was a prior hypothesis "We further hypothesized that neurotransmitter changes would relate to changes in the slope and intercept of the EEG aperiodic activity in the same subjects." It would be great if the text could be revised for consistency and the analysis described as exploratory.

      In the revised manuscript, we improved the phrasing (Page 5, Lines 130-132) and consistently reported the correlations as exploratory in the Methods and Discussion. We consider the correlation analyses as exploratory due to our sample size and the absence of prior work. However, we did hypothesize that both MRS and EEG markers would concurrently be altered in CC vs SC individuals.

      (1.9) The analysis for the EEG needs to take more advantage of the available data. As far as I understand, only two electrodes were used, yet far more were available as seen in their previous study (Ossandon et al., 2023). The spatial specificity is not established. The authors could use the frontal cortex electrode (FP1, FP2) signals as a control for spatial specificity in the group effects, or even better, all available electrodes and correct for multiple comparisons. Furthermore, they could use the aperiodic intercept vs Glx in SC to evaluate the specificity of the correlation to CC.

      The aperiodic intercept and slope did not differ between CC and SC individuals for Fp1 and Fp2, suggesting the spatial specificity of the results. In the revised manuscript, we added this analysis to the Supplementary Material (Supplementary Material S14) and referred to it in our Results (Page 20, Lines 513-514).

      Further, Glx concentration in the visual cortex did not correlate with the aperiodic intercept in the SC group (Figure 4), suggesting that this relationship was indeed specific to the CC group.

      The data from all electrodes has been analyzed and published in other studies as well (Pant et al., 2023; Ossandón et al., 2023). 

      Reviewer #2 (Public Review):

      Summary:

      The manuscript reports non-invasive measures of activity and neurochemical profiles of the visual cortex in congenitally blind patients who recovered vision through the surgical removal of bilateral dense cataracts. The declared aim of the study is to find out how restoring visual function after several months or years of complete blindness impacts the balance between excitation and inhibition in the visual cortex.

      Strengths:

      The findings are undoubtedly useful for the community, as they contribute towards characterising the many ways this special population differs from normally sighted individuals. The combination of MRS and EEG measures is a promising strategy to estimate a fundamental physiological parameter - the balance between excitation and inhibition in the visual cortex, which animal studies show to be heavily dependent upon early visual experience. Thus, the reported results pave the way for further studies, which may use a similar approach to evaluate more patients and control groups.

      Weaknesses:

      (2.1) The main issue is the lack of an appropriate comparison group or condition to delineate the effect of sight recovery (as opposed to the effect of congenital blindness). Few previous studies suggested an increased excitation/Inhibition ratio in the visual cortex of congenitally blind patients; the present study reports a decreased E/I ratio instead. The authors claim that this implies a change of E/I ratio following sight recovery. However, supporting this claim would require showing a shift of E/I after vs. before the sight-recovery surgery, or at least it would require comparing patients who did and did not undergo the sight-recovery surgery (as common in the field).

      Longitudinal studies would indeed be the best way to test the hypothesis that the lower E/I ratio in the CC group observed by the present study is a consequence of sight restoration.

      We have now explicitly stated this in the Limitations section (Page 25, Lines 654-655).

      However, longitudinal studies involving neuroimaging are an effortful challenge, particularly in research conducted outside of major developed countries and dedicated neuroimaging research facilities. Crucially, however, had CC and SC individuals, as well as permanently congenitally blind vs SC individuals (Coullon et al., 2015; Weaver et al., 2013), not differed on any neurochemical markers, such a longitudinal study might have been trivial. Thus, in order to justify and better tailor longitudinal studies, cross-sectional studies are an initial step.

      (2.2) MR Spectroscopy shows a reduced GLX/GABA ratio in patients vs. sighted controls; however, this finding remains rather isolated, not corroborated by other observations. The difference between patients and controls only emerges for the GLX/GABA ratio, but there is no accompanying difference in either the GLX or the GABA concentrations. There is an attempt to relate the MRS data with acuity measurements and electrophysiological indices, but the explorative correlational analyses do not help to build a coherent picture. A bland correlation between GLX/GABA and visual impairment is reported, but this is specific to the patients' group (N=10) and would not hold across groups (the correlation is positive, predicting the lowest GLX/GABA ratio values for the sighted controls - the opposite of what is found). There is also a strong correlation between GLX concentrations and the EEG power at the lowest temporal frequencies. Although this relation is intriguing, it only holds for a very specific combination of parameters (of the many tested): only with eyes open, only in the patient group.

      We interpret these findings differently, that is, in the context of experiments from non-human animals and the larger MRS literature (Page 23, Lines 609-611).

      Homeostatic control of E/I balance assumes that the ratio of excitation (reflected here by Glx) and inhibition (reflected here by GABA+) is regulated. Like prior work (Gao et al., 2024, 2024; Narayan et al., 2022; Perica et al., 2022; Steel et al., 2020; Takado et al., 2022; Takei et al., 2016), we assumed that the ratio of Glx/GABA+ is indicative of E/I balance rather than solely the individual neurotransmitter levels. One of the motivations for assessing the ratio vs the absolute concentration is that as per the underlying E/I balance hypothesis, a change in excitation would cause a concomitant change in inhibition, and vice versa, which has been shown in non-human animal work (Fang et al., 2021; Haider et al., 2006; Tao & Poo, 2005) and modeling research (Vreeswijk & Sompolinsky, 1996; Wu et al., 2022). Importantly, our interpretation of the lower E/I ratio is not just from the Glx/GABA+ ratio, but additionally, based on the steeper EEG aperiodic slope (1-20 Hz). 

      As stated in the Discussion section and Response 1.4, we did not expect to see a lower Glx/GABA+ ratio in CC individuals. We discuss the possible reasons for the direction of the correlation with visual acuity and aperiodic offset during passive visual stimulation, and offer interpretations and (testable) hypotheses.

      We interpret the direction of the Glx/GABA+ correlation with visual acuity to imply that patients with highest (compensatory) balancing of the consequences of congenital blindness (hyperexcitation), in light of visual stimulation, are those who recover best. Note, the sighted control group was selected based on their “normal” vision. Thus, clinical visual acuity measures are not expected to sufficiently vary, nor have the resolution to show strong correlations with neurophysiological measures. By contrast, the CC group comprised patients highly varying in visual outcomes, and thus were ideal to investigate such correlations.

      This holds for the correlation between Glx and the aperiodic intercept, as well. Previous work has suggested that the intercept of the aperiodic activity is associated with broadband spiking activity in neural circuits (Manning et al., 2009). Thus, an atypical increase of spiking activity during visual stimulation, as indirectly suggested by “old” non-human primate work on visual deprivation (Hyvärinen et al., 1981) might drive a correlation not observed in healthy populations.

      In the revised manuscript, we have more clearly indicated in the Discussion that these are possible post-hoc interpretations (Page 23, Lines 584-586; Page 24, Lines 609-620; Page 24, Lines 644-647; Pages 25, Lines 650 - 657). We argue that given the lack of such studies in humans, it is all the more important that extant data be presented completely, even if the direction of the effects are not as expected.

      (2.3) For these reasons, the reported findings do not allow us to draw firm conclusions on the relation between EEG parameters and E/I ratio or on the impact of early (vs. late) visual experience on the excitation/inhibition ratio of the human visual cortex.

      Indeed, the correlations we have tested between the E/I ratio and EEG parameters were exploratory, and have been reported as such.

      We have now made this clear in all the relevant parts of the manuscript (Introduction, Page 5, Lines 132-135; Methods, Page 16, Line 415; Results, Page 21, Figure 4; Discussion, Page 22, Line 568, Page 25, Lines 644-645, Page 25, Lines 650-657).

      The goal of our study was not to compare the effects of early vs. late visual experience. The goal was to study whether early visual experience is necessary for a typical E/I ratio in visual neural circuits. We provided clear evidence in favor of this hypothesis. Thus, the present results suggest the necessity of investigating the effects of late visual deprivation. In fact, such research is missing in permanent blindness as well.

      Reviewer #3 (Public Review):

      This manuscript examines the impact of congenital visual deprivation on the excitatory/inhibitory (E/I) ratio in the visual cortex using Magnetic Resonance Spectroscopy (MRS) and electroencephalography (EEG) in individuals whose sight was restored. Ten individuals with reversed congenital cataracts were compared to age-matched, normally sighted controls, assessing the cortical E/I balance and its interrelationship to visual acuity. The study reveals that the Glx/GABA ratio in the visual cortex and the intercept and aperiodic signal are significantly altered in those with a history of early visual deprivation, suggesting persistent neurophysiological changes despite visual restoration.

      My expertise is in EEG (particularly in the decomposition of periodic and aperiodic activity) and statistical methods. I have several major concerns in terms of methodological and statistical approaches along with the (over)interpretation of the results. These major concerns are detailed below.

      (3.1) Variability in visual deprivation:

      - The document states a large variability in the duration of visual deprivation (probably also the age at restoration), with significant implications for the sensitivity period's impact on visual circuit development. The variability and its potential effects on the outcomes need thorough exploration and discussion.

      We work with a rare, unique patient population, which makes it difficult to systematically assess the effects of different visual histories while maintaining stringent inclusion criteria such as complete patterned visual deprivation at birth. Regardless, we considered the large variance in age at surgery and time since surgery as supportive of our interpretation: group differences were found despite the large variance in duration of visual deprivation. Moreover, the existing variance was used to explore possible associations between behavior and neural measures, as well as neurochemical and EEG measures.

      In the revised manuscript, we have detailed the advantages (Methods, Page 5, Lines 143 – 145, Lines 147-149; Discussion, Page 26, Lines 677-678) and disadvantages (Discussion, Page 25, Lines 650-657) of our CC sample, with respect to duration of congenital visual deprivation.

      (3.2) Sample size:

      - The small sample size is a major concern as it may not provide sufficient power to detect subtle effects and/or overestimate significant effects, which then tend not to generalize to new data. One of the biggest drivers of the replication crisis in neuroscience.

      We address the small sample size in our Discussion, and make clear that small sample sizes were due to the nature of investigations in special populations. In the revised manuscript, we added the sample sizes of previous studies using MRS in permanently blind individuals (Page 4, Lines 108 - 109). It is worth noting that our EEG results fully align with those of larger samples of congenital cataract reversal individuals (Page 25, Lines 666-676, Supplementary Material S18, S19) (Ossandón et al., 2023), providing us confidence about their validity and reproducibility. Moreover, our MRS results and correlations of those with EEG parameters were spatially specific to occipital cortex measures.

      The main problem with the correlation analyses between MRS and EEG measures is that the sample size is simply too small to conduct such an analysis. Moreover, it is unclear from the methods section that this analysis was only conducted in the patient group (which the reviewer assumed from the plots), and not explained why this was done only in the patient group. I would highly recommend removing these correlation analyses.

      In the revised manuscript, we have more clearly marked the correlation analyses as exploratory (Introduction, Page 4, Lines 118-128 and Page 5, Lines 132-134; Methods Page 16, Line 415; Discussion Page 22, Line 568, Page 24, Lines 644-645, Page 25, Lines 650-657); note that we do not base most of our discussion on the results of these analyses.

      As indicated by Reviewer 1, reporting them allows for deriving more precise hypothesis for future studies. It has to be noted that we investigate an extremely rare population, tested outside of major developed economies and dedicated neuroimaging research facilities. In addition to being a rare patient group, these individuals come from poor communities. Therefore, we consider it justified to report these correlations as exploratory, providing direction for future research.

      (3.3) Statistical concerns:

      - The statistical analyses, particularly the correlations drawn from a small sample, may not provide reliable estimates (see https://www.sciencedirect.com/science/article/pii/S0092656613000858, which clearly describes this problem).

      It would undoubtedly be better to have a larger sample size. We nonetheless think it is of value to the research community to publish this dataset, since 10 multimodal data sets from a carefully diagnosed, rare population, representing a human model for the effects of early experience on brain development, are quite a lot. Sample sizes in prior neuroimaging studies in transient blindness have most often ranged from n = 1 to n = 10. They nevertheless provided valuable direction for future research, and integration of results across multiple studies provides scientific insights. 

      Identifying possible group differences was the goal of our study, with the correlations being an exploratory analysis, which we have clearly indicated in the methods, results and discussion.

      - Statistical analyses for the MRS: The authors should consider some additional permutation statistics, which are more suitable for small sample sizes. The current statistical model (2x2) design ANOVA is not ideal for such small sample sizes. Moreover, it is unclear why the condition (EO & EC) was chosen as a predictor and not the brain region (visual & frontal) or neurochemicals. Finally, the authors did not provide any information on the alpha level nor any information on correction for multiple comparisons (in the methods section). Finally, even if the groups are matched w.r.t. age, the time between surgery and measurement, the duration of visual deprivation, (and sex?), these should be included as covariates as it has been shown that these are highly related to the measurements of interest (especially for the EEG measurements) and the age range of the current study is large.

      In our ANOVA models, the neurochemicals were the outcome variables, and the conditions were chosen as predictors based on prior work suggesting that Glx/GABA+ might vary with eye closure (Kurcyus et al., 2018). The study was designed based on a hypothesis of group differences localized to the occipital cortex, due to visual deprivation. The frontal cortex voxel was chosen to indicate whether these differences were spatially specific. Therefore, we conducted separate ANOVAs based on this study design.

      We have now clarified the motivation for these conditions in the Introduction (Page 4, Lines 122-125) and the Methods (Page 9, Lines 219-224).

      In the revised manuscript, we added the rationale for parametric analyses for our outcomes (Shapiro-Wilk as well as Levene’s tests, Supplementary Material S9). Note that in the Supplementary Materials (S12, S14), we have reported the correlations between visual history metrics and MRS/EEG outcomes, thereby investigating whether the variance in visual history might have driven these results. Specifically, we found a (negative) correlation between visual cortex Glx/GABA+ concentration during eye closure and the visual acuity in the CC group (Figure 2c). None of the other exploratory correlations between MRS/EEG outcomes vs time since surgery, duration of blindness or visual acuity were significant in the CC group (Supplementary Material S12, S15).  

      The alpha level used for the ANOVA models specified in the Methods section was 0.05. The alpha level for the exploratory analyses reported in the main manuscript was 0.008, after correcting for (6) multiple comparisons using the Bonferroni correction, also specified in the Methods. Note that the p-values following correction are expressed as multiplied by 6, due to most readers assuming an alpha level of 0.05 (see response regarding large p-values).

      We used a control group matched for age, recruited and tested in the same institutes, using the same setup. We feel that we followed the gold standards for recruiting a healthy control group for a patient group.

      - EEG statistical analyses: The same critique as for the MRS statistical analyses applies to the EEG analysis. In addition: was the 2x3 ANOVA conducted for EO and EC independently? This seems to be inconsistent with the approach in the MRS analyses, in which the authors chose EO & EC as predictors in their 2x2 ANOVA.

      The 2x3 ANOVA was not conducted independently for the eyes open/eyes closed condition. The ANOVA conducted on the EEG metrics was 2x3 because it had two groups (CC, SC) and three conditions (eyes open (EO), eyes closed (EC) and visual stimulation (LU)) as predictors.

      - Figure 4: The authors report a p-value of >0.999 with a correlation coefficient of -0.42 with a sample size of 10 subjects. This can't be correct (it should be around: p = 0.22). All statistical analyses should be checked.

      As specified in the Methods and Figure legend, the reported p values in Figure 4 have been corrected using the Bonferroni correction, and therefore multiplied by the number of comparisons, leading to the seemingly large values.

      Additionally, to check all statistical analyses, we put the manuscript through an independent Statistics Check (Nuijten & Polanin, 2020) (https://michelenuijten.shinyapps.io/statcheck-web/) and have uploaded the consistency report with the revised Supplementary Material (Supplementary Report 1).

      - Figure 2c. Eyes closed condition: The highest score of the *Glx/GABA ratio seems to be ~3.6. In subplot 2a, there seem to be 3 subjects that show a Glx/GABA ratio score > 3.6. How can this be explained? There is also a discrepancy for the eyes-closed condition.

      The three subjects that show the Glx/GABA+ ratio > 3.6 in subplot 2a are in the SC group, whereas the correlations plotted in figure 2c are only for the CC group, where the highest score is indeed ~3.6.

      (3.4) Interpretation of aperiodic signal:

      - Several recent papers demonstrated that the aperiodic signal measured in EEG or ECoG is related to various important aspects such as age, skull thickness, electrode impedance, as well as cognition. Thus, currently, very little is known about the underlying effects which influence the aperiodic intercept and slope. The entire interpretation of the aperiodic slope as a proxy for E/I is based on a computational model and simulation (as described in the Gao et al. paper).

      Apart from the modeling work from Gao et al., multiple papers which have also been cited which used ECoG, EEG and MEG and showed concomitant changes in aperiodic activity with pharmacological manipulation of the E/I ratio (Colombo et al., 2019; Molina et al., 2020; Muthukumaraswamy & Liley, 2018). Further, several prior studies have interpreted changes in the aperiodic slope as reflective of changes in the E/I ratio, including studies of developmental groups (Favaro et al., 2023; Hill et al., 2022; McSweeney et al., 2023; Schaworonkow & Voytek, 2021) as well as patient groups (Molina et al., 2020; Ostlund et al., 2021).

      In the revised manuscript, we have cited those studies not already included in the Introduction (Page 3, Lines 92-94).

      - Especially the aperiodic intercept is a very sensitive measure to many influences (e.g. skull thickness, electrode impedance...). As crucial results (correlation aperiodic intercept and MRS measures) are facing this problem, this needs to be reevaluated. It is safer to make statements on the aperiodic slope than intercept. In theory, some of the potentially confounding measures are available to the authors (e.g. skull thickness can be computed from T1w images; electrode impedances are usually acquired alongside the EEG data) and could be therefore controlled.

      All electrophysiological measures indeed depend on parameters such as skull thickness and electrode impedance. As in the extant literature using neurophysiological measures to compare brain function between patient and control groups, we used a control group matched in age/sex, recruited in the same region, tested with the same devices, and analyzed with the same analysis pipeline. For example, impedance was kept below 10 kOhm for all subjects.

      This is now mentioned in the Methods, Page 13, Line 344.

      There is no evidence available suggesting that congenital cataracts are associated with changes in skull thickness that would cause the observed pattern of group results. Moreover, we cannot think of how any of the exploratory correlations between neurophysiological measures and MRS measures could be accounted for by a difference e.g. in skull thickness.

      - The authors wrote: "Higher frequencies (such as 20-40 Hz) have been predominantly associated with local circuit activity and feedforward signaling (Bastos et al., 2018; Van Kerkoerle et al., 2014); the increased 20-40 Hz slope may therefore signal increased spontaneous spiking activity in local networks. We speculate that the steeper slope of the aperiodic activity for the lower frequency range (1-20 Hz) in CC individuals reflects the concomitant increase in inhibition." The authors confuse the interpretation of periodic and aperiodic signals. This section refers to the interpretation of the periodic signal (higher frequencies). This interpretation cannot simply be translated to the aperiodic signal (slope).

      Prior work has not always separated the aperiodic and periodic components, making it unclear what might have driven these effects in our data. The interpretation of the higher frequency range was intended to contrast with the interpretations of lower frequency range, in order to speculate as to why the two aperiodic fits might go in differing directions. Note that Ossandón et al. reported highly similar results (group differences for CC individuals and for permanently congenitally blind humans) for the aperiodic activity between 20-40 Hz and oscillatory activity in the gamma range.

      In the revised Discussion, we removed this section. We primarily interpret the increased offset and prior findings from fMRI-BOLD data (Raczy et al., 2023) as an increase in broadband neuronal firing.

      - The authors further wrote: We used the slope of the aperiodic (1/f) component of the EEG spectrum as an estimate of E/I ratio (Gao et al., 2017; Medel et al., 2020; Muthukumaraswamy & Liley, 2018). This is a highly speculative interpretation with very little empirical evidence. These papers were conducted with ECoG data (mostly in animals) and mostly under anesthesia. Thus, these studies only allow an indirect interpretation by what the 1/f slope in EEG measurements is actually influenced.

      Note that Muthukumaraswamy et al. (2018) used different types of pharmacological manipulations and analyzed periodic and aperiodic MEG activity in humans, in addition to monkey ECoG (Muthukumaraswamy & Liley, 2018). Further, Medel et al. (now published as Medel et al., 2023) compared EEG activity in addition to ECoG data after propofol administration. The interpretation of our results are in line with a number of recent studies in developing (Hill et al., 2022; Schaworonkow & Voytek, 2021) and special populations using EEG. As mentioned above, several prior studies have used the slope of the 1/f component/aperiodic activity as an indirect measure of the E/I ratio (Favaro et al., 2023; Hill et al., 2022; McSweeney et al., 2023; Molina et al., 2020; Ostlund et al., 2021; Schaworonkow & Voytek, 2021), including studies using scalp-recorded EEG from humans.

      In the introduction of the revised manuscript, we have made more explicit that this metric is indirect (Page 3, Line 91), (additionally see Discussion, Page 24, Lines 644-645, Page 25, Lines 650-657).

      While a full understanding of aperiodic activity needs to be provided, some convergent ideas have emerged. We think that our results contribute to this enterprise, since our study is, to the best of our knowledge, the first which assessed MRS measured neurotransmitter levels and EEG aperiodic activity.

      (3.5) Problems with EEG preprocessing and analysis:

      - It seems that the authors did not identify bad channels nor address the line noise issue (even a problem if a low pass filter of below-the-line noise was applied).

      As pointed out in the methods and Figure 1, we only analyzed data from two occipital channels, O1 and O2 neither of which were rejected for any participant. Channel rejection was performed for the larger dataset, published elsewhere (Ossandón et al., 2023; Pant et al., 2023). As control sites we added the frontal channels FP1 and Fp2 (see Supplementary Material S14)

      Neither Ossandón et al. (2023) nor Pant et al. (2023) considered frequency ranges above 40 Hz to avoid any possible contamination with line noise. Here, we focused on activity between 0 and 20 Hz, definitely excluding line noise contaminations (Methods, Page 14, Lines 365-367). The low pass filter (FIR, 1-45 Hz) guaranteed that any spill-over effects of line noise would be restricted to frequencies just below the upper cutoff frequency.

      Additionally, a prior version of the analysis used spectrum interpolation to remove line noise; the group differences remained stable (Ossandón et al., 2023). We have reported this analysis in the revised manuscript (Page 14, Lines 364-357).

      Further, both groups were measured in the same lab, making line noise (~ 50 Hz) as an account for the observed group effects in the 1-20 Hz frequency range highly unlikely. Finally, any of the exploratory MRS-EEG correlations would be hard to explain if the EEG parameters would be contaminated with line noise.

      - What was the percentage of segments that needed to be rejected due to the 120μV criteria? This should be reported specifically for EO & EC and controls and patients.

      The mean percentage of 1 second segments rejected for each resting state condition and the percentage of 6.25 long segments rejected in each group for the visual stimulation condition have been added to the revised manuscript (Supplementary Material S10), and referred to in the Methods on Page 14, Lines 372-373).

      - The authors downsampled the data to 60Hz to "to match the stimulation rate". What is the intention of this? Because the subsequent spectral analyses are conflated by this choice (see Nyquist theorem).

      This data were collected as part of a study designed to evoke alpha activity with visual white-noise, which changed in luminance with equal power at all frequencies from 1-60 Hz, restricted by the refresh rate of the monitor on which stimuli were presented (Pant et al., 2023). This paradigm and method was developed by VanRullen and colleagues (Schwenk et al., 2020; VanRullen & MacDonald, 2012), wherein the analysis requires the same sampling rate between the presented frequencies and the EEG data. The downsampling function used here automatically applies an anti-aliasing filter (EEGLAB 2019) .

      - "Subsequently, baseline removal was conducted by subtracting the mean activity across the length of an epoch from every data point." The actual baseline time segment should be specified.

      The time segment was the length of the epoch, that is, 1 second for the resting state conditions and 6.25 seconds for the visual stimulation conditions. This has now been explicitly stated in the revised manuscript (Page 14, Lines 379-380).

      - "We excluded the alpha range (8-14 Hz) for this fit to avoid biasing the results due to documented differences in alpha activity between CC and SC individuals (Bottari et al., 2016; Ossandón et al., 2023; Pant et al., 2023)." This does not really make sense, as the FOOOF algorithm first fits the 1/f slope, for which the alpha activity is not relevant.

      We did not use the FOOOF algorithm/toolbox in this manuscript. As stated in the Methods, we used a 1/f fit to the 1-20 Hz spectrum in the log-log space, and subtracted this fit from the original spectrum to obtain the corrected spectrum. Given the pronounced difference in alpha power between groups (Bottari et al., 2016; Ossandón et al., 2023; Pant et al., 2023), we were concerned it might drive differences in the exponent values. Our analysis pipeline had been adapted from previous publications of our group and other labs (Ossandón et al., 2023; Voytek et al., 2015; Waschke et al., 2017).

      We have conducted the analysis with and without the exclusion of the alpha range, as well as using the FOOOF toolbox both in the 1-20 Hz and 20-40 Hz ranges (Ossandón et al., 2023). The findings of a steeper slope in the 1-20 Hz range as well as lower alpha power in CC vs SC individuals remained stable. In Ossandón et al., the comparison between the piecewise fits and FOOOF fits led the authors to use the former, as it outperformed the FOOOF algorithm for their data.

      - The model fits of the 1/f fitting for EO, EC, and both participant groups should be reported.

      In Figure 3 of the manuscript, we depicted the mean spectra and 1/f fits for each group.

      In the revised manuscript, we added the fit quality metrics (average R<sup>2</sup> values > 0.91 for each group and condition) (Methods Page 15, Lines 395-396; Supplementary Material S11) and additionally show individual subjects’ fits (Supplementary Material S11).

      (3.6) Validity of GABA measurements and results:

      - According the a newer study by the authors of the Gannet toolbox (https://analyticalsciencejournals.onlinelibrary.wiley.com/doi/abs/10.1002/nbm.5076), the reliability and reproducibility of the gamma-aminobutyric acid (GABA) measurement can vary significantly depending on acquisition and modeling parameter. Thus, did the author address these challenges?

      We took care of data quality while acquiring MRS data by ensuring appropriate voxel placement and linewidth prior to scanning (Page 9, Lines 229-237). We now address this explicitly in the Methods in the “MRS Data Quality” section. Acquisition as well as modeling parameters were constant for both groups, so they cannot have driven group differences.

      The linked article compares the reproducibility of GABA measurement using Osprey (Oeltzschner et al., 2020), which was released in 2020 and uses linear combination modeling to fit the peak, as opposed to Gannet’s simple peak fitting (Hupfeld et al., 2024). The study finds better test-retest reliability for Osprey compared to Gannet’s method.

      As the present work was conceptualized in 2018, we used Gannet 3.0, which was the state-of-the-art edited-spectrum analysis toolbox at the time, and still is widely used.

      In the revised manuscript, we re-analyzed the data using linear combination modeling with Osprey (Oeltzschner et al., 2020), and reported that the main findings remained the same, i.e. the Glx/GABA+ concentration ratio was lower in the visual cortex of congenital cataract reversal individuals compared to normally sighted controls, regardless of whether participants were scanned with eyes open or with eyes closed. Further, NAA concentration did not differ between groups (Supplementary Material S3). Thus, we demonstrate that our findings were robust to analysis pipelines, and state this in the Methods (Page 9, Lines 242-246) and Results (Page 19, Lines 464-467).

      - Furthermore, the authors wrote: "We confirmed the within-subject stability of metabolite quantification by testing a subset of the sighted controls (n=6) 2-4 weeks apart. Looking at the supplementary Figure 5 (which would be rather plotted as ICC or Blant-Altman plots), the within-subject stability compared to between-subject variability seems not to be great. Furthermore, I don't think such a small sample size qualifies for a rigorous assessment of stability.

      Indeed, we did not intend to provide a rigorous assessment of within-subject stability. Rather, we aimed to confirm that data quality/concentration ratios did not systematically differ between the same subjects tested longitudinally; driven, for example, by scanner heating or time of day. As with the phantom testing, we attempted to give readers an idea of the quality of the data, as they were collected from a primarily clinical rather than a research site.

      In the revised manuscript, we have removed the statement regarding stability and the associated section.

      - "Why might an enhanced inhibitory drive, as indicated by the lower Glx/GABA ratio" Is this interpretation really warranted, as the results of the group differences in the Glx/GABA ratio seem to be rather driven by a decreased Glx concentration in CC rather than an increased GABA (see Figure 2).

      We used the Glx/GABA+ ratio as a measure, rather than individual Glx or GABA+ concentration, which did not significantly differ between groups. As detailed in Response 2.2, we think this metric aligns better with an underlying E/I balance hypothesis and has been used in many previous studies (Gao et al., 2024; Liu et al., 2015; Narayan et al., 2022; Perica et al., 2022).

      Our interpretation of an enhanced inhibitory drive additionally comes from the combination of aperiodic EEG (1-20 Hz) and MRS measures, which, when considered together, are consistent with a decreased E/I ratio.

      In the revised manuscript, we have rewritten the Discussion and removed this section.   

      - Glx concentration predicted the aperiodic intercept in CC individuals' visual cortices during ambient and flickering visual stimulation. Why specifically investigate the Glx concentration, when the paper is about E/I ratio?

      As stated in the methods, we exploratorily assessed the relationship between all MRS parameters (Glx, GABA+ and Glx/GABA+ ratio) with the aperiodic parameters (slope, offset), and corrected for multiple comparisons accordingly. We think this is a worthwhile analysis considering the rarity of the dataset/population (see 1.2, 1.6, 2.1 and Reviewer 1’s comments about future hypotheses). We only report the Glx – aperiodic intercept correlation in the main manuscript as it survived correction for multiple comparisons.

      (3.7) Interpretation of the correlation between MRS measurements and EEG aperiodic signal:

      - The authors wrote: "The intercept of the aperiodic activity was highly correlated with the Glx concentration during rest with eyes open and during flickering stimulation (also see Supplementary Material S11). Based on the assumption that the aperiodic intercept reflects broadband firing (Manning et al., 2009; Winawer et al., 2013), this suggests that the Glx concentration might be related to broadband firing in CC individuals during active and passive visual stimulation." These results should not be interpreted (or with very caution) for several reasons (see also problem with influences on aperiodic intercept and small sample size). This is a result of the exploratory analyses of correlating every EEG parameter with every MRS parameter. This requires well-powered replication before any interpretation can be provided. Furthermore and importantly: why should this be specifically only in CC patients, but not in the SC control group?

      We have indicated clearly in all parts of the manuscript that these correlations are presented as exploratory. Further, we interpret the Glx-aperiodic offset correlation, and none of the others, as it survived the Bonferroni correction for multiple comparisons. We offer a hypothesis in the Discussion as to why such a correlation might exist in the CC but not the SC group (see response 2.2), and do not speculate further.

      (3.8) Language and presentation:

      - The manuscript requires language improvements and correction of numerous typos. Over-simplifications and unclear statements are present, which could mislead or confuse readers (see also interpretation of aperiodic signal).

      In the revised manuscript, we have checked that speculations are clearly marked, and typos are removed.

      - The authors state that "Together, the present results provide strong evidence for experience-dependent development of the E/I ratio in the human visual cortex, with consequences for behavior." The results of the study do not provide any strong evidence, because of the small sample size and exploratory analyses approach and not accounting for possible confounding factors.

      We disagree with this statement and allude to convergent evidence of both MRS and neurophysiological measures. The latter link to corresponding results observed in a larger sample of CC individuals (Ossandón et al., 2023). In the revised manuscript, we have rephrased the statement as “to provide initial evidence” (Page 22, Line 676).

      - "Our results imply a change in neurotransmitter concentrations as a consequence of *restoring* vision following congenital blindness." This is a speculative statement to infer a causal relationship on cross-sectional data.

      As mentioned under 2.1, we conducted a cross-sectional study which might justify future longitudinal work. In order to advance science, new testable hypotheses were put forward at the end of a manuscript.

      In the revised manuscript, we rephrased the sentence and added “might imply” to better indicate the hypothetical character of this idea (Page 22, Lines 586-587).

      - In the limitation section, the authors wrote: "The sample size of the present study is relatively high for the rare population , but undoubtedly, overall, rather small." This sentence should be rewritten, as the study is plein underpowered. The further justification "We nevertheless think that our results are valid. Our findings neurochemically (Glx and GABA+ concentration), and anatomically (visual cortex) specific. The MRS parameters varied with parameters of the aperiodic EEG activity and visual acuity. The group differences for the EEG assessments corresponded to those of a larger sample of CC individuals (n=38) (Ossandón et al., 2023), and effects of chronological age were as expected from the literature." These statements do not provide any validation or justification of small samples. Furthermore, the current data set is a subset of an earlier published paper by the same authors "The EEG data sets reported here were part of data published earlier (Ossandón et al., 2023; Pant et al., 2023)." Thus, the statement "The group differences for the EEG assessments corresponded to those of a larger sample of CC individuals (n=38) " is a circular argument and should be avoided.

      Our intention was not to justify having a small sample, but to justify why we think the results might be valid as they align with/replicate existing literature.

      In the revised manuscript, we added a figure showing that the EEG results of the 10 subjects considered here correspond to those of the 28 other subjects of Ossandón et al (Supplementary Material S18). We adapted the text accordingly, clearly stating that the pattern of EEG results of the ten subjects reported here replicate those of the 28 additional subjects of Ossandón et al. (2023) (Page 25, Lines 671-672).

      References (Public Review)

      Barnes, S. J., Sammons, R. P., Jacobsen, R. I., Mackie, J., Keller, G. B., & Keck, T. (2015). Subnetwork-specific homeostatic plasticity in mouse visual cortex in vivo. Neuron, 86(5), 1290–1303. https://doi.org/10.1016/J.NEURON.2015.05.010

      Bernabeu, A., Alfaro, A., García, M., & Fernández, E. (2009). Proton magnetic resonance spectroscopy (1H-MRS) reveals the presence of elevated myo-inositol in the occipital cortex of blind subjects. NeuroImage, 47(4), 1172–1176. https://doi.org/10.1016/j.neuroimage.2009.04.080

      Bottari, D., Troje, N. F., Ley, P., Hense, M., Kekunnaya, R., & Röder, B. (2016). Sight restoration after congenital blindness does not reinstate alpha oscillatory activity in humans. Scientific Reports. https://doi.org/10.1038/srep24683

      Colombo, M. A., Napolitani, M., Boly, M., Gosseries, O., Casarotto, S., Rosanova, M., Brichant, J. F., Boveroux, P., Rex, S., Laureys, S., Massimini, M., Chieregato, A., & Sarasso, S. (2019). The spectral exponent of the resting EEG indexes the presence of consciousness during unresponsiveness induced by propofol, xenon, and ketamine. NeuroImage, 189(September 2018), 631–644. https://doi.org/10.1016/j.neuroimage.2019.01.024

      Consideration of Sample Size in Neuroscience Studies. (2020). Journal of Neuroscience, 40(21), 4076–4077. https://doi.org/10.1523/JNEUROSCI.0866-20.2020

      Coullon, G. S. L., Emir, U. E., Fine, I., Watkins, K. E., & Bridge, H. (2015). Neurochemical changes in the pericalcarine cortex in congenital blindness attributable to bilateral anophthalmia. Journal of Neurophysiology. https://doi.org/10.1152/jn.00567.2015

      Fang, Q., Li, Y. T., Peng, B., Li, Z., Zhang, L. I., & Tao, H. W. (2021). Balanced enhancements of synaptic excitation and inhibition underlie developmental maturation of receptive fields in the mouse visual cortex. Journal of Neuroscience, 41(49), 10065–10079. https://doi.org/10.1523/JNEUROSCI.0442-21.2021

      Favaro, J., Colombo, M. A., Mikulan, E., Sartori, S., Nosadini, M., Pelizza, M. F., Rosanova, M., Sarasso, S., Massimini, M., & Toldo, I. (2023). The maturation of aperiodic EEG activity across development reveals a progressive differentiation of wakefulness from sleep. NeuroImage, 277. https://doi.org/10.1016/J.NEUROIMAGE.2023.120264

      Gao, Y., Liu, Y., Zhao, S., Liu, Y., Zhang, C., Hui, S., Mikkelsen, M., Edden, R. A. E., Meng, X., Yu, B., & Xiao, L. (2024). MRS study on the correlation between frontal GABA+/Glx ratio and abnormal cognitive function in medication-naive patients with narcolepsy. Sleep Medicine, 119, 1–8. https://doi.org/10.1016/j.sleep.2024.04.004

      Haider, B., Duque, A., Hasenstaub, A. R., & McCormick, D. A. (2006). Neocortical network activity in vivo is generated through a dynamic balance of excitation and inhibition. Journal of Neuroscience. https://doi.org/10.1523/JNEUROSCI.5297-05.2006

      Hill, A. T., Clark, G. M., Bigelow, F. J., Lum, J. A. G., & Enticott, P. G. (2022). Periodic and aperiodic neural activity displays age-dependent changes across early-to-middle childhood. Developmental Cognitive Neuroscience, 54, 101076. https://doi.org/10.1016/J.DCN.2022.101076

      Hupfeld, K. E., Zöllner, H. J., Hui, S. C. N., Song, Y., Murali-Manohar, S., Yedavalli, V., Oeltzschner, G., Prisciandaro, J. J., & Edden, R. A. E. (2024). Impact of acquisition and modeling parameters on the test–retest reproducibility of edited GABA+. NMR in Biomedicine, 37(4), e5076. https://doi.org/10.1002/nbm.5076

      Hyvärinen, J., Carlson, S., & Hyvärinen, L. (1981). Early visual deprivation alters modality of neuronal responses in area 19 of monkey cortex. Neuroscience Letters, 26(3), 239–243. https://doi.org/10.1016/0304-3940(81)90139-7

      Juchem, C., & Graaf, R. A. de. (2017). B0 magnetic field homogeneity and shimming for in vivo magnetic resonance spectroscopy. Analytical Biochemistry, 529, 17–29. https://doi.org/10.1016/j.ab.2016.06.003

      Keck, T., Hübener, M., & Bonhoeffer, T. (2017). Interactions between synaptic homeostatic mechanisms: An attempt to reconcile BCM theory, synaptic scaling, and changing excitation/inhibition balance. Current Opinion in Neurobiology, 43, 87–93. https://doi.org/10.1016/J.CONB.2017.02.003

      Kurcyus, K., Annac, E., Hanning, N. M., Harris, A. D., Oeltzschner, G., Edden, R., & Riedl, V. (2018). Opposite Dynamics of GABA and Glutamate Levels in the Occipital Cortex during Visual Processing. Journal of Neuroscience, 38(46), 9967–9976. https://doi.org/10.1523/JNEUROSCI.1214-18.2018

      Liu, B., Wang, G., Gao, D., Gao, F., Zhao, B., Qiao, M., Yang, H., Yu, Y., Ren, F., Yang, P., Chen, W., & Rae, C. D. (2015). Alterations of GABA and glutamate-glutamine levels in premenstrual dysphoric disorder: A 3T proton magnetic resonance spectroscopy study. Psychiatry Research - Neuroimaging, 231(1), 64–70. https://doi.org/10.1016/J.PSCYCHRESNS.2014.10.020

      Lunghi, C., Berchicci, M., Morrone, M. C., & Russo, F. D. (2015). Short‐term monocular deprivation alters early components of visual evoked potentials. The Journal of Physiology, 593(19), 4361. https://doi.org/10.1113/JP270950

      Maier, S., Düppers, A. L., Runge, K., Dacko, M., Lange, T., Fangmeier, T., Riedel, A., Ebert, D., Endres, D., Domschke, K., Perlov, E., Nickel, K., & Tebartz van Elst, L. (2022). Increased prefrontal GABA concentrations in adults with autism spectrum disorders. Autism Research, 15(7), 1222–1236. https://doi.org/10.1002/aur.2740

      Manning, J. R., Jacobs, J., Fried, I., & Kahana, M. J. (2009). Broadband shifts in local field potential power spectra are correlated with single-neuron spiking in humans. The Journal of Neuroscience : The Official Journal of the Society for Neuroscience, 29(43), 13613–13620. https://doi.org/10.1523/JNEUROSCI.2041-09.2009

      McSweeney, M., Morales, S., Valadez, E. A., Buzzell, G. A., Yoder, L., Fifer, W. P., Pini, N., Shuffrey, L. C., Elliott, A. J., Isler, J. R., & Fox, N. A. (2023). Age-related trends in aperiodic EEG activity and alpha oscillations during early- to middle-childhood. NeuroImage, 269, 119925. https://doi.org/10.1016/j.neuroimage.2023.119925

      Medel, V., Irani, M., Crossley, N., Ossandón, T., & Boncompte, G. (2023). Complexity and 1/f slope jointly reflect brain states. Scientific Reports, 13(1), 21700. https://doi.org/10.1038/s41598-023-47316-0

      Molina, J. L., Voytek, B., Thomas, M. L., Joshi, Y. B., Bhakta, S. G., Talledo, J. A., Swerdlow, N. R., & Light, G. A. (2020). Memantine Effects on Electroencephalographic Measures of Putative Excitatory/Inhibitory Balance in Schizophrenia. Biological Psychiatry: Cognitive Neuroscience and Neuroimaging, 5(6), 562–568. https://doi.org/10.1016/j.bpsc.2020.02.004

      Mukerji, A., Byrne, K. N., Yang, E., Levi, D. M., & Silver, M. A. (2022). Visual cortical γ−aminobutyric acid and perceptual suppression in amblyopia. Frontiers in Human Neuroscience, 16. https://doi.org/10.3389/fnhum.2022.949395

      Muthukumaraswamy, S. D., & Liley, D. T. (2018). 1/F electrophysiological spectra in resting and drug-induced states can be explained by the dynamics of multiple oscillatory relaxation processes. NeuroImage, 179(November 2017), 582–595. https://doi.org/10.1016/j.neuroimage.2018.06.068

      Narayan, G. A., Hill, K. R., Wengler, K., He, X., Wang, J., Yang, J., Parsey, R. V., & DeLorenzo, C. (2022). Does the change in glutamate to GABA ratio correlate with change in depression severity? A randomized, double-blind clinical trial. Molecular Psychiatry, 27(9), 3833—3841. https://doi.org/10.1038/s41380-022-01730-4

      Nuijten, M. B., & Polanin, J. R. (2020). “statcheck”: Automatically detect statistical reporting inconsistencies to increase reproducibility of meta-analyses. Research Synthesis Methods, 11(5), 574–579. https://doi.org/10.1002/jrsm.1408

      Oeltzschner, G., Zöllner, H. J., Hui, S. C. N., Mikkelsen, M., Saleh, M. G., Tapper, S., & Edden, R. A. E. (2020). Osprey: Open-source processing, reconstruction & estimation of magnetic resonance spectroscopy data. Journal of Neuroscience Methods, 343, 108827. https://doi.org/10.1016/j.jneumeth.2020.108827

      Ossandón, J. P., Stange, L., Gudi-Mindermann, H., Rimmele, J. M., Sourav, S., Bottari, D., Kekunnaya, R., & Röder, B. (2023). The development of oscillatory and aperiodic resting state activity is linked to a sensitive period in humans. NeuroImage, 275, 120171. https://doi.org/10.1016/J.NEUROIMAGE.2023.120171

      Ostlund, B. D., Alperin, B. R., Drew, T., & Karalunas, S. L. (2021). Behavioral and cognitive correlates of the aperiodic (1/f-like) exponent of the EEG power spectrum in adolescents with and without ADHD. Developmental Cognitive Neuroscience, 48, 100931. https://doi.org/10.1016/j.dcn.2021.100931

      Pant, R., Ossandón, J., Stange, L., Shareef, I., Kekunnaya, R., & Röder, B. (2023). Stimulus-evoked and resting-state alpha oscillations show a linked dependence on patterned visual experience for development. NeuroImage: Clinical, 103375. https://doi.org/10.1016/J.NICL.2023.103375

      Perica, M. I., Calabro, F. J., Larsen, B., Foran, W., Yushmanov, V. E., Hetherington, H., Tervo-Clemmens, B., Moon, C.-H., & Luna, B. (2022). Development of frontal GABA and glutamate supports excitation/inhibition balance from adolescence into adulthood. Progress in Neurobiology, 219, 102370. https://doi.org/10.1016/j.pneurobio.2022.102370

      Pitchaimuthu, K., Wu, Q. Z., Carter, O., Nguyen, B. N., Ahn, S., Egan, G. F., & McKendrick, A. M. (2017). Occipital GABA levels in older adults and their relationship to visual perceptual suppression. Scientific Reports, 7(1). https://doi.org/10.1038/S41598-017-14577-5

      Rideaux, R., Ehrhardt, S. E., Wards, Y., Filmer, H. L., Jin, J., Deelchand, D. K., Marjańska, M., Mattingley, J. B., & Dux, P. E. (2022). On the relationship between GABA+ and glutamate across the brain. NeuroImage, 257, 119273. https://doi.org/10.1016/J.NEUROIMAGE.2022.119273

      Schaworonkow, N., & Voytek, B. (2021). Longitudinal changes in aperiodic and periodic activity in electrophysiological recordings in the first seven months of life. Developmental Cognitive Neuroscience, 47. https://doi.org/10.1016/j.dcn.2020.100895

      Schwenk, J. C. B., VanRullen, R., & Bremmer, F. (2020). Dynamics of Visual Perceptual Echoes Following Short-Term Visual Deprivation. Cerebral Cortex Communications, 1(1). https://doi.org/10.1093/TEXCOM/TGAA012

      Sengpiel, F., Jirmann, K.-U., Vorobyov, V., & Eysel, U. T. (2006). Strabismic Suppression Is Mediated by Inhibitory Interactions in the Primary Visual Cortex. Cerebral Cortex, 16(12), 1750–1758. https://doi.org/10.1093/cercor/bhj110

      Steel, A., Mikkelsen, M., Edden, R. A. E., & Robertson, C. E. (2020). Regional balance between glutamate+glutamine and GABA+ in the resting human brain. NeuroImage, 220. https://doi.org/10.1016/J.NEUROIMAGE.2020.117112

      Takado, Y., Takuwa, H., Sampei, K., Urushihata, T., Takahashi, M., Shimojo, M., Uchida, S., Nitta, N., Shibata, S., Nagashima, K., Ochi, Y., Ono, M., Maeda, J., Tomita, Y., Sahara, N., Near, J., Aoki, I., Shibata, K., & Higuchi, M. (2022). MRS-measured glutamate versus GABA reflects excitatory versus inhibitory neural activities in awake mice. Journal of Cerebral Blood Flow & Metabolism, 42(1), 197. https://doi.org/10.1177/0271678X211045449

      Takei, Y., Fujihara, K., Tagawa, M., Hironaga, N., Near, J., Kasagi, M., Takahashi, Y., Motegi, T., Suzuki, Y., Aoyama, Y., Sakurai, N., Yamaguchi, M., Tobimatsu, S., Ujita, K., Tsushima, Y., Narita, K., & Fukuda, M. (2016). The inhibition/excitation ratio related to task-induced oscillatory modulations during a working memory task: A multtimodal-imaging study using MEG and MRS. NeuroImage, 128, 302–315. https://doi.org/10.1016/J.NEUROIMAGE.2015.12.057

      Tao, H. W., & Poo, M. M. (2005). Activity-dependent matching of excitatory and inhibitory inputs during refinement of visual receptive fields. Neuron, 45(6), 829–836. https://doi.org/10.1016/J.NEURON.2005.01.046

      Vanrullen, R., & MacDonald, J. S. P. (2012). Perceptual echoes at 10 Hz in the human brain. Current Biology. https://doi.org/10.1016/j.cub.2012.03.050

      Voytek, B., Kramer, M. A., Case, J., Lepage, K. Q., Tempesta, Z. R., Knight, R. T., & Gazzaley, A. (2015). Age-related changes in 1/f neural electrophysiological noise. Journal of Neuroscience, 35(38). https://doi.org/10.1523/JNEUROSCI.2332-14.2015

      Vreeswijk, C. V., & Sompolinsky, H. (1996). Chaos in neuronal networks with balanced excitatory and inhibitory activity. Science, 274(5293), 1724–1726. https://doi.org/10.1126/SCIENCE.274.5293.1724

      Waschke, L., Wöstmann, M., & Obleser, J. (2017). States and traits of neural irregularity in the age-varying human brain. Scientific Reports 2017 7:1, 7(1), 1–12. https://doi.org/10.1038/s41598-017-17766-4

      Weaver, K. E., Richards, T. L., Saenz, M., Petropoulos, H., & Fine, I. (2013). Neurochemical changes within human early blind occipital cortex. Neuroscience. https://doi.org/10.1016/j.neuroscience.2013.08.004

      Wu, Y. K., Miehl, C., & Gjorgjieva, J. (2022). Regulation of circuit organization and function through inhibitory synaptic plasticity. Trends in Neurosciences, 45(12), 884–898. https://doi.org/10.1016/J.TINS.2022.10.006

      Recommendations for the Authors:

      Reviewer #1 (Recommendations for The Authors):

      Thank you for the interesting submission. I have inserted my comments to the authors here. Some of them will be more granular comments related to the concerns raised in the public review.

      (1) Introduction:

      Could you please justify the rationale for using eyes open and eyes closed in the MRS condition, and the use of the three different conditions in the EEG experiment? If these resulted in negative findings, then the implications should be discussed.

      Previous work with MRS in sighted individuals has suggested that eye opening in darkness results in a decrease of visual cortex GABA+ concentration, while visual stimulation results in an increase of Glx concentration, compared to a baseline concentration at eye closure (Kurcyus et al., 2018). Moreover visual stimulation/eye opening is known to result in an alpha desynchronization (Adrian & Matthews, 1934).

      While previous work of our group has shown significantly reduced alpha oscillatory activity in congenital cataract reversal individual, desynchronization following eye opening was indistinguishable when compared to normally sighted controls (Ossandón et al., 2023; Pant et al., 2023).

      Thus, we decided to include both conditions to test whether a similar pattern of results would emerge for GABA+/Glx concentration.

      We added our motivation to the Introduction of the revised manuscript (Page 4, Lines 122-125) along with the Methods (Page 9, Lines 219-223).

      It does not become clear from the introduction why a higher intercept is predicted in the EEG measure. The rationale for this hypothesis needs to be explained better.

      Given the prior findings suggesting an increased E/I ratio in CC individuals and the proposed link between neuronal firing (Manning et al., 2009) and the aperiodic intercept, we expected a higher intercept for the CC compared to the SC group.

      We have now added this explanation to the Introduction (Page 4, Lines 126-128).

      (2) Participants

      Were participants screened for common MRS exclusion criteria such as history of psychiatric conditions or antidepressant medication, which could alter neurochemistry? If not, then this needs to be pointed out.

      All participants were clinically screened at the LV Prasad Eye Institute, and additionally self-reported no neurological or psychiatric conditions or medications. Moreover, all subjects were screened based exclusion criteria for being scanned using the standard questionnaire of the radiology center.

      We have now made this clear in the Methods (Page 7, Lines 168-171).

      Table 1 needs to show the age of the participant, which can only be derived by adding the columns 'duration of deprivation' and 'time since surgery'. Table 1 also needs to include the controls.

      We have accordingly modified Table 1 in the revised manuscript and added age for the patients as well as the controls (Table 1, Pages 6-7).

      The control cohort is not specific enough to exclude reduced visual acuity, or co-morbidities, as the primary driver of the differences between groups. Ideally, a cohort with developmental cataracts is recruited. Normally sighted participants as a control cohort cannot distinguish between different types of sight loss, or stages of plasticity.

      The goal of this study was not to distinguish between different types of sight loss or stages of plasticity. We aimed to assess whether the most extreme forms of visual deprivation (i.e. congenital and total patterned vision loss) affected the E/I ratio. Low visual acuity and nystagmus are genuine diagnostic criteria (Methods, Page 5, Lines 142-145). Visual acuity cannot solely explain the current findings, since the MRS data were acquired both with eyes closed or diffuse visual stimulation in a dimly lit room, without any visual task.

      With the awareness of the present results, we consider it worthwhile for the future to investigate additional groups such as developmental cataract-reversal individuals, to narrow down the contribution of the age of onset and degree of visual deprivation to the observed group differences.

      (3) Data collection and analysis

      - More detail is needed: how long were the sessions, how long was each part?

      We have added this information on Page 7, Lines 178-181 of the Methods. MRS scanning took between 45 and 60 minutes, EEG testing took 20 minutes excluding the time for capping, and visual acuity testing took 3-5 minutes.

      - It should be mentioned here that the EEG data is a reanalysis of a subset of legacy data, published previously in Ossandón et al., 2023; Pant et al., 2023.

      In the revised manuscript, we explicitly state at the beginning of the “Electrophysiology recordings” section of the Methods (Page 13, Lines 331-334) that the EEG datasets were a subset of previously published data.

      (4) MRS Spectroscopy

      - Please fill out the minimum reporting standards form (Lin et al., 2021), or report all the requested measures in the main document https://pubmed.ncbi.nlm.nih.gov/33559967/

      We have now filled out this form and added it as Supplementary Material (Supplementary Excel File 1). Additionally, all the requested information has been moved to the Methods section of the main document (MRS Data Quality, Pages 10-12).

      - Information on how the voxels were placed is missing. The visual cortex voxel is not angled parallel to the calcarine, as is a common way to capture processing in the early visual cortex. Describe in the paper what the criteria for successful placement were, and how was it ensured that non-brain tissue was avoided in a voxel of this size.

      Voxel placement was optimized in each subject to avoid the meninges, ventricles, skull and subcortical structures, ensured by examining the voxel region across slices in the acquired T1 volume for each subject. Saturation bands were placed to nullify the skull signal during MRS acquisition, at the anterior (frontal) and posterior (visual) edge of the voxel for every subject. Due to limitations in the clinical scanner rotated/skewed voxels were not possible, and thus voxels were not always located precisely parallel to the calcarine.

      We have added this information to Page 9 (Lines 229-237) of the revised manuscript.

      - Figure 1. shows voxels that are very close to the edge of the brain (frontal cortex) or to the tentorium (visual cortex). Could the authors please calculate the percentage overlap between the visual cortex MRS voxel and the visual cortex, and compare them across groups to ensure that there is no between-group bias from voxel placement?

      We have now added the requested analysis to Supplementary Material S2 and referred to it in the main manuscript on Page 9, Lines 236-237.

      Briefly, the percentage overlap with areas V1-V6 in every individual subject’s visual cortex voxel was 60% or more; the mean overlap in the CC group was 67% and the SC group 70%. The percentage overlap did not differ between groups ( t-test (t(18) = -1.14, p = 0.269)).

      - Figure 1. I would recommend displaying data on a skull-stripped image to avoid identifying information from the participant's T1 profile.

      We have now replaced the images in Figure 1 with skull-stripped images. Note that images from SPM12 were used instead of GannetCoregister, as GannetCoregister only displays images with the skull.

      - Please show more rigor with the MRS quality measures. Several examples of inconsistency and omissions are below.

      • SNR was quantified and shows a difference in SNR between voxel positions, with lower SNR in the frontal cortex. No explanation or discussion of the difference was provided.

      • Looking at S1, the linewidth of NAA seems to be a lot broader in the frontal cortex than in the visual cortex. The figures suggest that acquisition quality was very different between voxel locations, making the comparison difficult.

      • Linewidth of NAA is a generally agreed measure of shim quality in megapress acquisitions (Craven et al., 2022).

      The data quality difference between the frontal and visual cortices has been observed in the literature (Juchem & Graaf, 2017; Rideaux et al., 2022). We nevertheless chose a frontal cortex voxel as control site instead of the often-chosen sensorimotor cortex. The main motivation was to avoid any cortical region linked to sensory processing since crossmodal compensation as a consequence of visual deprivation is a well-documented phenomenon.

      We now make this clearer in the Methods (Page 11, Lines 284 – 299), in the Discussion/Limitations (Page 25, Lines 662 - 665).  

      - To get a handle on the data quality, I would recommend that the authors display their MRS quality measures in a separate section 'MRS quality measure', including NAA linewidth, NAA SNR, GABA+ CRLB, Glx CRLB, and test for the main effects and interaction of voxel location (VC, FC) and group (SC, CC) and discuss any discrepancies.

      We have moved all the quality metric values for GABA+, Glx and NAA from the supplement to the Methods section (see Table 2), and added the requested section titled “MRS Data quality.”

      We have conducted the requested analyses and reported them in Supplementary Material S6: there was a strong effect of region confirming that data quality was better in the visual than frontal region. We have referred to this in the main manuscript on Page 11, Line 299.

      In the revised manuscript, we discuss the data quality in the frontal cortex, and how we ensured it was comparable to prior work. Moreover, there were no significant group effects, or group-by-region interactions, suggesting that group differences observed for the visual cortex voxel cannot be accounted for by differences in data quality. We now included a section on data quality, both in the Methods (Page 11, Lines 284 – 299), and the limitations section of the Discussion (Page 25, Lines 662 - 665).

      Please clarify the MRS acquisition, "Each MEGA- PRESS scan lasted for 8 minutes and was acquired with the following specifications: TR = 2000 ms, TE = 68 ms, Voxel size = 40 mm x 30 mm x 25mm, 192 averages (each consists of two TRs). "192 averages x 2 TRs x 2s TR = 12.8 min, not 8 min, apologies if I have misunderstood these details.

      We have corrected this error in the revised manuscript and stated the parameters more clearly – there were a total of 256 averages, resulting in an (256 repetitions with 1 TR * 2 s/60) 8.5-minute scan (Page 8, Lines 212-213).

      - What was presented to participants in the eyes open MRS? Was it just normal room illumination or was it completely dark? Please add details to your methods.

      The scans were conducted in regular room illumination, with no visual stimulation.

      We have now clarified this on Page 9 (Lines 223-224) of the Methods.

      (5) MRS analysis

      How was the tissue fraction correction performed? Please add or refer to the exact equation from Harris et al., 2015.

      We have clarified that the reported GABA+/Glx values are water-normalized alpha corrected values (Page 10, Line 249), and cited Harris et al., 2015 on Page 10 (Line 251) of the Methods.

      (6) Statistical approach

      How was the sample size determined? Please add your justification for the sample size

      We collected as many qualifying patients as we were able to recruit for this study within 2.5 years of data collection (commencing August 2019, ending February 2022), given the constraints of the patient population and the pandemic. We have now made this clear in the Discussion (Page 25, Lines 650-652).

      Please report the tests for normality.

      We have now reported the Shapiro-Wilk test results for normality as well as Levene’s test for homogeneity of variance between groups for every dependent variable in our dataset in Supplementary Material S9, and added references to it in the descriptions of the statistical analyses (Methods, Page13, Lines 326-329 and Page 15, Lines 400-402).

      Calculate the Bayes Factor where possible.

      As our analyses are all frequentist, instead of re-analyzing the data within a Bayesian framework, we added partial eta squared values for all the reported ANOVAs (η<sub>p</sub><sup>²</sup>) for readers to get an idea of the effect size (Results).

      I recommend partial correlations to control for the influence of age, duration, and time of surgery, rather than separate correlations.

      Given the combination of small sample size and the expected multicollinearity in our variables (duration of blindness, for example, would be expected to correlate with age, as well as visual acuity post-surgery), partial correlations could not be calculated on this data.

      We are aware of the limits of correlational analyses. Given the unique data set of a rare population we had exploratorily planned to relate behavioral, EEG and MRS parameters by calculating correlations. Since no similar data existed when we started (and to the best of our knowledge our data set is still unique), these correlation analyses were explorative, but the most transparent to run.

      We have now clearly outlined these limitations in our Introduction (Page 5, Lines 133-135), Methods (Page 15, Lines 408-410) and Discussion section (Page 24, Line 634, Page 25, Lines 652-65) to ensure that the results are interpreted with appropriate caution.

      (7) Visual acuity

      Is the VA monocular average, from the dominant eye, or bilateral?

      We have now clarified that the VA reported here is bilateral (Methods, Page 7 Line 165 and Page 15, Line 405). Bilateral visual acuity in congenital cataract-reversal individuals typically corresponds to the visual acuity of the best eye.

      It is mentioned here that correlations with VA are exploratory, please be consistent as the introduction mentions that there was a hypothesis that you sought to test.

      We have now accordingly modified the Introduction (Page 5, Lines 133-135) and added the appropriate caveats in the discussion with regards to interpretations (Page 25, Lines 652-665).

      (8) Correlation analyses between MRS and EEG

      It is mentioned here that correlations between EEG and MRS are exploratory, please consistently point out the exploratory nature, as these results are preliminary and should not be overinterpreted ("We did not have prior hypotheses as to the best of our knowledge no extant literature has tested the correlation between aperiodic EEG activity and MRS measures of GABA+,Glx and Glx/GABA+." ).

      In the revised manuscript, we explicitly state the reported associations between EEG (aperiodic component) and MRS parameters allow for putting forward directed / more specific hypotheses for future studies (Introduction, Page 5, Lines 133-135; Methods, Page 15, Line 415. Discussion, Page 25, Lines 644-645 and Lines 652-665).

      (9) Results

      Figure 2 uses the same y-axis for the visual cortex and frontal cortex to facilitate a comparison between the two locations. Comparing Figure 2 a with b demonstrates poorer spectral peaks and reduced amplitudes. Lower spectral quality in the frontal cortex voxel could contribute to the absence of a group effect in the control voxel location. The major caveat that spectral quality differs between voxels needs to be pointed out and the limitations thereof discussed.

      We have now explicitly pointed out this issue in the Methods (MRS Data Quality, Supplementary Material S6) and Discussion in the Limitations section (Page 25, Lines 662-665). While data quality was lower for the frontal compared to the visual cortex voxels, as has been observed previously (Juchem & Graaf, 2017; Rideaux et al., 2022), this was not an issue for the EEG recordings. Thus, lower sensitivity of frontal measures cannot easily explain the lack of group differences for frontal measures. Crucially, data quality did not differ between groups.

      The results in 2c are the result of multiple correlations with metabolite values ("As in previous studies, we ran a number of exploratory correlation analyses between GABA+, Glx, and Glx/GABA+ concentrations, and visual acuity at the date of testing, duration of visual deprivation, and time since surgery respectively in the CC group"), it seems at least six for the visual acuity measure (VA vs Glx, VA vs GABA+, VA vs Glx/GABA+ x 2 conditions). While the trends are interesting, they should be interpreted with caution because of the exploratory nature, small sample size, the lack of multiple comparison correction, and the influence of two extreme data points. The authors should not overinterpret these results and should point out the need for replication.

      See response to (6) last section, which we copy here for convenience:

      We are aware of the limits of correlational analyses. Given the unique data set of a rare population we exploratorily related behavioral, EEG and MRS parameters by calculating correlations. Since no similar data existed when we started (and to the best of our knowledge our data set is still unique), these correlation analyses were explorative, but the most transparent to run.

      We have now clearly outlined these limitations in our Discussion section to ensure that the results are interpreted with appropriate caution (Discussion, Page 25, Lines 644-645 and Lines 652-665).

      (10) Discussion:

      Please explain the decrease in E/I balance from MRS in view of recent findings on an increase in E/I balance in CC using RSN-fMRI (Raczy et al., 2022) and EEG (Ossandon et al. 2023).

      We have edited our Abstract (Page 1-2, Lines 31-35) and Discussion (Page 23, Lines 584-590; Page 24, Lines 613-620). In brief, we think our results reflect a homeostatic regulation of E/I balance, that is, an increase in inhibition due to an increase in stimulus driven excitation following sight restoration.

      Names limitations but does nothing to mitigate concerns about spatial specificity. The limitations need to be rewritten to include differences in SNR between the visual cortex and frontal lobe. Needs to include caveats of small samples, including effect inflation.

      We have now discussed the data quality differences between the visual and frontal cortex voxel in MRS data quality, which we find irrespective of group (MRS Data Quality, Supplementary Material S6). We also reiterate why this might not explain our results; data quality was comparable to prior studies which have found group differences in frontal cortex (Methods Page 11, Lines 284 – 299), and data quality did not differ between groups. Further, EEG data quality did not differ across frontal and occipital regions, but group differences in EEG datasets were localized to the occipital cortex.

      Reviewer #2 (Recommendations for The Authors):

      To address the main weakness, the authors could consider including data from a third group, of congenitally blind individuals. Including this would go a very long way towards making the findings interpretable and relating them to the rest of the literature.

      Unfortunately, recruitment of these groups was not possible due to the pandemic. Indeed, we would consider a pre- vs post- surgery approach the most suitable design in the future, which, however, will require several years to be completed. Such time and resource intensive longitudinal studies are justified by the present cross-sectional results.

      We have explicitly stated our contribution and need for future studies in the Limitations section of the Discussion (Page 25, Lines 650-657).

      Analysing the amplitude of alpha rhythms, as well as the other "aperiodic" components, would be useful to relate the profile of the tested patients with previous studies. Visual inspection of Figure 3 suggests that alpha power with eyes closed is not reduced in the patients' group compared to the controls. This would be inconsistent with previous studies (including research from the same group) and it could suggest that the small selected sample is not really representative of the sight-recovery population - certainly one of the most heterogeneous study populations. This further highlights the difficulty of drawing conclusions on the effects of visual experience merely based on this N=10 set of patients.

      Alpha power was indeed reduced in the present subsample of 10 CC individuals (Supplementary Material S19). A possible source of the confusion (that the graphs of the CC and SC group look so similar for the EC condition in Figure 3) likely is that the spectra are shown with aperiodic components not yet removed, and scales to accommodate very different alpha power values. As documented in Supplementary Material S18 and S19, alpha power and the aperiodic intercept/slope results of the resting state data in the present 10 CC individuals correspond to the results from a larger sample of CC individuals (n = 28) in Ossandón et al., 2023. We explicitly highlight this “replication” in the main manuscript (Page 25 -26, Lines 671-676). Thus, the present sub-sample of CC individuals are representative for their population.

      To further characterise the MRS results, the authors may consider an alternative normalisation scheme. It is not clear whether the lack of significant GABA and GLX differences in the face of a significant group difference in the GLX/GABA ratio is due to the former measures being noisier since taking the ratio between two metabolites often helps reduce inter-individual variability and thereby helps revealing group differences. It remains an open question whether the GABA or GLX concentrations would show significant group differences after appropriate normalisation (e.g. NAA?).

      We repeated the analysis with Creatine-normalized values of GABA+ and Glx, and the main results i.e. reduced Glx/GABA+ concentration in the visual cortex of CC vs SC individuals, and no such difference in the frontal cortex, remained the same (Supplementary Material S5).

      Further, we re-analyzed the data using Osprey, an open-source toolbox that uses linear combination modeling, and found once more that our results did not change (Supplementary Material S3). We refer to these findings in the Methods (Page 10, Lines 272-275) and Results (Page 10, Lines 467-471) of the main manuscript.

      In fact, the Glx concentration in the visual cortex of CC vs SC individuals was significantly decreased when Cr-normalized values were used (which was not significant in the original analysis). However, we do not interpret this result as it was not replicated with the water-normalized values from Gannet or Osprey.

      I suggest revising the discussion to present a more balanced picture of the existent evidence of the relation between E/I and EEG indices. Although there is evidence that the 1/f slope changes across development, in a way that could be consistent with a higher slope reflecting more immature and excitable tissue, the link with cortical E/I is far from established, especially when referring to specific EEG indices (intercept vs. slope, measured in lower vs. higher frequency ranges).

      We have revised the Introduction (Page 4, Line 91, Lines 101-102) and Discussion (Page 22, Lines 568-569, Page 24, Lines 645-647 and Lines 654-657) in the manuscript accordingly; we allude to the fact that the links between cortical E/I and aperiodic EEG indices have not yet been unequivocally established in the literature.

      Minor:

      - The authors estimated NAA concentration with different software than the one used to estimate GLX and GABA; this examined the OFF spectra only; I suggest that the authors consider running their analysis with LCModel, which would allow a straightforward approach to estimate concentrations of all three metabolites from the same edited spectrum and automatically return normalised concentrations as well as water-related ones.

      We re-analyzed all of the MRS datasets using Osprey, which uses linear combination modelling and has shown quantification results similar to LCModel for NAA (Oeltzschner et al., 2020). The results of a lower Glx/GABA+ concentration in the visual cortex of CC vs SC individuals, and no difference in NAA concentration, were replicated using this pipeline.

      We have now added these analyses to the Supplementary Material S3 and referred to them in the Methods (Page 9, Lines 242-246) and Results (Page 18, Lines 464-467).

      - Of course the normalisation used to estimate GABA and GLX values is completely irrelevant when the two values are expressed as ratio GLX/GABA - this may be reflected in the text ("water normalised GLX/GABA concentration" should read "GLX/GABA concentration" instead).

      We have adapted the text on Page 16 (Line 431) and have ensured that throughout the manuscript the use of “water-normalized” is in reference to Glx or GABA+ concentration, and not the ratio.

      - Please specify which equation was used for tissue correction - is it alpha-correction?

      We have clarified that the reported GABA+/Glx values are water-normalized alpha corrected values (Page 10, Line 249), and cited Harris et al., 2015 on Page 10 (Line 251) of the Methods.

      - Since ANOVA was used, the assumption is that values are normally distributed. Please report evidence supporting this assumption.

      We have now reported the Shapiro-Wilk test results for normality as well as Levene’s test for homogeneity of variance between groups for every dependent variable in our dataset in Supplementary Material S9, and added references to it in the Methods (Page 13, Lines 326-329 and Page 15, Lines 400-402).

      Reviewer #3 (Recommendations for The Authors):

      In addition to addressing major comments listed in my Public Review, I have the following, more granular comments, which should also be addressed:

      (1) The paper's structure could be improved by presenting visual acuity data before diving into MRS and EEG results to better contextualize the findings.

      We now explicitly state in the Methods (Page 5, Line 155) that lower visual acuity is expected in a cohort of CC individuals with long lasting congenital visual deprivation.

      We have additionally included a plot of visual acuities of the two groups (Supplementary Material S1).

      (2) The paper should better explain the differences between CC for which sight is restored and congenitally blind patients. The authors write in the introduction that there are sensitive periods/epochs during the lifespan for the development of local inhibitory neural circuits. and "Human neuroimaging studies have similarly demonstrated that visual experience during the first weeks and months of life is crucial for the development of visual circuits. If human infants born with dense bilateral cataracts are treated later than a few weeks from birth, they suffer from a permanent reduction of not only visual acuity (Birch et al., 1998; Khanna et al., 2013) and stereovision (Birch et al., 1993; Tytla et al., 1993) but additionally from impairments in higher-level visual functions, such as face perception (Le Grand et al., 2001; Putzar et al., 2010; Röder et al., 2013)...".

      Thus it seems that the current participants (sight restored after a sensitive period) seem to be similarly affected by the development of the local inhibitory circuits as congenitally blind. To assess the effect of plasticity and sight restoration longitudinal data would be necessary.

      In the Introduction (Page 2, Lines 59-64; Page 3, Lines 111-114) we added that in order to identify sensitive periods e.g. for the elaboration of visual neural circuits, sight recovery individuals need to be investigated. The study of permanently blind individuals allows for investigating the role of experience (whether sight is necessary to introduce the maturation of visual neural circuits), but not whether visual input needs to be available at early epochs in life (i.e. whether sight restoration following congenital blindness could nevertheless lead to the development of visual circuits).

      This is indeed the conclusion we make in the Discussion section. We have now highlighted the need for longitudinal assessments in the Discussion (Page 25, Lines 654-656).

      (3) What's the underlying idea of analyzing two separate aperiodic slopes (20-40Hz and 1-19Hz). This is very unusual to compute the slope between 20-40 Hz, where the SNR is rather low.

      "Ossandón et al. (2023), however, observed that in addition to the flatter slope of the aperiodic power spectrum in the high frequency range (20-40 Hz), the slope of the low frequency range (1-19 Hz) was steeper in both, congenital cataract-reversal individuals, as well as in permanently congenitally blind humans."

      The present manuscript computed the slope between 1-20 Hz. Ossandón et al. as well as Medel et al. (2023) found a “knee” of the 1/f distribution at 20 Hz and describe further the motivations for computing both slope ranges. For example, Ossandón et al. used a data driven approach and compared single vs. dual fits and found that the latter fitted the data better. Additionally, they found the best fit if a knee at 20 Hz was used. We would like to point out that no standard range exists for the fitting of the 1/f component across the literature and, in fact, very different ranges have been used (Gao et al., 2017; Medel et al., 2023; Muthukumaraswamy & Liley, 2018).

      (4) "For this scan, participants were instructed to keep their eyes closed and stay as still as possible." Why should it be important to have the eyes closed during a T1w data acquisition? This statement at this location does not make sense.

      To avoid misunderstandings, we removed this statement in this context.

      (5) "Two SC subjects did not complete the frontal cortex scan for the EO condition and were excluded from the statistical comparisons of frontal cortex neurotransmitter concentrations."<br /> Why did the authors not conduct whole-brain MRS, which seems to be on the market for quite some time (e.g. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3590062/) ?

      Similar to previous work (Coullon et al., 2015; Weaver et al., 2013) our hypothesis was related to the visual cortex, and we chose the frontal cortex voxel as a control. This has now been clarified in the Introduction (Page 4, Lines 103-114), Methods (Page 9, Lines 225-227) and Discussion (Page 25, Lines 662-665).

      (6) In "....during visual stimulation with stimuli that changed in luminance (LU) (Pant et al., 2023)." the authors should provide a link on the visual stimulation, which is provided further below

      In the revised manuscript, we have moved up the description of the visual stimulation (Page 13, Line 336).

      (7) "During the EO condition, participants were asked to fixate on a blank screen." This is not really possible. Typically, resting state EO conditions include a fixation cross, as the participants would not be able to fixate on a blank screen and move their eyes, which would impact the recordings.

      We have now rephrased this as “look towards” with the goal of avoiding eye movements (Page 14, Line 347).

      (8) "Components corresponding to horizontal or vertical eye movements were identified via visual inspection and removed (Plöchl et al., 2012)." It is unclear what the Plöchl reference should serve for. Is the intention of the authors to state that manual (and subjective) visual inspection of the ICA components is adequate? I would recommend removing this reference.

      The intention was to provide the basis for classification during the visual inspection, as opposed to an automated method such as ICLabel.

      We stated this clearly in the revised manuscript (Page 14 Lines 368-370).

      (9) "The datasets were divided into 6.25 s long epochs corresponding to each trial." This is a bit inaccurate, as the trial also included some motor response task. Thus, I assume the 6.25 s are related to the visual stimulation.

      We have modified the sentence accordingly (Page 15, Line 378).

      (10) Figure 2. a & b. Just an esthetic suggestion: I would recommend removing the lines between the EC and EO conditions, as they suggest some longitudinal changes. Unless it is important to highlight the changes between EC and EO within each subject.

      In fact, EC vs. EO was a within-subject factor with expected changes for the EEG and possible changes in the MRS parameters. To allow the reader to track changes due to EC vs. EO for individual subjects (rather than just comparing the change in the mean scores), we use lines.  

      (11) Figure 3A: I would plot the same y-axis range for both groups to make it more comparable.

      We have changed Figure 3A accordingly.

      (12) " flattening of the intercept" replaces flattening, as it is too related to slope.

      We have replaced “flattening” with “reduction” (Page 20, Line 517).

      (13) The plotting of only the significant correlation between MRS measures and EEG measures seems to be rather selective reporting. For this type of exploratory analysis, I would recommend plotting all of the scatter plots and moving the entire exploratory analysis to the supplementary (as this provides the smallest evidence of the results).

      We have made clear in the Methods (Page 16, Lines 415-426), Results and Discussion (page 24, Lines 644-645), as well as in the Supplementary material, that the reason for only reporting the significant correlation was that this correlation survived correction for multiple comparisons, while all other correlations did not. We additionally explicitly allude to the Supplementary Material where the plots for all correlations are shown (Results, Page 21, Lines 546-552).

      (14) "Here, we speculate that due to limited structural plasticity after a phase of congenital blindness, the neural circuits of CC individuals, which had adapted to blindness after birth, employ available, likely predominantly physiological plasticity mechanisms (Knudsen, 1998; Mower et al., 1985; Röder et al., 2021), in order to re-adapt to the newly available visual excitation following sight restoration."

      I don't understand the logic here. The CC individuals are congenitally blind, thus why should there be any physiological plasticity mechanism to adapt to blindness, if they were blind at birth?

      With “adapt to blindness” we mean adaptation of a brain to an atypical or unexpected condition when taking an evolutionary perspective (i.e. the lack of vision). We have made this clear in the revised manuscript (Introduction, Page 4, Lines 111-114; Discussion, Page 23, Lines 584-591).

      (15) "An overall reduction in Glx/GABA ratio would counteract the aforementioned adaptations to congenital blindness, e.g. a lower threshold for excitation, which might come with the risk of runaway excitation in the presence of restored visually-elicited excitation."

      This could be tested by actually investigating the visual excitation by visual stimulation studies.

      The visual stimulation condition in the EEG experiment of the present study found a higher aperiodic intercept in CC compared to SC individuals. Given the proposed link between the intercept and spontaneous neural firing (Manning et al., 2009), we interpreted the higher intercept in CC individuals as increased broadband neural firing during visual stimulation (Results Figure 3; Discussion Page 24, Lines 635-640). This idea is compatible with enhanced BOLD responses during an EO condition in CC individuals (Raczy et al., 2022). Future work should systematically manipulate visual stimulation to test this idea.

      (16) As the authors also collected T1w images, the hypothesis of increased visual cortex thickness in CC. Was this investigated?

      This hypothesis was investigated in a separate publication which included this subset of participants (Hölig et al., 2023), and found increased visual cortical thickness in the CC group. We refer to this publication, and related work (Feng et al., 2021) in the present manuscript.

      (17) The entire discussion of age should be omitted, as the current data set is too small to assess age effects.

      We have removed this section and just allude to the fact that we replicated typical age trends to underline the validity of the present data (Page 26, Lines 675-676).

      (18) Table1: should include the age and the age at the time point of surgery.

      We added age to the revised Table 1. We clarified that in CC individuals, duration of blindness is the same as age at the time point of surgery (Page 6, Line 163).

      (19) Why no group comparisons of visual acuity are reported?

      Lower visual acuity in CC than SC individuals is a well-documented fact.

      We have now added the visual acuity plots for readers (Supplementary Material S1, referred to in the Methods, Page 5, Line 155) which highlight this common finding.

      References (Recommendations to the Authors)

      Adrian, E. D., & Matthews, B. H. C. (1934). The berger rhythm: Potential changes from the occipital lobes in man. Brain. https://doi.org/10.1093/brain/57.4.355

      Coullon, G. S. L., Emir, U. E., Fine, I., Watkins, K. E., & Bridge, H. (2015). Neurochemical changes in the pericalcarine cortex in congenital blindness attributable to bilateral anophthalmia. Journal of Neurophysiology. https://doi.org/10.1152/jn.00567.2015

      Feng, Y., Collignon, O., Maurer, D., Yao, K., & Gao, X. (2021). Brief postnatal visual deprivation triggers long-lasting interactive structural and functional reorganization of the human cortex. Frontiers in Medicine, 8, 752021. https://doi.org/10.3389/FMED.2021.752021/BIBTEX

      Gao, R., Peterson, E. J., & Voytek, B. (2017). Inferring synaptic excitation/inhibition balance from field potentials. NeuroImage, 158(March), 70–78. https://doi.org/10.1016/j.neuroimage.2017.06.078

      Hölig, C., Guerreiro, M. J. S., Lingareddy, S., Kekunnaya, R., & Röder, B. (2023). Sight restoration in congenitally blind humans does not restore visual brain structure. Cerebral Cortex, 33(5), 2152–2161. https://doi.org/10.1093/CERCOR/BHAC197

      Juchem, C., & Graaf, R. A. de. (2017). B0 magnetic field homogeneity and shimming for in vivo magnetic resonance spectroscopy. Analytical Biochemistry, 529, 17–29. https://doi.org/10.1016/j.ab.2016.06.003

      Kurcyus, K., Annac, E., Hanning, N. M., Harris, A. D., Oeltzschner, G., Edden, R., & Riedl, V. (2018). Opposite Dynamics of GABA and Glutamate Levels in the Occipital Cortex during Visual Processing. Journal of Neuroscience, 38(46), 9967–9976. https://doi.org/10.1523/JNEUROSCI.1214-18.2018

      Manning, J. R., Jacobs, J., Fried, I., & Kahana, M. J. (2009). Broadband shifts in local field potential power spectra are correlated with single-neuron spiking in humans. The Journal of Neuroscience : The Official Journal of the Society for Neuroscience, 29(43), 13613–13620. https://doi.org/10.1523/JNEUROSCI.2041-09.2009

      Medel, V., Irani, M., Crossley, N., Ossandón, T., & Boncompte, G. (2023). Complexity and 1/f slope jointly reflect brain states. Scientific Reports, 13(1), 21700. https://doi.org/10.1038/s41598-023-47316-0

      Muthukumaraswamy, S. D., & Liley, D. T. (2018). 1/F electrophysiological spectra in resting and drug-induced states can be explained by the dynamics of multiple oscillatory relaxation processes. NeuroImage, 179(November 2017), 582–595. https://doi.org/10.1016/j.neuroimage.2018.06.068

      Oeltzschner, G., Zöllner, H. J., Hui, S. C. N., Mikkelsen, M., Saleh, M. G., Tapper, S., & Edden, R. A. E. (2020). Osprey: Open-source processing, reconstruction & estimation of magnetic resonance spectroscopy data. Journal of Neuroscience Methods, 343, 108827. https://doi.org/10.1016/j.jneumeth.2020.108827

      Ossandón, J. P., Stange, L., Gudi-Mindermann, H., Rimmele, J. M., Sourav, S., Bottari, D., Kekunnaya, R., & Röder, B. (2023). The development of oscillatory and aperiodic resting state activity is linked to a sensitive period in humans. NeuroImage, 275, 120171. https://doi.org/10.1016/J.NEUROIMAGE.2023.120171

      Pant, R., Ossandón, J., Stange, L., Shareef, I., Kekunnaya, R., & Röder, B. (2023). Stimulus-evoked and resting-state alpha oscillations show a linked dependence on patterned visual experience for development. NeuroImage: Clinical, 103375. https://doi.org/10.1016/J.NICL.2023.103375

      Raczy, K., Holig, C., Guerreiro, M. J. S., Lingareddy, S., Kekunnaya, R., & Roder, B. (2022). Typical resting-state activity of the brain requires visual input during an early sensitive period. Brain Communications, 4(4). https://doi.org/10.1093/BRAINCOMMS/FCAC146

      Rideaux, R., Ehrhardt, S. E., Wards, Y., Filmer, H. L., Jin, J., Deelchand, D. K., Marjańska, M., Mattingley, J. B., & Dux, P. E. (2022). On the relationship between GABA+ and glutamate across the brain. NeuroImage, 257, 119273. https://doi.org/10.1016/J.NEUROIMAGE.2022.119273

      Weaver, K. E., Richards, T. L., Saenz, M., Petropoulos, H., & Fine, I. (2013). Neurochemical changes within human early blind occipital cortex. Neuroscience. https://doi.org/10.1016/j.neuroscience.2013.08.004

    1. Author Response

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      This manuscript by Neininger-Castro and colleagues presents a novel automatic image analysis method for assessing sarcomeres, the basic units of myofibrils and validates this tool in a couple of experimental approaches that interfere with sarcomere assembly in iPSCcardiomyocytes (iPSC-CM).

      Automatic quantification of sarcomeres is definitely something that is useful to the field. I am surprised that there is no reference in the manuscript to SarcTrack, published by Toepfer and colleagues in 2019 (PMID 30700234), which has exactly the same purpose. The advantage of the image analysis software presented in the current manuscript appears to me to be that it can cover both mature sarcomeres and nascent sarcomeres in premyofibrils effectively.

      We whole-heartedly disagree that SarcTrack has the exact same purpose as sarcApp. sarcApp measures more than the frequency of actinin2 images, and can measure real-space quantifications of actinin, myomesin, and titin, which has not been done before in this way. However, SarcTrack is an interesting method that we hope many researchers find helpful in their research. SarcTrack is a particle tracker that outputs the dimensions of the objects found, but does not distinguish between Z-Lines and other actinin2-positive structures (Z-Bodies, adhesions). It also does not group these structures into higher order structures such as myofibrils and muscle stress fibers.

      When going through the manuscript there were a few issues that should be addressed in a revised version of the manuscript:

      1) I am a bit puzzled that they took 1.4 um length as a cutoff length for a mature A-band in their quantifications, since the consensus in the field for thick filament length seems to be 1.6 um?

      We use 1.4 µm as a cutoff length for the length of a Z-Line rather than the A-Band. We believe the reviewer is referring to the width of the A-Band perpendicular to the Z-lines, which is indeed 1.6 µm. However, we are referring to the length of the Z-Lines, which can span anywhere from 1.4 µm to up to 10 or more µm. Thank you for allowing us to make the clarification.

      2) When doing the knockdown for alpha and beta-myosin heavy chain, respectively, why did they not also do a Western blot for the "other" isoform as well (Figure 7)? We know that iPSCCM express a mixture, so the relatively mild phenotype that they observe in single knockdown experiments may well be due to concomitant upregulation of the expression of the other isoform. In my point of view this should be checked.

      It is likely that in the single knockdown experiments the other isoform is upregulated, which is why we were careful in stating that neither muscle myosin alone is required for sarcomere formation. We do agree this would be an interesting experiment to check beyond the scope of this manuscript.

      3) There seems to be a disconnect between the images for myomesin knockdown shown in Figure 8H and the quantification shown in Figure 8I, which makes me wonder whether the image shown in H middle (MYOM1 (1) KD), where the beta-myosin doublets do not seem to be much affected is really representative?

      The image shown in the middle of H is representative of the mean length of beta-myosin doublets in MYOM1 (1) KD hiCMs. While the beta-myosin doublets are still present and organized, they are significantly shorter. In the zoomed out image, you can appreciate much shorter arrays of beta-myosin doublets that, while extending across the entire cell, are thinner than control cells.

      Reviewer #2 (Public Review):

      Neininger-Castro et al report on their original study entitled "Independent regulation of Z-lines and M-lines during sarcomere assembly in cardiac myocytes revealed by the automatic image analysis software sarcApp", In this study, the research team developed two software, yoU-Net and sarcApp, that provide new binarization and sarcomere quantification methods. The authors further utilized human induced pluripotent stem cell-derived cardiomyocytes (hiCMs) as their model to verify their software by staining multiple sarcomeric components with and without the treatment of Blebbistatin, a known myosin II activity inhibitor. With the treatment of different Blebbistatin concentrations, the morphology of sarcomeric proteins was disturbed. These disrupted sarcomeric structures were further quantified using sarcApp and the quantification data supported the phenotype. The authors further investigated the roles of muscle myosins in sarcomere assembly by knocking down MYH6, MYH7, or MYOM in hiCMs. The knockdown of these genes did not affect Z-line assembly yet the knockdown of MYOM affected M-line assembly. The authors demonstrated that different muscle myosins participate in sarcomere assembly in different manners.

      Reviewer #3 (Public Review):

      Neininger-Castro and colleagues developed software tools for the quantification of sarcomeres and sarcomere-precursor features in immunostained human induced pluripotent stem cellderived cardiac myocytes (hiCMs). In the first part they used a deep-learning- based model called a U-Net to construct and train a network for binarization of immunostained cardiomyocyte images. They also wrote graphical user interface (GUI) software that will assist other labs in using this approach and made it publicly available. They did not compare their approach to existing ones, but an example from one image suggests their binarization tool outperforms Otsu thresholding binarization.

      In the second part they developed a software tool called sarcApp that classifies sarcomere structures in the binarized image as a Z-Line or Z-Body and assigns each to either a myofibril or to stress fibers. The tools can then automatically count and measure multiple features (33 per cell and 24 per myofibril) and report them on a per-cell, per-myofibril, and per- stress fiber basis.

      To test the tools they used Blebbistatin to inhibit sarcomere assembly and showed that the sarcApp tool could capture changes in multiple features such as fewer myofibrils, fewer Z-Lines, decreased myofibril persistence, decreased Z-Line length and altered myofibril orientation in the Blebbistatin treated cells. With some changes the tool was also shown to quantify sarcomeres in titin and myomesin stained cardiomyocytes.

      Finally they used sarcApp to quantify the changes in sarcomere assembly after siRNA mediated knockout of MYH7, MYH7, or MYOM. The analysis indicates that neither MYH6 nor MYH7 knockdown perturbed the assembly of Z- or M-lines, and that knockdown of MYOM perturbed the A-band/M-Line but not the Z-Line assembly according to features captured by the sarcApp tool.

      Overall the authors developed and made publicly available an excellent software tool that will be very useful for labs that are interested in studying sarcomere assembly. Multiple features that are difficult to measure or count manually can be automatically measured by the software quickly and accurately.

      There are however some remaining questions about these tools:

      1) The binarization tool which is tailored to sarcomere image binarization appears promising but was not systematically compared with existing approaches.

      We compared it with the existing approach we used previously in the lab, which was Otsu’s method for binarization. We are not aware of several other binarization approaches to compare to, other than using other machine learning techniques that are less advanced than a U-Net, the current standard in image-to-image translation.

      2) How robust is the tool? The tool was tested on images from one type of cardiomyocytes (hiCMs) taken from one lab using Nikon Spinning Disk confocal microscope equipped with Apo TIRF Oil 100X 1.49 NA objective or instant Structured Illumination Microscopy (iSIM), using deconvolution (Microvolution software) and in a specific magnification. It remains to be seen whether the tool would be equally effective with images taken with other microscopy systems, with other cardiomyocytes (chick or neonatal rat), with different magnifications, live imaging, etc.

      We tested the software with several magnifications, with live imaging, and with other tissues. We did not include the information in the manuscript because the data we tested the software with is for future manuscripts studying different aspects of sarcomere formation and maintenance. sarcApp reliably identifies Z-Lines and sarcomeres with deconvolved widefield fluorescence images of hiCMs and frozen human tissue, and are currently using it to measure zebrafish data for another study. Further, it works for live imaging with an actinin2-GFP (or similar) label. For the titin quantification, we would recommend using only 60-100X magnification, as the titin structures (doublets and rings) are not resolvable at lower magnifications.

      3) The tool was developed for evaluation of sarcomere assembly. The authors show that for this application it can detect the perturbation by Blebbistatin, or knockdown of sarcomeric genes. It remains to be seen if this tool is also useful for assessment of sarcomere structure for other questions beside sarcomere assembly and in other sarcomere pathologies.

      While this is beyond the scope of this specific methods paper, we welcome other researchers to use our software for other questions in other pathologies. We are currently doing the same for other manuscripts from our lab.

      Reviewer #1 (Recommendations For The Authors):

      1)"alpha-actinin..., which border the sarcomeric contractile machinery (thin and thick filaments); Z-lines do NOT border thick filaments in a relaxed sarcomere

      We have removed “(thin and thick filaments)” from the text.

      2) myomesin targeting siRNAs (gene name MYOM): there are actually three genes encoding for myomesin family members, specify, which one was targeted (I am assuming MYOM1).

      Thank you for the clarification: we do target MYOM1

      3) I am not surprised that they found not many mature Z-lines in the absence of both sarcomeric myosins; a similar codependence of assembly of mature Z-discs and the presence of functional thick filaments was previously shown by Geach and colleagues in 2015 (PMID 25845369)

      Thank you for sharing this manuscript: we have added a reference to it in our study.

      Reviewer #2 (Recommendations For The Authors):

      This work offers the possibility to gain more insights into the process of sarcomere assembly through the advancement in sarcomeric or myofibril structure analyses. However, some clarifications are needed from the authors, please see below for the comments.

      1) It is recommended that the authors include the time points for replating and harvesting hiCMs. After replating, the cardiomyocytes require at least three to four days for sarcomeric structures to reform. If the hiCMs were fixed before sarcomere assembly had completed, the staining of sarcomeric proteins including ACTN2 and titin could be compromised and it is difficult to tell if the phenotypes observed were consequences of drug treatments or knockdown of sarcomeric genes or simply because the replating hiCMs were fixed before their sarcomeric structures had fully regrown. It is also recommended that the authors replate hiCMs at a fixed time point to avoid discrepancies in the data.

      Cardiomyocytes do not require three to four days for sarcomeric structures to re-form, and indeed only require 24 hours, with the first sarcomeres typically appearing at ~6 hours. We and others have published several studies demonstrating this (Fenix et al., eLIfe 2018, Taneja, Neininger and Burnette MBoC 2020, Chen et al. Nature Methods, 2022). While sarcomeres continue to develop and turn over after this time, our lab is interested in the beginning steps of sarcomerogenesis rather than the turnover of mature structures.

      2) The sarcApp automatically identifies Z-lines and Z-bodies; however, is there an option for the users to set their own thresholds? Some users may select different criterions when quantifying sarcomeres. Moreover, the Z-lines and Z-bodies identified by the software are not always accurate. Can the users modify the list manually in an unbiased way. If this function is not available, the authors may consider adding this function to their software. sarcApp measures Zline and Z-bodies length but does not measure Z-line and Z-bodies width, but sometimes it is also necessary to measure the width.

      Absolutely, users can modify the thresholds to identify Z-Lines and Z-Bodies. There is not a way for users to modify the list in an unbiased way per se, as editing the list of Z-Lines and Z-Bodies based on non-mathematical measurements is inherently biased, but the user is free to add in other Z-Lines and Z-Bodies as they wish. In this context, “manually” and “unbiased” is mutually exclusive.

      3) It is recommended that the authors include the original images beside the sarcomeric structures identified by sarcApp (Figure 2A, 2C, 4C-F and more). It would be easier to compare the original Z-lines and Z-bodies with those identified by the software.

      We have added these in Author response image 1.

      Author response image 1.

      Uncropped images and merges from Figures 2, 4 and 6, respectively.

      4) The M-line length quantification data in Figure 3G, 5F, and 6H showed different colored-dots labeling n1 to n3, but the authors did not discuss the significance of these symbols.

      We are not sure what the reviewer means by this statement: there is no significance of the different colored dots other than to mark the biological replicate shown. These graphs were created using SuperPlots, which was not stated in the original methods. It has now been added to the Statistical Analysis section.

      5) Can the authors elaborate more on the reasons why they treated Blebbistatin at concentrations of 50µM and 100µM. Previous studies showed that 25µM of Blebbistatin was sufficient to delay the transformation of cardiomyocytes (PMID 27072942). Can the authors also comment on why they selected 6 hours, 12 hours, and 24 hours post replating for drug treatment. Moreover, the drug treatment at different time points was only done on ACTN2 but not titin or myomesin.

      We selected 6, 12, and 24 hours for actinin2 to show the time course of sarcomere formation and to show that sarcomeres are developed by 24 hours, as also mentioned above. We are interested in future studies of the time course of titin and myomesin over time, and are working on it in the lab.

      We chose 50 and 100 µM Blebbistatin as these completely blocked sarcomere assembly whereas treatment with 25 µM did not. This manuscript is a methods paper that aims to validate sarcApp and show how it could be used. We did not intend for it to be a comprehensive study of how different concentrations of blebbistatin affects sarcomere assembly.

      We are also unsure what the reviewer means by “transformation of cardiomyocytes”. The manuscript with the PMID of 27072942 does not address this issue. The paper is a “review and analyze readmission data for patients who received a continuous flow left ventricular assist device (LVAD)”. We assume the reviewer is referring to differentiation. The model system we developed and published in eLife in 2018 does not use differentiating iPSC cardiac myocytes. The hiCMs we use are terminally differentiated but still immature, as they are more transcriptionally similar to primary fetal myocytes. As such, they do not maintain their sarcomeres when they removed from the 96 well and plated onto a glass coverslip for highresolution microscopy. These assemble sarcomeres within 24 hours with the sarcomeres forming close to the dorsal membrane and then rearrange overtime (e.g., moving from the top of the cell to the bottom) (Fenix et al., eLife 2018). With that said, we do agree with the reviewer that a study of sarcomere assembly in the context of cardiac myocyte differentiation would be a fascinating direction for future studies, and we think sarcApp could facilitate such studies.

      6) The authors mentioned that the myofibrils of Z-line, titin, and M-line were randomly oriented after Blebbistatin treatments. The myofibrils were randomly oriented for titin and M-line. However, the orientation of Z-line after 50µM Blebbistatin treatment was not necessarily random, only the orientation after 100µM Blebbistatin treatment was randomized. The authors might consider changing bar graph to other types of charts if the orientation was really randomized after quantification.

      We find that the bar chart is the most informative to us, but users can consider other types of charts in their analyses.

      7) It is recommended that the authors include images staining ACTN2 at lower magnifications (Figure 1A, 1C). With current images, it is true that yoU-Net can separate Z-lines from Z-bodies yet it is difficult to tell if yoU-Net can still distinguish Z-lines from Z-bodies with larger images or it only applies to a small portion of the image.

      The yoU-Net can distinguish Z-Lines from Z-Bodies with images of any size, as image size (height vs. width in pixels) does not affect how binarization occurs. During binarization, the only pixel requirement is that the width and height are divisible by 8 (for downsampling purposes). Usually this is not the case with raw images, so the image borders are slightly cropped to make them usable. In terms of resolution, we recommend using 60X-100X objectives on confocal or superresolution data for the clearest results. We have, however, successfully binarized deconvolved widefield images at 100X as well.

      8) The authors mentioned that the knockdown of MYH7 did not affect Z-lines and M-lines; however, the structures of ACTN2, myomesin, and titin appeared more organized as compared to those in control.

      We agree that the sarcomeres and myofibrils look slightly more organized, and did mean to state that the knockdown did not negatively affect Z-Lines and M-Lines and have updated the manuscript to be more accurate.

      9) Please provide the merge images for Fig. 4D, 4E, 6B

      The merge images for Fig. 4D, 4E, and 6B are included with the original images requested above (point 3)

      10) In the text, they described" "antibodies to the titin I-band localize to both MSFs and sarcomeres in hiCMs (Figure 4A). Titin forms ring-like structures around the Z-Bodies of MSFs that are closer to the apparent sarcomere transition point (Figure 4A)" However, based on the antibody information they provided, it is not explicitly recognized for N-or C-terminus TITIN. Please provide TTN N-terminus or TTN-C terminus co-stainings with ACTN2 antibody to understand which part of TTN together with ACTN2 forms a Z-Body.

      The TTN antibody is an N-terminal antibody localizing to the I-Band region of sarcomeres. We agree with the reviewer that a more thorough study of titin will be of interest and we are currently undertaking such a study. However, this is a methods paper presenting a tool. While some of the data we present does point to mechanistic hypotheses, it is beyond the scope of this study to fully characterize titin during sarcomere assembly.

      11) TITIN doublet was used to indicate a sarcomere in Fig. 4C-D. Moreover, they also used another combination (myomesin and F-ACTIN) to label a sarcomere in Fig. 6D. Can they compare the difference between these two methods or by using these two methods (TITIN doublet) and (myomesin and F-ACTIN), how is the average length of sarcomere? Will the sarcomere length be the same?

      We noted in the manuscript that due to the organization of titin doublets (wrapping around the ends of Z-Lines) that the average titin doublet will be approximately 0.3 um longer than the ZLine. We did not expect to see a difference in lengths of myomesin M-Lines and mature actinin2 Z-Lines and indeed do not see major differences in the average lengths (between 2.0 and 2.5 um in 24 hour control cells)

      12) They used siRNA method to knockdown MYH6, MYH7 and MYOM and concluded that the knockdown of these genes did not affect the Z-line assembly. Even though they showed very nice knockdown efficiency of these proteins, they should (1) co-stain MYH6/TITIN/actinin2 and MYH6/ myomesin /actinin2 for Fig. 7C. (2) MYH7/TITIN/actinin2 and MYH7/ myomesin /actinin2 for Fig. 7I. (3) MYOM1/TITIN/actinin2 and MYOM2/TITIN/actinin2 for Fig. 8A. (4) MYH7/MYOM1 and MYH7/MYOM2 for Fig. 8H to make sure the cells they measured were truly knockdownpositive cells,

      The antibodies for alpha and beta myosin are not very efficient for immunofluorescence, and work best for western blots. We decided also to choose a random subset of the cells on the dish to be sure to eliminate any risk of cherry-picking. While imaging cells on the dish, we looked only at the DAPI nuclear channel and selected 50 cells minimum per dish with only this channel, then imaged the other channels.

      Minor comments:

      1) Well-organized sarcomere structure on DMSO treated cells in Fig.5A and Fig. 6A, but it was disarray in Fig. S3M. Why?

      Figure S3 shows hiCMs that have only been allowed to spread for 6 hours, which have not formed mature sarcomeres yet, hence the disarray.

      2) Fig 1A, Fig2B: please label the name of the antibody, not the actin filament

      We used phalloidin labelling here, which marks actin filaments. We have updated the figure legends to be more clear. Thank you!

      3) Fig. 7I: actinin2 instead of actinin

      Thank you for catching this! We have fixed it.

      Reviewer #3 (Recommendations For The Authors):

      Testing the app using images shot by other microscopy systems, magnifications, and cardiomyocytes from other species, as noted in the public review above, should make the app even more wildly useful.

      A more formal head-to-head comparison with other approaches will be more convincing in showing the new tool is superior

      I also think that a more detailed protocol for using the app will help other investigators.

      The app counts and measures many features, but it is not always clear how and using what algorithm these are measured. Including these details in a protocol or even as comments in the code will be very helpful for others.

      The protocol found on the public GitHub for the app will help other investigators to download, use, and understand the application. We have received contact from researchers who have been able to use the application without assistance from us, which is a good sign that the application is user-friendly and that the online protocol is sufficient.

    1. Author response:

      The following is the authors’ response to the original reviews.

      The reviewers praised multiple aspects of our study. Reviewer 1 noted that “the work aligns well with current research trends and will greatly interest researchers in the field.” Reviewer 2 highlighted the unique capability of our imaging approach, which “allows for investigation of the heterogeneity of response across individual dopamine axons, unlike other common approaches such as fiber photometry.” Reviewer 3 commented that “the experiments are beautifully executed” and “are revealing novel information about how aversive and rewarding stimuli is encoded at the level of individual axons, in a way that has not been done before.”

      In addition to the positive feedback, the reviewers also provided useful criticisms and suggestions, some of which may not be fully addressed in a single study. For instance, questions regarding whether dopamine axons encode the valence or specific identity of the stimuli, or the most salient aspects of the environment, remain open. At the same time, as all the reviewers agreed, our report on the diversity of dopamine axonal responses using a novel imaging design introduces significant new insights to the neuroscience community. Following the reviewers’ recommendations, we have refrained from making interpretations that could be perceived as overinterpretation, such as concluding that “dopamine axons are involved in aversive processing.” This has necessitated extensive revisions, including modifying the title of our manuscript to make clear that the novelty of our work is revealing ‘functional diversity’ using our new imaging approach.

      Below, we respond to the reviewers’ comments point by point.

      eLife assessment

      This valuable study shows that distinct midbrain dopaminergic axons in the medial prefrontal cortex respond to aversive and rewarding stimuli and suggest that they are biased toward aversive processing. The use of innovative microprism based two-photon calcium imaging to study single axon heterogeneity is solid, although the experimental design could be optimized to distinguish aversive valence from stimulus salience and identity in this dopamine projection. This work will be of interest to neuroscientists working on neuromodulatory systems, cortical function and decision making.

      Reviewer #1

      Summary:

      In this manuscript, Abe and colleagues employ in vivo 2-photon calcium imaging of dopaminergic axons in the mPFC. The study reveals that these axons primarily respond to unconditioned aversive stimuli (US) and enhance their responses to initially-neutral stimuli after classical association learning. The manuscript is well-structured and presents results clearly. The utilization of a refined prism-based imaging technique, though not entirely novel, is well-implemented. The study's significance lies in its contribution to the existing literature by offering single-axon resolution functional insights, supplementing prior bulk measurements of calcium or dopamine release. Given the current focus on neuromodulator neuron heterogeneity, the work aligns well with current research trends and will greatly interest researchers in the field.

      However, I would like to highlight that the authors could further enhance their manuscript by addressing study limitations more comprehensively and by providing essential details to ensure the reproducibility of their research. In light of this, I have a number of comments and suggestions that, if incorporated, would significantly contribute to the manuscript's value to the field.

      Strengths:

      • Descriptive.

      • Utilization of a well-optimized prism-based imaging method.

      • Provides valuable single-axon resolution functional observations, filling a gap in existing literature.

      • Timely contribution to the study of neuromodulator neuron heterogeneity.

      We thank the reviewer for this positive assessment.

      Weaknesses:

      (1) It's important to fully discuss the fact that the measurements were carried out only on superficial layers (30-100um), while major dopamine projections target deep layers of the mPFC as discussed in the cited literature (Vander Weele et al., 2018) and as illustrated in FigS1B,C. This limitation should be explicitly acknowledged and discussed in the manuscript, especially given the potential functional heterogeneity among dopamine neurons in different layers. This potential across-layer heterogeneity could also be the cause of discrepancy among past recording studies with different measurement modalities. Also, mentioning technical limitations would be informative. For example: how deep the authors can perform 2p-imaging through the prism? was the "30-100um" maximum depth the authors could get?

      Thank you for pointing out this important issue about layer differences.

      It is possible that the mesocortial pathway has layer-specific channels, with some neurons targeting supra granular layers and others targeting infragranular ones. Alternatively, it is also plausible that the axons of the same neurons branch into both superficial and deep layers. This is a critical issue that has not been investigated in anatomical studies and will require single-cell labeling of dopamine neurons (Matsuda et al 2009 and Aransay et al 2015). We now discuss this issue in the Discussion.

      As for the imaging depth of 30–100 m, we were unable to visualize deeper axons in a live view mode. Our imaging system has already been optimized to detect weak signals (e.g., we have employed an excitation wavelength of 980 nm, dispersion compensation, and a hybrid photodetector). It is possible that future studies using improved imaging approaches may be able to visualize deeper layers. Importantly, sparse axons in the supragranular layers are advantageous in detecting weak signals; dense labeling of axons would increase the background fluorescence relative to signals. We now reference this layer issue in the Results and Discussion sections.

      (2) In the introduction, it seems that the authors intended to refer to Poulin et al. 2018 regarding molecular/anatomical heterogeneity of dopamine neurons, but they inadvertently cited Poulin et al. 2016 (a general review on scRNAseq). Additionally, the statement that "dopamine neurons that project to the PFC show unique genetic profiles (line 85)" requires clarification, as Poulin et al. 2018 did not specifically establish this point. Instead, they found at least the Vglut2/Cck+ population projects into mPFC, and they did not reject the possibility of other subclasses projecting to mPFC. Rather, they observed denser innervation with DAT-cre, suggesting that non-Vglut2/Cck populations would also project to mPFC. Discuss the potential molecular heterogeneity among mPFC dopamine axons in light of the sampling limitation mentioned earlier.

      We thank the reviewer for pointing this out. Genetic profiles of PFC-projecting DA neurons are still being investigated, so describing them as “unique” was misleading. We have edited the Introduction accordingly, and now discuss this issue in detail in the Discussion.

      (3) I find the data presented in Figure 2 to be odd. Firstly, the latency of shock responses in the representative axons (right panels of G, H) is consistently very long - nearly 500ms. It raises a query whether this is a biological phenomenon or if it stems from a potential technical artifact, possibly arising from an issue in synchronization between the 2-photon imaging and stimulus presentation. My reservations are compounded by the notable absence of comprehensive information concerning the synchronization of the experimental system in the method section.

      The synchronization of the stimulus and data acquisition is accomplished at a sub-millisecond resolution. We use a custom-made MATLAB program that sends TTL commands to standard imaging software (ThorImage or ScanImage) and a stimulator for electrical shocks. All events are recorded as analogue inputs to a different DAQ to ensure synchronization. We have provided additional details regarding the configuration in the Methods section.

      We consider that the long latency of shock response is biological. For instance, a similar long latency was found after electrical shock in a photometry imaging study (Kim, …, Deisseroth, 2016).

      Secondly, there appear to be irregularities in Panel J. While the authors indicate that "Significant axons were classified as either reward-preferring (cyan) or aversive-preferring (magenta), based on whether the axons are above or below the unity line of the reward/aversive scatter plot (Line 566)," a cyan dot slightly but clearly deviates above the unity line (around coordinates (x, y) = (20, 21)). This needs clarification. Lastly, when categorizing axons for analysis of conditioning data in Fig3 (not Fig2), the authors stated "The color-coded classification (cyan/magenta) was based on k-means clustering, using the responses before classical conditioning (Figure 2J)". I do not understand why the authors used different classification methods for two almost identical datasets.

      We thank the reviewer for pointing out these insufficient descriptions. We classified the axons using k-means clustering, and the separation of the two clusters happened to roughly coincide with the unity line of the reward/aversive scatter plot in Fig 2J. In other words, we did not use the unity line to classify the data points (which is why the color separation of the histogram is not at 45 degrees). We have clarified this point in the Methods section.

      (4) In connection with Point 3, conducting separate statistical analyses for aversive and rewarding stimuli would offer a fairer approach. This could potentially reveal a subset of axons that display responses to both aversive and appetitive stimuli, aligning more accurately with the true underlying dynamics. Moreover, the characterization of Figure 2J as a bimodal distribution while disregarding the presence of axons responsive to both aversive and appetitive cues seems somewhat arbitrary and circular logic. A more inclusive consideration of this dual-responsive population could contribute to a more comprehensive interpretation.

      We also attempted k-means clustering with additional dimensions (e.g., temporal domains as shown in Fig. 3I, J), but no additional clusters were evident. We note that the lack of other clusters does not exclude the possibility of their existence, which may only become apparent with a substantial increase in the number of samples. In the current report, we present the clusters that were the easiest/simplest for us to identify.

      Additionally, we have revised our manuscript to reflect that many axons respond to both reward and aversive stimuli, and that aversive-preferring axons do not exclusively respond to the aversive stimulus.

      (5) The contrast in initialization to novel cues between aversive and appetitive axons mirrors findings in other areas, such as the tail-of-striatum (TS) and ventral striatum (VS) projecting dopamine neurons (Menegas et al., 2017, not 2018). You might consider citing this very relevant study and discussing potential collateral projections between mPFC and TS or VS.

      Thank you for pointing this out. We have now included Menegas et al., 2017, and also discuss the possibility of collaterals to these areas. In addition, we also referred to Azcorra et al., 2023 - this was published after our initial submission.

      (6) The use of correlation values (here >0.65) to group ROIs into axons is common but should be justified based on axon density in the FOV and imaging quality. It's important to present the distribution of correlation values and demonstrate the consistency of results with varying cut-off values. Also, provide insights into the reliability of aversive/appetitive classifications for individual ROIs with high correlations. Importantly, if you do the statistical testing and aversive/appetitive classifications for individual ROIs with above-threshold high correlation (to be grouped into the same axon), do they always fall into the same category? How many false positives/false negatives are observed?


      "Our results remained similar for different correlation threshold values (Line 556)" (data not shown) is obsolete.

      We have conducted additional analysis using correlation values 0.5 and 0.3 that resulted in a smaller number of axon terminals. In essence, the relationship between reward responses and aversive responses remained very similar to Fig. 2J, K.

      Author response image 1.

      Reviewer #2 (Public Review):

      Summary:

      This study aims to address existing differences in the literature regarding the extent of reward versus aversive dopamine signaling in the prefrontal cortex. To do so, the authors chose to present mice with both a reward and an aversive stimulus during different trials each day. The authors used high spatial resolution two-photon calcium imaging of individual dopaminergic axons in the medial PFC to characterize the response of these axons to determine the selectivity of responses in unique axons. They also paired the reward (water) and an aversive stimulus (tail shock) with auditory tones and recorded across 12 days of associative learning.

      The authors find that some axons respond to both reward and aversive unconditioned stimuli, but overall, there is a strong preference to respond to aversive stimuli consistent with expectations from prior studies that used other recording methods. The authors find that both of their two auditory stimuli initially drive responses in axons, but that with training axons develop more selective responses for the shock associated tone indicating that associative learning led to changes in these axon's responses. Finally, the authors use anticipatory behaviors during the conditioned stimuli and facial expressions to determine stimulus discrimination and relate dopamine axons signals with this behavioral evidence of discrimination. This study takes advantage of cutting-edge imaging approaches to resolve the extent to which dopamine axons in PFC respond appetitive or aversive stimuli. They conclude that there is a strong bias to respond to the aversive tail shock in most axons and weaker more sparse representation of water reward.

      Strengths:

      The strength of this study is the imaging approach that allows for investigation of the heterogeneity of response across individual dopamine axons, unlike other common approaches such as fiber photometry which provide a measure of the average population activity. The use of appetitive and aversive stimuli to probe responses across individual axons is another strength.

      We thank the reviewer for this positive assessment.

      Weaknesses:

      A weakness of this study is the design of the associative conditioning paradigm. The use of only a single reward and single aversive stimulus makes it difficult to know whether these results are specific to the valence of the stimuli versus the specific identity of the stimuli. Further, the reward presentations are more numerous than the aversive trials making it unclear how much novelty and habituation account for results. Moreover, the training seems somewhat limited by the low number of trials and did not result in strong associative conditioning. The lack of omission responses reported may reflect weak associative conditioning. Finally, the study provides a small advance in our understanding of dopamine signaling in the PFC and lacks evidence for if and what might be the consequence of these axonal responses on PFC dopamine concentrations and PFC neuron activity.

      We thank the reviewer for the suggestions.

      We agree that interpreting the response change during classical conditioning is not straightforward. Although the reward and aversive stimuli we employed are commonly used in the field, future studies with more sophisticated paradigms will be necessary to address whether dopamine axons encode the valence of the stimuli, the specific identity of the stimuli, or novelty and habituation. In our current manuscript, we refrain from making a conclusion that distinct groups of neurons encode different valances. In fact, many axons respond to both stimuli, at different ratios. We have removed descriptions that may suggest exclusive coding of reward or aversive processing. Additionally, we have extensively discussed possible interpretations.

      In terms of the strength of the conditioning association, behavioral results indicated that the learning plateaued – anticipatory behaviors did not increase during the last two phases when the conditioned span was divided into six phases (Figure 3–figure supplement 1).

      Our goal in the current manuscript is to provide new insight into the functional diversity of dopamine axons in the mPFC. Investigating the impact of dopamine axons on local dopamine concentration and neural activity in the mPFC is important but falls beyond the scope of our current study. In particular, given the functional diversity of dopamine axons, interpreting bulk optogenetic or chemogenetic axonal manipulation experiments would not be straightforward. As suggested, measuring the dopamine concentration through two-photon imaging of dopamine sensors and monitoring the activity of dopamine recipient neurons (e.g., D1R- or D2R-expressing neurons) is a promising approach that we plan to undertake in the near future.

      Reviewer #3 (Public Review):

      Summary:

      The authors image dopamine axons in medial prefrontal cortex (mPFC) using microprism-mediated two-photon calcium imaging. They image these axons as mice learn that two auditory cues predict two distinct outcomes, tailshock or water delivery. They find that some axons show a preference for encoding of the shock and some show a preference for encoding of water. The authors report a greater number of dopamine axons in mPFC that respond to shock. Across time, the shock-preferring axons begin to respond preferentially to the cue predicting shock, while there is a less pronounced increase in the water-responsive axons that acquire a response to the water-predictive cue (these axons also increase non-significantly to the shock-predictive cue). These data lead the authors to argue that dopamine axons in mPFC preferentially encode aversive stimuli.

      Strengths:

      The experiments are beautifully executed and the authors have mastered an impressively complex technique. Specifically, they are able to image and track individual dopamine axons in mPFC across days of learning. This technique is used the way it should be: the authors isolate distinct dopamine axons in mPFC and characterize their encoding preferences and how this evolves across learning of cue-shock and cue-water contingencies. Thus, these experiments are revealing novel information about how aversive and rewarding stimuli is encoded at the level of individual axons, in a way that has not been done before. This is timely and important.

      We thank the reviewer for this positive assessment.

      Weaknesses:

      The overarching conclusion of the paper is that dopamine axons preferentially encode aversive stimuli. This is prevalent in the title, abstract, and throughout the manuscript. This is fundamentally confounded. As the authors point out themselves, the axonal response to stimuli is sensitive to outcome magnitude (Supp Fig 3). That is, if you increase the magnitude of water or shock that is delivered, you increase the change in fluorescence that is seen in the axons. Unsurprisingly, the change in fluorescence that is seen to shock is considerably higher than water reward.

      We agree that the interpretation of our results is not straightforward. Our current manuscript now focuses on our strength, which is reporting the functional diversity of dopamine axons. Therefore, we avoid using the word ‘encode’ when describing the response.

      We believe that our results could reconcile the apparent discrepancy as to why some previous studies reported only aversive responses while others reported reward responses. In particular, if the reward volume were very small, the reward response could go undetected.

      Further, when the mice are first given unexpected water delivery and have not yet experienced the aversive stimuli, over 40% of the axons respond [yet just a few lines below the authors write: "Previous studies have demonstrated that the overall dopamine release at the mPFC or the summed activity of mPFC dopamine axons exhibits a strong response to aversive stimuli (e.g., tail shock), but little to rewards", which seems inconsistent with their own data].

      We always recorded the reward and aversive response together, which might have confused the reviewer. Therefore, there is no inconsistency in our data. We have clarified our methods and reasoning accordingly.

      Given these aspects of the data, it could be the case that the dopamine axons in mPFC encodes different types of information and delegates preferential processing to the most salient outcome across time.

      This is certainly an exciting interpretation, so we have included it in our discussion. Meanwhile, ‘the most salient outcome’ alone cannot fully capture the diverse response patterns of the dopaminergic axons, particularly reward-preferring axons. We discuss our findings in more detail in the revised manuscript.

      The use of two similar sounding tones (9Khz and 12KHz) for the reward and aversive predicting cues are likely to enhance this as it requires a fine-grained distinction between the two cues in order to learn effectively. There is considerable literature on mPFC function across species that would support such a view. Specifically, theories of mPFC function (in particular prelimbic cortex, which is where the axon images are mostly taken) generally center around resolution of conflict in what to respond, learn about, and attend to. That is, mPFC is important for devoting the most resources (learning, behavior) to the most relevant outcomes in the environment. This data then, provides a mechanism for this to occur in mPFC. That is, dopamine axons signal to the mPFC the most salient aspects of the environment, which should be preferentially learned about and responded towards. This is also consistent with the absence of a negative prediction error during omission: the dopamine axons show increases in responses during receipt of unexpected outcomes, but do not encode negative errors. This supports a role for this projection in helping to allocate resources to the most salient outcomes and their predictors, and not learning per se. Below are a just few references from the rich literature on mPFC function (some consider rodent mPFC analogous to DLPFC, some mPFC), which advocate for a role in this region in allocating attention and cognitive resources to most relevant stimuli, and do not indicate preferential processing of aversive stimuli.

      Distinguishing between 9 kHz and 12 kHz sound tones may not be that difficult, considering anticipatory licking and running are differentially manifested. In addition, previous studies have shown that mice can distinguish between two sound tones when they are separated by 7% (de Hoz and Nelken 2014). Nonetheless, we agree with the attractive interpretation that “the mPFC devotes the most resources (learning, behavior) to the most relevant outcomes in the environment” and that dopamine is a mechanism for this. Therefore, we discuss this interpretation in the revised text.

      References:

      (1) Miller, E. K., & Cohen, J. D. (2001). An integrative theory of prefrontal cortex function. Annual review of neuroscience, 24(1), 167-202.

      (2) Bissonette, G. B., Powell, E. M., & Roesch, M. R. (2013). Neural structures underlying set-shifting: roles of medial prefrontal cortex and anterior cingulate cortex. Behavioural brain research, 250, 91101.

      (3) Desimone, R., & Duncan, J. (1995). Neural mechanisms of selective visual attention. Annual review of neuroscience, 18(1), 193-222.

      (4) Sharpe, M. J., Stalnaker, T., Schuck, N. W., Killcross, S., Schoenbaum, G., & Niv, Y. (2019). An integrated model of action selection: distinct modes of cortical control of striatal decision making. Annual review of psychology, 70, 53-76.

      (5) Ridderinkhof, K. R., Ullsperger, M., Crone, E. A., & Nieuwenhuis, S. (2004). The role of the medial frontal cortex in cognitive control. science, 306(5695), 443-447.

      (6) Nee, D. E., Kastner, S., & Brown, J. W. (2011). Functional heterogeneity of conflict, error, taskswitching, and unexpectedness effects within medial prefrontal cortex. Neuroimage, 54(1), 528-540.

      (7) Isoda, M., & Hikosaka, O. (2007). Switching from automatic to controlled action by monkey medial frontal cortex. Nature neuroscience, 10(2), 240-248.

      Reviewer #1 (Recommendations For The Authors):

      Specific Suggestions and Questions on the Methods Section:

      In general, the methods part is not well documented and sometimes confusing. Thus, as it stands, it hinders reproducible research. Specific suggestions/questions are listed in the following section.

      (1) Broussard et al. 2018 introduced axon-GCaMP6 instead of axon-jGCaMP8m. The authors should provide details about the source of this material. If it was custom-made, a description of the subcloning process would be appreciated. Additionally, consider depositing sequence information or preferably the plasmid itself. Furthermore, the introduction of the jGCaMP8 series by Zhang, Rozsa, et al. 2023 should be acknowledged and referenced in your manuscript.

      We thank the reviewer for pointing this out. We have now included details on how we prepared the axon-jGCaMP8m, which was based on plasmids available at Addgene. Additionally, we have deposited our construct to Addgene ( https://www.addgene.org/216533/ ). We have also cited Janelia’s report on jGCaMP8, Zhang et al.

      (2) The authors elaborate on the approach taken for experimental synchronization. Specifically, how was the alignment achieved between 2-photon imaging, treadmill recordings, aversive/appetitive stimuli, and videography? It would be important to document the details of the software and hardware components employed for generating TTLs that trigger the pump, stimulator, cameras, etc.

      We have now included a more detailed explanation about the timing control. We utilize a custommade MATLAB program that sends TTL square waves and analogue waves via a single National Instruments board (USB-6229) to control two-photon image acquisition, behavior camera image acquisition, water syringe movement, current flow from a stimulator, and sound presentation. We also continuously recorded at 30 kHz via a separate National Instrument board (PCIe-6363) the frame timing of two-photon imaging, the frame timing of a behavior camera, copies of command waves (sent to the syringe pump, the stimulator, and the speaker), and signals from the treadmill corresponding to running speed.

      (3) The information regarding the cameras utilized in the study presents some confusion. In one instance, you mention, "To monitor licking behavior, the face of each mouse was filmed with a camera at 60 Hz (CM3-U3-13Y3M-CS, FLIR)" (Line 488). However, there's also a reference to filming facial expressions using an infrared web camera (Line 613). Could you clarify whether the FLIR camera (which is an industrial CMOS not a webcam) is referred to as a webcam? Alternatively, if it's a different camera being discussed, please provide product details, including pixel numbers and frame rate for clarity.

      We thank the reviewer for pointing this out. This was a mistake on our end. The camera used in the current project was a CM3-U3-13Y3M-CS, not a web camera. We have now corrected this.

      (4) Please provide more information about the methodology employed for lick detection. Specifically, did the authors solely rely on videography for this purpose? If so, why was an electrical (or capacitive) detector not used? It would provide greater accuracy in detecting licking.

      Lick detection was performed offline based on videography, using DeepLabCut. As licking occurs at a frequency of ~6.5 Hz (Xu, …, O’Connor Nature Neurosci, 2022), the movement can be detected at a frame rate of 60 Hz. Initially, we used both a lick sensor and videography. However, we favored videography because it could potentially provide non-binary information.

      Other Minor Points:

      (5) Ensure consistency in the citation format; both Vander Weele et al. 2018 and Weele et al. 2019, share the same first author.

      Thank you for pointing this out. Endnote processes the first author’s name differently depending on the journal. We fixed the error manually. The first paper (2018) is an original research paper, and the second one (2019) is a review about how dopamine modulates aversive processing in the mPFC. We cited the second one in three instances where we mentioned review papers.

      (6) The distinction between "dashed vs dotted lines" in Figure 3K and 3M appears to be very confusing. Please consider providing a clearer visualization/labeling to mitigate this confusion.

      We have now changed the line styles.

      (7) Additionally plotting mean polar angles of aversive/appetitive axons as vectors in the Cartesian scatter plots (2J, 3I,J) would make interpretation easier.

      We have now made this change to Figures 2, 3, 4.

      (8) Data and codes should be shared in a public database. This is important for reproducible research and we believe that "available from the corresponding author upon reasonable request" is outdated language.

      We have uploaded the data to GitHub, https://github.com/pharmedku/2024-elife-da-axon.

      Reviewer #2 (Recommendations For The Authors):

      (1) Authors don't show which mouse each axon data comes from making it hard to know if differences arise from inter-mouse differences vs differences in axons. The best way to address this point is to show similar plots as Figure 2J & K but broken down by mouse to shows whether each mouse had evidence of these two clusters.

      We have now made this change to Figure 2-figure supplement 3.

      (2) Line 166: Should this sentence point to panels 2F, G, H rather than 2I which doesn't show a shock response?

      We thank the reviewer for pointing this out. We have fixed the incorrect labels.

      Line 195: The population level bias to aversive stimuli was shown previously using photometry so it is not justified to say "for the first time" regarding this statement.

      We have adjusted this sentences so the claim of ”for the first time” is not associated with the population-level bias.

      (4) The paper lacks a discussion of the potential role that novelty plays in the amplitude of the responses given that tail shocks occur less often that rewards. Is the amplitude of the first reward of the day larger than subsequent rewards? Would tail shock responses decay if they occurred in sequential trials?

      Following the reviewer's suggestion, we conducted a comparison of individual axonal responses to both conditioned and unconditioned stimuli across the first trial and subsequent trials. Our findings reveal a notable trend: aversive-preferring axons exhibited attenuation in response to CSreward, yet enhancement in response to CSaversive. Conversely, the response of these axons to USreward was attenuated, with no significant change observed for USaversive. In contrast, reward-preferring axons displayed an invariable activity pattern from the initial trial, highlighting the functional diversity present within dopamine axons. This analysis has been integrated into Figure 3-figure supplement 4 and is elaborated upon in the Discussion section.

      (5) Fix typo in Figure 1 - supplement 1. Shift

      We have now corrected this. Thank you.

      (6) The methods section needs information about trial numbers. Please indicate how many trials were presented to each mouse per day.

      We have now added the information about trial numbers to the Methods section.

      Reviewer #3 (Recommendations For The Authors):

      In line with the public review, my recommendation is for the authors to remain as objective about their data as possible. There are many points in the manuscript where the authors seem to directly contradict their own data. For example, they first detail that dopamine axons respond to unexpected water rewards. Indeed, they find that there are 40% of dopamine axons that respond in this way. Then, a few paragraphs later they state: "Previous studies have demonstrated that the overall dopamine release at the mPFC or the summed activity of mPFC dopamine axons exhibits a strong response to aversive stimuli (e.g., tail shock), but little to rewards". As detailed above, I do not think these data support an idea that dopamine axons in mPFC preferentially encode aversive outcomes. If the authors wanted to examine a role for mPFC in preferential encoding of aversive stimuli, you would first have to equate the outcomes by magnitude and then compare how the axons acquire preferences across time. Alternatively, a prediction of a more general process that I detail above would predict that you could give mice two rewards that differ in magnitude (e.g., lots of food vs. small water) and you would see the same results that the authors have seen here (i.e., a preference for the food, which is the larger and more salient outcome). Without other tests of how dopamine axons in mPFC respond to situations like this, I don't think any conclusion around mPFC in favoring aversive stimuli can be made.

      As suggested, we have made the current manuscript as objective as possible, removing interpretation aspects regarding what dopamine axons encode and emphasizing their functional diversity. In particular, we remove the word ‘encode’ when describing the response of dopamine axons.

      Although it may have appeared unclear, there was no contradiction within our data regarding the response to reward and aversive stimuli. We have now improved the readability of the Results and Methods sections. Concerning the interpretation of what exactly the mPFC dopamine axons encode, we have rewritten the discussion to be as objective about our data as possible, as suggested. We also have edited our title and abstract accordingly. Meanwhile, we wish to emphasize that our reward and aversive stimuli are standard paradigms commonly used in the field. We believe, and all the reviewers agreed, that reporting the diversity of dopamine axonal responses with a novel imaging design constitutes new insight for the neuroscience community. Therefore, we have decided to leave the introduction of new behavioral tasks for future studies and instead expanded our discussion.

      As mentioned, I think the experiments are executed really well and the technological aspects of the authors' methods are impressive. However, there are also some aspects of the data presentation that would be improved. Some of the graphs took a considerable amount of effort to unpack. For example, Figure 4 is hard going. Is there a way to better illustrate the main points that this figure wants to convey? Some of this might be helped by a more complete description in the figure captions about what the data are showing. It would also be great to see how the response of dopamine axons changes across trial within a session to the shock and water-predictive cues. Supp Figure 1 should be in the main text with standard error and analyses across time. Clarifying these aspects of the data would make the paper more relevant and accessible to the field.

      We thank the reviewer for pointing out that the legend of Figure 4 was incomplete. We have fixed it, along with improving the presentation of the figure. We have also prepared a new figure (Figure 3– figure supplement 4) to compare CSaversive and CSreward signals for the first and rest of the trials within daily sessions, revealing further functional diversity in dopamine axons. We have decided to keep Figure 1–figure supplement 2 as a figure supplement with an additional analysis, as another reviewer pointed out that the design is not completely new. Furthermore, as eLife readers can easily access figure supplements, we believe it is appropriate to maintain it in this way.

      Minor points:

      (1) What is the control period for the omission test? Was omission conducted for the shock?

      The control period for reward omission is a 2-second period just before the CS onset. We did not include shock omission, because a sufficient number of trials (> 6 trials) for the rare omission condition could not be achieved within a single day.

      (2) The authors should mention how similar the tones were that predicted water and shock.

      According to de Hoz and Nelken (2014), a frequency difference of 4–7% is enough for mice to discriminate between tones. In addition, anticipatory licking and running confirmed that the mice could discriminate between the frequencies. We have now included this information in the Discussion.

      (3) I realize the viral approach used in the current studies may not allow for an idea of where in VTA dopamine neurons are that project to mPFC- is there data in the literature that speak to this? Particularly important as we now know that there is considerable heterogeneity in dopamine neuronal responses, which is often captured by differences in medial/lateral position within VTA.

      Some studies have suggested that mesocortical dopamine neurons are located in the medial posterior VTA (e.g., Lammel et al., 2008). However, in mouse anterograde tracing, it is not possible to spatially confine the injection of conventional viruses/tracers. We now refer to Lammel et al., 2008 in the Introduction.

    1. Author Response

      The following is the authors’ response to the original reviews.

      REVIEWER #1:

      The authors present a carefully controlled set of experiments that demonstrate an additional complexity for GPCR signaling in that endosomal signaling make be different when b-arrestin is or isn't associated with a G protein-bound V2R vasopressin receptor. It uses state of the art biosensorbased approaches and b-arrestin KO lines to assess this. It adds to a growing body of evidence that G proteins and b-arrestin can associate with GPCR complexes simultaneously. They also demonstrate the possibility that Gaq might also be activated by the V2R receptor. My sense is one thing they may need to be considered is the possibility of such "megacomplexes" might actually involve receptor dimers or oligomers.

      1.1 Can the authors please review the data that describes the concept of "GPCR megacomplexes"? I feel this is missing from the introduction. The notion means different things to different people. As you will see from my other comments, you should especially focus on evidence at the level of the single receptor.

      We appreciate the reviewer’s comments and have now included a more wholesome description of the GPCR megacomplex, or ‘megaplex’, concept in the introduction (page 2, 1st paragraph).

      1.2 The authors use mini-G proteins to conclude that V2R receptors interact with Gaq (in addition to Gas). I would prefer if there were a more direct measure of this. Can the authors show that the receptor interacts with full length Gaq (and not the other G proteins in Figure)? Is there a signaling phenotype associated with Gaq coupling? Is it sensitive to Gaq inhibition?

      Excellent point and we are happy to expand further on this. The ability of the V2R to activate Gq/11 has already been demonstrated before (Zhu, X. et al. Mol Pharmacol 46(3):460-9 (1994); Lykke, K. et al. Physiol Rep. 3(8):e12519 (2015); Avet, C. et al. eLife 11: e74101 (2022); Heydenreich, F.M. et al. Mol Pharmacol 102(3):139-49 (2022). Therefore, we did not attempt to document this activation using more traditional assays. On the other hand, to demonstrate an interaction between V2R and Ga subunit in cells is challenging for several reasons. First, the full-length Ga subunit is already located at the plasma membrane at basal state, and thus, generates high background signals in proximity assays. Second, upon receptor activation, the Ga subunit interaction with V2R is so transient that it is difficult, if not impossible, to catch this transient moment in a proximity assay. Although the miniG proteins are highly engineered, coupling specificity of the different subtypes (Gas, Gai/o, Gaq/11, and Ga12/13) to GPCRs is maintained. In addition, as they are homogenously expressed in the cytosol under basal states rather than at the membrane, they generate low background noise. Upon agonist stimulation, miniG proteins are recruited from the cytosol to the V2R at the plasma membrane, resulting in a robust signal in proximity assays. Thus, miniG proteins are unique in that they can actually detect GPCR–G protein interactions in cellular proximity assays, which is very challenging using full-length Ga subunits.

      That being said, we fully understand the reviewer’s concern and greatly value the effort in enhancing robustness of our study. Therefore, we have now monitored downstream signaling events of Gaq/11 in the absence or presence of the selective Gaq/11 inhibitor YM-254890 as a secondary method of documenting Gaq/11 activity. Specifically, we used a newly developed biosensor to measure diacylglycerol (DAG) production, a downstream second messenger of Gaq/11 activation, at both the plasma membrane and endosomes. Using a second biosensor, we detect general protein kinase C (PKC) activation, which is another downstream signaling event of Gaq/11 activation. Together, we demonstrated that AVP-stimulation leads to DAG production at both the plasma membrane and endosomes (Fig. 1C-D) as well as PKC activation (Fig. 1E), which all are sensitive to YM-254890 inhibition (Fig. 1C-D and E). Together these results rigorously suggest that the V2R interacts with and activates Gaq/11.

      1.3 I raise a similar concern with Gaq coupling in endosomes.

      For similar reasons that miniG proteins are excellent tools for demonstrating V2R interaction with G proteins at the plasma membrane, miniG proteins can also be used to detect V2R interaction with G proteins at endosomes by measuring proximity between miniG and an endosomal marker in response to agonist challenge. However, to ensure that the endosomal recruitment of miniGsq to the V2R demonstrated in our study corresponds to endosomal Gaq/11 activation, we monitored the production of DAG at the early endosomes in a similar way to which we detected DAG production at the plasma membrane. As shown in Fig. 1D, stimulation of V2R with AVP induces recruitment of the DAG-binding biosensor to the early endosomal marker Rab5. Pre-treatment of the cells with the selective Gaq/11 inhibitor YM-254890 abrogated this response, confirming that V2R activation leads to production of DAG at the early endosomes in a Gaq/11-dependent manner (Fig. 1D).

      1.4 Can the confocal data be shown for Gai and Ga12?

      Yes, we can certainly show this data as negative control. We have now included the confocal data using Halo-mGsi as a negative control for confocal microscopy (Fig. 2). As seen on this figure, mGsi does not colocalize with Lck (plasma membrane), nor with EEA1 (early endosomes) upon stimulation of cells with AVP in line with a receptor that does not couple to Gai/o.

      We did not include data using Halo-mG12, as this G protein subtype, similar to Gi/o, does not couple functionally to V2R. Therefore, it is highly unlikely we would obtain different results from the experiments using Halo-mGsi.

      1.5 The authors want us to believe that there is simultaneous binding of G proteins and b-arrestin. This is never demonstrated and is at odds with the structural basis of G protein and b-arrestin binding. Have the authors considered that "simultaneous" occupancy might simply reflect binding at distinct GPCR monomers in the context of dimeric or oligomeric receptors? They could I suppose provide data at the level of a single receptor rather than using the bulk BRET approaches used.

      We appreciate the comment and opportunity to highlight some of our previous work, which address the megacomplexes at the level of a single receptor. First, we have characterized the megacomplex biochemically and structurally at a low resolution (Thomsen ARB et al. 2016, Cell 166(4):907-19). The results unequivocally demonstrate that a single GPCR interacts simultaneously with heterotrimeric G protein, at the receptor core, and with b-arrestin via the phosphorylated receptor carboxy-terminal. We also documented functionality of the megacomplex as the receptor can interact with and activate the G protein, which were shown by 3 different biochemical approaches (Thomsen ARB et al. 2016, Cell 166(4):907-19). In addition, we solved a high-resolution cryo-EM structure of a megacomplex further highlighting the architecture of this complex (Nguyen AH et al. 2019, Nat Struct Mol Biol 26:1123-31). As both biochemical and structural analyses were done in vitro in which the receptor was embedded in a detergent micelle, we also confirmed that the megacomplex structural architecture fits naturally within the context of a membrane in molecular dynamics simulation experiments (Nguyen AH et al. 2019, Nat Struct Mol Biol 26:1123-31).

      In cells, we and others have also showed that GPCRs such as the V2R can bind b-arrestins exclusively via the phosphorylated carboxy-terminal tail as it does in the megacomplex (Kumari P et al. 2016, Nat Commun 7:13416; Cahill III TJ et al. 2017, PNAS 114(10):2562-67; Kumari P et al. 2017, Mol Biol Cell 28(8):1003-10; Chen K et al. 2023, Nature (online doi: https://doi.org/10.1038/s41586-023-06420-x). In addition, we and others have used BRET and confocal microscopy to show that the V2R and other GPCRs recruit G protein and b-arrestin simultaneously and that the three components colocalize in endosomes upon prolonged agonist exposure (Thomsen ARB et al. 2016, Cell 166(4):907-19; Chen K et al. 2023, Nature (online doi: https://doi.org/10.1038/s41586-023-06420-x). As the reviewer correctly points out, in these cellular experiments (as well as in single molecule microscopy), the working resolution is not high enough to rule out that the receptors that co-recruit G protein and b-arrestin in endosomes could be dimeric instead of monomeric. Thus, we conducted a series of experiments with GPCR–b-arrestin fusions where the two proteins are covalently attached at the receptor carboxy-terminal tail. We showed that despite the GPCR–b-arrestin coupling being fully functional (in respect to b-arrestin promoting a highaffinity state of the receptor for agonist binding and constitutively internalizing the receptor) the receptor could still activate G proteins (Thomsen ARB et al. 2016, Cell 166(4):907-19; Nguyen AH et al. 2019, Nat Struct Mol Biol 26:1123-31), which demonstrates that the single receptor megaplex can physically form in cells.

      We have now included an extra paragraph in the discussion to go over these megaplex-related considerations (5th paragraph in the discussion), and we thank the reviewer for raising this point.

      1.6 Please introduce abbreviations when you first use this- this was not done consistently.

      Thank you for noticing these errors, which we now have corrected.  

      REVIEWER #2:

      This manuscript by Daly et al., probes the emerging paradigm of GPCR signaling from endosomes using the V2R as a model system with an emphasis on Gaq/11 and b-arrestins. The study employs cellular imaging, enzyme complementation assays and energy transfer-based sensors to probe the potential formation of GPCR-G-protein-b-arrestin megaplexes. While the study is certainly very interesting, it appears to be very preliminary at many levels, and clearly requires further development in order to make robust conclusions. The authors should consider expanding on this work further to make the points more convincingly to make the work solid and impactful. The two corresponding authors are among the leaders in the field having demonstrated the existence of megaplexes, and building on the work in a systematic fashion should certainly move the paradigm forward. As the work presented in the current manuscript is already pre-printed, the authors should take this opportunity to present a completer and more comprehensive story to the field.

      We are grateful for the time and efforts the reviewer has put into reviewing our work. We are certainly excited to learn that the reviewer finds our work “very interesting”. Regarding the robustness, we have added extra control experiments to increase the completeness of the study. These experiments include:

      • Measurements of AVP-stimulated diacylglycerol production, a signaling event downstream of Gaq/11 activation. These measurements were conducted both at plasma membrane (Fig. 1C) and early endosomes (Fig. 1D) using a newly developed DAG-binding biosensor, and demonstrate that the V2R activates Gaq/11 at both of these subcellular locations.

      • Monitoring AVP-promoted protein kinase C activation, another downstream signaling effect of Gaq/11 activation (Fig. 1E). The result of this approach shows in another way that V2R activates of Gaq/11.

      • Inhibition of signaling events downstream of Gaq/11 activation using the selective of Gaq/11 inhibitor YM254890. YM-254890 inhibits both AVP-stimulated DAG production at plasma membrane and endosomes as well as PKC activation (Fig. 1C-E), which strongly confirms that these signaling outputs are results of Gaq/11 activation.

      • We have also included the confocal data using Halo-mGsi as a negative control for confocal microscopy (Fig. 2). As seen in this figure, mGsi does not translocate to the plasma membrane or early endosomes upon stimulation with AVP, which validates that V2R activation does not couple to and activate Gai/o.

      Finally, we would like to kindly remind the reviewer that the production of the pre-print manuscript is part of the peer-review process in eLife.

      2.1 The use of miniG proteins in these experiments is a major concern as these are highly engineered and may not represent the true features of G proteins. While these have been used as a readout in other publications, their use in demonstrating megaplex formation is sub-optimal, and native, full-length G proteins should be used.

      We are a bit unsure as to what the reviewer means by using native full-length G proteins. If the reviewer is suggesting to co-immunoprecipitate V2R with native unlabeled G protein and b-arrestin, it should be considered that the G protein interaction with the receptor is extremely transient and unlikely to survive the pull-down procedure unless stabilized by a nanobody or crosslinking. Although the b-arrestin interaction with the receptor is more stable of nature, co-immunoprecipitation with the receptor requires crosslinking or stabilization with a Fab/nanobody. Therefore, we do not think this approach can be used as a more accurate way of detecting native megaplexes.

      If the reviewer is suggesting the use of full-length G proteins in our cell-based proximity assays instead of miniG proteins, we would like to highlight that this approach is somewhat prone to false-positive responses. The major reason behind this is that G proteins are located at regions in membranes close to the receptor whereas b-arrestins are distributed throughout the cytosol. Upon activation of the V2R, barrestins translocate to the receptor at the plasma membrane, which results in enhanced BRET between V2R-coupled G protein subtypes and b-arrestins (see Author response image 1 below of preliminary data). This translocation also results in non-specific BRET signals between b-arrestins and G protein subtypes at the plasma membrane that do not couple to V2R but are located in close proximity to the receptor. As these nonspecific BRET signals do not report on the formation of functional V2R megaplexes (see Author response image 1), we have purposely not used this approach.

      Author response image 1.

      To overcome this technical hurdle in detection of functional megaplexes, we have replaced full-length G proteins by miniG proteins as the latter are located in the cytosol at resting states and only translocate to the membrane area if a receptor adopts an active conformation. This replacement is advantageous since activation of megaplex-forming receptors such as the V2R results in simultaneous translocation of miniG proteins and b-arrestins from the cytosol to the receptor at the plasma membrane, which produces a highly specific proximity signal (see Author response image 2 below of preliminary data). When stimulating the V2R, we only observe increases in proximity between b-arrestin1 and miniG proteins that are activated by the V2R (miniGs and miniGsq) but not the miniG proteins that are not activated by this receptor (miniGsi and miniG12) (see Author response image 2). Therefore, usage of miniG proteins offers a more accurate experimental approach to detect functional megaplexes as compared to the usage of full-length G proteins.

      Author response image 2.

      2.2 The interpretation of complementation (NanoLuc) or proximity (BRET) as evidence of signaling is not appropriate, especially when overexpression system and engineered constructs are being used.

      We thank the reviewer for raising this concern. We have previously demonstrated global Gas activation and Gas signaling in form of cAMP stimulated by internalized V2R (Thomsen ARB et al. 2016, Cell 166(4):907-19). As mentioned previously, in the current updated manuscript we have now included experiments to document downstream signaling events in response to Gaq/11 activation. These experiments include measurement of production of DAG at the plasma membrane (Fig. 1C) and early endosomes (Fig. 1D), as well as phosphorylation/activation of PKC (Fig. 1E). Pre-incubation with the selective Gaq/11 inhibitor YM-254890, abrogated all these downstream signals and confirms that the V2R stimulates Gaq/11 protein signaling at both the plasma membrane and endosomes (Fig. 1C-E).

      2.3 After the original work from the same corresponding authors on megaplex formation, the major challenge in the field is to demonstrate the existence and relevance of megaplex formation at endogenous levels of components, and the current study focuses solely on showing the proximity of Gaq and b-arrestins.

      We completely agree with the reviewer that it will be important to demonstrate functionality endogenous megaplexes and we are currently working on this in other studies using different receptor systems. However, doing this is not trivial and we will have to overcome major technical barriers that we feel is somewhat out of the scope of the current study. The goal of our V2R study is to demonstrate that V2R megaplexes form with Gaq/11 resulting to Gaq/11 activation at endosomes, and that endosomal G protein activation by the V2R can occur independently of b-arrestin, which we in our humble opinion accomplish.

      2.4 The study lacks a coherent approach, and the assays are often shifted back and forth between the two b-arrestin isoforms (1 and 2), for example, confocal vs. complementation etc.

      We understand the reviewer’s concern. However, as opposed to the β2-adrenergic receptor that binds βarrestin2 with higher affinity than β-arrestin1, V2R has a strong affinity for both β-arrestin1 and β-arrestin2 (Oakley et al. 2000, JBC 275(22):17201-10). The V2R’s almost identical affinity for β-arrestin1 and βarrestin2 is well illustrated in Fig. 3B. Thus, although different β-arrestin isoforms were used in some experiments, it is very unlikely that the overall results and conclusions from this study will change by adding extra experiments to ensure that both β-arrestin isoforms are used in every experiment.

      2.5 In every assay, only the G proteins and b-arrestins are monitored without a direct assessment of the presence of receptor, and absent that data, it is difficult to justify calling these entities megaplexes.

      Mini G proteins and b-arrestin come into close proximity upon agonist stimulation of the V2R. Using confocal microscopy, we observed this co-recruitment of miniGs/miniGsq and b-arrestin in response to prolonged V2R stimulation at endosomes specifically (Fig. 3D-F). In absence of GPCR stimulation, both miniG and b-arrestin would be homogenously distributed throughout the cytosol, and thus, the only reason to why both proteins have been recruited to endosomes in response to AVP challenge is that they are recruited to internalized and active V2R. This point was obviously not adequately described in the original manuscript, and thus, we have now clarified this further in the updated manuscript at the 8th sentence of the last paragraph of the "The V2R recruits Gas/Gaq and barrs simultaneously" section.

      REVIEWER #3:

      The manuscript by Daly et al. examines endosomal signaling of the vasopressin type 2 receptors using engineered mini G protein (mG proteins) and a number of novel techniques to address if sustained G protein signaling in the endosomal compartment is enhanced by b-arrestin. Employing these interesting techniques they have how V2R could activates Gas and Gaq in the endosomal compartments and how this modulation could occur in arrestin-dependent and -independent manner. Although the phenomenon of endosomal signaling is complex to address the authors have tried their best to examine these using a number of well controlled set of experiments. Though this is an interesting and well carried out study of endosomal signaling of G proteins, my concerns are:

      3.1 The study is done in overexpressed HEK 293 cells with these engineered constructs making me wonder if the kinetics would be the same in primary cells?

      The reviewer raises an interesting and valid point. It is possible that in the context of primary cells the kinetic would differ slightly and it would definitely be interesting to address this in a subsequent study. However, despite being an interesting aspect of our study, the kinetic itself is not our major take home message, but rather the subcellular localization of the G protein activation and the role of β-arrestin in these events. We have now highlighted this aspect in our updated manuscript (1st paragraph of the discussion) and we thank the reviewer for addressing this.

      3.2 The use of the phrase "G protein activation independent of b-arrestins to a minor degree" would make me question its physiological relevance. The authors should discuss the relevance of their findings in physiological or pathological context.

      We are glad that the reviewer focuses on this point, and we would like to highlight that other GPCRs including the glucagon-like peptide-1 receptor (GLP1R) internalizes in a β-arrestin-independent manner (Claing A et al. 2000 PNAS 97(3):1119-24), while signaling through Gas from endosomes. In the case of the GLP1R, this endosomal Gas signaling promotes glucose-stimulated insulin secretion in pancreatic βcells (Kuna RS et al. 2013 Am J Physiol Endocrinol Metab 305:E161-70). Consequently, β-arrestinindependent endosomal G protein signaling appears to have some physiological relevance. Similarly, in a very recent pre-print from the von Zastrow group (Blythe EE and von Zastrow M 2023 BioRxiv https://doi.org/10.1101/2022.09.07.506997), it was reported that endogenously-expressed vasoactive intestinal peptide receptor 1 (VIPR1), which regulates gastro-intestinal functions, promotes robust G protein signaling from endosomes in a completely β-arrestin-independent fashion. This again suggest that endogenously expressed GPCRs can internalize and activate G proteins from endosomes independently from β-arrestin to produce physiological responses. We have now discussed about these studies in the 6th paragraph of the discussion.

      3.3 The confocal colocalization studies shown in Figure 2 and their conclusion "suggesting a certain level of endosomal Gas/Gaq signaling despite the absence of barr2" seems rather inconclusive.

      As opposed to V2R a receptor that retains β-arrestin in endosomes upon internalization, β-arrestin quickly dissociates from V2b2AR after internalization due to the low affinity of the carboxy-terminal of β2AR for βarrestin. In the previous Fig. 2 (now Fig. 3), after 45 minutes of AVP stimulation, no β-arrestin is visible at endosomes in cells expressing V2b2AR as β-arrestin has already dissociated from the receptor and translocated back to the cytosol. However, clear green clusters of mGs and mGsq are still visible at endosomes indicating the presence of active receptor interacting with Gas or Gaq despite the fact that βarrestin is back to the cytosol. We quantified the percentage of the green mGs or mGsq clusters that do not colocalize with β-arrestin and have added this information to the updated version of the manuscript (Fig. 3G). In V2R-expressing cells, almost all active receptors that interact with Gas or Gaq/11 also associate with β-arrestin (Fig. 3G). In contrast, in V2b2AR-expressing cells, approximately 75% of the active receptors do not interact with β-arrestin (Fig. 3G). This suggests that β-arrestin binding to V2R is not an absolute requirement for endosomal Gas and Gaq activation by V2R. This point was obviously not addressed adequately in the original manuscript, and thus, we have now elaborated further on this in the updated version in the last paragraph of the "The V2R recruits Gas/Gaq and βarrs simultaneously" section.

      3.4 Though a novel observation it is not clear to me how V2R would internalize after activation without arrestin. Is it some sort of generalized microcytosis occurring in these overexpressed cells? Should discuss.

      This is certainly a very interesting observation and something other research laboratories also have seen recently – in particular, in context to endosomal G protein signaling (Blythe EE and von Zastrow M 2023 BioRxiv https://doi.org/10.1101/2022.09.07.506997). The main and best characterized pathway for GPCR internalization is clathrin-dependent where receptors most commonly are associated with β-arrestins. However, for some GPCRs, the β-arrestin association is not required for clathrin-mediated internalization. One example is the apelin receptor that can internalize via clathrin-coated pits, but in β-arrestinindependent manner (Pope GR et al. 2016 Moll Cell Endocrinol. 437:108-19). Alternatively, GPCRs can also internalize independently of any clathrin and β-arrestin associations via caveolae or fast endophilinmediated endocytosis (FEME). We have now expanded our discussion of possible mechanisms for βarrestin-independent receptor internalization in the updated manuscript in the 6th paragraph of the discussion, and we thank the reviewer for the suggestion.

      3.5 Is use of mini G protein a good representation? The authors should justify.

      Excellent point and something we have comprehensively discussed in our response to reviewer 1 and 2 (points 1.2 and 2.1).

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Summary:

      Bendzunas, Byrne et al. explore two highly topical areas of protein kinase regulation in this manuscript. Firstly, the idea that Cys modification could regulate kinase activity. The senior authors have published some standout papers exploring this idea of late, and the current work adds to the picture of how active site Cys might have been favoured in evolution to serve critical regulatory functions. Second, BRSK1/2 are understudied kinases listed as part of the "dark kinome" so any knowledge of their underlying regulation is of critical importance to advancing the field.

      Strengths:

      In this study, the author pinpoints highly-conserved, but BRSK-specific, Cys residues as key players in kinase regulation. There is a delicate balance between equating what happens in vitro with recombinant proteins relative to what the functional consequence of Cys mutation might be in cells or organisms, but the authors are very clear with the caveats relating to these connections in their descriptions and discussion. Accordingly, by extension, they present a very sound biochemical case for how Cys modification might influence kinase activity in cellular environs.

      Weaknesses:

      I have very few critiques for this study, and my major points are barely major.

      Major points

      (1) My sense is that the influence of Cys mutation on dimerization is going to be one of the first queries readers consider as they read the work. It would be, in my opinion, useful to bring forward the dimer section in the manuscript.

      We agree that the influence of Cys on BRSK dimerization is a topic of significant interest. Our primary focus was to explore oxidative regulation of the understudied BRSK kinases as they contain a conserved T-loop Cys, and we have previously demonstrated that equivalent residues at this position in related kinases were critical drivers of oxidative modulation of catalytic activity. We have demonstrated here that BRSK1 & 2 are similarly regulated by redox and this is due to oxidative modification of the T+2 Cys, in addition to Cys residues that are conserved amongst related ARKs as well as BRSK-specific Cys. Although we also provide evidence for limited redox-sensitive higher order BRSK species (dimers) in our in vitro analysis, these represent a small population of the total BRSK protein pool (this was validated by SEC-MALs analysis). As such, we do not have strong evidence to suggest that these limited dimers significantly contribute to the pronounced inhibition of BRSK1 & 2 in the presence of oxidizing agents, and instead believe that other biochemical mechanisms likely drive this response. This may result from oxidized Cys altering the conformation of the activation loop. Indeed, the formation of an intramolecular disulfide within the T-loop of BRSK1 & 2, which we detected by MS, is one such regulatory modification. It is noteworthy, that intramolecular disulfide bonds within the T-loop of AKT and MELK have already been shown to induce an inactive state in the kinase, and we posit a similar mechanism for BRSKs.

      While we recognize the potential importance of dimerization in this context, our current data from in vitro and cell-based assays do not provide substantial evidence to assert dimerization as a primary regulatory mechanism. Hence, we maintained a more conservative stance in our manuscript, discussing dimerization in later sections where it naturally followed from the initial findings. That being said, we acknowledge the potential significance of dimerization in the regulation of the BRSK T-loop cysteine. We believe this aspect merits further investigation and could indeed be the focus of a follow-up study.

      (2) Relatedly, the effect of Cys mutation on the dimerization properties of preparations of recombinant protein is not very clear as it stands. Some SEC traces would be helpful; these could be included in the supplement.

      In order to determine whether our recombinant BRSK proteins (and T-loop mutants) existed as monomers or dimers, we performed SDS-PAGE under reducing and non-reducing conditions (Fig 7). This unambiguously revealed that a monomer was the prominent species, with little evidence of dimers under these experimental conditions (even in the presence of oxidizing agents). Although we cannot discount a regulatory role for BRSK dimers in other physiological contexts, we could not produce sufficient evidence to suggest that multimerization played a substantial role in modifying BRSK kinase activity in our assays. We note that our in vitro analysis was performed using truncated forms of the protein, and as such it is entirely possible that regions of the protein that flank the kinase domain may serve additional regulatory functions that may include higher order BRSK conformations. In this regard, although we have not included SEC traces of our recombinant proteins, we have included analytical SEC-MALS of the truncated proteins (Supplementary Figure 6) which we believe to be more informative. We have also now included additional SEC-MALS data for BRSK2 C176A and C183A (Supplementary Figure 6d and e), which supports our findings in Fig 7, demonstrating the presence of limited dimer species under non-reducing conditions.

      (3) Is there any knowledge of Cys mutants in disease for BRSK1/2?

      We have conducted an extensive search across several databases: COSMIC (Catalogue of Somatic Mutations in Cancer), ProKinO (Protein Kinase Ontology), and TCGA (The Cancer Genome Atlas). These databases are well-regarded for their comprehensive and detailed records of mutations related to cancer and protein kinases. Our analysis using the COSMIC and TCGA databases focused on identifying any reported instances of Cys mutations in BRSK1/2 that are implicated in cancer. Additionally, we utilized the ProKinO database to explore the broader landscape of protein kinase mutations, including any potential disease associations of Cys mutations in BRSK1/2. However, we found no evidence to indicate the presence of Cys mutations in BRSK1/2 that are associated with cancer or disease. This lack of association in the current literature and database records suggests that, as of our latest search, Cys mutations in BRSK1/2 have not been reported as significant contributors to pathogenesis.

      (4) In bar charts, I'd recommend plotting data points. Plus, it is crucial to report in the legend what error measure is shown, the number of replicates, and the statistical method used in any tests.

      We have added the data points to the bar charts and included statistical methods in figure legends.

      (5) In Figure 5b, the GAPDH loading control doesn't look quite right.

      The blot has been repeated and updated.

      (6) In Figure 7 there is no indication of what mode of detection was used for these gels.

      We have updated the figure legend to confirm that the detection method was western blot.

      (7) Recombinant proteins - more detail should be included on how they were prepared. Was there a reducing agent present during purification? Where did they elute off SEC... consistent with a monomer of higher order species?

      We have added ‘produced in the absence of reducing agents unless stated otherwise’ in the methods section to improve clarity. Although we have not added additional sentences to describe the elution profile of the BRSK proteins by SEC during purification, we believe that the inclusion of analytical SEC-MALS data is sufficient evidence that the proteins are largely monomeric under non-reducing conditions.

      Reviewer #2 (Public Review):

      Summary:

      In this study by Bendzunas et al, the authors show that the formation of intra-molecular disulfide bonds involving a pair of Cys residues near the catalytic HRD motif and a highly conserved T-Loop Cys with a BRSK-specific Cys at an unusual CPE motif at the end of the activation segment function as repressive regulatory mechanisms in BSK1 and 2. They observed that mutation of the CPE-Cys only, contrary to the double mutation of the pair, increases catalytic activity in vitro and drives phosphorylation of the BRSK substrate Tau in cells. Molecular modeling and molecular dynamics simulations indicate that oxidation of the CPE-Cys destabilizes a conserved salt bridge network critical for allosteric activation. The occurrence of spatially proximal Cys amino acids in diverse Ser/Thr protein kinase families suggests that disulfide-mediated control of catalytic activity may be a prevalent mechanism for regulation within the broader AMPK family. Understanding the molecular mechanisms underlying kinase regulation by redox-active Cys residues is fundamental as it appears to be widespread in signaling proteins and provides new opportunities to develop specific covalent compounds for the targeted modulation of protein kinases.

      The authors demonstrate that intramolecular cysteine disulfide bonding between conserved cysteines can function as a repressing mechanism as indicated by the effect of DTT and the consequent increase in activity by BSK-1 and -2 (WT). The cause-effect relationship of why mutation of the CPE-Cys only increases catalytic activity in vitro and drives phosphorylation of the BRSK substrate Tau in cells is not clear to me. The explanation given by the authors based on molecular modeling and molecular dynamics simulations is that oxidation of the CPE-Cys (that will favor disulfide bonding) destabilizes a conserved salt bridge network critical for allosteric activation. However, no functional evidence of the impact of the salt-bridge network is provided. If you mutated the two main Cys-pairs (aE-CHRD and A-loop T+2-CPE) you lose the effect of DTT, as the disulfide pairs cannot be formed, hence no repression mechanisms take place, however when looking at individual residues I do not understand why mutating the CPE only results in the opposite effect unless it is independent of its connection with the T+2residue on the A-loop.

      Strengths:

      This is an important and interesting study providing new knowledge in the protein kinase field with important therapeutic implications for the rationale design and development of next-generation inhibitors.

      Weaknesses:

      There are several issues with the figures that this reviewer considers should be addressed.

      Reviewer #1 (Recommendations for The Authors):

      Major points

      Page 26 - the discussion could be more concise. There's an element of recapping the results, which should be avoided.

      Regarding the conciseness of the discussion section, we have thoroughly revised it to ensure a more succinct presentation, deliberately avoiding the recapitulation of results. The revised discussion now focuses on interpreting the findings and their implications, steering clear of redundancy with the results section.

      Figure 1b seems to be mislabeled/annotated. I recommend checking whether the figure legends match more broadly. Figure 1 appears to be incorrectly cited throughout the results.

      Thank you for pointing out the discrepancies in the labeling and citation of Figure 1b. We have carefully reviewed and corrected these issues to ensure that all figure labels, legends, and citations accurately reflect the corresponding data and illustrations. We appreciate your attention to detail and the opportunity to improve the clarity and accuracy of our presentation.

      Figure 6 - please include a color-coding key in the figure. Further support for these simulations could be provided by supplementary movies or plots of the interaction. Figure 4 colour palette should be adjusted for the spheres in the Richardson diagrams to have greater distinction.

      As suggested, we have amended the colour palette in Figure 4 to improve conformity throughout the figure.

      Minor points

      Figure 2 - it'd be helpful to know what the percentage coverage of peptides is.

      We have updated the figure legend to include peptide coverage for both proteins

      Some typos - Supp 2 legend "Domians".

      Fixed

      Figure 6 legend - analyzed by needs a space;

      Fixed

      Fig 8 legend schematic misspelled.

      Fixed

      Broadly, if you Google T-loop you get a pot pourri of enzyme answers. Why not just use Activation loop?

      The choice of "T-loop" over "Activation loop" in our manuscript was made to maintain consistency with other literature in the field, and in particular our previous paper “Aurora A regulation by reversible cysteine oxidation reveals evolutionarily conserved redox control of Ser/Thr protein kinase activity” where we refer to the activation loop cysteine as T-loop + 2. We acknowledge the varied enzyme contexts in which "T-loop" is used and agree on the importance of clarity. To address this, we made an explicit note in the manuscript that the "T-loop" is also referred to as the "Activation loop", ensuring readers are aware of the interchangeable use of these terms. Additionally, this nomenclature facilitates a more straightforward designation of cysteine residues within the loop (T+2 Cysteine). We believe this approach balances adherence to established conventions with the need for clarity and precision in our descriptions.

      Methods - what is LR cloning. Requires some definition. Some manufacturer detail is missing in methods, and referring to prior work is not sufficient to empower readers to replicate.

      We agree, and have added the following to the methods section:

      “BRSK1 and 2 were sub-cloned into pDest vectors (to encode the expression of N-terminal Flag or HA tagged proteins) using the Gateway LR Clonase II system (Invitrogen) according to the manufacturer’s instructions. pENtR BRSK1/2 clones were obtained in the form of Gateway-compatible donor vectors from Dr Ben Major (Washington University in St. Louis). The Gateway LR Clonase II enzyme mix mediates recombination between the attL sites on the Entry clone and the attR sites on the destination vector. All cloned BRSK1/2 genes were fully sequenced prior to use.”

      Page 7 - optimal settings should be reported. How were pTau signals quantified and normalised?

      We have added the following to the methods section:

      “Two-color Western blot detection method employing infrared fluorescence was used to measure the ratio of Tau phospho serine 262 to total Tau. Total GFP Tau was detected using a mouse anti GFP antibody and visualized at 680 nm using goat anti mouse IRdye 680 while phospho-tau was detected using a Tau phospho serine 262 specific antibody and visualized at 800 nm using goat anti rabbit IRdye 800. Imaging was performed using a Licor Odessey Clx with scan control settings set to 169 μm, medium quality, and 0.0 mm distance. Quantification was performed using Licor image studio on the raw image files. Total Tau to phospho Tau ratio was determined by measuring the ratio of the fluorescence intensities measured at 800 nm (pTau) to those at 680 nm (total tau).”

      In the Figure 6g-j legend, the salt bridge is incorrectly annotated as E185-R248 rather than 258.

      Fixed

      Lines 393-395 provides a repeat statement on BRSKs phosphorylating Tau (from 388-389).

      We have removed the repetition and reworded the opening lines of the results section to improve the overall flow of the manuscript.

      Supp. Figure 1 is difficult to view - would it be possible to increase the size of the phylogenetic analysis?

      We thank the reviewer for this observation. We have rotated (90°) and expanded the figure so that it can be more clearly viewed

      Supp. Figure 2 - BRSK1/2 incorrectly spelled.

      Fixed

      Please check the alignment of labels in Supp. Figure 3e.

      Fixed

      Reviewer #2 (Recommendations For The Authors):

      (1) In Figure 1, current panel b is not mentioned/described in the figure legend and as a consequence, the rest of the panels in the legends do not fit the content of the figure.

      Reviewer 1 also noted this error, and we have amended the manuscript accordingly.

      What is the rationale for using the HEK293T cells as the main experimental/cellular system? Are there cell lines that express both proteins endogenously so that the authors can recapitulate the results obtained from ectopic overexpression?

      The selection of HEK-293T cells was driven by their well-established utility in overexpression studies, which make them ideal for the investigation of protein interactions and redox regulation. This cell line's robust transfection efficiency and well-characterized biology provide a reliable platform for dissecting the molecular mechanisms underlying the redox regulation of proteins. Furthermore, the use of HEK-293T cells aligns with the broader scientific practice, serving as a common ground for comparability with existing literature in the field of BRSK1/2 signaling, protein regulation and interaction studies.

      The application of HEK-293T cells as a model system in our study serves as a foundational step towards eventually elucidating the functions of BRSK1/2 in neuronal cells, where these kinases are predominantly expressed and play critical roles. Given the fact that BRSKs are classed as ‘understudied’ kinases, the choice of a HEK-293T co-overexpression system allowed us to analyze the direct effects of BRSK kinase activity (using phosphorylation of Tau as a readout) in a cellular context and in more controlled manner. This approach not only aids in the establishment of a baseline understanding of the redox regulation of BRSK1/2, but also sets the stage for subsequent investigations in more physiologically relevant neuronal models

      In current panel d, could the authors recapitulate the same experimental conditions as in current panel c?

      Figure 1 panel c shows that both BRSK1 and 2 are reversibly inhibited by oxidizing agents such as H2O2, whilst panels d and e show the concentration dependent activation and inhibition of the BRSKs with increasing concentrations of DTT and H2O2 respectively. The experimental conditions were identical, other than changing amounts of reducing and oxidizing agents, and used the same peptide coupled assays. Data for all experiments were originally collected in ‘real time’ as depicted in Fig 1c (increase in substrate phosphorylation over time). However, to aid interpretation of the data, we elected to present the latter two panels as dose response curves by calculating the change in the rate of enzyme activity (shown as pmol phosphate incorporated into the peptide substrate per min) for each condition. To aid the reader, we now include an additional supplementary figure (new supplementary figure 2) depicting BRSK1 and 2 dependent phosphorylation of the peptide substrate in the presence of different concentrations of DTT and H2O2 in a real time (kinetic) assay. The new data shown is a subset of the unprocessed data that was used to calculate the rates of BRSK activity in Fig 1d & e.

      Why did the authors use full-length constructs in these experiments and did not in e.g. Figure 2 where they used KD constructs instead?

      In the initial experiments, illustrated in Figure 1, we employed full-length protein constructs to establish a proof of concept, demonstrating the overall behavior and interactions of the proteins in their full-length form. This confirmed that BRSK1 & 2, which both contain a conserved T + 2 Cys residue that is frequently prognostic for redox sensitivity in related kinases, displayed a near-obligate requirement for reducing agents to promote kinase activity.  

      Subsequently, in Figure 2, our focus shifted towards delineating the specific regions within the proteins that are critical for redox regulation. By using constructs that encompass only the kinase domain, we aimed to demonstrate that the redox-sensitive regulation of these proteins is predominantly mediated by specific cysteine residues located within the kinase domain itself. This strategic use of the kinase domain of the protein allowed for a more targeted investigation. Furthermore, in our hands these truncated forms of the protein were more stable at higher concentrations, enabling more detailed characterization of the proteins by DSF and SEC-MALS. We predict that the flanking disordered regions of the full-length protein (as predicted by AlphaFold) contribute to this effect.

      (2) In Figure 2, Did the authors try to do LC/MS-MS in the same experimental conditions as in Figure 1 (e.g. buffer minus/plus DTT, H2O2, H2O2 + DTT)?

      We would like to clarify that the mass spectrometry experiments were conducted exclusively on proteins purified under native (non-reducing) conditions. We did not extend the LC/MS-MS analyses to include proteins treated with various buffer conditions such as minus/plus DTT, H2O2, or H2O2 + DTT as used in the experiments depicted in Figure 1. Given that we could readily detect disulfides in the absence of oxidizing agents, we did not see the benefit of additional treatment conditions as peroxide treatment of protein samples can frequently complicate interpretation of MS data. However, it should be noted that prior to MS analysis, tryptic peptides were subjected to a 50:50 split, with one half alkylated in the presence of DTT (as described in the methods section) to eliminate disulfides and other transiently oxidized Cys forms. Comparative analysis between reduced and non-reduced tryptic peptides improved our confidence when assigning disulfide bonds (which were eliminated in identical peptides in the presence of DTT).

      On panel b, why did the authors show alphafold predictions and not empiric structural information (e.g. X-ray, EM,..)?

      The AlphaFold models were primarily utilized to map the general locations of redox-sensitive cysteine pairs within the proteins of interest. Although we have access to the crystal structure of mouse BRSK2, they do not fully capture the active conformation seen in the Alphafold model of the human version. The use of AlphaFold models for human proteins in this study aids in consistently tracking residue numbering across the manuscript, offering a useful framework for understanding the spatial arrangement of these critical cysteine pairs in their potentially active-like states. This approach facilitates our analysis and discussion by providing a reference for the structural context of these residues in the human proteins.

      What was the rationale for using the KD construct and not the FL as in Figure 1?

      The rationale to use the kinase domain was primarily based on the significantly lower confidence in the structural predictions for regions outside the kinase domain (KD). Our experimental focus was to investigate the role of conserved cysteine residues within the kinase domain, which are critical for the protein's function and regulation. This targeted approach allowed us to concentrate our analyses on the most functionally relevant and structurally defined portion of the protein, thereby enhancing the precision and relevance of our findings. As is frequently the case, truncated forms of the protein, consisting only of the kinase domain, are much more stable than their full length counterparts and are therefore more amenable to in vitro biochemical analysis. In our hands this was true for both BRSK1 and 2, and as such much of the data collected here was generated using kinase-domain (KD) constructs. Simulations using the KD structures are therefore much more representative of our original experimental setup.

      The BSK1 KD construct appears to be rather inactive and not responsive to DTT treatment. Could the authors comment on the differences observed with the FL construct of Figure 1

      It is important to note that BRSK1, in general, exhibits lower intrinsic activity compared to BRSK2. This reduced activity could be attributed to a range of factors, including the need for activation by upstream kinases such as LKB1, as well as potential post-translational modifications (PTMs) that may be absent in the bacterially expressed KD construct. The full-length forms of the protein were purified from Sf21 cells, and as such may have additional modifications that are lacking in the bacterially derived KD counterparts. We also cannot discount additional regulatory roles of the regions that flank the KD, and these may contribute in part to the modest discrepancy observed between constructs.  Despite these differences, it is crucial to emphasize that both the KD and FL constructs of BRSK1 are regulated by DTT, indicating a conserved redox-dependent activation for both of the related BRSK proteins.  

      (3) In Figure 4, on panel A wouldn´t the authors expect that mutating on the pairs e.g. C198A in BSK1 would have the same effect as mutating the C191 from the T+2 site? Did they try mutating individual sites of the aE/CHRD pair? The same will apply to BSK2

      We appreciate the insightful comment. It's important to clarify that the redox regulation of these proteins is influenced not solely by the formation of disulfide bonds but also by the oxidation state of individual cysteine residues, particularly the T+2 Cys. This nuanced mechanism of regulation allows for a diverse range of functional outcomes based on the specific cysteine involved and its state of oxidation. This aspect forms a key finding of our paper, highlighting the complexity of redox regulation beyond mere disulfide bond formation. For example, AURA kinase activity is regulated by oxidation of a single T+2 Cys (Cys290, equivalent to Cys191 and Cys176 of BRSK1 and 2 respectively), but this regulation can be supplemented through artificial incorporation of a secondary Cys at the DFG+2 position (Byrne et al., 2020). This targeted genetic modification or AURA mirrors equivalent regulatory disulfide-forming Cys pairs that naturally occur in kinases such as AKT and MELK, and which provide an extra layer of regulatory fine tuning (and a possible protective role to prevent deleterious over oxidation) to the T+2 Cys. We surmise that the CPE Cys is also an accessory regulatory element to the T+2 Cys in BRSK1 +2, which is the dominant driver of BRSK redox sensitivity (as judged by the fact that CPE Cys mutants are still potently regulated by redox [Fig 4]), by locking it in an inactive disulfide configuration.

      In our preliminary analysis of BRSK1, we observed that mutations of individual sites within the aE/CHRD pair was similarly detrimental to kinase activity as a tandem mutation (see reviewer figure 1). As discussed in the manuscript, we think that these Cys may serve important structural regulatory functions and opted to focus on co-mutations of the aE/CHRD pair for the remainder of our investigation.

      Author response image 1.

      In vitro kinase assays showing rates of in vitro peptide phosphorylation by WT and Cys-to-Ala (aE/CHRD residues) variants of BRSK1 after activation by LKB1.

      In panels C and D, the same experimental conditions should have been measured as in A and B.

      Panels A and B were designed to demonstrate the enzymatic activity and the response to DTT treatment to establish the baseline redox regulation of the kinase and a panel of Cys-to-Ala mutant variants. In contrast, panels C and D were specifically focused on rescue experiments with mutants that showed a significant effect under the conditions tested in A and B. These panels were intended to further explore the role of redox regulation in modulating the activity of these mutants, particularly those that retained some level of activity or exhibited a notable response to redox changes.

      The rationale for this experimental design was to prioritize the investigation of mutants, such as those at the T+2 and CPE cysteine sites, which provided the most insight into the redox-dependent modulation of kinase activity. Other mutants, which resulted in inactivation, were deprioritized in this context as they offered limited additional information regarding the redox regulation mechanism. This focused approach allowed us to delve deeper into understanding how specific cysteine residues contribute to the redox-sensitive control of kinase function, aligning with the overall objective of elucidating the nuanced roles of redox regulation in kinase activity.

      (4) In figure 5: Why did the authors use reduced Glutathione instead of DTT? The authors should have recapitulated the same experimental conditions as in Figure 4 and not focused only on the T+2 or the CPE single mutants but using the double and the aE/CHRD mutants as well, as internal controls and validation of the enzymatic assays using the modified peptide

      Regarding the use of reduced glutathione (GSH) instead of DTT in Figure 5, we chose GSH for its well characterized biological relevance as an antioxidant in cellular responses to oxidative stress. Furthermore, while DTT has been widely used in experimental setups, it is also potentially cytotoxic at high concentrations.

      Addressing the point on experimental consistency with Figure 4, we appreciate the suggestion and indeed had already conducted such experiments (Previously Supp Fig 3, now changed to current Supp Fig 4). These experiments include analyses of BRSK mutant activity in a HEK-293T model. However, we chose not to focus on inactivating mutants (such as the aE/CHRD mutants which had depleted expression levels possibly as a consequence of compromised structural integrity) or pursue the generation of double mutant CMV plasmids, as these were deemed unlikely to add significant insights into the core narrative of our study. Our focus remained on the mutants that yielded the most informative results regarding the redox regulation mechanisms in the in vitro setting, ensuring a clear and impactful presentation of our findings.

      A time course evaluation of the reducing or oxidizing reagents should have been performed. Would we expect that in WT samples, and in the presence of GSH, and also in the case of the CPE mutant, an increment in the levels of Tau phosphorylation as a readout of BSK1-2 activity?

      We acknowledge the importance of such analyses in understanding the dynamic nature of redox regulation on kinase activity and have included a time course (Supp Fig 2 e-g). These results confirm a depletion of Tau phosphorylation over time in response to peroxide generated by the enzyme glucose oxidase.

      (5) In Figure 6, did the authors look at the functional impact of the residues with which interact the T+2 and the CPE motifs e.g. T174 and the E185-R258 tether?

      Our primary focus was on the salt bridges, as this is a key regulatory structural feature that is conserved across many kinases. Regarding the additional interactions mentioned, we have thoroughly evaluated their roles and dynamics through molecular dynamics (MD) simulations but did not find any results of significant relevance to warrant inclusion.

      (6) In Figure 7: Did the author look at the oligomerization state of the BSK1-2 multimers under non-reducing conditions? Were they also observed in the case of the FL constructs? What was the stoichiometry?

      Our current work indicates that the kinase domain of BRSK1-2 primarily exists in a monomeric state, with some evidence of dimerization or multimer formation under specific conditions. Our SEC-MALS (Supp Fig 6) and SDS-PAGE analysis (Figure 7) clearly demonstrates that monomers are overwhelmingly the dominant species under non-reducing conditions (>90 %). We also conclude that these limited oligomeric species can be removed by inclusion of reducing agents such as DTT (Figure 7), which may suggest a role for a Cys residue(s). Notably, removal of the T+2 Cys was insufficient to prevent multimerization.

      We were unable to obtain reliable SEC-MALS data for the full-length forms of the protein, likely due to the presence of disordered regions that flank the kinase domain which results in a highly heterodispersed and unstable preparation (at the concentrations required for SEC-MALS). Although we are therefore unable to comment on the stoichiometry of FL BRSK dimers, we can detect BRSK1 and 2 hetero- and homo-complexes in HEK-293T cells by IP, which supports the existence of limited BRSK1 & 2 dimers (Supp Fig 6a). However, we were unable to detect intermolecular disulfide bonds by MS, although this does not necessarily preclude their existence. The physiological role of BRSK multimerization (if any) and establishing specifically which Cys residues drive this phenomenon is of significant interest to our future investigations.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1:

      I will summarize my comments and suggestions below.

      (1) Abstract:

      "Non-catalytic (pseudo)kinase signaling mechanisms have been described in metazoans, but information is scarce for plants." To the best of my understanding EFR is an active protein kinase in vitro and in vivo and cannot be considered a pseudokinase. Consider rephrasing.

      We rephrased to: “Non-catalytic signaling mechanisms of protein kinase domains have been described in metazoans, but information is scarce for plants.”

      (2) Page 4: It should be noted, that while membrane associated Rap-RiD systems have been used in planta to activate receptor kinase intracellular domains by promoting interaction with a co-receptor kinase domain, this system does not resemble the actual activation mechanism in the plasma membrane. This would be worth discussing when introducing the system. For example, the first substrates of the RK signaling complex may also be membrane associated and not freely diffuse in solution, which may be important for enzyme-substrate interaction.

      We inserted on page 4: “The RiD system was previously applied in planta, maintaining membrane-association by N-terminal myristoylation (Kim et al., 2021). For the in vitro experiments, the myristoylation sites were excluded to facilitate the production of recombinant protein.”

      (3) Page 4 and Fig 1: The catalytic Asp in BRI1 is D1027 and not D1009 (https://pubmed.ncbi.nlm.nih.gov/21289069/). Please check and prepare the correct mutant protein if needed.

      We clarified this in the text by stating that we mutated the HRD-aspartate to asparagine in all our catalytic-dead mutants: “Kinase-dead variants with the catalytic residue (HRD-aspartate) replaced by asparagine (EFRD849N and BRI1D1009N), had distinct effects […]”. D1027 in BRI1 is the DFG-Asp, which was not mutated in our study.

      (4) Page 4 and Fig 1: Is BIK1 a known component of the BR signaling pathway and a direct BRI1 substrate? Or in other words how specific is the trans-phosphorylation assay? In my opinion, a more suitable substrate for BRI1/BAK1 would be BSK1 or BSK3 (for example https://pubmed.ncbi.nlm.nih.gov/30615605/).

      Kinase-dead BIK1 is a reported substrate of BRI1. We clarified this in the results section by inserting: “BIK1 was chosen as it is reported substrate of both, EFR/BAK1 and BRI1/BAK1 complexes (Lin et al., 2013).”

      (5) Fig. 1B Why is BIK1 D202N partially phosphorylated in the absence of Rap? I would suggest to add control lanes showing BRI1, EFR, FLS2, BAK1 and BIK1 in isolation. Given that a nice in vitro activation system with purified components is available, why not compare the different enzyme kinetics rather than band intensities at only 1 enzyme : substrate ratio?

      BIK1 D202N is partially phosphorylated due to the presence of active BAK1 that is capable of transphosphorylating BIK1 D202N as it has been reported in a previous study: (DOI: 10.1038/s41586-018-0471-x).

      (6) Page 4 and Fig 1: Is the kinase dead variant of EFR indeed kinase dead? I could still see a decent autorad signal for this mutant when expressed in E. coli (Fig 1 A in Bender et al., 2021; https://pubmed.ncbi.nlm.nih.gov/34531323/)? If this mutant is not completely inactive, could this change the interpretation of the experiments performed with the mutant protein in vitro and in planta in the current manuscript? In my opinion, it could be possible that a partially active EFR mutant can be further activated by BAK1, and in turn can phosphorylate BIK1 D202N. The differences in autorad signal for BRI1D1009?N and EFRD849N is very small, and the entire mechanism hinges on this difference.

      We would like to emphasize that the mechanism hinges on the difference between non-dimerized and dimerized kinase domains in the in vitro kinase assay. BRI1 D1009N fails to enhance BIK1 D202N trans-phosphorylation compared to the non-dimerized sample, while EFR D849N is still capable of enhancing BIK1 transphosphorylation upon dimerization as indicated by quantification of autorads (Figure 1B/C). We have also addressed this point in a section on the limitations of our study.

      (7) Fig 1B. "Our findings therefore support the hypothesis that EFR increases BIK1 phosphorylation by allosterically activating the BAK1 kinase domain." To the best of my understanding presence of wild-type EFR in the EFR-BAK1 signaling complex leads to much better phosphorylation of BIK1D202N when compared to the EFRD849N mutant. How does that support the allosteric mechanism? By assuming that the D849N mutant is in an inactive conformation and fully catalytically inactive (see above)? Again, I think the data could also be interpreted in such a way that the small difference in autorad signal for BIK1 between BRI1 inactive (but see above) and ERF inactive are due to EFR not being completely kinase dead (see above), rather than EFR being an allosteric regulator. To clarify this point I would suggest to a) perform quantitative auto- and trans-(generic substrate) phosphorylation assays with wt and D849N EFR to derive enzyme kinetic parameters, to (2) include the EFRD849 mutant in the HDX analysis and (3) to generate transgenic lines for EFRD489N/F761H/Y836F // EFRD489N/F761H/SSAA and compare them to the existing lines in Fig. 3.

      Mutations of proteins, especially those that require conformational plasticity for their function can have pleiotropic effects as the mutation may affect the conformational plasticity and consequently catalytic and non-catalytic functions that depend on the conformational plasticity. In such cases, it is difficult to fully untangle catalytic and non-catalytic functions. Coming back to EFR D849N, the D849N mutation may also impact the non-catalytic function by altering the conformational plasticity, explaining the difference observed in EFR vs EFR D849N. As you rightly suggested, HDX would be a way to address this but would still not clarify whether catalytic activity contributes to activation. We instead attempted to produce analog sensitive EFR variants for in vivo characterization of EFR-targeted catalytic inhibition. Unfortunately, we failed in producing an analog-sensitive variant for which we could show ATP-analog binding. To address your concern, we inserted a section on limitations of the study.

      (8) Fig. 2B,C, supplement 3 C,D. Has it been assessed if the different EFR versions were expressed to similar protein levels and still localized to the PM?

      Localization of the mutant receptors has not been explicitly evaluated by confocal microscopy. However, the selected mutation EFRF761H is shown to accumulate in stable Arabidopsis lines (Figure 3 – Supplement 1C) and BAK1 could be coIPed by all EFR variants upon elf18-treatment (Figure 3 B), indicating plasma membrane localization.

      (9) How the active-like conformation of EFR is in turn activating BAK1 is poorly characterized, but appears to be the main step in the activation of the receptor complex. Extending the HDX analyses to resting and Rap-activated receptor complexes could be a first step to address this question. I tried to come up with an experimental plan to test if indeed the kinase activity of BAK1 and not of EFR is essential for signal propagation, but this is a complex issue. You would need to be able to mimic an activated form of EFR (which you can), to make sure its inactive (possibly, see above) and likewise to engineer a catalytically inactive form of BAK1 in an active-like state (difficult). As such a decisive experiment is difficult to implement, I would suggest to discuss different possible interpretations of the existing data and alternative scenarios in the discussion section of the manuscript.

      We addressed your concern whether BAK1 kinase activity is essential for signaling propagation by pairing EFRF761H and BAK1D416N (Figure 4 Supplement 2 C) which fails to induce signaling. In this case, EFRF761H is in its activated conformation but cannot activate downstream signaling. We also attempted to address your concern by an in vitro kinase assay by pairing EFR and BAK1D416N and using a range of concentrations of the substrate BIK1D202N. We observed that catalytic activity of BAK1 but not EFR was essential for BIK1 phosphorylation. However, this experiment does not address whether activated EFR can efficiently propagate signaling in the absence of BAK1 catalytic activity. In the limitations of the study section, we now discuss the catalytic importance of EFR for signaling activation.

      Author response image 1.

      BIK1 trans-phosphorylation depends on BAK1 catalytic activity. Increasing concentrations of BIK1 D202N were used as substrate for Rap-induced dimers of EFR-BAK1, EFR D849N-BAK1, and EFR-BAK1 D416N respectively. BIK1 trans-phosphorylation depended on the catalytic activity of BAK1. Proteins were purified from E. coli λPP cells. Three experiments yielded similar results of which a representative is shown here.

      Reviewer #2:

      All of my suggestions are minor.

      Figure 1B, I think it would be more useful to readers to explain the amino acid in the D-N change, rather than just call it D-to-N? Also, please label the bands on the stained gel; the shift on FKBP-BRI1 and FKBP-EFR are noticeable on the Coomassie stain.

      We implemented your suggestions.

      Figure 1-Supplement 1. There is still a signal in pS612 BAK1 (it states 'also failed to induce BAK1 S612 phosphorylation' in the text, which is not quite correct). Also, could mention the gel shift seen in BAK1, which appears absent in Y836F.

      We corrected the text which now states: “To test whether the requirement for Y836 phosphorylation is similar, we immunoprecipitated EFR-GFP and EFRY836F-GFP from mock- or elf18-treated seedlings and probed co-immunoprecipitated BAK1 for S612 phosphorylation. EFRY836F also obstructed the induction of BAK1 S612 phosphorylation (Figure 1 – Supplement 1), indicating that EFRY836F and EFRSSAA impair receptor complex activation.” The gel shift of BAK1 you pointed out was not observed in replications and thus we prefer not to comment on it.

      Figure 2 and 3 are full of a, b, c,d's, which I don't understand. Sorry

      We used uppercase letters to indicate subpanels and lowercase letters to indicate the results of the statistical testing. In the figure caption, we have clarified that the lowercase letters refer to statistical comparisons.

      Figure 2 A. If each point on the x-axis is one amino acid, I think it would again be useful to name the amino acids that the gold or purple or blue colored lines extend through.

      Each point stands for a peptide which are sorted by position of their starting amino acid from N-terminus to C-terminus. We now added plots of HDX for individual peptides that correspond to the highlighted region in subpanel A.

      Figure Supplement 1 is very small for what it is trying to show, even on the printed page. If this residue were to be phosphorylated, what would happen to the H-bond?

      We suppose that VIa-Tyr phosphorylation would break the H-bond and causes displacement of the aC-b4 loop. Recent studies, published after our submission, highlight the importance of this loop for substrate coordination and ATP binding. Thus, phosphorylation of VIa-Tyr and displacing this loop may render the kinase rather unproductive. We have expanded the discussion to include this point.

      Figure 2B: Tyr 836 is not present in any of the alignments in Figure 2A. This should be rectified, because the text talks about the similarity to Tyr 156 in PKA.

      We have adjusted the alignments such that they now contain the VIa-Tyr residues of EFR and PKA.

      Figure 4D. Is there any particular reason that these Blots are so hard to compare or FKBP and BAK1?

      We assume it is referred to Figure 4 – Supplement 2 D. FKBP-EFR and FRB-BAK1 both are approximately the size of RubisCo, the most abundant protein in plant protein samples and which overlay the FKBP- and FRB-tagged kinase. Thus, it is difficult to detect these proteins.

      Reviewer #3:

      (1) The paper reporting the allosteric activation mechanism of EGFR should be cited.

      Will be included.

      (2)The authors showed that "Rap addition increased BIK1 D202N phosphorylation when the BRI1 or EFR kinase domains were dimerized with BAK1, but no such effect was observed with FLS2". Please explain why FLS2 failed to enhance BIK1 transphosphorylation by Rap treatment?

      Even though BIK1 is a reported downstream signaling component of FLS2/BAK1, it might be not the most relevant downstream signaling component and rather related RLCKs, like PBL1, might be better substrates for dimerized FLS2/BAK1. We haven’t tested this, however. Alternatively, the purified FLS2 kinase domain might be labile and quickly unfolds even though it was kept on ice until the start of the assay, or the N-terminal FKBP-tag may disrupt function. As the reason for our observation is not clear, we have removed FLS2 in vitro dimerization experiments from the manuscript.

      (3) Based solely on the data presented in Figure 1, it can be concluded that EFR's kinase activity is not required to facilitate BIK1 transphosphorylation. Therefore, the title of Figure 1, "EFR Allosterically Activates BAK1," may be inappropriate.

      We have changed the figure title to: “EFR facilitates BIK1 trans-phosphorylation by BAK1 non-catalytically.”

      (4) In Figure 1- Supplement 1, I could not find any bands in anti-GFP and anti-BAK1 pS612 of input. Please redo it.

      Indeed, we could not detect protein in the input samples of this experiment. BAK1 S612 phosphorylation is an activation mark and not necessarily expected to be abundant enough for detection in input samples. EFR-GFP, however, is usually detected in input samples and is reported in Macho et al. 2014 from which manuscript these lines come. Why EFR-GFP is not detected in this set of experiments is unclear but, in our opinion, does not detract from the conclusions drawn since similar amounts of EFR-GFP are pulled-down across all samples.

      (5) For Figure 2A, please mark the structure represented by each color directly in the figure.

      We have made the suggested change.

      (6) Please modify "EFRF761/Y836F and EFRF761H/SSAA restore BIK1 trans-phosphorylation" to "EFRF761H/Y836F and EFRF761H/SSAA restore BIK1 trans-phosphorylation".

      Thank you for spotting this. We changed it.

      (7) The HDX-MS analysis demonstrated that the EFR (Y836F) mutation inhibits the formation of the active-like conformation. Conversely, the EFR (F761H) mutation serves as a potent intragenic suppressor, significantly stabilizing the active-like conformation. Confirming through HDX-MS conformational testing that the EFR (Y836F F761H) double mutation does not hinder the formation of the active-like EFR kinase conformation would greatly strengthen the conclusions of the article.

      Response: We agree that this is beneficial, and we attempted to do it but failed to produce enough protein for HDX-MS analysis. We stated this now in an extra section of the paper (“Limitations of the study”).

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews: 

      Reviewer #1 (Public review): 

      Summary: 

      This is a comprehensive study that clearly and deeply investigates the function of GATA6 in human early cardiac development. 

      Strengths: 

      This study combines hESC engineering, differentiation, detailed gene expression, genome occupancy, and pathway modulation to elucidate the role of GATA6 in early cardiac differentiation. The work is carefully executed and the results support the conclusions. The use of publicly available data is well integrated throughout the manuscript. The RIME experiments are excellent. 

      Weaknesses: 

      Much has been known about GATA6 in mesendoderm development, and this is acknowledged by the authors. 

      We appreciate the comments and have tried to highlight both the early role of GATA6 in cardiac progenitor biology as well as the haploinsufficiency for relevance to human congenital heart disease, which we believe adds value to other recent published work, among others Sharma et al. eLife 2020.

      Reviewer #2 (Public review): 

      Summary: 

      This manuscript by Bisson et al describes the role of GATA6 to regulate cardiac progenitor cell (CPC) specification and cardiomyocyte (CM) generation using human embryonic stem cells (hESCs). The authors found that GATA6 loss-of-function hESC exhibits early defects in mesendoderm and lateral mesoderm patterning stages. Using RNA-seq and CUT&RUN assays the genes of the Wnt and BMP programs were found to be affected by the loss of GATA6 expression. Modulating Wnt and BMP during early cardiac differentiation can partially rescue CPC and CM defects in GATA6 hetero- and homozygous mutant hESCs. 

      Strengths: 

      The studies performed were rigorous and the rationale for the experimental design was logical. The results obtained were clear and supported the conclusions that the authors made regarding the role of GATA6 on Wnt and BMP pathway gene expression. 

      Weaknesses: 

      Given the wealth of studies that have been performed in this research area previously, the amount of new information provided in this study is relatively modest. Nevertheless, the results and quite clear and should make a strong contribution to the field. 

      Likewise for reviewer 2, we appreciate the comments and have tried to highlight both the early role of GATA6 in cardiac progenitor biology as well as the haploinsufficiency for relevance to human congenital heart disease.

      Reviewer #3 (Public review): 

      In this study, Bison et al. analyzed the role of the GATA6 transcription factor in patterning the early mesoderm and generating cardiomyocytes, using human embryonic stem cell differentiation assays and patient-derived hiPSCs with heart defects associated with mutations in the GATA6 gene. They identified a novel role for GATA6 in regulating genes involved in the WNT and BMP pathways -findings not previously noted in earlier analyses of GATA6 mutant hiPSCs during early cardiac mesoderm specification (Sharma et al., 2020). Modulation of the WNT and BMP pathways may partially rescue early cardiac mesoderm defects in GATA6 mutant hESCs. These results provide significant insights into how GATA6 loss-of-function and heterozygous mutations contribute to heart defects. 

      I have the following comments: 

      (1) Throughout the manuscript, Bison et al. alternate between different protocols to generate cardiomyocytes, which creates some confusion (e.g., Figure 1 vs. Supplemental Figure 2A). The authors should provide a clear justification for using alternative protocols. 

      We agree and clarified this issue in the revision (p. 6). The reviewer is correct that there are two widely used protocols for directed differentiation of PSCs to cardiac fate. One is a cytokine-based protocol (Fig. 1A) and the other uses small molecules to manipulate the WNT pathway (CHIR protocol, Supplemental Fig. 2B). In our study, we used the CHIR protocol only for experiments in Supplemental Figure 2B-E. Since our data implicated BMP and WNT as mediators of the GATA6-dependent program, we did this mainly to confirm that the phenotype we observed with the cytokine-based protocol was not biased by the differentiation protocol. However, we found the CHIR protocol to be overall relatively inefficient for cardiac differentiation using the parental H1 hESCs and the various isogenic lines. The in vitro cardiac differentiation protocols for hPSCs are known to be variable depending on lines and sometimes require extensive optimization for various media components and concentrations, cell seeding densities, and batch variations for crucial reagents. The cytokine-based protocol we optimized worked most efficiently with our hPSC lines to generate cardiomyocytes, therefore we committed to using it for the bulk of experiments in this study.  

      (2) The authors should characterise the mesodermal identity and cardiomyocyte subtypes generated with the activin/BMP-induction protocol thoroughly and clarify whether defects in the expression of BMP and WNTrelated gene affect the formation of specific cardiomyocyte subtypes in a chamber-specific manner. This analysis is important, as Sharma et al. suggested a role for GATA6 in orchestrating outflow tract formation, and Bison et al. similarly identified decreased expression of NRP1, a gene involved in outflow tract septation, in their GATA6 mutant cells. 

      We agree it is important that the mesodermal identities are quite thoroughly characterized.

      For example, Fig. 2 (K+P+, Brachyury, EOMES), Fig. 3G&H (lateral mesoderm, cardiac mesoderm RNAseq & GSEA comparing datasets from Koh et al.). The capacity of the cytokine-based protocol to generate both FHF and SHF derived sub-types has been rigorously evaluated by Keller and colleagues, which we now cite (Yang et al. 2022). Since the null cells do not generate CMs, chamber specific subtypes cannot be evaluated; whether the GATA6 heterozygous mutants are biased is an interesting question. Indeed, the top GO term identified by CUT&RUN analysis for GATA6 at day 2 of

      differentiation is outflow tract morphogenesis, which is consistent with the interpretation by Sharma et al., but implicates this program at a much earlier developmental stage, long before cardiomyocyte differentiation. We think this is one of the most important findings of our study and appreciate the chance to highlight this in the revision (p. 9, 17). When we evaluated chamber-specificity for differentiated cardiomyocytes, we did not find significant differences, as indicated for the reviewer in the panel below (day 20 of differentiation). Since our study focuses on early stages of progenitor specification rather than cardiomyocyte differentiation, we agree that a more rigorous analysis would be of value, and indicated this as a limitation of our current study (p. 18).

      Author response image 1.

      (3) The authors developed an iPSC line derived from a congenital heart disease (CHD) patient with an atrial septal defect and observed that these cells generate cTnnT+ cells less efficiently. However, it remains unclear whether atrial cardiomyocytes (or those localised specifically at the septum) are being generated using the activin/BMP-induction protocol and the patient-derived iPSC line.

      As indicated above, our study is focused on cardiac progenitor specification, and we found similar differences with the patient-derived iPSC-CMs compared to using hESC heterozygous targeted mutants. While we did not note any major differences in expression of cardiomyocyte markers, whether the mutants show any biases toward sub-types of cardiomyocytes is an interesting question to be pursued in subsequent work.

      (4) The authors should also justify the necessity of using the patient-derived line to further analyse GATA6 function. 

      This is a good point, and as suggested we provided the justification (p. 5-6). This is the first patient-derived iPSC line published with a heterozygous GATA6 mutation along with an isogenic mutation-corrected control generated for cardiac directed differentiation. Patients with congenital heart disease (CHD) associated with GATA6 mutations are typically heterozygous (also true for many other CHD variants; presumably homozygous null embryos would not survive). It is important to query if phenotypes found using targeted mutations in hESCs (or iPSCs) model the human disease, since the patient cells (or the hESCs) likely have additional genetic variants that might interact with the GATA6 mutation. The fact that both types of heterozygous cells (patient-derived iPSCs and targeted hESCs) generate similar defects in CM differentiation provides evidence supporting the use of these human cellular models to study the genetic and cellular basis for congenital heart disease. This is particularly important, since other models, such as heterozygous mice, do not show such phenotypes.

      (5) Figure 3 suggests an enrichment of paraxial mesoderm genes in the context of GATA6 loss-of-function, which is intriguing given the well-established role of GATA6 in specifying cardiac versus pharyngeal mesoderm lineages in model organisms. Could the authors expand their analysis beyond GO term enrichment to explore which alternative fates GATA6 mutant cells may acquire? Additionally, how does the potential enrichment of paraxial mesoderm, rather than pharyngeal mesoderm, relate to the initial mesodermal induction from their differentiation protocol? Could the authors also rule out the possibility of increased neuronal cell fates? 

      We need to interpret our in vitro differentiation data cautiously in relation to what has been shown in vivo, since we are unlikely to be reproducing all the complex signaling taking place in the embryo. Yet we do see modest increases in gene expression levels including signatures of paraxial mesoderm and ECM/mesenchymal at days 2 or 3 of differentiation in the GATA6 mutant cells. Therefore, we now include a heatmap showing enriched paraxial mesoderm gene expression in the mutant cells, new Fig. 3I (see page 10).

      A caveat of this result is that the cells are being differentiated toward cardiac fate, so a bias for alternative fates might be suppressed. We modified the protocol to favor paraxial fate by adding CHIR at day 2 (rather than XAV) and performing qPCR assays at day 3. We found this successfully induced paraxial mesoderm gene expression, but equally comparing wildtype, heterozygous, or null cells, so do not feel it warrants highlighting further. 

      Recommendations for the authors:  

      Reviewing Editor (Recommendations for the authors): 

      Incorporation of marker analysis for various stages of iPSC to CM differentiation (mesoderm, cardiac progenitor, CM subtypes) would increase the significance and support for the findings presented. Further data on the link (direct or indirect) between GATA6 and Wnt/BMP signalling would also add to the significance of this study. A number of textual changes/clarifications are also suggested to improve the manuscript. 

      We appreciate the feedback and provide responses for issues raised for markers, direct or indirect interactions, and textual changes/clarifications in the following sections. As indicated above, we did not find obvious alterations in cardiac subtypes, but since our study is focused on early progenitor specification, this is an interesting question that we think should be more rigorously evaluated in subsequent work.  

      Reviewer #1 (Recommendations for the authors)

      Minor details: 

      (1) On p6 "Principal component analysis (PCA) showed that the cells derived from each genotype were well separated from each other (Supplemental Figure 2C)". All genotypes should be in one PCA plot to better evaluate the three genotypes. 

      We prepared the new plot as suggested, presented as new Supplemental Fig. 2C. 

      (2) p10: "Chia et al.22 and found a significantly decreased enrichment in GATA6-/- cells relative to WT at day 2" decreased enrichment of what? Direct target genes? 

      Thank you for catching this. Yes, the text was changed to indicate a “decreased enrichment in GATA6-/- cells relative to WT at day 2 for putative direct GATA6 target genes.” 

      Reviewer #2 (Recommendations for the authors): 

      Overall, this is an interesting study that addresses the early developmental roles of GATA6 on cardiac differentiation. While the identification of Wnt and BMP pathway genes to be involved in GATA6 regulation is not entirely unexpected, the authors do bring forth some useful knowledge that helps to further elucidate the mechanism of pre-cardiac mesoderm regulation. Some suggestions for improvement are included below - 

      Major points: 

      (1) Since the loss of Gata6 in this study is global (either as heterozygous or homozygous, it is likely that the very early requirement of Gata6 (e.g. mesodermal stage of differentiation) is responsible for the cardiac transcriptional phenotype observed and not due to specific role of Gata6 in the cardiac lineage which would need to be addressed using conditional knock out of Gata6 in hPSC model. The authors should be more explicit when discussing the results as disruption of mesodermal differentiation leading to loss of downstream cardiac lineage cells. For example, I would change the title "GATA6 loss-of-function impairs CM differentiation" to "GATA6 loss-of-function impairs mesodermal (or mesodermal lineage) differentiation" and show the changes in cardiac progenitor cells genes (Isl1, Tbx1, Hand1, and BAF50c/Smarcd3) in addition to cardiomyocyte genes but no change in mesodermal (e.g. Brachyury, T, Eomes, Mesp1/2, etc) genes. 

      We agree with the reviewer’s interpretation. The title for the section was changed as suggested. In Fig. 1, we show changes in cardiac progenitor cell genes (Isl1, Hand1, and BAF50c/Smarcd3) while not seeing changes in mesodermal genes in Fig. 2 (e.g. Brachyury, Eomes, Mesp1/2). We note that the defect may be specific to cardiac (or anterior lateral) mesoderm, as the ability to express paraxial mesoderm markers was not impaired.  

      (2) The use of NKX2.5, TBX5, TBX20, and GATA4 as markers for CPC is not ideal. These markers are also expressed in differentiated cardiomycytes. ISL1 or TBX1 for second heart field progenitors and HAND1 or BAF60c/Smarcd3 for first heart field progenitors would be ideal.  

      As suggested, we included additional day 6 qPCR panel (new Fig. 1E) to evaluate the heart field progenitor markers. 

      (3) Much of the findings described in this study have been known in the field including the requirement of Wnt and BMP to induce mesodermal and subsequently cardiomyocyte differentiation. The key new information here is that Gata6 knockout disrupts Wnt and BMP signaling. It would help to further validate experimentally some of the Wnt and BMP genes as either direct or indirect targets of Gata6 using reporter assays. 

      While reporter assays are feasible and do provide relevant outputs, we feel that the use of any one or even several response elements in a reporter assay adds relatively little value compared to comprehensive analysis of bona fide network components. To address the reviewers concern we have included profiling heat maps for WNT and BMP pathway components to more rigorously and specifically evaluate the disruption in the signaling networks caused by loss of GATA6. Proving direct targets of endogenous genes is challenging, but we mapped many binding peaks for GATA6 to putative enhancers of WNT/BMP pathway genes (based on histone marks). We provide a list of these genes (new Fig. 4F) and distinguish these from WNT/BMP pathway genes that were not bound by GATA6 yet are down-regulated in the GATA6 mutant cells and are likely to be indirect targets (p. 12). 

      Minor points: 

      (1) Figures 1 and 2 - in the figure legend the labels w2, w4, m2, m5, m11, and m14 should be explained as the name of the clones of targeted hESC.  

      The legends were edited to provide this information.  

      (2) Supplemental Figure 3A - the resolution of the FACS plot is suboptimal. 

      We apologize and have corrected the plot resolution in the revised manuscript.  

      (3) Supplemental Table 1 - it's intriguing that amongst all the SWI/SNF factors, the one that is known to be cardiac-specific (SMARCD3) did not come up in the GATA6-RIME-enriched proteins. Is this a reflection of the early stage in which GATA6 plays a role in development (e.g. mesendoderm development but not precardiac mesoderm development when SMARCD3 is expressed)? 

      We agree and have noted this feature in the revised manuscript (p. 17). We note that SMARCD3 is expressed in the RNA-seq data as early as day 2. Although speculative, it may be that GATA6 primarily interacts with SWI/SNF complexes prior to the role for SMARCD3 in cardiac specification.

      Reviewer #3 (Recommendations for the authors): 

      (1) Figures 3G and 3H, as well as others, have resolution issues. The gene names are unreadable, and higherresolution images should be provided. 

      We apologize for the resolution issues and these have been fixed in the revised version. 

      (2) In their early manipulation of the WNT and BMP pathways (Figure 6A), it is unclear whether the activin/BMP protocol shown in Figure 1A was used. If this is the case, the authors should compare their results to a wild-type + DOX EV condition for consistency. 

      We clarified in the revision (Fig. 6A) that all the experiments in Fig. 6 use the cytokine protocol. In the revised figure, we included the wild-type + DOX EV condition as suggested. 

      (3) In Figures 6C and 6D, the authors should include an analysis of a wild-type isogenic line under their new CHIR/LB condition for comparison. 

      As suggested, we included the WT isogenic line in the comparison. For Fig. 6C these are shown on a separate graph because the Y-axis values are very different. Note that the CHIR/LB treatments that improve mutant cell differentiation impact the WT cells in the opposite manner.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1:

      Summary:

      The manuscript by Bohra et al. describes the indirect effects of ligand-dependent gene activation on neighboring non-target genes. The authors utilized single-molecule RNA-FISH (targeting both mature and intronic regions), 4C-seq, and enhancer deletions to demonstrate that the non-enhancer-targeted gene TFF3, located in the same TAD as the target gene TFF1, alters its expression when TFF1 expression declines at the end of the estrogen signaling peak. Since the enhancer does not loop with TFF3, the authors conclude that mechanisms other than estrogen receptor or enhancer-driven induction are responsible for TFF3 expression. Moreover, ERα intensity correlations show that both high and low levels of ERα are unfavorable for TFF1 expression. The ERa level correlations are further supported by overexpression of GFP-ERa. The authors conclude that transcriptional machinery used by TFF1 for its acute activation can negatively impact the TFF3 at peak of signaling but once, the condensate dissolves, TFF3 benefits from it for its low expression.

      Strengths:

      The findings are indeed intriguing. The authors have maintained appropriate experimental controls, and their conclusions are well-supported by the data.

      Weaknesses:

      There are some major and minor concerns that related to approach, data presentation and discussion. But I think they can be fixed with more efforts.

      We thank the reviewer for their positive comments on the paper. We have addressed all their specific recommendations below.  

      The deletion of enhancer reveals the absolute reliance of TFF1 on its enhancers for its expression. Authors should elaborate more on this as this is an important finding.

      We thank the reviewer for the comment. We have now added a more detailed discussion on the requirement of enhancer for TFF1 expression in the revised manuscript (line 368-385).  

      In Fig. 1, TFF3 expression is shown to be induced upon E2 signaling through qRT-PCR, while smFISH does not display a similar pattern. The authors attribute this discrepancy to the overall low expression of TFF3. In my opinion, this argument could be further supported by relevant literature, if available. Additionally, does GRO-seq data reveal any changes in TFF3 expression following estrogen stimulation? The GRO-seq track shown in Fig.1 should be adjusted to TFF3 expression to appreciate its expression changes.

      We have now included a browser shot image of TFF3 region showing GRO-Seq signal at E2 time course (Fig. S1C). We observed an increased transcription towards the 3’ end of TFF3 gene body at 3h.  The increased transcription at 3h, corroborates with smFISH data. The relative changes of TFF3 expression measured by qRT-PCR and smFISH for intronic transcripts are somewhat different, we speculate that such biased measurements that are dependent on PCR amplifications could be more for genes that express at low levels and smFISH using intronic probes may be a more sensitive assay to detect such changes.    

      Since the mutually exclusive relationship between TFF1 and TFF3 is based on snap shots in fixed cells, can authors comment on whether the same cell that expresses TFF1 at 1h, expresses TFF3 at 3h? Perhaps, the calculations taking total number of cells that express these genes at 1 and 3h would be useful.

      Like pointed out by the reviewer, since these are fixed cells, we cannot comment on the fate of the same cell at two time points. To further address this limitation, future work could employ cells with endogenous tags for TFF1 and TFF3 and utilize live cell imaging techniques. In a fixed cell assay, as the reviewer suggests, it can be investigated whether a similar fraction shows high TFF3 expression at 3h, as the fraction that shows high TFF1 expression at 1 h. To quantify the fractions as suggested by the reviewer, we plotted the fraction of cells showing high TFF1 and TFF3 expression at 1h and 3h. We identify truly high expressing cells by taking mean and one standard deviation (for single cell level data) at E2-1hr as the threshold for TFF1 (80 and above transcript counts) and mean and one standard deviation (for single cell level data) at E2-3hr as the threshold for TFF3 (36 and above transcript counts). The fraction with high TFF1 expression at 1h  (12.06 ± 2.1) is indeed comparable to that with high TFF3 expression at 3h (12.50 ± 2.0) (Fig. 2C and Author response image 1). We should note that if the transcript counts were normally distributed, a predetermined fraction would be expected to be above these thresholds and comparable fractions can arise just from underlying statistics. But in our experiments, this is unlikely to be the case given the many outliers that affect both the mean and the standard deviation, and the lack of normality and high dispersion in single cell distributions. Of course, despite the fractions being comparable, we cannot be certain if it is the same set of cells that go from high expression of TFF1 to high expression of TFF3, but definitely that is a possibility. We thank the reviewer for pointing out this comparison.

      Author response image 1.

      The graph represents the percent of cells that show high expression for TFF1 and TFF3 at 1h and 3h post E2 signaling. The threshold was collected by pooling in absolute RNA counts from 650 analyzed cells (as in Fig. 2C). The mean and standard deviation over single cell data were calculated. Mean plus one standard deviation was used to set the threshold for identifying high expressing cells. For TFF1, as it maximally expresses at 1h the threshold used was 80. For TFF3, as it maximally expresses at 3h the threshold used was 36. Fraction of cells expressing above 80 and 36 for TFF1 and TFF3 respectively were calculated from three different repeats. Mean of means and standard deviations from the three experiments are plotted here.

      Authors conclude that TFF3 is not directly regulated by enhancer or estrogen receptor. Does ERa bind on TFF3 promoter? 

      The ERa ChIP-seq performed at 1h and 3h of signaling suggests that TFF3 promoter is not bound by ERa as shown in supplementary Fig. 1B and S1B. However, one peak upstream to TFF1 promoter is visible and that is lost at 3h. 

      Minor comments:

      Reviewer’s comment -The figures would benefit from resizing of panels. There is very little space between the panels.

      We have now resized the figures in the revised manuscript.

      The discussion section could include an extrapolation on the relationship between ERα concentration and transcriptional regulation. Given that ERα levels have been shown to play a critical role in breast cancer, exploring how varying concentrations of ERα affect gene expression, including the differential regulation of target and non-target genes, would provide valuable insights into the broader implications of this study.

      This is a very important point that was missing from the manuscript. We have included this in the discussion in the revised manuscript (line 426-430).

      Reviewer #2:

      Summary:

      In this manuscript by Bohra et al., the authors use the well-established estrogen response in MCF7 cells to interrogate the role of genome architecture, enhancers, and estrogen receptor concentration in transcriptional regulation. They propose there is competition between the genes TFF1 and TFF3 which is mediated by transcriptional condensates. This reviewer does not find these claims persuasive as presented. Moreover, the results are not placed in the context of current knowledge.

      Strengths:

      High level of ERalpha expression seems to diminish the transcriptional response. Thus, the results in Fig. 4 have potential insight into ER-mediated transcription. Yet, this observation is not pursued in great depth however, for example with mutagenesis of ERalpha. However, this phenomenon - which falls under the general description of non monotonic dose response - is treated at great depth in the literature (i.e. PMID: 22419778). For example, the result the authors describe in Fig. 4 has been reported and in fact mathematically modeled in PMID 23134774. One possible avenue for improving this paper would be to dig into this result at the single-cell level using deletion mutants of ERalpha or by perturbing co-activators.

      We thank the reviewer for pointing us to the relevant literature on our observation which will enhance the manuscript. We have discussed these findings in relations to ours in the discussion section (Line 400-413). We thank the reviewer for insight on non-monotonic behavior.

      Weaknesses:

      There are concerns with the sm-RNA FISH experiments. It is highly unusual to see so much intronic signal away from the site of transcription (Fig. 2) (PMID: 27932455, 30554876), which suggests to me the authors are carrying out incorrect thresholding or have a substantial amount of labelling background. The Cote paper cited in the manuscript is likewise inconsistent with their findings and is cited in a misleading manner: they see splicing within a very small region away from the site of transcription. 

      We thank the reviewer for this comment, and apologize if they feel we misrepresented the argument from Cote et al. This has now been rectified in the manuscript. However, we do not agree that the intronic signals away from the site of transcription are an artefact. First, the images presented here are just representative 2D projections of 3D Z-stacks; whereas the full 3D stack is used for spot counting using a widely-used algorithm that reports spot counts that are constant over wide range of thresholds (Raj et al., 2008). The veracity of automated counts was first verified initially by comparison to manual counts. Even for the 2D representations the extragenic intronic signals show up at similar thresholds to the transcription sites. 

      The signal is not non-specific arising from background labeling, explained by following reasons:

      • To further support the time-course smFISH data and its interpretation without depending on the dispersed intronic signal, we have analyzed the number of alleles firing/site of transcription at a given time in a cell under the three conditions. We counted the sites of transcription in a given cell and calculated the percentage of cells showing 1,2,3,4 or >4 sites. We see that the percent of cells showing a single site of transcription for TFF1 is very high in uninduced cells and this decreases at 1h. At 1h, the cells showing 2, 3 and 4 sites of transcription increase which again goes down at 3h (Author response image 2A). This agrees with the interpretation made from mean intronic counts away from the site of transcription. Similarly, for TFF3, the number of cells showing 2,3 and 4 sites of transcription increase slightly at 3hr compared to uninduced and 1hr (Author response image 2B).  We can also see that several cells have no alleles firing at a given time as has been quantified in the graphs on right showing total fraction of cells with zero versus non-zero alleles firing (Author response image 2A-B). A non-specific signal would be present in all cells.

      • There is literature on post-transcriptional splicing of RNA beyond our work, which suggests that intronic signal can be found at relatively large distances away from the site of transcription. Waks et al. showed that some fraction of unspliced RNA could be observed up to 6-10 microns away from the site of transcription suggesting that there can be a delay between transcription and (alternative) splicing (Waks et al., 2011). Pannuclear disperse intronic signals can arise as there can be more than one allele firing at a time in different nuclear locations. The spread of intronic transcripts in our images is also limited in cells in which only 1 allele is firing at E2-1 hour (Author response image 2C) or uninduced cells (Author response image 2D). Furthermore, Cote et al. discuss that “Of note, we see that increased transcription level correlates with intron dispersal, suggesting that the percentage of splicing occurring away from the transcription site is regulated by transcription level for at least some introns. This may explain why we observe posttranscriptional splicing of all genes we measured, as all were highly expressed.” This is in line with our interpretation that intron signal dispersal can occur in case of posttranscriptional splicing (Coté et al., 2023). Additionally, other studies have suggested that transcripts in cells do not necessarily undergo co-transcriptional splicing which leads us to conclude that intronic signal can be found farther away from the site of transcription. Coulon et al. showed that splicing can occur after transcript release from the site and suggested that no strict checkpoint exists to ensure intron removal before release which results in splicing and release being kinetically uncoupled from each other (Coulon et al., 2014). Similarly, using live-cell imaging, it was shown that splicing is not always coupled with transcription, and this could depend on the nature and structural features of transcript (such as blockage of polypyrimidine tract which results in delayed recognition) (Vargas et al., 2011). Drexler  et al. showed that as opposed to drosophila transcripts that are shorter, in mammalian cells, splicing of the terminal intron can occur post-transcriptionally (Drexler et al., 2020). Using RNA polymerase II ChIP-Seq time course data from ERα activation in the MCF-7 cells, Honkela et al. showed that large number of genes can show significant delays between the completion of transcription and mRNA production (Honkela et al., 2015). This was attributed to faster transcription of shorter genes which results in splicing  delays suggesting rapid completion of transcription on shorter genes can lead to splicing-associated delays (Honkela et al., 2015). More recently, comparisons of nascent and mature RNA levels suggested a time lapse between transcription and splicing for the genes that are early responders during signaling (Zambrano et al., 2020). The presence of significant numbers of TFF1 nascent RNA in the nucleus in our data corroborates with above observations. 

      • Uniform intensities across many transcripts suggests these are true signal arising from RNA molecules which would not be the case for non-specific, background signal (Author response image 2E).

      • Splicing occurs in the nucleus and intron containing pre-transcripts should be nuclear localized. Thus, intronic signals should remain localized to the nucleus unlike the mature mRNA which translocate to the cytoplasm after processing and thus exonic signals can be found both in the nucleus and the cytoplasm. In keeping with this, we observe no signal in the cytoplasm for the intronic probes and it remains localized within the nucleus as expected and can be seen in Author response image 2F, while exonic signals are observed in both compartments. This suggests to us that the signal is coming from true pre-transcripts. There is no reason for non-specific background labelling to remain restricted to the nucleus.

      • We observe that the mean intronic label counts for both the genes TFF1 and TFF3 increases upon E2-induction compared to uninduced condition (Fig. 2B). Similarly, the mean intronic count for both genes reduce drastically in the TFF1-enhancer deleted cells (Fig. 3C, D). This change in the number of intronic signal specifically on induction and enhancer deletion suggests that the signal is not an artefact and arises from true nascent transcripts that are sensitive to stimulus or enhancer deletion.

      • We expect colocalization of intronic signal with exonic signals in the nucleus, while there can be exonic signals that do not colocalize with intronic, representing more mature mRNA. Indeed, we observe a clear colocalization between the intronic and exonic signals in the nucleus, while exonic signals can occur independent of intronic both in the nucleus and the cytoplasm. This clearly demonstrates that the intronic signals in our experiments are specific and not simply background labelling (Author response image 2G).

      These studies and the arguments above lead us to conclude that the presence of intronic transcripts in the nucleus, away from the site of transcription is not an artefact. We hope the reviewer will agree with us. These analyses have now been included in the manuscript as Supplementary Figure 6 and have been added in the manuscript at line numbers 106-111, 201204,  215-217 and line 231-235. We thank the reviewer for raising this important point.

      Author response image 2.

      Dynamic induction and RNA localization of TFF1 and TFF3 transcription across cell populations using smRNA FISH A. Bar graph depicting the percentage of cells with 1,2,3,4, or greater than 4 sites of transcription for TFF1 (left) is shown. The graph shows the mean of means from different repeats of the experiment, and error bars denote SEM (n>200, N=3). Only the cells with at least one allele firing were counted and cells with no alleles were not included in this. The graph on right shows the number of cells with zero or non-zero number of alleles firing. B. Bar graph depicting the percentage of cells with 1,2,3,4 or greater than 4 sites of transcription for TFF3 (left) is shown. The graph shows the mean of means from different repeats of the experiment, and error bars denote SEM (n>200, N=3). Only the cells with at least one allele firing were counted and cells with no alleles were not included in this. The graph in the middle shows the number of cells with 2,3,4 or greater than 4 sites of transcription for TFF3.The graph on the right shows the number of cells with zero or non-zero number of alleles firing. C. Images from single molecule RNA FISH experiment showing transcripts for InTFF1 in cells induced for 1 hour with E2. The image shows that when a single allele of TFF1 is firing, the transcripts show a more spatially restricted localisation. The scale bar is 5 microns. D. Images from single molecule RNA FISH experiment showing transcripts for InTFF1 in uninduced cells. The image shows that when a single allele of TFF1 is firing and transcription is low, the transcripts show a more spatially restricted localisation. The scale bar is 5 microns. E. Line profile through several transcripts in the nucleus show uniform and similar intensities indicating that these are true signals. F. 60X Representative images from a single molecule RNA FISH experiment showing transcripts for InTFF1 and ExTFF1 (top) and InTFF3 and ExTFF3 (bottom). The image shows that there is no intronic signal in the cytoplasm, while exonic signals can be found both in the nucleus and the cytoplasm. The scale bar is 5 microns. G. 60X Representative images from single molecule RNA FISH experiment showing transcripts for InTFF1 and ExTFF1. The image shows that all intronic signals are colocalized with exonic signals, but all exonic signals are expectedly not colocalized with intronic signals, representing more mature mRNA. The scale bar is 5 microns.

      One substantial way to improve the manuscript is to take a careful look at previous single cell analysis of the estrogen response, which in some cases has been done on the exact same genes (PMID: 29476006, 35081348, 30554876, 31930333). In some of these cases, the authors reach different conclusions than those presented in the present manuscript. Likewise, there have been more than a few studies that have characterized these enhancers (the first one I know of is: PMID 18728018). Also, Oh et al. 2021 (cited in the manuscript) did show an interaction between TFF1e and TFF3, which seems to contradict the conclusion from Fig. 3. In summary, the results of this paper are not in dialogue with the field, which is a major shortcoming. 

      We thank the reviewer for pointing out these important studies. The studies from Prof. Larson group are particularly very insightful (Rodriguez et al., 2019). We have now included this in the discussion (line 106-111 and line 420-424) where we suggest the differences and similarities between our, Larson’s group and also Mancini’s group (Patange et al., 2022; Stossi et al., 2020). 

      The 4C-Seq data from the manuscript Oh et al. 2021 is exactly consistent with our observation from Fig 3 as they also observed little to no interaction between TFF1e and TFF3p in WT cells, only upon TFF1p deletion, did the TFF1e become engaged with the TFF3p. In agreement with this, we also observe little to no interaction between TFF1e and TFF3p in WT cells (Fig.3A). This is also consistent with our competition model for resources between these two genes. Oh et al. shows interaction between TFF1e and TFF3 when the TFF1 promoter is deleted showing that when the primary promoter is not available the enhancer is retargeted to the next available gene (Oh et al., 2021). It does not show that in WT or at any time point of E2 signalling does TFF1e and TFF3 interact.

      In the opinion of this reviewer, there are few - if any - experiments to interrogate the existence of LLPS for diffraction-limited spots such as those associated with transcription. This difficulty is a general problem with the field and not specific to the present manuscript. For example, transient binding will also appear as a dynamic 'spot' in the nucleus, independently of any higher-order interactions. As for Fig. 5, I don't think treating cells with 1,6 hexanediol is any longer considered a credible experiment. For example, there are profound effects on chromatin independent of changes in LLPS (PMID: 33536240).  

      We are cognizant of and appreciate the limitations pointed out by the reviewer. We and others have previously shown that ERa forms condensates on TFF1 chromatin region using ImmunoFISH assay (Saravanan et al., 2020).  The data below shows the relative mean ERα intensity on TFF1 FISH spots and random regions clearly showing an appearance of the condensate at the TFF1 site. Further, the deletion of TFF1e causes the reduction in size of this condensate. Thus, we expect that these ERα condensates are characterized by higher-order interactions and become disrupted on treatment with 1,6-hexanediol. These condensates are the size of below micron as mentioned by the reviewer, but most TF condensates are of the similar sizes. We agree with the reviewer that 1,6- hexanediol treatment is a brute-force experiment with several irreversible changes to the chromatin. Although we have tried to use it at a low concentration for a short period of time and it has been used in several papers (Chen et al., 2023; Gamliel et al., 2022). The opposite pattern of TFF1 vs. TFF3 expression upon 1,6- hexanediol treatment suggests that there is specificity. Further, to perturb condensates, mutants of ERa can be used (N-terminus IDR truncations) however, the transcriptional response of these mutants is also altered due to perturbed recruitment of coactivators that recognize Nterminus of ER, restricting the distinction between ERa functions and condensate formation.

      References:

      Chen, L., Zhang, Z., Han, Q., Maity, B. K., Rodrigues, L., Zboril, E., Adhikari, R., Ko, S.-H., Li, X., Yoshida, S. R., Xue, P., Smith, E., Xu, K., Wang, Q., Huang, T. H.-M., Chong, S., & Liu, Z. (2023). Hormone-induced enhancer assembly requires an optimal level of hormone receptor multivalent interactions. Molecular Cell, 83(19), 3438-3456.e12. https://doi.org/10.1016/j.molcel.2023.08.027

      Coté, A., O’Farrell, A., Dardani, I., Dunagin, M., Coté, C., Wan, Y., Bayatpour, S., Drexler, H. L., Alexander, K. A., Chen, F., Wassie, A. T., Patel, R., Pham, K., Boyden, E. S., Berger, S., Phillips-Cremins, J., Churchman, L. S., & Raj, A. (2023). Post-transcriptional splicing can occur in a slow-moving zone around the gene. eLife, 12. https://doi.org/10.7554/eLife.91357.2

      Coulon, A., Ferguson, M. L., de Turris, V., Palangat, M., Chow, C. C., & Larson, D. R. (2014). Kinetic competition during the transcription cycle results in stochastic RNA processing. eLife, 3, e03939. https://doi.org/10.7554/eLife.03939

      Drexler, H. L., Choquet, K., & Churchman, L. S. (2020). Splicing Kinetics and Coordination Revealed by Direct Nascent RNA Sequencing through Nanopores. Molecular Cell, 77(5), 985-998.e8. https://doi.org/10.1016/j.molcel.2019.11.017

      Gamliel, A., Meluzzi, D., Oh, S., Jiang, N., Destici, E., Rosenfeld, M. G., & Nair, S. J. (2022). Long-distance association of topological boundaries through nuclear condensates. Proceedings of the National Academy of Sciences of the United States of America, 119(32), e2206216119. https://doi.org/10.1073/pnas.2206216119

      Honkela, A., Peltonen, J., Topa, H., Charapitsa, I., Matarese, F., Grote, K., Stunnenberg, H. G., Reid, G., Lawrence, N. D., & Rattray, M. (2015). Genome-wide modeling of transcription kinetics reveals patterns of RNA production delays. Proceedings of the National Academy of Sciences of the United States of America, 112(42), 13115. https://doi.org/10.1073/pnas.1420404112

      Oh, S., Shao, J., Mitra, J., Xiong, F., D’Antonio, M., Wang, R., Garcia-Bassets, I., Ma, Q., Zhu, X., Lee, J.-H., Nair, S. J., Yang, F., Ohgi, K., Frazer, K. A., Zhang, Z. D., Li, W., & Rosenfeld, M. G. (2021). Enhancer release and retargeting activates disease-susceptibility genes. Nature, 595(7869), Article 7869. https://doi.org/10.1038/s41586-021-03577-1

      Patange, S., Ball, D. A., Wan, Y., Karpova, T. S., Girvan, M., Levens, D., & Larson, D. R. (2022). MYC amplifies gene expression through global changes in transcription factor dynamics. Cell Reports, 38(4). https://doi.org/10.1016/j.celrep.2021.110292

      Raj, A., van den Bogaard, P., Rifkin, S. A., van Oudenaarden, A., & Tyagi, S. (2008). Imaging individual mRNA molecules using multiple singly labeled probes. Nature Methods, 5(10), Article 10. https://doi.org/10.1038/nmeth.1253

      Rodriguez, J., Ren, G., Day, C. R., Zhao, K., Chow, C. C., & Larson, D. R. (2019). Intrinsic Dynamics of a Human Gene Reveal the Basis of Expression Heterogeneity. Cell, 176(1–2), 213-226.e18. https://doi.org/10.1016/j.cell.2018.11.026

      Saravanan, B., Soota, D., Islam, Z., Majumdar, S., Mann, R., Meel, S., Farooq, U., Walavalkar, K., Gayen, S., Singh, A. K., Hannenhalli, S., & Notani, D. (2020). Ligand dependent gene regulation by transient ERα clustered enhancers. PLOS Genetics, 16(1), e1008516. https://doi.org/10.1371/journal.pgen.1008516

      Stossi, F., Dandekar, R. D., Mancini, M. G., Gu, G., Fuqua, S. A. W., Nardone, A., De Angelis, C., Fu, X., Schiff, R., Bedford, M. T., Xu, W., Johansson, H. E., Stephan, C. C., & Mancini, M. A. (2020). Estrogeninduced transcription at individual alleles is independent of receptor level and active conformation but can be modulated by coactivators activity. Nucleic Acids Research, 48(4), 1800. https://doi.org/10.1093/nar/gkz1172

      Vargas, D. Y., Shah, K., Batish, M., Levandoski, M., Sinha, S., Marras, S. A. E., Schedl, P., & Tyagi, S. (2011). Single-Molecule Imaging of Transcriptionally Coupled and Uncoupled Splicing. Cell, 147(5), 1054–1065. https://doi.org/10.1016/j.cell.2011.10.024

      Waks, Z., Klein, A. M., & Silver, P. A. (2011). Cell-to-cell variability of alternative RNA splicing. Molecular Systems Biology, 7(1), 506. https://doi.org/10.1038/msb.2011.32

      Zambrano, S., Loffreda, A., Carelli, E., Stefanelli, G., Colombo, F., Bertrand, E., Tacchetti, C., Agresti, A., Bianchi, M. E., Molina, N., & Mazza, D. (2020). First Responders Shape a Prompt and Sharp NF-κB-Mediated Transcriptional Response to TNF-α. iScience, 23(9), 101529. https://doi.org/10.1016/j.isci.2020.101529

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      The authors in this paper investigate the nature of the activity in the rodent EPN during a simple freely moving cue-reward association task. Given that primate literature suggests movement coding whereas other primate and rodent studies suggest mainly reward outcome coding in the EPNs, it is important to try to tease apart the two views. Through careful analysis of behavior kinematics, position, and neural activity in the EPNs, the authors reveal an interesting and complex relationship between the EPN and mouse behavior.

      Strengths:

      (1) The authors use a novel freely moving task to study EPN activity, which displays rich movement trajectories and kinematics. Given that previous studies have mostly looked at reward coding during head-fixed behavior, this study adds a valuable dataset to the literature. (2) The neural analysis is rich and thorough. Both single neuron level and population level (i.e. PCA) analysis are employed to reveal what EPN encodes.

      Thank you very much for this appreciation.

      Weaknesses:

      (1) One major weakness in this paper is the way the authors define the EPN neurons. Without a clear method of delineating EPN vs other surrounding regions, it is not convincing enough to call these neurons EPNs solely from looking at the electrode cannula track from Figure 2B. Indeed, EPN is a very small nucleus and previous studies like Stephenson-Jones et al (2016) have used opto-tagging of Vglut2 neurons to precisely label EPN single neurons. Wallace et al (2017) have also shown the existence of SOM and PV-positive neurons in the EPN. By not using transgenic lines and cell-type specific approaches to label these EPN neurons, the authors miss the opportunity to claim that the neurons recorded in this study do indeed come from EPN. The authors should at least consider showing an analysis of neurons slightly above or below EPN and show that these neurons display different waveforms or firing patterns.

      We thank the reviewer for their comment, and we thank the opportunity to expand on the inclusion criteria of studied units after providing an explanation. 

      As part of another study, we performed experiments recording in EPN with optrodes and photoidentification in PV-Cre animals. We found optoidentified units in both: animals with correct placement (within the EPN) and on those with off-target placement (within the thalamus or medial to the EPN). Thus, despite the use of Cre animals, we relied on histology to ensure correct EPN recording. We believe that the optotagging based purely on neural makers such as PV, SOM, VGLUT, VGAT would not provide a better anatomical delineation of the EPN since adjacent structures are rich in those same markers. The thalamic reticular nucleus is just dorsal to the EPN and it has been shown to express both SOM and PV (Martinez-Garcia et al., 2020). 

      On the other hand, the lateral hypothalamus (just medial to the EPN) also expresses vGlut2 and SOM. Stephenson-Jones (2016), Extended Data Figure 1, panel g, shows vGluT2 and somatostatin labeling of neurons, with important expression of neurons dorsal, ventral and medial to the EPN. Thus, we believe that viral strategies relying on single neuronal markers still depend on careful histological analysis of recording sites.

      A combination of neural markers or more complex viral strategies might be more suitable to delineate the EPN. As an example, for anatomical tracing Stephenson-Jones et al. 2016 performed a rabies-virus based approach involving retrogradely transported virus making use of projection sites through two injections. Two step viral approaches were also performed in Wallace, M. et al. 2017. We attempted to perform a two-step viral approach, using an anterogradely transported Cre-expressing virus (AAV1.hSyn.Cre.WPRE.hGH) injected into the striatum and a second Cre dependent ChR2 into the EPN. However, our preliminary experiments showed that this double viral approach had a stark effect decreasing the performance of animals during the task (we attempted re-training 2-3 weeks after viral infections and animals failed to turn to the contralateral side of the injections). We believe that this approach might have had a toxic effect (Zingg et al., 2017). 

      To this point, a recent paper (Lazaridis et al., 2019) repeated an optogenetic experiment performed in the Stephenson-Jones et al. study, using a set of different viral approaches and concluded that increasing the activity of GPi-LHb is not aversive, as it had been previously reported. Thus, future studies attempting to increase anatomical specificity are a must, but they will require using viral approaches amenable to the behavioral paradigm.

      We attempted to find properties regarding waveforms, firing rate, and firing patterns from units above or below, however, we did not find a marker that could generate a clear demarcation. We show here a figure that includes the included units in this study as well as excluded ones to show that there is a clear overlap.

      Author response image 1.

      Finally, we completely agree with the reviewer in that there is still room for improvement. We have further expanded the Methods section to explain better our efforts to include units recorded within the EPN. Further, we have added a paragraph within the Discussion section to point out this limitation (lines 871-876).

      Methods (lines 116-131):

      “Recordings. Movable microwire bundles (16 microwires, 32 micrometers in diameter, held inside a cannula, Innovative Neurophysiology, Durham, NC)] were stereotaxtically implanted just above the entopeduncular nucleus (-0.8 AP, 1.7 ML, 3.9 DV). Post surgical care included antibiotic, analgesic and antiinflammatory pharmacological treatment. After 5 days of recovery, animals were retrained for 1-2 weeks. Unitary activity was recorded for 2-6 days at each dorsoventral electrode position and the session with the best electrophysiological (signal to noise ratio (>2), stability across time) and behavioral [performance, number of trials (>220)] quality was selected. Microwire electrodes were advanced in 50 micrometer dorsoventral steps for 500 micrometers in total. After experiment completion, animals were perfused with a 4% paraformaldehyde solution. Brains were extracted, dehydrated with a 30% sucrose solution and sectioned in a cryostat into 30micron thick slices. Slices were mounted and photographed using a light microscope. Microwire tracks of the 16-microwire bundle were analyzed (Fig. 2A-B) and only animals with tracks traversing the EPN were selected (6 out of 10). Finally, we located the final position of microwire tips and inferred the dorsoventral recording position of each of the recording sessions. Only units recorded within the EPN were included.” 

      Discussion (lines 871-876):

      “A weakness of the current study is the lack of characterization of neuronal subtypes. An area of opportunity for future research could be to perform photo-identification of neuronal subtypes within the EPN which could contribute to the overall description of the information representation. Further, detailed anatomical viral vector strategies could aid to improve anatomical localization of recordings, reduce reliance on histological examination, and solve some current controversies (Lazaridis et al., 2019).” 

      (2) The authors fail to replicate the main finding about EPN neurons which is that they encode outcome in a negative manner. Both Stephenson-Jones et al (2016) and Hong and Hikosaka (2008) show a reward response during the outcome period where firing goes down during reward and up during neutral or aversive outcome. However, Figure 2 G top panel shows that the mean population is higher during correct trials and lower during incorrect trials. This could be interesting given that the authors might try recording from another part of EPN that has not been studied before. However, without convincing evidence that the neurons recorded are from EPN in the first place (point 1), it is hard to interpret these results and reconcile them with previous studies.

      We really thank the reviewer for pointing out that we need to better explain how EPN units encode outcome. We now provide an additional panel in Figure 4, its corresponding text in the results section (lines 544-562) and a new paragraph in the discussion related to this comment.

      We believe that we do indeed recapitulate findings of both of Stephenson-Jones et al (2016) and Hong and Hikosaka (2008). Both studies focus on a specific subpopulation of GPi/EPN neurons that project to the lateral habenula (LHb). Stephenson-Jones et al (2016) posit that GPi-LHb neurons (which they opto-tag as vGluT2) exhibit a decreased firing rate during rewarding outcomes. Hong and Hikosaka (2008) antidromically identified LHb projecting neurons through within the GPi and found reward positive and reward negative neurons, which were respectively modulated either by increasing or decreasing their firing rate with a rewarding outcome (red and green dots on the x-axis of Figure 5A in their paper).

      As the reviewer pointed out the zScore may be misleading. Therefore, in our study we also decomposed population activity on reward axis through dPCA. When marginalizing for reward in Figure 3F, we find that the weights of individual units on this axis are centered around zero, with positive and negative values (Figure 3F, right panel). Thus, units can code a rewarding outcome as either an increase or a decrease of activity. We show example units of such modulation in Figure 3-1g and h.

      We had segregated our analysis of spatio-temporal and kinematic coding upon the reward coding of units in Figure 4L-M. Yet, following this comment and in an effort of further clarifying this segregation, we introduced panels with the mean zScore of units during outcome evaluation in Figure 4L.

      We amended the main text to better explain these findings (lines 544-562).

      “Previous reports suggest that EPN units that project to the lateral habenula encode reward as a decrease in firing rate. Thus, we wished to ask whether reward encoding units can code kinematic and spatio-temporal variables as well.

      To this end, we first segregated units upon their reward coding properties: reward positive (which increased activity with reward) and reward negative units (which decreased activity with reward). We performed auROC on the 250ms after head entry comparing rewarded trials and incorrect trails (p<0.001, permutation test). Mean activity of reward insensitive, positive and negative units is shown in Fig. 4L. Next, we performed a dimensionality reduction on the coefficients of the model that best explained both contexts (kinematic + spatio-temporal model on pooled data) using UMAP (McInnes et al., 2018). We observe a continuum rather than discrete clusters (Fig. 4L). Note that individual units are color coded according to their responsivity to reward. We did not find a clear clustering either.”  

      Paragraph added in the discussion (lines 749-755):

      “In this study, we found that rewarding outcomes can be represented by EPN units through either an increase or a decrease in firing rate (Fig. 3F, 3-1g-h, 4L). While Stephenson-Jones et al., 2016 found that lateral habenula (LHb)-projecting neurons within the EPN of mice primarily encoded rewarding outcomes by a decrease in firing rate, Hong and Hikosaka, 2008 observed that in primates, LHb-projecting units could encode reward through either a decrease or an increase in firing rate. Thus, our results align more closely with the latter study, which also employed an operant conditioning task.”

      (3) The authors say that: 'reward and kinematic doing are not mutually exclusive, challenging the notion of distinct pathways and movement processing'. However, it is not clear whether the data presented in this work supports this statement. First, the authors have not attempted to record from the entire EPN. Thus it is possible that the coding might be more segregated in other parts of EPN. Second, EPNs have previously been shown to display positive firing for negative outcomes and vice versa, something which the authors do not find here. It is possible that those neurons might not encode kinematic and movement variables. Thus, the authors should point out in the main text the possibility that the EPN activity recorded might be missing some parts of the whole EPN.

      We thank the reviewer for the opportunity to expand on this topic. We believe it is certainly possible that other not-recorded regions of the EPN might exhibit greater segregation of reward and kinematics. However, we considered it worthwhile pointing out that from the dataset collected in this study reward-sensitive units encode kinematics in a similar fashion to reward-insensitive ones (Fig. 4L,M). Moreover, we asked specifically whether reward-negative units (that decrease firing rate with rewarding outcomes, as previously reported) could encode kinematics and spatio-temporal variables with different strength than reward-insensitive ones and could not find significant differences (Fig. 4M).

      We did indeed find units that displayed decreased firing rate upon rewarding outcomes, as has been previously reported. We have addressed this fact more thoroughly in point (2). 

      Finally, we agree with the reviewer that the dataset collected in this study is by no means exhaustive of the entire EPN and have thus included a sentence pointing this out in the Discussion section (lines 805-806):

      “Given that we did not record from the entire EPN, it is still possible that another region of the nucleus might exhibit more segregation.”

      (4) The authors use an IR beam system to record licks and make a strong claim about the nature of lick encoding in the EPN. However, the authors should note that IR beam system is not the most accurate way of detecting licks given that any object blocking the path (paw or jaw-dropping) will be detected as lick events. Capacitance based, closed-loop detection, or video capturing is better suited to detect individual licks. Given that the authors are interested in kinematics of licking, this is important. The authors should either point this out in the main text or verify in the system if the IR beam is correctly detecting licks using a combination of those methods.

      We thank the reviewer for the opportunity of clarifying the lick event acquisition. We have experience using electrical alternatives to lickometers; however, we believe they were not best suited to this application. Closed-loop lickometers generally use a metallic grid upon which animals stand so that the loop can be closed; however, we wanted to have a transparent floor. We have found capacitance based lickometers to be useful in head-fixed conditions but have noticed that they are very dependent on animal position and proximity of other bodyparts such as limbs. Given the freely moving aspect of the task this was difficult to control. Finally, both electric alternatives for lickometers are more prone to noise and may introduce electrical artifacts that might contaminate the spiking signal. This is why we opted to use a slit in combination with an IR beam that would only fit the tongue and that forced enough protrusion such that individual licks could be monitored. Further, the slit could not fit other body-parts like the paw or jaw. We have now included a video (Supp. Video 2) showing a closeup of this behavior that better conveys how the jaw and paw do not fit inside the slit. The following text has been added in the corresponding methods section (lines 97-98):

      “The lickometer slit was just wide enough to fit the tongue and deep enough to evoke a clear tongue protrusion.”

      Reviewer #1 (Recommendations For The Authors):

      (1)The authors should verify using opto-tagging of either Vglut2, SOM, or PV neurons whether they can see the same firing pattern. If not, the authors should address this weakness in the paper.

      We thank the reviewer for this important point, we have provided a more detailed reply above.

      (2)The way dPCA or PCA is applied to the data is not stated at all in the main text. Are all units from different mice combined? Or applied separately for each mouse? How does that affect the interpretation of the data? At least a brief text should be included in the main text to guide the readers.

      We thank the reviewer for pointing out this important omission. We have included an explanation in the Methods section and in the Main text.

      Methods (lines 182-184):

      “For all population level analyses individual units recorded from all sessions and all animals were pooled to construct pseudo-simultaneous population response of combined data mostly recorded separately.”

      Main text (lines 397-399):

      “For population level analyses throughout the study, we pooled recorded units from all animals to construct a pseudo-simultaneous population.”

      Discussion (lines 729-730):

      “…(from pooled units from all animals to construct a pseudo-simultaneous population, which assumes homogeneity across subjects)”

      (3) The authors argue that they do not find 'value coding' in this study. However, the authors never manipulate reward size or probability, but only the uncertainty or difficulty of the task. This might be better termed 'difficulty', and it is difficult to say whether this correlates with value in this task. For instance, mice might be very confident about the choice, even for an intermediate frequency sweep, if the mouse had waited long enough to hear the full sweep. In that case, the difficulty would not correlate with value, given that the mouse will think the value of the port it is going to is high. Thus, authors should avoid using the term value.

      We agree with the reviewer. We have modified the text to specify that difficulty was the variable being studied and added the following sentence in the Discussion (lines 747-748):

      “It is still possible that by modifying reward contingencies such as droplet size value coding could be evidenced.”

      (4) How have the authors obtained Figure 7D bottom panel? It is unclear at all what this correlation represents. Are the authors looking at a correlation between instantaneous firing rate and lick rate during a lick bout?

      We thank the reviewer for pointing out that omission. It is indeed correlation coefficient between the instantaneous firing rate and the instantaneous lick rate for a lick bout. We have included labeling in Figure 7D and pointed this out in the main text [lines 680-681]:

      “Fig.7D, lower panel shows the correlation coefficient between the instantaneous firing rate and the instantaneous lick rate within a lick bout for all units.”

      Reviewer #2 (Public Review):

      This paper examined how the activity of neurons in the entopeduncular nucleus (EPN) of mice relates to kinematics, value, and reward. The authors recorded neural activity during an auditory-cued two-alternative choice task, allowing them to examine how neuronal firing relates to specific movements like licking or paw movements, as well as how contextual factors like task stage or proximity to a goal influence the coding of kinematic and spatiotemporal features. The data shows that the firing of individual neurons is linked to kinematic features such as lick or step cycles. However, the majority of neurons exhibited activity related to both movement types, suggesting that EPN neuronal activity does not merely reflect muscle-level representations. This contradicts what would be expected from traditional action selection or action specification models of the basal ganglia.

      The authors also show that spatiotemporal variables account for more variability compared to kinematic features alone. Using demixed Principal Component Analysis, they reveal that at the population level, the three principal components explaining the most variance were related to specific temporal or spatial features of the task, such as ramping activity as mice approached reward ports, rather than trial outcome or specific actions. Notably, this activity was present in neurons whose firing was also modulated by kinematic features, demonstrating that individual EPN neurons integrate multiple features. A weakness is that what the spatiotemporal activity reflects is not well specified. The authors suggest some may relate to action value due to greater modulation when approaching a reward port, but acknowledge action value is not well parametrized or separated from variables like reward expectation.

      We thank the reviewer for the comment. We indeed believe that further exploring these spatiotemporal signals is important and will be the subject of future studies.

      A key goal was to determine whether activity related to expected value and reward delivery arose from a distinct population of EPN neurons or was also present in neurons modulated by kinematic and spatiotemporal features. In contrast to previous studies (Hong & Hikosaka 2008 and Stephenson-Jones et al., 2016), the current data reveals that individual neurons can exhibit modulation by both reward and kinematic parameters. Two potential differences may explain this discrepancy: First, the previous studies used head-fixed recordings, where it may have been easier to isolate movement versus reward-related responses. Second, those studies observed prominent phasic responses to the delivery or omission of expected rewards - responses largely absent in the current paper. This absence suggests a possibility that neurons exhibiting such phasic "reward" responses were not sampled, which is plausible since in both primates and rodents, these neurons tend to be located in restricted topographic regions. Alternatively, in the head-fixed recordings, kinematic/spatial coding may have gone undetected due to the forced immobility.

      Thank you for raising this point. Nevertheless, there is some phasic activity associated with reward responses, which can be seen in the new panel in Figure 4L.

      Overall, this paper offers needed insight into how the basal ganglia output encodes behavior. The EPN recordings from freely moving mice clearly demonstrate that individual neurons integrate reward, kinematic, and spatiotemporal features, challenging traditional models. However, the specific relationship between spatiotemporal activity and factors like action value remains unclear.

      We really appreciate this reviewer for their valuable comments.

      Reviewer #2 (Recommendations For The Authors):

      One small suggestion is to make sure that all the panels in the figures are well annotated. I struggled in places to know what certain alignments or groupings meant because they were not labelled. An example would be what do the lines correspond to in the lower panels of Figure 2D and E. I could figure it out from other panels but it would have helped if each panel had better labelling.

      Thanks for pointing this out, we have improved labelling across the figures and corrected the specific example you have pointed out.

      The paper is very nice though. Congratulations!

      Thank you very much.

      Editor's note:

      Should you choose to revise your manuscript, please include full statistical reporting including exact p-values wherever possible alongside the summary statistics (test statistic and df) and 95% confidence intervals. These should be reported for all key questions and not only when the p-value is less than 0.05 in the main manuscript.

      We thank the editor for the comment. A statistics table has been added.

      References:

      Lazaridis, I., Tzortzi, O., Weglage, M., Märtin, A., Xuan, Y., Parent, M., Johansson, Y., Fuzik, J., Fürth, D., Fenno, L. E., Ramakrishnan, C., Silberberg, G., Deisseroth, K., Carlén, M., & Meletis, K. (2019). A hypothalamus-habenula circuit controls aversion. Molecular Psychiatry, 24(9), 1351–1368. https://doi.org/10.1038/s41380-019-0369-5

      Martinez-Garcia, R. I., Voelcker, B., Zaltsman, J. B., Patrick, S. L., Stevens, T. R., Connors, B. W., & Cruikshank, S. J. (2020). Two dynamically distinct circuits drive inhibition in the sensory thalamus. Nature, 583(7818), 813–818. https://doi.org/10.1038/s41586-0202512-5

      McInnes, L., Healy, J., Saul, N., & Großberger, L. (2018). UMAP: Uniform Manifold Approximation and Projection. Journal of Open Source Software, 3(29), 861. https://doi.org/10.21105/joss.00861

      Zingg, B., Chou, X. lin, Zhang, Z. gang, Mesik, L., Liang, F., Tao, H. W., & Zhang, L. I. (2017). AAV-Mediated Anterograde Transsynaptic Tagging: Mapping Corticocollicular Input-Defined Neural Pathways for Defense Behaviors. Neuron, 93(1), 33–47. https://doi.org/10.1016/j.neuron.2016.11.045

    1. Author Response

      The following is the authors’ response to the original reviews.

      eLife assessment

      This paper reports valuable results regarding the potential role and time course of the prefrontal cortex in conscious perception. Although the sample size is small, the results are clear and convincing, and strengths include the use of several complementary analysis methods. The behavioral test includes subject report so the results do not allow for distinguishing between theories of consciousness; nevertheless, results do advance our understanding of the contribution of prefrontal cortex to conscious perception. We appreciate very much for editor and reviewers encouraged review opinion. Particularly, we thank three reviewers very much for their professional and constructive comments that help us to improve the manuscript substantially.

      Public Reviews:

      Reviewer #1 (Public Review):

      This is a clear and rigorous study of intracranial EEG signals in the prefrontal cortex during a visual awareness task. The results are convincing and worthwhile, and strengths include the use of several complementary analysis methods and clear results. The only methodological weakness is the relatively small sample size of only 6 participants compared to other studies in the field. Interpretation weaknesses that can easily be addressed are claims that their task removes the confound of report (it does not), and claims of primacy in showing early prefrontal cortical involvement in visual perception using intracranial EEG (several studies already have shown this). Also the shorter reaction times for perceived vs not perceived stimuli (confident vs not confident responses) has been described many times previously and is not a new result.

      We appreciate very much for the reviewer’s encouraged opinion. We are going to address reviewer’s specific questions and comments point-by-point in following.

      ‘The only methodological weakness is the relatively small sample size of only 6 participants compared to other studies in the field.’

      We agree that the sample size is relatively small in the present study. To compensate such shortcoming, we rigorously verified each result at both individual and population levels, resembling the data analysis method in non-human primate study.

      Interpretation weaknesses that can easily be addressed are claims that their task removes the confound of report (it does not),

      Thank you very much for your comment. We agree that our task does not remove the confound of report entirely. However, we believe that our task minimizes the motor confounds by dissociating the emergence of awareness from motor in time and balanced direction of motor between aware and unaware conditions. We have modified the text according to reviewer’s comment in the revised manuscript as following: “This task removes the confound of motor-related activity”.

      ..and claims of primacy in showing early prefrontal cortical involvement in visual perception using intracranial EEG (several studies already have shown this).

      We agree that several iEEG studies, including ERP and HFA, have shown the early involvement of prefrontal cortical in visual perception. However, in these studies, the differential activity between conscious and unconscious conditions was not investigated, thus, the activity in prefrontal cortex might be correlated with unconscious processing, rather than conscious processing. In present study, we compared the neural activity in PFC between conscious and unconscious trials, and found the correlation between PFC activity and conscious perception. Although one iEEG study(Gaillard et al., 2009) reported awareness-specific PFC activation, the awareness-related activity started 300 ms after the onset of visual stimuli, which was ~100 ms later than the early awareness related activity in our study. Also, due to the limited number of electrodes in the previous study (2 patients with 19 recording sites mostly in mesiofrontal and peri-insular regions), it was restricted while exploring the awareness-related activity in PFC. In the present study, the number of recording sites (245) were much more than previous study and covered multiple areas in PFC. Our results further show earlier awareness-related activity (~ 200 ms after visual stimuli onset), including ERP, HFA and PLV, which sheds new light on understanding of the role of PFC in conscious perception.

      We have added this discussion in the MS (lines 522-536);

      Also the shorter reaction times for perceived vs not perceived stimuli (confident vs not confident responses) has been described many times previously and is not a new result. Thank you very much for your comment. We agree that the reaction time is strongly modulated by the confident level, which has been described previously (Broggin, Savazzi, & Marzi, 2012; Marzi, Mancini, Metitieri, & Savazzi, 2006). However, in previous studies, the confident levels were usually induced by presenting stimulus with different physical property, such as spatial frequency, eccentricity and contrast. It is well known that the more salient stimuli will induce the faster process of visual information and speed up the process of visuomotor transformation, eventually shorten the reaction time (Corbetta & Shulman, 2002; Posner & Petersen, 1990). Therefore, the dependence of visual processing on the salience of visual stimulus confounds with the effect of visual awareness on the reaction time, which is hard to attribute the shorter reaction time in more salient condition purely to visual awareness. In contrast, we create a condition (near perceptual threshold) in the present study, in which the saliency (contrast) of visual stimulus is very similar in both aware and unaware conditions in order to eliminate the influence of stimulus saliency in reaction time. We think that the difference in reaction time in our study is mainly due to the modulation of awareness state, which was not reported previously.

      We have added the discussion in the MS (lines 497-507).

      Reviewer #1 (Recommendations For The Authors):

      Specific comments follow:

      Abstract: "we designed a visual awareness task that can minimize report-related confounding" and in the Introduction lines 112-115: "Such a paradigm can effectively dissociate awareness-related activity from report-related activity in terms of time... and report behavior"; Discussion lines 481-483 "even after eliminating the influence of the confounding variables related to subjective reports such as motion preparation" and other similar statements in the manuscript should be removed. The task involves report using eye movements with every single stimulus. The fact that there is report for both perceived and not perceived stimuli, that the direction of report is not determined until the time of report, and that there is delay between stimulus and report, does not remove the report-related post-perceptual processing that will inevitably occur in a task where overt report is required for every single trial. For example, brain activity related to planning to report perception will only occur after perceived trials, regardless of the direction of eye movement later decided upon. This preparation to respond is different for perceived and not perceived stimuli, but is not part of the perception itself. In this way the current task is not at all unique and does not substantially differ from many other report-based tasks used previously.

      The objective of present study is to assess whether PFC is involved in the emergence of visual awareness. To do so, it is crucial to determine the subjective awareness state as correct as possible. Considering the disadvantage of non-report paradigms in determining the subjective awareness state (Tsuchiya et al. TiCS, 2015; Mashour et al, Neuron, 2020), we employed a balanced report paradigm. It has been argued (Merten & Nieder, PNAS, 2011) that, in the balanced report paradigms, subjects could not prepare any motor response during the delay period because only the appearance of a rule cue (change color of fixation point at the end of delay period) informed subjects about the appropriate motor action. In this case, the post-perceptual processing during delay period might reflect the non-motor cognitive activity. Alternatively, as being mentioned by reviewer, the post-perceptual processing might relate to planning to report perception, which is different for perceived and not perceived stimuli. Therefore, up to date, the understanding of the post-perceptual processing remains controversial. According to reviewer’s comment, we have modified the description of our task as following: “we designed a visual awareness task that can minimize report-related motor confounding”. Also, have changed “report-related” to “motorrelated” in the text of manuscript.

      Figures 3, 4 changes in posterior middle frontal gyri suggest early frontal eye field involvement in perception. This should be interpreted in the context of many previous studies showing FEF involvement in signal detection. The authors claim that "earlier visual awareness related activities in the prefrontal cortex were not found in previous iEEG studies, especially in the HG band" on lines 501-502 of the Discussion. This statement is not true and should be removed. The following statement in the Discussion on lines 563-564 should be removed for the same reasons: "our study detected 'ignition' in the human PFC for the first time." Authors should review and cite the following studies as precedent among others:

      Blanke O, Morand S, Thut G, Michel CM, Spinelli L, Landis T, Seeck M (1999) Visual activity in the human frontal eye field. Neuroreport 10 (5):925-930. doi:10.1097/00001756-19990406000006

      Foxe JJ, Simpson GV (2002) Flow of activation from V1 to frontal cortex in humans. A framework for defining "early" visual processing. Exp Brain Res 142 (1):139-150. doi:10.1007/s00221-001-0906-7

      Gaillard R, Dehaene S, Adam C, Clemenceau S, Hasboun D, Baulac M, Cohen L, Naccache L (2009) Converging intracranial markers of conscious access. Plos Biology 7 (3):e61

      Gregoriou GG, Gotts SJ, Zhou H, Desimone R (2009) High-frequency, long-range coupling between prefrontal and visual cortex during attention. Science 324:1207-1210

      Herman WX, Smith RE, Kronemer SI, Watsky RE, Chen WC, Gober LM, Touloumes GJ, Khosla M, Raja A, Horien CL, Morse EC, Botta KL, Hirsch LJ, Alkawadri R, Gerrard JL, Spencer DD, Blumenfeld H (2019) A Switch and Wave of Neuronal Activity in the Cerebral Cortex During the First Second of Conscious Perception. Cereb Cortex 29 (2):461-474.

      Khalaf A, Kronemer SI, Christison-Lagay K, Kwon H, Li J, Wu K, Blumenfeld H (2022) Early neural activity changes associated with stimulus detection during visual conscious perception. Cereb Cortex. doi:10.1093/cercor/bhac140

      Kwon H, Kronemer SI, Christison-Lagay KL, Khalaf A, Li J, Ding JZ, Freedman NC, Blumenfeld H (2021) Early cortical signals in visual stimulus detection. Neuroimage 244:118608.

      We agree that several iEEG studies, including ERP and HFA, have shown the early involvement of prefrontal cortical in visual perception. However, in these studies, the differential activity between conscious and unconscious conditions was not investigated, thus, the activity in prefrontal cortex might be correlated with unconscious processing, rather than conscious processing. In present study, we compared the neural activity in PFC between conscious and unconscious trials, and found the correlation between PFC activity and conscious perception. Although one iEEG study reported awareness-specific PFC activation, the awareness-related activity started 300 ms after the onset of visual stimuli, which was ~100 ms later than the early awareness related activity in our study. Also, due to the limited number of electrodes in the previous study (2 patients with 19 recording sites mostly in mesiofrontal and peri-insular regions), it was restricted while exploring the awareness-related activity in PFC. In the present study, the number of recording sites (245) were much more than previous study and covered multiple areas in PFC. Our results further show earlier awareness-related activity (~ 200 ms after visual stimuli onset), including ERP, HFA and PLV, which sheds new light on understanding of the role of PFC in conscious perception.

      We have added this discussion in the MS (lines 522-533);

      Minor weakness that should be mentioned in the Discussion: The intervals for the FP (fixation period) and Delay period were both fixed at 600 ms instead of randomly jittered, so that subjects likely had anticipatory activity predictably occurring with each grating and cue stimulus.

      Thank you very much for your comment. We agree that subjects might have anticipatory activity during experiment. Actually, the goal for us to design the task in this way is to try to balance the effect of attention and anticipation between aware and unaware conditions. We have added this discussion in the MS (lines 467-469);

      The faster reaction times for perceived/confident responses vs not perceived/unconfident responses has been reported many times previously in the literature and should be acknowledged rather than being claimed as a novel finding. Authors should modify p. 163 lines 160-162, first sentence of the Discussion lines 445-446 "reaction time.. shorter" claiming this was a novel finding; same for lines 464-467. Please see the following among others:

      Broggin E, Savazzi S, Marzi CA (2012) Similar effects of visual perception and imagery on simple reaction time. Q J Exp Psychol (Hove) 65 (1):151-164. doi:10.1080/17470218.2011.594896

      Chelazzi L, Marzi CA, Panozzo G, Pasqualini N, Tassinari G, Tomazzoli L (1988) Hemiretinal differences in speed of light detection in esotropic amblyopes. Vision Res 28 (1):95-104 Marzi CA, Mancini F, Metitieri T, Savazzi S (2006) Retinal eccentricity effects on reaction time to imagined stimuli. Neuropsychologia 44 (8):1489-1495. doi:10.1016/j.neuropsychologia.2005.11.012

      Posner MI (1994) Attention: the mechanisms of consciousness. Proceedings of the National Academy of Sciences of the United States of America 91 (16):7398-7403

      Sternberg S (1969) Memory-scanning: mental processes revealed by reaction-time experiments. Am Sci 57 (4):421-457

      Thanks. We have cited some of these papers in the revised manuscript due to the restricted number of citations.

      Methods lines 658-659: "results under LU and HA conditions were classified as the control group and were only used to verify and check the results during calculation." However the authors show these results in the figures and they are interesting. HA stimuli show earlier responses than NA stimuli. This is a valuable result which should be discussed and interpreted in light of the other findings.

      We thank very much for reviewer’s comment. We have made discussion accordingly in the revised MS (lines 535-536).

      General comment on figures: Many of the figure elements are tiny and the text labels and details can't be seen at all, especially single trial color plots, and the brain insets showing recording sites.

      We have modified the figures accordingly.

      Other minor comments: Typo: Figure 2 legend, line 169 "The contrast level resulted in an awareness percentage greater than 25%..." is missing a word and should say instead something like "The contrast level that resulted in an awareness percentage greater than 25%..."

      Thanks. We have corrected the typo accordingly.

      Figure 2 Table description in text line 190 says "proportions of recording sites" but the Table only shows number of recording sites and number of subjects, not "proportions." This should be corrected in the text.

      Thanks. We have corrected the error.

      Figure 3, and other figures, should always label the left and right hemispheres to avoid ambiguity.

      Thanks. We have made correction accordingly. In caption of Figure 2D (line 189), we modified the sentence as ‘In all brain images, right side of the image represents the right side of the brain’.

      Methods line 666. The saccadic latency calculations paragraph should have a separate heading before it, to separate it from the Behavioral data analysis section.

      Thanks. It has been corrected in line 725.

      Reviewer #2 (Public Review):

      The authors attempt to address a long-standing controversy in the study of the neural correlates of visual awareness, namely whether neurons in prefrontal cortex are necessarily involved in conscious perception. Several leading theories of consciousness propose a necessary role for (at least some sub-regions of) PFC in basic perceptual awareness (e.g., global neuronal workspace theory, higher order theories), while several other leading theories posit that much of the previously reported PFC contributions to perceptual awareness may have been confounded by task-based cognition that co-varied between the aware and unaware reports (e.g., recurrent processing theory, integrated information theory). By employing intracranial EEG in human patients and a threshold detection task on low-contrast visual stimuli, the authors assessed the timing and location of neural populations in PFC that are differentially activated by stimuli that are consciously perceived vs. not perceived. Overall, the reported results support the view that certain regions of PFC do contribute to visual awareness, but at time-points earlier than traditionally predicted by GNWT and HOTs.

      Reply: We appreciate very much for the reviewer’s encouraged opinion.

      Major strengths of this paper include the straightforward visual threshold detection task including the careful calibration of the stimuli and the separate set of healthy control subjects used for validation of the behavioral and eye tracking results, the high quality of the neural data in six epilepsy patients, the clear patterns of differential high gamma activity and temporal generalization of decoding for seen versus unseen stimuli, and the authors' interpretation of these results within the larger research literature on this topic. This study appears to have been carefully conducted, the data were analyzed appropriately, and the overall conclusions seem warranted given the main patterns of results.

      Reply: We appreciate very much for the reviewer’s encouraged opinion.

      Weaknesses include the saccadic reaction time results and the potential flaws in the design of the reporting task. This is not a "no report" paradigm, rather, it's a paradigm aimed at balancing the post-perceptual cognitive and motor requirements between the seen and unseen trials. On each trial, subjects/patients either perceived the stimulus or not, and had to briefly maintain this "yes/no" judgment until a fixation cross changed color, and the color change indicated how to respond (saccade to the left or right). Differences in saccadic RTs (measured from the time of the fixation color change to moving the eyes to the left or right response square) were evident between the seen and unseen trials (faster for seen). If the authors' design achieved what they claim on page 3, "the report behaviors were matched between the two awareness states ", then shouldn't we expect no differences in saccadic RTs between the aware and unaware conditions? The fact that there were such differences may indicate differences in post-perceptual cognition during the time between the stimulus and the response cue. Alternatively, the RT difference could reflect task-strategies used by subjects/patients to remember the response mapping rules between the perception and the color cue (e.g., if the YES+GREEN=RIGHT and YES+RED=LEFT rules were held in memory, while the NO mappings were inferred secondarily rather than being actively held in memory). This saccadic RT result should be better explained in the context of the goals of this particular reporting-task.

      The objective of present study is to assess whether PFC is involved in the emergence of visual awareness. To do so, it is crucial to determine the subjective awareness state as correct as possible. Considering the disadvantage of non-report paradigms in determining the subjective awareness state (Tsuchiya et al, TiCS, 2015; Mashour et al, Neuron, 2020), we employed a balanced report paradigm. It has been argued (Merten & Nieder, PNAS, 2011) that, in the balanced report paradigms, subjects could not prepare any motor response during the delay period because only after the appearance of a rule cue (change color of fixation point at the end of delay period) subjects were informed about the appropriate motor action. In this case, the post-perceptual processing during delay period might reflect the non-motor cognitive activity, such as working memory (Mashour et al. Neuron, 2020). Alternatively, as being mentioned by reviewer, the postperceptual processing might relate to planning to report perception, which is different for perceived and not perceived stimuli (Aru et al. Neurosci Biobehav Rev, 2012 ). Therefore, up to date, the understanding of the post-perceptual processing remains controversial. Considering reviewer’s comment together with other opinions, we have modified the description of our task as following: “we designed a visual awareness task that can minimize report-related motor confounding”. Also, we have changed “report-related” to “motor-related” in the rest of manuscript.

      Regarding the question whether the saccadic RT in our balanced response paradigm should be expected to be similar between aware and unaware condition, we think that the RT should be similar in case if the delay period is long enough for the decision of “no” to be completed. In fact, in a previous study (Merten & Nieder, PNAS, 2011), the neuronal encoding of “no” decision didn’t appear until 2s after the stimulus cue onset. However, in our task, the delay period lasted only 600 ms that was long enough to form the “yes” decision, but was not enough to form the “no” decision. It might be the reason that our data show shorter RT in aware condition than in unaware condition.

      We totally agree reviewer’s comment about the alternative interpretation for RT difference between aware and unaware condition in our study, i.e., reflecting task-strategies used by subjects/patients to remember the response mapping rules between the perception and the color cue (e.g., if the YES+GREEN=RIGHT and YES+RED=LEFT rules were held in memory, while the NO mappings were inferred secondarily rather than being actively held in memory). We have made additional discussion about these questions in the revised manuscript (lines 492496).

      Nevertheless, the current results do help advance our understanding of the contribution of PFC to visual awareness. These results, when situated within the larger context of the rapidly developing literature on this topic (using "no report" paradigms), e.g., the recent studies by Vishne et al. (2023) Cell Reports and the Cogitate consortium (2023) bioRxiv, provide converging evidence that some sub-regions of PFC contribute to visual awareness, but at latencies earlier than originally predicted by proponents of, especially, global neuronal workspace theory.

      We appreciate very much for the reviewer’s encouraged opinion.

      Reviewer #2 (Recommendations For The Authors):

      Abstract: "the spatiotemporal overlap between the awareness-related activity and the interregional connectivity in PFC suggested that conscious access and phenomenal awareness may be closely coupled." I strongly suggest revising this sentence. The current results cannot be used to make such a broad claim about p-consciousness vs. a-consciousness. This study used a balanced trial-by-trial report paradigm, which can only measure conscious access.

      We thank reviewer for this comment. We have withdrawn this sentence from the revised manuscript.

      Task design: A very similar task was used previously by Schröder et al. (2021) J Neurosci. See specifically, their Figure 1, and Figure 4B-C. Using almost the exact same "matching task", the authors of this previous study show that they get a P3b for both the perceived and not-perceived conditions, confirming that post-perceptual cognition/report confounds were not eliminated, but instead were present in (and balanced between) both the perceived/not-perceived trials due to the delayed matching aspect of the design. This previous paper should be cited and the P3b result should be considered when assessing whether cognition/report confounds were addressed in the current study.

      Thank you very much for your reminding about the study of Schröder et al. We are sorry for not citing this closely related study in our previous manuscript. Schröder et al. found while P3b showed significant difference between perceived and not-perceived trials in direct report task, the P3b was presented in both perceived/not-perceived trials and not significantly different in the matched task. Based on these findings, Schröder et al. argued that P3b represented the task specific post-perceptual cognition/report rather than the emergence of awareness per se. Considering the similarity of tasks between Schröder et al. and ours, we agree that our task is not able to totally eliminate the confound of post-perceptual cognition/report related activity with awareness related activity. Nevertheless, our task is able to minimize the confound of motorrelated activity with the emergence of awareness by separating them in time and balancing the direction of responsive movements. Therefore, we modified the term of “report-related” to “motor-related” in the text of revised manuscript.

      On page 2, lines 71-75, the authors' review of the Frassle et al. (2014) experiment should be revised for accuracy. In this study, all PFC activity did not disappear as the authors claim. Also, the main contrast in the Frassle et al. study was rivalry vs. replay. However, in both of these conditions, visual awareness was changing with the main difference being whether there was sensory conflict between the two eyes or not. Such a contrast would presumably subtract out the common activity patterns related to visual awareness changes, while isolating rivalry (and the resulting neural competition) vs. non-rivalry (and the lack of such competition) which is not broadly relevant for the goal of measuring neural correlates of visual awareness which are present in both sides of the contrast (rivalry and replay).

      Thank you very much for your suggestion. We agree that and revised in the MS (lines 71-76).

      ‘For instance, a functional magnetic resonance imaging (fMRI) study employing human binocular rivalry paradigms found that when subjects need to manually report the changing of their awareness between conflict visual stimuli, the frontal, parietal, and occipital lobes all exhibited awareness-related activity. However, when report was not required, awareness-related activation was largely diminished in the frontal lobe but remained in the occipital and parietal lobes’

      On page 2, lines 76-78, the authors write, "no-report paradigm may overestimate unconscious processing because it cannot directly measure the awareness state". This should be reworded for clarity, as report paradigms also do not "directly measure the awareness state". All measures of awareness are indirect, either via subjects verbal or manual reports, or via behaviors or other physiological measures like OKN, pupillometry, etc. It's also not clear as written why no-report paradigms might overestimate unconscious processing.

      Thank you very much for your suggestion. We agreed and modified the description. In lines 76-80:

      ‘Nevertheless, the no-report paradigm may overestimate the neural correlates of awareness by including unconscious processing, because it infers the awareness state through other relevant physiological indicators, such as optokinetic nystagmus and pupil size(Tsuchiya, Wilke, Frassle, & Lamme, 2015). In the absence of subjective reports, it remains controversial regarding whether the presented stimuli are truly seen or not.’

      However, the no-report paradigm may overestimate the neural correlates of awareness, because it infers the awareness state through other relevant physiological indicators, such as optokinetic nystagmus and pupil size(Tsuchiya et al., 2015) , in the absence of subjective reports and it remains controversial that whether the stimuli presented in such paradigm are truly seen as opposed to being merely potentially visible but unattended.

      On page 5, line 155, there is a typo. This should be Figure 2C, not 2B.

      Thanks. We have modified the description.

      On page 5, lines 160-162, the authors state, "The results showed that the saccadic reaction time in the aware trials was systematically shorter than that in the unaware trials. Such results demonstrate that visual awareness significantly affects the speed of information processing in the brain." I don't understand this. If subjects can never make a saccade until the fixation cross changes color, both for Y and N decisions, why would a difference in saccadic reaction times indicate anything about visual awareness affecting the speed of information processing in the brain? Doesn't this just show that the Red/Green x Left/Right response contingencies were easier to remember and execute for the Yes-I-did-see-it decisions compared to the No-I-didn't-see-it decisions?

      We agree and have made additional discussion about these questions in the revised manuscript (lines 492-496).

      ‘An alternative interpretation for RT difference between aware and unaware condition in our study is that the difference in task-strategies used by subjects/patients to remember the response mapping rules between the perception and the color cue (e.g., if the YES+GREEN=RIGHT and YES+RED=LEFT rules were held in memory, while the NO mappings were inferred secondarily rather than being actively held in memory).’

      In Figure 3B (and several other figures) due to the chosen view and particular brain visualization used, many readers will not know whether the front of brain is up and back of brain down or vise versa (there are no obvious landmarks like the cerebellum, temporal sulcus, etc.). I suggest specifying this in the caption or better yet on the figure itself.

      Thanks. We have added these descriptions in the caption of Figure 2D.

      Line 189 ‘In all brain images, right and up sides of each image represent the right and up sides of the brain’.

      In Figure 3B, the color scale may confuse some readers. When I first inspected this figure, I immediately thought the red meant positive voltage or activation, while the blue meant negative voltage or deactivation. Only later, I realized that any color here is meaningful. Not sure if an adjustment of the color scale might help, or perhaps not normalizing (and not taking absolute values of the voltage diffs, but maintaining the +/- diffs)?

      Thanks for reviewer’s comment. We are sorry for not clearly describing the reason why we normalized the activity in absolute value and chose the color scale from 0 to 20. The major reason is that it is not clearly understood so far regarding the biological characteristics of LFP polarity (Einevoll et al, Nat Rev Neurosci, 2013). To simplify such complex issue, we consider the change in magnitude of LFP during delay period in our task represents awareness related activity, regardless its actual value being positive or negative. Therefore, we first calculated the absolute value of activity difference between aware and unaware trials in individual recording site, then used Shepard's method (see Method for detailed information) to calculate the activity in each vertex and projected on the surface of brain template as shown in Fig. 3B.

      We have added the description in the MS (lines 794-800).

      We have tried to adjust the color scale from -20 to 20 according to reviewer’s suggestion. However, the topographic heatmap showed less distinguishable between brain regions with different strength of awareness related activity. Thus, we would like to keep the way as we used to analyze and present these results.

      Figure 3B: Why choose seemingly arbitrary time points in this figure? What's the significance of 247 and 314 and 381ms (why not show 200, 250, 300, etc.)? Also, are these single time-points or averages within a broader time window around this time-point, e.g., 225-275ms for the 250ms plot?

      Thank reviewer for this helpful comment. We are sorry for not clearly describing why we chose the 8 time points to demonstrate the spatiotemporal characteristics of awareness related activity in Fig. 3B. To identify the awareness related activity, we analyzed the activity difference between aware and unaware trials during delay period (180-650 ms after visual stimulus onset). The whole dynamic process has been presented in SI with a video (video S1). Here, we just sampled the activity at 8 time points (180 ms, 247 ms, 314 ms, etc.) that equally divided the 430 ms delay period.

      We have added the description in the MS (lines 213-215).

      Figure 3D: It's not clear how this figure panel is related to the data shown in Fig3A. In Fig3A, the positive amplitude diffs all end at around 400ms, but in Fig3D, these diffs extend out to 600+ms. I suggest adding clarity about the conversion being used here.

      Thanks for reviewer’s comment. We are sorry for not clearly describing the way to analyze the population activity (Fig. 3D) in the previous version of manuscript. Since it is not clearly understood so far regarding the biological characteristics of LFP polarity, to simplify such complex issue, we consider the change in magnitude of LFP during delay period in our task is awareness related activity, regardless its actual value being positive or negative. Therefore, while analyzing the awareness related population activity, we first calculate the absolute value of activity difference between aware and unaware trials in individual recording site, then pool the data of 43 recording sites together and calculate the mean and standard error of mean (SEM)(Fig. 3D). As you can see in Fig. 3A, the activity difference between aware (red) and unaware (blue) trials lasts until/after the end of delay period. Thus, the awareness related population activity in Fig 3D extends out to 600 ms.

      We have added the description in the MS (lines 769-777).

      Figure 6D could be improved by making the time labels much bigger, perhaps putting them on the time axis on the bottom rather than in tiny text above each brain.

      Thanks for reviewer’s comment. We have modified it accordingly.

      Page 18, line 480: "our results show that the prefrontal cortex still displays visual awareness-related activities even after eliminating the influence of the confounding variables related to subjective reports such as motion preparation" This is too strong of a statement. It's not at all clear whether confounding variables related to subjective reports (especially the cognition needed to hold in mind the Y/N decision about seeing the stimulus prior to the response cue) were eliminated with the design used here. In other places of the manuscript, the authors use "minimized" which is more accurate.

      Thanks for reviewer’s comment. We have modified it accordingly.

      Page 19, section starting on line 508: The authors should consider citing the study by Vishne et al. (2023), which was just accepted for publication recently, but has been posted on bioRxiv for almost a year now: https://www.biorxiv.org/content/10.1101/2022.08.02.502469v1 . And on page 20, line 563, the authors claim that to the best of their knowledge, they were the first to detect "ignition" in PFC in human subjects. Consider revising this statement, now that you know about the Vishne et al. paper.

      We agree.

      Thanks for your reminding about these papers. We have cited this study and made discussion in the revised manuscript (line 522-533). We agree that several iEEG studies have shown the early involvement of PFC in visual perception (Vishne et al. 2023; Khalaf et al. 2023; Kwon et al. 2021). However, in these studies, authors did not compare the neural activity between conscious and unconscious conditions, leaving the possibility that the ERP and HFA were correlated with the unconscious information processing rather than awareness-specific processing. In the present study, we compared the neural activity in PFC between conscious and unconscious trials, and found that the activity of PFC specifically correlated with conscious perception. As we mentioned in the previous version of manuscript, there is one iEEG study (Gaillard et al. 2009) that reported awareness-specific activity in PFC. However, the awareness related activity started more than 300 ms after the onset of visual stimuli, which was about 100 ms longer than the early awareness related activity in our study. Nevertheless, according to reviewer’s comment, we modified our argument as following in lines 621-623:

      ‘However, as discussed above, in contrast with previous studies, our study detected earlier awareness-specific ‘ignition’ in the human PFC, while minimizing the motor-related confounding.’

      Experimental task section of Methods: Were any strategies for learning the response cue matching task suggested to patients/subjects, and/or did any patients/subjects report which strategy they ended up using? For example, if I were a subject in this experiment, I would remember and mentally rehearse the rules: "YES+GREEN = RIGHT" and "YES+RED = LEFT". For trials in which I didn't see anything, I wouldn't need to hold 2 more rules in mind, as they can be inferred from the inverse of the YES rules (and it's much harder to hold 4 things in mind than 2). This extra inference needed to get to the NO+GREEN = LEFT and NO+RED = RIGHT rules would likely cause me to respond slightly slower to the NO trials compared to the YES trials, leading to saccadic RT effects in the same direction the authors found. More information about the task training and strategies used by patients/subjects would be helpful.

      We agree and discussed this in lines 492-496.

      Reviewer #3 (Public Review):

      The authors report a study in which they use intracranial recordings to dissociate subjectively aware and subjectively unaware stimuli, focusing mainly on prefrontal cortex. Although this paper reports some interesting findings (the videos are very nice and informative!) the interpretation of the data is unfortunately problematic for several reasons. I will detail my main comments below. If the authors address these comments well, I believe the paper may provide an interesting contribution to further specifying the neural mechanisms important for conscious access (in line with Gaillard et al., Plos Biology 2009).

      Reply: We appreciate very much for the reviewer’s encouraged opinion.

      The main problem with the interpretation of the data is that the authors have NOT used a so called "no-report paradigm". The idea of no report paradigms is that subjects passively view a certain stimulus without the instruction to "do something with it", e.g., detect the stimulus, immediately or later in time. Because of the confusion of this term, specifically being related to the "act of reporting", some have argued we should use the term no-cognition paradigm instead (Block, TiCS, 2019, see also Pitts et al., Phil Trans B 2018). The crucial aspect is that, in these types of paradigms, the critical stimulus should be task-irrelevant and thus not be associated with any task (immediately or later). Because in this experiment subjects were instructed to detect the gratings when cued 600 ms later in time, the stimuli are task relevant, they have to be reported about later and therefore trigger all kinds of (known and potentially unknown) cognitive processes at the moment the stimuli are detected in real-time (so stimulus-locked). You could argue that the setup of this delayed response task excludes some very specific report related processes (e.g., the preparation of an eye-movement), which is good, however this is usually not considered the main issue. For example when comparing masked versus unmasked stimuli (Gaillard et al., 2009 Plos Biology), these conditions usually also both contain responses but these response related processes are "averaged out" in the specific contrasts (unmasked > masked). In this paper, RT differences between conditions (that are present in this dataset) are taken care of by using this delayed response in this paper, which is a nice feature for that and is not the case for the above example set-up.

      Given the task instructions, and this being merely a delayed-response task, it is to be expected that prefrontal cortex shows stronger activity for subjectively aware versus subjectively unaware stimuli. Unfortunately, given the nature of this task, the novelty of the findings is severely reduced. The authors cannot claim that prefrontal cortex is associated with "visual awareness", or what people have called phenomenal consciousness (this is the goal of using no-cognition paradigms). The only conclusion that can be drawn is that prefrontal cortex activity is associated with accessing sensory input: and hence conscious access. This less novel observation has been shown many times before and there is also little disagreement about this issue between different theories of consciousness (e.g., global workspace theory and local recurrency theories both agree on this).

      We totally agree that the no-report/no-cognition paradigms contain less cognition within the post-perceptual processing than the report paradigms. We designed the balanced response task in order to minimize the motor related component from post-perceptual processing, even though this task does not eliminate the entire cognition from post-perceptual processing. Regarding reviewer’s comment that our task is not able to assess the involvement of PFC in the emergence of awareness, we have different opinion. As we mentioned in the manuscript, the findings of early awareness related activity (~200 ms) in PFC, which resemble the VAN activity in EEG studies, indicate the association of PFC with the emergence of visual awareness (phenomenal consciousness).

      The best solution at this point seems to rewrite the paper entirely in light of this. My advice would be to state in the introduction that the authors investigate conscious access using iEEG and then not refer too much to no-cognition paradigm or maybe highlight some different strategies about using task-irrelevant stimuli (see Canales-Johnson et al., Plos Biology 2023; Hesse et al., eLife 2020; Hatamimajoumerd et al Curr Bio 2022; Alilovic et al., Plos Biology 2023; Pitts et al., Frontiers 2014; Dwarakanth et al., Neuron 2023 and more). Obviously, the authors should then also not claim that their results solve debates about theories regarding visual awareness (in the "no-cognition" sense, or phenomenal consciousness), for example in relation to the debate about the "front or the back of the brain", because the data do not inform that discussion. Basically, the authors can just discuss their results in detail (related to timing, frequency, synchronization etc) and relate the different signatures that they have observed to conscious access.

      The objective of present study is to assess whether PFC is involved in the emergence of visual awareness (i.e., phenomenal consciousness). Interestingly, we found the early awareness related activity (~200 ms after visual stimulus onset), including ERP, high gamma activity and phase synchronization, in PFC, which indicate the association of PFC with the emergence of visual awareness. Therefore, we would like to keep the basic context of manuscript and make revision according to reviewers’ comments.

      On the other hand, we totally agree reviewer’s argument that the report paradigm is more suitable to study the access consciousness. Indeed, we have found that the awareness related activity in PFC could be separated into two subgroups, i.e., early activity with shorter latency (~200 ms after stimulus onset) and late activity with longer latency (> 350 ms after stimulus onset). In addition, the early activity was declined to the baseline level within ~200 ms during delay period, whereas the late activity lasted throughout the delay period and reached to the next stage of task (change color of the fixation point). Moreover, the early activity occurs primarily within the contralateral PFC of the visual stimulus, whereas the late activity occurs within both contralateral and ipsilateral PFC. While the early awareness related activity resembles the VAN activity in EEG studies (associating with p-consciousness), the late awareness related activity resembles the P3b activity (associating with a-consciousness). We are going to report these results in a separated paper soon.

      I think the authors have to discuss the Gaillard et al PLOS Biology 2009 paper in much more detail. Gaillard et al also report a study related to conscious access contrasting unmasked and masked stimuli using iEEG. In this paper they also report ERP, time frequency and phase synchronization results (and even Granger causality). Because of the similarities in approach, I think it would be important to directly compare the results presented in that paper with results presented here and highlight the commonalities and discrepancies in the Discussion.

      Thanks for reviewer’s comment. We have made additional analysis and detailed discussion accordingly. In addition, we also extended discussion with other relevant studies in the revised manuscript.

      In lines 528-549,

      ‘Although one iEEG study reported awareness-specific PFC activation, the awareness-related activity started 300 ms after the onset of visual stimuli, which was ~100 ms later than the early activity in our study. Also, due to the limited number of electrodes in PFC (2 patients with 19 recording sites mostly in mesiofrontal and peri-insular regions), their experiments were restricted while exploring the awareness-related activity in PFC. In the present study, the number of recording sites (245) were much more than previous study and covered more areas in PFC. Our results further show earlier awareness-related activity (~ 200 ms after visual stimuli onset), including ERP, HFA and PLV. These awareness-related activity in PFC occurred even earlier (~150 ms after stimulus onset) for the salient stimulus trials (Fig. 3A\D and Fig. 4A\D, HA condition).

      However, the proportions are much smaller than that reported by Gaillard et al, which peaked at ~60%. We think that one possibility for the difference may be due to the more sampled PFC subregions in present study and the uneven distribution of awareness-related activity in PFC. Meanwhile, we noticed that the peri-insula regions and middle frontal gyrus (MFG), which were similar with the regions reported by Gaillard et al, seemed to show more fraction of awarenessrelated sites than other subregions during the delay period (0-650 ms after stimulus onset). To test such possibility and make comparison with the study of Gaillard et al. we calculated the proportion of awareness-related site in peri-insula and MFG regions. We found although the proportion of awareness-related site was larger in peri-insula and MFG than in other subregions, it was much lower than the report of Gaillard et al. One alternative possibility for the difference between these two studies might be due to the more complex task in Gaillard et al. Nevertheless, we think these new results would contribute to our understanding of the neural mechanism underlying conscious perception, especially for the role of PFC.’ In lines 601-603:

      ‘The only human iEEG study reported that the phase synchronization of the beta band in the aware condition also occurred relatively late (> 300 ms) and mainly confined to posterior zones but not PFC.’

      As for the Granger Causality analysis between PFC and occipital lobe, while the aim of this study focused mainly on PFC and there were few recoding sites in occipital lobe, we would like to do this analysis in later studies after we collect more data.

      In the Gaillard paper they report a figure plotting the percentage of significant frontal electrodes across time (figure 4A) in which it can be seen that significant electrodes emerge after approximately 250 ms in PFC as well. It would be great if the authors could make a similar figure to compare results. In the current paper there are much more frontal electrode contacts than in the Gaillard paper, so that is interesting in itself.

      Thanks reviewer for this constructive comment. We made similar analysis as Gaillard et al. and plotted the results in the figure bellow. As you can see, the awareness related sites started to emerge about 200 ms after visual stimulus onset according to both ERP and HG activity. The proportion of awareness related sites reached peak at ~14% (8% for HG) in 300-400ms. However, the proportions are much smaller than that reported by Gaillard et al, which peaked at ~60%. We think that one possibility for the difference may be due to the more sampled PFC subregions in present study and the uneven distribution of awareness-related activity in PFC. Meanwhile, we noticed that the peri-insula regions and middle frontal gyrus (MFG), which were similar with the regions reported by Gaillard et al, seemed to show more fraction of awareness-related sites than other subregions during the delay period (0-650 ms after stimulus onset). To test such possibility and make comparison with the study of Gaillard et al. we calculated the proportion of awareness-related site in peri-insula and MFG regions. We found although the proportion of awareness-related site was larger in peri-insula and MFG than in other subregions, it was much lower than the report of Gaillard et al. One alternative possibility for the difference between these two studies might be due to the more complex task in Gaillard et al.

      We have added this figure and discussion to the revised manuscript as a new result (Figure 4E & S2 and lines 537-549).

      Author response image 1.

      Percentage of awareness-related sites in ERP and HG analysis. n, number of recording sites in PFC.

      Author response image 2.

      Percentage of awareness-related sites in ERP and HG analysis at parsopercularis and middle frontal gyrus (MFG). n, number of recording sites.

      In my opinion, some of the most interesting results are not highlighted: the findings that subjectively unaware stimuli show increased activations in the prefrontal cortex as compared to stimulus absent trials (e.g., Figure 4D). Previous work has shown PFC activations to masked stimuli (e.g., van Gaal et al., J Neuroscience 2008, 2010; Lau and Passigngham J Neurosci 2007) as well as PFC activations to subjectively unaware stimuli (e.g., King, Pescetelli, and Dehaene, Neuron 2016) and this is a very nice illustration of that with methods having more detailed spatial precision. Although potentially interesting, I wonder about the objective detection performance of the stimuli in this task. So please report objective detection performance for the patients and the healthy subjects, using signal detection theoretic d'. This gives the reader an idea of how good subjects were in detecting the presence/absence of the gratings. Likely, this reveals far above chance detection performance and in that case I would interpret these findings as "PFC activation to stimuli indicated as subjectively unaware" and not unconscious stimuli. See Stein et al., Plos Biology 2021 for a direct comparison of subjectively and objectively unaware stimuli.

      We gratefully appreciate for reviewer’s helpful and valuable comments. We do notice that the activity of PFC in subjectively unawareness condition (stimulus contrast near perceptual threshold) is significantly higher than stimulus absent condition. Such results, by using sEEG recordings with much higher spatial resolution than brain imaging and scalp EEG, support findings of previous studies (citations). Considering the question of neural correlation of unawareness processing is a hot and interesting topic, after carefully considering, we would like to report these results in a separate paper, rather than add these results in the current manuscript in order to avoid the distraction.

      According to reviewer’s comment about the objective detection performance of the stimuli in our task, we analyzed the signal detection theoretic d’. The values of d’ in patients and healthy subjects are similar (1.81±0.27 in patients and 2.12±0.37 in healthy subjects). Such results indicate that the objective detection performance of subjects in our task is well above the chance level. Since our task merely measures the subjective awareness, we agree reviewer’s comment about the interpretation of our results as “PFC activation to stimuli indicated the subjective unawareness rather than objective unawareness”. We will emphasize this point in our next paper.

      We have added the d prime in the MS (lines149-150).

      In Figure 7 of the paper the authors want to make the case that the contrast does not differ between subjectively aware stimuli and subjectively unaware stimuli. However so far they've done the majority of their analyses across subjects, and for this analysis the authors only performed within-subject tests, which is not a fair comparison imo. Because several P values are very close to significance I anticipate that a test across subjects will clearly show that the contrast level of the subjectively aware stimuli is higher than of the subjectively unaware stimuli, at the group level. A solution to this would be to sub-select trials from one condition (NA) to match the contrast of the other condition (NU), and thereby create two conditions that are matched in contrast levels of the stimuli included. Then do all the analyses on the matched conditions.

      Thank reviewer for the helpful comment. Regarding reviewer’s comment “However so far they've done the majority of their analyses across subjects, and for this analysis the authors only performed within-subject tests, which is not a fair comparison imo”, if we understand correctly, reviewer considered that it was fair if the analysis of neural activity in PFC was done across subjects but the stimulus contrast analysis between NA and NU was done individually. Actually, it is not the case. In neural activity analysis, the significant awareness-related sites were identified firstly in each individual subject (Fig. 3A and Fig 4A, and Methods), same as the analysis of stimulus contrast (see Methods). Only in the neural population activity analysis, the activity of awareness-related sites was pooled together and made further analysis.

      To further evidence the awareness related activity in PFC is not highly correlated with stimulus contrast, we compared the activity difference between two different stimulus contrast conditions, i.e., stimulus contrast difference between high-contrast aware (HA) and NA conditions (large difference, ~14%), and between NA and NU conditions (slight difference, ~0.2%). The working hypothesis is that, if PFC activity is closely correlated with the contrast of stimulus contrast, we expect to see the activity difference between HA and NA conditions is much larger than that between NA and NU conditions. To test this hypothesis, we analyzed data of two patients in which the previous analysis showed significant or near significant difference of stimulus contrast between NA and NU conditions (Author response image 1, below, patient #2 and 1). The results (Author response image 1) show that the averaged activity difference (0-650 ms after visual stimulus onset) between HA and NA was similar as the averaged activity difference between NA and NU trials, even though the stimulus contrast difference was much larger between HA and NA conditions than between NA and NU conditions. Such results indicate that the awareness-related activity in PFC cannot be solely explained by the contrast difference between NA and NU conditions. Based on these results, we think that it is not necessary to perform the analysis as reviewer’s comment “A solution to this would be to sub-select trials from one condition (NA) to match the contrast of the other condition (NU), and thereby create two conditions that are matched in contrast levels of the stimuli included. Then do all the analyses on the matched conditions”. Another reason that impedes us to do this analysis is due to the limited trial numbers in our dataset.

      Author response image 3.

      Relationship between stimulus contract and PFC activity. X axis represents the stimulus contrast difference between two paired conditions, i.e., aware versus unaware in near perceptual threshold conditions (NA – NU, red dots); aware in high contrast condition versus aware in near perceptual threshold condition (HA – NA, blue dots). Y axis represents the activity difference between paired stimulus conditions. The results show that activity difference is similar between two paired conditions regardless the remarkable contrast difference between two paired conditions. Such results indicate that the greater activity in NA trials than in NU trials (Fig. xx-xx) could not be interpreted by the slight difference in stimulus contrast between NA and NU trials.

      Related, Figure 7B is confusing and the results are puzzling. Why is there such a strong below chance decoding on the diagonal? (also even before stimulus onset) Please clarify the goal and approach of this analysis and also discuss/explain better what they mean.

      We have withdrawn Figure7B for the confusing decoding results on the diagonal.

      I was somewhat surprised by several statements in the paper and it felt that the authors may not be aware of several intricacies in the field of consciousness. For example, a statement like the following "Consciousness, as a high-level cognitive function of the brain, should have some similar effects as other cognitive functions on behavior (for example, saccadic reaction time). With this question in mind, we carefully searched the literature about the relationship between consciousness and behavior; surprisingly, we failed to find any relevant literature." This is rather problematic for at least two reasons. First, not everyone would agree that consciousness is a highlevel cognitive function and second there are many papers arguing for a certain relationship between consciousness and behavior (Dehaene and Naccache, 2001 Cognition; van Gaal et al., 2012, Frontiers in Neuroscience; Block 1995, BBS; Lamme, Frontiers in Psychology, 2020; Seth, 2008 and many more). Further, the explanation for the reaction time differences in this specific case is likely related to the fact that subjects' confidence in that decision is much higher in the aware trials than in the unaware trials, hence the speeded response for the first. This is a phenomenon that is often observed if one explores the "confidence literature". Although the authors have not measured confidence I would not make too much out of this RT difference.

      We agree that and modified accordingly in lines 492-507.

      ‘An alternative interpretation for RT difference between aware and unaware condition in our study, i.e., reflecting task-strategies used by subjects/patients to remember the response mapping rules between the perception and the color cue (e.g., if the YES+GREEN=RIGHT and YES+RED=LEFT rules were held in memory, while the NO mappings were inferred secondarily rather than being actively held in memory).

      Another possibility is that the reaction time is strongly modulated by the confident level, which has been described in previous studies(Broggin et al., 2012; Marzi et al., 2006). However, in previous studies, the confident levels were usually induced by presenting stimulus with different physical property, such as spatial frequency, eccentricity and contrast. However, the dependence of visual process on the salience of visual stimulus confounds with the effect of visual awareness on the reaction time of responsive movements, which is hard to attribute the shorter reaction time in more salient condition purely to visual awareness. In contrast, we create a condition (near aware threshold) in the present study, in which the saliency (contrast) of visual stimulus is very similar in both aware and unaware conditions in order to eliminate the influence of stimulus saliency in reaction time. We think that the difference in reaction time in our study is mainly due to the modulation of awareness state, which was not reported previously.’

      I would be interested in a lateralized analysis, in which the authors compare the PFC responses and connectivity profiles using PLV as a factor of stimulus location (thus comparing electrodes contralateral to the presented stimulus and electrodes ipsilateral to the presented stimulus). If possible this may give interesting insights in the mechanism of global ignition (global broadcasting), supposing that for contralateral electrodes information does not have to cross from one hemisphere to another, whereas for ipsilateral electrodes that is the case (which may take time). Gaillard et al refer to this issue as well in their paper, and this issue is sometimes discussed regarding to Global workspace theory. This would add novelty to the findings of the paper in my opinion.

      We gratefully appreciate reviewer’s helpful and available suggestions. We have made the analysis accordingly. We find that the awareness-related ERP activation in PFC occurs earlier only in the contralateral PFC with latency about 200 ms and then occurs in both contralateral and ipsilateral PFC about 100 ms later. In addition, the magnitude of awareness-related activity is stronger in the contralateral PFC than in ipsilateral PFC during the early phase (200-400 ms), then the activity becomes similar between contralateral and ipsilateral PFC. Moreover, the awareness related HG activity only appears in the contralateral PFC. Such results show the spatiotemporal characteristics of visual awareness related activity between two hemispheres. We are going to report these results in a separate paper soon.

      Reviewer #3 (Recommendations For The Authors):

      Some of the font sizes in the figures are too small.

      We have modified accordingly.

      To me, the abbreviations are confusing, (NA/NU etc). I would try to come up with easier ones or just not use abbreviations.

      We have modified accordingly and try to avoid to use the abbreviations.

      The data/scripts availability statement states "available upon reasonable request". I would suggest that the authors make the data openly available when possible, and I believe eLife requires that as well.

      Thanks for reviewer’s suggestions. Due to several ongoing studies based on this dataset, we would like to open our data after complete these studies if there is no restriction from national policy.

    1. Author Response

      The following is the authors’ response to the original reviews.

      Public Comments

      Reviewer 1

      (1) Despite the well-established role of Netrin-1 and UNC5C axon guidance during embryonic commissural axons, it remains unclear which cell type(s) express Netrin-1 or UNC5C in the dopaminergic axons and their targets. For instance, the data in Figure 1F-G and Figure 2 are quite confusing. Does Netrin-1 or UNC5C express in all cell types or only dopamine-positive neurons in these two mouse models? It will also be important to provide quantitative assessments of UNC5C expression in dopaminergic axons at different ages.

      Netrin-1 is a secreted protein and in this manuscript we did not examine what cell types express Netrin-1. This question is not the focus of the study and we consider it irrelevant to the main issue we are addressing, which is where in the forebrain regions we examined Netrin-1+ cells are present. As per the reviewer’s request we include below images showing Netrin-1 protein and Netrin-1 mRNA expression in the forebrain. In Figure 1 below, we show a high magnification immunofluorescent image of a coronal forebrain section showing Netrin-1 protein expression.

      Author response image 1.

      This confocal microscope image shows immunofluorescent staining for Netrin-1 (green) localized around cell nuclei (stained by DAPI in blue). This image was taken from a coronal section of the lateral septum of an adult male mouse. Scale bar = 20µm

      In Figures 2 and 3 below we show low and high magnification images from an RNAscope experiment confirming that cells in the forebrain regions examined express Netrin-1 mRNA.

      Author response image 2.

      This confocal microscope image of a coronal brain section of the medial prefrontal cortex of an adult male mouse shows Netrin-1 mRNA expression (green) and cell nuclei (DAPI, blue). Brain regions are as follows: Cg1: Anterior cingulate cortex 1, DP: dorsopeduncular cortex, fmi: forceps minor of the corpus callosum, IL: Infralimbic Cortex, PrL: Prelimbic Cortex

      Author response image 3.

      A higher resolution image from the same sample as in Figure 2 shows Netrin-1 mRNA (green) and cell nuclei (DAPI; blue). DP = dorsopeduncular cortex

      Regarding UNC5c, this receptor homologue is expressed by dopamine neurons in the rodent ventral tegmental area (Daubaras et al., 2014; Manitt et al., 2010; Phillips et al., 2022). This does not preclude UNC5c expression in other cell types. UNC5c receptors are ubiquitously expressed in the brain throughout development, performing many different developmental functions (Kim and Ackerman, 2011; Murcia-Belmonte et al., 2019; Srivatsa et al., 2014). In this study we are interested in UNC5c expression by dopamine neurons, and particularly by their axons projecting to the nucleus accumbens. We therefore used immunofluorescent staining in the nucleus accumbens, showing UNC5 expression in TH+ axons. This work adds to the study by Manitt et al., 2010, which examined UNC5 expression in the VTA. Manitt et al. used Western blotting to demonstrate that UNC5 expression in VTA dopamine neurons increases during adolescence, as can be seen in the following figure:

      References:

      Daubaras M, Bo GD, Flores C. 2014. Target-dependent expression of the netrin-1 receptor, UNC5C, in projection neurons of the ventral tegmental area. Neuroscience 260:36–46. doi:10.1016/j.neuroscience.2013.12.007

      Kim D, Ackerman SL. 2011. The UNC5C Netrin Receptor Regulates Dorsal Guidance of Mouse Hindbrain Axons. J Neurosci 31:2167–2179. doi:10.1523/jneurosci.5254-10.20110.2011

      Manitt C, Labelle-Dumais C, Eng C, Grant A, Mimee A, Stroh T, Flores C. 2010. Peri-Pubertal Emergence of UNC-5 Homologue Expression by Dopamine Neurons in Rodents. PLoS ONE 5:e11463-14. doi:10.1371/journal.pone.0011463

      Murcia-Belmonte V, Coca Y, Vegar C, Negueruela S, Romero C de J, Valiño AJ, Sala S, DaSilva R, Kania A, Borrell V, Martinez LM, Erskine L, Herrera E. 2019. A Retino-retinal Projection Guided by Unc5c Emerged in Species with Retinal Waves. Current Biology 29:1149-1160.e4. doi:10.1016/j.cub.2019.02.052

      Phillips RA, Tuscher JJ, Black SL, Andraka E, Fitzgerald ND, Ianov L, Day JJ. 2022. An atlas of transcriptionally defined cell populations in the rat ventral tegmental area. Cell Reports 39:110616. doi:10.1016/j.celrep.2022.110616

      Srivatsa S, Parthasarathy S, Britanova O, Bormuth I, Donahoo A-L, Ackerman SL, Richards LJ, Tarabykin V. 2014. Unc5C and DCC act downstream of Ctip2 and Satb2 and contribute to corpus callosum formation. Nat Commun 5:3708. doi:10.1038/ncomms4708

      (2) Figure 1 used shRNA to knockdown Netrin-1 in the Septum and these mice were subjected to behavioral testing. These results, again, are not supported by any valid data that the knockdown approach actually worked in dopaminergic axons. It is also unclear whether knocking down Netrin-1 in the septum will re-route dopaminergic axons or lead to cell death in the dopaminergic neurons in the substantia nigra pars compacta?

      First we want to clarify and emphasize, that our knockdown approach was not designed to knock down Netrin-1 in dopamine neurons or their axons. Our goal was to knock down Netrin-1 expression in cells expressing this guidance cue gene in the dorsal peduncular cortex.

      We have previously established the efficacy of the shRNA Netrin-1 knockdown virus used in this experiment for reducing the expression of Netrin-1 (Cuesta et al., 2020). The shRNA reduces Netrin-1 levels in vitro and in vivo.

      We agree that our experiments do not address the fate of the dopamine axons that are misrouted away from the medial prefrontal cortex. This research is ongoing, and we have now added a note regarding this to our manuscript.

      Our current hypothesis, based on experiments being conducted as part of another line of research in the lab, is that these axons are rerouted to a different brain region which they then ectopically innervate. In these experiments we are finding that male mice exposed to tetrahydrocannabinol in adolescence show reduced dopamine innervation in the medial prefrontal cortex in adulthood but increased dopamine input in the orbitofrontal cortex. In addition, these mice show increased action impulsivity in the Go/No-Go task in adulthood (Capolicchio et al., Society for Neuroscience 2023 Abstracts)

      References:

      Capolicchio T., Hernandez, G., Dube, E., Estrada, K., Giroux, M., Flores, C. (2023) Divergent outcomes of delta 9 - tetrahydrocannabinol in adolescence on dopamine and cognitive development in male and female mice. Society for Neuroscience, Washington, DC, United States [abstract].

      Cuesta S, Nouel D, Reynolds LM, Morgunova A, Torres-Berrío A, White A, Hernandez G, Cooper HM, Flores C. 2020. Dopamine Axon Targeting in the Nucleus Accumbens in Adolescence Requires Netrin-1. Frontiers Cell Dev Biology 8:487. doi:10.3389/fcell.2020.00487

      (3) Another issue with Figure1J. It is unclear whether the viruses were injected into a WT mouse model or into a Cre-mouse model driven by a promoter specifically expresses in dorsal peduncular cortex? The authors should provide evidence that Netrin-1 mRNA and proteins are indeed significantly reduced. The authors should address the anatomic results of the area of virus diffusion to confirm the virus specifically infected the cells in dorsal peduncular cortex.

      All the virus knockdown experiments were conducted in wild type mice, we added this information to Figure 1k.

      The efficacy of the shRNA in knocking down Netrin-1 was demonstrated by Cuesta et al. (2020) both in vitro and in vivo, as we show in our response to the reviewer’s previous comment above.

      We also now provide anatomical images demonstrating the localization of the injection and area of virus diffusion in the mouse forebrain. In Author response image 4 below the area of virus diffusion is visible as green fluorescent signal.

      Author response image 4.

      Fluorescent microscopy image of a mouse forebrain demonstrating the localization of the injection of a virus to knock down Netrin-1. The location of the virus is in green, while cell nuclei are in blue (DAPI). Abbreviations: DP: dorsopeduncular cortex IL: infralimbic cortex

      References:

      Cuesta S, Nouel D, Reynolds LM, Morgunova A, Torres-Berrío A, White A, Hernandez G, Cooper HM, Flores C. 2020. Dopamine Axon Targeting in the Nucleus Accumbens in Adolescence Requires Netrin-1. Frontiers Cell Dev Biology 8:487. doi:10.3389/fcell.2020.00487

      (4) The authors need to provide information regarding the efficiency and duration of knocking down. For instance, in Figure 1K, the mice were tested after 53 days post injection, can the virus activity in the brain last for such a long time?

      In our study we are interested in the role of Netrin-1 expression in the guidance of dopamine axons from the nucleus accumbens to the medial prefrontal cortex. The critical window for these axons leaving the nucleus accumbens and growing to the cortex is early adolescence (Reynolds et al., 2018b). This is why we injected the virus at the onset of adolescence, at postnatal day 21. As dopamine axons grow from the nucleus accumbens to the prefrontal cortex, they pass through the dorsal peduncular cortex. We disrupted Netrin-1 expression at this point along their route to determine whether it is the Netrin-1 present along their route that guides these axons to the prefrontal cortex. We hypothesized that the shRNA Netrin-1 virus would disrupt the growth of the dopamine axons, reducing the number of axons that reach the prefrontal cortex and therefore the number of axons that innervate this region in adulthood.

      We conducted our behavioural tests during adulthood, after the critical window during which dopamine axon growth occurs, so as to observe the enduring behavioral consequences of this misrouting. This experimental approach is designed for the shRNa Netrin-1 virus to be expressed in cells in the dorsopeduncular cortex when the dopamine axons are growing, during adolescence.

      References:

      Capolicchio T., Hernandez, G., Dube, E., Estrada, K., Giroux, M., Flores, C. (2023) Divergent outcomes of delta 9 - tetrahydrocannabinol in adolescence on dopamine and cognitive development in male and female mice. Society for Neuroscience, Washington, DC, United States [abstract].

      Reynolds LM, Yetnikoff L, Pokinko M, Wodzinski M, Epelbaum JG, Lambert LC, Cossette M-P, Arvanitogiannis A, Flores C. 2018b. Early Adolescence is a Critical Period for the Maturation of Inhibitory Behavior. Cerebral cortex 29:3676–3686. doi:10.1093/cercor/bhy247

      (5) In Figure 1N-Q, silencing Netrin-1 results in less DA axons targeting to infralimbic cortex, but why the Netrin-1 knocking down mice revealed the improved behavior?

      This is indeed an intriguing finding, and we have now added a mention of it to our manuscript. We have demonstrated that misrouting dopamine axons away from the medial prefrontal cortex during adolescence alters behaviour, but why this improves their action impulsivity ability is something currently unknown to us. One potential answer is that the dopamine axons are misrouted to a different brain region that is also involved in controlling impulsive behaviour, perhaps the dorsal striatum (Kim and Im, 2019) or the orbital prefrontal cortex (Jonker et al., 2015).

      We would also like to note that we are finding that other manipulations that appear to reroute dopamine axons to unintended targets can lead to reduced action impulsivity as measured using the Go No Go task. As we mentioned above, current experiments in the lab, which are part of a different line of research, are showing that male mice exposed to tetrahydrocannabinol in adolescence show reduced dopamine innervation in the medial prefrontal cortex in adulthood, but increased dopamine input in the orbitofrontal cortex. In addition, these mice show increased action impulsivity in the Go/No-Go task in adulthood (Capolicchio et al., Society for Neuroscience 2023 Abstracts)

      References

      Capolicchio T., Hernandez, G., Dube, E., Estrada, K., Giroux, M., Flores, C. (2023) Divergent outcomes of delta 9 - tetrahydrocannabinol in adolescence on dopamine and cognitive development in male and female mice. Society for Neuroscience, Washington, DC, United States [abstract].

      Jonker FA, Jonker C, Scheltens P, Scherder EJA. 2015. The role of the orbitofrontal cortex in cognition and behavior. Rev Neurosci 26:1–11. doi:10.1515/revneuro2014-0043 Kim B, Im H. 2019. The role of the dorsal striatum in choice impulsivity. Ann N York Acad Sci 1451:92–111. doi:10.1111/nyas.13961

      (6) What is the effect of knocking down UNC5C on dopamine axons guidance to the cortex?

      We have found that mice that are heterozygous for a nonsense Unc5c mutation, and as a result have reduced levels of UNC5c protein, show reduced amphetamine-induced locomotion and stereotypy (Auger et al., 2013). In the same manuscript we show that this effect only emerges during adolescence, in concert with the growth of dopamine axons to the prefrontal cortex. This is indirect but strong evidence that UNC5c receptors are necessary for correct adolescent dopamine axon development.

      References

      Auger ML, Schmidt ERE, Manitt C, Dal-Bo G, Pasterkamp RJ, Flores C. 2013. unc5c haploinsufficient phenotype: striking similarities with the dcc haploinsufficiency model. European Journal of Neuroscience 38:2853–2863. doi:10.1111/ejn.12270

      (7) In Figures 2-4, the authors only showed the amount of DA axons and UNC5C in NAcc. However, it remains unclear whether these experiments also impact the projections of dopaminergic axons to other brain regions, critical for the behavioral phenotypes. What about other brain regions such as prefrontal cortex? Do the projection of DA axons and UNC5c level in cortex have similar pattern to those in NAcc?

      UNC5c receptors are expressed throughout development and are involved in many developmental processes (Kim and Ackerman, 2011; Murcia-Belmonte et al., 2019; Srivatsa et al., 2014). We cannot say whether the pattern we observe here is unique to the nucleus accumbens, but it is certainly not universal throughout the brain.

      The brain region we focus on in our manuscript, in addition to the nucleus accumbens, is the medial prefrontal cortex. Close and thorough examination of the prefrontal cortices of adult mice revealed practically no UNC5c expression by dopamine axons. However, we did observe very rare cases of dopamine axons expressing UNC5c. It is not clear whether these rare cases are present before or during adolescence.

      Below is a representative set of images of this observation, which is now also included as Supplementary Figure 4:

      Author response image 5.

      Expression of UNC5c protein in the medial prefrontal cortex of an adult male mouse. Low (A) and high (B) magnification images demonstrate that there is little UNC5c expression in dopamine axons in the medial prefrontal cortex. Here we identify dopamine axons by immunofluorescent staining for tyrosine hydroxylase (TH, see our response to comment #9 regarding the specificity of the TH antibody for dopamine axons in the prefrontal cortex). This figure is also included as Supplementary Figure 4 in the manuscript. Abbreviations: fmi: forceps minor of the corpus callosum, mPFC: medial prefrontal cortex.

      References:

      Kim D, Ackerman SL. 2011. The UNC5C Netrin Receptor Regulates Dorsal Guidance of Mouse Hindbrain Axons. J Neurosci 31:2167–2179. doi:10.1523/jneurosci.5254- 10.20110.2011

      Murcia-Belmonte V, Coca Y, Vegar C, Negueruela S, Romero C de J, Valiño AJ, Sala S, DaSilva R, Kania A, Borrell V, Martinez LM, Erskine L, Herrera E. 2019. A Retino-retinal Projection Guided by Unc5c Emerged in Species with Retinal Waves. Current Biology 29:1149-1160.e4. doi:10.1016/j.cub.2019.02.052

      Srivatsa S, Parthasarathy S, Britanova O, Bormuth I, Donahoo A-L, Ackerman SL, Richards LJ, Tarabykin V. 2014. Unc5C and DCC act downstream of Ctip2 and Satb2 and contribute to corpus callosum formation. Nat Commun 5:3708. doi:10.1038/ncomms4708

      (8) Can overexpression of UNC5c or Netrin-1 in male winter hamsters mimic the observations in summer hamsters? Or overexpression of UNC5c in female summer hamsters to mimic the winter hamster? This would be helpful to confirm the causal role of UNC5C in guiding DA axons during adolescence.

      This is an excellent question. We are very interested in both increasing and decreasing UNC5c expression in hamster dopamine axons to see if we can directly manipulate summer hamsters into winter hamsters and vice versa. We are currently exploring virus-based approaches to design these experiments and are excited for results in this area.

      (9) The entire study relied on using tyrosine hydroxylase (TH) as a marker for dopaminergic axons. However, the expression of TH (either by IHC or IF) can be influenced by other environmental factors, that could alter the expression of TH at the cellular level.

      This is an excellent point that we now carefully address in our methods by adding the following:

      In this study we pay great attention to the morphology and localization of the fibres from which we quantify varicosities to avoid counting any fibres stained with TH antibodies that are not dopamine fibres. The fibres that we examine and that are labelled by the TH antibody show features indistinguishable from the classic features of cortical dopamine axons in rodents (Berger et al., 1974; 1983; Van Eden et al., 1987; Manitt et al., 2011), namely they are thin fibres with irregularly-spaced varicosities, are densely packed in the nucleus accumbens, sparsely present only in the deep layers of the prefrontal cortex, and are not regularly oriented in relation to the pial surface. This is in contrast to rodent norepinephrine fibres, which are smooth or beaded in appearance, relatively thick with regularly spaced varicosities, increase in density towards the shallow cortical layers, and are in large part oriented either parallel or perpendicular to the pial surface (Berger et al., 1974; Levitt and Moore, 1979; Berger et al., 1983; Miner et al., 2003). Furthermore, previous studies in rodents have noted that only norepinephrine cell bodies are detectable using immunofluorescence for TH, not norepinephrine processes (Pickel et al., 1975; Verney et al., 1982; Miner et al., 2003), and we did not observe any norepinephrine-like fibres.

      Furthermore, we are not aware of any other processes in the forebrain that are known to be immunopositive for TH under any environmental conditions.

      To reduce confusion, we have replaced the abbreviation for dopamine – DA – with TH in the relevant panels in Figures 1, 2, 3, and 4 to clarify exactly what is represented in these images. As can be seen in these images, fluorescent green labelling is present only in axons, which is to be expected of dopamine labelling in these forebrain regions.

      References:

      Berger B, Tassin JP, Blanc G, Moyne MA, Thierry AM (1974) Histochemical confirmation for dopaminergic innervation of the rat cerebral cortex after destruction of the noradrenergic ascending pathways. Brain Res 81:332–337.

      Berger B, Verney C, Gay M, Vigny A (1983) Immunocytochemical Characterization of the Dopaminergic and Noradrenergic Innervation of the Rat Neocortex During Early Ontogeny. In: Proceedings of the 9th Meeting of the International Neurobiology Society, pp 263–267 Progress in Brain Research. Elsevier.

      Levitt P, Moore RY (1979) Development of the noradrenergic innervation of neocortex. Brain Res 162:243–259.

      Manitt C, Mimee A, Eng C, Pokinko M, Stroh T, Cooper HM, Kolb B, Flores C (2011) The Netrin Receptor DCC Is Required in the Pubertal Organization of Mesocortical Dopamine Circuitry. J Neurosci 31:8381–8394.

      Miner LH, Schroeter S, Blakely RD, Sesack SR (2003) Ultrastructural localization of the norepinephrine transporter in superficial and deep layers of the rat prelimbic prefrontal cortex and its spatial relationship to probable dopamine terminals. J Comp Neurol 466:478–494.

      Pickel VM, Joh TH, Field PM, Becker CG, Reis DJ (1975) Cellular localization of tyrosine hydroxylase by immunohistochemistry. J Histochem Cytochem 23:1–12.

      Van Eden CG, Hoorneman EM, Buijs RM, Matthijssen MA, Geffard M, Uylings HBM (1987) Immunocytochemical localization of dopamine in the prefrontal cortex of the rat at the light and electron microscopical level. Neurosci 22:849–862.

      Verney C, Berger B, Adrien J, Vigny A, Gay M (1982) Development of the dopaminergic innervation of the rat cerebral cortex. A light microscopic immunocytochemical study using anti-tyrosine hydroxylase antibodies. Dev Brain Res 5:41–52.

      (10) Are Netrin-1/UNC5C the only signal guiding dopamine axon during adolescence? Are there other neuronal circuits involved in this process?

      Our intention for this study was to examine the role of Netrin-1 and its receptor UNC5C specifically, but we do not suggest that they are the only molecules to play a role. The process of guiding growing dopamine axons during adolescence is likely complex and we expect other guidance mechanisms to also be involved. From our previous work we know that the Netrin-1 receptor DCC is critical in this process (Hoops and Flores, 2017; Reynolds et al., 2023). Several other molecules have been identified in Netrin-1/DCC signaling processes that control corpus callosum development and there is every possibility that the same or similar molecules may be important in guiding dopamine axons (Schlienger et al., 2023).

      References:

      Hoops D, Flores C. 2017. Making Dopamine Connections in Adolescence. Trends in Neurosciences 1–11. doi:10.1016/j.tins.2017.09.004

      Reynolds LM, Hernandez G, MacGowan D, Popescu C, Nouel D, Cuesta S, Burke S, Savell KE, Zhao J, Restrepo-Lozano JM, Giroux M, Israel S, Orsini T, He S, Wodzinski M, Avramescu RG, Pokinko M, Epelbaum JG, Niu Z, Pantoja-Urbán AH, Trudeau L-É, Kolb B, Day JJ, Flores C. 2023. Amphetamine disrupts dopamine axon growth in adolescence by a sex-specific mechanism in mice. Nat Commun 14:4035. doi:10.1038/s41467-023-39665-1

      Schlienger S, Yam PT, Balekoglu N, Ducuing H, Michaud J-F, Makihara S, Kramer DK, Chen B, Fasano A, Berardelli A, Hamdan FF, Rouleau GA, Srour M, Charron F. 2023. Genetics of mirror movements identifies a multifunctional complex required for Netrin-1 guidance and lateralization of motor control. Sci Adv 9:eadd5501. doi:10.1126/sciadv.add5501

      (11) Finally, despite the authors' claim that the dopaminergic axon project is sensitive to the duration of daylight in the hamster, they never provided definitive evidence to support this hypothesis.

      By “definitive evidence” we think that the reviewer is requesting a single statistical model including measures from both the summer and winter groups. Such a model would provide a probability estimate of whether dopamine axon growth is sensitive to daylight duration. Therefore, we ran these models, one for male hamsters and one for female hamsters.

      In both sexes we find a significant effect of daylength on dopamine innervation, interacting with age. Male age by daylength interaction: F = 6.383, p = 0.00242. Female age by daylength interaction: F = 21.872, p = 1.97 x 10-9. The full statistical analysis is available as a supplement to this letter (Response_Letter_Stats_Details.docx).

      Reviewer 3

      (1) Fig 1 A and B don't appear to be the same section level.

      The reviewer is correct that Fig 1B is anterior to Fig 1A. We have changed Figure 1A to match the section level of Figure 1B.

      (2) Fig 1C. It is not clear that these axons are crossing from the shell of the NAC.

      We have added a dashed line to Figure 1C to highlight the boundary of the nucleus accumbens, which hopefully emphasizes that there are fibres crossing the boundary. We also include here an enlarged image of this panel:

      Author response image 6.

      An enlarged image of Figure1c in the manuscript. The nucleus accumbens (left of the dotted line) is densely packed with TH+ axons (in green). Some of these TH+ axons can be observed extending from the nucleus accumbens medially towards a region containing dorsally oriented TH+ fibres (white arrows).

      (3) Fig 1. Measuring width of the bundle is an odd way to measure DA axon numbers. First the width could be changing during adult for various reasons including change in brain size. Second, I wouldn't consider these axons in a traditional bundle. Third, could DA axon counts be provided, rather than these proxy measures.

      With regards to potential changes in brain size, we agree that this could have potentially explained the increased width of the dopamine axon pathway. That is why it was important for us to use stereology to measure the density of dopamine axons within the pathway. If the width increased but no new axons grew along the pathway, we would have seen a decrease in axon density from adolescence to adulthood. Instead, our results show that the density of axons remained constant.

      We agree with the reviewer that the dopamine axons do not form a traditional “bundle”. Therefore, throughout the manuscript we now avoid using the term bundle.

      Although we cannot count every single axon, an accurate estimate of this number can be obtained using stereology, an unbiassed method for efficiently quantifying large, irregularly distributed objects. We used stereology to count TH+ axons in an unbiased subset of the total area occupied by these axons. Unbiased stereology is the gold-standard technique for estimating populations of anatomical objects, such as axons, that are so numerous that it would be impractical or impossible to measure every single one. Here and elsewhere we generally provide results as densities and areas of occupancy (Reynolds et al., 2022). To avoid confusion, we now clarify that we are counting the width of the area that dopamine axons occupy (rather than the dopamine axon “bundle”).

      References:

      Reynolds LM, Pantoja-Urbán AH, MacGowan D, Manitt C, Nouel D, Flores C. 2022. Dopaminergic System Function and Dysfunction: Experimental Approaches. Neuromethods 31–63. doi:10.1007/978-1-0716-2799-0_2

      (4) TH in the cortex could also be of noradrenergic origin. This needs to be ruled out to score DA axons

      This is the same comment as Reviewer 1 #9. Please see our response below, which we have also added to our methods:

      In this study we pay great attention to the morphology and localization of the fibres from which we quantify varicosities to avoid counting any fibres stained with TH antibodies that are not dopamine fibres. The fibres that we examine and that are labelled by the TH antibody show features indistinguishable from the classic features of cortical dopamine axons in rodents (Berger et al., 1974; 1983; Van Eden et al., 1987; Manitt et al., 2011), namely they are thin fibres with irregularly-spaced varicosities, are densely packed in the nucleus accumbens, sparsely present only in the deep layers of the prefrontal cortex, and are not regularly oriented in relation to the pial surface. This is in contrast to rodent norepinephrine fibres, which are smooth or beaded in appearance, relatively thick with regularly spaced varicosities, increase in density towards the shallow cortical layers, and are in large part oriented either parallel or perpendicular to the pial surface (Berger et al., 1974; Levitt and Moore, 1979; Berger et al., 1983; Miner et al., 2003). Furthermore, previous studies in rodents have noted that only norepinephrine cell bodies are detectable using immunofluorescence for TH, not norepinephrine processes (Pickel et al., 1975; Verney et al., 1982; Miner et al., 2003), and we did not observe any norepinephrine-like fibres.

      References:

      Berger B, Tassin JP, Blanc G, Moyne MA, Thierry AM (1974) Histochemical confirmation for dopaminergic innervation of the rat cerebral cortex after destruction of the noradrenergic ascending pathways. Brain Res 81:332–337.

      Berger B, Verney C, Gay M, Vigny A (1983) Immunocytochemical Characterization of the Dopaminergic and Noradrenergic Innervation of the Rat Neocortex During Early Ontogeny. In: Proceedings of the 9th Meeting of the International Neurobiology Society, pp 263–267 Progress in Brain Research. Elsevier.

      Levitt P, Moore RY (1979) Development of the noradrenergic innervation of neocortex. Brain Res 162:243–259.

      Manitt C, Mimee A, Eng C, Pokinko M, Stroh T, Cooper HM, Kolb B, Flores C (2011) The Netrin Receptor DCC Is Required in the Pubertal Organization of Mesocortical Dopamine Circuitry. J Neurosci 31:8381–8394.

      Miner LH, Schroeter S, Blakely RD, Sesack SR (2003) Ultrastructural localization of the norepinephrine transporter in superficial and deep layers of the rat prelimbic prefrontal cortex and its spatial relationship to probable dopamine terminals. J Comp Neurol 466:478–494.

      Pickel VM, Joh TH, Field PM, Becker CG, Reis DJ (1975) Cellular localization of tyrosine hydroxylase by immunohistochemistry. J Histochem Cytochem 23:1–12.

      Van Eden CG, Hoorneman EM, Buijs RM, Matthijssen MA, Geffard M, Uylings HBM (1987) Immunocytochemical localization of dopamine in the prefrontal cortex of the rat at the light and electron microscopical level. Neurosci 22:849–862.

      Verney C, Berger B, Adrien J, Vigny A, Gay M (1982) Development of the dopaminergic innervation of the rat cerebral cortex. A light microscopic immunocytochemical study using anti-tyrosine hydroxylase antibodies. Dev Brain Res 5:41–52.

      (5) Netrin staining should be provided with NeuN + DAPI; its not clear these are all cell bodies. An in situ of Netrin would help as well.

      A similar comment was raised by Reviewer 1 in point #1. Please see below the immunofluorescent and RNA scope images showing expression of Netrin-1 protein and mRNA in the forebrain.

      Author response image 7.

      This confocal microscope image shows immunofluorescent staining for Netrin-1 (green) localized around cell nuclei (stained by DAPI in blue). This image was taken from a coronal section of the lateral septum of an adult male mouse. Scale bar = 20µm

      Author response image 8.

      This confocal microscope image of a coronal brain section of the medial prefrontal cortex of an adult male mouse shows Netrin-1 mRNA expression (green) and cell nuclei (DAPI, blue). RNAscope was used to generate this image. Brain regions are as follows: Cg1: Anterior cingulate cortex 1, DP: dorsopeduncular cortex, IL: Infralimbic Cortex, PrL: Prelimbic Cortex, fmi: forceps minor of the corpus callosum

      Author response image 9.

      A higher resolution image from the same sample as in Figure 2 shows Netrin-1 mRNA (green) and cell nuclei (DAPI; blue). DP = dorsopeduncular cortex

      (6) The Netrin knockdown needs validation. How strong was the knockdown etc?

      This comment was also raised by Reviewer 1 #1.

      We have previously established the efficacy of the shRNA Netrin-1 knockdown virus used in this experiment for reducing the expression of Netrin-1 (Cuesta et al., 2020). The shRNA reduces Netrin-1 levels in vitro and in vivo.

      References:

      Cuesta S, Nouel D, Reynolds LM, Morgunova A, Torres-Berrío A, White A, Hernandez G, Cooper HM, Flores C. 2020. Dopamine Axon Targeting in the Nucleus Accumbens in Adolescence Requires Netrin-1. Frontiers Cell Dev Biology 8:487. doi:10.3389/fcell.2020.00487

      (7) If the conclusion that knocking down Netrin in cortex decreases DA innervation of the IL, how can that be reconciled with Netrin-Unc repulsion.

      This is an intriguing question and one that we are in the planning stages of addressing with new experiments.

      Although we do not have a mechanistic answered for how a repulsive receptor helps guide these axons, we would like to note that previous indirect evidence from a study by our group also suggests that reducing UNC5c signaling in dopamine axons in adolescence increases dopamine innervation to the prefrontal cortex (Auger et al, 2013).

      References

      Auger ML, Schmidt ERE, Manitt C, Dal-Bo G, Pasterkamp RJ, Flores C. 2013. unc5c haploinsufficient phenotype: striking similarities with the dcc haploinsufficiency model. European Journal of Neuroscience 38:2853–2863. doi:10.1111/ejn.12270

      (8) The behavioral phenotype in Fig 1 is interesting, but its not clear if its related to DA axons/signaling. IN general, no evidence in this paper is provided for the role of DA in the adolescent behaviors described.

      We agree with the reviewer that the behaviours we describe in adult mice are complex and are likely to involve several neurotransmitter systems. However, there is ample evidence for the role of dopamine signaling in cognitive control behaviours (Bari and Robbins, 2013; Eagle et al., 2008; Ott et al., 2023) and our published work has shown that alterations in the growth of dopamine axons to the prefrontal cortex leads to changes in impulse control as measured via the Go/No-Go task in adulthood (Reynolds et al., 2023, 2018a; Vassilev et al., 2021).

      The other adolescent behaviour we examined was risk-like taking behaviour in male and female hamsters (Figures 4 and 5), as a means of characterizing maturation in this behavior over time. We decided not to use the Go/No-Go task because as far as we know, this has never been employed in Siberian Hamsters and it will be difficult to implement. Instead, we chose the light/dark box paradigm, which requires no training and is ideal for charting behavioural changes over short time periods. Indeed, risk-like taking behavior in rodents and in humans changes from adolescence to adulthood paralleling changes in prefrontal cortex development, including the gradual input of dopamine axons to this region.

      References:

      Bari A, Robbins TW. 2013. Inhibition and impulsivity: Behavioral and neural basis of response control. Progress in neurobiology 108:44–79. doi:10.1016/j.pneurobio.2013.06.005

      Eagle DM, Bari A, Robbins TW. 2008. The neuropsychopharmacology of action inhibition: cross-species translation of the stop-signal and go/no-go tasks. Psychopharmacology 199:439–456. doi:10.1007/s00213-008-1127-6

      Ott T, Stein AM, Nieder A. 2023. Dopamine receptor activation regulates reward expectancy signals during cognitive control in primate prefrontal neurons. Nat Commun 14:7537. doi:10.1038/s41467-023-43271-6

      Reynolds LM, Hernandez G, MacGowan D, Popescu C, Nouel D, Cuesta S, Burke S, Savell KE, Zhao J, Restrepo-Lozano JM, Giroux M, Israel S, Orsini T, He S, Wodzinski M, Avramescu RG, Pokinko M, Epelbaum JG, Niu Z, Pantoja-Urbán AH, Trudeau L-É, Kolb B, Day JJ, Flores C. 2023. Amphetamine disrupts dopamine axon growth in adolescence by a sex-specific mechanism in mice. Nat Commun 14:4035. doi:10.1038/s41467-023-39665-1

      Reynolds LM, Pokinko M, Torres-Berrío A, Cuesta S, Lambert LC, Pellitero EDC, Wodzinski M, Manitt C, Krimpenfort P, Kolb B, Flores C. 2018a. DCC Receptors Drive Prefrontal Cortex Maturation by Determining Dopamine Axon Targeting in Adolescence. Biological psychiatry 83:181–192. doi:10.1016/j.biopsych.2017.06.009

      Vassilev P, Pantoja-Urban AH, Giroux M, Nouel D, Hernandez G, Orsini T, Flores C. 2021. Unique effects of social defeat stress in adolescent male mice on the Netrin-1/DCC pathway, prefrontal cortex dopamine and cognition (Social stress in adolescent vs. adult male mice). Eneuro ENEURO.0045-21.2021. doi:10.1523/eneuro.0045-21.2021

      (9) Fig2 - boxes should be drawn on the NAc diagram to indicate sampled regions. Some quantification of Unc5c would be useful. Also, some validation of the Unc5c antibody would be nice.

      The images presented were taken medial to the anterior commissure and we have edited Figure 2 to show this. However, we did not notice any intra-accumbens variation, including between the core and the shell. Therefore, the images are representative of what was observed throughout the entire nucleus accumbens.

      To quantify UNC5c in the accumbens we conducted a Western blot experiment in male mice at different ages. A one-way ANOVA analyzing band intensity (relative to the 15-day-old average band intensity) as the response variable and age as the predictor variable showed a significant effect of age (F=5.615, p=0.01). Posthoc analysis revealed that 15-day-old mice have less UNC5c in the nucleus accumbens compared to 21- and 35-day-old mice.

      Author response image 10.

      The graph depicts the results of a Western blot experiment of UNC5c protein levels in the nucleus accumbens of male mice at postnatal days 15, 21 or 35 and reveals a significant increase in protein levels at the onset adolescence.

      Our methods for this Western blot were as follows: Samples were prepared as previously (Torres-Berrío et al., 2017). Briefly, mice were sacrificed by live decapitation and brains were flash frozen in heptane on dry ice for 10 seconds. Frozen brains were mounted in a cryomicrotome and two 500um sections were collected for the nucleus accumbens, corresponding to plates 14 and 18 of the Paxinos mouse brain atlas. Two tissue core samples were collected per section, one for each side of the brain, using a 15-gauge tissue corer (Fine surgical tools Cat no. NC9128328) and ejected in a microtube on dry ice. The tissue samples were homogenized in 100ul of standard radioimmunoprecipitation assay buffer using a handheld electric tissue homogenizer. The samples were clarified by centrifugation at 4C at a speed of 15000g for 30 minutes. Protein concentration was quantified using a bicinchoninic acid assay kit (Pierce BCA protein assay kit, Cat no.PI23225) and denatured with standard Laemmli buffer for 5 minutes at 70C. 10ug of protein per sample was loaded and run by SDS-PAGE gel electrophoresis in a Mini-PROTEAN system (Bio-Rad) on an 8% acrylamide gel by stacking for 30 minutes at 60V and resolving for 1.5 hours at 130V. The proteins were transferred to a nitrocellulose membrane for 1 hour at 100V in standard transfer buffer on ice. The membranes were blocked using 5% bovine serum albumin dissolved in tris-buffered saline with Tween 20 and probed with primary (UNC5c, Abcam Cat. no ab302924) and HRP-conjugated secondary antibodies for 1 hour. a-tubulin was probed and used as loading control. The probed membranes were resolved using SuperSignal West Pico PLUS chemiluminescent substrate (ThermoFisher Cat no.34579) in a ChemiDoc MP Imaging system (Bio-Rad). Band intensity was quantified using the ChemiDoc software and all ages were normalized to the P15 age group average.

      Validation of the UNC5c antibody was performed in the lab of Dr. Liu, from whom it was kindly provided. Briefly, in the validation study the authors showed that the anti-UNC5C antibody can detect endogenous UNC5C expression and the level of UNC5C is dramatically reduced after UNC5C knockdown. The antibody can also detect the tagged-UNC5C protein in several cell lines, which was confirmed by a tag antibody (Purohit et al., 2012; Shao et al., 2017).

      References:

      Purohit AA, Li W, Qu C, Dwyer T, Shao Q, Guan K-L, Liu G. 2012. Down Syndrome Cell Adhesion Molecule (DSCAM) Associates with Uncoordinated-5C (UNC5C) in Netrin-1mediated Growth Cone Collapse. The Journal of biological chemistry 287:27126–27138. doi:10.1074/jbc.m112.340174

      Shao Q, Yang T, Huang H, Alarmanazi F, Liu G. 2017. Uncoupling of UNC5C with Polymerized TUBB3 in Microtubules Mediates Netrin-1 Repulsion. J Neurosci 37:5620–5633. doi:10.1523/jneurosci.2617-16.2017

      (10) "In adolescence, dopamine neurons begin to express the repulsive Netrin-1 receptor UNC5C, and reduction in UNC5C expression appears to cause growth of mesolimbic dopamine axons to the prefrontal cortex".....This is confusing. Figure 2 shows a developmental increase in UNc5c not a decrease. So when is the "reduction in Unc5c expression" occurring?

      We apologize for the mistake in this sentence. We have corrected the relevant passage in our manuscript as follows:

      In adolescence, dopamine neurons begin to express the repulsive Netrin-1 receptor UNC5C, particularly when mesolimbic and mesocortical dopamine projections segregate in the nucleus accumbens (Manitt et al., 2010; Reynolds et al., 2018a). In contrast, dopamine axons in the prefrontal cortex do not express UNC5c except in very rare cases (Supplementary Figure 4). In adult male mice with Unc5c haploinsufficiency, there appears to be ectopic growth of mesolimbic dopamine axons to the prefrontal cortex (Auger et al., 2013). This miswiring is associated with alterations in prefrontal cortex-dependent behaviours (Auger et al., 2013).

      References:

      Auger ML, Schmidt ERE, Manitt C, Dal-Bo G, Pasterkamp RJ, Flores C. 2013. unc5c haploinsufficient phenotype: striking similarities with the dcc haploinsufficiency model. European Journal of Neuroscience 38:2853–2863. doi:10.1111/ejn.12270

      Manitt C, Labelle-Dumais C, Eng C, Grant A, Mimee A, Stroh T, Flores C. 2010. Peri-Pubertal Emergence of UNC-5 Homologue Expression by Dopamine Neurons in Rodents. PLoS ONE 5:e11463-14. doi:10.1371/journal.pone.0011463

      Reynolds LM, Pokinko M, Torres-Berrío A, Cuesta S, Lambert LC, Pellitero EDC, Wodzinski M, Manitt C, Krimpenfort P, Kolb B, Flores C. 2018a. DCC Receptors Drive Prefrontal Cortex Maturation by Determining Dopamine Axon Targeting in Adolescence. Biological psychiatry 83:181–192. doi:10.1016/j.biopsych.2017.06.009

      (11) In Fig 3, a statistical comparison should be made between summer male and winter male, to justify the conclusions that the winter males have delayed DA innervation.

      This analysis was also suggested by Reviewer 1, #11. Here is our response:

      We analyzed the summer and winter data together in ANOVAs separately for males and females. In both sexes we find a significant effect of daylength on dopamine innervation, interacting with age. Male age by daylength interaction: F = 6.383, p = 0.00242. Female age by daylength interaction: F = 21.872, p = 1.97 x 10-9. The full statistical analysis is available as a supplement to this letter (Response_Letter_Stats_Details.docx).

      (12) Should axon length also be measured here (Fig 3)? It is not clear why the authors have switched to varicosity density. Also, a box should be drawn in the NAC cartoon to indicate the region that was sampled.

      It is untenable to quantify axon length in the prefrontal cortex as we cannot distinguish independent axons. Rather, they are “tangled”; they twist and turn in a multitude of directions as they make contact with various dendrites. Furthermore, they branch extensively. It would therefore be impossible to accurately quantify the number of axons. Using unbiased stereology to quantify varicosities is a valid, well-characterized and straightforward alternative (Reynolds et al., 2022).

      References:

      Reynolds LM, Pantoja-Urbán AH, MacGowan D, Manitt C, Nouel D, Flores C. 2022. Dopaminergic System Function and Dysfunction: Experimental Approaches. Neuromethods 31–63. doi:10.1007/978-1-0716-2799-0_2

      (13) In Fig 3, Unc5c should be quantified to bolster the interesting finding that Unc5c expression dynamics are different between summer and winter hamsters. Unc5c mRNA experiments would also be important to see if similar changes are observed at the transcript level.

      We agree that it would be very interesting to see how UNC5c mRNA and protein levels change over time in summer and winter hamsters, both in males, as the reviewer suggests here, and in females. We are working on conducting these experiments in hamsters as part of a broader expansion of our research in this area. These experiments will require a lengthy amount of time and at this point we feel that they are beyond the scope of this manuscript.

      (14) Fig 4. The peak in exploratory behavior in winter females is counterintuitive and needs to be better discussed. IN general, the light dark behavior seems quite variable.

      This is indeed a very interesting finding, which we have expanded upon in our manuscript as follows:

      When raised under a winter-mimicking daylength, hamsters of either sex show a protracted peak in risk taking. In males, it is delayed beyond 80 days old, but the delay is substantially less in females. This is a counterintuitive finding considering that dopamine development in winter females appears to be accelerated. Our interpretation of this finding is that the timing of the risk-taking peak in females may reflect a balance between different adolescent developmental processes. The fact that dopamine axon growth is accelerated does not imply that all adolescent maturational processes are accelerated. Some may be delayed, for example those that induce axon pruning in the cortex. The timing of the risk-taking peak in winter female hamsters may therefore reflect the amalgamation of developmental processes that are advanced with those that are delayed – producing a behavioural effect that is timed somewhere in the middle. Disentangling the effects of different developmental processes on behaviour will require further experiments in hamsters, including the direct manipulation of dopamine activity in the nucleus accumbens and prefrontal cortex.

      Full Reference List

      Auger ML, Schmidt ERE, Manitt C, Dal-Bo G, Pasterkamp RJ, Flores C. 2013. unc5c haploinsufficient phenotype: striking similarities with the dcc haploinsufficiency model. European Journal of Neuroscience 38:2853–2863. doi:10.1111/ejn.12270

      Bari A, Robbins TW. 2013. Inhibition and impulsivity: Behavioral and neural basis of response control. Progress in neurobiology 108:44–79. doi:10.1016/j.pneurobio.2013.06.005

      Cuesta S, Nouel D, Reynolds LM, Morgunova A, Torres-Berrío A, White A, Hernandez G, Cooper HM, Flores C. 2020. Dopamine Axon Targeting in the Nucleus Accumbens in Adolescence Requires Netrin-1. Frontiers Cell Dev Biology 8:487. doi:10.3389/fcell.2020.00487

      Daubaras M, Bo GD, Flores C. 2014. Target-dependent expression of the netrin-1 receptor, UNC5C, in projection neurons of the ventral tegmental area. Neuroscience 260:36–46. doi:10.1016/j.neuroscience.2013.12.007

      Eagle DM, Bari A, Robbins TW. 2008. The neuropsychopharmacology of action inhibition: crossspecies translation of the stop-signal and go/no-go tasks. Psychopharmacology 199:439– 456. doi:10.1007/s00213-008-1127-6

      Hoops D, Flores C. 2017. Making Dopamine Connections in Adolescence. Trends in Neurosciences 1–11. doi:10.1016/j.tins.2017.09.004

      Jonker FA, Jonker C, Scheltens P, Scherder EJA. 2015. The role of the orbitofrontal cortex in cognition and behavior. Rev Neurosci 26:1–11. doi:10.1515/revneuro-2014-0043

      Kim B, Im H. 2019. The role of the dorsal striatum in choice impulsivity. Ann N York Acad Sci 1451:92–111. doi:10.1111/nyas.13961

      Kim D, Ackerman SL. 2011. The UNC5C Netrin Receptor Regulates Dorsal Guidance of Mouse Hindbrain Axons. J Neurosci 31:2167–2179. doi:10.1523/jneurosci.5254-10.2011

      Manitt C, Labelle-Dumais C, Eng C, Grant A, Mimee A, Stroh T, Flores C. 2010. Peri-Pubertal Emergence of UNC-5 Homologue Expression by Dopamine Neurons in Rodents. PLoS ONE 5:e11463-14. doi:10.1371/journal.pone.0011463

      Murcia-Belmonte V, Coca Y, Vegar C, Negueruela S, Romero C de J, Valiño AJ, Sala S, DaSilva R, Kania A, Borrell V, Martinez LM, Erskine L, Herrera E. 2019. A Retino-retinal Projection Guided by Unc5c Emerged in Species with Retinal Waves. Current Biology 29:1149-1160.e4. doi:10.1016/j.cub.2019.02.052

      Ott T, Stein AM, Nieder A. 2023. Dopamine receptor activation regulates reward expectancy signals during cognitive control in primate prefrontal neurons. Nat Commun 14:7537. doi:10.1038/s41467-023-43271-6

      Phillips RA, Tuscher JJ, Black SL, Andraka E, Fitzgerald ND, Ianov L, Day JJ. 2022. An atlas of transcriptionally defined cell populations in the rat ventral tegmental area. Cell Reports 39:110616. doi:10.1016/j.celrep.2022.110616

      Purohit AA, Li W, Qu C, Dwyer T, Shao Q, Guan K-L, Liu G. 2012. Down Syndrome Cell Adhesion Molecule (DSCAM) Associates with Uncoordinated-5C (UNC5C) in Netrin-1-mediated Growth Cone Collapse. The Journal of biological chemistry 287:27126–27138. doi:10.1074/jbc.m112.340174

      Reynolds LM, Hernandez G, MacGowan D, Popescu C, Nouel D, Cuesta S, Burke S, Savell KE, Zhao J, Restrepo-Lozano JM, Giroux M, Israel S, Orsini T, He S, Wodzinski M, Avramescu RG, Pokinko M, Epelbaum JG, Niu Z, Pantoja-Urbán AH, Trudeau L-É, Kolb B, Day JJ, Flores C. 2023. Amphetamine disrupts dopamine axon growth in adolescence by a sex-specific mechanism in mice. Nat Commun 14:4035. doi:10.1038/s41467-023-39665-1

      Reynolds LM, Pantoja-Urbán AH, MacGowan D, Manitt C, Nouel D, Flores C. 2022. Dopaminergic System Function and Dysfunction: Experimental Approaches. Neuromethods 31–63. doi:10.1007/978-1-0716-2799-0_2

      Reynolds LM, Pokinko M, Torres-Berrío A, Cuesta S, Lambert LC, Pellitero EDC, Wodzinski M, Manitt C, Krimpenfort P, Kolb B, Flores C. 2018a. DCC Receptors Drive Prefrontal Cortex Maturation by Determining Dopamine Axon Targeting in Adolescence. Biological psychiatry 83:181–192. doi:10.1016/j.biopsych.2017.06.009

      Reynolds LM, Yetnikoff L, Pokinko M, Wodzinski M, Epelbaum JG, Lambert LC, Cossette M-P, Arvanitogiannis A, Flores C. 2018b. Early Adolescence is a Critical Period for the Maturation of Inhibitory Behavior. Cerebral cortex 29:3676–3686. doi:10.1093/cercor/bhy247

      Schlienger S, Yam PT, Balekoglu N, Ducuing H, Michaud J-F, Makihara S, Kramer DK, Chen B, Fasano A, Berardelli A, Hamdan FF, Rouleau GA, Srour M, Charron F. 2023. Genetics of mirror movements identifies a multifunctional complex required for Netrin-1 guidance and lateralization of motor control. Sci Adv 9:eadd5501. doi:10.1126/sciadv.add5501

      Shao Q, Yang T, Huang H, Alarmanazi F, Liu G. 2017. Uncoupling of UNC5C with Polymerized TUBB3 in Microtubules Mediates Netrin-1 Repulsion. J Neurosci 37:5620–5633. doi:10.1523/jneurosci.2617-16.2017

      Srivatsa S, Parthasarathy S, Britanova O, Bormuth I, Donahoo A-L, Ackerman SL, Richards LJ, Tarabykin V. 2014. Unc5C and DCC act downstream of Ctip2 and Satb2 and contribute to corpus callosum formation. Nat Commun 5:3708. doi:10.1038/ncomms4708

      Torres-Berrío A, Lopez JP, Bagot RC, Nouel D, Dal-Bo G, Cuesta S, Zhu L, Manitt C, Eng C, Cooper HM, Storch K-F, Turecki G, Nestler EJ, Flores C. 2017. DCC Confers Susceptibility to Depression-like Behaviors in Humans and Mice and Is Regulated by miR-218. Biological psychiatry 81:306–315. doi:10.1016/j.biopsych.2016.08.017

      Vassilev P, Pantoja-Urban AH, Giroux M, Nouel D, Hernandez G, Orsini T, Flores C. 2021. Unique effects of social defeat stress in adolescent male mice on the Netrin-1/DCC pathway, prefrontal cortex dopamine and cognition (Social stress in adolescent vs. adult male mice). Eneuro ENEURO.0045-21.2021. doi:10.1523/eneuro.0045-21.2021

      Private Comments

      Reviewer #1

      (12) The language should be improved. Some expression is confusing (line178-179). Also some spelling errors (eg. Figure 1M).

      We have removed the word “Already” to make the sentence in lines 178-179 clearer, however we cannot find a spelling error in Figure 1M or its caption. We have further edited the manuscript for clarity and flow.

      Reviewer #2

      (1) The authors claim to have revealed how the 'timing of adolescence is programmed in the brain'. While their findings certainly shed light on molecular, circuit and behavioral processes that are unique to adolescence, their claim may be an overstatement. I suggest they refine this statement to discuss more specifically the processes they observed in the brain and animal behavior, rather than adolescence itself.

      We agree with the reviewer and have revised the manuscript to specify that we are referring to the timing of specific developmental processes that occur in the adolescent brain, not adolescence overall.

      (2) Along the same lines, the authors should also include a more substantiative discussion of how they selected their ages for investigation (for both mice and hamsters), For mice, their definition of adolescence (P21) is earlier than some (e.g. Spear L.P., Neurosci. and Beh. Reviews, 2000).

      There are certainly differences of opinion between researchers as to the precise definition of adolescence and the period it encompasses. Spear, 2000, provides one excellent discussion of the challenges related to identifying adolescence across species. This work gives specific ages only for rats, not mice (as we use here), and characterizes post-natal days 28-42 as being the conservative age range of “peak” adolescence (page 419, paragraph 1). Immediately thereafter the review states that the full adolescent period is longer than this, and it could encompass post-natal days 20-55 (page 419, paragraph 2).

      We have added the following statement to our methods:

      There is no universally accepted way to define the precise onset of adolescence. Therefore, there is no clear-cut boundary to define adolescent onset in rodents (Spear, 2000). Puberty can be more sharply defined, and puberty and adolescence overlap in time, but the terms are not interchangeable. Puberty is the onset of sexual maturation, while adolescence is a more diffuse period marked by the gradual transition from a juvenile state to independence. We, and others, suggest that adolescence in rodents spans from weaning (postnatal day 21) until adulthood, which we take to start on postnatal day 60 (Reynolds and Flores, 2021). We refer to “early adolescence” as the first two weeks postweaning (postnatal days 21-34). These ranges encompass discrete DA developmental periods (Kalsbeek et al., 1988; Manitt et al., 2011; Reynolds et al., 2018a), vulnerability to drug effects on DA circuitry (Hammerslag and Gulley, 2014; Reynolds et al., 2018a), and distinct behavioral characteristics (Adriani and Laviola, 2004; Makinodan et al., 2012; Schneider, 2013; Wheeler et al., 2013).

      References:

      Adriani W, Laviola G. 2004. Windows of vulnerability to psychopathology and therapeutic strategy in the adolescent rodent model. Behav Pharmacol 15:341–352. doi:10.1097/00008877-200409000-00005

      Hammerslag LR, Gulley JM. 2014. Age and sex differences in reward behavior in adolescent and adult rats. Dev Psychobiol 56:611–621. doi:10.1002/dev.21127

      Hoops D, Flores C. 2017. Making Dopamine Connections in Adolescence. Trends in Neurosciences 1–11. doi:10.1016/j.tins.2017.09.004

      Kalsbeek A, Voorn P, Buijs RM, Pool CW, Uylings HBM. 1988. Development of the Dopaminergic Innervation in the Prefrontal Cortex of the Rat. The Journal of Comparative Neurology 269:58–72. doi:10.1002/cne.902690105

      Makinodan M, Rosen KM, Ito S, Corfas G. 2012. A critical period for social experiencedependent oligodendrocyte maturation and myelination. Science 337:1357–1360. doi:10.1126/science.1220845

      Manitt C, Mimee A, Eng C, Pokinko M, Stroh T, Cooper HM, Kolb B, Flores C. 2011. The Netrin Receptor DCC Is Required in the Pubertal Organization of Mesocortical Dopamine Circuitry. J Neurosci 31:8381–8394. doi:10.1523/jneurosci.0606-11.2011

      Reynolds LM, Flores C. 2021. Mesocorticolimbic Dopamine Pathways Across Adolescence: Diversity in Development. Front Neural Circuit 15:735625. doi:10.3389/fncir.2021.735625

      Reynolds LM, Yetnikoff L, Pokinko M, Wodzinski M, Epelbaum JG, Lambert LC, Cossette MP, Arvanitogiannis A, Flores C. 2018. Early Adolescence is a Critical Period for the Maturation of Inhibitory Behavior. Cerebral cortex 29:3676–3686. doi:10.1093/cercor/bhy247

      Schneider M. 2013. Adolescence as a vulnerable period to alter rodent behavior. Cell and tissue research 354:99–106. Doi:10.1007/s00441-013-1581-2

      Spear LP. 2000. Neurobehavioral Changes in Adolescence. Current directions in psychological science 9:111–114. doi:10.1111/1467-8721.00072

      Wheeler AL, Lerch JP, Chakravarty MM, Friedel M, Sled JG, Fletcher PJ, Josselyn SA, Frankland PW. 2013. Adolescent Cocaine Exposure Causes Enduring Macroscale Changes in Mouse Brain Structure. J Neurosci 33:1797–1803. doi:10.1523/jneurosci.3830-12.2013

      (3) Figure 1 - the conclusions hinge on the Netrin-1 staining, as shown in panel G, but the cells are difficult to see. It would be helpful to provide clearer, more zoomed images so readers can better assess the staining. Since Netrin-1 expression reduces dramatically after P4 and they had to use antigen retrieval to see signal, it would be helpful to show some images from additional brain regions and ages to see if expression levels follow predicted patterns. For instance, based on the allen brain atlas, it seems that around P21, there should be high levels of Netrin-1 in the cerebellum, but low levels in the cortex. These would be nice controls to demonstrate the specificity and sensitivity of the antibody in older tissue.

      We do not study the cerebellum and have never stained this region; doing so now would require generating additional tissue and we’re not sure it would add enough to the information provided to be worthwhile. Note that we have stained the forebrain for Netrin-1 previously, providing broad staining of many brain regions (Manitt et al., 2011)

      References:

      Manitt C, Mimee A, Eng C, Pokinko M, Stroh T, Cooper HM, Kolb B, Flores C. 2011. The Netrin Receptor DCC Is Required in the Pubertal Organization of Mesocortical Dopamine Circuitry. J Neurosci 31:8381–8394. doi:10.1523/jneurosci.0606-11.2011

      (4) Figure 3 - Because mice tend to avoid brightly-lit spaces, the light/dark box is more commonly used as a measure of anxiety-like behavior than purely exploratory behavior (including in the paper they cited). It is important to address this possibility in their discussion of their findings. To bolster their conclusions about the coincidence of circuit and behavioral changes in adolescent hamsters, it would be useful to add an additional measure of exploratory behaviors (e.g. hole board).

      Regarding the light/dark box test, this is an excellent point. We prefer the term “risk taking” to “anxiety-like” and now use the former term in our manuscript. Furthermore, our interest in the behaviour is purely to chart the development of adolescent behaviour across our treatment groups, not to study a particular emotional state. Regardless of the specific emotion or emotions governing the light/dark box behaviour, it is an ideal test for charting adolescent shifts in behaviour as it is well-characterized in this respect, as we discuss in our manuscript.

      (5) Supplementary Figure 4,5 The authors defined puberty onset using uterine and testes weights in hamsters. While the weights appear to be different for summer and winter hamsters, there were no statistical comparison. Please add statistical analyses to bolster claims about puberty start times. Also, as many studies use vaginal opening to define puberty onset, it would be helpful to discuss how these measurements typically align and cite relevant literature that described use of uterine weights. Also, Supplementary Figures 4 and 5 were mis-cited as Supp. Fig. 2 in the text (e.g. line 317 and others).

      These are great suggestions. We have added statistical analyses to Supplementary Figures 5 and 6 and provided Vaginal Opening data as Supplementary Figure 7. The statistical analyses confirm that all three characters are delayed in winter hamsters compared to summer hamsters.

      We have also added the following references to the manuscript:

      Darrow JM, Davis FC, Elliott JA, Stetson MH, Turek FW, Menaker M. 1980. Influence of Photoperiod on Reproductive Development in the Golden Hamster. Biol Reprod 22:443–450. doi:10.1095/biolreprod22.3.443

      Ebling FJP. 1994. Photoperiodic Differences during Development in the Dwarf Hamsters Phodopus sungorus and Phodopus campbelli. Gen Comp Endocrinol 95:475–482. doi:10.1006/gcen.1994.1147

      Timonin ME, Place NJ, Wanderi E, Wynne-Edwards KE. 2006. Phodopus campbelli detect reduced photoperiod during development but, unlike Phodopus sungorus, retain functional reproductive physiology. Reproduction 132:661–670. doi:10.1530/rep.1.00019

      (6) The font in many figure panels is small and hard to read (e.g. 1A,D,E,H,I,L...). Please increase the size for legibility.

      We have increased the font size of our figure text throughout the manuscript.

      Reviewer #3

      (15) Fig 1 C,D. Clarify the units of the y axis

      We have now fixed this.

      Full Reference List

      Adriani W, Laviola G. 2004. Windows of vulnerability to psychopathology and therapeutic strategy in the adolescent rodent model. Behav Pharmacol 15:341–352. doi:10.1097/00008877-200409000-00005

      Hammerslag LR, Gulley JM. 2014. Age and sex differences in reward behavior in adolescent and adult rats. Dev Psychobiol 56:611–621. doi:10.1002/dev.21127

      Hoops D, Flores C. 2017. Making Dopamine Connections in Adolescence. Trends in Neurosciences 1–11. doi:10.1016/j.tins.2017.09.004

      Kalsbeek A, Voorn P, Buijs RM, Pool CW, Uylings HBM. 1988. Development of the Dopaminergic Innervation in the Prefrontal Cortex of the Rat. The Journal of Comparative Neurology 269:58–72. doi:10.1002/cne.902690105

      Makinodan M, Rosen KM, Ito S, Corfas G. 2012. A critical period for social experiencedependent oligodendrocyte maturation and myelination. Science 337:1357–1360. doi:10.1126/science.1220845

      Manitt C, Mimee A, Eng C, Pokinko M, Stroh T, Cooper HM, Kolb B, Flores C. 2011. The Netrin Receptor DCC Is Required in the Pubertal Organization of Mesocortical Dopamine Circuitry. J Neurosci 31:8381–8394. doi:10.1523/jneurosci.0606-11.2011

      Reynolds LM, Flores C. 2021. Mesocorticolimbic Dopamine Pathways Across Adolescence: Diversity in Development. Front Neural Circuit 15:735625. doi:10.3389/fncir.2021.735625 Reynolds LM, Yetnikoff L, Pokinko M, Wodzinski M, Epelbaum JG, Lambert LC, Cossette M-P, Arvanitogiannis A, Flores C. 2018. Early Adolescence is a Critical Period for the Maturation of Inhibitory Behavior. Cerebral cortex 29:3676–3686. doi:10.1093/cercor/bhy247

      Schneider M. 2013. Adolescence as a vulnerable period to alter rodent behavior. Cell and tissue research 354:99–106. doi:10.1007/s00441-013-1581-2

      Spear LP. 2000. Neurobehavioral Changes in Adolescence. Current directions in psychological science 9:111–114. doi:10.1111/1467-8721.00072

      Wheeler AL, Lerch JP, Chakravarty MM, Friedel M, Sled JG, Fletcher PJ, Josselyn SA, Frankland PW. 2013. Adolescent Cocaine Exposure Causes Enduring Macroscale Changes in Mouse Brain Structure. J Neurosci 33:1797–1803. doi:10.1523/jneurosci.3830-12.2013

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The manuscript co-authored by Pál Barzó et al is very clear and very well written, demonstrating the electrophysiological and morphological properties of human cortical layer 2/3 pyramidal cells across a wide age range, from age 1 month to 85 years using whole-cell patch clamp. To my knowledge, this is the first study that looks at the cross-age differences in biophysical and morphological properties of human cortical pyramidal cells. The community will also appreciate the significant effort involved in recording data from 485 cells, given the challenges associated with collecting data from human tissue. Understanding the electrophysiological properties of individual cells, which are essential for brain function, is crucial for comprehending human cortical circuits. I think this research enhances our knowledge of how biophysical properties change over time in the human cortex. I also think that by building models of human single cells at different ages using these data, we can develop more accurate representations of brain function. This, in turn, provides valuable insights into human cortical circuits and function and helps in predicting changes in biophysical properties in both health and disease.

      Strengths:

      The strength of this work lies in demonstrating how the electrophysiological and morphological features of human cortical layer 2/3 pyramidal cells change with age, offering crucial insights into brain function throughout life.

      Weaknesses:

      One potential weakness of the paper is that the methodology could be clearer, especially in how different cells were used for various electrophysiological measurements and the conditions under which the recordings were made. Clarifying these points would improve the study's rigor and make the results easier to interpret.

      Reviewer #2 (Public review):

      Summary:

      In this study, Barzo and colleagues aim to establish an appraisal for the development of basal electrophysiology of human layer 2/3 pyramidal cells across life and compare their morphological features at the same ages.

      Strengths:

      The authors have generated recordings from an impressive array of patient samples, allowing them to directly compare the same electrophysiological features as a function of age and other biological features. These data are extremely robust and well organised.

      Weaknesses:

      The use of spine density and shape characteristics is performed from an extremely limited sample (2 individuals). How reflective these data are of the population is not possible to interpret. Furthermore, these data assume that spines fall into discrete types - which is an increasingly controversial assumption.

      Many data are shown according to somewhat arbitrary age ranges. It would have been more informative to plot by absolute age, and then perform more rigourous statistics to test age-dependent effects.

      Overall, the authors achieve their aims by assessing the physiological and morphological properties of human L2/3 pyramidal neurons across life. Their findings have extremely important ramifications for our understanding of human life and implications for how different neuronal properties may influence neurological conditions.

      Reviewer #3 (Public review):

      Summary:

      To understand the specificity of age-dependent changes in the human neocortex, this paper investigated the electrophysiological and morphological characteristics of pyramidal cells in a wide age range from infants to the elderly.

      The results show that some electrophysiological characteristics change with age, particularly in early childhood. In contrast, the larger morphological structures, such as the spatial extent and branching frequency of dendrites, remained largely stable from infancy to old age. On the other hand, the shape of dendritic spines is considered immature in infancy, i.e., the proportion of mushroom-shaped spines increases with age.

      Strengths:

      Whole-cell recordings and intracellular staining of pyramidal cells in defined areas of the human neocortex allowed the authors to compare quantitative parameters of electrophysiological and morphological properties between finely divided age groups.

      They succeeded in finding symmetrical changes specific to both infants and the elderly, and asymmetrical changes specific to either infants or the elderly. The similarity of pyramidal cell characteristics between areas is unexpected.

      Weaknesses:

      Human L2/3 pyramidal cells are thought to be heterogeneous, as L2/3 has expanded to a high degree during the evolution from rodents to humans. However, the diversity (subtyping) is not revealed in this paper.

      Recommendations for the authors: 

      Reviewer #1 (Recommendations for the authors):

      The manuscript co-authored by Pál Barzó et al is very clear and very well written, demonstrating the electrophysiological and morphological properties of the human cortical layer 2/3 pyramidal cells across a wide age range, from age 1 month to 85 years using whole-cell patch clamp. To my knowledge, this is the first study that looks at the cross-age differences in morphological and electrophysiological properties of human cortical pyramidal cells. The community will also appreciate the significant effort involved in recording data from 485 cells, given the challenges associated with collecting data from human tissue. understanding the electrophysiological properties of individual cells, which are essential for brain function, is crucial for comprehending human cortical circuits. I think this research enhances our knowledge of how biophysical properties change over time in the human cortex. I also think that by building models of human single cells at different ages using these data, we can develop more accurate representations of brain function. This, in turn, provides valuable insights into human cortical circuits and function and helps in predicting changes in biophysical properties in both health and disease.

      We are grateful for the positive evaluation of our work. We also thank the reviewers for their comments and believe that our manuscript has improved significantly with their help. In addition to the reviewer’s suggestions for improvement, further cell reconstructions were performed to make the anatomical data more robust (n = 1,2,3,3,4,3,2 additional reconstruction in age groups infant, early childhood, late childhood, adolescence, young adulthood, middle adulthood and late adulthood, respectively; Σn = 18). Four additional cells were added to the spine analysis and the statistics associated with each additional dataset were updated.

      I have some comments, particularly regarding the methodology and data presentation, to improve the clarity of the paper

      (1) I assume the tissue is from the resected area adjacent to the tumor. Could you please clarify this in the Methods section?

      Thank you for this comment, it has been clarified in the Methods section with the following sentence: “We used human cortical tissue adjacent to the pathological lesion  that had to be surgically removed from patients (n = 63 female  n = 45 male) as part of the treatment for tumors, hydrocephalus, apoplexy, cysts, and arteriovenous malformation.”

      (2) Regarding the presentation of data in the Methods section, could you please clarify whether the authors used different cells for measuring the various electrophysiological properties? The number of recorded cells for calculating subthreshold properties (e.g., late adulthood: n = 113) differs from the number the cells used for calculating suprathreshold properties (e.g., late adulthood: n = 83). If this is the case, it may make it difficult to compare the electrophysiological properties. Could you please clarify this?

      The different element numbers are indeed due to the fact that different quality criteria were defined for the analysis of fast and slow signals. For the analysis of fast signals (e.g. AP half-width, AP upstroke velocity, AP amplitude), higher quality requirements were established therefore cells with high series resistance (> 30 MΩ) were excluded. We have updated and clarified the recording conditions in the text, figures, and methodology section accordingly.

      (3) Additionally, they mentioned that their recordings were done at zero holding current and at more than -50 pA. Could you clarify whether the data from these two sets of experiments were combined? If so, please provide an explanation in the methods section.

      Basically, we wanted to determine the parameters of the potential changes of the membrane at rest. However, for technical reasons related to the biological amplifier, in some of the experiments a certain continuous holding current may be present during the measurement (3.5% of all experiments). The holding currents were in the range of -50 pA to +60 pA. Within this range, previously checked on mouse neurons we have not found linear correlation between the electrophysiological properties and the holding current. This is reported in the Methods section.

      (4) This section needs revision. It is unclear why different series resistances (Rs) or different cells were used to compute various electrophysiological properties." To calculate passive membrane properties (resting membrane potential, input resistance, time constant, and sag) either cells with series resistance (Rs): 22.85 {plus minus} 9.04 MΩ (ranging between -4.55 MΩ and 56.76 MΩ) and 0 pA holding current (n = 154), or cells with holding current > -50 pA (-7.46 {plus minus} 28.56 pA, min: -49.89 pA, max: 59.68pA) and Rs < 30 MΩ (18.96 {plus minus} 6.48 MΩ) (n = 23) were used. For the analysis of high frequency action potential features (AP half-width, AP up-stroke velocity, AP amplitude and rheobase) cells with Rs < 30 MΩ (n = 331 cells with Rs 19.2 {plus minus} 6.6 MΩ) and holding current > -50pA (n = 308 with 0 pA holding current and Rs: 19.22 {plus minus} 6.59 MΩ, n = 23 withholding current: -7.46 {plus minus} 28.56 pA and Rs: 18.96 {plus minus} 6.48 MΩ) were used."

      To make the chapter clearer, we simplified the cell groups used to analyse the different electrophysical properties and revised the Method section as follows: “For the analysis of the electrophysiological recordings n = 457 recordings with a series resistance (Rs) of 24.93 ± 11.18 MΩ (max: 63.77 MΩ) were used. For the analysis of fast parameters related to the action potential (AP half-width, AP upstroke velocity, AP amplitude and rheobase), higher quality requirements were set and cells with Rs > 30 MΩ were excluded. This reduced the data set to n = 331 cells with Rs 19.42 ± 6.2 MΩ.”

      (5) The authors recorded the sag ratio using a -100 pA injected current. Is there a technical reason why they did not inject more than -100 PA?

      There is no particular technical reason, we use similar to others this current amplitude for voltage response recordings over the years to record electrophysiological traces.

      (6) In the abstract, the authors mentioned that data were recorded from ages 1 month to 85 years. However, in the results, they stated that data were recorded from ages 0 to 85 years. Could you please clarify this discrepancy?

      We corrected this discrepancy.

      (7) Additionally, the results mention that data were collected from 485 human cortical layer 2/3 (L2/3) pyramidal cells, but subthreshold membrane features such as resting membrane potential, input resistance, time constant (tau), and sag ratio were calculated in 475 cortical pyramidal cells from 99 patients. Could you please clarify these discrepancies? In the discussion "We recorded from n = 457 human cortical excitatory pyramidal cells from the supragranular layer from birth to 85 years"

      Thank you for pointing this out, we have corrected the error. Although our full data set contained 485 pyramidal cells, 28 recordings were excluded from the electrophysiological analysis and were used for morphological evaluation only, therefore 457 recordings were used for passive parameter measurements.

      (8) Regarding the distance from the pia to the border layer L1/L2, did the authors notice any differences across ages?

      To investigate whether the thickness of cortical layer 1 changes throughout life, we measured the L1 thickness and found no significant differences between age groups (P = 0.09, Kruskal-Wallis test) (Author response image 1).

      Author response image 1.

      Thickness of cortical layer 1 at different life stages. (A) Boxplot shows the thickness of layer 1. (B) Scatter plot shows the distribution of L1 thickness measured on the reconstructed cells. Age is shown in years on a logarithmic scale, dots are color-coded according to the corresponding age groups.

      (9) I am not sure why they referred to the data as layer 2/3 when most of the data, based on Figure 1E, were recorded from a distance of 0-200 µm from the L1/L2 border. Could it be that there is no significant depth-dependent variation in electrophysiological properties, as reported by Berg (2021), Kalmbach (2018), and Chameh (2021)?

      Although the vast majority of our data comes from a distance of less than 200 μm from the L1/L2 border, we cannot neglect the fact that our dataset also contains a small number of cells deeper than this, which are layer 3 cells. Apart from some differences shown in Supplementary Figures 7-9, we found no general difference between cells located at a distance of less than 200 μm and more than 200 μm from the L1 border.

      (10) In Figure 1, there is variability in resting membrane potential (RMP), tau, and input resistance (IR) within the infant age group. However, this trend is not observed in the sag ratio. Could you please discuss this finding?

      The large variance in the data is due to dramatic changes in these three parameters during the first year of life. Supplementary Figure 3 shows the comparisons of parameter distributions of patients between 0-6 months and 6-12 months. The sag amplitude in these cells is generally low therefore no such large changes could have occurred in them.

      (11) Did the authors use a K-Nearest Neighbors (KNN) test to assess the accuracy of the infant cluster in Figure 3F?

      Based on eight electrophysiological features of the cells (resting Vm, input resistance, tau, sag ratio, rheobase, AP half-width, AP up-stroke, and AP amplitude), the infant pyramidal cells on a UMAP form a distinct group (Author response image 2A) represented by cluster 4 on Author response image 2B. When calculating the sum of the Euclidean distances of cells within the cluster from the centroid, the isolated infant group (cluster 4) shows the smallest distance value from the centroid (cluster 1: 40.2, cluster 2: 36.21, cluster 3: 39.96, cluster 4: 5.72, cluster 5: 39.2, cluster 6: 55.74, cluster 7: 54.27), demonstrating that infant cells create a discrete cluster distinct from other age groups (Author response image 2B).

      Author response image 2.

      (A) Uniform Manifold Approximation and Projection (UMAP) of 8 selected electrophysiological properties (resting Vm, input resistance, tau, sag ratio, rheobase, AP half-width, AP up-stroke, and AP amplitude) with data points for 331 cortical L2/3 pyramidal cells, colored with the corresponding age groups. (B) UMAP colored by k-means clustering with 7 clusters, red crosses represent the centroids of the clusters.

      (12) Missing citation: 'Previous research has shown that the biophysical properties of human pyramidal cells show depth-related correlations throughout L2/3 (Berg et al., 2021).' Please include citations for Kalmbach (2018) and Chameh (2021).

      We thank for the additional references, these studies are now cited.

      (13) Have they noticed any morphological properties differences among the different cortical lobes (Parietal, Temporal, Frontal, and Occipital). It would be beneficial to present this data, especially since they have a sufficient sample size from each cortical lobe.

      The majority of our data set on the morphological properties of pyramidal cells comes from the parietal (n = 17 cells) and temporal lobe (n = 15). We found no significant differences in the morphological properties of cells from these two brain regions and no differences between age groups in the same cortical lobes.

      (14) Have the authors found differences in spine characteristics among different cortical areas, as reported previously by 10.1023/a:1024134312173).

      We found morphological differences in dendritic spines in the different brain regions, yet, our data are limited to draw definitive conclusions.

      Reviewer #2 (Recommendations for the authors):

      Major

      (1) I believe that these data presented in all main text figures would be more intuitive to be plotted on a log(age) scale, such as shown in supplementary Figure 13. The bounds of the ages used for different groups, as summarised in Figure 1 feel somewhat arbitrary.

      Recent neuroscientific studies on postnatal ageing mainly use the age-group comparison format (Kang 2011, Bethlehem 2022), which has been defined based on milestones in the cognitive, motor, social-emotional, and language/communications domains of observable behaviour (Zubler et al. 2022, for detailed definitions see Kang 2011). Since many parameters do not vary linearly but take a U-shape (or inverted U-shape), statistical quantification of these is not straightforward, so we would retain the age-group format for the main graphs. However, at the reviewer's suggestion, electrophysiological and morphological parameters are presented on a log(age) scale as supplementary figures (Supplementary Figures 2,4 and 6), also further statistical analysis was also carried out without grouping the data (see response 5).

      (2) The authors present a lot of data values in the text, which is also shown in the figures. This makes reading of the manuscript somewhat difficult in places. For brevity, it may be best to present this data as supplementary tables.

      Thank you for this suggestion. We have inserted these data as tables.

      (3) I am unclear why the authors excluded cells that fired doublets or triplets in Figure 4? Were these included in the passive and AP-specific analysis - but excluded from F-I plots? Please clarify the rationale and the relative abundance of these physiological types based on age - one might predict that more initial-burst firing types are associated with older neurons?

      Thank you for drawing attention to this anomaly. We have updated the figures and text by adding the cells with initial burst firing. These cells are also included in the analysis of passive and action potential properties. In our overall dataset, 6.78% of cells show burst firing; infant: 0%, early childhood: 3.57% (1 cell), late childhood: 0%, adolescence: 11.11% (6 cells), young adulthood: 10.11% (9), middle adulthood: 10.71% (6 cells), late adulthood: 7.96 (9 cells) of all cells including the age groups.

      (4) The statistical analyses performed in Figure 6 are not justified. From the authors' description of these data, they derive spine density measurements from 1 infant and 1 aged adult, then perform pseudoreplicated analysis in these individuals. These data would require greater replication from infant and aged groups - with the possible inclusion of a younger adult group also. It would be ideal to have n=3/age group to allow robust statistical analysis.

      Thank you for this point. Accordingly, we have expanded our data set to include n = 3 infant pyramidal cells (83 days old, from one patient) and n = 3 pyramidal cells from three late adulthood patients (64.3 ± 2.08 years old).

      (5) Given the high number of individuals and replicates throughout this manuscript, a more circumspect approach to statistics would be appreciated, e.g. a generalised linear mixed effects model - with age as a fixed effect and sex, patient, etc as random effects. This may reveal the greatest statistical power of these important and rich data.

      Of the generative models we used the Generalized Additive Mixed Model (GAMM) to describe the relationship between age and the various passive and active electrophysiological features. We defined age with cubic spline smoothing term as the fixed effect and gender, brain area, surgical procedure, and hemisphere as random effects. With GAMM we found that the age-dependent correlation of the examined parameters (resting membrane potential, input resistance, tau, sag ratio, rheobase current, AP half-width, AP up-stroke velocity, AP amplitude, first AP latency, adaptation) was significant, except for F-I slope, described by the model incorporating the four random effects.  We also observed correlation with gender, brain area, hemisphere, and surgical procedure in various intrinsic properties. The Author response table 1 below shows the statistical values of GAMM and the statistical tests used in the manuscript to compare.

      Author response table 1.

      Statistical significance of patient attributes *In the pairwise comparison, the age of cells in the two groups was significantly different: female (subthreshold: 37.36 ± 26.25 years old, suprathreshold: 38.3 ± 25.6 y.o.) - male (subthreshold: 24.86 ± 23.7 y.o., suprathreshold: 25.7 ± 23.93 y.o.), subthreshold: P = 1.96*10-6, suprathreshold: P = 3.25*10-5 Mann-Whitney test. **In the pairwise comparison, the age of cells in the two groups was significantly different: surgical procedure: tumor removal (subthreshold: 33.72 ± 24.33 y.o., suprathreshold: 36.43 ± 27.07 y.o.) - VP shunt (subthreshold: 27.38 ± 29.69 y.o., suprathreshold: 27.07 ± 29.37 y.o.) subthreshold: P = 3.68*10-3, suprathreshold: P = 1.64-10-3, Mann-Whitney test)

      (6) Regarding the morphological diversity of dendritic spines. There is some debate in the field as to whether the distinction of specific dendritic spine types - as conveyed in this manuscript - are true subtypes or reflect a continuum of diverse morphology (see Tønneson et al., 2014 Nature Neuroscience). It is appreciated that the approach taken by the authors is the dogma within the field - however, dogma should continue to be challenged. Given that the authors have used DAB labelling combined with light microscopy, the possibility of accurately measuring spine morphology required for determining this continuum is extremely limited (e.g. Li et al., (2023) ACS Chemical Neuroscience). I would suggest that alongside the inclusion of further replicates for their spine analysis, the authors tone down their discussion of spine subtypes given the absence of any synaptic data presented in this current study to support the maturation (or otherwise) of dendritic spine synapses.

      Many thanks to the reviewer for this comment. We agree with the drawbacks of our method for testing spine categorization. To increase the reliability of our results, we increased the number of pyramidal cells in the infant and late adult groups. We also revised the figure and as suggested by Reviewer#3 added photos of spines to each category in addition to schematic drawings to give an impression of the phenotype. In the discussion, we only address the differences between two readily separable mushroom and filopodial forms and highlight results that only confirm findings already known in the literature. Although the concerns are valid, we apply the sentence from the above Li et al. (2023) reference “...the most sophisticated equipment may not always be necessary for answering some research questions”. We believe that it is worth sharing our data and the somewhat subjective grouping, which we hope to report in more detail in the future.

      Minor

      (1) The order of the supplemental materials is out of order with their introduction in the text. These should be revised to reflect the order mentioned in the text.

      Thank you for your comment, we have corrected the order of the supplementary figures.

      (2) In Supplementary Figure 13, it would be informative to include some form of linear regression to confirm whether an age-dependent effect on neuronal morphology exists.

      We have added linear regression to the figure.

      (3) Figure 3D = should this be AP - not Ap?

      Thank you for drawing attention to this, we have corrected the incorrect typing on the figure.

      (4) For UMAP analysis in Figure 3, please provide a table of the features that were used for the 32 & 8-parameter UMAPs respectively.

      We have added a table to the Materials and methods section of all the electrophysiological features included in the UMAP.

      (5) For morphology, please include pia and L1/2 border for reconstructions shown for clarity.

      We indicated both the pia mater and the L1/2 border on the figure showing all the reconstructions (Supplementary Figure 10).

      Reviewer #3 (Recommendations for the authors):

      Major:

      (1) Data were obtained from different cortical areas of human patients of different ages. The electrophysiological characteristics were largely independent of other attributes such as disease, gender, and cortical areas (Supplementary Figure 2). To support the conclusion that age is one of the key attributes responsible for change, a similar morphological analysis would be necessary for gender.

      We updated the text and the supplementary section with Supplementary Figures 18-21. to determine if age-related differences in biophysical characteristics are affected by the patient's gender.

      (2) 'mushroom-shaped, thin, filopodial, branched, and stubby spines'

      Show photographs of individual typical spine types to make the classification easier to understand.

      To make the classification more understandable, we have updated the corresponding figure (Figure 6) with representative photos of the dendritic spine types.

      (3) Some electrophysiological parameters of the infant group showed higher deviations compared to other age groups. A UMAP (Supplementary Figure 2) shows that some infant neurons form a small cluster, while other infant neurons are scattered with neurons of other ages. Are there any differences between infant neurons in the small cluster and other infant neurons with respect to attributes other than age?

      For most of the electrophysiological parameters, the infant age group showed age-dependent variability, as illustrated in Supplementary Figures 3, 2,4 and 6 . The small group of infant cells is not clustered by gender, brain region, or medical condition, as shown in Supplementary Figure 5.

      (4) A recent paper (Benavides-Piccione et al. 2024, doi:10.1093/cercor/bhae180) reported that some morphological parameters of human layer 3 neurons differ between occipital and temporal regions. Area-dependent morphological differences have been also reported in non-human primates. Discussion of potential contradictions may therefore be requested.

      Most of the cells we reconstructed originated from the parietal and temporal regions (parietal: n = 20, temporal: n = 23, frontal: n = 15, occipital: n = 5). We found no differences in morphological features between these two regions, and we also found no significant differences when we compared the cells from the same brain regions by age group.

      (5) L2/3 cells of rodents are morphologically differentiated according to cortical depth. If individual L2/3 cells of humans are less differentiated than those of rodents, this point should be discussed.

      Depth-related morphological heterogeneity has already been reported previously (Berg 2021), however, our dataset on the morphological characteristics of pyramidal cells is from the upper L2/3 region, with their soma located at a distance of 117.85 ± 65.3 μm (between: 11.05 and 243.3 μm) from the L1/L2 border. Therefore, we cannot conclude from our data whether humans are less differentiated than rodents.

      Minor:

      (1) Cell body morphology may affect electrophysiological properties. However, morphological quantification of cell bodies has not been reported. It may be added.

      In our DAB-labeled samples, we could not perfectly measure the total volume of the cell body in the reconstructions, therefore our measurements regarding the soma morphology are not shown in the manuscript. When comparing the cell body area of the middle sections of the soma of the reconstructed cells between the age groups, we found no significant differences (P = 0.082, Kruskal–Wallis test).

      (2) 'The adaptation of the AP frequency response'

      Describe how this parameter was obtained.

      The adaptation of the AP frequency response or adaptation was calculated as the average adaptation of the interspike interval between consecutive APs.

      (3) 'we excluded cells showing initial duplet or triplet action potential bursts'

      Why were the burst cells excluded from the analysis?

      We have modified the figures and text to include cells with initial burst firing.

      (4) Electrophysiological characteristics to be analyzed:

      Spike thresholds and afterhyperpolarizations

      We found age-related differences in the amplitude of the afterhyperpolarization (P = 2.56*10<sup>-30</sup>, Kruskal-Wallis test) and in the threshold of the action potential (P = 5.24*10<sup>-12</sup>, Kruskal-Wallis test) (Author response image 3).

      Author response image 3.

      Age-dependence of afterhyperpolarization and AP threshold. (A-B) Boxplots show the differences in afterhyperpolarization (AHP) amplitude (A) and AP threshold (B) between age groups. Asterisks indicate statistical significance (* P < 0.05, ** P < 0.01, *** P < 0.001, Kruskal-Wallis test with post-hoc Dunn test). (C-D) Scatter plots show AHP amplitude (C) and AP threshold (D) across the lifespan. Age is shown on a logarithmic scale, dots are colored according to the corresponding age group.

      (5) 'We identified and labeled each spine on n = 2 fully 3D-reconstructed cells'

      To which cortical area do these cells belong?

      At what depths are they distributed?

      Is it possible to report the number of spines, in addition to the density per unit length?

      We increased the number of cells in which we analyzed dendritic spine density. The data shown in Figure 6. are from pyramidal cells from an infant patient (n = 3 from a single patient) and late adulthood patients (n = 3 from 3 patients) (Supplementary Figure 13). The infant cells are from the same patient, the sample is from the right parietal lobe, and the patient is 83 days old. The older cells are from three different patients (#1: 65 years old, right temporal lobe; #2: 66 years old, right parietal lobe; #3: 62 years old, right frontal lobe). Infant cells are located 144.43 ± 45.26 µm (#1: 109.3, #2: 128.49, #3: 195.5 µm), late adult cells 161.22 ± 66.22 µm (#1: 183.5, #2: 213.42, #3: 86.73 µm) from the L1/2 border. We provide the number of spines in an additional supplementary table (Supplementary table 2.).

    1. Author response:

      The following is the authors’ response to the original reviews.

      Thank you for your time and consideration on our submission. We also thank the reviewers for their consideration and helpful comments.  We have revised the introduction, results, and discussion sections of the revised manuscript in accordance with the reviewers’ suggestions, which have enhanced the clarity of our work. Specifically, we have clarified that the aim of the study is to report newly discovered sperm behaviours inside the uterus via high resolution deep tissue live imaging, and to stimulate further studies and discussion in the field of postcopulatory sexual selection in mice based on our observations. To the best of our knowledge, many of the specific sperm behaviours described in our manuscript are being reported for the first time, proven through direct observation inside the living reproductive tract.

      We have also restructured our manuscript and moved our hypothetical interpretations based on our experimental observations to the discussion section. We hope that these revisions have clarified our claims and that our revised manuscript effectively communicates the importance of our findings and its values in prompting new questions and insight that encourage further studies. We believe that our work clearly demonstrates the importance of sperm/reproductive tract interaction, which cannot be adequately studied in artificial environments, and may become an important guideline for designing future experiments and studies.

      Public Reviews: 

      Reviewer #1 (Public Review): 

      Summary: 

      The authors want to determine the role of the sperm hook of the house mouse sperm in movement through the uterus. The authors are trying to distinguish between two hypotheses put forward by others on the role of the sperm hook: (1) the sperm cooperation hypothesis (the sperm hook helps to form sperm trains) vs (2) the migration hypothesis (that the sperm hook is needed for sperm movement through the uterus). They use transgenic lines with fluorescent labels to sperm proteins, and they cross these males to C57BL/6 females in pathogen-free conditions. They use 2-photon microscopy on ex vivo uteri within 3 hours of mating and the appearance of a copulation plug. There are a total of 10 post-mating uteri that were imaged with 3 different males. They provide 10 supplementary movies that form the basis for some of the quantitative analysis in the main body figures. Their data suggest that the role of the sperm hook is to facilitate movement along the uterine wall. 

      We thank the reviewer for summarizing our work and the critical review of our paper. As summarized, the sperm hook has been primarily associated with the sperm cooperation (sperm hook) hypothesis and the migration hypothesis. However, we would like to emphasize that the aim of our work is not to cross check between the two hypotheses. Our aim was not to disprove either hypothesis, but rather to develop an experimental platform that enables detailed observation of sperm migration dynamics within the live reproductive tract. 

      Through live imaging, we observed both the formation of sperm trains as well as interaction between the sperm and female reproductive tract epithelium. However, in our observations, we could not find advantage in terms of faster movement for the rarely observed sperm trains. While these events were infrequent in our experiments, we are not asserting that the sperm train hypothesis is invalid but rather reporting our observations as is. 

      The main findings of our work lie in the newly observed dynamic behaviours of mouse sperm interacting with the female reproductive tract epithelium. Specifically, tapping and associated guided movement along the uterus wall, anchoring and related resistance to internal fluid flow and migration through the utero-tubal junction, and self-organized behaviour while clinging onto the colliculus tubarius. We have extensively revised the manuscript structure to clarify our findings.

      Strengths: 

      Ex vivo live imaging of fluorescently labeled sperm with 2-photon microscopy is a powerful tool for studying the behavior of sperm. 

      Weaknesses: 

      The paper is descriptive and the data are correlations. 

      The data are not properly described in the figure legends. 

      When statistical analyses are performed, the authors do not comment on the trend that sperm from the three males behave differently from each other. This weakens confidence in the results. For example, in Figure 1 the sperm from male 3613 (blue squares) look different from male 838 (red circles), but all of these data are considered together. The authors should comment on why sperm across males are considered together when the individual data points appear to be different across males. 

      Thank you for your comments and suggestions. We have revisited all figure legends and made the necessary amendments (shown in the red-lined manuscript). Please note that, for a better flow of the paper, the previous Figure 1 has been changed to Figure 2 in the revised manuscript.

      Regarding the analysis using different males, we would like to explain the statistics used. We used generalized linear mixed models to test the effect of the Angle and Distance to the wall on the migration kinetic parameters. The advantage of the generalized linear mixed models is that they consider individual variations in the data as an error term, thereby controlling such individual variations. 

      There are two main factors contributing to individual variations. One is, as you pointed out, the difference in sperm from different males. However, we used genetically similar mice, so genetical variations must be minimal. Nonetheless, there must be individual differences that caused variations including age, stress level as well as body conditions. As these factors cannot be controlled, we used the mixed model approach where individual variations are grouped within the individual. This approach enabled us to test the effect of each explanatory variable (Angle and Distance) within an individual. 

      The second factor that could cause variations is the female oestrous status. To avoid artifacts that could influence sperm behaviour, we did not use any invasive methods, such as hormone injections, to control or induce female oestrus. We controlled for this possible effect by including the mating date as a random effect. Since each female was used only once, the mating date reflects the variation caused by each female.

      To provide further verification that the variation between individual males do not affect our results, we conducted analysis per individual male and mating dates (per each female). As clearly shown, sperm data points from individual males or female also show consistent clear correlations with the distance from the uterus wall. As pointed out, while the mean sperm speed could be different between individuals, they are not the topic we are interested in here. Our interest here is the effect of the distance between sperm and the uterine wall. Additionally, the variation between males is not always larger than those effect of the day (female), which in total suggest that integrating male variation is not essential. We have added this information to Supplementary Figure (Fig. S3) of the revised supplementary materials.

      Moving forward, we can also consider the same analysis for the effects of the distance from wall on sperm SWR and LIN (linearity of forward progression) where no statistical significance was found. As see in the following figures, no statistically significant effect of the distance to wall on SWR and LIN are seen in that the regression lines drawn for each male and mating dates.

      In summary, the statistical approach we used here has successfully reflected variations in sperm kinetics from different males as well as the variance from different females. We hope that our explanations and additional analysis answer your concerns. 

      Movies S8-S10 are single data points and no statistical analyses are performed. Therefore, it is unclear how penetrant the sperm movements are. 

      With respect to Movie S8, Figure 4A and B (Figure 5A and B in the current revised manuscript) depict the trajectories of accumulated spermatozoa (sperm trains) in the female uterus, as shown in Movie S8. We have added this information to the revised figure legend (L 293) for clarity. We could not observe sperm trains that moved faster than single sperms during over 100 hours of observation and collection of over 10TB of images. The three sperm trains presented in Fig. 5B were the sperm trains that moved in the head-forward direction. Most other identifiable trains, or clusters, did not move or could not move forward as their heads were entangled randomly. Although we of course agree that a statistical test for Movie S8 (also Fig. 5B) would be great, due to the small number of sperm trains we found, we could not perform meaningful statistical tests. Instead, we provided all data in the box plots in Fig. 5C so that readers can evaluate and understand our points. We believe that this is a more neutral way of presenting our data rather than providing statistical significance.

      Regarding Movies S9 and S10, we are not entirely sure whether we understood your comments clearly. It would be very helpful if you could point out more specifically to the manuscript with line numbers as we would like to address your concerns and suggestions, and we believe that your input will improve our manuscript. We did not describe the penetration of sperm in these movies. Movies S9 and S10 are newly found sperm behaviours inside the UTJ and Isthmus. We observed that sperm beating is influenced by the width of luminal space as well as internal flow as see in Movies S9 and S10. As our animal model only expresses red fluorescence in the midpiece, accurate beating frequency measurement cannot be performed. However, we can clearly observe that beating is not continuous and almost results in a halt with respect to reproductive tract variations. We revised our description about the findings about beating speed changes in the revised manuscript (LL 305-335).  

      Movies S1B - did the authors also track the movement of sperm located in the middle of the uterus (not close to the wall)? Without this measurement, they can't be certain that sperm close to the uterus wall travels faster. 

      We revised the new Movie S1B to include videos that were used for the sperm migration kinetics analysis in Figure 2 (previously Figure 1). As you can see in the movies, the graph, and statistical analysis, there is a clear trend showing spermatozoa migration is slower as a function of distance from the uterus wall. Regarding your comment with respect to the middle of the uterus (not close to the wall), we have added another movie (Movie S1C) that was acquired at different depths from the wall (going towards the centre of the uterus). As clearly seen in Movie S1c, when imaging deeper into the uterus, there are an increasing number of inactive or slow-moving spermatozoa. Since the diameter of the uterus is easily over 2mm, we currently do not have optical access to exactly the centre of the uterus, but for all depths that are observable, spermatozoa near the wall were clearly faster.

      Movie S5A - is of lower magnitude (200 um scale bar) while the others have 50 and 20 uM scale bars. Individual sperm movement can be observed in the 20 uM (Movie 5SC). If the authors went to prove that there is no upsucking movement of sperm by the uterine contractions, they need to provide a high magnification image. 

      The main focus of video S5A, is the intramural UTJ where spermatozoa are located in rows within narrow luminal space (see Author response image 1). When there is up-suck like sperm passive carriage, there must be sperm movement from the uterus to intramural UTJ as in Author response image 1 left. However, there is no such sperm movement could be seen in our observations, as shown in Movie 5A. Importantly, as you can see in Movie 5A, indicated by an arrow from 5 sec to 6 sec, some spermatozoa are moving downward (see also Author response image 1 right). This is the opposite direction of movement with respect to possible up-suck like sperm carriage. 

      Genetical evidence also support up-suck like passive sperm carriage is not the case for sperm migration from the uterus to UTJ. If environmental up-suck like passive transfer plays an important role, it is unlikely that genetically modified spermatozoa cannot pass the entrance of the intramural UTJ (Nakanishi et al., 2004, Biol. Reprod.; Li et al., 2013, J. Mol. Cell Biol.; Larasati et al., 2020, Biol. Reprod.; Qu et al., 2021, Protein Cell). 

      Author response image 1.

      The left image represents what is expected when up-suck like passive sperm carriage occurs. The right image represents what is actually experimentally observed in the intramural UTJ (see Movie S5A). The direction of the arrowheads indicates the direction of sperm movement.

      Movie S8 - if the authors want to make the case that clustered sperm do not move faster than unclustered sperm, then they need to show Movie S8 at higher magnification. They also need to quantify these data. 

      We understand your concern. As shown in Figure 5B, we included all sperm kinetics data of each sperm train and unlinked spermatozoon around the trains as individual dots. The only analysis we did not conduct was a statistical test with the data as it could be erroneous due to the large sample size difference (3 trains vs 181 unlinked spermatozoa). As the medians of the four sperm kinetic parameters are similar except SWR, we concluded that they are not necessarily faster than unlinked single spermatozoa. Since there is no known advantage to spermatozoa (including sperm trains) with intermediate moving speeds for sperm competition – for example in IVF, success fertilization rate is high when faster and active spermatozoa with normal shape are selected (Vaughan & Sakkas, 2019, Biol. Reprod.) – it is questionable whether there can be an advantage to the formation of sperm trains whose speed is not faster than unlinked spermatozoa in our data.

      However, we do not agree with your comment regarding the need for higher magnification. Measurement of the sperm migration speeds (kinetic parameters) does not require measurement of exact tail movements in this study. Only sperm heads were tracked to measure their trajectory and such tracking was better done at low mag. For example, measuring the speed of a car does not need higher magnifications to visualize the rotation of the wheels. Additionally, including the effect of observation magnification on the sperm kinetic parameters for all 4 GLMM models for Figure 2 (Table S3) does not change the result, which shows that magnification is not a factor that influences our analysis. 

      Movie S9C - what is the evidence that these sperm are dead or damaged? 

      Thank you for your valid comment. We tracked sperm movements for at least 10 minutes and such entangled spermatozoa in the UTJ never became re-active. As you can see in the new Movie S9b, entangled spermatozoa were also acrosome re-acted (green acrosome head is gone) while active spermatozoa are responding to peristaltic movement by exhibiting movements within the same video. However, as you pointed out, we did not measure their viability with appropriate dyes. Although we also considered about extracting these spermatozoa and performing viability tests, we could not come up with a way to specifically extract the exact spermatozoa that were imaged. Considering your comments, we changed the term damaged or dead to inactive in the revised manuscript (LL 313-316, Legend Figure 6D. LL 380-384).

      Movie S10 - both slow- and fast-moving sperm are seen throughout the course of the movie, which does not support the authors' conclusion that sperm tails beat faster over time. 

      There must have been a misunderstanding. We did not indicate that sperm beating got faster over time anywhere in the main manuscript, including the figure legend and related movie captions. As correctly pointed out, the sperm beating speed changes over time (not getting faster over time) and shows a correlation with internal fluid flow and width of luminal space (LL 320-332). Please let us know if you meant something else. 

      Reviewer #2 (Public Review): 

      Summary: 

      The specific objective of this study was to determine the role of the large apical hook on the head of mouse sperm (Mus musculus) in sperm migration through the female reproductive tract. The authors used a custom-built two-photon microscope system to obtain digital videos of sperm moving within the female reproductive tract. They used sperm from genetically modified male mice that produce fluorescence in the sperm head and flagellar midpiece to enable visualization of sperm moving within the tract. Based on various observations, the authors concluded that the hook serves to facilitate sperm migration by hooking sperm onto the lining of the female reproductive tract, rather than by hooking sperm together to form a sperm train that would move them more quickly through the tract. The images and videos are excellent and inspirational to researchers in the field of mammalian sperm migration, but interpretations of the behaviors are highly speculative and not supported by controlled experimentation. 

      Thank you for your critical review and valuable comments on our manuscript. As pointed out, some of our findings and suggestions were largely observation based. However, to the best of our knowledge, many of our observations are novel, particularly in the context of live imaging inside the female uterus and reproductive tract. We believe these observations open doors to many questions and follow up studies that can be envisioned based on our findings, which is what drives science forward. 

      That being said, we entirely agree that many follow up experiments need to be designed and performed, especially to validate the exact molecular mechanisms of the observed dynamics. We acknowledge that it is unfortunate we currently lack the proper molecular experimental toolsets to perform further tests. We have removed much of the hypothetical discussions from the results section and moved them to the discussion section. We hope that our revision more clearly defines the observed experimental data and our interpretations.

      Strengths: 

      The microscope system developed by the authors could be of interest to others investigating sperm migration. 

      The new behaviors shown in the images and videos could be of interest to others in the field, in terms of stimulating the development of new hypotheses to investigate. 

      Weaknesses: 

      The authors stated several hypotheses about the functions of the sperm behaviors they saw, but the hypotheses were not clearly stated or tested experimentally. 

      The hypothesis statements were weakened by the use of hedge words, such as "may". 

      We appreciate your helpful comments and have revised our hypotheses and suggestions accordingly. We have removed instances of “may” or revised it to be more direct. We have also moved most of our interpretations and hypotheses from the results to the discussion section. 

      It is important to note that experimental approaches to test what we suggested from our findings in the current ex-vivo observation platform are not trivial and require extensive investigation of several unknown factors of the female reproductive tract. For instance, obtaining detailed information on the chemical characteristics and fluid dynamics in the female reproductive tract is essential to build a microfluidic channel that accurately resembles the uterus and oviduct, replicating what we found in an extracted living entire organ. This poses a significant challenge and requires collaborative expertise from many labs, which we hope to build in the near future. 

      Furthermore, our biggest concern is that, even if we were to construct the appropriate microfluidic channel to test sperm migration, it is very likely that the sperm behaviours that we observed under natural conditions may not be replicated in artificial environments. This raises questions about whether in-silico or in-vitro findings can truly resemble what we reported here using the ex-vivo observation inside a living organ.

      To share our experience related to this difficulty, at the initial stage of our study, we attempted sperm injection combined with fluorescent beads to visualize the fluid flow, as well as dyeing the female reproductive tract and spermatozoa after mating. However, none of these resulted in meaningful results. Another potential approach to perform similar research regarding our claims is using genetical engineering to indirectly confirm the influence of the sperm hook morphology on sperm behaviour. However, such an approach lacks a mechanical demonstration about how the sperm hook interacts with the female reproductive tract. 

      It is unfortunate that the sperm behaviours that we found and reported here are considered as highly speculative. The main findings of our work lie in the newly observed dynamic behaviours of mouse sperm interacting with the female reproductive tract epithelium. Specifically, these behaviours include tapping and associated guided movement along the uterus wall, anchoring and related resistance to internal fluid flow and migration through the utero-tubal junction, and self-organized behaviour while clinging onto the colliculus tubarius. 

      We have extensively revised the manuscript structure to clarify our findings and integrated our points in the introduction. Although we understand our following hypotheses may be considered speculative and the causative relationship between the sperm hook and its role in sperm migration requires further experimental approaches, we believe that the image-based observation of dynamic behaviours of spermatozoa are solid. We believe our findings will facilitate further studies and discussion in the field of studies on postcopulatory sexual selection in rodents.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors): 

      The manuscript is written for an expert in a fairly small field. I recommend that the authors rewrite the manuscript to make it more accessible to people outside of the field. These suggestions include 

      (1) Provide a diagram of the female reproductive tract in Figure 1. 

      a. Indicate where sperm enter the tract and the location of the oocyte they are trying to reach. 

      b. Label all areas of the uterus that are mentioned in this study and be consistent about the label. 

      (2) All movies should have a diagram of the location of the uterus that is being imaged. 

      Thank you for the great suggestion. We have added a diagram of the female reproductive tract in the revised Figure 1A. In response to your comments 1a and b, we have indicated such information by including eggs in the ampulla and arrows that indicate sperm migration direction. We have also labelled the name of the specific areas that were studied in the manuscript.

      We are unsure how to integrate the diagram in all movies without reframing the videos, which could cause serious corruption of the files. More importantly, we think that adding the same diagram to all movies may complicate the visuals and disrupt indications and subject in the movie. Instead, we have referred to the common diagram (Figure 1A) in each movie caption, specifying where the video was taken. Thank you for the suggestion. With this information, we hope readers can now more easily understand where we made the observations. 

      (3) The major questions in the field need to be better described in the introduction. 

      Thank you for your valuable suggestions and specific comments which have greatly helped improve our manuscript. We have revised our introduction and discussion sections by adding more literature reviews and integrating studies across a wider range of the postcopulatory sexual selection, as per your suggestion (LL 34-57, LL 385-398).

      (4) The major question that the authors are trying to address should be described in the introduction. 

      Thank you for the helpful suggestion. We have clarified in the introduction that our aim was to contribute to the field of postcopulatory sexual selection in rodents by advancing methodological progress and to stimulate discussion and future research on the function of the sperm hook in murine rodents (LL 76-94) based on our observations.

      (5) A discussion of the sperm hook should be provided. How many species have this structure (or similar structure)? 

      We have integrated your point into the revised discussion section. Essentially, most murine rodent species have sperm hooks (while their exact shapes differ). However, as there are over 500 species and not all of them have been tested, we do not know exactly how many of them have this structure. Therefore, we included paper references that examined species variations in sperm hook characteristics and their possible correlation with sperm competition (LL 385417) in the discussion. Additionally, we also included papers by Breed (2004) and by Roldan et al (1992) that investigated murine rodents with a sperm hook in the introduction section as well (LL 58-61).  

      (6) The figure legends must describe everything in the figure or movie. 

      Thank you for the helpful suggestion. We previously thought that our figure legends may be too long. We have included further information in the figure legends and movie captions. We have also revised the movies by adding some clips following our revision (Movie S1).

      Reviewer #2 (Recommendations For The Authors): 

      Here are some specific concerns I had about the clarity of approach to experiments and interpretations of results. 

      In the Introduction, the authors stated that the study was intended to determine the function of the hooks on the mouse sperm heads. However, in the Results section, the authors did not explain the rationale for the first set of experiments with respect to the overall objective of the study. In this experiment, the authors measured the velocities of sperm swimming in the uterus and found that the sperm moved faster when closer to the uterine wall (VCL, VSL). They concluded that migration along the uterine wall "may" be an efficient strategy for reaching the entrance to the uterotubal junction (UTJ) and did not explain how this related to the function of the hooks. 

      Thank you for your critical comment and guidance. We have changed the order of Figure 1 and Figure 2 and revised the result section to integrate your points. At the initial stage of the study, we expected to find evidence of the function of sperm trains in aiding sperm migration in the female uterus (which has not been observed in the live uterus; previous works were done invitro with extracted sperm from epididymis or uterus after mating). However, what we found was something unexpected: dynamic sperm hook related movements facilitating sperm migration inside the female uterus by playing a mechanical role in sperm interaction with the uterine wall. These results that were presented in the previous Figure 2 has been reorganized as the new Figure 1.

      Based on this observation, our research later moved to clarify whether such sperm-epithelium interaction indeed helps sperm migration. This led us to measure sperm kinetics in relation to their distance and angle to the uterine wall. We have revised our introduction and result parts by integrating these points. We hope that our revision will answer your questions. We have also reduced the use of ‘may’ or ‘can’ in the results section. In the revised manuscript, we have moved such hypotheses to the discussion section and focused on what we observed in the results section.

      The authors proposed that the sperm hook "may" play a crucial role in determining the direction of migration. When sperm encountered a uterine wall, significantly more changed migration direction toward the pro-hook direction than toward the anti-hook direction. In Figure 2B, sperm behavior is not visually understandable nor clearly explained. 

      Thank you for the helpful comments. We have removed “may” and “might” to make our claim clearer and more concise. We have also revised the previous Figure 2B by combining it with the previous Figure 2C (they have been combined into Figure 1C now). We have also revised Figure 1B by increasing the line thickness of the sperm trajectory of the pro-wall-hook direction and added the anti-wall-hook trajectory. We hope that these revisions make the figure easier to understand.

      In Figure 2E, are the authors showing that the tip of the hook is caught between two epithelial cells? Please clarify the meaning of this figure. 

      Please clarify the difference between "tapping" and "anchoring". 

      Thank you for the detailed comments. As you pointed out, we currently have no evidence whether sperm can be caught in epithelia inter-cellular gaps. We have revised this source of confusion by removing the gap in the revised figure (Figure 1E). We have also included the definition of anchoring (LL 142-143) and tapping (LL 128-130). Anchoring facilitates the attachment of sperm to the uterine epithelia. Such anchoring also involves the catching of the sperm head in the inter-mucosal fold or gap, particularly at the entrance of the intramural UTJ at the end of the uterus. Tapping is the interaction between the head hook and epithelia in which the sperm hook is tapping (or patting) on the surface. Sperm tapping can be a byproduct that results from flagella beating when spermatozoa migrate toward the pro-wall-hook direction along the uterine wall (epithelia) or can play some role in sperm migration. As we currently cannot draw a conclusion, we did not integrate the possible function of the tapping in the manuscript.

      The authors proposed that opposite sliding of neighboring mucosal folds lining the UTJ would cause small openings to form, through which only perhaps one sperm at a time could enter and pass through the UTJ into the uterus. This hypothesis was not actually tested. 

      Imaging inside deep tissue is challenging due to light scattering as it penetrates through biological tissue. While this is also true for the uterus, the intramural UTJ is especially difficult to image because the UTJ consists of several thick muscle and cell layers (see Movie S5A). Another challenge is that the peristaltic movement of the UTJ results in constant movement, making continuous tracking of single sperms while passing through the entirety of the UTJ impossible in our current experiments. We have moved this hypothesis to the discussion section and restated that this is a pure hypothetical model (LL 399-406). We hope that our model encourages the community in designing or establishing an improved ex-vivo observation system that may be able to test this hypothetical model in the near future.

      Next, the authors hypothesized that sperm that encounter the small openings in the UTJ may then be guided onward and the hooks could prevent backward slipping. This was also not tested. 

      As you’ve noted, the function of the sperm hook that aids in sliding and preventing backward slipping could not be tested directly in our ex-vivo observation platform that relies on natural movement of the living organ. However, we believe that these limitations also highlight the importance of continued research and the development of more advanced methodologies in this field.

      We would also like to note that we provide direct observations of spermatozoa resisting internal flow due to reproductive tract contractions in Movie S3A, B as well as Movie S5B. We referred to these movies and pointed out the role of anchoring (sperm attachment) in preventing sperm from being squeezing out (LL 140-149, LL 224-241). Unfortunately, we cannot conceive of how this behaviour can be tested additionally in any uterus-resembling microfluidic device or ex-vivo systems. In line with your suggestion, we have rewritten the related result section and moved our related discussions in the result part to the discussion section (LL 224-241, LL 399-417). 

      The authors observed that large numbers of uterine sperm are attached to the entrance of the UTJ. Some sperm clustered and synchronized their flagellar beating. The authors speculated that this behavior served to push sperm in clusters onward through the UTJ. 

      We would like to note that we did not speculate that sperm clustering and their synchronization could serve to push spermatozoa in a cluster to move onward through the UTJ. We only pointed out our observation in recorded videos, that generative flow from the clustered spermatozoa pushed away other spermatozoa as seen in Movie S7 (LL 261-264). Although such sperm cooperation is possible (blocking passage of later sperm), we cannot draw that conclusion from our observation. The possibility you pointed out (pushing sperm onward through the UTJ) was suggested by Qu et al in 2021 [Cooperation-based sperm clusters mediate sperm oviduct entry and fertilization, Protein & Cell] based on their observations on cleared dead reproductive tracts.

      The authors found only a few sperm trains in the uterus, UTJ, and oviduct, so they could not measure sufficient numbers of samples to test whether sperm trains swim faster than single sperm. Without sufficient data, they concluded that the "sperm trains did not move faster than unlinked single spermatozoa." 

      We would like to take this opportunity to clarify our claims. We do not claim that our current experiments can give the final verdict on whether the sperm train hypothesis for faster swimming is correct or not. The phrase “sperm trains did not move faster” was not intended to mean that the sperm train hypothesis is invalid.  We did not draw a conclusion but dryly described the experimental data that we observed (LL 279-286).  We would once again like to emphasize that the main claim of our manuscript is not to rule out the sperm train hypothesis, but to present the various dynamic interactions of the sperm head with the female reproductive tract. To make the statement more balanced, we revised the sentence as “observed sperm trains did not move faster or slower than unlinked single spermatozoa” (LL 281-282).

      The authors hypothesized that the dense sperm clusters at the entrance into the UTJ could prevent the rival's sperm from entering the UTJ (due to plugging entrance and/or creating an outward flow to sweep back the rival's sperm), but they did not test it. 

      We agree that we were not able to test such possible function of the sperm cluster at UTJ entrance. Following your concerns, we revised the result part (LL 256-264) by removing most of our discussions related to the observed phenomena. We also integrated some interpretation rather to the discussion section (LL 421-437) and suggested that future works using appropriate microfluidic channel designs or sequential double mating experiments may be performed for additional tests (LL 443-447). However, we would like to point out that Movie S7C clearly shows surrounding sperms that are swept away from the sperm clusters. Since the sperm density is high, this is almost equivalent to a particle image velocimetry experiment, and we can clearly see the effect of the outward flow generated by the sperm clusters.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Weakness#1: The authors claim to have identified drivers that label single DANs in Figure 1, but their confocal images in Figure S1 suggest that many of those drivers label additional neurons in the larval brain. It is also not clear why only some of the 57 drivers are displayed in Figure S1.

      As described in the Results section, we screened 57 GAL4 driver lines based on previous reports. These included drivers that had been shown to label a single dopaminergic neuron (DAN) or a small subset of DANs in the larval or adult brain hemisphere, suggesting potential for specific DAN labeling in larvae.

      In Figure 1, TH-GAL4 was used to cover all neurons in the DL1 cluster, while R58E02 and R30G08 were well known drivers for pPAM. Fly strains in Figure 1h, k, l, and m were reported as single DAN strains in larvae[1], while strains in Figure 1e, f, g were reported identifying only several DANs in adult brains[2,3]. We examined these strains and only some of them labeled single DANs in 3rd instar larval brain hemisphere (Figure 1f, g, h, l and m). Among them, only strains in Figure 1f and h labeled single DAN in the brain hemisphere, without labeling other non-DANs. Other strains labeled non-DANs in addition to single DANs (Figure 1g, l and m). Taking ventral nerve cord (VNC) into consideration, strain in Figure 1h also labeled neurons in VNC (Figure S1e), while strain in Figure 1f did not (Figure S1c).

      In summary, the driver shown in Figure 1f (R76F02AD;R55C10DBD, labeling DAN-c1) is the only line we identified that labels a single DAN in the 3rd instar larval brain hemisphere without additional labeling. The other lines shown in Figure 1 (g, h, l, m) label a single DAN but also include some non-DANs. Figure 1 focuses on strains that label a single or a pair of DANs.

      Labeling patterns for all 57 driver lines are summarized in Table 1. Figure S1 includes representative examples; full confocal images for all screened strains are available upon request, as stated in the figure legend.

      Weakness #2: Critically, R76F02-AD; R55C10-DBD labels more than one neuron per hemisphere in Figure S1c, and the authors cite Xie et al. (2018) to note that this driver labels two DANs in adult brains. Therefore, the authors cannot argue that the experiments throughout their paper using this driver exclusively target DAN-c1.

      Figure S1c shows a single dopaminergic (DA) neuron in each brain hemisphere. While additional GFP-positive signals were occasionally observed, they did not originate from the cell bodies of DA neurons, as these were not labeled by the tyrosine hydroxylase (TH) antibody. These additional GFP signals primarily appeared to be neurites, including axonal terminals, although we cannot rule out the possibility that some represent false-positive signals or weakly stained non-neuronal cell bodies. This interpretation is based on the analysis of 22 third-instar larval brains.

      To clarify this point in the manuscript, we added the following sentence to the Results section: “Based on the analysis of 22 brain samples, we observed this driver strain labels one neuron per hemisphere in the third-instar larval brain (Figure 2a–d, Figure S1c, Table S3).” Additionally, Table S3 was included to summarize the DAN-c1 labeling pattern across all 22 samples. An enlarged inset highlighting GFP-positive signals was also added to Figure S1c.

      Weakness #3: Missing from the screen of 57 drivers is the driver MB320C, which typically labels only PPL1-γ1pedc in the adult and should label DAN-c1 in the larva. If MB320C labels DAN-c1 exclusively in the larva, then the authors should repeat their key experiments with MB320C to provide more evidence for DAN-c1 involvement specifically.

      We thank the reviewer for this insightful suggestion. The MB320C driver primarily labels the PPL1-γ1pedc neuron in the adult brain, along with one or two additional weakly labeled cells. It would indeed be interesting to examine the expression pattern of this driver in third-instar larval brains. If it is found to label only DAN-c1 at this stage, we could consider using it to knock down D2R and assess whether this recapitulates our current findings.

      While we agree that this is a promising direction for future studies, we believe it is not essential for the current manuscript, given the specificity of the DAN-c1 driver (please see our response to Reviewer #3 for details). Nonetheless, we appreciate the reviewer’s suggestion, and we recognize that MB320C could be a valuable tool for future experiments.

      Weakness #4: The authors claim that the SS02160 driver used by Eschbach et al. (2020) labels other neurons in addition to DAN-c1. Could the authors use confocal imaging to show how many other neurons SS02160 labels? Given that both Eschbach et al. and Weber et al. (2023) found no evidence that DAN-c1 plays a role in larval aversive learning, it would be informative to see how SS02160 expression compares with the driver the authors use to label DAN-c1.

      We did not have our own images showing DANs in brains of SS02160 driver cross line. However, Extended Data Figure 1 in the paper of Eschbach et al. shows strongly labeled four neurons on each brain hemisphere[4], indicating that this driver is not a strain only labeling one neuron, DAN-c1.

      Weakness #5: The claim that DAN-c1 is both necessary and sufficient in larval aversive learning should be reworded. Such a claim would logically exclude any other neuron or even the training stimuli from being involved in aversive learning (see Yoshihara and Yoshihara (2018) for a detailed discussion of the logic), which is presumably not what the authors intended because they describe the possible roles of other DANs during aversive learning in the discussion.

      We agree with the reviewer that the terms “necessary” and “sufficient” may be too exclusive and could unintentionally exclude contributions from other neurons. As noted in the Discussion section, we acknowledge that additional dopaminergic neurons may also play roles in larval aversive learning. To reflect this, we have revised our wording to use “important” and “mediates” instead of the more definitive terms “necessary” and “sufficient,” making our conclusions more accurate and appropriately measured.

      Weakness #6: Moreover, if DAN-c1 artificial activation conveyed an aversive teaching signal irrespective of the gustatory stimulus, then it should not impair aversive learning after quinine training (Figure 2k). While the authors interpret Figure 2k (and Figure 5) to indicate that artificial activation causes excessive DAN-c1 dopamine release, an alternative explanation is that artificial activation compromises aversive learning by overriding DAN-c1 activity that could be evoked by quinine.

      This is an excellent point, and we agree that we cannot rule out the possibility that artificial activation interferes with aversive learning by overriding the natural activity of DAN-c1 that would normally be evoked by quinine. The observed results with TRPA1 could potentially be attributed to dopamine depletion, inactivation due to prolonged depolarization, or neural adaptation. However, we believe that our hypothesis - that over-excitation of DAN-c1 impairs learning - is more consistent with our experimental findings and with previously published data. Our rationale is as follows: (1) Associative learning in larvae occurs only when the conditioned stimulus (CS, e.g., an odor such as pentyl acetate) and unconditioned stimulus (US, e.g., quinine) are paired. In wild-type larvae, the CS depolarizes a subset of Kenyon cells in the mushroom body (MB), while the US induces dopamine (DA) release from DAN-c1 into the lower peduncle (LP) compartment (Figure 7a). When both stimuli coincide, calcium influx from CS activation and Gαs signaling via D1-type dopamine receptors activate the MB-specific adenylyl cyclase, rutabaga, which functions as a coincidence detector (Figure 7d). (2) Rutabaga converts ATP to cAMP, activating the PKA signaling pathway and modifying synaptic strength between Kenyon cells and mushroom body output neurons (MBONs) (Figure 7d). These changes in synaptic strength underlie learned behavioral responses to future presentations of the same odor. (3) Our results show that D2R is expressed in DAN-c1, and that D2R knockdown impairs aversive learning. Since D2Rs typically inhibit neuronal excitability and reduce cAMP levels[5], we hypothesize that D2R acts as an autoreceptor in DAN-c1 to restrict DA release. When D2R is knocked down, this inhibition is lifted, leading to increased DA release in response to the US (quinine). The resulting excess DA, in combination with CS-induced calcium influx, would elevate cAMP levels in Kenyon cells excessively - disrupting normal learning processes (Figure 7b). This is supported by studies showing that dunce mutants, which have elevated cAMP levels, also exhibit aversive learning deficits[6]. (4) The TRPA1 activation results are consistent with our over-excitation model. When DAN-c1 was artificially activated at 34°C in the distilled water group, this mimicked the natural activation by quinine, producing an aversive learning response toward the odor (Figure 2k or new Figure 2i, DW group). Similarly, in the sucrose group, artificial activation mimicked quinine, producing a learning response that reflected both appetitive and aversive conditioning (Figure 2k, SUC group). (5) Over-excitation impairs learning in the quinine group. When DAN-c1 was activated during quinine exposure, both artificial and natural activation combined to produce excessive DA release. This over-excitation likely disrupted the cAMP balance in Kenyon cells, impairing learning and resulting in failure of aversive memory formation (Figure 2k, QUI group). This phenotype closely mirrors the effect of D2R knockdown in DAN-c1. (6) Optogenetic activation of DAN-c1 during aversive training similarly produced elevated DA levels due to both natural and artificial stimulation. This again would result in MBN over-excitation and a corresponding learning deficit. When optogenetic activation occurred during non-training phases (resting or testing), no additional DA was released during training, and aversive learning remained intact (Figure 5b). (7) Notably, when optogenetic activation was applied during training, we observed no aversive learning in the distilled water group and no reduction in the sucrose group (Figure 5c, 5d). We interpret this as evidence that the optogenetic stimulation was strong enough to cause elevated DA release in both groups, impairing learning in a manner similar to D2R knockdown or TRPA1 overactivation. (8) We extended this over-excitation framework to directly activate Kenyon cells (MBNs). Since MBNs are involved in both appetitive and aversive learning, their over-excitation disrupted both types of learning (Figure 6), further supporting our hypothesis. In summary, we propose that DAN-c1 activity is tightly regulated by D2R autoreceptors to ensure appropriate levels of dopamine release during aversive learning. Disruption of this regulation - either through D2R knockdown or artificial overactivation of DAN-c1 - results in excessive DA release, over-excitation of Kenyon cells, and impaired learning. This over-excitation model is consistent with both our experimental results and prior literature.

      Weakness #7: The authors should not necessarily expect that D2R enhancer driver strains would reflect D2R endogenous expression, since it is known that TH-GAL4 does not label p(PAM) dopaminergic neurons.

      Just like the example of TH-GAL4, it is possible that the D2R driver strains may partially reflect the expression pattern of endogenous D2R in larval brains. When we crossed the D2R driver strains with the GFP-tagged D2R strain, however, we observed co-localization in DM1 and DL2b dopaminergic neurons, as well as in mushroom body neurons (Figure S3c to h). In addition, D2R knockdown with D2R-miR directly supported that the GFP-tagged D2R strain reflected the expression pattern of endogenous D2R (Figure 4b to d, signals were reduced in DM1). In summary, we think the D2R driver strains supported the expression pattern we observed from the GFP-tagged D2R strain, especially in DM1 DANs.

      Weakness #8: Their observations of GFP-tagged D2R expression could be strengthened with an anti-D2R antibody such as that used by Lam et al., (1999) or Love et al., (2023).

      Love et al. (2023) used the antibody originally described by Draper et al.[6]. We attempted to use the same antibody in our experiments; however, we were unable to detect clear signals following staining. This may be due to a lack of specificity for neurons in the Drosophila larval brain or incompatibility with our staining protocol. Unfortunately, we were unable to locate a copy of the Lam (1999) paper for further reference.

      Weakness #9: Finally, the authors could consider the possibility other DANs may also mediate aversive learning via D2R. Knockdown of D2R in DAN-g1 appears to cause a defect in aversive quinine learning compared with its genetic control (Figure S4e). It is unclear why the same genetic control has unexpectedly poor aversive quinine learning after training with propionic acid (Figure S5a). The authors could comment on why RNAi knockdown of D2R in DAN-g1 does not similarly impair aversive quinine learning (Figure S5b).

      We re-analyzed the data related to DAN-g1. Interestingly, knockdown of D2R in DAN-g1 larvae trained with quinine (QUI) showed a significant difference in response index (R.I.) compared to the distilled water (DW) control group. However, it also differed significantly from the DAN-g1 genetic control group trained with QUI (two-way ANOVA with Tukey’s multiple comparisons, p = 0.0002), while it was not significantly different from the UAS-D2R-miR genetic control group (p = 0.2724). Furthermore, knockdown of D2R in DAN-g1 did not lead to aversive learning deficits when larvae were trained with a different odorant, propionic acid (ProA; Figure S5a). Similarly, using an RNAi line to knock down D2R in DAN-g1 did not result in learning impairment when larvae were trained with pentyl acetate (PA; Figure S5b). These inconsistencies may stem from differences in stimulus intensity across odorants, as well as the variable efficiency of the knockdown strategies (microRNA vs. RNAi). Based on these results, we propose that D2Rs in DAN-g1 may modulate larval aversive learning in a quantitative manner but do not play as critical a role as those in DAN-c1, where knockdown produces a clear qualitative effect. We have added this paragraph to the Discussion section of the manuscript.

      Reviewer #2 (Public review):

      Weakness#1: Is not completely clear how the system DAN-c1, MB neurons and Behavioral performance work. We can be quite sure that DAN-c1;Shits1 were reducing dopamine release and impairing aversive memory (Figure 2h). Similarly, DAN-c1;ChR2 were increasing dopamine release and also impaired aversive memory (Figure 5b). However, is not clear what is happening with DAN-c1;TrpA1 (Figure 2K). In this case the thermos-induction appears to impair the behavioral performance of all three conditions (QUI, DW and SUC) and the behavior is quite distinct from the increase and decrease of dopamine tone (Figure 2h and 5b).

      The study successfully examined the role of D2R in DAN-c1 and MB neurons in olfactory conditioning. The conclusions are well supported by the data, with the exception of the claim that dopamine release from DAN-c1 is sufficient for aversive learning in the absence of unconditional stimulus (Figure 2K). Alternatively, the authors need to provide a better explanation of this point.

      Please refer to our response to Weakness #6 of Reviewer #1 above.

      Reviewer #3 (Public review):

      Weakness #1: It is a strength of the paper that it analyses the function of dopamine neurons (DANs) at the level of single, identified neurons, and uses tools to address specific dopamine receptors (DopRs), exploiting the unique experimental possibilities available in larval Drosophila as a model system. Indeed, the result of their screening for transgenic drivers covering single or small groups of DANs and their histological characterization provides the community with a very valuable resource. In particular the transgenic driver to cover the DANc1 neuron might turn out useful. However, I wonder in which fraction of the preparations an expression pattern as in Figure 1f/ S1c is observed, and how many preparations the authors have analyzed. Also, given the function of DANs throughout the body, in addition to the expression pattern in the mushroom body region (Figure 1f) and in the central nervous system (Figure S1c) maybe attempts can be made to assess expression from this driver throughout the larval body (same for Dop2R distribution).

      We thank the reviewer for the positive comments and thoughtful suggestions.

      Regarding the R76F02AD; R55C10DBD strain, we examined 22 third instar larval brains expressing GFP, Syt-GFP, or Den-mCherry. All brains clearly labeled DAN-c1. In approximately half of the samples, only DAN-c1 was labeled. In the remaining samples, 1 to 5 additional weakly labeled soma were observed, typically without associated neurites. Only 1 or 2 strongly labeled non-DAN-c1 cells were occasionally detected. These additional labeled neurons were rarely dopaminergic. In the ventral nerve cord (VNC), 8 out of 12 samples showed no labeled cells. The remaining 4 samples had 2–4 strongly labeled cells. These results support our conclusion that the R76F02AD; R55C10DBD combination predominantly and specifically labels DAN-c1 in the third instar larval brain. As for the reviewer’s question about the expression pattern of R76F02AD; R55C10DBD and D2R in the larval body, we agree that this is a very interesting avenue for further investigation. However, our current study is focused on the central nervous system and larval learning behaviors. We hope to explore this question more fully in future work.

      We added the following sentence to the Results section: “Based on analysis of 22 brain samples, we believe this driver strain consistently labels one neuron per hemisphere in the third-instar larval brain (Figure 2a - d, Figure S1c, Table S3).” In addition, we included Table S3 to summarize the DAN-c1 labeling patterns observed across these samples.

      Weakness #2: A first major weakness is that the main conclusion of the paper, which pertains to associative memory (last sentence of the abstract, and throughout the manuscript), is not justified by their evidence. Why so? Consider the paradigm in Figure 2g, and the data in Figure 2h (22 degrees, the control condition), where the assay and the experimental rationale used throughout the manuscript are introduced. Different groups of larvae are exposed, for 30min, to an odour paired with either i) quinine solution (red bar), ii) distilled water (yellow bar), or iii) sucrose solution (blue bar); in all cases this is followed by a choice test for the odour on one side and a distilled-water blank on the other side of a testing Petri dish. The authors observe that odour preference is low after odour-quinine pairing, intermediate after odour-water pairing and high after odour-sucrose pairing. The differences in odour preference relative to the odour-water case are interpreted as reflecting odour-quinine aversive associations and odour-sucrose appetitive associations, respectively. However, these differences could just as well reflect non-associative effects of the 30-min quinine or sucrose exposure per se (for a classical discussion of such types of issues see Rescorla 1988, Annu Rev Neurosci, or regarding Drosophila Tully 1988, Behav Genetics, or with some reference to the original paper by Honjo & Furukubo-Tokunaga 2005, J Neurosci that the authors reference, also Gerber & Stocker 2007, Chem Sens).

      As it stands, therefore, the current 3-group type of comparison does not allow conclusions about associative learning.

      We adopted the single-odor larval learning paradigm from Honjo et al., who first developed and validated this method for studying larval olfactory associative learning7,8. To address the reviewer’s concern regarding potential non-associative effects from 30-minute exposure to quinine or sucrose, we refer to multiple lines of evidence provided in Honjo’s studies: (1) Honjo et al. demonstrated that only larvae receiving paired presentations of odor and unconditioned stimulus (quinine or sucrose) exhibited learned responses. Exposure to either stimulus alone, or temporally dissociated presentations, failed to induce any learning response. (2) When tested with a second, non-trained odorant, larvae only responded to the odorant previously paired with the unconditioned stimulus. This rules out generalized olfactory suppression and confirms odor-specific associative learning. (3) Well-characterized learning mutants (e.g., rutabaga, dunce) that show deficits in adult reciprocal odor learning also failed to exhibit learned responses in this single-odor paradigm, further supporting its validity. (4) In our study, we used two distinct odorants (pentyl acetate and propionic acid) and two independent D2R knockdown approaches (UAS-miR and UAS-RNAi). We consistently observed that D2R knockdown in DAN-c1 impaired aversive learning. Importantly, naïve olfactory, gustatory, and locomotor assays ruled out general sensory or motor defects. Comparisons with control groups (odor paired with distilled water) also ruled out non-associative effects such as habituation. Taken together, these results strongly support that the single-odor paradigm is a robust and reliable assay for assessing larval olfactory associative learning in Drosophila. We have added a section in the Discussion to clarify and defend the use of this paradigm in our study.

      Weakness #3: A second major weakness is apparent when considering the sketch in Figure 2g and the equation defining the response index (R.I.) (line 480). The point is that the larvae that are located in the middle zone are not included in the denominator. This can inflate scores and is not appropriate. That is, suppose from a group of 30 animals (line 471) only 1 chooses the odor side and 29, bedazzled after 30-min quinine or sucrose exposure or otherwise confused by a given opto- or thermogenetic treatment, stay in the middle zone... a P.I. of 1.0 would result.

      We gave 5 min during the testing stage to allow the larvae to wander on the testing plate. Under most conditions, more than half of larvae (>50%) will explore around, and the rest may stay in the middle zone (will not be calculated). We used 25-50 larvae in each learning assay, so finally around 10-30 larvae will locate in two semicircular areas. Indeed, based on our raw data, a R.I. of 1 seldom appears. Most of the R.I.s fall into a region from -0.2 to 0.8. We should admit that the calculation equation of R. I. is not linear, so it would be sharper (change steeply) when it approaches -1 and 1. However, as most of the values fall into the region from -0.2 to 0.8, we think ‘border effects’ can be neglected if we have enough numbers of larvae in the calculation (10-30).

      Weakness #4: Unless experimentally demonstrated, claims that the thermogenetic effector shibire/ts reduces dopamine release from DANs are questionable. This is because firstly, there might be shibire/ts-insensitive ways of dopamine release, and secondly because shibire/ts may affect co-transmitter release from DANs.

      Shibire<sup>ts1</sup> gene encodes a thermosensitive mutant of dynamin, expressing this mutant version in target neurons will block neurotransmitter release at the ambient temperature higher than 30C, as it represses vesicle recycling[7]. It is a widely used tool to examine whether the target neuron is involved in a specific physiological function. We cannot rule out that there might be Shibire<sup>ts1</sup> insensitive ways of dopamine release exist. However, blocking dopamine release from DAN-c1 with Shibire<sup>ts1</sup> has already led to learning responses changing (Figure 2h). This result indicated that the dopamine release from DAN-c1 during training is important for larval aversive learning, which has already supported our hypothesis.

      For the second question about the potential co-transmitter release, we think it is a great question. Recently Yamazaki et al. reported co-neurotransmitters in dopaminergic system modulate adult olfactory memories in Drosophila[9], and we cannot rule out the roles of co-released neurotransmitters/neuropeptides in larval learning. Ideally, if we could observe the real time changes of dopamine release from DAN-c1 in wild type and TH knockdown larvae would answer this question. However, live imaging of dopamine release from one dopaminergic neuron is not practical for us at this time. On the other hand, the roles of dopamine receptors in olfactory associative learning support that dopamine is important for Drosophila learning. D1 receptor, dDA1, has been proven to be involved in both adult and larval appetitive and aversive learning[10,11]. In our work, D2R in the mushroom body showed important roles in both larval appetitive and aversive learning (Figure 6a). All this evidence reveals the importance of dopamine in Drosophila olfactory associative learning. In addition, there is too much unknow information about the co-release neurotransmitter/neuropeptides, as well as their potential complex ‘interaction/crosstalk’ relations. We believe that investigation of co-released neurotransmitter/neuropeptides is beyond the scope of this study at this time.

      Weakness #5: It is not clear whether the genetic controls when using the Gal4/ UAS system are the homozygous, parental strains (XY-Gal4/ XY-Gal4 and UAS-effector/ UAS-effector), or as is standard in the field the heterozygous driver (XY-Gal4/ wildtype) and effector controls (UAS-effector/ wildtype) (in some cases effector controls appear to be missing, e.g. Figure 4d, Figure S4e, Figure S5c).

      Almost all controls we used were homozygous parental strains. They did not show abnormal behaviors in either learnings or naïve sensory or locomotion assays. The only exception is the control for DAN-c1, the larvae from homozygous R76F02AD; R55C10DBD strain showed much reduced locomotion speed (Figure S6). To prevent this reduced locomotion speed affecting the learning ability, we used heterozygous R76F02AD; R55C10DBD/wildtype as control, which showed normal learning, naïve sensory and locomotion abilities (Figure 4e to i).

      For Figure 4d, it is a column graph to quantify the efficiency of D2R knockdown with miR. Because we need to induce and quantify the knockdown effect in specific DANs (DM1), only TH-GAL4 can be used as the control group, rather than UAS-D2R-miR. For the missing control groups in Figure S4e and S5c, we have shown them in other Figures (Figure 4e).

      We described this in the Materials and Methods part, “All control strains used in learning assays were homozygous (except DAN-c1×WT), while all experimental groups (D2R knockdown and thermogenetics) used were heterozygous by crossing the corresponding control strains”.

      We also re-organized the Figure S4e and S5c along with the control groups to make it easier to understand.

      Weakness #6: As recently suggested by Yamada et al 2024, bioRxiv, high cAMP can lead to synaptic depression (sic). That would call into question the interpretation of low-Dop2R leading to high-cAMP, leading to high-dopamine release, and thus the authors interpretation of the matching effects of low-Dop2R and driving DANs.

      We appreciate the reviewer’s suggestion. We read through this literature, which also addresses the question we mentioned in the Discussion section, about the discrepancy between the cAMP elevation in the mushroom body neurons and the reduced MBN-MBON synaptic plasticity after olfactory associative learning in Drosophila. The author gave an explanation to the existing D1R-cAMP elevation-MBN-MBON LTD axis, which is really helpful to our understanding about the learning mechanism. However, unfortunately, we do not think this offers a possible explanation for our D2R-related mechanisms. We added this literature into our citation.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) Throughout the behavioral experiments, a defect in aversive learning is defined as a relative increase in the response index (RI) after olfactory training with quinine (red) and a defect in appetitive learning as a relative decrease in RI after training with sucrose (blue). Training with distilled water (yellow) is intended to be a control for comparisons within genotypes/treatment groups but causes interpretation issues if it is also affected by experimental manipulations.

      The authors typically make comparisons between quinine, water, and sucrose within each group, but this often forces readers to infer the key comparisons of interest. For example, the key comparison in Figure 2h is the statistically significant difference between the red groups, which differ only in the temperature used during training. Many other figure panels in the paper would also benefit from more direct statistical comparisons, particularly Figure 2k.

      While I recognize the value of the water control, I strongly recommend that the authors make statistical comparisons directly between genotypes/treatment groups where possible and to interpret results with more caution when the water RI score differs substantially between groups. Also, since the authors are conducting two-way ANOVAs before Dunnett's multiple comparisons tests, they ideally should report the p-value for the main effect of each factor, plus the interaction p-value between the two factors before making multiple comparisons.

      We appreciate the reviewer’s suggestion. In response, we re-analyzed all learning assay data in Figures 2 and 4 using two-way ANOVA followed by Tukey’s multiple comparisons test. Unlike our previous analysis, which only compared each experimental group to its corresponding DW control, we now compared all groups against one another. First, we found that most R.I. values from different temperature conditions (Figure 2) or genotypes (Figure 4) trained with DW were not significantly different, with the exception of the data in Figure 2i (formerly Figure 2k; discussed further below). The R.I. from DAN-c1 × D2R-miR larvae trained with QUI was significantly different from both genotype control groups (DAN-c1 × WT and UAS-D2R-miR), while no significant difference was observed between the two controls trained with QUI. Thus, this more comprehensive statistical approach supports the conclusions we previously reported. Second, as the reviewer noted, the new analysis allows for a more direct interpretation of our findings. For example, in the thermogenetic experiments using the Shibire<sup>ts1</sup> strain, the R.I. of DAN-c1 × UAS-Shibire<sup>ts1</sup> larvae trained with QUI at 34°C was not significantly different from the DW group at 34°C, but was significantly different from the QUI group at 22°C. Both findings support our conclusion that blocking dopamine release from DAN-c1 impairs larval aversive learning (Figure 2f).

      In the dTRPA1 activation experiments, the R.I. of DAN-c1 × UAS-dTRPA1 larvae trained with DW at 34°C was significantly lower than that of the DW group at 22°C and the QUI group at 34°C, but not significantly different from the QUI group at 22°C (Figure 2i). These results indicate that activating DAN-c1 during training is sufficient to drive aversive learning even in the absence of QUI. Interestingly, when DAN-c1 × UAS-dTRPA1 larvae were trained with QUI at 34°C, their R.I. was significantly higher than that of the DW group at 34°C and significantly different from the QUI group at 22°C, but not significantly different from the DW group at 22°C (Figure 2i). We interpret this as evidence that simultaneous activation of DAN-c1 by both QUI and dTRPA1 leads to over-excitation, which in turn impairs aversive learning.

      We have revised the figures (Figures 2, 4, 5, and 6) and updated the corresponding Results sections to reflect this new statistical analysis. Additionally, we now report the p-values for interaction, row factor, and column factor - either in Table S4 (for Figure 2) or in the figure captions for Figures 4, 5, 6, S4, S5, and S7.

      (2) The authors' motivation to find tools that label DANs other than DAN-c1 was unclear until much later in the paper when I saw the screening experiments in Figures S4 and S5. The authors could provide a clearer justification for why they focus on DAN-c1 in Figure 2 rather than another DAN for which they found a specific driver in Figure 1. The motivation for looking at individual pPAM neurons was also unclear.

      We sincerely appreciate the reviewer’s thoughtful suggestion. Our study was initially motivated by the goal of characterizing the expression pattern of D2R in the larval brain. From there, we aimed to identify DAN drivers that label specific pairs of dopaminergic neurons, enabling us to assess the functional role of D2R in distinct DAN subtypes through targeted knockdown experiments. This approach ultimately led us to focus on DAN-c1, as it was the only neuronal population for which D2R knockdown resulted in a learning deficit. We then returned to examine the functional significance of DAN-c1 in aversive learning. While we recognize that a more comprehensive narrative might be desirable, the current structure of our manuscript reflects the most logical progression of our work based on our research priorities and experimental outcomes. We did explore alternative manuscript structures - such as beginning with the D2R expression pattern - but found that the current format best conveys our findings and rtionale.

      Regarding our motivation to study individual PAM neurons: we aimed to identify whether D2R plays a role in a specific pair of pPAM neurons involved in larval appetitive learning. However, we were unable to find a driver that exclusively labels DAN-j1, which we believe to be the key neuron in this context (see Figure 1). As a result, our investigation into appetitive learning did not progress beyond the observation of D2R expression in pPAM neurons (Figure 3d), and we did not proceed with learning assays in this context. While we acknowledge the limitations of our study, we believe that our focus on DAN-c1 is well-justified based on both our findings and the tools currently available. We respectfully note that a major restructuring of the manuscript would not necessarily clarify the rationale for focusing on DAN-c1, and therefore we have maintained the current organization.

      (3) The authors should also double-check and update the expression patterns of the drivers in Table 1 using references such as the FlyLight online resource. For example, MB438B labels PPL1-α'2α2, PPL1-α3, PPL1-γ1pedc according to FlyLight, not just PPL1-γ1pedc as initially reported by Aso and Hattori et al. (2014).

      We appreciate the reviewer’s suggestion. We have double-checked and updated the driver expression patterns in Table 1, using FlyLight data as a reference.

      (4) Interpreting overlaid green-and-red fluorescence confocal images would be difficult for any colorblind readers; I suggest that the authors consider using a more friendly color set.

      We thank the reviewer for the suggestion. In our study, we need three distinct colors to represent different channels. We also tested an alternative color scheme using and cyan , magenta, and yellow (CMY) instead of the standard red, green, and blue (RGB). As a comparison (see below), we used a R76F02AD;R55C10DBD (DAN-c1) GFP-labeled brain as an example. In our evaluation, the RGB combination provided clearer visualization and appeared more natural, while the CMY scheme looked somewhat artificial. Therefore, we decided to retain the original RGB color scheme and did not modify the colors in the figures.

      Author response image 1.

      (5) For Figure 4d, counting each DAN as an individual N would violate the assumption of independence made by the unpaired t test, since multiple DANs are found in each brain and therefore are not independent. Instead, it would be better to count each individual N as the average intensity of the four DANs measured in each brain.

      We revised the analysis of microRNA efficiency by averaging the fluorescence intensity of DANs within each brain, treating each brain as a single sample. Based on this approach, we re-plotted Figure 4d.

      (6) Finally, the authors ought to make it clearer throughout the paper that they have implicated a pair of DAN-c1 neurons in aversive learning, not just a single DAN as currently stated in the title.

      We thank the reviewer for the suggestion about the phrase we are using under this scenario. We have changed all “single neuron” to “a pair of neurons”.

      Reviewer #2 (Recommendations for the authors):

      (1) The results section presents: "Activation of DAN-c1 with dTRPA1 at 34°C during training induced repulsion to PA in the distilled water group (Figure 2k). These data suggested that DAN-c1 excitation and presumably increased dopamine release is sufficient for larval aversive learning in the absence of gustatory pairing."<br /> An alternative interpretation is that 30 min of TrpA activation depletes synaptic vesicle pool, or inactivates neurons because of prolonged depolarization, or DAN shows firing rate adaptation (e.g. see Pulver et al. 2009; doi:10.1152/jn.00071.2009). In such a case DA release would be reduced and not increased. Therefore, the interpretation that DAN-c1 activation is both necessary and sufficient in larval aversive learning is difficult to be sustained.

      In this regard it is important to know how the sensory motor abilities are during a thermos-induction at 34°C during 30 min.

      We thank the reviewer for the thoughtful suggestion. Regarding the concern about potential dopamine depletion or neuronal inactivation, we believe a comparison with the Shibire<sup>ts1</sup> experiments helps clarify the interpretation. Activation of Shibire<sup>ts1</sup> during training with distilled water did not result in aversive learning (Figure 2f), which is a distinct phenotype from that observed with dTRPA1 activation (Figure 2i). This suggests that the phenotypes seen with dTRPA1 activation are not due to reduced dopamine release. Additionally, as the reviewer suggested, we have revised our conclusion to state that “DAN-c1 is important for larval aversive learning,” rather than claiming it is both necessary and sufficient.

      (2) The GRASP system can label the contact of a cell in close proximity like synaptic contacts, but also other situations like no synaptic contact. It would be useful to use a more specific synaptic labelling tool, like the trans-synaptic tracing system (Talay et al., 2017 https://doi.org/10.1016/j.neuron.2017.10.011), which provides a better label of synaptic contact.

      We really appreciate the reviewer’s suggestion. First, we acknowledge that there are four general methods to reveal synaptic connections between neurons: immunohistochemistry (IHC), neuron labeling, viral tracing, GRASP, and electron microscopy (EM). Among these, IHC is not sufficiently convincing, viral tracing is challenging and rarely used in Drosophila, and EM, while the most accurate, is prohibitively expensive for our current goals. For these reasons, we chose the GRASP system to demonstrate the synaptic connections from dopaminergic neurons to the mushroom body. Second, we utilized an activity-dependent version of the GRASP system, linking split-GFP1-10 with synaptic proteins (e.g., synaptobrevin)[12] rather than with cell surface proteins like CD4 or CD8. This version significantly reduces false positive signals compared to the previous version, which was tagged with cell surface proteins. While we admit that this method does not provide as solid evidence of synaptic connections as EM, it is the most efficient method available to us for showing the synaptic connections from dopaminergic neurons to the mushroom body. Finally, we thank the reviewer for suggesting the literature on trans-synaptic tracing methods. Unfortunately, this method is not suitable for our goal, as it labels the entire postsynaptic neuron. In our study, we use GRASP to identify the specific dopaminergic neurons based on the synaptic locations and compartments within the mushroom body lobe. We require a labeling system at the subcellular level because, as noted, DAN-c1 forms synapses specifically in the lower peduncle (LP) of the mushroom body lobe, which is part of the axonal bundles from mushroom body neurons. Using the trans-synaptic tracing method would label the entire mushroom body, making it impossible to distinguish DAN-c1 from other DL1 dopaminergic neurons.

      (3) Previously, Honjo et al (2009) used a petri dish of 8.5 cm and a filter paper for reinforcement of 5.5 cm. In this study the petri dish was 10 cm and the size of the filter paper was not informed. That is important information because it will determine the probability of conditioning.

      A piece of filter paper (0.25cm<sup>2</sup> square) was used to hold odorants in this study. We have added this information to the Materials and Methods.

      (4) Statistic analysis of Behavioral performance of Fig 2H-I was made by ANOVA followed by Dunnett multiple comparisons test. Which was the control group? In each graph 2 independent Dunnett tests were performed against the DW control group?

      We have re-analyzed the data using a two-way ANOVA followed by Tukey’s multiple comparison test, as suggested by Reviewer #1. In Figure 2f-j (previously Figure 2h-l), the DW groups serve as the control groups. In our new analysis, we compared data across all groups using Tukey’s multiple comparison test, with particular focus on comparisons to the corresponding DW control groups.

      (5) The sample size in staining experiments of figures 1-4 were not informed.

      We have added Table S2 in the supplementary materials to provide the N numbers for brain samples used in the figures.

      (6) Color code in Fig 5 is missing, I assumed that is the same as in figure 4e

      We added color code in the figure legend of Figure 5.

      (7) Line 506 "0.1% QH solutions" should be 0.1% QUI solutions

      Changed.

      (8) There is no information on the availability of data

      We added Data Availability Statement: Data will be made available on request.

      Reviewer #3 (Recommendations for the authors):

      (1) Axes of behavioural experiments should better show the full span of possible values (-1;1) to allow a fair assessment.

      We have adjusted the axes in all learning assay graphs to a range from -1 to 1 for consistency and clarity.

      (2) Ns should better be given within the figures.

      We have added Table S2 in the supplementary materials to provide the N numbers for brain samples used in the figures. Additionally, Tables S4 to S6 include the N numbers for the learning assays. While we initially considered including the N numbers within the figure captions, we found it challenging to present this information clearly and efficiently. Therefore, we decided to summarize the N numbers in the tables instead.

      (3) Dot- or box-plots would be better for visualizing the data than means and SEMs.

      We agree with the reviewer’s suggestion. In the behavioral assay graphs, both dot plots and mean ± SEM have been included for better visualization of the data.

      (4) The paper reads as if Dop2R would reduce neuronal activity, rather than "just" cAMP levels. Such a misunderstanding should be avoided.

      We appreciate the reviewer’s comment. Under most conditions, dopamine binding to D2Rs activates the Gαi/o pathway, which inhibits adenylyl cyclase (AC) and reduces cAMP levels. This reduction in cAMP ultimately leads to decreased neuronal activity. In other words, D2R activation typically has an inhibitory effect on neurons. Additionally, D2R can exert inhibitory effects through other signaling pathways, such as the inhibition of voltage-gated associative learning, we continue to emphasize the importance of the D2R-mediated AC-cAMP-PKA signaling pathway. However, we do not rule out the potential involvement of additional signaling pathways, such as inhibition of voltage-gated calcium channels via Gβγ subunits[5]. As noted in the Introduction, dopamine receptors are also involved in other signaling cascades, including PKC, MAPK, and CaMKII pathways. In the context of our study, based on current understanding of molecular signaling in Drosophila olfactory, we still think D2R mediated AC-cAMP-PKA signaling pathway would be the most important one. However, we cannot rule out the involvement of other signaling pathways.

      (5) It would be better if citations were more clearly separated into ones that refer to adult flies versus work on larvae.

      We separated the citations related to adult flies from those working on larvae.

      (6) Line 81-83. DopECR is not found in mammals, is it?

      You are correct. DopECR is not found in mammals. This non-canonical receptor shares structural homology with vertebrate β-adrenergic-like receptors. It can be activated rapidly by dopamine as well as insect ecdysteroids[13,14].

      (7) Line 99: Better "a" learning center (some forms of learning work without mushroom bodies).

      We have revised the text from "the learning center" to "a learning center," as suggested by the reviewer.

      (8) Supplemental figures should be numbered according to the sequence in which they are mentioned in the text.

      We have rearranged the sequence of supplemental figures to match the order in which they are referenced in the text.

      (9) It is striking that dTRPA1-driving DANc1 is punishing in the water condition but that this effect does not summate with quinine punishment (but rather seems to impair it). Maybe you can back this up by ChR- or Chrimson-driving DANc1? Or by silencing DANc1 by GtACR1?

      We appreciate the reviewer’s suggestion. Indeed, we observed similar but not identical results when we used ChR2 to activate DAN-c1 during the training stage (Figure 5b and c). We found that activating DAN-c1 with quinine (QUI) impaired aversive learning (Figure 5b), consistent with our findings using dTRPA1 activation of DAN-c1 when trained in QUI at 34°C (Figure 2i). We propose that the over-excitation of DAN-c1, whether induced by QUI or artificial manipulation (optogenetics and thermogenetics), impairs aversive learning, which aligns with our findings for D2R knockdown (Figure 4e). However, there are some differences between dTRPA1 and ChR2 activation. While dTRPA1 activation induced aversive learning when trained with distilled water (DW) at 34°C (Figure 2i), ChR2 did not induce aversive learning under the same conditions (Figure 5c). We believe this difference is due to the varying activation levels between the two manipulations. Our optogenetic stimulus may have been stronger than the thermogenetic one, potentially leading to over-excitation in the DW group, preventing aversive learning. In the QUI group, the more severe over-excitation impaired aversive learning, producing a phenotype similar to that observed with other over-excitation methods (e.g., thermogenetics or D2R knockdown), where the phenotype reached a maximum level. We have also addressed these points in the Discussion section.

      (10) Unless I got the experimental procedure wrong, isn't it surprising that Figure S7b does not uncover a punishing effect of driving TH-Gals neurons?

      This optogenetic experiment with ChR2 expression in TH-GAL4 neurons was a pioneering attempt to activate DAN-c1 using ChR2. As explained in response to question (9), the failure to observe a punishing effect in the DW group when TH-GAL4 neurons were activated during training may be due to our optogenetic stimulus being too strong. This likely resulted in over-excitation of DAN-c1 (among the neurons labeled by TH-GAL4), impairing aversive learning and preventing the appearance of typical aversive behaviors.

      (11) It seems that Figure1f´ is repeated, in a mirrored manner, in Figure 2e.

      We have removed Figure 2e, as it was deemed redundant and not necessary for this section.

      Reference

      (1) Saumweber, T. et al. Functional architecture of reward learning in mushroom body extrinsic neurons of larval Drosophila. Nat Commun 9, 1104 (2018). https://doi.org/10.1038/s41467-018-03130-1

      (2) Aso, Y. & Rubin, G. M. Dopaminergic neurons write and update memories with cell-type-specific rules. Elife 5 (2016). https://doi.org/10.7554/eLife.16135

      (3) Xie, T. et al. A Genetic Toolkit for Dissecting Dopamine Circuit Function in Drosophila. Cell Rep 23, 652-665 (2018). https://doi.org/10.1016/j.celrep.2018.03.068

      (4) Eschbach, C. et al. Recurrent architecture for adaptive regulation of learning in the insect brain. Nat Neurosci 23, 544-555 (2020). https://doi.org/10.1038/s41593-020-0607-9

      (5) Neve, K. A., Seamans, J. K. & Trantham-Davidson, H. Dopamine receptor signaling. J Recept Signal Transduct Res 24, 165-205 (2004). https://doi.org/10.1081/rrs-200029981

      (6) Draper, I., Kurshan, P. T., McBride, E., Jackson, F. R. & Kopin, A. S. Locomotor activity is regulated by D2-like receptors in Drosophila: an anatomic and functional analysis. Dev Neurobiol 67, 378-393 (2007). https://doi.org/10.1002/dneu.20355

      (7) Honjo, K. & Furukubo-Tokunaga, K. Induction of cAMP response element-binding protein-dependent medium-term memory by appetitive gustatory reinforcement in Drosophila larvae. J Neurosci 25, 7905-7913 (2005). https://doi.org/10.1523/JNEUROSCI.2135-05.2005

      (8) Honjo, K. & Furukubo-Tokunaga, K. Distinctive neuronal networks and biochemical pathways for appetitive and aversive memory in Drosophila larvae. J Neurosci 29, 852-862 (2009). https://doi.org/10.1523/JNEUROSCI.1315-08.2009

      (9) Yamazaki, D., Maeyama, Y. & Tabata, T. Combinatory Actions of Co-transmitters in Dopaminergic Systems Modulate Drosophila Olfactory Memories. J Neurosci 43, 8294-8305 (2023). https://doi.org/10.1523/jneurosci.2152-22.2023

      (10) Selcho, M., Pauls, D., Han, K. A., Stocker, R. F. & Thum, A. S. The role of dopamine in Drosophila larval classical olfactory conditioning. PLoS One 4, e5897 (2009). https://doi.org/10.1371/journal.pone.0005897

      (11) Kim, Y. C., Lee, H. G. & Han, K. A. D1 dopamine receptor dDA1 is required in the mushroom body neurons for aversive and appetitive learning in Drosophila. J Neurosci 27, 7640-7647 (2007). https://doi.org/10.1523/JNEUROSCI.1167-07.2007

      (12) Macpherson, L. J. et al. Dynamic labelling of neural connections in multiple colours by trans-synaptic fluorescence complementation. Nat Commun 6, 10024 (2015). https://doi.org/10.1038/ncomms10024

      (13) Abrieux, A., Duportets, L., Debernard, S., Gadenne, C. & Anton, S. The GPCR membrane receptor, DopEcR, mediates the actions of both dopamine and ecdysone to control sex pheromone perception in an insect. Front Behav Neurosci 8, 312 (2014). https://doi.org/10.3389/fnbeh.2014.00312

      (14) Lark, A., Kitamoto, T. & Martin, J. R. Modulation of neuronal activity in the Drosophila mushroom body by DopEcR, a unique dual receptor for ecdysone and dopamine. Biochim Biophys Acta Mol Cell Res 1864, 1578-1588 (2017). https://doi.org/10.1016/j.bbamcr.2017.05.015

    1. Author Response

      The following is the authors’ response to the original reviews.

      We would like to first thank the Editor as well as the two reviewers for their enthusiasm and careful evaluation of our manuscript. We also appreciate their thoughtful and constructive comments and suggestions. They did, however, have concerns regarding experimental design, data analysis, and over-interpretation of our findings. We endeavored to address these concerns through refinement of our framing, inclusion of additional new analyses, and rewriting some parts of our discussion section. We hope our response can better explain the rationale of our experimental design and data interpretation. In addition, we also acknowledge the limitations of our present study, so that it will benefit future investigations into this topic. Our detail responses are provided below.

      Reviewer #1 (Public Review)

      This study examines whether the human brain uses a hexagonal grid-like representation to navigate in a non-spatial space constructed by competence and trustworthiness. To test this, the authors asked human participants to learn the levels of competence and trustworthiness for six faces by associating them with specific lengths of bar graphs that indicate their levels in each trait. After learning, participants were asked to extrapolate the location from the partially observed morphing bar graphs. Using fMRI, the authors identified brain areas where activity is modulated by the angles of morphing trajectories in six-fold symmetry. The strength of this paper lies in the question it attempts to address. Specifically, the question of whether and how the human brain uses grid-like representations not only for spatial navigation but also for navigating abstract concepts, such as social space, and guiding everyday decision-making. This question is of emerging importance.

      Thanks very much again for the evaluation and comments. Please find our revision plans to each comment below.

      The weak points of this paper are that its findings are not sufficiently supporting their arguments, and there are several reasons for this:

      (1) Does the grid-like activity reflect 'navigation over the social space' or 'navigation in sensory feature space'? The grid-like representation in this study could simply reflect the transition between stimuli (the length of bar graphs). Participants in this study associated each face with a specific length of two bars, and the 'navigation' was only guided by the morphing of a bar graph image. Moreover, any social cognition was not required to perform the task where they estimate the gridlike activity. To make social decision-making that was conducted separately, we do not know if participants needed to navigate between faces in a social space. Instead, they can recall bar graphs associated with faces and compute the decision values by comparing the length of bars. Notably, in the trust game in this study, competence and trustworthiness are not equally important to make a decision (Equation 1). The expected value is more sensitive to one over the other. This also suggests that the space might not reflect social values but perceptual differences.

      The Reviewer raises an interesting point. We apologize for not being clear enough to address this possibility in our original manuscript and we will improve the clarity in our revision. To address this issue, we would like to break it into two sub-questions and answer them separately: 1) Are participants merely memorizing the values associated with each avatar or do they place the avatars on a two-dimensional map in their internal representation. 2) If so, are the two dimensions of this internal representation social dimensions relating to competence and trust or sensory dimensions relating to bar height (i.e., social space or sensory space).

      For the first question, we hope our analysis of the distance effect on the reaction time in the comparison task can address this issue. Specifically, it came from the idea that distance is a measure of similarity between two avatars in the 2D social space. The closer two avatars are, the more similar they are, hence distinguishing them will be harder and result in longer reaction time. If participants are merely memorizing the avatars as six isolated instances without integrating them into a low-dimensional map, then avatars should be equidistant (as if they were lying on the vertices of a 5-simplex), and would not show a distance effect. Therefore, we interpreted the stronger distance effect as a behavioural index of having a better internal map-like representation. This approach is adopted from the work by Park et al. (2020), where they used the distance effect to demonstrate human brains map abstract relationships among entities from piecemeal learning.

      For the second question of ‘social space’ vs. ‘sensory space’, our study adopted the paradigm developed by, in which they used a similar way to construct a conceptual space and found that such space can be represented with grid-like code in the entorhinal and prefrontal cortex. We stayed close to the original design by Constantinescu et al. (2016) and hoped that our work could provide, to some extent, a close replication of their result but using non-spatial social concepts instead. Indeed, this led to the limitation of our study that participants are passively traversing the artificial space rather than actively navigating in the space to make decisions/inferences. And we did not find sufficient evidence as reported in previous grid-like coding fMRI studies. This may have to do with low signal quality in the medial temporal region, we are not entirely sure. Nevertheless, we don’t think our findings contradict or disprove previous findings in any way. Here we would also like to point to the work by Park et al. (2021). Their task involves making novel inferences in a 2D social hierarchy space and found that grid-like code in the entorhinal cortex and medial prefrontal cortex support such novel inferences. Hence, we argue that results from these studies and partial evidence from our study collectively support the idea that the entorhinal is important for representing abstract knowledge (spatial and non-spatial).

      (2) Does the brain have a common representation of faces in a social space? In this study, participants don't need to have a map-like representation of six faces according to their levels of social traits. Instead, they can remember the values of each trait. The evidence of neural representations of the faces in a 2-dimensional social space is lacking. The authors argued that the relationship between the reaction times and the distances between faces provides evidence of the formation of internal representations. However, this can be found without the internal representation of the relationships between faces. If the authors seek internal representations of the faces in the brain, it would be important to show that this representation is not simply driven by perceptual differences between bar graphs that participants may recall in association with each face.

      Considering these caveats, it is hard for me to agree if the authors provide evidence to support their claims.

      With regard to the common representation of faces, this is a potential limitation of our paradigm because our current task design didn’t include a stage of face presentation to properly test this question. With regard to the asymmetry between the two dimensions in determining expected value. We think that the prerequisite for identifying six-fold grid-like coding is to have an abstract space formed by orthogonal dimensions, i.e., competence and trustworthiness in our task are not correlated. In addition, the scanner task does not require computation of expected value. However, we do think that it is worth investigating whether the extent to which each dimension contributes to decision-making and inference will distort the grid-like representation of the map. Our prediction is that the entorhinal cortex will maintain a representation of the map invariant to this aspect so that it can support inferences in different contexts where different weights may be assigned to different dimensions. But this will be an interesting hypothesis for future studies to test. We hope that our revision plans with above considerations could address the Reviewer’s comments.

      Reviewer #2 (Public Review)

      Summary:

      In this work, Liang et al. investigate whether an abstract social space is neurally represented by a grid-like code. They trained participants to 'navigate' around a two-dimensional space of social agents characterized by the traits of warmth and competence, then measured neural activity as participants imagined navigating through this space. The primary neural analysis consisted of three procedures: 1) identifying brain regions exhibiting the hexagonal modulation characteristic of a grid-like code, 2) estimating the orientation of each region's grid, and 3) testing whether the strength of the univariate neural signal increases when a participant is navigating in a direction aligned with the grid, compared to a direction that is misaligned with the grid.

      From these analyses, the authors find the clearest evidence of a grid-like code in the prefrontal cortex and weaker evidence in the entorhinal cortex.

      Strengths:

      The work demonstrates the existence of a grid-like neural code for a socially-relevant task, providing evidence that such coding schemes may be relevant for a variety of two-dimensional task spaces.

      Thank you very much again for your careful evaluation and thoughtful comments. Please find our response to the comments below.

      Weaknesses:

      In various parts of this manuscript, the authors appear to use a variety of terms to refer to the (ostensibly) same neural regions: prefrontal cortex, frontal pole, ventromedial prefrontal cortex (vmPFC), and orbitofrontal cortex (OFC). It would be useful for the authors to use more consistent terminology to avoid confusing readers.

      Thanks for pointing out the use of terms, we will try to improve that in the revision of our manuscript.

      Claims about a grid code in the entorhinal cortex are not well-supported by the analyses presented. The whole-brain analysis does not suggest that the entorhinal cortex exhibits hexagonal modulation; the strength of the entorhinal BOLD signal does not track the putative alignment of the grid code there; multivariate analyses do not reveal any evidence of a grid-like representational geometry.

      On a conceptual level, it is not entirely clear how this work advances our understanding of gridlike encoding of two-dimensional abstract spaces, or of social cognition. The study design borrows heavily from Constantinescu et al. 2016, which is itself not an inherent weakness, but the Constantinescu et al. study already suggests that grid codes are likely to underlie two-dimensional spaces, no matter how abstract or arbitrary. If there were a hypothesis that there is something unique about how grid codes operate in the social domain, that would help motivate the search for social grid codes specifically, but no such theory is provided. The authors do note that warmth and competence likely have ecological importance as social traits, but other past studies have used slightly different social dimensions without any apparent loss of generality (e.g., Park et al. 2021). There are some (seemingly) exploratory analyses examining how individual difference measures like social anxiety and avoidance might affect the brain and behavior in this study, but a strong theoretical basis for examining these particular measures is lacking.

      We acknowledge that we used very similar dimensions to the work by Park et al. (2021). While Park and colleagues (2021) took a more innovative and rigorous approach, we tried to stay close to the original design by Constantinescu et al. (2016) with the hope that our work could provide, to some extent, a close replication of their result. Our data was collected before the 2021 paper came out and as the comment points out, we did not find as complete and convincing evidence as in these previous grid-like coding fMRI papers. This may be due to low signal quality in the medial temporal region, we are not entirely sure. But we don’t think our current findings can contradict or disprove previous findings in any way.

      I found it difficult to understand the analyses examining whether behavior (i.e., reaction times) and individual difference measures (i.e., social anxiety and avoidance) can be predicted by the hexagonal modulation strength in some region X, conditional on region X having a similar estimated grid alignment with some other region Y. It is possible that I have misunderstood the authors' logic and/or methodology, but I do not feel comfortable commenting on the correctness or implications of this approach given the information provided in the current version of this manuscript.

      We apologize for not being clear enough in the manuscript and we will improve the clarity in our revision. This exploratory analysis aims to examine if there is any correlation between the strength of grid-like representation of social value map and behavioral indicators of map-like representation; and test if there are any correlation between the strength of grid-like representation of this social value map and participants’ social trait. For the behavioral indicator, we used the distance effect in the reaction time of the comparison task outside the scanner. The closer a pair of avatars are, the more similar they are, hence distinguishing them will be harder and results in longer reaction time when making comparison judgement. If participants are merely memorizing the avatars as six isolated instances without integrating them into a map, all avatars should be equidistant and there wouldn’t be a distance effect. We interpreted stronger grid-like activity as a neural index of better representation of the 2D social space, and we interpreted stronger distance effect as a behavioral index of having better internal map-like representation.

      It was puzzling to see passing references to multivariate analyses using representational similarity analysis (RSA) in the main text, given that RSA is only used in analyses presented in the supplementary material.

      We speculate if RSA in entorhinal ROI would be more sensitive than the wholebrain univariate analysis to identify grid-like code because a previous paper on grid-like code in olfactory space (Bao et al., 2019) didn’t identify grid-like representation with univariate analysis but identified it with RSA analysis. However, we failed to find evidence of grid-like code in the entorhinal ROI aligned to its own putative grid orientation with the RSA approach. We reported this result in the main text to show that we carried out a relatively thorough investigation to test the hypothesis using various approaches and decided to add references to the RSA approach in the main text as well.

      Reviewer #3 (Public Review)

      Liang and colleagues set out to test whether the human brain uses distance and grid-like codes in social knowledge using a design where participants had to navigate in a two-dimensional social space based on competence and warmth during an fMRI scan. They showed that participants were able to navigate the social space and found distance-based codes as well as grid-like codes in various brain regions, and the grid-like code correlated with behavior (reaction times).

      On the whole, the experiment is designed appropriately for testing for distant-based and grid-like codes and is relatively well-powered for this type of study, with a large amount of behavioral training per participant. They revealed that a number of brain regions correlated positively or negatively with distance in the social space, and found grid-like codes in the frontal polar cortex and posterior medial entorhinal cortex, the latter in line with prior findings on grid-like activity in the entorhinal cortex. The current paper seems quite similar conceptually and in design to previous work, most notably by Park et al., 2021, Nature Neuroscience.

      Thanks very much again for your careful evaluation and comments. Please find our response to the comments below.

      Below, I raise a few issues and questions on the evidence presented here for a grid-like code as the basis of navigating abstract social space or social knowledge.

      (1) The authors claim that this study provides evidence that humans use a spatial / grid code for abstract knowledge like social knowledge.

      This data does specifically not add anything new to this argument. As with almost all studies that test for a grid code in a similar "conceptual" space (not only the current study), the problem is that when the space is not a uniform, square/circular space, and 2-dimensional then there is no reason the code will be perfectly grid-like, i.e., show six-fold symmetry. In real-world scenarios of social space (as well as navigation, semantic concepts), it must be higher dimensional - or at least more than two-dimensional. It is unclear if this generalizes to larger spaces where not all part of the space is relevant. Modelling work from Tim Behrens' lab (e.g., Whittington et al., 2020) and Bradley Love's lab (e.g., Mok & Love, 2019) have shown/argued this to be the case. In experimental work, like in mazes from the Mosers' labs (e.g., Derdikman et al., 2009), or trapezoid environments from the O'Keefe lab (Krupic et al., 2015), there are distortions in mEC cells, and would not pass as grid cells in terms of the six-fold symmetry criterion.

      The authors briefly discuss the limitations of this at the very end but do not really say how this speaks to the goal of their study and the claim that social space or knowledge is organized as a grid code and if it is in fact used in the brain in their study and beyond. This issue deserves to be discussed in more depth, possibly referring to prior work that addressed this, and raising the issue for future work to address the problem - or if the authors think it is a problem at all.

      Thanks very much for the references to the papers that we haven’t considered enough in our discussion. We will endeavour to discuss the topic in more depth in our revision. In summary, we raise this discussion point because various research groups have found gridlike representations in 2D artificial conceptual space. We think that the next step for a stronger claim would be to find the representation of more spontaneous non-spatial maps.

      Data and analysis

      (2) Concerning the negative correlation of distance with activation in the fusiform gyrus and visual cortex: this is a slightly puzzling but potentially interesting finding. However, could this be related to reaction times? The larger the distance, the longer the reaction times, so the original finding might reflect larger activations with smaller distances.

      Thanks very much for the suggestion. However, we didn’t find a correlation between response time in the choice stage in the scanner task and the negative distance activation in the fusiform gyrus (Figures below). Meanwhile, the morph period in each trial remains the same, the negative correlation of distance with activation in the fusiform gyrus could also be interpreted as a positive correlation of morphing speed with activation in the fusiform gyrus. Indeed, stronger negative activation indicates larger activation for smaller distances, but we are uncertain what it indicates concerning the functional role of Fusiform in our current task.

      Author response image 1.

      (3) Concerning the correlation of grid-like activity with behavior: is the correlation with reaction time just about how long people took (rather than a task-related neural signal)? The authors have only reported correlations with reaction time. The issue here is that the duration of reaction times also relates to the starting positions of each trial and where participants will navigate to. Considering the speed-accuracy tradeoff, could performance accuracy be negatively correlated with these grid consistency metrics? Or it could be positively correlated, which would suggest the grid signal reflects a good representation of the task.

      We apologize for not being clear enough in the manuscript and we will improve the clarity in our revision. The reaction time used to calculate the distance effect is from a task outside the scanner. The closer a pair of avatars are, the more similar they are, hence distinguishing them will be harder and results in longer reaction time when making comparison judgement. If participants are merely memorizing the avatars as six isolated instances without integrating them into a map, all avatars should be equidistant and there wouldn’t be a distance effect. We interpreted stronger grid-like activity as a neural index of better representation of the 2D social space, and we interpreted stronger distance effect as a behavioural index of having better internal map-like representation. This was the motivation behind this analysis.

      References

      Bao, X., Gjorgieva, E., Shanahan, L. K., Howard, J. D., Kahnt, T., & Gottfried, J. A. (2019). Grid-like Neural Representations Support Olfactory Navigation of a Two-Dimensional Odor Space. Neuron, 102(5), 1066-1075 e1065. https://doi.org/10.1016/j.neuron.2019.03.034

      Constantinescu, A. O., O'Reilly, J. X., & Behrens, T. E. J. (2016). Organizing conceptual knowledge in humans with a gridlike code. Science,352(6292), 1464-1468. https://doi.org/10.1126/science.aaf0941

      Park, S. A., Miller, D. S., & Boorman, E. D. (2021). Inferences on a multidimensional social hierarchy use a grid-like code. Nat Neurosci, 24(9), 1292-1301. https://doi.org/10.1038/s41593-02100916-3

      Park, S. A., Miller, D. S., Nili, H., Ranganath, C., & Boorman, E. D. (2020). Map Making: Constructing, Combining, and Inferring on Abstract Cognitive Maps. Neuron, 107(6), 1226-1238 e1228. https://doi.org/10.1016/j.neuron.2020.06.030

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      The Bagnat and Rawls groups' previous published work (Park et al., 2019) described the kinetics and genetic basis of protein absorption in a specialized cell population of young vertebrates termed lysosome-rich enterocytes (LREs). In this study they seek to understand how the presence and composition of the microbiota impacts the protein absorption function of these cells and reciprocally, how diet and intestinal protein absorption function impact the microbiome.

      Strengths of the study include the functional assays for protein absorption performed in live larval zebrafish, which provides detailed kinetics on protein uptake and degradation with anatomic precision, and the gnotobiotic manipulations. The authors clearly show that the presence of the microbiota or of certain individual bacterial members slows the uptake and degradation of multiple different tester fluorescent proteins.

      To understand the mechanistic basis for these differences, the authors also provide detailed single-cell transcriptomic analyses of cells isolated based on both an intestinal epithelial cell identity (based on a transgenic marker) and their protein uptake activity. The data generated from these analyses, presented in Figures 3-5, are valuable for expanding knowledge about zebrafish intestinal epithelial cell identities, but of more limited interest to a broader readership. Some of the descriptive analysis in this section is circular because the authors define subsets of LREs (termed anterior and posterior) based on their fabp2 expression levels, but then go on to note transcriptional differences between these cells (for example in fabp2) that are a consequence of this initial subsetting.

      Inspired by their single-cell profiling and by previous characterization of the genes required for protein uptake and degradation in the LREs, the authors use quantitative hybridization chain reaction RNA-fluorescent in situ hybridization to examine transcript levels of several of these genes along the length of the LRE intestinal region of germ-free versus mono-associated larvae. They provide good evidence for reduced transcript levels of these genes that correlate with the reduced protein uptake in the mono-associated larval groups.

      The final part of the study (shown in Figure 7) characterized the microbiomes of 30-day-old zebrafish reared from 6-30 days on defined diets of low and high protein and with or without homozygous loss of the cubn gene required for protein uptake. The analysis of these microbiomes notes some significant differences between fish genotypes by diet treatments, but the discussion of these data does not provide strong support for the hypothesis that "LRE activity has reciprocal effects on the gut microbiome". The most striking feature of the MDS plot of Bray Curtis distance between zebrafish samples shown in Figure 7B is the separation by diet independent of host genotype, which is not discussed in the associated text. Additionally, the high protein diet microbiomes have a greater spread than those of the low protein treatment groups, with the high protein diet cubn mutant samples being the most dispersed. This pattern is consistent with the intestinal microbiota under a high protein diet regimen and in the absence of protein absorption machinery being most perturbed in stochastic ways than in hosts competent for protein uptake, consistent with greater beta dispersal associated with more dysbiotic microbiomes (described as the Anna Karenina principle here: https://pubmed.ncbi.nlm.nih.gov/28836573/). It would be useful for the authors to provide statistics on the beta dispersal of each treatment group.

      Overall, this study provides strong evidence that specific members of the microbiota differentially impact gene expression and cellular activities of enterocyte protein uptake and degradation, findings that have a significant impact on the field of gastrointestinal physiology. The work refines our understanding of intestinal cell types that contribute to protein uptake and their respective transcriptomes. The work also provides some evidence that microbiomes are modulated by enterocyte protein uptake capacity in a diet-dependent manner. These latter findings provide valuable datasets for future related studies.

      We thank the Reviewer for their thorough and kind assessment. We appreciate the suggestion for edits and for pointing out areas that needed further clarification.

      One point in need of further explanation is the use fabp6 (referred to as fabp2 by the reviewer) to define anterior LREs and their gene expression pattern, which includes high levels of fabp6, something that was deemed a “circular argument” by the reviewer.  The rationale for using fabp6 as a reference is that we were able to define its spatial pattern in relation to other LRE markers and the neighboring ileocyte population using transgenic markers (Lickwar et al., 2017; Wen et al., 2021). Thus, far from being a circular argument, using fabp6 allowed us to identify other markers that are differentially expressed between anterior and posterior LREs, which share a core program that we highlight in our study. In the revised manuscript, we clarified this point (lines 166 – 169).

      We followed the Reviewer’s suggestion to test if LRE activity and dietary protein affected beta dispersal. Our analyses revealed that beta dispersion was not significantly different between our experimental conditions. We added details about this analysis (lines 384 – 386) and a new supplemental figure panel (Figure S7C).

      Reviewer #2 (Public review):

      Summary:

      The authors set out to determine how the microbiome and host genotype impact host protein-based nutrition.

      Strengths:

      The quantification of protein uptake dynamics is a major strength of this work and the sensitivity of this assay shows that the microbiome and even mono-associated bacterial strains dampen protein uptake in the host by causing down-regulation of genes involved in this process rather than a change in cell type.

      The use of fluorescent proteins in combination with transcript clustering in the single cell seq analysis deepens our understanding of the cells that participate in protein uptake along the intestine. In addition to the lysozome-rich enterocytes (LRE), subsets of enteroendocrine cells, acinar, and goblet cells also take up protein. Intriguingly, these non-LRE cells did not show lysosomal-based protein degradation; but importantly analysis of the transcripts upregulated in these cells include dab2 and cubn, genes shown previously as being essential to protein uptake.

      The derivation of zebrafish mono-associated with single strains of microbes paired with HCR to localize and quantify the expression of host protein absorption genes shows that different bacterial strains suppress these genes to variable extents.

      The analysis of microbiome composition, when host protein absorption is compromised in cubn-/- larvae or by reducing protein in the food, demonstrates that changes to host uptake can alter the abundance of specific microbial taxa like Aeramonas.

      Weaknesses:

      The finding that neurons are positive for protein uptake in the single-cell data set is not adequately discussed. It is curious because the cldn:GFP line used for sorting does not mark neurons and if the neurons are taking up mCherry via trans-synaptic uptake from EECs, those neurons should be mCherry+/GFP-; yet methods indicate GFP+ and GFP+/mCherry+ cells were the ones collected and analyzed.

      We thank the Reviewer for the kind and positive assessment of our work, for suggestions to improve the accessibility and clarity of the manuscript, and for pointing out an issue related to a neuronal population that needed further clarification.

      It turns out that there is a population of neurons that express cldn15la. They are not easily visualized by microscopy because IECs express this gene much more highly. However, the endogenous cldn15la transcripts can be found in neurons as shown in a recently published dataset (PMID: 35108531) as well as in this study We added a discussion point to clarify this issue (lines 463 – 465).

      Reviewer #3 (Public review):

      Summary:

      Childers et al. address a fundamental question about the complex relationship within the gut: the link between nutrient absorption, microbial presence, and intestinal physiology. They focus on the role of lysosome-rich enterocytes (LREs) and the microbiota in protein absorption within the intestinal epithelium. By using germ-free and conventional zebrafishes, they demonstrate that microbial association leads to a reduction in protein uptake by LREs. Through impressive in vivo imaging of gavaged fluorescent proteins, they detail the degradation rate within the LRE region, positioning these cells as key players in the process. Additionally, the authors map protein absorption in the gut using single-cell sequencing analysis, extensively describing LRE subpopulations in terms of clustering and transcriptomic patterns. They further explore the monoassociation of ex-germ-free animals with specific bacterial strains, revealing that the reduction in protein absorption in the LRE region is strain-specific.

      Strengths:

      The authors employ state-of-the-art imaging to provide clear evidence of the protein absorption rate phenotype, focusing on a specific intestinal region. This innovative method of fluorescent protein tracing expands the field of in vivo gut physiology.

      Using both conventional and germ-free animals for single-cell sequencing analysis, they offer valuable epithelial datasets for researchers studying host-microbe interactions. By capitalizing on fluorescently labelled proteins in vivo, they create a new and specific atlas of cells involved in protein absorption, along with a detailed LRE single-cell transcriptomic dataset.

      Weaknesses:

      While the authors present tangible hypotheses, the data are primarily correlative, and the statistical methods are inadequate. They examine protein absorption in a specific, normalized intestinal region but do not address confounding factors between germ-free and conventional animals, such as size differences, transit time, and oral gavage, which may impact their in vivo observations. This oversight can lead to bold conclusions, where the data appear valuable but require more nuance.

      The sections of the study describing the microbiota or attempting functional analysis are elusive, with related data being overinterpreted. The microbiome field has long used 16S sequencing to characterize the microbiota, but its variability due to experimental parameters limits the ability to draw causative conclusions about the link between LRE activity, dietary protein, and microbial composition. Additionally, the complex networks involved in dopamine synthesis and signalling cannot be fully represented by RNA levels alone. The authors' conclusions on this biological phenomenon based on single-cell data need support from functional and in vivo experiments.

      We thank the Reviewer for their assessment and for pointing out some areas that needed to be explained better and/or discussed.

      The Reviewer mentions some potential confounding factors (ie., size differences, transit time, oral gavage) in the gnotobiology experiments. We would like to convey that these aspects have been addressed in our experimental design and are now clarified in the revised manuscript: 1- larval sizes were recorded and found to be similar between GF and monoassociated larvae (Figure S6A); 2- while intestinal transit time may be affected by microbes and is a topic of interest, in our assay luminal mCherry cargo is present at high levels throughout the gut and is not limiting at any point during the experiment; 3- gavage, which is necessary for quantitative assays, is indeed an experimental manipulation that may somehow alter the subjects (the same is true for microscopy and virtually any research method). However, it cannot explain differences between GF and CV or alter our conclusions via microbial or dietary effects. We now elaborate the former point in the revised discussion (line 426). A new panel has been added for Fig.S6 to show that standard length was similar in GF and monoassociated larvae (Figure S6A).

      We are aware that microbial community composition is often highly variable between experiments and this necessitates adequately high biological replication and inclusion of internal controls to allow conclusions to be drawn. Nevertheless, studies evaluating the utility of 16S rRNA gene sequencing have found that this analysis reveals important impacts of environmental factors on the gut microbiome (PMIDs: 21346791, 31409661, 31324413). Our results provide further evidence that 16S rRNA gene sequencing remains a useful method to detect perturbations to the zebrafish gut microbiome. Reproducing previous findings, we detected many of the core zebrafish microbiota strains in our samples that have been identified by other studies (PMIDs: 26339860, 21472014, 17055441). To ensure the robustness of our results, we included several biological replicates for each condition, co-housed genotypes and included large sample sizes to minimize environmental variability between groups. In response to this reviewer concern, we have added a supplemental beta diversity plot and statistical analyses showing that the microbiomes in our larvae were significantly different from the diets or tank water (Figure S7A). This analysis shows that the host environment influenced microbial community composition (lines 376 – 378). We also added an additional supplemental panel and performed analysis showing that the experimental replicates (i.e., different tanks) were not a significant source of variation in this study (lines 378 – 380) (Figure S7B). This result underscores that the microbiota in these larvae were influenced by both the host and diet.

      Regarding dopamine pathways, we acknowledge that it involves complex biology that will require dedicated studies. In this work, we simply point out gene expression patterns we find interesting as they may inform future studies.

      Finally, the Reviewer mentions the use of inadequate statistical methods for some analyses without specifying or indicating alternative analyses, only the need to justify the use of two-way ANOVA is made explicit. In this point, we respectfully disagree and would like to emphasize that we use statistical methods that are standard in the field (PMID: 37707499). We nevertheless added a justification for the use of two-way ANOVA where appropriate (lines 635-637, 653-654, 773-776). The two-way ANOVA test was to compare fluorescence profiles of gavages cargoes or HCR probes along the length of the LRE region. This test accounts for differences in fluorescence between experimental conditions in segments (30 μm) along the LRE region (~300 μm). This allows us to capture differences in fluorescence between experimental conditions while accounting for heterogeneity in the LRE region. Please see our comment below for more information about our use of the 2-way ANOVA.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      Please provide in the materials and methods the strain identifiers and sources of the bacteria used in the study.

      Thank you for the suggestions. Strain identifiers and source information were added to the methods (lines 576-579).

      Reviewer #2 (Recommendations for the authors):

      (1) This is a very satisfying and thorough analysis of the reciprocal influence of diet, microbiome, and host genotype on protein absorption by the host. Below I make suggestions that mainly relate to making the paper more accessible to a broader audience.

      (2) Line 233 Starts a section that reports the findings of the scRNA dataset. The writing is inconsistent with respect to how the genes are listed: whether abbreviation only or spelled out followed by abbreviation. I prefer the latter. For example, slc10a2 is a bile acid Na cotransporter but for those not in the know, they would have to look this up. Perhaps adding a supplementary table that provides a gene list of those discussed in the text with abbreviation/spelled-out, and KEGG terms.

      Thank you for pointing out inconsistent gene labeling. We have revised the text with spelled out gene names followed by abbreviations.

      (3) Line 461 Where did the neurons come from when you were sorting cldn+ cells?

      Neuronal expression of cldn15la was detected in our data and other published datasets (PMID: 37995681, 35108531). We added a note to the text clarifying that neuronal cells can express cldn15la (lines 463-465).

      (4) Line 561 1x tricaine should be converted to percentage in solution or concentration throughout.

      The tricaine concentration was 0.2 mg/mL. We added this detail to the methods (line 596).

      (5) Line 612 Please clarify how normalizations are carried out: is it to the peak value in the germ-free condition? CV never reaches 1.

      AUC values were normalized to the peak value in the GF condition at 60 minutes PG. We clarified this step in the methods (lines 618-619).

      (6) Line 654-663 I think mCherry here should be mTourquoise?

      Thank you for catching this typo. We corrected it in the text.

      (7) In Figure 1 Please consider adding a color so that magenta does not represent BOTH germ-free AND mCherry.

      Due to the many colors of fluorescent proteins and HCR probes in this paper, we were not able to find an alternative plot line color to represent GF.

      (8) In Figure 2 I suggest consistency with respect to the order you present GF/CV

      Figure 1 GF->CV

      Figure 2 CV->GF

      My preference is GF->CV

      Images in Figure 2 were re-ordered following reviewer’s recommendation.

      Here, 20 minute time point also appears qualitatively different between GF and CV.

      There can be slight differences in LREs between individuals. These images were selected because they represented the average differences in the amount of mTurquoise degradation activity that occurred between 20 – 60 minutes post-flushing in the GF and CV conditions.

      In Figure 3E Figure legend refers to being able to see BSA in vacuoles. The image should be modified to show this- currently too small.

      In response, we enlarged the confocal microscopy images showing DQ red BSA in the LRE region (Figure 3E). We added a panel with confocal microscopy images of the LREs in 6 dpf larva gavaged with DQ red BSA (Figure S3F). These images show that DQ red BSA fluorescence was localized to the LRE lysosomal vacuole.

      In Figure 5D, Posterior LRE should be pink not green in the key to the right of the heatmap.

      Thank you for catching this error. We have corrected the colors (Figure 5D).

      Reviewer #3 (Recommendations for the authors):

      (1) Introduction and context:

      Expand the introduction to include more background on microbial-mediated protein absorption, with references to relevant findings in Drosophila. This will provide a stronger foundation for the study's contributions to the field.

      Thank you for this suggestion. We added information about microbe-mediated amino acid harvest in Drosophila to the introduction (lines 49-53).

      (12) Methodological suggestions:

      Measure and report differences between germ-free (GF) and conventional (CV) animals, such as transit time, to account for potential confounding factors in protein absorption dynamics.

      We respectfully assert that a transit assay is not required for this study and could actually create confusion as an effect in transit time could be interpreted as a contributing factor when it is in fact not the case due to the experimental design. This is because the concentration of luminal protein was equivalent in GF and CV larvae (Figure S1E), so the LREs had equal saturating access to those proteins in both conditions. Furthermore, we showed the microbiota did not degrade fluorescent protein (Figure S1F). Therefore, we feel confident that there was lower protein uptake in the LREs of CV larvae because the microbiome exerted regulatory effects on LRE activity.

      Provide detailed information on the gating strategy used for single-cell sorting to enhance the dataset's utility and support claims about cell changes.

      The methods we used for sorting cells were previously described (PMID: 31474562). In this manuscript, we describe them under the heading “Fluorescence activated cell sorting for single cell RNA-sequencing.”

      Explain the "GeneRatio" metric in figure legends for clarity.

      The GeneRatio is the ratio of genes associated with each individual GO term to the number of genes associated with the domain. An explanation was added to the caption (Figure S3C).

      (13) Visual and statistical improvements:

      Include images of labeled peptidases within lysosome-rich enterocytes (LREs) to reinforce findings.

      Thank you for the suggestion. We added images of labeled peptidases in the LRE region (Figure S6E-D).

      For Panels 4-F and 5-D, consider using violin plots of selected genes to improve clarity and emphasize major ideas.

      In Figure 4F, the heatmap shows multiple genes were upregulated in mCherry-positive cells. We tried the plotting suggested by the reviewer and felt that violin plots could not convey this message as clearly. Likewise, the heatmap in Figure 5D effectively shows the gradient of expression between ileocytes, anterior and posterior LREs.

      Strengthen statistical analysis by employing more rigorous methods and justifying their selection, such as using two-way ANOVA where appropriate.

      The two-way ANOVA was used to quantify protein uptake or HCR probe fluorescence along the length of the LRE region. This statistical test allowed us to compare differences in fluorescence between experimental conditions in multiple LRE segments (see Authoer response image 1 below for example). As our assays show, the LRE region is heterogenous with segments showing different levels of activity and gene expression. The two-way ANOVA is appropriate because it allows us to account for this heterogeneity by comparing fluorescence across multiple segments.

      Author response image 1.

      Our figures display these fluorescent levels in line plots (above, left) rather than bar plots (above, right). The results are easier to visualize interpret in line plots, and they display the fluorescence profiles in greater detail.

      (14) Technical corrections:

      Correct figure references: Figure 5 about tryptophan metabolism should be 5A, S5G-S5H.

      We corrected the figure references.

      Line 518: Spell out "heterozygotes" instead of using "gets".

      We changed the term from “hets” to “heterozygotes.”

      (15) Revise Figure S2 citation to match the actual figure labeling.

      We corrected the text to indicate “Figure S2” rather than “Figure S2A.”

      Additional manuscript modification

      · Figure panels 3B-C, S3A-B, 4A-C: Two cluster were relabeled with improved descriptors based on our updated annotations. The clusters “Pharynx-esophagus-cloaca 1” (PEC1) and PEC2 were relabeled as “Pharynx-cloaca 1” and “Pharynx-cloaca 2.”

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1:

      This study of mixed glutamate/GABA transmission from axons of the supramammillary nucleus to dentate gyrus seeks to sort out whether the two transmitters are released from the same or different synaptic vesicles. This conundrum has been examined in other dual-transmission cases and even in this particular pathway, there are different views. The authors use a variety of electrophysiological and immunohistochemical methods to reach the surprising (to me) conclusion that glutamate and GABA- filled vesicles are distinct yet released from the same nerve terminals. The strength of the conclusion rests on the abundance of data (approaches) rather than the decisiveness of any one approach, and I came away believing that the boutons may indeed produce and release distinct types of vesicles, but have reservations. 

      We thank the reviewer for his/her evaluation of our work. At present, several studies reported that a variety of combinations of two transmitters are co-released from different synaptic vesicles in the central nervous system. In this regard, we think the cotransmission of glutamate/GABA from different synaptic vesicles is not surprising. To better explain to the reader how much we know about co-release of dual transmitters in the brain, we have now added new sentences describing segregated co-release of two neurotransmitters in other synapses in the Introduction (line 63-80).

      Accepting the conclusion, one is now left with another conundrum, not addressed even in the discussion: how can a single bouton sort out VGLUTs and VIAATs to different vesicles, position them in distinct locations with nm precision, and recycle them without mixing? And why do it this way instead of with single vesicles having mixed chemical content? For example, could a quantitative argument be made that separate vesicles allow for higher transmitter concentrations? I feel the paper needs to address these problems with some coherent discussion, at minimum. 

      Although these questions are very important and interesting to address, little is known about molecular mechanisms how VGluT2 and VIAAT are sorted to different vesicles and each synaptic vesicle is segregated. That is why we had not mentioned the sorting mechanisms in the original manuscript. Nevertheless, in response to the reviewer’s suggestion, we have now added new sentences describing possible mechanisms for the sorting and segregation of VGluT2 and VIAAT in the Discussion (line 439-462).

      As for the question regarding why glutamate and GABA are released from different synaptic vesicles, we mentioned the functional roles of separate release of two transmitters over release from single vesicles several times in the Introduction (line 94100), Results (line 300-302), and Discussion (line 406-408, 521-522). Although it seems to be an interesting point to think about transmitter concentrations in the vesicles, we think this issue is beyond the scope of the present study. Given that manipulation of vesicular transmitter contents is technically possible (Hori and Takamori, 2021), this issue awaits further investigation.

      Major concerns: 

      (1) Throughout the paper, the authors use repetitive optogenetic stimulation to activate SuM fibers and co-release glutamate and GABA. There are several issues here: first, can the authors definitively assure the reader that all the short-term plasticity is presynaptic and not due to ChR2 desensitization? This has not been addressed. Second, can the authors also say that all the activated fibers release both transmitters? If for example 20% of the fibers retained a onetransmitter identity and had distinct physiological properties, could that account for some of the physiological findings? 

      Thank you for raising this important point. To examine whether repetitive light illumination induces ChR2 desensitization, the fiber volley was extracellularly recorded. We found that paired-pulse or 10 stimuli at 5, 10, and 20 Hz reliably evoked similar amplitudes of fiber volley during light stimulation. These results clearly indicate that repetitive light stimulation can reliably activate ChR2 and elicit action potentials in the SuM axons. These new findings are now included in Figure 1-figure supplement 2 and Figure 5-figure supplement 2. We also previously demonstrated that by direct patch-clamp recordings from ChR2-expressing hippocampal mossy fiber terminals, 125 times light stimulation at 25 Hz reliably elicited action potentials (Fig. S1: Fukaya et al., 2023). Therefore, we believe that if expression level of ChR2 is high, activation of ChR2 induces action potentials in response to repetitive light stimulation and mediates synaptic transmission with high efficiency.

      We found that most of the SuM terminals (95%) have both VGluT2 and VIAAT (Figure 1E). This anatomical evidence strongly indicates that most of the SuM terminals have the ability to release both glutamate and GABA, and the SuM fibers having one transmitter identity should be minor populations.

      (2) PPR differences in Figures 1F-I are statistically significant but still quite small. You could say they are more similar than different in fact, and residual differences are accounted for by secondary factors like differential receptor saturation. 

      In this experiment, the light intensity was adjusted to yield less than 80% of the maximum response as described in the method section of original and revised manuscript, minimizing the possibility of receptor saturation. We also excluded the possibility that PPR differences could be attributed to differential receptor saturation and desensitization by using a low-affinity AMPA receptor antagonist and a low-affinity GABAA receptor antagonist (Figure 5-figure supplement 3). These results indicate that PPR differences are mediated by the presynaptic origin.

      (3) The logic of the GPCR experiments needs a better setup. I could imagine different fibers released different transmitters and had different numbers of mGluRs, so that one would get different modulations. On the assumption that all the release is from a single population of boutons, then either the mGluRs are differentially segregated within the bouton, or the vesicles have differential responsiveness to the same modulatory signal (presumably a reduced Ca current). This is not developed in the paper. 

      Based on our minimal stimulation results and anatomical analysis, we believe that many SuM terminals contain both glutamate and GABA. Therefore, both transmissions are able to be modulated by mGluRs and GABAB receptors within the same terminals. As the reviewer pointed out, differential responsiveness of glutamate-containing and GABA-containing vesicles to the GPCR signal could be one of the molecular mechanisms for differential effects of GPCRs on EPSCs and IPSCs. In addition, the spatial coupling between GPCRs and active zones for glutamate and GABA in the same SuM terminals may be different, which may give rise to differential modulation of glutamate and GABA release. These possible mechanisms are now described in the Discussion (line 469-476).

      (4) The biphasic events of Figures 3 and S3: I find these (unaveraged) events a bit ambiguous. Another way to look at them is that they are not biphasic per se but rather are not categorizable. Moreover, these events are really tiny, perhaps generated by only a few receptors whose open probability is variable, thus introducing noise into the small currents. 

      We agree with the reviewer that some events are tiny and some small currents could be masked by background noise. We understand that detecting the biphasic events by minimal stimulation has technical limitations. Because we automatically detected biphasic events, which were defined as an EPSC-IPSC sequence, only if an outward peak current following an inward current appeared within 20 ms of light illumination as described in the method section, we cannot exclude the possibility that the biphasic events we detected might include false biphasic responses. To compensate these technical issues, we also performed strontium-induced asynchronous release as another approach and found similar results as minimal stimulation experiments (Figures 3E and 3F). Furthermore, we confirmed that the amplitudes and kinetics of minimal light stimulation-evoked EPSCs or IPSCs were not altered by blockade of their counterpart currents (Figure 3-figure supplement 2). Even if false biphasic responses were accidentally included in the analysis, eventually biphasic events are a minor population and we successfully detected discernible independent EPSCs and IPSCs, which were the major population of uniquantal release-mediated synaptic responses. Thus, multiple pieces of evidence support distinct release of glutamate and GABA from SuM terminals.

      (5) Figure 4 indicates that the immunohistochemical analysis is done on SuM terminals, but I do not see how the authors know that these terminals come from SuM vs other inputs that converge in DG. 

      We thank the reviewer for raising an important point. As shown in Figure 4A, B, almost all VGluT2-positive terminals in the GC layer co-expressed with VIAAT. We are aware that VTA neurons reportedly project to the GC layer of the DG and co-release glutamate and GABA (Ntamati and Luscher, 2016). Contrary to this report, our retrograde tracing analysis did not reveal direct projections from the VTA to the DG. This new data is now included in Figure 4-figure supplement 1. We also added pre-embedding immunogold EM analysis, in which SuM terminals were virally labeled with eYFP, confirming that they form both asymmetric and symmetric synapses (revised Figure 4F). Together with these new data, our results clearly demonstrate that SuM terminals in the GC layer form both asymmetric and symmetric synapses. While our results strongly suggest that VGluT2positive terminals and SuM terminals in the GC layer are nearly identical, we cannot fully exclude the possibility that other inputs originating from unidentified brain regions may co-express VGluT2 and VIAAT in the GC layer. Therefore, in Figure 4 of the revised manuscript, we described “VGluT2-positive terminals” instead of “SuM terminals”.

      (6) Figure 4E also shows many GluN1 terminals not associated with anything, not even Vglut, and the apparent numbers do not mesh with the statistics. Why? 

      In triple immunofluorescence for VGluT2, VIAAT, and GluN1, free GluN1 puncta were predominantly observed in the molecular layer. Given that VGluT2-positive terminals are sparse in the molecular layer, these GluN1 puncta are primarily associated with VGluT1, the dominant subtype. In this study, we focused the analysis of GluN1 puncta specifically on the GC layer, excluding the molecular layer. To avoid miscommunication, we changed the original Figure 4E to the new Figure 4G, which focuses on the GC layer and aligns with the quantitative analysis. Additionally, we used ultrathin sections (100-nm-thick) to enhance spatial resolution, which limits the detection of co-localization events within this confined spatial range, as noted in the Discussion (line 485-488).

      (7) Do the conclusions based on the fluorescence immuno mesh with the apparent dimensions of the EM active zones and the apparent intermixing of labeled vesicles in immuno EM? 

      To further support our immunofluorescence results, we performed EM study and found that a single SuM terminal formed both asymmetric and symmetric synapses on a GC soma (revised Figures 4E and 4F). These new data and our immunofluorescence results clearly indicate that a single SuM terminal forms both glutamatergic and GABAergic synapses on a GC and co-release glutamate and GABA. 

      As the reviewer pointed out, our immuno EM shows that VGluT2 and VIAAT labeled vesicles appear to intermix in asymmetric and symmetric synapses. Accordingly, in the revised manuscript, Figure 7 has been modified to show the intermixing of glutamate and GABA-containing vesicles in the SuM terminal. It should be noted that because of low labeling efficiency, our immuno-EM images don’t represent the whole picture of synaptic vesicles for glutamate and GABA. There could be biased distribution of vesicles close to their release site (more VGluT2-containing vesicles close to asymmetric synapses and more VIAAT-containing vesicles close to symmetric synapses) as reported previously (Root et al., 2018). Additionally, our results could be explained by other mechanisms: co-release of glutamate and GABA from the same vesicles, with one transmitter undetected due to the absence of its postsynaptic receptor. This possibility is now mentioned in the Discussion (line 512-520). More detailed vesicle configuration in a single SuM terminal will have to be investigated in future studies.

      (8) Figure 6 is not so interesting to me and could be removed. It seems to test the obvious: EPSPs promote firing and IPSPs oppose it. 

      We believe these results are necessary for the following two reasons. First, we showed that glutamate/GABA co-transmission balance is dynamically changed in a frequency-dependent manner (Figure 5). In terms of physiological significance, it is important to demonstrate how these frequency-dependent dynamic changes affect GC firing. Therefore, we believe that figure 6, which shows how SuM inputs modulate GC firing by repetitive SuM stimulation, is necessary for this paper. Second, we previously reported the excitatory effects of the SuM inputs on GC firing, suggesting the important roles of glutamatergic transmission of the SuM inputs in synaptic plasticity (Hashimotodani et al., 2018; Hirai et al., 2022; Tabuchi et al., 2022). In contrast, how GABAergic cotransmission contributes to SuM-GC synaptic plasticity and DG information processing was not well understood. Our results in figure 6, which demonstrate the inhibitory effects of GABAergic co-transmission on GC firing by high frequency repetitive SuM input activity, clearly show the contribution of GABAergic co-transmission to short-term plasticity at SuM-GC synapses. For these reasons, we would like to keep Figure 6. We hope that our explanations convince the reviewer. 

      Reviewer #2:

      Summary:

      In this study, the authors investigated the release properties of glutamate/GABA co-transmission at the supramammillary nucleus (SuM)-granule cell (GC) synapses using in vitro electrophysiology and anatomical approaches at the light and electron microscopy level. They found that SuM to dentate granule cell synapses, which co-release glutamate and GABA, exhibit distinct differences in paired-pulse ratio, Ca2+ sensitivity, presynaptic receptor modulation, and Ca2+ channel-vesicle coupling configuration for each neurotransmitter. The study shows that glutamate/GABA co-release produces independent glutamatergic and GABAergic synaptic responses, with postsynaptic targets segregated. They show that most SuM boutons form distinct glutamatergic and GABAergic synapses in close proximity, characterized by GluN1 and GABAAα1 receptor labeling, respectively. Furthermore, they demonstrate that glutamate/GABA co-transmission exhibits distinct short-term plasticity, with glutamate showing frequencydependent depression and GABA showing frequency-independent stable depression. 

      Their findings suggest that these distinct modes of glutamate/GABA co-release by SuM terminals serve as frequency-dependent filters of SuM inputs. 

      Strengths:

      The conclusions of this paper are mostly well supported by the data. 

      We thank the reviewer for their positive and constructive comments on our manuscript.

      Weaknesses: 

      Some aspects of Supplementary Figure 1A and the table need clarification. Specifically, the claim that the authors have stimulated an axon fiber rather than axon terminals is not convincingly supported by the diagram of the experimental setup. Additionally, the antibody listed in the primary antibodies section recognizes the gamma2 subunit of the GABAA receptor, not the alpha1 subunit mentioned in the results and Figure 4. 

      We have now answered these questions in recommendations section below.

      Reviewer #3:

      Summary: 

      In this manuscript, Hirai et al investigated the release properties of glutamate/GABA cotransmission at SuM-GC synapses and reported that glutamate/GABA co-transmission exhibits distinct short-term plasticity with segregated postsynaptic targets. Using optogenetics, whole-cell patch-clamp recordings, and immunohistochemistry, the authors reveal distinct transmission modes of glutamate/GABA co-release as frequency-dependent filters of incoming SuM inputs. 

      Strengths: 

      Overall, this study is well-designed and executed; conclusions are supported by the results. This study addressed a long-standing question of whether GABA and glutamate are packaged in the same vesicles and co-released in response to the same stimuli in the SuM-GC synapses (Pedersen et al., 2017; Hashimotodani et al., 2018; Billwiller et al., 2020; Chen et al., 2020; Li et al., 2020; Ajibola et al., 2021). Knowledge gained from this study advances our understanding of neurotransmitter co-release mechanisms and their functional roles in the hippocampal circuits. 

      Weaknesses:

      No major issues are noted. Some minor issues related to data presentation and experimental details are listed below. 

      We appreciate the reviewer’s positive view of our study. We responded in more detail in recommendations section below.

      Recommendations for the authors:

      Reviewer #1:

      (1) The blue color for VIAAT in panel 1C is extremely hard to see. 

      Thank you for pointing out. We have changed to the cyan color for VIAAT in Figure 1C and D in the revised manuscript.

      (2) Line 329 "perforant" not "perfomant".  

      We appreciate the reviewer’s careful attention. In the revised manuscript, we corrected this misword.

      Reviewer #2:

      To convincingly demonstrate that the authors stimulated SuM axon fiber instead of SuM terminals (Supplementary Figures 1A), they should provide an image showing the distribution of SuMlabeled fibers and axon terminals reaching the dentate gyrus (DG) and the trace of the optic fiber, rather than providing a diagram of the experimental setup. 

      We appreciate the reviewer’s suggestion. We have now provided a new experimental setup image (Figure 1-figure supplement 1A) showing a single GC, the distribution of SuM fibers in the GC layer, and the illumination area at each location. As SuM inputs make synapses onto the GC soma and dendrite close to the GC cell body, SuM-GC synapses in the recording GCs exist in a very limited area. This characteristic synaptic localization allowed us to control the illumination area without applying light to the SuM terminals in the recording GCs. Delayed onsets of EPSCs/IPSCs by over-axon stimulation (Figure 1-figure supplement 1C, D) also support that SuM terminals in the recording GCs were out of illumination area.

      Additionally, the authors should clarify the discrepancy between the antibody mentioned in the list of primary antibodies, which recognizes the gamma2 subunit of the GABAA receptor, and the alpha1 subunit of the GABAA receptor mentioned in the results and Figure 4. 

      We apologize for this mistake. As described in the main text and figure, we used the antibody for a1 subunit of the GABAA receptor. Table S1 has been corrected in the revised version of the paper.

      Reviewer #3:

      (1) In Figure 1, the authors used two [Ca2+]o concentrations to study the EPSC and IPSC amplitudes. How does the Ca2+ concentration affect the PPR in the EPSC and IPSC, respectively? 

      Given that lowering the extracellular Ca2+ concentration reduces the release probability, it is expected that 1 mM extracellular Ca2+ concentration increases PPR compared to 2.5 mM. Actually, we observed that lowering the extracellular Ca2+ concentration increased the synaptic responses from 2nd to 10th (both EPSC and IPSC) by train stimulation (Figure 5).

      (2) In Figure 2D, does baclofen also have a dose-dependent effect on the inhibition of the EPSC and IPSC similar to the DCG-IV in Figure 2C? 

      Thank you for your question. Because we aimed to demonstrate the differential inhibitory effects of baclofen at a certain concentration on glutamatergic and GABAergic co-transmission, we did not go into detail regarding a dose-dependent effect. In response to the reviewer’s comment, we performed the effects of higher concentration of baclofen on EPSCs and IPSCs. As shown in the figure below, 50 µM baclofen inhibited EPSCs and IPSCs to the similar extent. Therefore, by comparing inhibitory effect of two different concentrations of baclofen (5 and 50 µM), we believe that baclofen also has a dose-dependent inhibitory effect on both EPSCs and IPSCs similar to the DCGIV.

      Author response image 1.

      (3) In Figure 2E, statistical labels, such as "*" or "n.s." (not significant), should be provided on the plots to facilitate the reading of figures. 

      In response to the reviewer’s comment, we have provided statistical labels in the Figure 2E.

      (4) In Figure 3A, the latency of the evoked EPSC for the lower light stimulation groups seems to be much slower than the one shown on the left or other figures in the paper, such as Figure 1F.

      Please double-check if the blue light stimulation label is placed in the right location. 

      Corrected, thanks.

      (5) The use of minimal light stimulation in optogenetic experiments is not appropriately justified or described. More detailed information should be provided, such as whether the optogenetic stimulation is performed on the axon or the terminals of the SuM. 

      We appreciate the reviewer’s suggestion. To effectively detect stochastic synaptic responses, the light stimulation was applied on the terminals of the SuM. We have now stated this information (line 212). We also further described the justification of use of minimal light stimulation in the revised manuscript (line 207-209). 

      References

      Fukaya R, Hirai H, Sakamoto H, Hashimotodani Y, Hirose K, Sakaba T (2023) Increased vesicle fusion competence underlies long-term potentiation at hippocampal mossy fiber synapses. Sci Adv 9:eadd3616.

      Hashimotodani Y, Karube F, Yanagawa Y, Fujiyama F, Kano M (2018) Supramammillary Nucleus Afferents to the Dentate Gyrus Co-release Glutamate and GABA and Potentiate Granule Cell Output. Cell Rep 25:2704-2715 e2704.

      Hirai H, Sakaba T, Hashimotodani Y (2022) Subcortical glutamatergic inputs exhibit a Hebbian form of long-term potentiation in the dentate gyrus. Cell Rep 41:111871.

      Hori T, Takamori S (2021) Physiological Perspectives on Molecular Mechanisms and Regulation of Vesicular Glutamate Transport: Lessons From Calyx of Held Synapses. Front Cell Neurosci 15:811892.

      Ntamati NR, Luscher C (2016) VTA Projection Neurons Releasing GABA and Glutamate in the Dentate Gyrus. eNeuro 3.

      Root DH, Zhang S, Barker DJ, Miranda-Barrientos J, Liu B, Wang HL, Morales M (2018) Selective Brain Distribution and Distinctive Synaptic Architecture of Dual Glutamatergic-GABAergic Neurons. Cell Rep 23:3465-3479.

      Tabuchi E, Sakaba T, Hashimotodani Y (2022) Excitatory selective LTP of supra-mammillary glutamatergic/GABAergic co-transmission potentiates dentate granule cell firing. Proc Natl Acad Sci U S A 119:e2119636119.

    1. Author Response

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      The manuscript by Goetz et al. takes a new perspective on sensory information processing in cells. In contrast to previous studies, which have used population data to build a response distribution and which estimate sensory information at about 1 bit, this work defines sensory information at the single cell level. To do so, the authors take two approaches. First, they estimate single cells' response distributions to various input levels from time-series data directly. Second, they infer these single-cell response distributions from the population data by assuming a biochemical model and extracting the cells' parameters with a maximum-entropy approach. In either case, they find, for two experimental examples, that single-cell sensory information is much higher than 1 bit, and that the reduction to 1 bit at the population level is due to the fact that cells' response functions are so different from each other. Finally, the authors identify examples of measurable cell properties that do or do not correlate with single-cell sensory information.

      The work brings an important and distinct new insight to a research direction that generated strong interest about a decade ago: measuring sensory information in cells and understanding why it is so low. The manuscript is clear, the results are compelling, and the conclusions are well supported by the findings. Several contributions should be of interest to the quantitative biology community (e.g., the demonstration that single cells' sensory information is considerably larger than previously implied, and the approach of inferring single-cell data from population data with the help of a model and a maximum-entropy assumption).

      We thank the reviewer for the excellent summary of our research.

      Reviewer #2 (Public Review):

      In this paper the authors present an existing information theoretic framework to assess the ability of single cells to encode external signals sensed through membrane receptors.

      The main point is to distinguish actual noise in the signaling pathway from cell-cell variability, which could be due to differences in their phenotypic state, and to formalize this difference using information theory.

      After correcting for this cellular variability, the authors find that cells may encode more information than one would estimate from ignoring it, which is expected. The authors show this using simple models of different complexities, and also by analyzing an imaging dataset of the IGF/FoxO pathway.

      The implications of the work are limited because the analysed data is not rich enough to draw clear conclusions. Specifically,

      • the authors do not distinguish what could be methodological noise inherent to microscopy techniques (segmentation etc), and actual intrinsic cell state. It's not clear that cell-cell variability in the analyzed dataset is not just a constant offset or normalization factor. Other authors (e.g. Gregor et al Cell 130, 153-164) have re-centered and re-normalized their data before further analysis, which is more or less equivalent to the idea of the conditional information in the sense that it aims to correct for this experimental noise.

      We thank the reviewer for the comment. However, we do not believe our analysis is a consequence of normalization artifacts. Prior to modeling the single cell data, we removed well-dependent background fluorescence. This should take care of technical variation related to overall offsets in the data. We agree with the reviewer that background subtraction may not fully account for technical variability. For example, some of the cell-to-cell variability may potentially be ascribed to issues such as incorrect segmentation. Unfortunately, however, attempting to remove this technical variability through cell-specific normalization as suggested by the reviewer1 will diminish to a very large extent the true biological effects related to extensivity (cell size, total protein abundance). We note that these effects are a direct function of cell state-variables (see for example Cohen-Saidon et al.2 who use cell-state specific normalization to improve signaling fidelity). Therefore, an increase in mutual information after normalization does not only reflect removal of technical noise but also accounts for effect of cell state variables.

      Nonetheless, as the reviewer suggested, we performed a cell-specific normalization wherein the mean nuclear FoxO levels in each cell (in the absence of IGF) were normalized to one. Then, for each ligand concentration, we collated FoxO response across all cells and computed the channel capacity corresponding to cell-state agnostic mutual information ICSA. As expected, ICSA increases from ∼0.9 bits to ∼1.3 bits when cell-specific normalization was performed (Author response image 1). However, this value is significantly lower than the average ∼1.95 of cell-state specific mutual information ⟨ICee⟩. Finally, we note that the cell specific normalization does not change the calculations of channel capacity at the single cell level as these calculations do not depend on linear transformations of the data (centering and normalization). Therefore, we do not think that our analysis of experimental data suffers from artifacts related to microscopy.

      Author response image 1.

      Author response image 1. Left: nuclear FoxO response averaged over all cells in the population across different ligand concentration. Right: nuclear FoxO response was first normalized at the single cell level and then averaged over all cells in the population across different ligand concentrations.

      • in the experiment, each condition is shown only once and sequentially. This means that the reproducibility of the response upon repeated exposures in a single cell was not tested, casting doubt on the estimate of the response fidelity (estimated as the variance over time in a single response).

      The reviewer raises an excellent question about persistence of cell states. To verify that cell states are indeed conserved at the time scale of the experiment, we reanalyzed data generated by Gross et al.3 wherein cells were perturbed with IGF (37.5 pM), followed by a washout which allowed the cells to reach pre-stimulation nuclear FoxO levels, followed by a re-perturbation with the same amount of IGF. Nuclear FoxO response was measured at the single cell level after 90 minutes with IGF exposure both these times. Since the response x to the same input u was measured twice in the same cell (x1 and x2), we could evaluate the intrinsic variability in response at the single cell level. We then compared this intrinsic variability to the extrinsic cell-state dependent variability in the population.

      To do so, we computed for each cell δ=x1-x2 the difference between the two responses. reviewer Figure 2 show the histogram p(δ) as computed from the data (pink) and the same computed from the model that was trained on the single cell data (blue). We also computed p(δ0) which represented the difference between responses of two different cells both from the data and from the model.

      As we see in Author response image 2, the distribution p(δ) is significantly narrower than p(δ0) suggesting that intracellular variability is significantly smaller than across-population variability and that cells’ response to the same stimuli are quite conserved, especially when compared to responses in randomly picked pairs of cells. This shows that cell states and the corresponding response to extracellular perturbations are conserved, at least at the time scale of the experiment. Therefore, our estimates of cell-to-cell variability signaling fidelity are stable and reliable. We have now incorporated this discussion in the manuscript (lines 275-281).

      Author response image 2.

      Author response image 2. Left: Cells were treated with 37.5 pM of IGF for 90 minutes, washed out for 120 minutes and again treated with 37.5 pM of IGF. Nuclear FoxO was measured during the treatment and the washout. The distributions on the left show the difference in FoxO levels in single cells after the two 90 minutes IGF stimulations (pink: data, blue: model). Right: Distribution of difference in FoxO levels in two randomly picked cells after 90 minutes of exposure to 37.5 pM IGF.

      • another dataset on the EGF/EGFR pathway is analyzed, but no conclusion can be drawn from it because single-cell information cannot be directly estimated from it. The authors instead use a maximum-entropy Ansatz, which cannot be validated for lack of data.

      We thank the reviewer for this comment. We agree with the reviewer that we have not verified our predictions for the EGF/EGFR pathway. That study was meant to show the potential generality of our analysis. We look forward to validating our predictions for the EGF/EGFR pathway in future studies.

      Reviewer #3 (Public Review):

      Goetz, Akl and Dixit investigated the heterogeneity in the fidelity of sensing the environment by individual cells in a population using computational modeling and analysis of experimental data for two important and well-studied mammalian signaling pathways: (insulin-like growth factor) IGF/FoxO and (epidermal growth factor) EFG/EFGR mammalian pathways. They quantified this heterogeneity using the conditional mutual information between the input (eg. level of IGF) and output (eg. level of FoxO in the nucleus), conditioned on the "state" variables which characterize the signaling pathway (such as abundances of key proteins, reaction rates, etc.) First, using a toy stochastic model of a receptor-ligand system - which constitutes the first step of both signaling pathways - they constructed the population average of the mutual information conditioned on the number of receptors and maximized over the input distribution and showed that it is always greater than or equal to the usual or "cell state agnostic" channel capacity. They constructed the probability distribution of cell state dependent mutual information for the two pathways, demonstrating agreement with experimental data in the case of the IGF/FoxO pathway using previously published data. Finally, for the IGF/FoxO pathway, they found the joint distribution of the cell state dependent mutual information and two experimentally accessible state variables: the response range of FoxO and total nuclear FoxO level prior to IGF stimulation. In both cases, the data approximately follow the contour lines of the joint distribution. Interestingly, high nuclear FoxO levels, and therefore lower associated noise in the number of output readout molecules, is not correlated with higher cell state dependent mutual information, as one might expect. This paper contributes to the vibrant body of work on information theoretic characterization of biochemical signaling pathways, using the distribution of cell state dependent mutual information as a metric to highlight the importance of heterogeneity in cell populations. The authors suggest that this metric can be used to infer "bottlenecks" in information transfer in signaling networks, where certain cell state variables have a lower joint distribution with the cell state dependent mutual information.

      The utility of a metric based on the conditional mutual information to quantify fidelity of sensing and its heterogeneity (distribution) in a cell population is supported in the comparison with data. Some aspects of the analysis and claims in the main body of the paper and SI need to be clarified and extended.

      1. The authors use their previously published (Ref. 32) maximum-entropy based method to extract the probability distribution of cell state variables, which is needed to construct their main result, namely p_CeeMI (I). The salient features of their method, and how it compares with other similar methods of parameter inference should be summarized in the section with this title. In SI 3.3, the Lagrangian, L, and Rm should be defined.

      We thank the reviewer for the comment and apologize for the omission. We have now rewritten the manuscript to include references to previous reviews of works that infer probability distributions4 of cell state variables (lines 156-168). Notably, as we argued in our previous work5, no current method can efficiently estimate the joint distribution over parameters that is consistent with measured single cell data and models of signaling networks. Therefore, we could not use multiple approaches to infer parameter distributions. We have now expanded our discussion of the method in the supplementary information sections.

      1. Throughout the text, the authors refer to "low" and "high" values of the channel capacity. For example, a value of 1-1.5 bits is claimed to be "low". The authors need to clarify the context in which this value is low: In some physically realistic cases, the signaling network may need to simply distinguish between the present or absence of a ligand, in which case this value would not be low.

      We agree with the reviewer that small values of channel capacities might be sufficient for cells to carry out some tasks, in which case a low channel capacity does not necessarily indicate a network not performing its task. Indeed, how much information is needed for a specific task is a related but distinct question from how much information is provided though a signaling network. Both questions are essential to understand a cell's signaling behavior, with the former being far less easy to answer in a way which is generalizable. In contrast, the latter can be quantitatively answered using the analysis presented in our manuscript.

      1. Related to (2), the authors should comment on why in Fig. 3A, I_Cee=3. Importantly, where does the fact that the network is able to distinguish between 23 ligand levels come from? Is this related to the choice (and binning) of the input ligand distribution (described in the SI)?

      We thank the reviewer for the comment. The network can distinguish between all inputs used in the in silico experiment precisely because the noise at the cellular level is small enough that there is negligible overlap between single cell response distributions. Indeed, the mutual information will not increase with the number of equally spaced inputs in a sub-linear manner, especially when the input number is very high.

      1. The authors should justify the choice of the gamma distribution in a number of cases (eg. distribution of ligand, distribution cell state parameters, such as number of receptors, receptor degradation rate, etc.).

      We thank the reviewer for the comment. We note that previous works in protein abundances and gene expression levels (e.g. see6) have reported distributions with positive skews that can be fit well with gamma distributions or log-normal distributions. Moreover, many stochastic models of protein abundance levels and signaling networks are also known to result in abundances that are distributed according to a negative binomial distribution, the discrete counterpart of gamma distribution. Therefore, we chose Gamma distributions in our study. We have now clarified this point in the Supplementary Information. At the same time, gamma distribution only serves as a regularization for the finite data and in principle, our analysis and conclusion do not depend on choice of gamma distribution for abundances of proteins, ligands, and cell parameters.

      1. Referring to SI Section 2, it is stated that the probability of the response (receptor binding occupancy) conditioned on the input ligand concentration and number of receptors is a Poisson distribution. Indeed this is nicely demonstrated in Fig. S2. Therefore it is the coefficient of variation (std/mean) that decreases with increasing R0, not the noise (which is strictly the standard deviation) as stated in the paper.

      We thank the reviewer of the comment. We have now corrected our text.

      1. In addition to explicitly stating what the input (IGF level) and the output (nuclear GFP-tagged FoxO level) are, it would be helpful if it is also stated what is the vector of state variables, theta, corresponding to the schematic diagram in Fig. 2C.

      We thank the reviewer of the comment. We have now corrected our text in the supplementary material as well as the main text (Figure 2 caption).

      1. Related to Fig. 2C, the statement in the caption: "Phosphorylated Akt leads to phosphorylation of FoxO which effectively shuttles it out of the nucleus." needs clarification: From the figure, it appears that pFoxO does not cross the nuclear membrane, in which case it would be less confusing to say that phosphorylation prevents reentry of FoxO into the nucleus.

      We thank the reviewer of the comment. We have now corrected our text (Figure 2 caption).

      1. The explanations for Fig. 2D, E and insets are sparse and therefore not clear. The authors should expand on what is meant by model and experimental I(theta). What is CC input dose? Also in Fig. 2E, the overlap between the blue and pink histograms means that the value of the blue histogram for the final bin - and therefore agreement or lack thereof with the experimental result - is not visible. Also, the significance of the values 3.25 bits and 3 bits in these plots should be discussed in connection with the input distributions.

      We thank the reviewer of the comment. We have now corrected our text (Figure 2 caption and lines 249-251).

      1. While the joint distribution of the cell state dependent mutual information and various biochemical parameters is given in Fig. S7, there is no explanation of what these results mean, either in the SI or main text. Related to this, while a central claim of the work is that establishing this joint distribution will allow determination of cell state variables that differentiate between high and low fidelity sensing, this claim would be stronger with more discussion of Figs. 3 and S7. The related central claim that cell state dependent mutual information leads to higher fidelity sensing at the population level would be made stronger if it can be demonstrated that in the limit of rapidly varying cell state variables, the I_CSA is retrieved.

      We thank the reviewer for this excellent comment. We have now added more discussion about interpreting the correlation between cell state variables and cell-state specific mutual information (lines 294-306). We also appreciate the suggestion about a toy model calculation to show that dynamics of cell state variables affects cell state specific mutual information. We have now performed a simple calculation to show how dynamics of cell state variables affects cells’ sensing ability (lines 325-363). Specifically, we constructed a model of a receptor binding to the ligand wherein the receptor levels themselves changed over time through a slow process of gene expression (Author response image 3, main text Figure 4). In this model, the timescales of fluctuations of ligand-free receptors on the cell surface can be tuned by speeding up/slowing down the degradation rate of the corresponding mRNA while keeping the total amount of steady state mRNA constant. As shown in Author response image 3, the dependence of cell-specific mutual information on cell state variable diminishes when the time scale of change of cell state variables is fast.

      Author response image 3.

      Author response image 3. Cell state dynamics governs cell state conditioned mutual information. A. In a simple stochastic model, receptor mRNA is produced at a constant rate from the DNA and the translated into ligand-free receptors. The number of ligand-bound receptors after a short exposure to ligands is considered the output. B. A schematic showing dynamics of receptor numbers when mRNA dynamics are slower compared to signaling time scales. C. Conditioning on receptor numbers leads to differing abilities in sensing the environment when the time scale of mRNA dynamics τ is slow. In contrast, when the mRNA dynamics are fast (large τ-1), conditioning on cell state variables does not lead to difference in sensing abilities.

      Reviewer #1 (Recommendations For The Authors):

      My major concerns are mainly conceptual, as described below. With proper attention to these concerns, I feel that this manuscript could be a good candidate for the eLife community.

      Major concerns:

      1. The manuscript convincingly demonstrates that cells good sensors after all, and that heterogeneity makes their input-output functions different from each other. This raises the question of what happens downstream of sensing. For single-celled organisms, where it may be natural to define behavioral consequences at the single-cell level, it may very well be relevant that single-cell information is high, even if cells respond differently to the environment. But for cells in multicellular organisms, like those studied here, I imagine that most behavioral consequences of sensing occur at the multicellular level. Thus, many cells' responses are combined into a larger response. Because their responses are different, their high-information individual responses may combine into a low-information collective response. In fact, one could argue that a decent indicator of the fidelity of this collective response is indeed the population-level information measure estimated in previous works. Thus, a fundamental question that the authors must address is: what is the ultimate utility of reliable, but heterogeneous, responses for a multicellular system? This question has an important bearing for the relevance of their findings.

      We thank the reviewer for this thought-provoking comment. We agree that the fidelity with which cells sense their environment, especially those in multicellular organisms, may not always need to be very high. We speculate that when the biological function of a collection of cells can be expressed as an average over the response of individual cells; high-information but heterogeneous cells can be considered equivalent to low-information homogeneous cells. An example of such a function is population differentiation to maintain relative proportions of different cell types in a tissue or producing a certain amount of extracellular enzyme.

      In contrast, we believe that when the biological function involves collective action, spatial patterning, or temporal memory, the difference between reliable but heterogeneous population and unreliable homogeneous population will become significant. We plan to explore this topic in future studies.

      1. The authors demonstrate that the agreement is good between their inference approach and the direct estimation of response distributions from single-cell time series data. In fact, the agreement is so good that it raises the question of why one would need the inference approach at all. Is it because single-cell time series data is not always available? Is that why the authors used it for one example and not the other? The validation is an asset, but I imagine that the inference approach is complicated and may make assumptions that are not always true. Thus, its utility and appropriate use must be clarified.

      We thank the reviewer for the comment. As the reviewer correctly pointed out, live cell imaging data is not always available and has limited scope. Specifically, optical resolution limits measurements of multiple targets. Moreover, typical live cell measurements measure total abundance or localization and not post-translational modification (phosphorylation, methylation, etc.) which are crucial to signaling dynamics. The most readily available single cell data such those measured using single cell RNA sequencing, immunofluorescence, or flow cytometry are necessarily snapshots. Therefore, computational models that can connect underlying signaling networks to snapshot data become essential when imputing single cell trajectories. In addition, the modeling also allows us to identify network parameters that correlate most strongly with cellular heterogeneity. We have now clarified this point in the manuscript (lines 366-380).

      Minor comments:

      1. I would point out that the maximum values in the single-cell mutual information distributions (Fig 2D and E) correspond to log2 of the number of inputs levels, corresponding to perfect distinguishability of each of the equally-weighted input states. It is clear that many of the mutual information values cluster toward this maximum, and it would help readers to point out why.

      We thank the reviewer for the comment. We have now included a discussion about the skew in the distribution in the text (lines 251-260).

      1. Line 216 references Fig 2C for the EGF/EGFR pathway, but Fig 2C shows the FoxO pathway. In fact, I did not see a schematic of the EGF/EGFR pathway. It may be helpful to include one, and for completeness perhaps also one for the toy model, and organize the figures accordingly.

      We thank the reviewer for the comment. We did not include three separate schematics because the schematics of the EGF/EGFR model and the toy model are subsets of the schematic of the IGF/FoxO model. We have now clarified this point in the manuscript (Figure 2 caption).

      Reviewer #2 (Recommendations For The Authors):

      • the simple model of Fig. 2A would gain from a small cartoon explaining the model and its parameters.

      We thank the reviewer for the comment. We did not include a schematic for the toy model as it is a subset of the schematic of the IGF/FoxO model. The schematic of the toy model is included in the supplementary information.

      • L should be called u, and B should be called x, to be consistent with the rest of the notations in the paper.

      We have decided to keep the notation originally presented in the manuscript.

      • legend of 2E and D should be clarified. "CC input dose" is cryptic. The x axis is the input dose, the y axis is its distribution at the argmax of I. CC is the max of I, not its argmax. Likewise "I" in the legend for the colors should not be used to describe the insets, which are input distributions.

      We have now changed this in the manuscript.

      • the data analysis of the IGF/FoxO pathway should be explained in the main text, not the SI. Otherwise it's impossible to understand how one arrives at, or how to intepret, figure 2E, which is central to the paper. For instance the fact that p(x|u,theta) is assumed to be Gaussian, and how the variance and mean are estimated from the actual data is very important to understand the significance of the results.

      While we have added more details in the manuscript in various places, for the sake of brevity and clarity, we have decided to keep the details of the calculations in the supplementary materials.

      • there's no Method's section. Most of the paper's theoretical work is hidden in the SI, while it should be described in the methods.

      We thank the review of the comment. However, we believe that adding a methods section will break the narrative of the paper. The methods are described in detail in the supplementary materials with sufficient detail to reproduce our results. Additionally, we also provide a link to the github page that has all scripts related to the manuscript.

      PS: please submit a PDF of the SI for review, so that people can read it on any platform (as opposed to a word document, especially with equations)

      We have now done this.

      Reviewer #3 (Recommendations For The Authors):

      1. Subplots in Fig. 1, inset in Fig. 3 are not legible due to small font.

      We have now increased the font.

      1. Mean absolute error in Fig. S5 and relative error in related text should be clarified.

      We have now clarified this in the manuscript.

      1. Acronyms (MACO, MERIDIAN) should be defined.

      We have now made these changes.

      References

      1. Gregor T, Tank DW, Wieschaus EF, Bialek W. Probing the limits to positional information. Cell. 2007;130(1):153-64. doi: 10.1016/j.cell.2007.05.025. PubMed PMID: WOS:000248587000018.

      2. Cohen-Saidon C, Cohen AA, Sigal A, Liron Y, Alon U. Dynamics and Variability of ERK2 Response to EGF in Individual Living Cells. Mol Cell. 2009;36(5):885-93. doi: 10.1016/j.molcel.2009.11.025. PubMed PMID: WOS:000272965400020.

      3. Gross SM, Dane MA, Bucher E, Heiser LM. Individual Cells Can Resolve Variations in Stimulus Intensity along the IGF-PI3K-AKT Signaling Axis. Cell Syst. 2019;9(6):580-8 e4.

      4. Loos C H, J. Mathematical modeling of variability in intracellular signaling. Current Opinion in Systems Biology. 2019;16:17-24.

      5. Dixit PD, Lyashenko E, Niepel M, Vitkup D. Maximum Entropy Framework for Predictive Inference of Cell Population Heterogeneity and Responses in Signaling Networks. Cell Syst. 2020;10(2):204-12 e8.

      6. Taniguchi Y, Choi PJ, Li GW, Chen H, Babu M, Hearn J, Emili A, Xie XS. Quantifying E. coli proteome and transcriptome with single-molecule sensitivity in single cells. Science. 2010;329(5991):533-8. doi: 10.1126/science.1188308. PubMed PMID: 20671182; PMCID: PMC2922915.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The authors attempted to dissect the function of a long non-coding RNA, lnc-FANCI-2, in cervical cancer. They profiled lnc-FANCI-2 in different cell lines and tissues, generated knockout cell lines, and characterized the gene using multiple assays.

      Strengths:

      A large body of experimental data has been presented and can serve as a useful resource for the scientific community, including transcriptomics and proteomics datasets. The reported results also span different parts of the regulatory network and open up multiple avenues for future research.

      Thanks for your positive comments on the strengths.

      Weaknesses:

      The write-up is somewhat unfocused and lacks deep mechanistic insights in some places.

      As the lnc-FANCI-2 as a novel lncRNA had never been explored for any functional study, our report found that it regulates RAS signaling. Thus, this report focuses on lnc-FANCI-2 and RAS signaling pathway but also includes some important screening data, which are important for our readers to understand how we could reach the RAS signaling.

      Reviewer #2 (Public review):

      The study by Liu et al provides a functional analysis of lnc-FANCI-2 in cervical carcinogenesis, building on their previous discovery of FANCI-2 being upregulated in cervical cancer by HPV E7.

      The authors conducted a comprehensive investigation by knocking out (KO) FANCI-2 in CaSki cells and assessing viral gene expression, cellular morphology, altered protein expression and secretion, altered RNA expression through RNA sequencing (verification of which by RT-PCR is well appreciated), protein binding, etc. Verification experiments by RT-PCR, Western blot, etc are notable strengths of the study.

      The KO and KD were related to increased Ras signaling and EMT and reduced IFN-y/a responses.

      Thanks for your positive comments. It did take us a few years to reach this scientific point for understanding of lnc-FANCI-2 function.

      Although the large amount of data is well acknowledged, it is a limitation that most data come from CaSki cells, in which FANCI-2 localization is different from SiHa cells and cancer tissues (Figure 1). The cytoplasmic versus nuclear localization is somewhat puzzling.

      Regarding lnc-FANCI-2 localization, it could be both cytoplasmic and nuclear in cervical cancer tissues, HPV16 or HPV18 infected keratinocytes, and HPV16+ cervical cancer cell line CaSki cells which contain multiple integrated HPV16 DNA copies. But surprisingly, it is most detectable in the nucleus in HPV16+ SiHa cells which contain only one copy of integrated HPV16 DNA (Yu, L., et al. mBio 15: e00729-24, 2024). No matter what, knockdown of lnc-FANCI-2 expression from SiHa cells induces RAS signaling leading to an increase in the expression of p-AKT and p-Erk1/2 (suppl. Fig. S6B).

      Reviewer #3 (Public review):

      Summary:

      A long noncoding RNA, lnc-FANCI-2, was reported to be regulated by HPV E7 oncoprotein and a cell transcription factor, YY1 by this group. The current study focuses on the function of lnc-FANCI-2 in HPV-16 positive cervical cancer is to intrinsically regulate RAS signaling, thereby facilitating our further understanding of additional cellular alterations during HPV oncogenesis. The authors used advanced technical approaches such as KO, transcriptome and (IRPCRP) and LC- MS/MS analyses in the current study and concluded that KO Inc-FANCI-2 significantly increases RAS signaling, especially phosphorylation of Akt and Erk1/2.

      Strengths:

      (1) HPV E6E7 are required for full immortalization and maintenance of the malignant phenotype of cervical cancer, but they are NOT sufficient for full transformation and tumorigenesis. This study helps further understanding of other cellular alterations in HPV oncogenesis.

      (2) lnc-FANCI-2 is upregulated in cervical lesion progression from CIN1, CIN2-3 to cervical cancer, cancer cell lines, and HPV transduced cell lines.

      (3) Viral E7 of high-risk HPVs and host transcription factor YY1 are two major factors promoting lnc-FANCI-2 expression.

      (4) Proteomic profiling of cytosolic and secreted proteins showed inhibition of MCAM, PODXL2, and ECM1 and increased levels of ADAM8 and TIMP2 in KO cells.

      (5) RNA-seq analyses revealed that KO cells exhibited significantly increased RAS signaling but decreased IFN pathways.

      (6) Increased phosphorylated Akt and Erk1/2, IGFBP3, MCAM, VIM, and CCND2 (cyclin D2) and decreased RAC3 were observed in KO cells.

      Thanks for your positive comments. It has taken us almost nine years to reach this point to gradually understand lnc-FANCI-2 functions, which are more complex than our initial thoughts.  

      Weaknesses:

      (1) The authors observed the increased Inc-FANCI-2 in HPV 16 and 18 transduced cells, and other cervical cancer tissues as well, HPV-18 positive HeLa cells exhibited different expressions of Inc-FANCI-2.

      Both HPV16 and HPV18 infections induce lnc-FANCI-2 expression in keratinocytes (Liu H., et al. PNAS, 2021). However, HPV18+ cervical cancer cell lines HeLa and C4II cells (Figure S1A and S1B) do not express lnc-FANCI-2 as we see in HPV-negative cell lines such as HCT116, HEK293, HaCaT, and BCBL1 cells. Although we don’t know why, our preliminary data show that the lnc-FANCI-2 promoter functions well and is sensitive to YY1 binding in lnc-FANCI-2 expressing CaSki and C33A cells in our dual luciferase assays but is much less sensitive to YY1 binding in HeLa and HCT116 cells, indicating some unknown cellular factors negatively regulating lnc-FANCI-2 promoter activity.

      Author response image 1.

      A firefly luciferase (FLuc) reporter containing either the wild-type (−600 wt) or YY1-binding-site-mutated lnc-FANCI-2 promoter was evaluated in CaSki, HeLa, C33A, and HCT116 cells for its promoter activity, with Renilla luciferase (RLuc) activity driven by a TK promoter serving as an internal control. The two YY1-binding motifs (A and B) with a X for mutation are illustrated in the right diagram.

      (2) Previous studies and data in the current showed a steadily increased Inc-FANCI-2 during cancer progression, however, the authors did not observe significant changes in cell behaviors (both morphology and proliferation) in KO Inc-FANCI-2.

      Thanks. We do see decreases in cell proliferation, colony formation, and cell migration, accompanied by increased cell senescence, from the lnc-FANCI-2 KO cells to the parent WT cells.  These data are now added to the revised Fig. 1 and the revised supplemental Fig. S3.

      (3) The authors observed the significant changes of RAS signaling (downstream) in KO cells, but they provided limited interpretations of how these results contributed to full transformation or tumorigenesis in HPV-positive cancer.

      As we stated in the title of this function of lnc-FANCI-2, the lnc-FANCI-2 intrinsically restricts RAS signaling and phosphorylation of Akt and Erk in HPV16-infected cervical cancer. Presumably, high RAS-AKT-ERK signaling inhibits tumor cell survival due to senescence induction as we show in our new Figure 1 and supplemental Fig. S3. A similar report was found in a lung cancer study (Patricia Nieto, et al. Nature 548: 239-243, 2017).

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      Major comments:

      (1) A major issue is that parts of the manuscript read like a collection of experimental results. However, some of the results do not contribute directly to the central story. Besides confusing the reader, the large amount of apparently disparate results can raise more questions. For example:

      a) Why is lnc-FANCI-2 highly expressed in HPV16-infected cervical cancer cell lines (but not in HPV18-infected cells)?

      b) How do p53 and RB repress the expression of lnc-FANCI-2?

      c) What regulates the sub-cellular localization of lnc-FANCI-2?

      d) How does lnc-FANCI-2 negatively regulate RAS signalling?

      e) How does MAP4K4 bind to lnc-FANCI-2?

      f) Do lnc-FANCI-2 and MAP4K4 require each other to regulate RAS signalling?

      g) How does RAS signalling regulate the transcription of MCAM and IGFBP3?

      h) How does MCAM feedback on RAS? Do the different MCAM isoforms impact on RAS signalling differently?

      i) How does IGFBP3 feedback on ERK but not AKT?

      j) How do the other mentioned proteins like ADAM8 fit into the regulatory network?

      k) Each question will require a lot more work to address. I think it would be good if the authors could think through carefully what the key message(s) in the current manuscript should be and then present a more focused write-up.

      Thanks for the critical comments. Because this study is the first time to explore lnc-FANCI-2 functions, we would like to be collective. We believe these data are important to guide any future studies. We really appreciate our reviewer listing many questions related to HPV infection, cell biology, RAS signaling, cancer biology from questions a to k. To address each question in a satisfactory way will be a separate study, but fortunately, our report has pointed out such a direction with some preliminary data for future studies. Here below are our responses to each question from a to k:

      a) Both HPV16 and HPV18 infection induce lnc-FANCI-2 expression in keratinocytes (Liu H., et al. PNAS, 2021). However, HPV18+ cervical cancer cell lines HeLa and C4II cells (Figure S1A and S1B) do not express lnc-FANCI-2 as we see in HPV-negative cell lines such as HCT116, HEK293, HaCaT, and BCBL1 cells. Although we don’t know why, our preliminary data show that lnc-FANCI-2 promoter functions well and is sensitive to YY1 binding in lnc-FANCI-2 expressing CaSki and C33A cells but is much less sensitive to YY1 in HeLa and HCT116 cells, indicating some unknown cellular factors negatively regulating lnc-FANCI-2 promoter activity.

      b) We don’t know whether p53 and pRB could repress the expression of lnc-FANCI-2 although C33A cells bearing a mutant p53 and mutant pRB express high amount of lnc-FANCI-2. However, KD of E2F1 had no effect on lnc-FANCI-2 promoter activity in CaSki cells (Liu, H., et al. PNAS, 2021).

      c) RNA cellular localization can be affected by many factors, including splicing, export, and polyadenylation. As lnc-FANCI-2 is a long non-coding RNA, its regulation of cellular location could be more complicated than mRNAs and thus could be a future research direction.  

      d) The conclusion that lnc-FANCI-2 negatively regulates RAS signaling is based on both lnc-FANCI-2 KO and KD studies.  Please see the proposed hypothetic model in Figure 8E.

      e) The MAP4K4 binding to lnc-FANCI-2 was demonstrated by our IRPCRP-Mass spectrometry (Fig. 8A and 8C), although the exact binding site on lnc-FANCI-2 was not explored. As you probably know, many enzymes today turn out an RNA-binding enzyme (Castello A., et al. Trends Endocrinol. Metab. 26: 746-757, 2015; Hentze MW., et al. Nat. Rev. Mol. Cell Biol. 19: 327-341, 2018)    

      f) Yes, they are slightly relied on each other in regulating RAS signaling. We found that KD of MAP4K4 in parent CaSki cells (Figure 8D) led to more effect on RAS signaling (MCAM, IGFBP3, p-Akt) than that in lnc-FANCI-2 KO ΔPr-A9 cells. In contrast, the latter displayed more p-Erk1/2 than that induced by KD of lnc-FANCI-2 in the parental CaSki cells (Figure S7C).

      g) We believe RAS signaling regulates most likely the transcription of MCAM and IGFBP3 through phosphorylated transcription factors (Figure 8E diagram).

      h) As a signal molecule with at least 13 ligands/coreceptors (Joshkon A., et al. Biomedicines 8: 633, 2020), the increased MCAM appears to sustain RAS signaling (Fig. 7J and Fig. 8E). We are assuming the full-length cytoplasmic MCAM plays a predominant role in RAS signaling due to its abundance than the cleaved nuclear MCAM missing both transmembrane and cytoplasmic regions. Plus, RAS signaling mainly occurs in the cytosol.  

      i) Exact mechanism remains unknown. Lnc-FANCI-2 KO cells exhibit high expression levels of IGFBP3 RNA and protein and p-Erk1/2, but not so much for p-Akt, possibly due to IGFBP3 regulation of MAPK for Erk phosphorylation, but not much so on PI3K for Akt phosphorylation.

      j) The dysregulation of RAS signaling and ADAM protein activity is implicated in various cancers. ADAM proteins can modulate RAS signaling by cleaving and releasing ligands that activate or inactivate RAS-related pathways (Schafer B., et al. JBC 279: 47929-38, 2004; Ohtsu H., et al. Am J Physiol Cell Physiol 291: C1-C10, 2006; Dang M, et al. JBC 286: 17704-17713, 2011; Kleino I, et al. PLoS One 10: e0121301, 2015). Some ADAM proteins are Involved in the migration and invasion of cancer cells, and its loss can promote the degradation of KRAS (Huang Y-K., et al. Nat Cancer 5: 400-419, 2024). In this revision, we have a brief discussion on ADAMs and RAS signaling.

      k) We agree with our reviewer that each question will require a lot more work to address. As this study is to explore the lnc-FANCI-2 function for the first time, however, we prefer to include all of these data that have been selectively included in this write-up. We hope reviewer 1 will be satisfied with our response to each question from a to j. 

      (2) Figures S1A & S1C - Replicates are needed.

      Yes, we have repeated all of the experiments. The quantification shown in Figure S1A and S1C was performed in triplicate, and error bars have been added to the updated figure.

      3) Figure S1D - There seems to be some lnc-FANCI-2 RNA in the nucleus of CaSki cells as well. Please quantify the relative amount of lnc-FANCI-2 in the nucleus vs cytoplasm.

      Yes, a small fraction of lnc-FANCI-2 is in the nucleus of CaSki cells as we reported (Liu H., PNAS, 2021, Movies S1 and S2). We did quantify by fractionation and RT-qPCR the relative amount of lnc-FANCI-2 in the nucleus vs cytoplasm in Figure S1C. 

      (4) Figure S2B - (a) For ΔPr-A9 cells, it looks like there is an increase in E6 and a decrease in E7, instead of "little change" as the authors claimed. (b) I suggest checking the protein levels for all the control and KO clones.

      Thanks for the questions. We had some variation in E6 and E7 detection and the submitted one was one representative.  We grew again the lnc-FANCI-2 KO clones A9 and B3 and reexamined the expression of HPV16 E6/E7 proteins and their downstream targets, p53 and E2F1. As shown in new Figure S3A expt II, we saw again some variations in the detections (~20-30%) and these variations do not reflect a noticeable change for their downstream targets. Thus, we do not consider these changes significantly enough to draw a conclusion in our study, but rather most likely from sampling in the assays.

      (5) In the Proteome Profiler Human sReceptor Array analysis, multiple proteins were highlighted as having at least 30% change. But it is unclear how they relate to RAS signaling.

      Thanks for this comment.  Cellular soluble receptors are essential for RAS signaling, EMT pathway and IFN responses. For example, the dysregulation of RAS signaling and ADAM protein activity is implicated in various cancers. ADAM proteins can modulate RAS signaling by cleaving and releasing ligands that activate or inactivate RAS-related pathways (Schafer B., et al. JBC 279: 47929-38, 2004; Ohtsu H., et al. Am J Physiol Cell Physiol 291: C1-C10, 2006; Dang M, et al. JBC 286: 17704-17713, 2011; Kleino I, et al. PLoS One 10: e0121301, 2015). Some ADAM proteins are Involved in the migration and invasion of cancer cells, and its loss can promote the degradation of KRAS (Huang Y-K., et al. Nat Cancer 5: 400-419, 2024). In this revision, we have a brief discussion on ADAMs and RAS signaling.

      (6) Does knockdown of MAP4K4 lead to an increase in MCAM and IGFBP3?

      Yes, the MAP4K4 KD from parental WT CaSki cells does lead an increase in MCAM (~70%) and IGFBP3 (~30%) which is like the knockdown of lnc-FANCI-2 shown in the revised Figure 8D.

      Minor comments:

      (7) In the opinion of this reviewer the title is somewhat unwieldy.

      Thanks. We have shortened the title as “The lnc-FANCI-2 intrinsically restricts RAS signaling in HPV16-infected cervical cancer”

      (8) The abstract can be more focused and doesn't have to mention so many gene names. In fact, the significance paragraph works better as an abstract. For the significance, the authors can provide another write-up on the implications of their research instead.

      Thanks. We have revised the abstract and added the implications of this research.

      (9) The last sentence of the introduction feels a little abrupt. It would be good to elaborate a little more on the key findings.

      Thanks for this critical comment. We have revised as in the following: In this report, we demonstrate that lnc-FANCI-2 in HPV16-infected cells controls RAS signaling by interaction with MAP4K4 and other RNA-binding proteins. Ablation of lnc-FANCI-2 in the cells promotes RAS signaling and phosphorylation of Akt and Erk. High levels of lnc-FANCI-2 and low level of MCAM expression in cervical cancer patients correlate with improved survival, indicating that lnc-FANCI-2 plays a critical role in regulating RAS signaling to affect cervical cancer progression and patient outcomes.

      (10) Typo on line 191: Should be ADAM8 and not ADMA8.

      Corrected.

      Reviewer #2 (Recommendations for the authors):

      The paper contains a vast amount of data and would greatly benefit from an expanded version of the schematic of Figure 8E summarizing the main results. Including additional details on FANCI-2 regulation by HPV (primarily from previous studies) and its implications for HPV16-driven carcinogenesis would provide a more comprehensive overview.

      Thanks for the suggestion. We have modified our Figure 8E to include HR-HPV E7 and YY1 in regulation of lnc-FANCI-2 transcription.

      Further specific comments:

      (1) The introduction may be shortened to increase readability (e.g. lines 77-90; 94-105).

      We have shortened the introduction by deletion of the lines 94-105 from our initial submission.

      (2) Lines 55-57 the number of cervical cancer diagnoses and mortality need to be updated to the latest literature. The reference is from 2012.

      Thanks. We have revised and updated accordingly with a new citation (Bray F., et al: Global cancer statistics 2022: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin 74, 229-263 (2024))

      (3) Line 61: Progression rate of CIN3 is incorrect (31% in 30 years according to reference 5).

      Thanks. Corrected.

      (4) Lines 108-112 are difficult to understand and should be rewritten.

      Thanks. Revised accordingly.

      (5) Line 116 Is this correct or should 'but' be 'and'?

      Thanks. Corrected accordingly.

      (6) Figure 1A top: The difference between cervical cancer and normal areas is hard to see in the top figure. The region labeled as "normal" does not resemble typical differentiating epithelium or normal glandular epithelium, though this is difficult to assess accurately from the image provided. I suggest adding HE staining and also the histotypes.

      We have added an H&E staining panel in the corresponding region to Figure 1A, which clearly shows the normal and cancer regions. Both cervical cancer tissues were cervical squamous cell carcinoma.

      (7) HFK-HPV16 & 18 cells (Figure 1B) are not described in the Materials & Methods.

      Thanks. We revised our Materials and Methods by citing our two previous publications.

      (8) Figure 2E (RNA scope on FANCI-2 KO) only shows 2 to 3 cells, which makes it somewhat difficult to assess downregulated expression in the KO. I suggest replacing these with pictures showing more cells (i.e. >10) to strengthen the results.

      We have replaced the image in Figure 2E to include more cells.

      (9) The spindle-like morphology in deltaPr-A9 cells shown in FigS2A is not very distinct. Including images at higher magnification could help clarify this feature.

      Good comment. We have enlarged the images for better view and revised the context.

      (10) Both protein and RNA expression analysis have been performed on WT CaSki cells and FANCI-2 KO cells. If I am correct there is little overlap between the significantly changed gene products. What does this mean? Have you looked into the comparison?

      The DEGs identified from RNA-seq indicated a genome wide transcriptome change, while the protein array we used only covered 105 soluble protein receptors. However, we did find 9/15 (60%) membrane proteins in cell lysates (PODXL2, ECM1, NECTIN2, MCAM, ADAM9, CDH5, ADAM10, ITGA5, NOTCH1, SCARF2, ADAM8, TIMP2, LGALS3BP, CDH13, and ITGB6) exhibited consistent changes in expression (underlined) by both RNA-seq and protein array assays. We have revised the text with this information (page 11). Other six proteins (40%) had inconsistent expression correlation in two assays could be due to post-translational mechanisms, such as protein stability, modifications and secretion, etc.  

      (11) Figure S7, which represents TCGA data and survival is quite complex. It would be more effective to display a similar figure for FANCI-2, as was done for MCAM in Figure 7I, to simplify the comparison and enhance clarity.

      Thanks. However, the suggested figure for lnc-FANCI-2 was published in PNAS paper already (Liu H., et al. PNAS, 2021).  The Figure S8 in this revision is the result from our in-house GradientScanSurv pipeline, a new way to correlate the expression and survival more accurately.

      What do the Figures look like if you analyse only HPV16+ patients versus HPV18+ patients, considering that FANCI-2 upregulation in cell lines is related to HPV16 and not 18? Is there an effect of histotype? Or tumor stage?

      HPV18 infected keratinocytes express high level of lnc-FANCI-2. Two HPV18<sup>+</sup> HeLa and C4II cell lines and HPV-negative cell lines, such as HCT116 cells, which do not express lnc-FANCI-2 could be due to the presence of some unknow repressive factors. We found that lnc-FANCI-2 promoter functions well in responding to YY1 binding in CaSki and C33A cells expressing lnc-FANCI-2 but does not so in HeLa and HCT116 cells in our dual luciferase assays. 

      (12) It remains puzzling that FANCI-2 upregulation was previously shown to already occur in CIN lesions and increase further in cervical cancer, while the current data indicate that FANCI-2 suppresses AKT activation. If I am correct Akt activation has been linked to cervical carcinogenesis. Similarly, line 434 states that increased MCAM might promote cervical tumorigenesis, implying that low FANCI-2 would stimulate tumorigenesis. If I understand correctly, the increase in FANCI-2 observed in CIN lesions would reflect a "brake" on the carcinogenic pathway and its sustained increase in cancer might indicate that growth is still (partly) controlled. As mentioned earlier, a Figure illustrating the relation between FANCI-2, HPV, and the carcinogenic process would be beneficial for clarity.

      Yes. Increased MCAM, but low level of lnc-FANCI-2, correlates with poor cervical cancer survival. We have revised Figure 8E to illustrate this relation better.  

      (13) May part of the potentially conflicting findings be explained by CaSki cells being of metastatic origin? Related to this, does the expression of FANCI-2 or MALM depend on the tumor stage?

      Thanks for this important suggestion. Unfortunately, we found that the expression of lnc-FANCI-2 and MCAM is not associated with cervical cancer stage based on the TCGA data (http://gepia.cancer-pku.cn/index.html). See the data below:

      Author response image 2.

      Despite some lingering uncertainty, the extensive experiments conducted using KO and KD cells do provide compelling evidence that lnc-FANCI-2 function is linked to RAS signaling and EMT.

      Thanks for your positive review and instructive comments.

      Reviewer #3 (Recommendations for the authors):

      (1) The authors observed the increased Inc-FANCI-2 in HPV 16 and 18 transduced cells, and other cervical cancer tissues as well, HPV-18 positive HeLa cells exhibited different expressions of Inc-FANCI-2. I suggest authors provide more discussions on this difference, for example, HPV genotypes. HPV genome status in host cells? Cell types?

      Thanks. We found the keratinocyte infections with HPV16, HPV18, and other HR-HPVs could induce lnc-FANCI-2 expression (Liu H., et al. PNAS, 2021). In this report, we found HPV18<sup>+</sup> HeLa and C4II cells and other HPV-negative cell lines do not. Our preliminary data on lnc-FANCI-2 promoter activity assays showed the presence of a negative regulatory factor (s) in non-lnc-FANCI-2 expressing cells. See the data in Author response image 1.

      We have revised our discussion by inclusion these sets of the luciferase data as data not shown.

      (2) I suggest the authors discuss more details on how the changes of RAS signaling in KO cells help our further understanding of the molecular mechanisms for HPV-associated full-cell transformation and malignancy in addition to the well-known functions of HPV E6 and E7.

      Thanks. We have modified the Figure 8E as suggested by reviewer 2 and revised the discussion further.

    1. Author Response

      The following is the authors’ response to the original reviews.

      Reviewer #1:

      Summary:

      This paper performs fine-mapping of the silkworm mutants bd and its fertile allelic version, bdf, narrowing down the causal intervals to a small interval of a handful of genes. In this region, the gene orthologous to mamo is impaired by a large indel, and its function is later confirmed using expression profiling, RNAi, and CRISPR KO. All these experiments are convincingly showing that mamo is necessary for the suppression of melanic pigmentation in the silkworm larval integument. The authors also use in silico and in vitro assays to probe the potential effector genes that mamo may regulate. Strengths: The genotype-to-phenotype workflow, combining forward (mapping) and reverse genetics (RNAi and CRISPR loss-of-function assays) linking mamo to pigmentation are extremely convincing.

      Response: Thank you very much for your affirmation of our work. The reviewer discussed the parts of our manuscript that involve evolution sentence by sentence. We have further refined the description in this regard and improved the logical flow. Thank you again for your help.

      Weaknesses:

      1) The last section of the results, entitled "Downstream target gene analysis" is primarily based on in silico genome-wide binding motif predictions.

      While the authors identify a potential binding site using EMSA, it is unclear how much this general approach over-predicted potential targets. While I think this work is interesting, its potential caveats are not mentioned. In fact the Discussion section seems to trust the high number of target genes as a reliable result. Specifically, the authors correctly say: "even if there are some transcription factor-binding sites in a gene, the gene is not necessarily regulated by these factors in a specific tissue and period", but then propose a biological explanation that not all binding sites are relevant to expression control. This makes a radical short-cut that predicted binding sites are actual in vivo binding sites. This may not be true, as I'd expect that only a subset of binding motifs predicted by Positional Weight Matrices (PWM) are real in vivo binding sites with a ChIP-seq or Cut-and-Run signal. This is particularly problematic for PWM that feature only 5-nt signature motifs, as inferred here for mamo-S and mamo-L, simply because we can expect many predicted sites by chance.

      Response: Thank you very much for your careful work. The analysis and identification of transcription factor-binding sites is an important issue in gene regulation research. Techniques such as ChIP-seq can be used to experimentally identify the binding sites of transcription factors (TFs). However, reports using these techniques often only detect specific cell types and developmental stages, resulting in a limited number of downstream target genes for some TFs. Interestingly, TFs may regulate different downstream target genes in different cell types and developmental stages.

      Previous research has suggested that the ZF-DNA binding interface can be understood as a “canonical binding model”, in which each finger contacts DNA in an antiparallel manner. The binding sequence of the C2H2-ZF motif is determined by the amino acid residue sequence of its α-helical component. Considering the first amino acid residue in the α-helical region of the C2H2-ZF domain as position 1, positions -1, 2, 3, and 6 are key amino acids for recognizing and binding DNA. The residues at positions -1, 3, and 6 specifically interact with base 3, base 2, and base 1 of the DNA sense sequence, respectively, while the residue at position 2 interacts with the complementary DNA strand (Wolfe SA et al., 2000; Pabo CO et al., 2001). Based on this principle, the binding sites of C2H2-ZF have good reference value. For the 5-nt PWM sequence, we referred to the study of D. melanogaster, which was identified by EMSA (Shoichi Nakamura et al., 2019). In the new version, we have rewritten this section.

      Pabo CO, Peisach E, Grant RA. Design and selection of novel Cys2His2 zinc finger proteins. Annu Rev Biochem. 2001;70:313-340.

      Wolfe SA, Nekludova L, Pabo CO. DNA recognition by Cys2His2 zinc finger proteins. Annu Rev Biophys Biomol Struct. 2000;29:183-212.

      Nakamura S, Hira S, Fujiwara M, et al. A truncated form of a transcription factor Mamo activates vasa in Drosophila embryos. Commun Biol. 2019;2:422. Published 2019 Nov 20.

      2) The last part of the current discussion ("Notably, the industrial melanism event, in a short period of several decades ... a more advanced self-regulation program") is flawed with important logical shortcuts that assign "agency" to the evolutionary process. For instance, this section conveys the idea that phenotypically relevant mutations may not be random. I believe some of this is due to translation issues in English, as I understand that the authors want to express the idea that some parts of the genome are paths of least resistance for evolutionary change (e.g. the regulatory regions of developmental regulators are likely to articulate morphological change). But the language and tone is made worst by the mention that in another system, a mechanism involving photoreception drives adaptive plasticity, making it sound like the authors want to make a Lamarckian argument here (inheritance of acquired characteristics), or a point about orthogenesis (e.g. the idea that the environment may guide non-random mutations).

      Because this last part of the current discussion suffers from confused statements on modes and tempo of regulatory evolution and is rather out of topic, I would suggest removing it.

      In any case, it is important to highlight here that while this manuscript is an excellent genotype-to-phenotype study, it has very few comparative insights on the evolutionary process. The finding that mamo is a pattern or pigment regulatory factor is interesting and will deserve many more studies to decipher the full evolutionary study behind this Gene Regulatory Network.

      Response: Thank you very much for your careful work. In this part of the manuscript, we introduced some assumptions that make the statement slightly unconventional. The color pattern of insects is an adaptive trait. The bd and bdf mutants used in the study are formed spontaneously. As a frequent variation and readily observable phenotype, color patterns have been used as models for evolutionary research (Wittkopp PJ et al., 2011). Darwin's theory of natural selection has epoch-making significance. I deeply believe in the theory that species strive to evolve through natural selection. However, with the development of molecular genetics, Darwinism’s theory of undirected random mutations and slow accumulation of micromutations resulting in phenotype evolution has been increasingly challenged.

      The prerequisite for undirected random mutations and micromutations is excessive reproduction to generate a sufficiently large population. A sufficiently large population can contain sufficient genotypes to face various survival challenges. However, it is difficult to explain how some small groups and species with relatively low fertility rates have survived thus far. More importantly, the theory cannot explain the currently observed genomic mutation bias. In scientific research, every theory is constantly being modified to adapt to current discoveries. The most famous example is the debate over whether light is a particle or a wave, which has lasted for hundreds of years. However, in the 20th century, both sides seemed to compromise with each other, believing that light has a wave‒particle duality.

      In summary, we have rewritten this section to reduce unnecessary assumptions.

      Wittkopp PJ, Kalay G. Cis-regulatory elements: molecular mechanisms and evolutionary processes underlying divergence. Nat Rev Genet. 2011;13(1):59-69.

      Minor Comment:

      The gene models presented in Figure 1 are obsolete, as there are more recent annotations of the Bm-mamo gene that feature more complete intron-exon structures, including for the neighboring genes in the bd/bdf intervals. It remains true that the mamo locus encodes two protein isoforms.

      An example of the Bm-mamo locus annotation, can be found at: https://www.ncbi.nlm.nih.gov/gene/101738295 RNAseq expression tracks (including from larval epidermis) can be displayed in the embedded genome browser from the link above using the "Configure Tracks" tool.

      Based on these more recent annotations, I would say that most of the work on the two isoforms remains valid, but FigS2, and particularly Fig.S2C, need to be revised.

      Response: Thank you very much for your careful work. In this study, we referred to the predicted genes of SilkDB, NCBI and Silkbase. In different databases, there are varying degrees of differences in the number of predicted genes and the length of gene mRNA. Because the SilkDB database is based on the first silkworm genome, it has been used for the longest time and has a relatively large number of users. In the revised manuscript, we have added the predicted genes of NCBI and Silkbase in Figure S1.

      Author response image 1.

      The predicted genes and qPCR analysis of candidate genes in the responsible genomic region for bd mutant. (A) The predicted genes in SilkDB;(B) the predicted genes in Genbak;(C) the predicted genes in Silkbase;(D) analysis of nucleotide differences in the responsible region of bd;(E) investigation of the expression level of candidate genes.

      Reviewer #2 (Public Review):

      Summary:

      The authors tried to identify new genes involved in melanin metabolism and its spatial distribution in the silkworm Bombyx mori. They identified the gene Bm-mamo as playing a role in caterpillar pigmentation. By functional genetic and in silico approaches, they identified putative target genes of the Bm-mamo protein. They showed that numerous cuticular proteins are regulated by Bm-mamo during larval development.

      Strengths:

      • preliminary data about the role of cuticular proteins to pattern the localization of pigments

      • timely question

      • challenging question because it requires the development of future genetic and cell biology tools at the nanoscale

      Response: Thank you very much for your affirmation of our work. The reviewer's familiarity with the color patterns of Lepidoptera is helpful, and the recommendation raised has provided us with very important assistance. This has allowed us to make significant progress with our manuscript.

      Weaknesses:

      • statistical sampling limited

      • the discussion would gain in being shorter and refocused on a few points, especially the link between cuticular proteins and pigmentation. The article would be better if the last evolutionary-themed section of the discussion is removed.

      A recent paper has been published on the same gene in Bombyx mori (https://www.sciencedirect.com/science/article/abs/pii/S0965174823000760) in August 2023. The authors must discuss and refer to this published paper through the present manuscript.

      Response: Thank you very much for your careful work. First, we believe that competitive research is sometimes coincidental and sometimes intentional. Our research began in 2009, when we began to configure the recombinant population. In 2016, we published an article on comparative transcriptomics (Wu et al. 2016). The article mentioned above has a strong interest in our research and is based on our transcriptome analysis for further research, with the aim of making a preemptive publication. To discourage such behavior, we cannot cite it and do not want to discuss it in our paper.

      Songyuan Wu et al. Comparative analysis of the integument transcriptomes of the black dilute mutant and the wild-type silkworm Bombyx mori. Sci Rep. 2016 May 19:6:26114. doi: 10.1038/srep26114.

      Reviewer #1 (Recommendations For The Authors):

      1) please consider using a more recent annotation model of the B. mori genome to revise your Result Section 1, Fig.1, and Fig. S2. https://www.ncbi.nlm.nih.gov/gene/101738295

      Specifically, you used BGIM_ gene models, while the current annotation such as the one above featured in the NCBI database provides more accurate intron-exon structures without splitting mamo into tow genes. I believe this can be done with minor revisions of the figures, and you could keep the BGIM_ gene names for the text.

      Response: Thank you very much for your careful work. The GenBank of NCBI (National Center for Biotechnology Information) is a very good database that we often use and refer to in this research process. Our research started in 2009, so we mainly referred to the SilkDB database (Jun Duan et al., 2010), although other databases also have references, such as NCBI and Silkbase (https://silkbase.ab.a.u-tokyo.ac.jp/cgi-bin/index.cgi). Because the SilkDB database was constructed based on the first published silkworm genome data, it has been used for the longest time and has a relatively large number of users. Recently, researchers are still using these data (Kejie Li et al., 2023).

      The problem with predicting the mamo gene as two genes (BGIBMGA012517 and BGIBMGA012518) in SilkDB is mainly due to the presence of alternative splicing of the mamo gene. BGIBMGA012517 corresponds to the shorter transcript (mamo-s) of the mamo gene. Due to the differences in sequencing individuals, sequencing methods, and methods of gene prediction, there are differences in the number and sequence of predicted genes in different databases. We added the pattern diagram of predicted genes from NCBI and Silkbase, and the expression levels of new predicted genes are shown in Supplemental Figure S1.

      Jun Duan et al., SilkDB v2.0: a platform for silkworm (Bombyx mori) genome biology. Nucleic Acids Res. 2010 Jan;38(Database issue): D453-6. doi: 10.1093/nar/gkp801. Kejie Li et al., Transcriptome analysis reveals that knocking out BmNPV iap2 induces apoptosis by inhibiting the oxidative phosphorylation pathway. Int J Biol Macromol. 2023 Apr 1;233:123482. doi: 10.1016/j.ijbiomac.2023.123482. Epub 2023 Jan 31.

      Author response image 2.

      The predicted genes and qPCR analysis of candidate genes in the responsible genomic region for bd mutant. (A) The predicted genes in SilkDB;(B) the predicted genes in Genbak;(C) the predicted genes in Silkbase;(D) analysis of nucleotide differences in the responsible region of bd;(E) investigation of the expression level of candidate genes.

      2) As I mentioned in my public review, I strongly believe the interpretation of the PWM binding analyses require much more conservative statements taking into account the idea that short 5-nt motifs are expected by chance. The work in this section is interesting, but the manuscript would benefit from a quite significant rewrite of the corresponding Discussion section, making it that the in silico approach is prone to the identification of many sites in the genomes, and that very few of those sites are probably relevant for probabilistic reasons. I would recommend statements such as "Future experiments assessing the in vivo binding profile of Bm-mamo (eg. ChIP-seq or Cut&Run), will be required to further understand the GRNs controlled by mamo in various tissues".

      Response: Thank you very much for your careful work. Previous research has suggested that the ZF-DNA binding interface can be understood as a “canonical binding model”, in which each finger contacts DNA in an antiparallel manner. The binding sequence of the C2H2-ZF motif is determined by the amino acid residue sequence of its α-helical component. Considering the first amino acid residue in the α-helical region of the C2H2-ZF domain as position 1, positions -1, 2, 3, and 6 are key amino acids for recognizing and binding DNA. The residues at positions -1, 3, and 6 specifically interact with base 3, base 2, and base 1 of the DNA sense sequence, respectively, while the residue at position 2 interacts with the complementary DNA strand (Wolfe SA et al., 2000; Pabo CO et al., 2001). Based on this principle, the prediction of DNA recognition motifs of C2H2-type zinc finger proteins currently has good accuracy.

      The predicted DNA binding sequence (GTGCGTGGC) of the mamo protein in Drosophila melanogaster was highly consistent with that of silkworms. In addition, in D. melanogaster, the predicted DNA binding sequence of mamo, the bases at positions 1 to 7 (GTGCGTG), was highly similar to the DNA binding sequence obtained from EMSA experiments (Seiji Hira et al., 2013). Furthermore, in another study on the mamo protein of Drosophila melanogaster, five bases (TGCGT) were used as the DNA recognition core sequence of the mamo protein (Shoichi Nakamura et al., 2019). In the JASPAR database (https://jaspar.genereg.net), there are also some shorter (4-6 nt) DNA recognition sequences; for example, the DNA binding sequence of Ubx is TAAT (ID MA0094.1) in Drosophila melanogaster. However, we used longer DNA binding motifs (9 nt and 15 nt) of mamo to study the 2 kb genomic regions near the predicted gene. Over 70% of predicted genes were found to have these feature sequences near them. This analysis method is carried out with common software and processes. Due to sufficient target proteins, the accessibility of DNA, the absence of suppressors, the suitability of ion environments, etc., zinc finger protein transcription factors are more likely to bind to specific DNA sequences in vitro than in vivo. Using ChIP-seq or Cut&Run techniques to analyze various tissues and developmental stages in silkworms can yield one comprehensive DNA-binding map of mamo, and some false positives generated by predictions can be excluded. Thank you for your suggestion. We will conduct this work in the next research step. In addition, for brevity, we deleted the predicted data (Supplemental Tables S7 and S8) that used shorter motifs.

      Pabo CO, Peisach E, Grant RA. Design and selection of novel Cys2His2 zinc finger proteins. Annu Rev Biochem. 2001;70:313-340.

      Wolfe SA, Nekludova L, Pabo CO. DNA recognition by Cys2His2 zinc finger proteins. Annu Rev Biophys Biomol Struct. 2000;29:183-212.

      Anton V Persikov et al., De novo prediction of DNA-binding specificities for Cys2His2 zinc finger proteins. Nucleic Acids Res. 2014 Jan;42(1):97-108. doi: 10.1093/nar/gkt890. Epub 2013 Oct 3.

      Seiji Hira et al., Binding of Drosophila maternal Mamo protein to chromatin and specific DNA sequences. Biochem Biophys Res Commun. 2013 Aug 16;438(1):156-60. doi: 10.1016/j.bbrc.2013.07.045. Epub 2013 Jul 20.

      Shoichi Nakamura et al., A truncated form of a transcription factor Mamo activates vasa in Drosophila embryos. Commun Biol. 2019 Nov 20;2: 422. doi: 10.1038/s42003-019-0663-4. eCollection 2019.

      3) In my opinion, the last section of the Discussion needs to be completely removed ("Notably, the industrial melanism event, in a short period of several decades ... a more advanced self-regulation program"), as it is over-extending the data into evolutionary interpretations without any support. I would suggest instead writing a short paragraph asking whether the pigmentary role of mamo is a Lepidoptera novelty, or if it could have been lost in the fly lineage.

      Below, I tried to comment point-by-point on the main issues I had.

      Wu et al: Notably, the industrial melanism event, in a short period of several decades, resulted in significant changes in the body color of multiple Lepidoptera species(46). Industrial melanism events, such as changes in the body color of pepper moths, are heritable and caused by genomic mutations(47).

      Yes, but the selective episode was brief, and the relevant "carbonaria" mutations may have existed for a long time at low-frequency in the population.

      Response: Thank you very much for your careful work. Moth species often have melanic variants at low frequencies outside industrial regions. Recent molecular work on genetics has revealed that the melanic (carbonaria) allele of the peppered moth had a single origin in Britain. Further research indicated that the mutation event causing industrial melanism of peppered moth (Biston betularia) in the UK is the insertion of a transposon element into the first intron of the cortex gene. Interestingly, statistical inference based on the distribution of recombined carbonaria haplotypes indicates that this transposition event occurred in approximately 1819, a date highly consistent with a detectable frequency being achieved in the mid-1840s (Arjen E Van't Hof, et al., 2016). From molecular research, it is suggested that this single origin melanized mutant (carbonaria) was generated near the industrial development period, rather than the ancient genotype, in the UK. We have rewritten this part of the manuscript.

      Arjen E Van't Hof, et al., The industrial melanism mutation in British peppered moths is a transposable element. Nature. 2016 Jun 2;534(7605):102-5. doi: 10.1038/nature17951.

      Wu et al: If relying solely on random mutations in the genome, which have a time unit of millions of years, to explain the evolution of the phenotype is not enough.

      What you imply here is problematic for several reasons.

      First, as you point out later, some large-effect mutations (e.g. transpositions) can happen quickly.

      Second, it's unclear what "the time units of million of years" means here... mutations occur, segregate in populations, and are selected. The speed of this process depends on the context and genetic architectures.

      Third, I think I understand what you mean with "to explain the evolution of the phenotype is not enough", but this would probably need a reformulation and I don't think it's relevant to bring it here. After all, you used loss-of-function mutants to explain the evolution of artificially selected mutants. The evolutionary insights from these mutants are limited. Random mutations at the mamo locus are perfectly sufficient here to explain the bd and bdf phenotypes and larval traits.

      Response: Thank you very much for your careful work. Charles Darwin himself, who argued that “natural selection can act only by taking advantage of slight successive variations; she can never take a leap, but must advance by the shortest and slowest steps” (Darwin, C. R. 1859). This ‘micromutational’ view of adaptation proved extraordinarily influential. However, the accumulation of micromutations is a lengthy process, which requires a very long time to evolve a significant phenotype. This may be only a proportion of the cases. Interestingly, recent molecular biology studies have shown that the evolution of some morphological traits involves a modest number of genetic changes (H Allen Orr. 2005).

      One example is the genetic basis analysis of armor-plate reduction and pelvic reduction of the three-spined stickleback (Gasterosteus aculeatus) in postglacial lakes. Although the marine form of this species has thick armor, the lake population (which was recently derived from the marine form) does not. The repeated independent evolution of lake morphology has resulted in reduced armor plate and pelvic structures, and there is no doubt that these morphological changes are adaptive. Research has shown that pelvic loss in different natural populations of three-spined stickleback fish occurs by regulatory mutations deleting a tissue-specific enhancer (Pel) of the pituitary homeobox transcription factor 1 (Pitx1) gene. The researchers genotyped 13 pelvic-reduced populations of three-spined stickleback from disparate geographic locations. Nine of the 13 pelvic-reduced stickleback populations had sequence deletions of varying lengths, all of which were located at the Pel enhancer. Relying solely on random mutations in the genome cannot lead to such similar mutation forms among different populations. The author suggested that the Pitx1 locus of the stickleback genome may be prone to double-stranded DNA breaks that are subsequently repaired by NHEJ (Yingguang Frank Chan et al., 2010).

      The bd and bdf mutants used in the study are formed spontaneously. Natural mutation is one of the driving forces of evolution. Nevertheless, we have rewritten the content of this section.

      Darwin, C. R. The Origin of Species (J. Murray, London, 1859).

      H Allen Orr. The genetic theory of adaptation: a brief history. Nat Rev Genet. 2005 Feb;6(2):119-27. doi: 10.1038/nrg1523.

      Yingguang Frank Chan et al., Adaptive evolution of pelvic reduction in sticklebacks by recurrent deletion of a Pitx1 enhancer. Science. 2010 Jan 15;327(5963):302-5. doi: 10.1126/science.1182213. Epub 2009 Dec 10.

      Wu et al: Interestingly, the larva of peppered moths has multiple visual factors encoded by visual genes, which are conserved in multiple Lepidoptera, in the skin. Even when its compound eyes are covered, it can rely on the skin to feel the color of the environment to change its body color and adapt to the environment(48). Therefore, caterpillars/insects can distinguish the light wave frequency of the background. We suppose that perceptual signals can stimulate the GRN, the GRN guides the expression of some transcription factors and epigenetic factors, and the interaction of epigenetic factors and transcription factors can open or close the chromatin of corresponding downstream genes, which can guide downstream target gene expression.

      This is extremely confusing because you are bringing in a plastic trait here. It's possible there is a connection between the sensory stimulus and the regulation of mamo in peppered moths, but this is a mere hypothesis. Here, by mentioning a plastic trait, this paragraph sounds as if it was making a statement about directed evolution, especially after implying in the previous sentence that (paraphrasing) "random mutations are not enough". To be perfectly honest, the current writing could be misinterpreted and co-opted by defenders of the Intelligent Design doctrine. I believe and trust this is not your intention.

      Response: Thank you very much for your careful work. The plasticity of the body color of peppered moth larvae is very interesting, but we mainly wanted to emphasize that their skin shows the products of visual genes that can sense the color of the environment by perceiving light. Moreover, these genes are conserved in many insects. Human skin can also perceive light by opsins, suggesting that they might initiate light–induced signaling pathways (Haltaufderhyde K et al., 2015). This indicates that the perception of environmental light by the skin of animals and the induction of feedback through signaling pathways is a common phenomenon. For clarity, we have rewritten this section of the manuscript.

      Haltaufderhyde K, Ozdeslik RN, Wicks NL, Najera JA, Oancea E. Opsin expression in human epidermal skin. Photochem Photobiol. 2015;91(1):117-123.

      Wu et al: In addition, during the opening of chromatin, the probability of mutation of exposed genomic DNA sequences will increase (49).

      Here again, this is veering towards a strongly Lamarckian view with the environment guiding specific mutation. I simply cannot see how this would apply to mamo, nothing in the current article indicates this could be the case here. Among many issues with this, it's unclear how chromatin opening in the larval integument may result in heritable mutations in the germline.

      Response: Thank you very much for your careful work. Previous studies have shown that there is a mutation bias in the genome; compared with the intergenic region, the mutation frequency is reduced by half inside gene bodies and by two-thirds in essential genes. In addition, they compared the mutation rates of genes with different functions. The mutation rate in the coding region of essential genes (such as translation) is the lowest, and the mutation rates in the coding region of specialized functional genes (such as environmental response) are the highest. These patterns are mainly affected by the traits of the epigenome (J Grey Monroe et al., 2022).

      In eukaryotes, chromatin is organized as repeating units of nucleosomes, each consisting of a histone octamer and the surrounding DNA. This structure can protect DNA. When one gene is activated, the chromatin region of this gene is locally opened, becoming an accessible region. Research has found that DNA accessibility can lead to a higher mutation rate in the region (Radhakrishnan Sabarinathan et al., 2016; Schuster-Böckler B et al., 2012; Lawrence MS et al., 2013; Polak P et al., 2015). In addition, the BTB-ZF protein mamo belongs to this family and can recruit histone modification factors such as DNA methyltransferase 1 (DMNT1), cullin3 (CUL3), histone deacetylase 1 (HDAC1), and histone acetyltransferase 1 (HAT1) to perform chromatin remodeling at specific genomic sites. Although mutations can be predicted by the characteristics of apparent chromatin, the forms of mutations are diverse and random. Therefore, this does not violate randomness. For clarity, we have rewritten this section of the manuscript.

      J Grey Monroe, Mutation bias reflects natural selection in Arabidopsis thaliana. Nature. 2022 Feb;602(7895):101-105.

      Sabarinathan R, Mularoni L, Deu-Pons J, Gonzalez-Perez A, López-Bigas N. Nucleotide excision repair is impaired by binding of transcription factors to DNA. Nature. 2016;532(7598):264-267.

      Schuster-Böckler B, Lehner B. Chromatin organization is a major influence on regional mutation rates in human cancer cells. Nature. 2012;488(7412):504-507.

      Lawrence MS, Stojanov P, Polak P, et al. Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature. 2013;499(7457):214-218.

      Polak P, Karlić R, Koren A, et al. Cell-of-origin chromatin organization shapes the mutational landscape of cancer. Nature. 2015;518(7539):360-364.

      Mathew R, Seiler MP, Scanlon ST, et al. BTB-ZF factors recruit the E3 ligase cullin 3 to regulate lymphoid effector programs. Nature. 2012;491(7425):618-621.

      Wu et al: Transposon insertion occurs in a timely manner upstream of the cortex gene in melanic pepper moths (47), which may be caused by the similar binding of transcription factors and opening of chromatin.

      No, we do not think that the peppered moth mutation is Lamarckian at all, as seems to be inferred here (notice that by mentioning the peppered moth twice, you are juxtaposing a larval plastic trait and then a purely genetic wing trait, making it even more confusing). Also, the "in a timely manner" is superfluous, because all the data are consistent with a chance mutation being eventually picked up by strong directional mutation. The mutation and selection did NOT occur at the same time.

      Response: Thank you very much for your careful work. The insertion of one transposon into the first intron of the cortex gene of industrial melanism in peppered moth occurred in approximately 1819, which is similar to the time of industrial development in the UK (Arjen E Van't Hof, et al., 2016). In multiple species of Heliconius, the cortex gene is the shared genetic basis for the regulation of wing coloring patterns. Interestingly, the SNP of the cortex, associated with the wing color pattern, does not overlap among different Heliconius species, such as H. erato dephoon and H. erato favorinus, which suggests that the mutations of this cortex gene have different origins (Nadeau NJ et al., 2016). In addition, in Junonia coenia (van der Burg KRL et al., 2020) and Bombyx mori (Ito K et al., 2016), the cortex gene is a candidate for regulating changes in wing coloring patterns. Overall, the cortex gene is an evolutionary hotspot for the variation of multiple butterfly and moth wing coloring patterns. In addition, it was observed that the variations in the cortex are diverse in these species, including SNPs, indels, transposon insertions, inversions, etc. This indicates that although there are evolutionary hotspots in the insect genome, this variation is random. Therefore, this is not completely detached from randomness.

      Arjen E Van't Hof, et al., The industrial melanism mutation in British peppered moths is a transposable element. Nature. 2016 Jun 2;534(7605):102-5. doi: 10.1038/nature17951.

      Nadeau NJ, Pardo-Diaz C, Whibley A, et al. The gene cortex controls mimicry and crypsis in butterflies and moths. Nature. 2016;534(7605):106-110.

      van der Burg KRL, Lewis JJ, Brack BJ, Fandino RA, Mazo-Vargas A, Reed RD. Genomic architecture of a genetically assimilated seasonal color pattern. Science. 2020;370(6517):721-725.

      Ito K, Katsuma S, Kuwazaki S, et al. Mapping and recombination analysis of two moth colour mutations, Black moth and Wild wing spot, in the silkworm Bombyx mori. Heredity (Edinb). 2016;116(1):52-59.

      Wu et al: Therefore, we proposed that the genetic basis of color pattern evolution may mainly be system-guided programmed events that induce mutations in specific genomic regions of key genes rather than just random mutations of the genome.

      While the mutational target of pigment evolution may involve a handful of developmental regulator genes, you do not have the data to infer such a strong conclusion at the moment.

      The current formulation is also quite strong and teleological: "system-guided programmed events" imply intentionality or agency, an idea generally assigned to the anti-scientific Intelligent Design movement. There are a few examples of guided mutations, such as the adaptation phase of gRNA motifs in bacterial CRISPR assays, where I could see the term ""system-guided programmed events" to be applicable. But it is irrelevant here.

      Response: Thank you very much for your careful work. The CRISPR-CAS9 system is indeed very well known. In addition, recent studies have found the existence of a Cas9-like gene editing system in eukaryotes, such as Fanzor. Fanzor (Fz) was reported in 2013 as a eukaryotic TnpB-IS200/IS605 protein encoded by the transposon origin, and it was initially thought that the Fz protein (and prokaryotic TnpBs) might regulate transposon activity through methyltransferase activity (Saito M et al., 2023). Fz has recently been found to be a eukaryotic CRISPR‒Cas system. Although this system is found in fungi and mollusks, it raises hopes for scholars to find similar systems in other higher animals. However, before these gene-editing systems became popular, zinc finger nucleases (ZFNs) were already being studied as a gene-editing system in many species. The mechanism by which ZFN recognizes DNA depends on its zinc finger motif (Urnov FD et al., 2005). This is consistent with the mechanism by which transcription factors recognize DNA-binding sites.

      Furthermore, a very important evolutionary event in sexual reproduction is chromosome recombination during meiosis, which helps to produce more abundant alleles. Current research has found that this recombination event is not random. In mice and humans, the PRDM9 transcription factors are able to plan the sites of double-stranded breaks (DSBs) in meiosis recombination. PRDM9 is a histone methyltransferase consisting of three main regions: an amino-terminal region resembling the family of synovial sarcoma X (SSX) breakpoint proteins, which contains a Krüppel-associated box (KRAB) domain and an SSX repression domain (SSXRD); a PR/SET domain (a subclass of SET domains), surrounded by a pre-SET zinc knuckle and a post-SET zinc finger; and a long carboxy-terminal C2H2 zinc finger array. In most mammalian species, during early meiotic prophase, PRDM9 can determine recombination hotspots by H3K4 and H3K36 trimethylation (H3K4me3 and H3K36me3) of nucleosomes near its DNA-binding site. Subsequently, meiotic DNA DSBs are formed at hotspots through the combined action of SPO11 and TOPOVIBL. In addition, some proteins (such as RAD51) are involved in repairing the break point. In summary, programmed events of induced and repaired DSBs are widely present in organisms (Bhattacharyya T et al., 2019).

      These studies indicate that on the basis of randomness, the genome also exhibits programmability.

      Saito M, Xu P, Faure G, et al. Fanzor is a eukaryotic programmable RNA-guided endonuclease. Nature. 2023;620(7974):660-668.

      Urnov FD, Miller JC, Lee YL, et al. Highly efficient endogenous human gene correction using designed zinc-finger nucleases. Nature. 2005;435(7042):646-651.

      Bhattacharyya T, Walker M, Powers NR, et al. Prdm9 and Meiotic Cohesin Proteins Cooperatively Promote DNA Double-Strand Break Formation in Mammalian Spermatocytes [published correction appears in Curr Biol. 2021 Mar 22;31(6):1351]. Curr Biol. 2019;29(6):1002-1018.e7.

      Wu et al: Based on this assumption, animals can undergo phenotypic changes more quickly and more accurately to cope with environmental changes. Thus, seemingly complex phenotypes such as cryptic coloring and mimicry that are highly similar to the background may have formed in a short period. However, the binding sites of some transcription factors widely distributed in the genome may be reserved regulatory interfaces to cope with potential environmental changes. In summary, the regulation of genes is smarter than imagined, and they resemble a more advanced self-regulation program.

      Here again, I can agree with the idea that certain genetic architectures can evolve quickly, but I cannot support the concept that the genetic changes are guided or accelerated by the environment. And again, none of this is relevant to the current findings about Bm-mamo.

      Response: Thank you very much for your careful work. Darwin's theory of natural selection has epoch-making significance. I deeply believe in the theory that species strive to evolve through natural selection. However, with the development of molecular genetics, Darwinism’s theory of undirected random mutations and slow accumulation of micromutations resulting in phenotype evolution has been increasingly challenged.

      The prerequisite for undirected random mutations and micromutations is excessive reproduction to generate a sufficiently large population. A sufficiently large population can contain sufficient genotypes to face various survival challenges. However, it is difficult to explain how some small groups and species with relatively low fertility rates have survived thus far. More importantly, the theory cannot explain the currently observed genomic mutation bias. In scientific research, every theory is constantly being modified to adapt to current discoveries. The most famous example is the debate over whether light is a particle or a wave, which has lasted for hundreds of years. However, in the 20th century, both sides seemed to compromise with each other, believing that light has a wave‒particle duality.

      Epigenetics has developed rapidly since 1987. Epigenetics has been widely accepted, defined as stable inheritance caused by chromosomal conformational changes without altering the DNA sequence, which differs from genetic research on variations in gene sequences. However, an increasing number of studies have found that histone modifications can affect gene sequence variation. In addition, both histones and epigenetic factors are essentially encoded by genes in the genome. Therefore, genetics and epigenetics should be interactive rather than parallel. However, some transcription factors play an important role in epigenetic modifications. Meiotic recombination is a key process that ensures the correct separation of homologous chromosomes through DNA double-stranded break repair mechanisms. The transcription factor PRDM9 can determine recombination hotspots by H3K4 and H3K36 trimethylation (H3K4me3 and H3K36me3) of nucleosomes near its DNA-binding site (Bhattacharyya T et al., 2019). Interestingly, mamo has been identified as an important candidate factor for meiosis hotspot setting in Drosophila (Winbush A et al., 2021).

      Bhattacharyya T, Walker M, Powers NR, et al. Prdm9 and Meiotic Cohesin Proteins Cooperatively Promote DNA Double-Strand Break Formation in Mammalian Spermatocytes [published correction appears in Curr Biol. 2021 Mar 22;31(6):1351]. Curr Biol. 2019;29(6):1002-1018.e7.

      Winbush A, Singh ND. Genomics of Recombination Rate Variation in Temperature-Evolved Drosophila melanogaster Populations. Genome Biol Evol. 2021;13(1): evaa252.

      Reviewer #2 (Recommendations For The Authors):

      Major comments

      Response: Thank you very much for your careful work. First, we believe that competitive research is sometimes coincidental and sometimes intentional. Our research began in 2009, when we began to configure the recombinant population. In 2016, we published an article on comparative transcriptomics (Wu et al. 2016). The article mentioned above has a strong interest in our research and is based on our transcriptome analysis for further research, with the aim of making a preemptive publication.

      To discourage such behavior, we cannot cite it and do not want to discuss it in our paper.

      Songyuan Wu et al. Comparative analysis of the integument transcriptomes of the black dilute mutant and the wild-type silkworm Bombyx mori. Sci Rep. 2016 May 19:6:26114. doi: 10.1038/srep26114.

      • line 52-54. The numerous biological functions of insect coloration have been thoroughly investigated. It is reasonable to expect more references for each function.

      Response: Thank you very much for your careful work. We have made the appropriate modifications.

      Sword GA, Simpson SJ, El Hadi OT, Wilps H. Density-dependent aposematism in the desert locust. Proc Biol Sci. 2000;267(1438):63-68. … Behavior.

      Barnes AI, Siva-Jothy MT. Density-dependent prophylaxis in the mealworm beetle Tenebrio molitor L. (Coleoptera: Tenebrionidae): cuticular melanization is an indicator of investment in immunity. Proc Biol Sci. 2000;267(1439):177-182. … Immunity.

      N. F. Hadley, A. Savill, T. D. Schultz, Coloration and Its Thermal Consequences in the New-Zealand Tiger Beetle Neocicindela-Perhispida. J Therm Biol. 1992;17, 55-61…. Thermoregulation.

      Y. G. Hu, Y. H. Shen, Z. Zhang, G. Q. Shi, Melanin and urate act to prevent ultraviolet damage in the integument of the silkworm, Bombyx mori. Arch Insect Biochem. 2013; 83, 41-55…. UV protection.

      M. Stevens, G. D. Ruxton, Linking the evolution and form of warning coloration in nature. P Roy Soc B-Biol Sci. 2012; 279, 417-426…. Aposematism.

      K. K. Dasmahapatra et al., Butterfly genome reveals promiscuous exchange of mimicry adaptations among species. Nature.2012; 487, 94-98…. Mimicry.

      Gaitonde N, Joshi J, Kunte K. Evolution of ontogenic change in color defenses of swallowtail butterflies. Ecol Evol. 2018;8(19):9751-9763. Published 2018 Sep 3. …Crypsis.

      B. S. Tullberg, S. Merilaita, C. Wiklund, Aposematism and crypsis combined as a result of distance dependence: functional versatility of the colour pattern in the swallowtail butterfly larva. P Roy Soc B-Biol Sci.2005; 272, 1315-1321…. Aposematism and crypsis combined.

      • line 59-60. This general statement needs to be rephrased. I suggest remaining simple by indicating that insect coloration can be pigmentary, structural, or bioluminescent. About the structural coloration and associated nanostructures, the authors could cite recent reviews, such as: Seago et al., Interface 2009 + Lloyd and Nadeau, Current Opinion in Genetics & Development 2021 + "Light as matter: natural structural colour in art" by Finet C. 2023. I suggest doing the same for recent reviews that cover pigmentary and bioluminescent coloration in insects. The very recent paper by Nishida et al. in Cell Reports 2023 on butterfly wing color made of pigmented liquid is also unique and worth to consider.

      Response: Thank you very much for your careful work. We have made the appropriate modifications.

      Insect coloration can be pigmentary, structural, or bioluminescent. Pigments are mainly synthesized by the insects themselves and form solid particles that are deposited in the cuticle of the body surface and the scales of the wings (10, 11). Interestingly, recent studies have found that bile pigments and carotenoid pigments synthesized through biological synthesis are incorporated into body fluids and passed through the wing membranes of two butterflies (Siproeta stelenes and Philaethria diatonica) via hemolymph circulation, providing color in the form of liquid pigments (12). The pigments form colors by selective absorption and/or scattering of light depending on their physical properties (13). However, structural color refers to colors, such as metallic colors and iridescence, generated by optical interference and grating diffraction of the microstructure/nanostructure of the body surface or appendages (such as scales) (14, 15). Pigment color and structural color are widely distributed in insects and can only be observed by the naked eye in illuminated environments. However, some insects, such as fireflies, exhibit colors (green to orange) in the dark due to bioluminescence (16). Bioluminescence occurs when luciferase catalyzes the oxidation of small molecules of luciferin (17). In conclusion, the color patterns of insects have evolved to be highly sophisticated and are closely related to their living environments. For example, cryptic color can deceive animals via high similarity to the surrounding environment. However, the molecular mechanism by which insects form precise color patterns to match their living environment is still unknown.

      • RNAi approach. I have no doubt that obtaining phenocopies by electroporation might be difficult. However, I find the final sampling a bit limited to draw conclusions from the RT-PCR (n=5 and n=3 for phenocopies and controls). Three control individuals is a very low number. Moreover, it would nice to see the variability on the plot, using for example violin plots.

      Response: Thank you very much for your careful work. In the RNAi experiment, we injected more than 20 individuals in the experimental group and control group. We have added the RNAi data in Figure 4.

      Author response table 1.

      • Figure 6. Higher magnification images of Dazao and Bm-mamo knockout are needed, as shown in Figure 5 on RNAi.

      Response: Thank you very much for your careful work. We have added enlarged images.

      Author response image 3.

      • Phylogenetic analysis/Figure S6. I am not sure to what extent the sampling is biased or not, but if not, it is noteworthy that mamo does not show duplicated copies (negative selection?). It might be interesting to discuss this point in the manuscript.

      Response: Thank you very much for your careful work. mamo belongs to the BTB/POZ zinc finger family. The members of this family exhibit significant expansion in vertebrates. For example, there are 3 members in C. elegans, 13 in D. melanogaster, 16 in Bombyx mori, 58 in M. musculus and 63 in H. sapiens (Wu et al, 2019). These members contain conserved BTB/POZ domains but vary in number and amino acid residue compositions of the zinc finger motifs. Due to the zinc finger motifs that bind to different DNA recognition sequences, there may be differences in their downstream target genes. Therefore, when searching for orthologous genes from different species, we required high conservation of their zinc finger motif sequences. Due to these strict conditions, only one orthologous gene was found in these species.

      • Differentially-expressed genes and CP candidate genes (line 189-191). The manuscript would gain in clarity if the authors explain more in details their procedure. For instance, they moved from a list of 191 genes to CP genes only. Can they say a little bit more about the non-CP genes that are differentially expressed? Maybe quantify the number of CPs among the total number of differentially-expressed genes to show that CPs are the main class?

      Response: Thank you very much for your careful work. The nr (Nonredundant Protein Sequence Database) annotations for 191 differentially expressed genes in Supplemental Table S3 were added. Among them, there were 19 cuticular proteins, 17 antibacterial peptide genes, 6 transporter genes, 5 transcription factor genes, 5 cytochrome genes, 53 enzyme-encoding genes and others. Because CP genes were significantly enriched in differentially expressed genes (DEGs), previous studies have found that BmorCPH24 can affect pigmentation. Therefore, we first conducted an investigation into CP genes.

      • Interaction between Bm-mamo. It is not clear why the authors chose to investigate the physical interaction of Bm-mamo protein with the putative binding site of yellow, and not with the sites upstream of tan and DDC. Do the authors test one interaction and assume the conclusion stands for the y, tan and DDC?

      Response: Thank you very much for your careful work. In D. melanogaster, the yellow gene is the most studied pigment gene. The upstream and intron sequences of the yellow gene have been identified as containing multiple cis-regulatory elements. Due to the important pigmentation role of the yellow gene and its variable cis-regulatory sequence among different species, it has been considered a research model for cis-regulatory elements (Laurent Arnoult et al. 2013, Gizem Kalay et al. 2019, Yaqun Xin et al. 2020, Yann Le Poul et al. 2020). We use yellow as an example to illustrate the regulation of the mamo gene. We added this description to the discussion.

      Laurent Arnoult et al. Emergence and diversification of fly pigmentation through evolution of a gene regulatory module. Science. 2013 Mar 22;339(6126):1423-6. doi: 10.1126/science.1233749.

      Gizem Kalay et al. Redundant and Cryptic Enhancer Activities of the Drosophila yellow Gene. Genetics. 2019 May;212(1):343-360. doi: 10.1534/genetics.119.301985. Epub 2019 Mar 6.

      Yaqun Xin et al. Enhancer evolutionary co-option through shared chromatin accessibility input. Proc Natl Acad Sci U S A. 2020 Aug 25;117(34):20636-20644. doi: 10.1073/pnas.2004003117. Epub 2020 Aug 10.

      Yann Le Poul et al. Regulatory encoding of quantitative variation in spatial activity of a Drosophila enhancer. Sci Adv. 2020 Dec 2;6(49):eabe2955. doi: 10.1126/sciadv.abe2955. Print 2020 Dec.

      • Please note that some controls are missing for the EMSA experiments. For instance, the putative binding-sites should be mutated and it should be shown that the interaction is lost.

      Response: Thank you very much for your careful work. In this study, we found that the DNA recognition sequence of mamo is highly conserved across multiple species. In D. melanogaster, studies have found that mamo can directly bind to the intron of the vasa gene to activate its expression. The DNA recognition sequence they use is TGCGT (Shoichi Nakamura et al. 2019). We chose a longer sequence, GTGCGTGGC, to detect the binding of mamo. This binding mechanism is consistent across species.

      • Figure 7 and supplementary data. How did the name of CPs attributed? According to automatic genome annotation of Bm genes and proteins? Based on Drosophila genome and associated gene names? Did the authors perform phylogenetic analyses to name the different CP genes?

      Response: Thank you very much for your careful work. The naming of CPs is based on their conserved motif and their arrangement order on the chromosome. In previous reports, sequence identification and phylogenetic analysis of CPs have been carried out in silkworms (Zhengwen Yan et al. 2022, Ryo Futahashi et al. 2008). The members of the same family have sequence similarity between different species, and their functions may be similar. We have completed the names of these genes in the text, for example, changing CPR2 to BmorCPR2.

      Zhengwen Yan et al. A Blueprint of Microstructures and Stage-Specific Transcriptome Dynamics of Cuticle Formation in Bombyx mori. Int J Mol Sci. 2022 May 5;23(9):5155.

      Ningjia He et al. Proteomic analysis of cast cuticles from Anopheles gambiae by tandem mass spectrometry. Insect Biochem Mol Biol. 2007 Feb;37(2):135-46.

      Maria V Karouzou et al. Drosophila cuticular proteins with the R&R Consensus: annotation and classification with a new tool for discriminating RR-1 and RR-2 sequences. Insect Biochem Mol Biol. 2007 Aug;37(8):754-60.

      Ryo Futahashi et al. Genome-wide identification of cuticular protein genes in the silkworm, Bombyx mori. Insect Biochem Mol Biol. 2008 Dec;38(12):1138-46.

      • Discussion. I think the discussion would gain in being shorter and refocused on the understudied role of CPs. Another non-canonical aspect of the discussion is the reference to additional experiments (e.g., parthogenesis line 290-302, figure S14). This is not the place to introduce more results, and it breaks the flow of the discussion. I encourage the authors to reshuffle the discussion: 1) summary of their findings on mamo and CPs, 2) link between pigmentation mutant phenotypes, pigmentation pattern and CPs, 3) general discussion about the (evo-)devo importance of CPs and link between pigment deposition and coloration. Three important papers should be mentioned here:

      1) Matsuoka Y and A Monteiro (2018) Melanin pathway genes regulate color and morphology of butterfly wing scales. Cell Reports 24: 56-65... Yellow has a pleiotropic role in cuticle deposition and pigmentation.

      2) https://arxiv.org/abs/2305.16628... Link between nanoscale cuticle density and pigmentation

      3) https://www.cell.com/cell-reports/pdf/S2211-1247(23)00831-8.pdf... Variation in pigmentation and implication of endosomal maturation (gene red).

      Response: Thank you very much for your careful work. We have rewritten the discussion section.

      1) We have summarized our findings.

      Bm-mamo may affect the synthesis of melanin in epidermis cells by regulating yellow, DDC, and tan; regulate the maturation of melanin granules in epidermis cells through BmMFS; and affect the deposition of melanin granules in the cuticle by regulating CP genes, thereby comprehensively regulating the color pattern in caterpillars.

      2) We describe the relationship among the pigmentation mutation phenotype, pigmentation pattern, and CP.

      Previous studies have shown that the lack of expression of BmorCPH24, which encodes important components of the endocuticle, can lead to dramatic changes in body shape and a significant reduction in the pigmentation of caterpillars (53). We crossed Bo (BmorCPH24 null mutation) and bd to obtain F1(Bo/+Bo, bd/+), then self-crossed F1 and observed the phenotype of F2. The lunar spots and star spots decreased, and light-colored stripes appeared on the body segments, but the other areas still had significant melanin pigmentation in double mutation (Bo, bd) individuals (Fig. S13). However, in previous studies, introduction of Bo into L (ectopic expression of wnt1 results in lunar stripes generated on each body segment) (24) and U (overexpression of SoxD results in excessive melanin pigmentation of the epidermis) (58) strains by genetic crosses can remarkably reduce the pigmentation of L and U (53). Interestingly, there was a more significant decrease in pigmentation in the double mutants (Bo, L) and (Bo, U) than in (Bo, bd). This suggests that Bm-mamo has a stronger ability than wnt1 and SoxD to regulate pigmentation. On the one hand, mamo may be a stronger regulator of the melanin metabolic pathway, and on the other hand, mamo may regulate other CP genes to reduce the impact of BmorCPH24 deficiency.

      3) We discussed the importance of (evo-) devo in CPs and the relationship between pigment deposition and coloring.

      CP genes usually account for over 1% of the total genes in an insect genome and can be categorized into several families, including CPR, CPG, CPH, CPAP1, CPAP3, CPT, CPF and CPFL (68). The CPR family is the largest group of CPs, containing a chitin-binding domain called the Rebers and Riddiford motif (R&R) (69). The variation in the R&R consensus sequence allows subdivision into three subfamilies (RR-1, RR-2, and RR-3) (70). Among the 28 CPs, 11 RR-1 genes, 6 RR-2 genes, 4 hypothetical cuticular protein (CPH) genes, 3 glycine-rich cuticular protein (CPG) genes, 3 cuticular protein Tweedle motif (CPT) genes, and 1 CPFL (like the CPFs in a conserved C-terminal region) gene were identified. The RR-1 consensus among species is usually more variable than RR-2, which suggests that RR-1 may have a species-specific function. RR-2 often clustered into several branches, which may be due to gene duplication events in co-orthologous groups and may result in conserved functions between species (71). The classification of CPH is due to their lack of known motifs. In the epidermis of Lepidoptera, the CPH genes often have high expression levels. For example, BmorCPH24 had a highest expression level, in silkworm larvae epidermis (72). The CPG protein is rich in glycine. The CPH and CPG genes are less commonly found in insects outside the order Lepidoptera (73). This suggests that they may provide species specific functions for the Lepidoptera. CPT contains a Tweedle motif, and the TweedleD1 mutation has a dramatic effect on body shape in D. melanogaster (74). The CPFL members are relatively conserved in species and may be involved in the synthesis of larval cuticles (75). CPT and CPFL may have relatively conserved functions among insects. The CP genes are a group of rapidly evolving genes, and their copy numbers may undergo significant changes in different species. In addition, RNAi experiments on 135 CP genes in brown planthopper (Nilaparvata lugens) showed that deficiency of 32 CP genes leads to significant defective phenotypes, such as lethal, developmental retardation, etc. It is suggested that the 32 CP genes are indispensable, and other CP genes may have redundant and complementary functions (76). In previous studies, it was found that the construction of the larval cuticle of silkworms requires the precise expression of over two hundred CP genes (22). The production, interaction, and deposition of CPs and pigments are complex and precise processes, and our research shows that Bm-mamo plays an important regulatory role in this process in silkworm caterpillars. For further understanding of the role of CPs, future work should aim to identify the function of important cuticular protein genes and the deposition mechanism in the cuticle.

      Minor comments - Title. At this stage, there is no evidence that Bm-mamo regulates caterpillar pigmentation outside of Bombyx mori. I suggest to precise 'silkworm caterpillars' in the title.

      Response: Thank you very much for your careful work. We have modified the title.

      • Abstract, line 29. Because the knowledge on pigmentation pathway(s) is advanced, I would suggest writing 'color pattern is not fully understood' instead of 'color pattern is not clear'.

      Response: Thank you very much for your careful work. We have modified this sentence.

      • line 29. I suggest 'the transcription factor' rather than 'a transcription factor'.

      Response: Thank you very much for your careful work. We have modified this sentence.

      • line 30. If you want to mention the protein, the name 'Bm-mamo' should not be italicized.

      Response: Thank you very much for your careful work. We have modified this sentence.

      • line 30. 'in the silkworm'.

      Response: Thank you very much for your careful work. We have modified this sentence.

      • line 31. 'mamo' should not be italicized.

      Response: Thank you very much for your careful work. We have modified this sentence.

      • line 31. 'in Drosophila' rather 'of Drosophila'.

      Response: Thank you very much for your careful work. We have modified this sentence.

      • line 32. Bring detail if the gamete function is conserved in insects? In all animals?

      Response: Thank you very much for your careful work. The sentence was changed to “This gene has a conserved function in gamete production in Drosophila and silkworms and evolved a pleiotropic function in the regulation of color patterns in caterpillars.”

      • Introduction, line 51. I am not sure what the authors mean by 'under natural light'. Please rephrase.

      Response: Thank you very much for your careful work. We have deleted “under natural light”.

      • line 43. I find that the sentence 'In some studies, it has been proven that epidermal proteins can affect the body shape and appendage development of insects' is not necessary here. Furthermore, this sentence breaks the flow of the teaser.

      Response: Thank you very much for your careful work. We have deleted this sentence.

      • line 51-52. 'Greatly benefit them' should be rephrased in a more neutral way. For example, 'colours pattern have been shown to be involved in...'.

      Response: Thank you very much for your careful work. We have modified to “and the color patterns have been shown to be involved in…”

      • line 62. CPs are secreted by the epidermis, but I would say that CPs play their structural role in the cuticle, not directly in the epidermis. I suggest rephrasing this sentence and adding references.

      Response: Thank you very much for your careful work. We have modified “epidermis” to “cuticle”.

      • line 67. Please indicate that pathways have been identified/reported in Lepidoptera (11). Otherwise, the reader does not understand if you refer to previous biochemical in Drosophila for example.

      Response: Thank you very much for your careful work. We have modified this sentence. “Moreover, the biochemical metabolic pathways of pigments used for color patterning in Lepidoptera…have been reported.”

      • line 69. Missing examples of pleiotropic factors and associated references. For example, I suggest adding: engrailed (Dufour, Koshikawa and Finet, PNAS 2020) + antennapedia (Prakash et al., Cell Reports 2022) + optix (Reed et al., Science 2011), etc. Need to add references for clawless, abdominal-A.

      Response: Thank you very much for your careful work. We have made modifications.

      • line 76. The simpler term moth might be enough (instead of Lepidoptera).

      Response: Thank you very much for your careful work. We have modified this to “insect”.

      • line 96. I would simplify the text by writing "Then, quantitative RT-PCR was performed..."

      Response: Thank you very much for your careful work. We have modified this sentence.

      • line 112. 'Predict' instead of 'estimate'?

      Response: Thank you very much for your careful work. We have modified this sentence.

      • line 113. I would rather indicate the full name first, then indicate mamo between brackets.

      Response: Thank you very much for your careful work. We have modified this sentence.

      • line 144. The Perl script needs to be made accessible on public repository.

      Response: Thank you very much for your careful work.

      • line 147-150. Too many technical details here. The details are already indicated in the material and methods section. Furthermore, the details break the flow of the paragraph.

      Response: Thank you very much for your careful work. We have modified this section.

      • line 152. Needs to make the link with the observed phenotypes in Figure 1. Just needs to state that RNAi phenocopies mimic the mutant alleles.

      Response: Thank you very much for your careful work. We have modified this sentence.

      • line 153-157. Too many technical details here. The details are already indicated in the material and methods section. Furthermore, the details break the flow of the paragraph.

      Response: Thank you very much for your careful work. We have simplified this paragraph.

      • line 170. Please rephrase 'conserved in 30 species' because it might be understood as conserved in 30 species only, and not in other species.

      Response: Thank you very much for your careful work. We have modified this sentence.

      • line 182. Maybe explain the rationale behind restricting the analysis to +/- 2kb. Can you cite a paper that shows that most of binding sites are within 2kb from the start codon?

      Response: Thank you very much for your careful work. We have modified this sentence.

      • line 182. '14,623 predicted genes'.

      Response: Thank you very much for your careful work. We have modified this sentence.

      • line 183. '10,622 genes'

      Response: Thank you very much for your careful work. We have modified this sentence.

      • line 183. Redundancy. Please remove 'silkworm' or 'B. mori'.

      Response: Thank you very much for your careful work. We have modified this sentence.

      • line 187. '10,072 genes'

      Response: Thank you very much for your careful work. We have modified this sentence.

      • line 188. '9,853 genes'

      Response: Thank you very much for your careful work. We have modified this sentence.

      • line 200. "Therefore, the differential...in caterpillars" is a strong statement.

      Response: Thank you very much for your careful work. We have modified this sentence.

      • line 204. Remove "The" in front of eight key genes. Also, needs a reference... maybe a recent review on the biochemical pathway of melanin in insects.

      Response: Thank you very much for your careful work. We have modified this sentence.

      • line 220. This sentence is too general and vague. Please explicit what you mean by "in terms of evolution". Number of insect species? Diversity of niche occupancy? Morphological, physiological diversity?

      Response: Thank you very much for your careful work. We have modified this sentence.

      • line 285. The verb "believe" should be replaced by a more neutral one.

      Response: Thank you very much for your careful work. We have modified this sentence.

      • line 354-355. This sentence needs to be rephrased in a more objective way.

      Response: Thank you very much for your careful work. We have rewritten this sentence.

      • line 378. Missing reference for MUSCLE.

      Response: Thank you very much for your careful work. We have modified this sentence.

      • line 379. Pearson model?

      Response: Thank you very much for your careful work. We have modified this sentence.

      • line 408. "The CRISPRdirect online software was used...".

      Response: Thank you very much for your careful work. We have modified this sentence.

      • Figure 1. In the title, I suggest indicating Dazao, bd, bdf as it appears in the figure. Needs to precise 'silkworm larval development'.

      Response: Thank you very much for your careful work. We have modified this figure title.

      • Figure 3. In the title, is the word 'pattern' really necessary? In the legend, please indicate the meaning of the acronyms AMSG and PSG.

      Response: Thank you very much for your careful work. We have modified this figure legend.

      • Figure S7A. Typo 'Znic finger 1', 'Znic finger 2', 'Znic finger 3',

      Response: Thank you very much for your careful work. We have fixed these typos. .

    1. Author Response:

      Reviewer #1 (Public Review):

      Summary:

      The authors identified that genetically and pharmacological inhibition of CERS1, an enzyme implicated in ceramides biosynthesis worsen muscle fibrosis and inflammation during aging.<br /> Strengths:

      The study points out an interesting issue on excluding CERS1 inhibition as a therapeutic strategy for sarcopenia. Overall, the article it's well written and clear.<br /> Weaknesses:

      Many of the experiments confirmed previous published data, which also show a decline of CERS1 in ageing and the generation and characterization of a muscle specific knockout mouse line. The mechanistic insights of how the increased amount of long ceramides (cer c24) and the decreased of shorter ones (cer c18) might influence muscle mass, force production, fibrosis and inflammation in aged mice have not been addressed.

      We thank the reviewer for the assessment and would like to point out that Cers1 had not previously been studied in the context of aging. Moreover, our unbiased pathway analyses in human skeletal muscle implicate CERS1 for the first time with myogenic differentiation, which we validate in cell culture systems. To improve mechanistic insights, as suggested by Reviewer #1, we performed more experiments to gain insights how Cers1 derived c18, and Cers2 derived c24 ceramide species affect myogenesis. We recently showed that knocking out Cers2 reduces c24:0/c24:1 and promotes muscle cell maturation (PMID: 37118545, Fig. 6m-r and Supplementary Fig. 5e). This suggests that the very long chain ceramides c24 might indeed be driving the effect we see upon Cers1 inhibition because we observe an accumulation of c24 ceramides upon Cers1 (c18) inhibition (Fig 2B, Fig 3B, Fig 4A, Fig S3E), which is associated with impaired muscle maturation (Fig 4B-C, Fig S3G-I, Fig S4G-I). To study whether impaired muscle cell differentiation upon Cers1 inhibition is dependent on Cers2, we knocked-down Cers1 alone, or in combination with the knockdown of Cers2. Results show that reduced muscle cell maturation mediated by Cers1KD is rescued by the simultaneous knockdown of Cers2 as shown by gene expression analyses and immunohistochemical validation and quantification. Hence, we believe that reducing Cers1 function during aging might lead to an increase in sphingosine levels as has been shown previously (PMID: 31692231). Increased sphingosine triggers cell apoptosis due to its toxicity (PMID: 12531554). Therefore, channeling accumulating sphingosine towards C24 ceramides may avoid toxicity but, as we show in this manuscript, will reduce the myogenic potential in muscle. However, if also C24 production is blocked by Cers2 inhibition, sphingosine is forced towards the production of other, potentially less toxic or myogenesis-impairing ceramides. We added these new data to the revised manuscript as new Fig 5D-E and new Fig S5G-I.

      Reviewer #2 (Public Review):

      Summary:

      The manuscript by Wohlwend et al. investigates the implications of inhibiting ceramide synthase Cers1 on skeletal muscle function during aging. The authors propose a role for Cers1 in muscle myogenesis and aging sarcopenia. Both pharmacological and AAV-driven genetic inhibition of Cers1 in 18month-old mice lead to reduced C18 ceramides in skeletal muscle, exacerbating age-dependent features such as muscle atrophy, fibrosis, and center-nucleated fibers. Similarly, inhibition of the Cers1 orthologue in C. elegans reduces motility and causes alterations in muscle morphology.<br /> Strengths:

      The study is well-designed, carefully executed, and provides highly informative and novel findings that are relevant to the field.

      Weaknesses:

      The following points should be addressed to support the conclusions of the manuscript.

      (1) It would be essential to investigate whether P053 treatment of young mice induces age-dependent features besides muscle loss, such as muscle fibrosis or regeneration. This would help determine whether the exacerbation of age-dependent features solely depends on Cers1 inhibition or is associated with other factors related to age- dependent decline in cell function. Additionally, considering the reported role of Cers1 in whole-body adiposity, it is necessary to present data on mice body weight and fat mass in P053treated aged-mice.

      We thank the reviewer to suggest that we study Cers1 inhibition in young mice. In fact, a previous study shows that muscle-specific Cers1 knockout in young mice impairs muscle function (PMID: 31692231). Similar to our observation, these authors report reduced muscle fiber size and muscle force. Therefore, we do not believe that our observed effects of Cers1 inhibition in aged mice are specific to aging, although the phenotypic consequences are accentuated in aged mice. As requested by the reviewer, we attached the mice body weights and fat mass (Author response image 1A-B). The reduced fat mass upon P053 treatment is in line with previously reported reductions in fat mass in chow diet or high fat diet fed young mice upon Cers1 inhibition (PMID: 30605666, PMID: 30131496), again suggesting that the effect of Cers1 inhibition might not be specific to aging.

      Author response image 1.

      (A-B) Body mass (A) and Fat mass as % of body mass (B) were measured in 22mo C57BL/6J mice intraperitoneally injected with DMSO or P053 using EchoMRI (n=7-12 per group). (C-D) Grip strengh measurements in all limbs (C) or only the forelimbs (D) in 24mo C57BL/6J mice intramuscularly injected with AAV9 particles containing scramble, or shRNA targeting Cers1 (n=8 per group). (E-F) Pax7 gene expression in P053 or AAV9 treated mice (n=6-7 per group) (E), or in mouse C2C12 muscle progenitor cells treated with 25nM scramble or Cers1 targeting shRNA (n=8 per group) (F). (G) Proliferation as measured by luciferase intensity in mouse C2C12 muscle muscle cells treated with 25nM scramble or Cers1 targeting shRNA (n=24 per group). Each column represents one biological replicate. (H) Overlayed FACS traces of Annexin-V (BB515, left) and Propidium Iodide (Cy5, right) of mouse C2C12 muscle myotubes treated with 25nM scramble or Cers1 targeting shRNA (n=3 per group). Quantification right: early apoptosis (Annexin+-PI-), late apoptosis (Annexin+-PI+), necrosis (Annexin--PI+), viability (Annexin--PI-). (I) Normalized Cers2 gene expression in mouse C2C12 muscle muscle cells treated with 25nM scramble or Cers1 targeting shRNA (n=6-7 per group). (J-K) Representative mitochondrial respiration traces of digitonin-permeablized mouse C2C12 muscle muscle cells treated DMSO or P053 (J) with quantification of basal, ATP-linked, proton leak respiration as well as spare capacity and maximal capacity linked respiration (n=4 per group). (L) Reactive oxygen production in mitochondria of mouse C2C12 muscle muscle cells treated DMSO or P053. (M) Enriched gene sets related to autophagy and mitophagy in 24mo C57BL/6J mouse muscles intramuscularly injected with AAV9 particles containing scramble, or shRNA targeting Cers1 (left), or intraperitoneally injected with DMSO or P053 (right). Color gradient indicates normalized effect size. Dot size indicates statistical significance (n=6-8 per group). (N) Representative confocal Proteostat® stainings with quantifications of DMSO and P053 treated mouse muscle cells expressing APPSWE (top) and human primary myoblasts isolated from patients with inclusion body myositis (bottom). (O) Stillness duration during a 90 seconds interval in adult day 5 C. elegans treated with DMSO or 100uM P053. (P) Lifespan of C. elegans treated with DMSO or P053. (n=144-147 per group, for method details see main manuscript page 10).

      (2) As grip and exercise performance tests evaluate muscle function across several muscles, it is not evident how intramuscular AAV-mediated Cers1 inhibition solely in the gastrocnemius muscle can have a systemic effect or impact different muscles. This point requires clarification.

      The grip strength measurements presented in the manuscript come from hindlimb grip strength, as pointed out in the Methods section. We measured grip strength in all four limbs, as well as only fore- (Author response image 1C-D). While forelimb strength did not change, only hindlimb grip strength was significantly different in AAV-Cers1KD compared to the scramble control AAV (Fig 3I), which is in line with the fact that we only injected the AAV in the hindlimbs. This is similar to the effect we observed with our previous data where we saw altered muscle function upon IM AAV delivery in the gastrocnemius (PMID: PMID: 34878822, PMID: 37118545). The gastrocnemius likely has the largest contribution to hindlimb grip strength given its size, and possibly even overall grip strength as suggested by a trend of reduced grip strength in all four limbs (Author response image 1C). We also suspect that the hindlimb muscles have the largest contribution to uphill running as we could also see an effect on running performance. While we carefully injected a minimal amount of AAV into gastrocnemius to avoid leakage, we cannot completely rule out that some AAV might have spread to other muscles. We added this information to the discussion of the manuscript as a potential limitation of the study.

      (3) To further substantiate the role of Cers1 in myogenesis, it would be crucial to investigate the consequences of Cers1 inhibition under conditions of muscle damage, such as cardiotoxin treatment or eccentric exercise.<br /> While it would be interesting to study Cers1 in the context of muscle regeneration, and possibly mouse models of muscular dystrophy, we think such work would go beyond the scope of the current manuscript.

      (4) It would be informative to determine whether the muscle defects are primarily dependent on the reduction of C18-ceramides or the compensatory increase of C24-ceramides or C24-dihydroceramides.

      To improve mechanistic insights, as suggested by Reviewer #2, we performed more experiments to gain insights how Cers1 derived c18, and Cers2 derived c24 ceramide species affect myogenesis. We recently showed that knocking out Cers2 reduces c24:0/c24:1 and promotes muscle cell maturation (PMID: 37118545, Fig. 6m-r and Supplementary Fig. 5e). This suggests that the very long chain ceramides c24 might indeed be driving the effect we see upon Cers1 inhibition because we observe an accumulation of c24 ceramides upon Cers1 (c18) inhibition (Fig 2B, Fig 3B, Fig 4A, Fig S3E), which is associated with impaired muscle maturation (Fig 4B-C, Fig S3G-I, Fig S4G-I). To study whether impaired muscle cell differentiation upon Cers1 inhibition is dependent on Cers2, we knocked-down Cers1 alone, or in combination with the knockdown of Cers2. Results show that reduced muscle cell maturation mediated by Cers1KD is rescued by the simultaneous knockdown of Cers2 as shown by gene expression analyses and immunohistochemical validation and quantification. We added these data to the manuscript as new Fig 5D-E, new Fig S5G-I. These data, together with our previous results showing that Degs1 knockout reduces myogenesis (PMID: 37118545, Fig. 6s-x and Fig. 7) suggest that C24/dhC24 might contribute to the age-related impairments in myogenesis. We added the new results to the revised manuscript.

      (5) Previous studies from the research group (PMID 37118545) have shown that inhibiting the de novo sphingolipid pathway by blocking SPLC1-3 with myriocin counteracts muscle loss and that C18-ceramides increase during aging. In light of the current findings, certain issues need clarification and discussion. For instance, how would myriocin treatment, which reduces Cers1 activity because of the upstream inhibition of the pathway, have a positive effect on muscle? Additionally, it is essential to explain the association between the reduction of Cers1 gene expression with aging (Fig. 1B) and the age-dependent increase in C18-ceramides (PMID 37118545).

      Blocking the upstream enzyme of the ceramide pathway (SPT1) shuts down the entire pathway that is overactive in aging, and therefore seems beneficial for muscle aging. While most enzymes in the ceramide pathway that we studied so far (SPTLC1, CERS2) revealed muscle benefits in terms of myogenesis, inflammation (PMID: 35089797; PMID: 37118545) and muscle protein aggregation (PMID: 37196064), the CERS1 enzyme shows opposite effects. This is also visible in the direction of CERS1 expression compared to the other enzymes in one of our previous published studies (PMID: 37118545, Fig. 1e and Fig. 1f). In the current study, we show that Cers1 inhibition indeed exacerbates age-related myogenesis and inflammation as opposed to the inhibition of Sptlc1 or Cers2. As the reviewer points out, both C18- and C24-ceramides seem to accumulate upon muscle aging. We think this is due to an overall overactive ceramide biosynthesis pathway. Blocking C18-ceramides via Cers1 inhibition results in the accumulates C24-ceramides and worsens muscle phenotypes (see reply to question #4). On the other hand, blocking C24-ceramides via Cers2 inhibition improves muscle differentiation. These observations together with the finding that Cers1 mediated inhibition of muscle differentiation is dependent on proper Cers2 function (new Fig 5D-E, new Fig S5G-I) points towards C24-ceramides as the main culprit of reduced muscle differentiation. Hence, at least a significant part of the benefits of blocking SPTLC1 might have been related to reducing very long-chain ceramides. We believe that reduced Cers1 expression in skeletal muscle upon aging, observed by us and others (PMID: 31692231), might reflect a compensatory mechanism to make up for an overall overactive ceramide flux in aged muscles. Reducing Cers1 function during aging might lead to an increase in sphingosine levels as has been shown previously (PMID: 31692231). Increased sphingosine triggers cell apoptosis due to its toxicity (PMID: 12531554). Therefore, channeling accumulating sphingosine towards C24 ceramides may avoid toxicity but, as we show in this manuscript, will reduce the myogenic potential in muscle. However, if also C24 production is blocked by Cers2 inhibition (new Fig 5E-D, new Fig S5G-I), sphingosine is forced towards the production of other, potentially less toxic, or myogenesis-impairing ceramides. These data are now added to the revised manuscript (see page 7). Details were added to the discussion of the manuscript (see page 8).

      Addressing these points will strengthen the manuscript's conclusions and provide a more comprehensive understanding of the role of Cers1 in skeletal muscle function during aging.

      Reviewer #1 (Recommendations For The Authors):

      The authors identified that genetical and pharmacological inhibition of CERS1, an enzyme implicated in ceramides biosynthesis worsen muscle fibrosis and inflammation during aging.

      Even though many of the experiments only confirmed previous published data (ref 21, 11,37,38), which also show a decline of CERS1 in ageing and the generation and characterization of a muscle specific knockout mouse line, the study points out an interesting issue on excluding CERS1 inhibition as a therapeutic strategy for sarcopenia and opens new questions on understanding how inhibition of SPTLC1 (upstream CERS1) have beneficial effects in healthy aging (ref 15 published by the same authors).

      Overall, the article it's well written and clear. However, there is a major weakness. The mechanistic insights of how the increased amount of long ceramides (c24) and the decreased of shorter ones (cer c18) might influence muscle mass, force production, fibrosis and inflammation in aged mice have not been addressed. At the present stage the manuscript is descriptive and confirmatory of CERS1 mediated function in preserving muscle mass. The authors should consider the following points:

      Comments:

      (1) Muscle data

      (a) The effect of CERS1 inhibition on myotube formation must be better characterized. Which step of myogenesis is affected? Is stem cell renewal or MyoD replication/differentiation, or myoblast fusion or an increased cell death the major culprit of the small myotubes? Minor point: Figure S1C: show C14:00 level at 200 h; text of Fig S2A and 1F: MRF4 and Myogenin are not an early gene in myogenesis please correct, Fig S2B and 2C: changes in transcript does not mean changes in protein or myotube differentiation and therefore, authors must test myotube formation and myosin expression.

      Cers1 inhibition seems to affect differentiation and myoblast fusion. To test other suggested effects we performed more experiments as delineated. Inhibiting Cers1 systemically with the pharmacological inhibitor of Cers1 (P053) or with intramuscular delivery of AAV expressing a short hairpin RNA (shRNA) against Cers1 in mice did not affect Pax7 transcript levels (Author response image 1E). Moreover, we did also not observe an effect of shRNA targeting Cers1 on Pax7 levels in mouse C2C12 muscle progenitor cells (Author response image 1F). To characterize the effect of Cers1 inhibition on muscle progenitor proliferation/renewal, we used scramble shRNA, or shRNA targeting Cers1 in C2C12 muscle progenitors and measured proliferation using CellTiter-Glo (Promega). Results showed that Cers1KD had no significant effect on cell proliferation (Author response image 1G). Next, we assayed cell death in differentiating C2C12 myotubes deficient in Cers1 using FACS Analysis of Annexin V (left) and propidium iodide (right). We found no difference in early apoptosis, late apoptosis, necrosis, or muscle cell viability, suggesting that cell death can be ruled out to explain smaller myotubes (Author response image 1H). These findings support the notion that the inhibitory effect of Cers1 knockdown on muscle maturation are primarily based on effects on myogenesis rather than on apoptosis. Our data in the manuscript also suggests that Cers1 inhibition affects myoblast fusion, as shown by reduced myonucleation upon Cers1KD (Fig S3H right, Fig S5I).

      (b) The phenotype of CESR1 knockdown is milder than 0P53 treated mice (Fig S5D and Figure 3F, 3H are not significant) despite similar changes of Cer18:0, Cer24:0, Cer 24:1 concentration in muscles . Why?

      Increases in very long chain ceramides were in fact larger upon P053 administration compared to AAVmediated knockdown. For example, Cer24:0 levels increased by >50% upon P053 administration, compared to 20% by AAV injections. Moreover, dhC24:1 increased by 6.5-fold vs 2.5-fold upon P053 vs AAV treatment, respectively. These differences might not only explain the slightly attenuated phenotypes in the AA- treated mice but also underlines the notion that very long chain ceramides might cause muscle deterioration. We believe inhibiting the enzymatic activity of Cers1 (P053) as compared to degrading Cers1 transcripts is a more efficient strategy to reduce ceramide levels. However, we cannot completely rule out multi-organ, systemic effects of P053 treatment beyond its direct effect on muscle. We added these details in the discussion of the revised manuscript (see page 8 of the revised manuscript).

      (c) The authors talk about a possible compensation of CERS2 isoform but they never showed mRNA expression levels or CERS2 protein levels aner treatment. Is CERS2 higher expressed when CERS1 is downregulated in skeletal muscle?

      We appreciate the suggestion of the reviewer. We found no change in Cers2 mRNA levels upon Cers1 inhibition in mouse C2C12 myoblasts (Author response image 1I). We would like to point out that mRNA abundance might not be the optimal measurement for enzymes due to enzymatic activities. Therefore, we think metabolite levels are a better proxy of enzymatic activity. It should also be pointed out that “compensation” might not be an accurate description as sphingoid base substrate might simply be more available upon Cers1KD and hence, more substrate might be present for Cers2 to synthesize very long chain ceramides. This “re-routing” has been previously described in the literature and hypothesized to be related to avoid toxic (dh)sphingosine accumulation (PMID: 30131496). Therefore, we changed the wording in the revised manuscript to be more precise.

      (d) Force measurement of AAV CERS1 downregulated muscles could be a plus for the study (assay function of contractility)

      In the current study we measured grip strength in mice, which had previously been shown to be a good proxy of muscle strength and general health (PMID: 31631989). Indeed, our results of reduced muscle grip strength are in line with previous work that shows reduced contractility in muscles of Cers1 deficient mice (PMID: 31692231).

      (e) How are degradation pathways affected by the downregulation of CERS1. Is autophagy/mitophagy affected? How is mTOR and protein synthesis affected? There is a recent paper that showed that CerS1 silencing leads to a reduction in C18:0-Cer content, with a subsequent increase in the activity of the insulin pathway, and an improvement in skeletal muscle glucose uptake. Could be possible that CERS1 downregulation increases mTOR signalling and decreases autophagy pathway? Autophagic flux using colchicine in vivo would be useful to answer this hypothesis

      Cers1 in skeletal muscle has indeed been linked to metabolic homeostasis (see PMID: 30605666). In line with their finding in young mice we also find reduced fat mass upon P053 treatment in aged mice (Author response image 1A-B). We also looked into mitochondrial bioenergetics upon blocking Cers1 with P053 treatment using an O2k oxygraphy (Author response image 1J-L). Results show that Cers1 inhibition in mouse muscle cells increases mitochondrial respiration, similar to what has been shown before (PMID: 30131496). However, we also found that reactive oxygen species production in mouse muscle cells is increased upon P053 treatment, suggesting the presence of dysfunctional mitochondria upon inhibiting Cers1 with P053.We next looked into the mitophagy/autophagy degradation pathways suggested by the reviewer and do not find convincing evidence supporting that Cers1 has a major impact on autophagy or mitophagy derived gene sets in mice treated with shRNA against Cers1, or the Cers1 pharmacological inhibitor P053 (Author response image 1M).

      We then assessed the effect of Cers1 inhibition on transcripts levels related to the mTORC1/protein synthesis, as suggested by the reviewer. Cers1 knockdown in differentiating mouse muscle cells showed only a weak trend to reduce mTORC1 and its downstream targets (new Fig S4A). In line with this, there was no notable difference in protein synthesis in differentiating, Cers1 deficient mouse C2C12 myoblasts as assessed by L-homopropargylglycine (HPG) amino acid labeling using confocal microscopy (new Fig S4B) or FACS analyses (new Fig S4C). However, Cers1KD increased transcripts related to the myostatin-Foxo1 axis as well as the ubiquitin proteasome system (e.g. atrogin-1, MuRF1) (new Fig S4D), suggesting Cers1 inhibition increases protein degradation. We added these details to the revised manuscript on page 7. We recently implicated the ceramide pathway in regulating muscle protein homeostasis (PMID: 37196064). Therefore, we assessed the effect of Cers1 inhibition with the P053 pharmacological inhibitor on protein folding in muscle cells using the Proteostat dye that intercalates into the cross-beta spine of quaternary protein structures typically found in misfolded and aggregated proteins. Interestingly, inhibiting Cers1 further increased misfolded proteins in C2C12 mouse myoblasts expressing the Swedish mutation in APP and human myoblasts isolated from patients with inclusion body myositis (Author response imageure 1N). These findings suggest that deficient Cers1 might upregulate protein degradation to compensate for the accumulation of misfolded and aggregating proteins, which might contribute to impaired muscle function observed upon Cers1 knockdown. Further studies are needed to disentangle the underlying mechanstics.

      (f) The balances of ceramides have been found to play roles in mitophagy and fission with an impact on cell fate and metabolism. Did the authors check how are mitochondria morphology, mitophagy or how dynamics of mitochondria are altered in CERS1 knockdown muscles? (fission and fusion). There is growing evidence relating mitochondrial dysfunction to the contribution of the development of fibrosis and inflammation.

      Previously, CERS1 has been studied in the context of metabolism and mitochondria (for reference, please see PMID: 26739815, PMID: 29415895, PMID: 30605666, PMID: 30131496). In summary, these studies demonstrate that C18 ceramide levels are inversely related to insulin sensitivity in muscle and mitochondria, and that Cers1 inhibition improves insulin-stimulated suppression of hepatic glucose production and reduced high-fat diet induced adiposity. Moreover, improved mitochondrial respiration, citrate synthase activity and increased energy expenditure were reported upon Cers1 inhibition. Lack of Cers1 specifically in skeletal muscle was also reported to improve systemic glucose homeostasis. While these studies agree on the effect of Cers1 inhibition on fat loss, results on glucose homeostasis and insulin sensitivity differ depending on whether a pharmacologic or a genetic approach was used to inhibit Cers1. The current manuscript describes the effect of CERS1 on muscle function and myogenesis because these were the most strongly correlated pathways with CERS1 in human skeletal muscle (Fig 1C) and impact of Cers1 on these pathways is poorly studied, particularly in the context of aging. Therefore, we would like to refer to the mentioned studies investigating the effect of CERS1 on mitochondria and metabolism.

      (2) C.elegans data:

      (a) The authors checked maternal RNAi protocol to knockdown lagr-1 and showed alteration of muscle morphology at day 5. They also give pharmacological exposure of P053 drug at L4 stage. Furthermore, the authors also used a transgenic ortholog lagr-1 to perform the experiments. All of them were consistent showing a reduced movement. It would be important to show rescue of the muscle phenotype by overexpressing CERS1 ortholog in knockdown transgenic animals.

      We used RNAi to knockdown the Cers1 orthologue, lagr-1, in C.elegans. Therefore, we do not have transgenic animals. Overexpressing lagr-1 in the RNAi treated animals would also not be possible as the RNA from the overexpression would just get degraded.

      (b) The authors showed data about distance of C.elegans. It would be interesting to specify if body bends, reversals and stillness are affected in RNAi and transgenic Knockdown worms.

      As suggested, we measured trashing and stillness as suggested by the reviewer and found reduced trashing (new Fig S5B) and a trend towards an increase in stillness (Author response image 1O) in P053 treated worms on day 5 of adulthood, which is the day we observed significant differences in muscle morphology and movement (Fig 4D-E, Fig S5A). These data are now included in the revised manuscript.

      (c) Is there an effect on lifespan extension by knocking down CERS1?

      We performed two independent lifespan experiments in C.elegans treated with the Cers1 inhibitor P053 and found reduced lifespan in both replicate experiments (for second replicate, see Author response image 1P). We added these data to the revised manuscript as new Fig 4H.

      How do the authors explain the beneficial effect of sptlc1 inhibition on healthy aging muscle? Discuss more during the article if there is no possible explanation at the moment.

      We believe that blocking the upstream enzyme of the ceramide pathway (SPT1) shuts down the entire pathway that is overactive in aging, and therefore is more beneficial for muscle aging. Our current work suggests that at least a significant part of Sptlc1-KD benefits might stem from blocking very long chain ceramides. While SPTLC1 and CERS2 revealed muscle benefits in terms of myogenesis, inflammation (PMID: 35089797; PMID: 37118545) and muscle protein aggregation (PMID: 37196064), the CERS1 enzyme shows opposite effects, which is also visible in Fig 1e and Fig 1f of PMID: 37118545. In the current study, we show that Cers1 inhibition indeed exacerbates aging defects in myogenesis and inflammation as opposed to the inhibition of Sptlc1 or Cers2. The fact that the effect of Cers1 on inhibiting muscle differentiation is dependent on the clearance of Cers2-derived C24-ceramides suggests that reducing very long chain ceramides might be crucial for healthy muscle aging. We added details to the discussion.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      This study aims to understand the malaria antigen-specific cTfh profile of children and adults living in a malaria holoendemic area. PBMC samples from children and adults were unstimulated or stimulated with PfSEA-1A or PfGARP in vitro for 6h and analysed by a cTfh-focused panel. Unsupervised clustering and analysis on cTfh were performed.

      The main conclusions are:

      (1) the cohort of children has more diverse (cTfh1/2/17) recall responses compared to the cohort of adults (mainly cTfh17) and

      (2) Pf-GARP stimulates better cTfh17 responses in adults, thus a promising vaccine candidate.

      Strengths:

      This study is in general well-designed and with excellent data analysis. The use of unsupervised clustering is a nice attempt to understand the heterogeneity of cTfh cells. Figure 9 is a beautiful summary of the findings.

      Weaknesses:

      (1) Most of my concerns are related to using PfSEA-1A and PfGARP to analyse cTfh in vitro stimulation response. In vitro, stimulation on cTfh cells has been frequently used (e.g. Dan et al, PMID: 27342848), usually by antigen stimulation for 9h and analysed CD69/CD40L expression, or 18h and CD25/OX40. However, the authors use a different strategy that has not been validated to analyse in vitro stimulated cTfh. Also, they excluded CD25+ cells which might be activated cTfh. I am concerned about whether the conclusions based on these results are reliable.

      It has been shown that cTfh cells can hardly produce cytokines by Dan et al. However, in this paper, the authors report the significant secretion of IL-4 and IFNg on some cTfh clusters after 6h stimulation. If the stimulation is antigen-specific through TCR, why cTfh1 cells upregulate IL-4 but not IFNg in Figure 6? I believe including the representative FACS plots of IL-4, IFNg, IL21 staining, and using %positive rather than MFI can make the conclusion more convincing. Similarly, the author should validate whether TCR stimulation under their system for 6h can induce robust BCL6/cMAF expression in cTfh cells. Moreover, there is no CD40L expression. Does this mean TCR stimulation mediated BCl6/cMAF upregulation and cytokine secretion precede CD40L expression?

      In summary, I am particularly concerned about the method used to analyse PfSEA-1A and PfGARP-specific cTfh responses because it lacks proper validation. I am unsure if the conclusions related to PfSEA-1A/PfGARP-specific responses are reliable.

      An unfortunate reality of these types of complex immunologic studies is that it takes time to optimize a multiparameter flow cytometry panel, run this number of samples, and then conduct the analysis (not to mention the time it takes for a manuscript to be accepted for peer-review). An unexpected delay, frankly, was the COVID-19 pandemic when non-essential research lab activities were put on hold. We designed our panel in 2019 and referred to the “T Follicular Helper Cells” Methods and Protocols book from Springer 2015. Obviously the field of human immunology took a huge leap forward during the pandemic as we sought to characterize components of protective immunity, and as a result there are several new markers we will choose for future studies of Tfh subsets. We agree with the reviewer that cytokine expression kinetics differ depending on the in vitro stimulation conditions. Due to small blood volumes obtained from healthy children, we were limited in the number of timepoints we could test. However, since we were most interested in IL21 expression, we found 6 hrs to be the best in combination with the other markers of interest during our optimization experiments. We did find IFNg expression from non-Tfh cells, therefore we believe our stimulation conditions worked.

      Dan et al used stimulated tonsils cells to assess the CXCR5<sup>pos</sup>PD1<sup>pos</sup>CD45RA<sup>neg</sup> Tfh and CXCR5<sup>neg</sup> CD45RA<sup>neg</sup> non-Tfh whereas in our study, we evaluated CXCR5<sup>pos</sup>PD1<sup>pos</sup>CD45RA<sup>neg</sup> Tfh from PBMCs. Dan et al PBMCs’ work used EBV/CMV or other pathogen product stimuli and only gated on CD25<sup>pos</sup>OX40<sup>pos</sup> cells which are not the cells we are assessing in our study. This might explain in part the differences in cytokine kinetics, as we evaluated CD25<sup>neg</sup> PBMCs only. However, we agree that more recent studies focused on CXCR5<sup>pos</sup>PD1<sup>pos</sup> cells included more Activation-induced marker (AIM) markers, which are missing in our study, inducing a lack of depth in our analysis.

      Percentage of positive cells and MFI are complementary data. Indeed, the percentage of positive cells only indicates which cells express the marker of interest without giving a quantitative value of this expression. MFI indicates how much the marker of interest is expressed by cells which is important as it can indicate degree of activation or exhaustion per cell. Meta-cluster analysis is not ideal to assess the percentage of positivity whereas it does provide essential information regarding the intensity of expression. We added supplemental figures 14 (Bcl6 and cMAF), 15 (INFg and IL21) and 16 (IL4 and IL21) where percentage of positive cells were manually gated directly from the total CXCR5<sup>pos</sup>CD4<sup>pos</sup>CD45RA<sup>neg</sup>CD25<sup>neg</sup> TfH based on the FMO or negative control, and we overlaid the positive cells on the UMAP of all the CXCR5<sup>pos</sup>CD4<sup>pos</sup>CD45RA<sup>neg</sup>CD25<sup>neg</sup> meta-clusters. Results from the manual gating are consistent with the results we show using clustering. However, it helps to better visualize that antigen-specific IL21 expression was statistically significant in children whereas the high background observed for adults did not reveal higher expression after stimulation, perhaps suggesting an upper threshold of cytokine expression (supplemental figure 15). The following sentence has been added in the methods at the end of the “OMIQ analysis” section: “ However, the percentage of positive IFN𝛾, IL-4, IL-21, Bcl6, or cMAF using manual gating can be found in Supplemental Figures 14, 15, and 16 along with the overlay of the gated positive cells on the CD4<sup>pos</sup>CXCR5<sup>pos</sup>CD25<sup>neg</sup> UMAP and the cytoplots of the gated positive cells for each meta-cluster (Supplemental Figures 14, 15, and 16).”

      Indeed cMAF can be induced by TCR signaling, ICOS and IL6 (Imbratta et. al, 2020). However, in our study populations, ICOS was expressed (see Author response image 1, panel A) in absence of any stimulation suggesting that CXCR5<sup>pos</sup>CD4<sup>pos</sup>CD25<sup>neg</sup>CD45RA<sup>neg</sup> cells were already capable of expressing cMAF. Indeed, after gating Bcl6 and cMAF positive cells based on their FMOs (Author response image 1, panel B and C, respectively), we overlaid positive cells on the CXCR5<sup>pos</sup>CD4<sup>pos</sup>CD25<sup>neg</sup>CD45RA<sup>neg</sup> cells UMAP and we can see that most of our cells already express cMAF alone (Author response image 1, panel D), co-express cMAF and Bcl6 (Author response image 1, panel E), confirming that they are TfH cells, whereas very few cells only expressed Bcl6 alone (Author response image 1, panel F). Because we knew that cT<sub>FH</sub> already expresses Bcl6 and cMAF, we focused our analysis on the intensity of their expression to assess if our vaccine candidates were inducing more expression of these transcription factors.

      Author response image 1.

      (2) The section between lines 246-269 is confusing. Line 249, comparing the abundance after antigen stimulation is improper because 6h stimulation (under Golgi stop) should not induce cell division. I think the major conclusions are contained in Figure 5e, that (A) antigen stimulation will not alter cell number in each cluster and (B) children have more MC03, 06 and fewer MC02, etc.). The authors should consider removing statements between lines 255-259 because the trends are the same regardless of stimulations.

      We agree, there is no cell division after 6h and that different meta clusters did not proliferate after this short of in vitro stimulation. The use of the word ‘abundance’ in the context of cluster analysis is in reference to comparing the contribution of events by each group to the concatenated data. After the meta clusters are defined and then deconvoluted by study group, certain meta clusters could be more abundant in one group compared to another - meaning they contributed more events to a particular metacluster.

      Dimensionality reduction is more nuanced than manual gating and reveals a continuum of marker expression between the cell subsets, as there is no hard “straight line” threshold, as observed when using in 2D gating. Because of this, differences are revealed in marker expression levels after stimulation making them shift from one cluster to another - thereby changing their abundance.

      To clarify how this type of analysis is interpreted, we have modified lines 255-259 as follows:

      “In contrast, the quiescent PfSEA-1A- and PfGARP-specific cT<sub>FH</sub>2-like cluster (MC02) was significantly more abundant in adults compared to children (Figure 5c and 5d, pf<0.05). Interestingly, following PfGARP stimulation, the activated cT<sub>FH</sub>1/17-like subset (MC09) became more abundant in children compared to adults (Figure 5d, pf<0.05 with a False Discovery Rate=0.08), but no additional subsets shifted phenotype after PfSEA-1A stimulation (Figure 5c).”

      Reviewer #2 (Public Review):

      Summary:

      Forconi et al explore the heterogeneity of circulating Tfh cell responses in children and adults from malaria-endemic Kenya, and further compare such differences following stimulation with two malaria antigens. In particular, the authors also raised an important consideration for the study of Tfh cells in general, which is the hidden diversity that may exist within the current 'standard' gating strategies for these cells. The utility of multiparametric flow cytometry as well as unbiased clustering analysis provides a potentially potent methodology for exploring this hidden depth. However, the current state of analysis presented does not aid the understanding of this heterogeneity. This main goal of the study could hopefully be achieved by putting all the parameters used in one context, before dissecting such differences into their specific clinical contexts.

      Strengths:

      Understanding the full heterogeneity of Tfh cells in the context of infection is an important topic of interest to the community. The study included clinical groupings such as age group differences and differences in response to different malaria antigens to further highlight context-dependent heterogeneity, which offers new knowledge to the field. However, improvements in data analyses and presentation strategies should be made in order to fully utilize the potential of this study.

      Weaknesses:

      In general, most studies using multiparameter analysis coupled with an unbiased grouping/clustering approach aim to describe differences between all the parameters used for defining groupings, prior to exploring differences between these groupings in specific contexts. However, the authors have opted to separate these into sections using "subset chemokine markers", "surface activation markers" and then "cytokine responses", yet nuances within all three of these major groups were taken into account when defining the various Tfh identities. Thus, it would make sense to show how all of these parameters are associated with one another within one specific context to first logically establish to the readers how can we better define Tfh heterogeneity. When presented this way, some of the identities such as those that are less clear such as "MC03/MC04/ MC05/ MC08" may even be better revealed. once established, all of these clusters can then be subsequently explored in further detail to understand cluster-specific differences in children vs adults, and in the various stimulation conditions. Since the authors also showed that many of the activation markers were not significantly altered post-stimulation thus there is no real obstacle for merging the entire dataset for the first part of this study which is to define Tfh heterogeneity in an unbiased manner regardless of age groups or stimulation conditions. Other studies using similar approaches such as Mathew et al 2020 (doi: 10.1126/science.abc8) or Orecchioni et al 2017 (doi: 10.1038/s41467-017-01015-3) can be referred to for more effective data presentation strategies.

      Accordingly, the expression of cytokines and transcription factors can only be reliably detected following stimulation. However, the underlying background responses need to be taken into account for understanding "true" positive signals. The only raw data for this was shown in the form of a heatmap where no proper ordering was given to ensure that readers can easily interpret the expression of these markers following stimulation relative to no stimulation. Thus, it is difficult to reliably interpret any real differences reported without this. Finally, the authors report differences in either cluster abundance or cluster-specific cytokine/ transcription factor expression in Tfh cell subsets when comparing children vs adults, and between the two malaria antigens. The comparisons of cytokine/transcription factor between groups will be more clearly highlighted by appropriately combining groupings rather than keeping them separate as in Figures 6 and 7.

      Thank you for sharing these references. Similar to SPADE clustering and ViSNE dimensionality algorithms used in Orecchioni et al, we used all the extracellular markers from our panel in our FlowSOM algorithm with consensus meta-clustering which includes both the chemokine receptors and activation markers even though they are presented separately in our manuscript across the figure 3 and 4. This was explained in the methods section (lines 573 - 587). We then chose the UMAP algorithm as visual dimensionality reduction of the meta-clusters generated by FlowSOM-consensus meta-clustering as explained under the “OMIQ analysis” subpart of our methods (lines 588- 604). Therefore, we believe we have conducted the analysis as this reviewer suggests even if we chose to show the figures that were informative to our story. The heatmap of the results brings the possibility to see which combination of markers respond or not to the different conditions and between groups, all the raw data are present from the supplemental figures 10 to 13 showing, using bar plots, the differences expressed in the heatmaps. We believe it strengthens our interpretation of the results.

      Regarding the transcription factor and cytokine background, we added supplemental figures 14, 15 and 16 where we used manual gating to select Bcl6, cMAF, IFNg, IL21 or IL4 positive cells directly from total CXCR5<sup>pos</sup>CD4<sup>pos</sup>CD45RA<sup>neg</sup>CD25<sup>neg</sup> TfH cells based on the FMO or negative control, and we overlaid the positive cells on the UMAP of all the CXCR5<sup>pos</sup>CD4<sup>pos</sup>CD45RA<sup>neg</sup>CD25<sup>neg</sup> meta-clusters. Moreover, all the dot plots (with their statistics) used for the heatmap figure 6 and 7 can be found in the supplemental figures 10, 11, 12 and 13. These supplemental figures address the concerns above by showing the difference of signals between unstimulated and stimulated conditions.

      Reviewer #3 (Public Review):

      Summary:

      The goal of this study was to carry out an in-depth granular and unbiased phenotyping of peripheral blood circulating Tfh specific to two malaria vaccine candidates, PfSEA-1A and PfGARP, and correlate these with age (children vs adults) and protection from malaria (antibody titers against Plasmodium antigens.). The authors further attempted to identify any specific differences in the Tfh responses to these two distinct malaria antigens.

      Strengths:

      The authors had access to peripheral blood samples from children and adults living in a malaria-endemic region of Kenya. The authors studied these samples using in vitro restimulation in the presence of specific malaria antigens. The authors generated a very rich data set from these valuable samples using cutting-edge spectral flow cytometry and a 21-plex panel that included a variety of surface markers, cytokines, and transcription factors.

      Weaknesses:

      - Quantifying antigen-specific T cells by flow cytometry requires the use of either 1- tetramers or 2- in vitro restimulation with specific antigens followed by identification of TCR-activated cells based on de-novo expression of activation markers (e.g. intracellular cytokine staining and/or surface marker staining). Although authors use an in vitro restimulation strategy, they do not focus their study on cells de-novo expressing activation markers as a result of restimulation; therefore, their study is not really on antigen-specific cTfh. Moreover, the authors report no changes in the expression of activation markers commonly used to identify antigen-specific T cells upon in vitro restimulation (including IFNg and CD40L); therefore, it is not clear if their in vitro restimulation with malaria antigens actually worked.

      We understand the reviewer’s point of view and apologies for any confusion. IFNg was expressed but not statistically different between groups. Indeed, looking at the CD8 T cells and using manual gating, we were able to show that IFNg was increased but not statistically significant upon stimulation from CD4<sup>pos</sup>CXCR5<sup>pos</sup> cells (supplemental figure 15, panel C), confirming our primary observation using clustering analysis. These results showed that our malaria antigen induced IFNg response in some participants, but not all of them, revealing heterogeneity in this response among individuals within the same group.

      Regarding CD40L, in the supplemental figure 7, we can see that some of our meta-clusters expressed more CD40L upon stimulation, but again without leading to statistical differences between groups. Combined with the increased expression of other cytokines and transcription factors, we showed that our stimulation did indeed work. However, because of the high variation within groups, there were no statistical differences across our groups. Because CD40L is not the only marker showing specific T cell activation, and not all T cells respond using this marker alone, a more comprehensive multimarker AIM panel might have highlighted differences between groups. We recognized the limitations of our study and believe that future study will benefit from more activation markers commonly used to identify antigone-specific T cells such as CD69, OX40, 4-1BB (AIM panel), among other markers.

      - CXCR5+CD4+ memory T cells have been shown to present multi-potency and plasticity, capable of differentiating to non-Tfh subsets upon re-challenge. Although authors included in their flow panel a good number of markers commonly used in combination to identify Tfh (CXCR5, PD-1, ICOS, Bcl-6, IL-21), they only used one single marker (CXCR5) as their basis to define Tfh, thus providing a weak definition for Tfh cells and follow up downstream analysis.

      Sorry for the confusion, even though the subsampled on the CD4<sup>pos</sup>CXCR5<sup>pos</sup> CD25<sup>neg</sup> cells to run our FlowSOM, we showed the different levels of expression across meta-clusters (figure 4 panels A and B) of PD1 (Tfh being PD1 positive cells) and ICOS (indicating the activation stage of the Tfh, “T Follicular Helper Cells” Methods and Protocols book from Springer 2015). We also included an overlay of the manually gated double positive Bcl6-cMAF cells on the CXCR5<sup>pos</sup>CD45RA<sup>neg</sup>CD25<sup>neg</sup> CD4 T cell UMAP plot to show that most of them express Bcl6 (supplemental figure 14). Interestingly, the manually gated IL21 positive cells were less abundant, particularly for children (supplemental figure 15). Because we were not able to include all the markers that are now used to define Tfh cells, we referred to our cell subsets as “TFH-like”. This is an acknowledged limitation of our study. Due to the limited blood volume obtained from children and cost of running multiplex flow cytometry assays, our results showing antigen-specific heterogeneity of Tfh subset will have to be validated in future studies that include these additional defining markers.

      - Previous works have used FACS-sorting and in vitro assays for cytokine production and B cell help to study the functional capacity of different cTfh subsets in blood from Plasmodium-infected individuals. In this study, authors do not carry out any such assays to isolate and evaluate the functional capacity of the different Tfh subsets identified. Thus, all the suggestions for the role that these different cTfh subsets may have in vivo in the context of malaria remain highly hypothetical.

      Unfortunately, low blood volumes obtained from children prevented us from running in vitro functional assays and the study design did not allow us to correlate them with protection. However, since the function of identified Tfh subsets from malaria-exposed individuals has been evaluated using Pf lysates in other studies, we referenced them when interpreting the differences we reported in Tfh subset recognition between malaria antigens. If either of these antigens move forward into vaccine trials, then evaluating their function would be important.

      - The authors have not included malaria unexposed control groups in their study, and experimental groups are relatively small (n=13).

      This study design did not include the recruitment of malaria naive negative controls as its goal was to assess malaria antigen-specific responses comparing the quality and abundance between malaria-exposed children to adults to these potential new vaccine targets PfSEA-1A and PfGARP. We did however test 3 malaria-naive adults and found no non-specific activation after stimulation with these two malaria antigens. Since this was done as part of our assay optimization, we did not feel the need to show these negative findings.

      And even with our small sample size, we demonstrated significant age-associated differences in malaria antigen-specific responses from cT<sub>FH</sub>-like subsets.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      Minor points are:

      (1) Line 88, cTfh cells are not only from GC-Tfh, they have GC-independent origin (He et al, PMID: 24138884).

      The following sentence was added line 88 “Interestingly, cT<sub>FH</sub> cells can also come from peripheral cT<sub>FH</sub> precursor CCR7<sup>low</sup>PD1<sup>high</sup>CXCR5<sup>pos</sup> cells; thus, they also have a GC-independent origin (He, Cell, 2013 PMID: 24138884).

      (2) I believe all participants were free of blood-stage infection upon enrolment. But can authors clearly state this information between lines 151-159?

      We mentioned in the methods, line 495-496 “Participants were eligible if they were healthy and not experiencing any symptoms of malaria at the time venous blood was collected”. However, using qPCR we found 5 children with malaria blood stage. As shown in Author response image 2, comparing malaria free to blood-stage children, no differences were observed without any stimulation. However, MC03 is more abundant upon malaria antigen stimulation in the blood-stage group whereas MC04 is more abundant in the malaria free group upon PfGARP stimulation only confirming that our stimulation worked.

      Author response image 2.

      Reviewer #3 (Recommendations For The Authors):

      (1) The strategy for gating on antigen-specific cTfh cells needs to be revised. The correct approach would be to gate on those cells that respond by de-novo expression of activation markers upon antigen restimulation (also termed activation-induced markers. e.g. CD69, CD40L, CXCL13 and IL-21, Niessl 2020; CD69, CD40L, CD137 and OX40, Lemieux 2023; CD137 and OX40, Grifoni 2020). As it stands, the study is not really on antigen-specific T cells, but rather on the overall CD4 T cell compartment plus or minus antigenic stimulation.

      We recognized the limitation in our flow panel design which prevents us from performing this gating. We originally based our panel design on the “T follicular helper cells methods and protocols” book (Springer 2015) which used CD45RA, CD25, CXCR5, CCR6, CXCR3, CCR7, ICOS and PD1 to define cT<sub>FH</sub>. We had already optimized our 21-color panel, purchased reagents and started to run our experiments by the time these publications modified how to define TFH cells Niessl, Lemieux and Grifoni’s publication. Indeed we optimized and performed our assay from November 2019 to March 2020, finishing to run the samples during the first quarantine. Because of the urgent needs of research on SARS-CoV-2 that we were involved with from this time and moving forward, the analysis of our TFH work got highly postponed. Moreover, 2020 is also the year where many TFH papers came out with better ways to define cT<sub>FH</sub> and responses to antigen stimulations. In our future studies, our panel will include AIM.

      (2) It is not clear if the antigenic stimulation actually worked. Does the proportion of IFNg+ or IL-4+ or IL-21+ or CD40L+ or CD25+ CD4 or CD8 T cells increase following in vitro antigen restimulation?

      Yes, using manual gating, we are able to show an increase of IL4 (supplemental figure 16 panel B and C), and IL21 (supplemental figure 15 panel J and K) production in both children and adults. However, we did not observe significant production of IFNg (supplemental figure 15, panel C) and changes in CD40L expression (supplemental figure 7) after malaria antigen stimulation, however, our positive control SEB worked. So, yes our stimulation assay worked but these 2 malaria antigens did not significantly induce these cytokines. This could be that they are too low to detect in every participant since they are single antigens and not whole parasite lysates, as other studies have used. It could also be that these antigens don’t stimulate CD40L or IFNg in all our participants. We brought up this limitation as follow in the discussion, line 473: “Although the heterogeneity in the response of CD40L and IFNγ suggests that our tested malaria antigens did not induce significant differences in the expression of these markers in all our participants, our panel did not include other activated induced markers, such as OX40, 4-1BB, and CD69”.

      (3) It is not clear what is the proportion of cTfh over the total CD4 T cell compartment among the different groups. Does this vary among different groups? It would be valuable to display this as an old-fashioned combination of contour plots with outliers for illustrating flow cytometry and bar graphs for the cumulative data.

      The proportion of CD3<sup>pos</sup>CD4<sup>pos</sup>CD25<sup>neg</sup>CXCR5<sup>pos</sup> cTfh cells did not differ within the total number of CD4 T cells between groups (figure 2).

      (4) The gating strategy could be refined and become more robust if adding additional markers in combination with CXCR5 for identifying cTfh (e.g. CXCR5+Bcl6+).

      Thank you for this suggestion. An overlay of Bcl6 expression can be found in supplemental figure 14 where we confirm that our CXCR5+ cT<sub>FH</sub>-like subsets express cMAF and Bcl6.

      (5) The protocols for intracellular and intranuclear staining seem to be incomplete in Materials and Methods. In particular, cell permeabilization strategies seem to be missing.

      Our apologies for this oversight, we added the following sentences in the methods line 545: “Cells were fixed and permeabilized for 45 mins using the transcription factor buffer set (BD Pharmingen) followed by a wash with the perm-wash buffer. Intracellular staining was performed at 4 °C for 45 more mins followed by two washes using the kit’s perm-wash buffer”.

      (6) In Materials and Methods, the authors mention they have used fluorescence minus one control to set their gating strategy. It would be valuable to show these, either on the main body or as part of supplementary figures.

      We added the cytoplots of the FMOs and/or negative controls as appropriate in the supplemental figures 14 (cMAF and Bcl6), 15 (IFNg and IL21) and 16 (IL4 and IL21).

      (7) Line 194 and Figure 3, it is not clear the criteria that the authors used for down-sampling events before FlowSOM analysis. Was this random? Was this done with unstimulated or stimulated samples?

      We chose to down-sample on CD3posCD4<sup>pos</sup>CD25<sup>neg</sup>CD45RA<sup>neg</sup> and CXCR5<sup>pos</sup> cells prior to our FlowSOM to allow more cluster analysis to focus only on the differences among those cells. The down-sampling used 1,000 CD3posCD4<sup>pos</sup>CD25<sup>neg</sup> CD45RA<sup>neg</sup>CXCR5<sup>pos</sup> cells from each fcs file (unstimulated and stimulated samples). If the fcs file had more than 1,000 CXCR5<sup>pos</sup> cells, the down-sampling was done randomly by the OMIQ platform algorithm to select only 1,000 CXCR5<sup>pos</sup> cells within this specific fcs file. The latest sentence was added to the methods line 593.

      (8) Lanes 201, 202, As it stands, the take of the authors on the role of different cTfh subsets during infection remains highly speculative. Are these differences in cTfh phenotypes actually reflected in their in vitro capacity to provide B cell help (e.g. as in the Obeng-Adjei 2015 paper) or to produce IL-21, express co-stimulatory molecules, or any other characteristic that would allow them to better infer their functional roles during infection? Any additional in vitro analysis of the functional capacity of isolated cTfh subsets identified in this research would greatly increase its value.

      We agree with the reviewer that this sentence is speculative, and we rephrase it as follow: “First, we found different CXCR5 expression levels between meta-clusters (Figure 3b); CXCR5 is essential for cT<sub>FH</sub> cells to migrate to the lymph nodes and interact with B-cells”. We would have liked to perform in vitro functional assays. However, as explained above, we did not have sufficient cells collected from children to do so.

      (9) It is not clear why authors omitted IL-17 and did not use IFNg and IL-4 to refine their definition of Th1, Th2 and Th17 cTfh.

      We would have liked to include IL-17, however we were constrained by only having access to a 4 lasers cytometer at the time we ran our assay. In light of needing to prioritize markers, when we were designing our flow panel, cTfh1 were shown to be preferentially activated during episodes of acute febrile malaria children (Obeng-Adjei). Therefore, we chose to focus on IFNg and IL4 to differentiate Tfh1 from Tfh2, in addition to other markers as surrogate of functional potential. We did not use IFNg and IL4 to refine our definition of Tfh1, Tfh2 and Tfh17 as recent publications have shown that IL4 is not only expressed in Tfh2 but also in the other Tfh subsets, at lower intensity (Gowthaman among others). Therefore IFNg and IL4 by themselves were not sufficient to properly define the different Tfh subsets. In future studies, we plan to include transcription factor profiles (T-bet, BATF, GATA3) to further refine definitions of Tfh subsets.

      (10) Lines, 226, 228, based on the combination of markers that the MC03 subset expresses, it is tempting to think that this is the only "truly" committed Tfh subset from the entire analysis. Please, discuss.

      If the reviewer is referring to changes in marker expression levels that indicate they have not reached a level of differentiation that would make them reliable (ie “true) Tfh cells, we agree that this is an important question now that we have technology that can measure and analyse so many phenotypic markers at once. This brings forward the need for the scientific method - to replicate study findings to determine whether they are consistent given the same study design and experimental conditions.

      (11) Lines 243 244, Again, is this reflected in functional capacity?

      The study described in this manuscript did not include functional assays. However, this did not change the key finding that different malaria antigens behaved differently, demonstrating heterogeneity in Tfh recognition of malaria antigens. Regarding CD40L expression, we did not observe differences between groups, however some individuals had an increase of their CD40L (supplemental figure 7). It is possible that some individuals had responded through other activated induced markers (CD69, ICOS, OX40, 4-1BB among others) and that our stimulation condition was not long enough to assess CD40L expression upon malaria antigen stimulation. This limitation has been addressed by editing the line 243-244 as follows: “we were unable to find statistical differences in the CD40L expression between groups as only few individuals responded through it (supplemental figure 7).”

      (12) Lines 243, 244, Are these cTfh subsets exclusively detected in malaria-exposed individuals? This is confounded by the lack of a malaria unexposed control group in this study, which would have been highly valuable.

      We agree with the reviewer that having non-naive children would have been valuable as a negative control group. However, this study was conducted in Kenya where all children are suspected to have had at least one malaria infection. We also did not have ethical approval or the means to enroll children in the USA who would not have been exposed to malaria as a negative control group. Since we were also evaluating differences by age group, comparing US adults would not have helped to address this point. Therefore, this remains an open question that might be addressed by another study recruiting children in non-malaria endemic areas.

      (13) Line 267, as the authors have not gated on T cells de-novo expressing activation markers in response to antigen restimulation, how do they know these are indeed antigen-specific cTfh?

      Omiq analysis accounts for marker expression levels in the resting cells (unstimulated well) for each individual compared to each experimental/stimulated well. The algorithm computationally determines whether that expression level changed without an arbitrary positive threshold, keeping the expression levels as a continuous variable, not dichotomous - which is the power of unbiased cluster analyses. Therefore, we know that these cells are antigen-specific based on the statistical difference in intensity expression between the resting cells and the stimulated ones. Nevertheless, manual gating to show “de-novo” responding cells, produced the same results as assessing the MFI of each meta-cluster (supplemental figures 14, 15 and 16).

      (14) Lines, 292-295, it is very surprising that Tfh cells would not produce IL-21 upon restimulation. Have the authors observed upregulation of IL-21 following SEB restimulation?

      Yes, we observed IL21 positive cells upon SEB stimulation (supplemental figure 15, panel J and K). However we found unexpectedly high background levels of IL21, specifically within the adult group (supplemental figure 15, panel K and M) making it challenging to find antigen-specific increases above background. Interestingly, an increase in IL21 using manual gating was observed upon PfSEA-1A or PfGARP stimulation in children (supplemental figure 15, panel J and L).

      (15) In Figures 3 and 4, it is not clear if there are any significant differences in expression of different markers between different cTfh subsets and/or different conditions. Moreover, the lack of differences in response to antigen stimulation seems to suggest that it did not work adequately.

      We intentionally chose 6-hours stimulation to better assess changes in cytokines which we did. However, because it is a short stimulation, we did not expect dramatic changes in the extracellular markers presented in the figure 3 and 4. A longer stimulation, such as 24h, will highlight properly these changes.

      (16) Figure 5b would benefit from bar graphs.

      Please find below the bar-graphs for the highlighted meta-clusters in figure 5b. We did not include these bar-graphs to our figure 5 as they do not bring new information. They repeat the information already presented through the EdgeR plot.

      Author response image 3.

      (17) Figures 6 and 7 would greatly benefit from showing individual examples of old-fashioned contour with outliers flow plots to illustrate the different cTfh subsets identified in the study.

      The different cT<sub>FH</sub> subsets can be found with a contour plot with outliers in the supplemental figure 4.

      (18) Figures 3,4, 6, and 7, the authors exclusively focused on the study of MFI to measure the expression of cytokine and transcription factors among different groups/stimulations. Have the authors observed any differences in the percentage or absolute counts of cytokine+ and/or TF+ between different subsets of cTfh and/or different conditions?

      Yes. We added the supplemental figures 14 (transcription factors) and 15/16 (cytokines) where cytokines and transcription factors were assessed using manual gating. We found that total CD4<sup>pos</sup>CXCR5<sup>pos</sup> IL4 was significantly increased upon stimulation in both adults and children while IFNg was not. However, we found significantly higher IFNg on total CD8<sup>pos</sup> cells showing that the stimulation worked, but the total CD4<sup>pos</sup>CXCR5<sup>pos</sup> did not express IFNg. Finally, we observed a trend of higher IL21<sup>pos</sup>CD4<sup>pos</sup>CXCR5<sup>pos</sup> in adults, not significant due to high background whereas IL21 was significantly increased upon stimulation in children. Regarding cMAF and Bcl6, both transcription factors were significantly increased upon stimulation within children only.

      (19) Figure 8, the definition for high and low PfGARP antibody titers seems rather arbitrary. Are these associations still significant when attempting a regular correlation analysis between Ab values (i.e. Net MFI) and different cTfh subsets?

      Yes, the definition for high and low PfGARP antibody levels is arbitrary but when looking at the antibody data (figure 1b), it was naturally bimodal. Therefore as a sub-analysis, we assess the association between PfGARP antibodies levels and cT<sub>FH</sub> subsets, see Author response image 4. We checked the correlation between the abundance of the meta-clusters and the level of IgG anti-PfGARP and anti-PfSEA after PfGARP and PfSEA stimulation. We also checked the correlation between the MFI expression of Bcl6 and cMAF after stimulation (PfGARP or PfSEA-1A minus the unstimulated) by the meta-clusters and the level of IgG anti-PfGARP and anti-PfSEA. However, we believe that because of our small sample size, our results are not robust enough and that we risk over-interpreting the data. Therefore, we choose not to include this analysis in the manuscript.

      Author response image 4.

      (20) The comprehensive 21-plex panel that authors used in this study could generate insights on additional immune cells beyond cTfh (e.g. additional CD4 T cell subsets, CD8 T cells, CD19 B cells). It is not clear why the authors limited their analysis to cTfh only.

      The primary goal of the study was to assess the cT<sub>FH</sub> response to malaria vaccine candidates. However, we were able to assess the IFNg expression for CD8 T cells upon stimulation using the manual gating as indicated in the supplemental figure 15. Without additional markers to more clearly define other CD4 T cell or B cell subsets, we do not believe this dataset would go deep enough into characterizing antigen-specific responses to malaria antigens that would yield new insight.

      (21) Minor point, the punctuation should be revised throughout the manuscript.

      Punctuation was revised throughout the manuscript by our departmental scientific writer Dr. Trombly, as per reviewer request.

    1. Reviewer #2 (Public Review):

      Assessment

      This study develops a potentially useful metric for quantifying codon usage adaptation – the Codon Adaptation Index of Species (CAIS) – that is intended to allow for more direct comparisons of the strength of selection at the molecular level across species by controlling for interspecies variation in amino acid usage and GC content. As evidence to support there claim CAIS better controls for GC content and amino acid usage across species, they note that CAIS has only a weak positive correlation with GC% (that does not stand up to multiple hypothesis testing correction) while CAI has a clear negative correlation with GC%. Using CAIS, they find better adapted species have more disordered protein domains; however, excitement about these findings is dampened due to (1) this result is also observed using the effective number of codons (ENC) and

      (2) concerns over the interpretation of CAIS as a proxy for the effectiveness of selection.

      Public Review

      Summary

      The goal of the authors in this study is to develop a more reliable approach for quantifying codon usage such that it is more comparable across species. Specifically, the authors wish to estimate the degree of adaptive codon usage, which is potentially a general proxy for the strength of selection at the molecular level. To this end, the authors created the Codon Adaptation Index for Species (CAIS) that attempts to control for differences in amino acid usage and GC% across species. Using their new metric, the authors observe a positive relationship between CAIS and the overall “disorderedness” of a species protein domains. I think CAIS has the potential to be a valuable tool for those interested in comparing codon adaptation across species in certain situations. However, I have certain theoretical concerns about CAIS as a direct proxy for the efficiency of selection sNe when mutation bias changes across species.

      Strengths

      (1) I appreciate that the authors recognize the potential issues of comparing CAI when amino acid usage varies and correct for this in CAIS. I think this is sometimes an under-appreciated point in the codon usage literature, as CAI is a relative measure of codon usage bias (i.e. only considers synonyms). However, the strength of natural selection on codon usage can potentially vary across amino acids, such that comparing mean CAI between protein regions with different amino acid biases may result in spurious signals of statistical significance.

      (2) The CAIS metric presented here is generally applicable to any species that has an annotated genome with protein-coding sequences. A significant improvement over the previous version is the implementation of software tool for applying this method.

      (3) The authors do a better job of putting their results in the context of the underlying theory of CAIS compared to the previous version.

      (4) The paper is generally well-written.

      Weaknesses

      (1) The previously observed correlation between CAIS and body size was due to a bug when calculating phylogenetic independent contrasts. I commend the authors for acknowledging this mistake and updating the manuscript accordingly. I feel that the unobserved correlation between CAIS and body size should remain in the final version of the manuscript. Although it is disappointing that it is not statistically significant, the corrected results are consistent with previous findings (Kessler and Dean 2014).

      (2) I appreciate the authors for providing a more detailed explanation of the theoretical basis model. However, I remain skeptical that shifts in CAIS across species indicates shifts in the strength of selection. I am leaving the math from my previous review here for completeness.

      As in my previous review, let’s take a closer look at the ratio of observed codon frequencies vs. expected codon frequencies under mutation alone, which was previously notated as RSCUS in the original formulation. In this review, I will keep using the RSCUS notation, even though it has been dropped from the updated version. The key point is this is the ratio of observed and expected codon frequencies. If this ratio is 1 for all codons, then CAIS would be 0 based on equation 7 in the manuscript – consistent with the complete absence of selection on codon usage. From here on out, subscripts will only be used to denote the codon and it will be assumed that we are only considering the case of r = genome for some species s.

      I think what the authors are attempting to do is “divide out” the effects of mutation bias (as given by Ei), such that only the effects of natural selection remain, i.e. deviations from the expected frequency based on mutation bias alone represents adaptive codon usage. Consider Gilchrist et al. GBE 2015, which says that the expected frequency of codon i at selection-mutation-drift equilibrium in gene g for an amino acid with Na synonymous codons is

      where ∆M is the mutation bias, ∆η is the strength of selection scaled by the strength of drift, and φg is the gene expression level of gene g. In this case, ∆M and ∆η reflect the strength and direction of mutation bias and natural selection relative to a reference codon, for which ∆M,∆η = 0. Assuming the selection-mutation-drift equilibrium model is generally adequate to model of the true codon usage patterns in a genome (as I do and I think the authors do, too), the Ei,g could be considered the expected observed frequency codon i in gene g

      E[Oi,g].

      Let’s re-write the  in the form of Gilchrist et al., such that it is a function of mutation bias ∆M. For simplicity we will consider just the two codon case and assume the amino acid sequence is fixed. Assuming GC% is at equilibrium, the term gr and 1 − gr can be written as

      where µx→y is the mutation rate from nucleotides x to y. As described in Gilchrist et al. MBE 2015 and Shah and Gilchrist PNAS 2011, the mutation bias . This can be expressed in terms of the equilibrium GC content by recognizing that

      As we are assuming the amino acid sequence is fixed, the probability of observing a synonymous codon i at an amino acid becomes just a Bernoulli process.

      If we do this, then

      Recall that in the Gilchrist et al. framework, the reference codon has ∆MNNG,NNG \= 0 =⇒ e−∆MNNG,NNG \=

      (1) Thus, we have recovered the Gilchrist et al. model from the formulation of Ei under the assumption that natural selection has no impact on codon usage and codon NNG is the pre-defined reference codon. To see this, plug in 0 for ∆η in equation (1).

      We can then calculate the expected RSCUS using equation (1) (using notation E[Oi]) and equation (6) for the two codon case. For simplicity assume, we are only considering a gene of average expression (defined as ). Assume in this case that NNG is the reference codon (∆MNNG,∆ηNNG \= 0).

      This shows that the expected value of RSCUS for a two codon amino acid is expected to increase as the strength of selection ∆η increases, which is desired. Note that ∆η in Gilchrist et al. is formulated in terms of selection against a codon relative to the reference, such that a negative value represents that a codon is favored relative to the reference. If ∆η = 0 (i.e. selection does not favor either codon), then E[RSCUS] = 1. Also note that the expected RSCUS does not remain independent of the mutation bias. This means that even if sNe (i.e. the strength of natural selection) does not change between species, changes to the strength and direction of mutation bias across species could impact RSCUS. Assuming my math is right, I think one needs to be cautious when interpreting CAIS as representative of the differences in the efficiency of selection across species except under very particular circumstances.

      Consider our 2-codon amino acid scenario. You can see how changing GC content without changing selection can alter the CAIS values calculated from these two codons. Particularly problematic appears to be cases of extreme mutation biases, where CAIS tends toward 0 even for higher absolute values of the selection parameter. Codon usage for the majority of the genome will be primarily determined by mutation biases,

      with selection being generally strongest in a relatively few highly-expressed genes. Strong enough mutation biases ultimately can overwhelm selection, even in highly-expressed genes, reducing the fraction of sites subject to codon adaptation.

      Peer review image 1.

      Peer review image 2.

      CAIS (Low Expression)

      Peer review image 3.

      CAIS (Average Expression)

      Peer review image 4.

      CAIS (High Expression)

      If we treat the expected codon frequencies as genome-wide frequencies, then we are basically assuming this genome made up entirely of a single 2-codon amino acid with selection on codon usage being uniform across all genes. This is obviously not true, but I think it shows some of the potential limitations of the CAIS approach. Based on these simulations, CAIS seems best employed under specific scenarios. One such case could be when it is known that mutation bias varies little across the species of interest. Looking at the species used in this manuscript, most of them have a GC content around 0.41, so I suspect their results are okay (assuming things like GC-biased gene conversion are not an issue). Outliers in GC content probably are best excluded from the analysis.

      Although I have not done so, I am sure this could be extended to the 4 and 6 codon amino acids. One potential challenge to CAIS is the non-monotonic changes in codon frequencies observed in some species (again, see Shah and Gilchrist 2011 and Gilchrist et al. 2015).

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      In this manuscript, Liu et al. present CROWN-seq, a technique that simultaneously identifies transcription-start nucleotides and quantifies N6,2'-O-dimethyladenosine (m6Am) stoichiometry. This method is derived from ReCappable-seq and GLORI, a chemical deamination approach that differentiates A and N6-methylated A. Using ReCappable-seq and CROWN-seq, the authors found that genes frequently utilize multiple transcription start sites, and isoforms beginning with an Am are almost always N6-methylated. These findings are consistently observed across nine cell lines. Unlike prior reports that associated m6Am with mRNA stability and expression, the authors suggest here that m6Am may increase transcription when combined with specific promoter sequences and initiation mechanisms. Additionally, they report intriguing insights on m6Am in snRNA and snoRNA and its regulation by FTO. Overall, the manuscript presents a strong body of work that will significantly advance m6Am research.

      Strengths:

      The technology development part of the work is exceptionally strong, with thoughtful controls and well-supported conclusions.

      We appreciate the reviewer for the very positive assessment of the study. We have addressed the concerns below.

      Weaknesses:

      Given the high stoichiometry of m6Am, further association with upstream and downstream sequences (or promoter sequences) does not appear to yield strong signals. As such, transcription initiation regulation by m6Am, suggested by the current work, warrants further investigation.

      We thank the reviewer for the insightful comments. We have softened the language related to m<sup>6</sup>Am and transcription regulation. We totally agree with the reviewer that future investigation is required to determine the molecular mechanism behind m<sup>6</sup>Am and transcription regulation.

      Reviewer #2 (Public review):

      Summary:

      In the manuscript "Decoding m6Am by simultaneous transcription-start mapping and methylation quantification" Liu and co-workers describe the development and application of CROWN-Seq, a new specialized library preparation and sequencing technique designed to detect the presence of cap-adjacent N6,2'-O-dimethyladenosine (m6Am) with single nucleotide resolution. Such a technique was a key need in the field since prior attempts to get accurate positional or quantitative measurements of m6Am positioning yielded starkly different results and failed to generate a consistent set of targets. As noted in the strengths section below the authors have developed a robust assay that moves the field forward.

      Furthermore, their results show that most mRNAs whose transcription start nucleotide (TSN) is an 'A' are in fact m6Am (85%+ for most cell lines). They also show that snRNAs and snoRNAs have a substantially lower prevalence of m6Am TSNs.

      Strengths:

      Critically, the authors spent substantial time and effort to validate and benchmark the new technique with spike-in standards during development, cross-comparison with prior techniques, and validation of the technique's performance using a genetic PCIF1 knockout. Finally, they assayed nine different cell lines to cross-validate their results. The outcome of their work (a reliable and accurate method to catalog cap-adjacent m6Am) is a particularly notable achievement and is a needed advance for the field.

      Weaknesses:

      No major concerns were identified by this reviewer.

      We thank the reviewer for the positive assessment of the method and dataset. We have addressed the concerns below.

      Mid-level Concerns:

      (1) In Lines 625 and 626, the authors state that “our data suggest that mRNAs initate (mis-spelled by authors) with either Gm, Cm, Um, or m6Am.” This reviewer took those words to mean that for A-initiated mRNAs, m6Am was the ‘default’ TSN. This contradicts their later premise that promoter sequences play a role in whether m6Am is deposited.

      We thank the reviewer for the comment. We have changed this sentence into “Instead, our data suggest that mRNAs initiate with either Gm, Cm, Um, or Am, where Am are mostly m<sup>6</sup>Am modified.” The revised sentence separates the processes of transcription initiation and m<sup>6</sup>Am deposition, which will not confuse the reader.

      (2) Further, the following paragraph (lines 633-641) uses fairly definitive language that is unsupported by their data. For example in lines 637 and 638 they state “We found that these differences are often due to the specific TSS motif.” Simply, using ‘due to’ implies a causative relationship between the promoter sequences and m6Am has been demonstrated. The authors do not show causation, rather they demonstrate a correlation between the promoter sequences and an m6Am TSN. Finally, despite claiming a causal relationship, the authors do not put forth any conceptual framework or possible mechanism to explain the link between the promoter sequences and transcripts initiating with an m6Am.

      (3) The authors need to soften the language concerning these data and their interpretation to reflect the correlative nature of the data presented to link m6Am and transcription initiation.

      For (2) and (3). We have softened the language in the revised manuscript. Specifically, for lines 633-641 in the original manuscript, we have changed “are often due to” into “are often related to” in the revised manuscript, which claims a correlation rather than a causation.

      Reviewer #3 (Public review):

      Summary:

      m6Am is an abundant mRNA modification present on the TSN. Unlike the structurally similar and abundant internal mRNA modification m6A, m6Am’s function has been controversial. One way to resolve controversies surrounding mRNA modification functions has been to develop new ways to better profile said mRNA modification. Here, Liu et al. developed a new method (based on GLORI-seq for m6A-sequencing), for antibody-independent sequencing of m6Am (CROWN-seq). Using appropriate spike-in controls and knockout cell lines, Liu et al. clearly demonstrated CROWN-seq’s precision and quantitative accuracy for profiling transcriptome-wide m6Am. Subsequently, the authors used CROWN-seq to greatly expand the number of known m6Am sites in various cell lines and also determine m6Am stoichiometry to generally be high for most genes. CROWN-seq identified gene promoter motifs that correlate best with high stoichiometry m6Am sites, thereby identifying new determinants of m6Am stoichiometry. CROWN-seq also helped reveal that m6Am does not regulate mRNA stability or translation (as opposed to past reported functions). Rather, m6Am stoichiometry correlates well with transcription levels. Finally, Liu et al. reaffirmed that FTO mainly demethylates m6Am, not of mRNA but of snRNAs and snoRNAs.

      Strengths:

      This is a well-written manuscript that describes and validates a new m6Am-sequencing method: CROWN-seq as the first m6Am-sequencing method that can both quantify m6Am stoichiometry and profile m6Am at single-base resolution. These advantages facilitated Liu et al. to uncover new potential findings related to m6Am regulation and function. I am confident that CROWN-seq will likely be the gold standard for m6Am-sequencing henceforth.

      Weaknesses:

      Though the authors have uncovered a potentially new function for m6Am, they need to be clear that without identifying a mechanism, their data might only be demonstrating a correlation between the presence of m6Am and transcriptional regulation rather than causality.

      We thank the reviewer for the very positive assessment of the CROWN-seq method. We have softened the language which is related to the correlation between m<sup>6</sup>Am and transcription regulation.

      Reviewer recommendations:

      We thank the reviewers for their constructive suggestions. In the revised manuscript, we have corrected the errors and updated the requested discussions and figures.

      Reviewer #1 (Recommendations for the authors):

      (1) The prior work from the research group, "Reversible methylation of m6Am in the 5′ cap controls mRNA stability" (PMID: 28002401), should be cited, even if the current findings differ from earlier conclusions-particularly in line 58 and the section titled "m6Am does not substantially influence mRNA stability or translation".

      We thank the reviewer for this comment. We have added the citation.

      (2) I wonder why the authors chose to convert A to I before capping and recapping, as RNA fragmentation caused by chemical treatment may introduce noise into these processes.

      We thank the reviewer for this comment. This is a very good point. We have indeed considered this alternative protocol. There are two concerns in performing decapping-and-recapping before A-to-I conversion: (1) it is unclear whether the 3’-desthiobiotin, which is essential for the 5’ end enrichment, is stable or not during the harsh A-to-I conversion; (2) performing decapping-and-recapping first requires more enzyme and 3’-desthiobiotin-GTP, which are the major cost of the library preparation. This is because the input of CROWN-seq (~1 μg mRNA) is much higher than that in ReCappable-seq (~5 μg total RNA or ~250 ng mRNA). In the current protocol, many 5’ ends are highly fragmented and therefore are lost during the A-to-I conversion. As a result, less enzyme and 3’-desthiobiotin-GTP are needed.

      (3) During CROWN-seq benchmarking, the authors found that 93% of reads mapped to transcription start sites, implying a 7% noise level with a spike-in probe. This noise could lead to false positives in TSN assignments in real samples. It appears that additional filters (e.g., a known TSS within 100 nt) were applied to mitigate false positives. If so, I recommend that the authors clarify these filters in the main text.

      We thank the reviewer for this comment. We think that the spike-in probes might lead to an underestimation of the accuracy of TSN mapping. The spike-in probes are made by in vitro transcription with m<sup>7</sup>Gpppm<sup>6</sup>AmG or m<sup>7</sup>GpppAmG analogs. We found that the in vitro transcription exhibits a small amount of non-specific initiation, which leads to spike-in probes with 5’ ends that are not precisely aligned with the desired TSS. To better illustrate the mapping accuracy of CROWN-seq, we provided Figure 2H, which compares the non-conversion rates of newly found A-TSNs between wild-type and PCIF1 knock cells. If the newly found A-TSNs are real, they should show high non-conversion rates in wild-type cells (i.e., high m<sup>6</sup>Am) and almost zero non-conversion rates (i.e., Am) in PCIF1 knockout cells. As expected, most of the newly found A-TSNs are true A-TSNs since they are m6Am in wild-type and Am in PCIF1 knockout. Thus, we think that CROWN-seq is very precise in TSS mapping. We have clarified this in the Discussion.

      (4) I wonder if PCIF1 knockout affects TSN choice and abundance. If not, this data should be presented. If so, how are these changes accounted for in Figure 2H and Figure S5?

      We thank the reviewer for this comment.  PCIF1 KO does not really affect TSN choice. Here we calculate the correlation of relative TSN expression within genes between wild-type and PCIF1 KO cells (shown using Pearson’s r). It shows that most of the genes have similar TSN choices (with higher Pearson’s r) in both wild-type and PCIF1 KO cells. Thus, PCIF1 KO does not alter global TSN expressions.

      Author response image 1.

      (5) The manuscript refers to Am as a rare modification in mRNA (e.g., introduction lines 101-102; discussion lines 574, 608; and possibly other locations) without specifying this only applies to transcription start sites. As this study does not cover entire mRNA sequences, these statements may not be misleading.

      We thank the reviewer for this comment.  We have clarified it.

      Reviewer #2 (Recommendations for the authors):

      (1) On line 122, the authors state that: "On average, a gene uses 9.5{plus minus}9 (mean and s.d., hereafter) TSNs (Figure 1A)." However, they do not discuss the dispersion apparent in the TSNs they observed. Figure panels 1A, B, and S1A, B show a range of 120 bases or less. What is the predominant range of distances between annotated TSNs and the newly identified ones?

      1a) For example, what percentage of new TSNs fall within 20? 50? 75? bases of the annotated sites? Additional text describing the distribution of these TSNs would help readers better understand the diversity inherent in these novel 5' RNA ends. Notably, this additional text likely is best placed in the CROWN-Seq section related to Figure 2 or S2.

      We thank the reviewer for this comment. We have updated Figure S2 to describe the newly found TSSs. Depending on the coverage in CROWN-seq, the TSSs with higher coverage tend to overlap with or locate proximally to known TSSs. In contrast, the TSSs with low coverage tend to be located further away from annotated TSSs.

      1b) The alternate TSNs can have effects on splicing patterns and isoform identity. Providing a few sentences to explain how regularly this occurs would be helpful.

      We thank the reviewer for this comment. It is a very interesting point. Different TSNs can indeed have different splicing patterns. Although the discovery of splicing patterns regulated by TSNs is out of the scope of this study, we have discussed this possibility in the revised Discussion section.

      (2) On Lines 241 and 242, the authors mentioned that 1284 sites were excluded from the analysis based on low (under 20-explained in the figure legend) read count, distance from TSS, or false negatives (which are not explained). Although I agree that the authors are justified in setting these reads aside, the information could be useful to readers willing to perform follow-up work if their mRNAs of interest were included in these 1284 sites.

      2a) An annotation of all of these sites (broken down by category, i.e. the 811, the 343, and the 130) as a supplementary table should be provided.

      We thank the reviewer for this comment. We have added the categories to the revised Table S1.

      (3) Although I have marked several typos/grammar mistakes in several parts of this review, others exist elsewhere in the text and should be corrected.

      We thank the reviewer for this comment. We have corrected them.

      (4) In lines 122 and 123 the authors say "Only ~9% of genes contain a single TSN (Figure 1A)." However, their figure shows 81% with a single TSN. Why is there a 10% discrepancy?

      We thank the reviewer for this comment. We have corrected the plot in Figure 1A, to match the description.

      (5) The first Tab of Table S2 is labeled 'Legend', but is blank. Is this intentional?

      We thank the reviewer for this comment. We have updated the table legends.

      (6) On lines 70 and 76 of the supplementary figure file pertaining to Figure S2, the legend labels for Figure S2E and S2F are not accurate, they need to be changed to G and H.

      (7) In Figure 4A 'percentile' is misspelled.

      (8) The color-coding legend for the 4 bases is missing from (and should be added to) Figure S4A.

      (9) On Lines 984, 1163, and 1194 the '2s' should be properly sub-scripted where appropriate.

      For (6) to (9). We thank the reviewer for finding these issues. We have now corrected them.

      Reviewer #3 (Recommendations for the authors):

      (1) The authors should discuss if their results can definitively distinguish between the SSCA+1GC motif promoting m6Am that, in turn, promotes transcription, versus the SCA+1GC motif promoting m6Am but also separately promoting transcription in a m6Am-independent manner. The authors should also discuss this in light of recent findings by An et al. (2024 Mol. Cell), which support the former conclusion.

      We thank the reviewer for the suggestion. We now have updated the Discussion to address that our paper and An et al. can support each other.

      (2) Given that the authors showed m6Am promotes gene expression (Figure 5) but does not affect mRNA stability (Fig. S5), logic dictates that m6Am must regulate mRNA transcription. However, the authors should explain why this regulation focuses on the initiation aspect of transcription rather than other aspects of transcriptional e.g. premature termination, pause release, and elongation.

      We thank the reviewer for this comment. In this study, we did not profile the 3’ ends of nascent RNAs and thus we can only make conclusions about the overall transcription process but not a specific aspect. We have updated the revised Discussion section to mention that An et al. discovered that m<sup>6</sup>Am can sequester PCF11 and thus promote transcription, and therefore some of the effects we see could be related to differential premature termination.

      (3) Authors should add alternative versions of Figure 1D but with 3 colours corresponding to Am vs. m6Am vs. Cm/Gm/Um for all the cells, they performed CROWN-seq on.

      We thank the reviewer for this comment. We have updated Figure S5 as the corresponding figure showing the fraction of Am vs. m6Am vs. Cm/Gm/Um.

      (4) Figure 2H (left): Please comment on the few outliers that still show high non-conversion even in PCIF1-KO cells.

      We thank the reviewer for this comment. We have discussed the outliers in the main text. These outliers can be found in the revised Table S3.

      (5) Line 254: "Second, if these sites were RNA fragments they would not contain m6Am." is missing a comma.

      (6) S2G and S2H labelling in Figure S2 legends is wrong.

      For (5) and (6). We thank the reviewer for these comments. We have corrected them.

      (7) Figure 3D: Many gene names are printed multiple times (e.g. ACTB is printed 5 times). Is this correct; is each dot representing 1 cell line?

      We thank the reviewer for this comment. These gene names represent different transcription-start nucleotides. We now clarify that each instance refers to a different start site.

      (8) S5A-C: Even if there's no substantial difference, authors should still display the Student's T-test P-values as they did for S5D-G.

      We thank the reviewer for this comment. We have updated the P-values.

      (9) Figure 5C and S5E: Why are the authors not showing the respective analysis for C-TSN and U-TSN genes?

      We thank the reviewer for this comment. Most mRNAs start with A or G. We therefore selected G-TSN as the control. Unlike G-TSNs which occur in diverse sequence and promoter contexts, C-TSNs and U-TSNs are unusual. Genes that mainly use C-TSNs and U-TSNs are the so-called “5’ TOP (Terminal OligoPyrimidine)” genes. The 5’ TOP genes are mostly genes related to translation and metabolism, and thus their expressions reflect the homeostasis of cell metabolism. Thus, we were concerned that any differential expression of the C-TSN and U-TSN genes between wild-type and PCIF1 knockout cells might reflect specific effects on TOP transcriptional regulation rather than the general effects of PCIF1 on transcription.

      (10) Line 82, 470, 506, 676: The authors should also cite Koh et al (2019 Nat. Comm.) in these lines that describe how snRNAs can also be m6Am-methylated and how FTO targets these same snRNAs for demethylation.

      We thank the reviewer for this comment. We have updated the citation.

    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewer #1 (Public Review): 

      Summary: 

      This manuscript presents a method to infer causality between two genes (and potentially proteins or other molecules) based on the non-genetic fluctuations among cells using a version of the dual-reporter assay as a causal control, where one half of the dual-reporter pair is causally decoupled, as it is inactive. The authors propose a statistical invariant identity to formalize this idea. 

      We thank the referee for this summary of our work. 

      Strengths: 

      The paper outlines a theoretical formalism, which, if experimentally used, can be useful in causal network inference, which is a great need in the study of biological systems. 

      We thank the referee for highlighting the potential value of our proposed method.

      Weaknesses: 

      The practical utility of this method may not be straightforward and potentially be quite difficult to execute. Additionally, further investigations are needed to provide evidence of the broad applicability of the method to naturally occurring systems and its scalability beyond the simple circuit in which it is experimentally demonstrated. 

      We agree with these two points and have rewritten the manuscript, in particular highlighting the considerable future work that remains to be done to establish the broad applicability and scalability of our method.

      In the rewritten manuscript we explicitly spell out potential practical issues and we explicitly state that our presented proof–of–principle feasibility study does not guarantee that our method will successfully work in systems beyond the narrowly sampled test circuits. This helps readers to clearly distinguish between what we claim to have done from what remains to be done. The re-written parts and additional clarifications are:

      Abstract (p. 1), Introduction (p. 1-2), Sec. “Proposed additional tests” (p. 8), and “Limitations of this study” (p. 10).

      Reviewer #2 (Public Review): 

      Summary: 

      This paper describes a new approach to detecting directed causal interactions between two genes without directly perturbing either gene. To check whether gene X influences gene Z, a reporter gene (Y) is engineered into the cell in such a way that (1) Y is under the same transcriptional control as X, and (2) Y does not influence Z. Then, under the null hypothesis that X does not affect Z, the authors derive an equation that describes the relationship between the covariance of X and Z and the covariance of Y and Z. Violation of this relationship can then be used to detect causality. 

      The authors benchmark their approach experimentally in several synthetic circuits. In four positive control circuits, X is a TetR-YFP fusion protein that represses Z, which is an RFP reporter. The proposed approach detected the repression interaction in two or three of the positive control circuits. The authors constructed sixteen negative control circuit designs in which X was again TetR-YFP, but where Z was either a constitutively expressed reporter or simply the cellular growth rate. The proposed method detected a causal effect in one of the eight negative controls, which the authors argue is not a false positive, but due to an unexpected causal effect. Overall, the data support the practical usefulness of the proposed approach. 

      We thank the referee for their summary of our work.

      Strengths: 

      The idea of a "no-causality control" in the context of detected directed gene interactions is a valuable conceptual advance that could potentially see play in a variety of settings where perturbation-based causality detection experiments are made difficult by practical considerations. 

      By proving their mathematical result in the context of a continuous-time Markov chain, the authors use a more realistic model of the cell than, for instance, a set of deterministic ordinary differential equations. 

      We thank the referee for summarizing the value of our work. 

      Caveats: 

      The term "causally" is used in the main-text statement of the central theorem (Eq 2) without a definition of this term. This makes it difficult to fully understand the statement of the paper's central theorem without diving into the supplement.  

      We thank the referee for this suggestion. In the revised manuscript we now define causal effects right before the statement of the main theorem of the main text (p. 2). We have also added a definition of the causal network arrows in the caption of Fig. 1 to help readers better understand our central claim.

      The basic argument of theorem 1 appears to rely on establishing that x(t) and y(t) are independent of their initial conditions. Yet, there appear to be some scenarios where this property breaks down: 

      (1) Theorem 1 does not seem to hold in the edge case where R=beta=W=0, meaning that the components of interest do not vary with time, or perhaps vary in time only due to measurement noise. In this case x(t), y(t), and z(t) depend on x(0), y(0), and z(0). Since the distributions of x(0), y(0), and z(0) are unspecified, a counterexample to the theorem may be readily constructed by manipulating the covariance matrix of x(0), y(0), and z(0). 

      (2) A similar problem may occur when transition probabilities decay with time. For example, suppose that again R=0 and X are degraded by a protease (B), but this protease is subject to its own first-order degradation. The deterministic version of this situation can be written, for example, dx/dt=-bx and db/dt=-b. In this system, x(t) approaches x(0)exp(-b(0)) for large t. Thus, as above, x(t) depends on x(0). If similar dynamics apply to the Y and Z genes, we can make all genes depend on their initial conditions, thus producing a pathology analogous to the above example. 

      The reviewer does not know when such examples may occur in (bio)physical systems. Nevertheless, since one of the advantages of mathematics is the ability to correctly identify the domain of validity for a claim, the present work would be strengthened by "building a fence" around these edge cases, either by identifying the comprehensive set of such edge cases and explicitly prohibiting them in a stated assumption set, or by pointing out how the existing assumptions already exclude them.  

      We thank the referee for bringing to our attention these edge cases that indeed violate our theorem as stated. In the revised manuscript we have “built a fence” around these edge cases by adding two requirements to the premise of our theorem: First, we have added the requirement that the degradation rate does not decay to zero for any possible realization. That is, if beta(t) is the degradation rate of X and Y for a particular cell over time, then taking the time average of beta(t) over all time must be non-zero. Second, we have added the requirement that the system has evolved for enough time such that the dual reporter averages <x> and <y>, along with the covariances Cov(x, z_{k}) and Cov(y, z_{k}) have reached a time-independent stationary state.  

      With these requirements, no assumptions need to be made about the initial conditions of the system, because any differences in the initial conditions will decay away as the system reaches stationarity. For instance, the referee’s example (1) is not possible with these requirements because beta(t) can no longer remain zero. Additionally, example (2) is no longer possible because the time average of the degradation rate would be zero, which is no longer allowed (i.e., we would have that integral from 0 to T of b(0)exp(-t)/T dt =  0 when T goes to infinity). 

      Note that adding the condition that degradation cannot decay to exactly zero does not reduce the biological applicability of the theorem. But as the referee correctly points out any mathematical theorem needs to be accurately stated and stand on its own regardless of whether biological systems could realize particular edge cases. Also note, that the requirement that the cellular ensemble has reached a time-independent distribution of cell-to-cell variability can be (approximately) experimentally verified by taking snapshots of ensemble variability at two sufficiently separate different moments in time. 

      In response to the referee’s comment, we have added the above requirements when stating the theorem in the main text. We have also added the requirement of non-decay of the degradation rate to the definition of the system in SI Sec. 4, along with the stationarity requirement in theorem 1 in SI Sec 5. We have also added mathematical details to the proof of the invariant in SI Sec 5.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors): 

      This manuscript presents a method to infer causality between two genes (and potentially proteins or other molecules) based on the non-genetic fluctuations among cells using a version of the dual-reporter assay as a causal control, where one half of the dual-reporter pair is causally decoupled, as it is inactive. The authors propose a statistical invariant identity to formalize this idea. They propose and experimentally demonstrate the utility of this idea with a synthetic reporter system in bacteria. 

      The paper is well written and clearly outlines the principle, the mathematical invariant relationship both to give the reader an intuitive understanding of why the relationship must be true and in their mathematical derivation of the proof of Theorem 1. 

      The paper outlines a theoretical formalism, which, if experimentally used, can be useful in causal network inference, which is a great need in the study of biological systems. However, the practical utility of this method may not be straightforward and potentially be quite difficult to execute. We think this work could offer a platform to advance the field of network inference, but would encourage the authors to address the following comments. 

      We thank the reviewer for the positive comments on readability, summarizing the value of our work, as well as the critical comments below that helped us improve the manuscript.

      Major comments: 

      (1) Although the invariant identity seems theoretically sound, the data from synthetic engineered circuits in this manuscript do not support that the invariant holds for natural causal relations between genes in wild-type cells. In all the positive control synthetic circuits (numbers 1 to 4) the target gene Z i.e. RFP was always on the plasmid, and in circuit #4 there was an additional endogenous copy. The authors recapitulate the X-to-Z causality in circuits 1, 2, and 3 but not 4. Ultimately, the utility of this method lies in the ability to capture causality from endogenous correlations, this observation suggests that the method might not be useful for that task. 

      We thank the referee for their careful reading of our synthetic circuits and sincerely apologize for an error in our description of circuit #4 in the schematic of Table S2 of the supplement. We incorrectly stated that this circuit contained a chromosomally expressed RFP. In fact, in circuit #4 RFP was only on the plasmid just like in the circuits #1-3. We have corrected the schematic in the revised manuscript and have verified that the other circuits are correctly depicted.

      In the revised manuscript, we now explicitly spell out that all our “positive control” test cases had the genes of interest expressed on plasmids, and that we have not shown that our method successfully detected causal interactions in a chromosomally encoded gene regulatory circuit, see additional statements in Sec. “Causally connected genes that break the invariant” on p. 6. 

      In the absence of any explicit experimental evidence, it is then important to consider whether chromosomally encoded circuits are expected to cause problems for our method which is based on a fluctuation test. Due to plasmid copy number fluctuations, X and Z will fluctuate significantly more when expressed on plasmids than when expressed chromosomally. However, because this additional variability is shared between X and Z it does not help our analysis which relies on stochastic differences in X and Z expression due to “intrinsic noise” effects downstream of copy number fluctuations. The additional “extrinsic noise” fluctuations due to plasmid copy number variability would wash out violations of Eq. (2) rather than amplify them. If anything, we thus expect our test cases to have been harder to analyze than endogenous fluctuations. This theoretical expectation is indeed borne out by numerical test cases presented in the revised supplement where plasmid copy fluctuations severely reduced the violations of Eq. 2, see new additional SI Sec. 15. 

      Additionally, the case of the outlier circuit (number 12) suggests that exogenous expression of certain genes may lead to an imbalance of natural stoichiometry and lead to indirect effects on target genes which can be misinterpreted as causal relations. Knocking out the endogenous copy may potentially ameliorate this issue but that remains to be tested. 

      We agree with the referee that the expression of exogenous genetic reporters can potentially affect cellular physiology and lead to undesired effects. In the revised manuscript we now explicitly spell out that the metabolic burden or the phototoxicity of introducing fluorescent proteins could in principle cause artificial interactions that do not correspond to the natural gene regulatory network, see Sec. “Proposed additional tests” on p. 8.

      However, it is also important to consider that the test circuit #12 represents a synthetic circuit with genes that were expressed at extremely high levels (discussed in 3rd paragraph of Sec. “Evidence that RpoS mediated stress response affected cellular growth in the outlier circuit”, p. 8), which led to the presumed cellular burden. Arguably, natural systems would not typically exhibit such high expression levels, but importantly even if they did, our method does not necessarily rely on fluorescently tagged proteins but can, in principle, also be applied to other methods such as transcript counting through sequencing or in-situ hybridization of fluorescent probes.  

      Ultimately, the value of this manuscript will be greatly elevated if the authors successfully demonstrate the recapitulation of some known naturally existing causal and non-causal relations. For this, the authors can choose any endogenous gene Z that is causally controlled by gene X. The gene X can be on the exogenous plasmid along with the reporter and the shared promoter. Same for another gene Z' which is not causally controlled by gene X. Potentially a knockout of endogenous X may be required but it might depend  on what genes are chosen. 

      If the authors think the above experiments are outside the scope of this manuscript, they should at least address these issues and comment on how this method could be effectively used by other labs to deduce causal relations between their favorite genes. 

      Because a full analysis of naturally occurring gene interactions was beyond the scope of our work, we agree with the referee’s suggestion to add a section to discuss the limitations of our experimental results. In the revised manuscript we reiterate that additional investigations are needed to show that the method works to detect causal interactions between endogenous genes, see Abstract (p. 1), Introduction (p. 1-2), Sec. “Proposed additional tests” (p. 8), and “Limitations of this study”  (p. 9). In the original manuscript we explicitly spelled out how other researchers can potentially carry out this further work in the subsections titled “Transcriptional dual reporters” (p. 3) and ”Translational dual reporters” (p. 3).  In the revised manuscript, we have added a section “Proposed additional tests” (p. 8) in which we propose an experiment analogous to the one proposed by the referee above, involving an endogenous gene circuit found in E. coli, as an example to test our invariant. 

      (2) For a theoretical exposition that is convincing, we suggest the authors simulate a larger network (for instance, a network with >10 nodes), like the one shown schematically in Figure 1, and demonstrate that the invariant relationship holds for the causally disconnected entities, but is violated for the causally related entities. It would also be interesting to see if any quantification for the casual distance between "X" and the different causally related entities could be inferred.  

      We thank the referee for this suggestion. We have added SI Sec. 14 where we present simulation results of a larger network with 10 nodes. We find that all of the components not affected by X satisfy Eq. (2) as they must. However, it is important to consider that we have analytically proven the invariant of Eq. (2) for all possible systems. It provably applies equally to networks with 5, 100, or 10,000 components. The main purpose of the simulations presented in Fig. (2) is to illustrate our results and to show that correlation coefficients do not satisfy such an invariant. However, they are not used as a proof of our mathematical statements.

      We thank the referee for the interesting suggestion of quantifying a “causal distance”. Unfortunately, the degree to which Eq. (2) is violated cannot directly equate to an absolute measure for the “causal distance” of an interaction. This is because both the strength of the interaction and the size of the stochastic fluctuations in X affect the degree to which Eq. (2) is violated. The distance from the line should thus be interpreted as a lower bound on the causal effect from X to Z because we do not know the magnitude of stochastic effects inherent to the expression of the dual reporters X and Y. While the dual reporters X and Y are identically regulated, they will differ due to stochastic fluctuations. Propagation of these fluctuations from X to Z are what creates an asymmetry between the normalized covariances. In the most extreme example, if X and Y do not exhibit any stochastic fluctuations we have x(t)=y(t) for all times and Eq. (2) will not be violated even in the presence of a strong causal link from X to Z.

      However, it might be possible to infer a relative causal distance to compare causal interactions within cells.

      That is, in a given network, the normalized covariances between X, Y and two other components of interest Z1, Z2 that are affected by X can be compared. If the asymmetry between (η𝑥𝑧1 , η𝑦𝑧1) is larger than the asymmetry between (η𝑥𝑧2 , η𝑦𝑧2) , then we might be able to conclude that X affects Z1 with a stronger interaction than the interaction from X to Z2, because here the intrinsic fluctuations in X are the same in both cases. 

      In response to the referee’s comment and to test the idea of a relative causal distance, we have simulated a larger network made of 10 components. In this network, X affects a cascade of components called Z8, Z9, and Z10, see the additional SI Sec. 14. Here the idea of a causal distance can be defined as the distance down the cascade: Z8 is closest to X and so has the largest causal strength, whereas Z10 has the weakest. Indeed, simulating this system we find that the asymmetry between η𝑥𝑧8 and η𝑦𝑧8 is the largest whereas that between  η𝑥𝑧10 and η𝑦𝑧10 the smallest. We also find that all of the components not affected by X have normalized covariances that satisfy Eq. (2). This result suggests that the relative causal distance or strength in a network could potentially be estimated from the degree of the violations of Eq. (2). 

      However, we note that these are preliminary results. In the case of the specific regulatory cascade now considered in SI Sec. 14, the idea of a causal distance can be well defined. Once feedback is introduced into the system, this definition may no longer make sense. For instance, consider the same network that we simulate in SI Sec. 14, but where the most downstream component in the cascade, Z10, feeds back and affects X and Y. In such a circuit it is unclear whether Z8 or Z10 is “causally closer” to X. A more thorough theoretical analysis, equipped with a more universal quantitative definition for causal distance or strength, would be needed to deduce what information can be inferred from the relative distances in the violations of Eq. (2). While this defines an interesting research question, answering it goes beyond the scope of the current manuscript. 

      Minor comments: 

      - The method relies on the gene X and the reporter Y having the same control which would result in similar dynamics. The authors do not quantitatively compare the YFP and CFP expression if this indeed holds for the synthetic circuits. It would be useful to know how much deviation between the two can be tolerated while not affecting the outcome. 

      We thank the referee for their comment. The invariant of Eq. (2) is indeed only guaranteed to hold only when the transcription rate of Y is proportional to that of X. How much levels of X and Y covary depends on the stochastic effects intrinsic to the expression of the dual reporters as well as how similar the transcriptional control of X and Y is. The stochastic difference between X and Y is exactly what we exploit. 

      However, in the limit of high YFP and CFP levels, intrinsic fluctuations that cause stochastic expression differences between X and Y become negligible and we can directly infer whether they are indeed tightly co-regulated from time-traces: Below, we show two single cell traces taken with our experimental setup in which the YFP and CFP fluorescence trajectories are almost exactly proportional. Both of these traces are from circuit #10 as defined in Table. S4. 

      Author response image 1.

      We chose the above traces because they showed the highest correlation between YFP and CFP levels. Other traces for lower expression levels have lower correlations due to effects of intrinsic noise (see Tables S2-S4). However, the existence of one trace in which YFP is almost perfectly proportional to CFP throughout can only occur if the YFP and CFP genes are under the same control. And, since the control of YFP and CFP genes in all of our synthetic circuits are identical (with the same promoters and plasmid positions), these data strongly suggest that our dual reporters are tightly co-regulated in all the synthetic circuits. Moreover, the negative control experiments presented in Fig. 3E provide a natural consistency check that the YFP and CFP are under the same control and satisfy Eq. (1).

      We agree that it would be useful to know how much the X and Y production rates can differ for Eq. (2) to hold. Importantly, our proven theorem already allows for the rates to differ by an unspecified proportionality constant. In response to the referee’s comment we have derived a more general condition under which our approach holds. In the newly added SI Sec. 7 we prove that Eq. (2) holds also when rates differ as long as the difference is stochastic in nature with an average of zero. We also prove that Eq. (2) holds in the face of multiplicative noise that is independent of the X and Y production rates.

      However, the production rates of X and Y cannot differ in all ways. Some types of differences between the X and Y production rates can lead to deviations of Eq. (2) even when there is no causal interaction. To highlight this, we added the results of simulations of a toy model in which the X and Y production rates differ by an additive noise term that does not average to zero, see Fig. S19B of the newly added SI Sec. 7.

      - The invariant should potentially hold true for any biological species that are causally related e.g. protein-protein interactions. Also, this method could potentially find many applications in eukaryotic cells. Although it's outside the scope of current work to experimentally demonstrate such applications, the authors should comment on experimental strategies to apply this method to overcome potential pitfalls (e.g. presence of enhancers in eukaryotic cells). 

      We thank the referee for this suggestion. We agree that there are potential pitfalls that could come into effect when our proposed approach is applied on more complex systems such as eukaryotic gene expression. In response to the referee’s comment, we have added an explicit discussion of these potential pitfalls in the discussion section “Limitations of this study” (see p. 10). 

      In particular, in eukaryotes there are many genes in which promoter sequences may not be the sole factor determining transcription rates. Other factors that can be involved in gene regulation include the presence of enhancers, epigenetic modifications, and bursts in gene expression, to name a few. We thus propose a few strategies, which include positioning the passive reporter at a similar gene loci as the gene of interest, measuring the gene regulation activities of the gene of interest and its passive reporter using a separate method, and exploiting the invariant with a third gene, where it is known there is no causal interaction, as a consistency check. In addition, we include in the SI a new section SI Sec. 8 which shows that the invariant holds in the face of many types of bursty gene expression dynamics.

      However, the above is not a comprehensive list. Some of the issues the referee mentions are serious and may not be straightforward to overcome. We now spell this out explicitly in the revised manuscript (p. 10). 

      - In the legend of Fig. 1, the sentence "Data points here are for..." is missing a few words, or needs to be rephrased. 

      We thank the referee for this comment. We have rewritten the figure caption, which now reads “Data points are numerical simulations of specific example networks (see SI for details) to illustrate the analytically proven theorem of Eq. 2.”

      - Fig. 2 talks about the uncertainties associated with each point on the scatter plots. However, it is difficult to understand the quantification in such a plot. It would be great to have a plot quantifying the uncertainties in the invariant relation for the different topologies studied, specifically in order to understand if one topology is consistently deviating more from the x=y line than the other topologies studied here.  

      We thank the referee for this suggestion. In the supplement of the revised manuscript we have added supplemental Figs. S3, S4, and  S5 to separately quantify the uncertainty of the difference processes plotted in Fig. 2 and have added a new section (SI Sec. 11) to discuss the processes simulated in Fig. 2 in more detail. In short, each simulated process generated less than ~5% of outliers when considering 95% confidence intervals (with the max percentage deviation being 5.01% for process 5, see Fig. S5). These outliers were then simulated over a larger number of simulations to reduce the sampling error, which resulted in 0% of outliers (see Sec. “Confidence intervals for finite sampling error” on Materials and Methods on p. 11). Some simulated processes generated larger percentage errors in the normalized covariances than others, but this is expected as different processes have different dynamics which will result in different degrees of sampling of the underlying distributions.

      Note, that the invariant of Eq. 2 is analytically proven for all tested topologies as none of the topologies include a causal effect from X to Z. Any deviation of the numerical data from the straight line prediction of Eq. 2 (right column in Fig. 2C) is due to the finite sampling of a stochastic process to estimate the true covariance from the sampling covariance. Any given parameter set was simulated several times which allowed us to estimate the sampling error from differences in between repeated samples. In the additional SI figures we now quantify this error for the different topologies. 

      In addition to the above changes we want to highlight that the purpose of the simulations presented in Fig. (2) is not to prove our statements or explore the behavior of different topologies. The purpose of the data presented in the right column of Fig. 2C is to illustrate the theoretical invariant and act as a numerical sanity check of our analytically proven result. In contrast, the data in the left column of Fig 2C illustrates that the correlations do not satisfy an invariant like Eq. 2 which applies to covariances but not correlations.  

      - The legend for Fig. 3 seems to end abruptly. There likely needs to be more.  

      We thank the referee for catching this mistake. We have corrected the accidentally truncated figure caption of Fig. 3.

      - There is a typo in equation (5.3) on page 23 of supplementary material, there should be x instead of y in the degradation equation of x. 

      We thank the referee for catching this mistake which has been corrected in the revised manuscript.

      - In the supplemental material, to understand the unexpected novel discovery of causality, Figure S5 is presented. However, this doesn't give the context for other negative controls designed, and the effect of rfp dynamics (which can be seen in the plots both in the main paper and the supplement) in the growth rate of cells in those constructs. As a baseline, it would be nice to have those figures.  

      We thank the referee for this suggestion. We have now included representative RFP traces with the growth rates for other negative control circuits, see Fig. S10. In addition, we have now included the cross correlation functions between RFP and growth rate in these negative control circuits, see Fig. S10A. While in all cases, RFP and growth rate are negatively correlated, the outlier circuit exhibits the largest negative correlation.

      The suggested comparison of the referee thus highlights that – in isolation – a negative correlation between RFP and growth rate is only weak evidence for our hypothesized causal interaction because negative correlations can result from the effect of growth rate affecting volume dilution and thus RFP concentration. Crucially, we thus additionally considered the overall variability of growth rate and found the outlier circuit has the largest growth rate variability which is indicative of something that is affecting the growth rate of those cells, see Fig. S10B. To compare the magnitude of RFP variability against other strains requires constraining the comparison group to other synthetic circuits that have RFP located on the chromosome rather than a plasmid. This is why we compare the CV of the outlier with the CV of circuit #5, which corresponds to the “regular” repressilator (i.e., the outlier circuit without the endogenous lacI gene). As an additional comparison, we computed the CV for a strain of E. coli that does not contain a synthetic plasmid at all, but still contains the RFP gene on the chromosome. We find that the CVs in the outlier circuit to be larger than in these two additional circuits, suggesting that the outlier circuit causes additional fluctuations in the RFP and growth rate. We now spell this out explicitly in the revised manuscript (see Sec. “Evidence that RpoS mediated stress response affected cellular growth in the outlier circuit“, p. 8).

      The referee is correct that the above arguments are only circumstantial evidence, but they do show that the data is consistent with a plausible explanation of the hypothesized causal interaction. Our main evidence for an RpoS mediated stress response that explains the deviations from Eq. 2 in the outlier circuit is the perturbation experiment in which the deviation disappears for the RpoS knockout strain. We now spell out this argument explicitly in the revised manuscript (see Sec. “Evidence that RpoS mediated stress response affected cellular growth in the outlier circuit“, p. 8).

      Reviewer #2 (Recommendations For The Authors): 

      The proof of theorem 1 relies on an earlier result, lemma 1. Lemma 1 only guarantees the existence of a "dummy" system that satisfies the separation requirement and preserves the dynamics of X and Y. However, in principle, it may be possible to maintain the dynamics of X and Y while still changing the relationship between Cov(X,Zk) and Cov(Y,Zk). This could occur if the dynamics of Zk differ in a particular way between the original system and the dummy system. So lemma 1 needs to be a little stronger- it needs  to mention that the dynamics of Zk are preserved, or something along these lines. The proof of lemma 1 appears to contain the necessary ingredients for what is actually needed, but this should be clarified. 

      We agree with the referee that this is an important distinction. Lemma 1 does in fact guarantee that any component Zk that is not affected by X and Y will have the same dynamics in the “dummy” system. However, as the referee points out, this is not stated in the lemma statement nor in the proof of the lemma. In response to the referee’s comment, we have made it clear in the lemma statement that the Zk dynamics are preserved in the “dummy” system, and we have also added details to the proof to show that this is the case, see Lemma 1 on p. 27 of the SI. 

      Readers who are familiar with chemical reaction diagrams, but not birth-death process diagrams may waste some time trying to interpret Equation 1 as a chemical reaction diagram with some sort of rate constant as a label on each arrow (I did this). It may be helpful to either provide a self-contained definition of the notation used, or mention a source where the necessary definitions can be found. 

      We agree with the referee. In the revised manuscript we have added a description of the notation used below Equation 1 of the main text, see p. 2. The notational overloading of the “arrow notation” is a perennial problem in the field and we thank the referee for reminding us of the need to clarify what the arrows mean in our diagrams.

      It would be helpful if the authors could propose a rule for deciding whether dependence is detected or not. As it stands presently, the output of the approach seems to be a chart like that in Figure 3D where you show eta_xz and eta_yz with confidence interval bars and the reader must visually assess whether the points more-or-less fall on the line of unity. It would be better to have some systematic procedure for making a "yes or no" call as to whether a causal link was detected or not. Having a systematic detection rule would allow you to make a call as to whether dependence in circuit 3 was detected or not. It would also allow you or a future effort to evaluate the true positive rate of the approach in simulated settings. 

      We thank the referee for this suggestion. In the revised manuscript we have added an explicit rule for detecting causality using the invariant of Eq. (2). Specifically, Eq. (2) can be re-written as r = 1 where r is the covariability ratio r = etaXZ/etaYZ. In that case, given 95% confidence intervals for the experimentally determined covariability ratio r, we say that there is a causal interaction if the confidence intervals overlap with the value of r = 1. 

      This corresponds to a null hypothesis test at the 2.5% significance level. The reason that it is at 2.5% significance and not 5% significance is as follows. Let’s say we measure a covariability ratio of r_m, and the 95% confidence interval is [r_m - e_m, r_m + e_m] for some error e_m. Without loss of generality, let’s say that r_m > 1 (the same applies if r_m < 1). This means that Prob(r < r_m - e_m) = 2.5% and Prob(r > r_m + e_m) = 2.5% , where r is the actual value of the covariability ratio. Under the null hypothesis that there is no causal interaction, we set r = 1. However, we now have Prob(1 < r_m + e_m) = 0, because we know that r_m > 1 and so we must have r_m + e_m > 1. The probability that the value of 1 falls outside the error bars is therefore 2.5% under the null hypothesis. 

      This proposed rule is the same rule that we used to detect statistical outliers in our simulations, where we found a “false positive” rate of 2.3% over 6522 simulated systems due to statistical sampling error (as discussed in the Materials and Methods section). In response to the referee’s suggestion, we have added the section “A rule for detecting causality in the face of measurement uncertainty” (p. 4). We also apply the rule to the experimental data and find that the rule detects 2/4 causal interactions in Fig. 3D. We have clarified this in the Fig. 3D caption, in the main text, and we have added a figure in the SI (Fig. S2) where we apply the null hypothesis test on the measured covariability ratios. 

      Note, whether the third interaction is “detected” or not depends on the cut-off value used. We picked the most common 95% rule to be consistent with the traditional statistical approaches. With this rule one of the data points lies right at the cusp of detection, but ultimately falls into the “undetected” category if a strictly binary answer is sought under the above rule. 

      It would be helpful to mention what happens when the abundance of a species hits zero. Specifically, there are two ways to interpret the arrow from X to X+d with a W on top: 

      Interpretation (1): 

      P(X+d | X) = W if X+d {greater than or equal to} 0  P(X+d | X) = 0 if X_i+d_i < 0 for at least one i 

      Interpretation (2): 

      P(X+d | X) = W regardless of whether X+d < 0  W = 0 whenever X_i < d_i for at least one i 

      Interpretation (1) corresponds to a graph where the states are indexed on the non-negative integers. Interpretation (2) corresponds to a graph where the states are indexed on the integers (positive or negative), and W is responsible for enforcing the non-negativity of mass. I believe you need the second interpretation because the first interpretation leads to problems with your definition of causality. For example, consider the reaction: 

      (Na, K) -- 0.1 --> (Na-1, K+1) 

      This could occur if Na and K are the intracellular concentrations of sodium and potassium ions in a cell that has an ATP-driven sodium-potassium exchanger whose rate is limited by the frequency with which extracellular potassium ions happen to flow by. Per the definition of causality found in the appendix, Na has no causal effect on K since Na does not show up in the reaction rate term. However, under interpretation (1), Na clearly has a causal effect on K according to a reasonable definition of causality because if Na=0, then the reaction cannot proceed, whereas if Na>0 then it can. However, under interpretation (2), the reaction above cannot exist and so this scenario is excluded. 

      We thank the referee for this comment that helped us clarify the meaning of arrows with propensities. In short, interpretation (2) corresponds to the definition of our stochastic systems. This is consistent with the standard notation used for the chemical master equation. As the referee points out, because molecular abundances cannot be negative, any biochemical system must then have the property that the propensity of a reaction must be equal to zero when the system is in a state in which an occurrence of that reaction would take one of the abundances to negative numbers. Stochastic networks that do not have this property cannot correspond to biochemical reaction networks.

      In the revised manuscript, we now spell this out explicitly to avoid any confusion, see SI page 25.

      Furthermore, we additionally discuss the referee’s example in which the rate of exchanging Na for K through an ion exchanger is approximately independent of the intracellular Na concentration. Because biochemical systems cannot become negative, it cannot be that the rate is truly constant, but at some point for low concentrations must go down until it becomes exactly zero for zero molecules. 

      Importantly, agreement with Eq. (2) does not imply that there is no causal effect from X to Zk. It is the deviation from Eq. (2) that implies the existence of a causal effect from X to Zk. Therefore, although the above referee’s example would constitute a causal interaction in our framework, it would not lead to a deviation of Eq. (2) because the fluctuations in Na (which we exploit) do not propagate to K. From a practical point of view, our method thus detects whether changing X over the observed range affects the production and degradation rates of Zk. 

      In the course of setting up the negative control benchmark circuits, a perturbation-based causal validation would be nice. For instance, first, verify that X does not affect Z by intervening on X (e.g. changing its copy number or putting it under the control of an inducible promoter), and ensuring that Z's activity is not affected by such interventions upon X. This approach would help to adjudicate questions of whether the negative control circuits actually have an unknown causal link. The existing benchmark is already reasonably solid in my view, and I do not know how feasible this would be with the authors' setup, but I think that a perturbation-based validation could in principle be the gold standard benchmark.  

      We agree that additional perturbation-based validation tests on all of the negative control circuits would indeed improve the evidence that our method worked as advertised. While such experiments are indeed beyond the scope of our current work we now explicitly point out the benefits of such additional controls in the revised Discussion.

      Below is a series of comments about typography, mostly about section 4 of the supplement. 

      We thank the referee for their careful reading and highlighting those mistakes.

      At the bottom of page 21, Z_aff is defined as the set of components that are affected by X. However, later Z_aff seems to refer to components affected by X or Y. For instance, in the proof of lemma 1, it is written "However, because a is part of z_aff, the {ak} variables must be affected by X and/or Y." 

      We thank the referee for catching this mistake. We have changed the definition of Z_aff throughout the supplement to refer to components affected by X or Y. If it can be experimentally ensured that Y is a passive reporter (i.e., it does not affect other components in the cell), then the theorem can only be violated if X affects Z. 

      In the equation following Eq 5.2, W_k and d_k should be W_i and d_i ?  

      Yes, the referee is correct. In the revised manuscript we have corrected W_k and d_k to W_i and d_i. 

      In Eq 5.3 in the lower-left transition diagram, I think a "y" should be an "x". 

      Yes, the referee is correct. In the revised manuscript  we have fixed this typo.

      In the master equation above Eq 5.5, the "R" terms for the y reactions are missing the alpha term, and I think two of the beta terms need to be multiplied by x and y respectively.  

      The referee is correct. In the revised manuscript  we have fixed this typo.

      The notation of Eq 5.8, where z_k(t) is the conditional expectation of z_kt, is strange and difficult to follow. Why does z_k(t) not get a bar over it like its counterparts for x, y, R, and beta? The bars, although not a perfect solution, do help.  

      We agree with the referee’s comment and have added further explanations to define the averages in question, see SI p. 28. In short, when we condition on the history of the components not affected by X or Y, we in effect condition on the time trajectories of z_{k} (when it is part of the components not affected by X and/or Y) and beta (since it only depends on the components not affected by X or Y). We thus previously did not include the bars when taking the averages of these components in the conditional space because the conditioning in effect sets their time-trajectories (so they become deterministic functions of time). In the revised manuscript we now also denote these conditional expectations with bars and we have added comments to the proof to clarify their definition.

      I think it would be helpful to show how the relationship <x>=<y>/alpha is obtained from Eq 5.5.  

      We agree with this suggestion and have added the derivations, see Eqs. (5.9) - (5.13) in the revised SI. 

      In the main text, the legend of Fig 3 cuts off mid-sentence.  

      We thank the referee for catching this mistake which has been fixed in the revised manuscript.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      Oor et al. report the potentially independent effects of the spatial and feature-based selection history on visuomotor choices. They outline compelling evidence, tracking the dynamic history effects based on their clever experimental design (urgent version of the search task). Their finding broadens the framework to identify variables contributing to choice behavior and their neural correlates in future studies.

      Strengths:

      In their urgent search task, the variable processing time of the visual cue leads to a dichotomy in choice performance - uninformed guesses vs. informed choices. Oor et al. did rigorous analyses to find a stronger influence of the location-based selection history on the uninformed guesses and a stronger influence of the feature-based selection history on the informed choices. It is a fundamental finding that contributes to understanding the drivers of behavioral variance. The results are clear.

      Weaknesses:

      (1) In this urgent search task, as the authors stated in line 724, the variability in performance was mainly driven by the amount of time available for processing the visual cue. The authors used processing time (PT) as the proxy for this "time available for processing the visual cue." But PT itself is already a measure of behavioral variance since it is also determined by the subject's reaction time (i.e., PT = Reaction time (RT) - Gap). In that sense, it seems circular to explain the variability in performance using the variability in PT. I understand the Gap time and PT are correlated (hinted by the RT vs. Gap in Figure 1C), but Gap time seems to be more adequate to use as a proxy for the (imposed) time available for processing the visual cue, which drives the behavioral variance. Can the Gap time better explain some of the results? It would be important to describe how the results are different (or the same) if Gap time was used instead of PT and also discuss why the authors would prefer PT over Gap time (if that's the case).

      Thanks to Rev 1 for requesting clarification of this important point. As Rev 1 notes, PT is a derived variable, computed for each trial by subtracting the Gap interval from RT (PT=RT‒Gap). While it is true that Gap and PT are correlated (inversely), it is precisely because of the variance in RT that Gap alone is not an adequate (or certainly not the best) predictor of choice outcome. First, note that, if the Gap were fixed, there would still be variance in RT and in outcome, and any dependence of outcome on time would be explained necessarily by the PT. This is true at any Gap. So, clearly, the PT predicts outcome in a way that the Gap cannot. It is easy to see why: the Gap is the part of the RT interval during which no cue information is present, whereas the PT is the part of the same interval during which it is. Therefore, if one accepts the logical premise that the likelihood of a correct choice depends on the amount of time available to view the Cue before making that choice (i.e., the definition of PT), it follows that the relationship between PT and performance should be tighter than that between performance and Gap. And, indeed, this is the case. Mean accuracy declines systematically as a function of Gap, as expected, but its correlation with performance is much weaker than for PT.

      Rev 1’s request for a comparison of how accuracy varies as function of PT versus how it varies with Gap has appeared in earlier publications (Stanford et al., 2010; Shankar et al., 2011; Salinas et al., 2014) and we now include it here for the current dataset by adding plots of accuracy versus Gap as a new panel in Fig. 1 (Fig. 1c). That PT (not Gap) better predicts the likelihood of success on a given trial is evident in comparing the tachometric (Fig. 1b) and psychometric curves (Fig. 1c). The tachometric curves vary from chance to asymptotic performance and do so over a short range of PT (~75 ms) with well-defined inflection points identifying key transitions in performance (e.g., from guesses to increasingly informed choices). In contrast, the psychometric function plotting average accuracy versus Gap (Fig. 1c) varies much more gradually, a reduction in temporal definition attributable to the failure to account for the RT’s contribution to determining PT for each trial at a given Gap.

      (2) The authors provide a compelling account of how the urgent search task affords

      (i) more pronounced selection history effects on choice and

      (ii) dissociating the spatial and feature-based history effects by comparing their different effects on the tachometric curves. However, the authors didn't discuss the limits of their task design enough. It is a contrived task (one of the "laboratory tasks"), but the behavioral variability in this simple task is certainly remarkable. Yet, is there any conclusion we should avoid from this study? For instance, can we generalize the finding in more natural settings and say, the spatial selection history influences the choice under time pressure? I wonder whether the task is simple yet general enough to make such a conclusion.

      As Rev. 1 notes, the CO task is a laboratory task that produces large history effects. But importantly, we don't think urgency is causal or essential to the existence of such effects (this is now more explicitly stated in the first section of the Results); it is simply a powerful tool for revealing and characterizing them. As noted in the Discussion, our results are consistent with studies that, based on simpler, non-urgent tasks, demonstrated either reward-driven spatial biases or color priming effects. The CO task uses urgency to generate a psychometric function that time resolves perceptually informed from perceptually uninformed choices, and thereby provides the logical key to disambiguating the simultaneous contributions of perceptual and non-perceptual biases to performance. Such was essential to our demonstration that distinct biases act independently on the same saccade choices.

      In a natural setting, we would certainly expect the respective magnitudes of such non-volitional history-based biases to be highly context dependent, but it would be difficult, if not impossible, to discern their relative impact on natural behavior. That said, we think that the biases revealed by the CO task are exemplary of those that would manifest in natural behaviors depending on the real-world context to which such behaviors correspond. Here, it is important to emphasize that the spatial- and feature-based biases we observed were not strategic, on average neither helping nor hindering overall performance. Thus, in the real-world we might expect the expression of similar biases to be an important source of behavioral variance. These observations are now summarized in the penultimate paragraph of the Discussion.

      (3) Although the authors aimed to look at both inter- and intra-trial temporal dynamics, I'm not sure if the results reflect the true within-trial dynamics. I expected to learn more about how the spatial selection history bias develops as the Gap period progresses (as the authors mentioned in line 386, the spatial history bias must develop during the Gap interval). Does Figure 3 provide some hints in this within-trial temporal dynamics?

      Because it is based on the location of the saccadic choice(s) on previous trial(s), we might expect a signal of spatial bias to be present before and during the Gap period and perhaps even before a trial begins (i.e., intertrial interval). However, because behavioral bias is a probabilistic measure of saccade tendency, we have no way of knowing if such a signal is present during periods devoid of saccadic choices. Note that, for both monkey subjects, average RT exceeded the duration of the longest Gap employed (Fig. 1), and this means that relatively few saccades occurred prior to Cue onset. That said, it's clear in both Figs. 2, 3, and 6 that location bias is evident for saccades initiated at the transition between Gap and Cue intervals (PT=0). Anecdotally, we can report that that spatial bias is evident when we extend our analysis back further into the range of negative PTs (i.e., Gap interval), but the statistics are weak given the paucity of trials at that point. Nevertheless, this is consistent with a bias that exists from the beginning of the trial, as would be expected based on neurophysiological studies from Hikosaka's lab in a simpler but comparable spatial bias task.

      Although our data do not unequivocally identify the temporal origin of the spatial bias, they clearly show that the bias is present early (at short PTs) and diminishes rapidly as the perceptual information accrues (at long PTs). Thus, the PT-dependent temporal dynamics that are revealed clearly suggest that spatial and perceptual biases operate over different intra-trial time frames, one decreasing and the other increasing. As mentioned by Rev. 1, Fig. 3 emphasizes this dichotomy.

      (4) The monkeys show significant lapse rates (enough error trials for further analyses). Do the choices in the error trials reflect the history bias? For example, if errors are divided in terms of PTs, do the errors with short PT reflect more pronounced spatial history bias (choosing the previously selected location) compared to the errors with long PT?

      The short answer is “yes”. Errors generally show a PT-dependent influence of history bias. However, correct and error trials are the result of the same biased dynamics, and analyzing them separately post-hoc does not provide much additional insight about the history effects beyond that provided by the tachometric curves themselves.

      To see this, first consider the figure below (Author response image 1). Two tachometric curves conditioned on color history are shown (left). These are the two extreme curves plotted in Fig. 2a, which correspond to the 4S (i.e., 4 repeats of the current target color) and 4D (4 color repeats and then a switch) conditions. Each of these curves already shows the probability of making an error at each PT but, indeed, we can compare the proportions of correct and error trials at short PTs (guesses) and long PTs (informed choices). These are indicated by the bar graphs on the right. Now, the effect of a bias would be to create a difference in success rate between repetitions (4S, blue) and switches (4D, red) relative to the overall, unbiased expectation (indicated by dotted lines). For color-based history, there is no bias at short PT: the proportions of correct choices are almost exactly at the expected chance level (filled bars coincide with dotted line). In contrast, at long PTs, there is a differential effect, but it is due both to a proportion of correct trials that is higher than expected in the 4S case (filled blue bar above dotted line) and to a proportion of correct trials that is lower than expected in the 4D case (filled orange bar below dotted line). This is exactly as one would expect if the current choice was biased by target color history.

      Author response image 1.

      A similar analysis can be done for location history (Author response image 2, which shows the two extreme curves from Fig. 2e). In this case the bias is much stronger at short PTs, and the difference between repeats (4S, blue) and switches (4D, red) is largely explained by a proportion of correct choices that is much higher than expected by chance in the 4S condition (filled blue bar well above dotted line). This makes sense, because a rewarded location is likely to become the next guess, so if the target happens to appear again at that same location, the subsequent guess is more likely than chance to be correct. At longer PTs, the differential effect is smaller, as would be expected for more informed choices, but it is again driven by the 4S condition. Importantly, in the case of location the total number of S trials is much smaller than the total number of D trials (because a target-location repetition has a probability of 0.25 only), so it only makes sense to compare the proportions of correct (or error) trials, not the absolute numbers, between those conditions.

      Author response image 2.

      In summary, although it is possible to examine the separate dependencies of correct and error trials on history and PT, the distinction is not very useful. Only the frequency of errors relative to that of correct choices makes complete sense, not so much, say, the frequency of short PT errors relative to that of long PT errors.  

      Reviewer #2 (Public review):

      Summary:

      This is a clear and systematic study of trial history influences on the performance of monkeys in a target selection paradigm. The primary contribution of the paper is to add a twist in which the target information is revealed after, rather than before, the cue to make a foveating eye movement. This twist results in a kind of countermanding of an earlier "uninformed" saccade plan by a new one occurring right after the visual information is provided. As with countermanding tasks in general, time now plays a key factor in the success of this task, and it is time that allows the authors to quantitatively assess the parametric influences of things like previous target location, previous target identity, and previous correctness rate on choice performance. The results are logical and consistent with the prior literature, but the authors also highlight novelties in the interpretation of prior-trial effects that they argue are enabled by the use of their paradigm.

      Strengths:

      Careful analysis of a multitude of variables influencing behavior

      Weaknesses:

      Results appear largely confirmatory.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) The authors provide comprehensive accounts of the urgent search task in multiple places in the manuscript. But the description can be simpler and more consistent throughout. I found it confusing when the authors compared their task with previous search tasks used by Bichot and Schall, McPeek et al. I believe the authors wanted to explain that it is not just the urgency but the fact that the target color being randomly interleaved also contributes to the pronounced history bias in their task. I appreciate their thorough comparison with previous studies but it can be distracting or lose focus. It might read better if this statement can be expanded in the Discussion, not in the Results (lines 366-376).

      We thank the reviewer for pointing this out. We agree that the paragraph in question was ambiguous and appeared to elaborate a Discussion point, which was not our intent. Indeed, as the reviewer noted, the main point was that the randomization of the target colors (and not urgency) is the critical aspect of the task that makes it surprisingly difficult for the monkeys. We have revised the paragraph to emphasize this conclusion and the two empirical results from our own data that support it. The agreement with prior studies, which is somewhat tangential, is now briefly mentioned at the end of the paragraph. It should now be clear that the text mainly describes current data that are relevant to the interpretation of the main results.

      (2) It's important to state that feature-based selection history bias is not merely due to the monkey's intrinsic bias to one color over the other (red vs green). The authors did a nice job controlling that, as mentioned in Methods (lines 194-196) and supplementary figure (Figure 1 - Figure Supplement 2). It would be helpful for readers to read in Results as well.

      Thank you for the suggestion. We now mention this in the second section of the Results.

      (3) D trial examples for the location history in Results can be confusing to readers (lines 407-409; left-left-right, up-up-left). The examples in Methods (lines 224-229; left-up-right, up-down-left) are better to convey the preceding (different) trials can be of any kind.

      Indeed. Both types of example are now mentioned in the Results.

      Reviewer #2 (Recommendations for the authors):

      I have only minor comments:

      (1) In the abstract, I'm not sure what "when combined" means in the last sentence. What is combined? Selection history and stimulus salience? If so, this is not very clear. Also, it might be nice to end the abstract on how the study addresses the three components of attention that the abstract started with in the first place (salience, task, and history). Otherwise, I spent multiple abstract reads (before even reading the rest of the paper) trying to see whether indeed the paper addresses the three components of attention that were so prominently described at the beginning of the abstract or not. And, I still could not convince myself of whether all three were addressed by the study or not (I then resorted to proceeding with a reading of the rest of the paper).

      Thanks for pointing this out. We have reworded the abstract to clarify that we are focusing on selection history, not salience or top-down attention.

      (2) Line 72: isn't stimulus location still a feature????

      Our nomenclature here is intended to be consistent with the commonly applied distinction between “spatial” and “feature” -based attention that underscores the distinct mechanistic underpinnings of “where” and “what”.

      (3) Lines 76-79: I'm very confused here. The part about "guesses can be strongly biased toward an arbitrary location early on". However, I expected the later part of the sentence to still stick to location and mention what the temporal dynamic is. Instead, it discusses perceptual bias, which I presume is the color thing. So, the net result is that I'm a bit confused about how *both* location and color behave in *both* early and late times.

      We have rewritten the end of this paragraph to clarify when and how location and feature biases manifest in behavior. It may be useful to note the following. The tachometric curve describes different types of choices distinguished by their timing, guesses at short PTs vs informed decisions at long PTs. However, this also corresponds to the degree to which perceptual information becomes available over time within a single trial. Namely, perceptual information is initially absent but arrives later on. The revised text now reflects this distinction, making the logic for the expected results clearer.

      (4) Last paragraph of the introduction (lines 80-82): it would be helpful to justify here why the psychophysics were done in monkeys in this study, instead of humans.

      We now allude to the reason these studies were done in monkeys but feel that more elaboration of this point is better left to Discussion. The Discussion now more explicitly states that the current data are closely related to neurophysiological studies of spatial attention and color priming in monkeys (beginning of 4th paragraph).

      - Line 389: this kind of formulation is much clearer to me than lines 76-79 mentioned above.

      As noted, the above-mentioned section has been revised.

      - I'm a bit confused by Figure 4 in the sense that some of the effect sizes are not too different from Figure 2, even when there are some intermediate inconsistent trials. I guess the problem is aggravated by the different axis ranges in Figures 2, and 4.

      All the 1S and 1D data points are the same in both figures, as they should, but the problem is that, otherwise, the two figures are just not comparable. Apples and oranges. To see this, note that the trends for the difference between S and D conditions should go in opposite directions as trials go further into the past, and indeed they do. In Figures 2c, f, the differences between 1S and 1D results are small, and those between 4S and 4D results are the largest because both S and D effects grow away from the average with more repetitions. In contrast, in Figure 4b-d, the differences between S and D shrink as the effect of a single trial becomes more distant (differences are largest between 1S and 1D results, smallest between 1S9x and 1D9x results). The only slightly ambiguous trend is that of Figure 2g, because the S data are more noisy. We have expanded the text surrounding Figure 4 to highlight the different expected trends for this analysis in contrast to that presented in Figure 2. This should clarify the qualitative difference between the two.

      - On a related note, it is odd that the summary figures (e.g. Figures. 2, 4, etc) are vertically aligned such that the dependent measure is on the x-axis rather than the y-axis. For example, looking at Figure 2, it would make much more sense if panels b-d and f-h were rotated by 90 deg, such that the vertical axis is indeed the low asymptote or high asymptote or RT. This would directly correlate with the same data in panels a and e in the same figure and would be much easier to follow. Then, later in the paper, Fig. 8 suddenly does the dependent measure on the y-axis, as I said. I think it can help to use similarly consistent plotting approaches across all (or most) analyses.

      We tried other formats but settled on the current one because we felt it made it (slightly) easier to compare the patterns across history conditions between any two of the 6 bar graphs in each figure (in Figs 2, 5, 6), in part because it prevents any confusion with the PT axes. As this does not make a substantial difference either way, we prefer to maintain the present arrangement. Additional labels are now included, which should make the figures a bit more friendly.

      - At the beginning of the paper, I was under the impression that this will really be a free viewing search task (e.g. Wolfe search arrays or old Nakayama search arrays), but then it became clear later that it was still an instructed task, with the only difference being that the target onset is now 4 targets. I think this distinction should be clarified very early on, in order to avoid confusion by the readers. The reason I say this is that with enforced fixation, there are other factors in this task that come into play, like the monkey's individual microsaccade rates etc, which can modulate performance since they also have a form of countermanding that is like the one imposed by the compelled saccade task. So, better alert the readers to the context of the task early on.

      Thanks. We have provided additional detail when introducing the task for the first time in the Introduction, along with a citation to an earlier publication in which the specific task is described. There should be no ambiguity now.

      Reviewing Editor Comments:

      Short Assessment:

      This important study makes compelling use of the monkey animal model to capture the long-time course over which trial history affects decision-making under time pressure, showing decisions are affected by the stimulus sequence extending back as many as four trials previously.

      Summary:

      Decision-making is variable, but how much of this variability can be accounted for by the immediate previous history is not well known. Using an "urgent" saccade, Oor et al manipulated how much time monkeys had to process evidence, and evaluated what they did when there was too little time to make an evidence-based decision. They report that the history affected performance as far back as 4 previous trials and that different aspects of the stimulus history (color and location) affected performance differently.

      Strengths:

      The key strengths of this paper are that the monkey paradigm permitted a study under highly controlled conditions with stable performance across sessions and enough trials to conduct the history analysis farther back in time than is possible with smaller data sets. While the fact that prior history affects decisions was previously known, this study provides a careful quantification of the effect -- which proves to be quite large - as well as an assessment of both location and feature histories in combination with each other. The manuscript is well-written and easy to follow.

      Weaknesses and recommendations for the authors:

      (1) The figures are lovely but could use some more text/design elements to clarify, and there is space to do so. e.g., in Figure 2, there could be titles to indicate that the top row involves the color history and the bottom row involves location history. The information is there, in the y labels of panels B and F, but it takes a while to see that.

      Done. Titles have been added to Figure 2 and several others.

      (2) Furthermore, the abbreviations 1D, 4S, etc are explained in the legend but it seems there is room to spell them out or include a graphic to indicate what they mean.

      The labels 1D, 4S, etc are difficult to spell out because each one represents multiple conditions; for instance, 2S may correspond to green-green or red-red target colors, and so on. Figure legends have been edited to more clearly indicate that S and D labels correspond to repeat and switch trials, respectively, and that the associated number indicates how far back the history goes.

      (3) The terms "low asymptote" and "high asymptote" could be indicated in a graphic of a tachymetric function, smoothing the transition to the rightmost panels. (Consider also alternative terms - perhaps "floor" and "ceiling" might be more readily understandable than asymptote to the student reader??).

      Thanks for the suggested terms, “floor” and “ceiling”, which we’ve adopted. They are indeed more natural. Figure 2a now indicates that floor and ceiling accuracies correspond to opposite ends of the PT axis.

      (4) The units for the asymptotes are not indicated - I assume these are "% correct" but that would be helpful to clarify.

      Yes. Units for floor and ceiling (and RT) are now indicated in all figures.

      (5) Figure 3 - "PT", and "1S-1D" could be spelled out, and the meaning of the two colored traces could be in the figure itself rather than only in the legend. Similar suggestions apply about labeling, abbreviations apply in subsequent figures.

      PT is now spelled out in all figures other than Figure 1, and labels for the two traces were added to Figure 3. Thanks for all the detailed suggestions.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      This study provides a thorough analysis of Nup107's role in Drosophila metamorphosis, demonstrating that its depletion leads to developmental arrest at the third larval instar stage due to disruptions in ecdysone biosynthesis and EcR signaling. Importantly, the authors establish a novel connection between Nup107 and Torso receptor expression, linking it to the hormonal cascade regulating pupariation.

      However, some contradictory results weaken the conclusions of the study. The authors claim that Nup107 is involved in the translocation of EcR from the cytoplasm to the nucleus. However, the evidence provided in the paper suggests it more likely regulates EcR expression positively, as EcR is undetectable in Nup107-depleted animals, even below background levels.

      We appreciate the concern raised in this public review. However, we must clarify that we do not claim that Nup107 directly regulates the translocation of EcR from the cytoplasm to nucleus, rather Nup107 regulates Ecdysone hormone (20E) synthesis which in turn affects EcR translocation. In the manuscript, we posited this hypothesis if Nup107 will regulate EcR nuclear translocation (9th line of 2nd paragraph on page 6). We have spelled this out more clearly as the 3rd subsection title of the Results section, and in the discussion (8th line of 2nd paragraph on page 11).

      20E acts through the EcR to induce the transcription of EcR responsive genes including the EcR. This creates a positive autoregulatory loop that enhances the EcR level through ecdysone signaling (1). Since Nup107 depletion leads to a reduction in ecdysone levels, it disrupts the transcription autoregulatory EcR expression loop. This can contribute to the reduced EcR levels seen in Nup107-depleted animals. 

      Additionally, the link between Nup107 and Torso is not fully substantiated. While overexpression of Torso appears to rescue the lack of 20E production in the prothoracic gland, the distinct phenotypes of Torso and Nup107 depletion-developmental delay in the former versus complete larval arrest in the latter complicate understanding of Nup107's precise role.

      We understand that there are differences in the developmental delay when Tosro and Nup107 depletion is analyzed. However, the two molecules being compared here are very different, and variability in their depletion could contribute observed phenotypic differences (2). Even if there is no variability of depletion of Torso and Nup107­­­, we believe that Nup107, being more widely expressed, and involved in the regulation of various cellular processes, induces stronger defects.

      Further, we think that RNAi-mediated depletion of Nup107 in prothoracic glands (PG) causes significant reduction in the PG size, which may exert a pronounced defect in 20E biosynthesis through the Halloween genes, inducing a stronger developmental arrest.

      To clarify these discrepancies, further investigation into whether Nup107 interacts with other critical signaling pathways related to the regulation of ecdysone biosynthesis, such as EGFR or TGF-β, would be beneficial and could strengthen the findings.

      In summary, although the study presents some intriguing observations, several conclusions are not well-supported by the experimental data.

      We agree with the reviewer’s suggestion. As noted in the literature, five RTKs-torso, InR, EGFR, Alk, and Pvr-stimulate the PI3K/Akt pathway, which plays a crucial role in the PG functioning and controlling pupariation and body size (3). We have checked the torso and EGFR signaling. We rescued Nup107 defects with the torso overexpression, however, constitutively active EGFR (BL-59843) did not rescue the phenotype (data was not shown). Nonetheless, we plan to examine the EGFR pathway activation by measuring the pERK levels in Nup107-depleted PGs.

      Reviewer #2 (Public review):

      Summary:

      The manuscript by Kawadkar et al investigates the role of Nup107 in developmental progression via the regulation of ecdysone signaling. The authors identify an interesting phenotype of Nup107 whole-body RNAi depletion in Drosophila development - developmental arrest at the late larval stage. Nup107-depleted larvae exhibit mis-localization of the Ecdysone receptor (EcR) from the nucleus to the cytoplasm and reduced expression of EcR target genes in salivary glands, indicative of compromised ecdysone signaling. This mis-localization of EcR in salivary glands was phenocopied when Nup107 was depleted only in the prothoracic gland (PG), suggesting that it is not nuclear transport of EcR but the presence of ecdysone (normally secreted from PG) that is affected. Consistently, whole-body levels of ecdysone were shown to be reduced in Nup107 KD, particularly at the late third instar stage when a spike in ecdysone normally occurs. Importantly, the authors could rescue the developmental arrest and EcR mislocalization phenotypes of Nup107 KD by adding exogenous ecdysone, supporting the notion that Nup107 depletion disrupts biosynthesis of ecdysone, which arrests normal development. Additionally, they found that rescue of the Nup107 KD phenotype can also be achieved by over-expression of the receptor tyrosine kinase torso, which is thought to be the upstream regulator of ecdysone synthesis in the PG. Transcript levels of the torso are also shown to be downregulated in the Nup107KD, as are transcript levels of multiple ecdysone biosynthesis genes. Together, these experiments reveal a new role of Nup107 or nuclear pore levels in hormone-driven developmental progression, likely via regulation of levels of torso and torso-stimulated ecdysone biosynthesis.

      Strengths:

      The developmental phenotypes of an NPC component presented in the manuscript are striking and novel, and the data appears to be of high quality. The rescue experiments are particularly significant, providing strong evidence that Nup107 functions upstream of torso and ecdysone levels in the regulation of developmental timing and progression.

      Weaknesses:

      The underlying mechanism is however not clear, and any insight into how Nup107 may regulate these pathways would greatly strengthen the manuscript. Some suggestions to address this are detailed below.

      Major questions:

      (1) Determining how specific this phenotype is to Nup107 vs. to reduced NPC levels overall would give some mechanistic insight. Does knocking down other components of the Nup107 subcomplex (the Y-complex) lead to similar phenotypes? Given the published gene regulatory function of Nup107, do other gene regulatory Nups such as Nup98 or Nup153 produce these phenotypes?

      We thank this public review for raising this concern. Working with a Nup-complex like the Nup107 complex, this concern is anticipated but difficult to address as many Nups function beyond their complex identity. Our observations with all other members of the Nup107-complex, including dELYS, suggest that except Nup107, none of the other tested Nup107-complex members could induce larval developmental arrest.

      In this study, we primarily focused on the Nup107 complex (outer ring complex) of the NPC. However, previous studies have reported that Nup98 and Nup153 interact with chromatin, with these investigations conducted in Drosophila S2 cells (4, 5, 6). We have now examined other nucleoporins outside of this complex, such as Nup153.

      We ubiquitously depleted Nup153 using the Actin5C-Gal4 driver and assessed the pupariation profile of the knockdown larvae in comparison to control larvae. In contrast to the Nup107 knockdown, when Nup153 is depleted to less than 50% levels, no impact on pupariation was observed (Auhtor response image 1)

      Author response image 1.

      Nup153 depletion does not affect the Drosophila metamorphosis. Actin5C-Gal4 is used as a ubiquitous driver. (A) Comparison of pupariation profiles of control and Nup153 knockdown organisms. (B) Quantification of Nup153 knockdown efficiency. Data are represented from at least three independent experiments. Statistical significance was derived from the Student’s t-test. Error bars represents SEM. ***p = <0.001.

      (2) In a related issue, does this level of Nup107 KD produce lower NPC levels? It is expected to, but actual quantification of nuclear pores in Nup107-depleted tissues should be added. These and the above experiments would help address a key mechanistic question - is this phenotype the result of lower numbers of nuclear pores or specifically of Nup107?

      We agree with the concern raised here, and to address the concern raised here, we stained the control and Nup107 depleted salivary glands with mAb414 antibody (exclusively FG-repeat Nup recognizing antibody). While Nup107 intensities are significantly reduced at the nuclear envelope in Nup107 depleted salivary glands, the mAb414 staining seems unperturbed (Author response image 2).

      Author response image 2.

      Nup107 depletion does not perturb overall NPC composition. Comparison of salivary gland nucleus upon control and Nup107 knockdown. The Nup107 is shown in green and mAb414, staining for other FG-repeat containing nucleoporins is shown in red. Scale bars, 5µm.

      (3) Additional experiments on how Nup107 regulates the torso would provide further insight. Does Nup107 regulate transcription of the torso or perhaps its mRNA export? Looking at nascent levels of the torso transcript and the localization of its mRNA can help answer this question. Or alternatively, does Nup107 physically bind the torso?

      While the concern regarding torso transcript level is genuine, we have already reported in the manuscript that Nup107 directly regulates torso expression. When Nup107 is depleted, torso levels go down, which in turn controls ecdysone production and subsequent EcR signaling (Figure 6B of the manuscript).

      However, the exact nature of Nup107 regulation on torso expression is still unclear. Since the Nup107 is known to interact with chromatin (7), it may affect torso transcription. The possibility of a stable and physiologically relevant interaction between Nup107 and the torso in a cellular context is unlikely largely due to their distinct subcellular localizations. If we investigate this further, it will require a significant amount of time for having reagents and experimentation, and currently stands beyond the scope of this manuscript.

      (4) The depletion level of Nup107 RNAi specifically in the salivary gland vs. the prothoracic gland should be compared by RT-qPCR or western blotting.

      Although we know that the Nup107 protein signal is reduced in SG upon knockdown (Figure 3B), we have not compared the Nup107 transcript level in these two tissues (SG and PG) upon RNAi. As suggested here, we evaluated the knockdown efficiency of Nup107 using the salivary gland-specific driver AB1-Gal4 and the prothoracic gland-specific driver Phm-Gal4. Our results indicate a significant reduction in Nup107 transcript levels upon Nup107 RNAi in both SG and PG compared to their respective controls (Author response image 3).

      Author response image 3.

      Nup107 levels are significantly reduced upon Nup107<sup>KK</sup> RNAi. Quantification of Nup107 transcript levels from control and Nup107 depleted larvae [tissue specific depletion using AB1-Gal4 (A) and Phm-Gal4 (B)]. Data are represented from at least three independent experiments. Statistical significance was derived from the Student’s t-test. Error bars represent SEM. **p = <0.004

      (5) The UAS-torso rescue experiment should also include the control of an additional UAS construct - so Nup107; UAS-control vs Nup107; UAS-torso should be compared in the context of rescue to make sure the Gal4 driver is functioning at similar levels in the rescue experiment.

      This is a very valid point, and we took this into account while planning the experiment. In such cases, often the GAL4 dilution can be critical. We have demonstrated in Figure S7, that GAL4 dilution is not blurring our observations. We used the Nup107<sup>KK</sup>; UAS-GFP as control alongside the Nup107<sup>KK</sup>; UAS-torso. We conclude that the presence of GFP signals in prothoracic glands and their reduced size indicates genes downstream to both UAS sequences are transcribed, and GAL4 dilution does not play a role here.

      Minor:

      (6) Figures and figure legends can stand to be more explicit and detailed, respectively.

      We have revisited all figures and their corresponding legends to ensure appropriate and explicit details are provided.

      Reviewer #3 (Public review):

      Summary:

      In this study by Kawadkar et al, the authors investigate the developmental role of Nup107, a nucleoporin, in regulating the larval-to-pupal transition in Drosophila through RNAi knockdown and CRISPR-Cas9-mediated gene editing. They demonstrate that Nup107, an essential component of the nuclear pore complex (NPC), is crucial for regulating ecdysone signaling during developmental transitions. The authors show that the depletion of Nup107 disrupts these processes, offering valuable insights into its role in development.

      Specifically, they find that:

      (1) Nup107 depletion impairs pupariation during the larval-to-pupal transition.

      (2) RNAi knockdown of Nup107 results in defects in EcR nuclear translocation, a key regulator of ecdysone signaling.

      (3) Exogenous 20-hydroxyecdysone (20E) rescues pupariation blocks, but rescued pupae fail to close.

      (4) Nup107 RNAi-induced defects can be rescued by activation of the MAP kinase pathway.

      Strengths:

      The manuscript provides strong evidence that Nup107, a component of the nuclear pore complex (NPC), plays a crucial role in regulating the larval-to-pupal transition in Drosophila, particularly in ecdysone signaling.

      The authors employ a combination of RNAi knockdown, CRISPR-Cas9 gene editing, and rescue experiments, offering a comprehensive approach to studying Nup107's developmental function.

      The study effectively connects Nup107 to ecdysone signaling, a key regulator of developmental transitions, offering novel insights into the molecular mechanisms controlling metamorphosis.

      The use of exogenous 20-hydroxyecdysone (20E) and activation of the MAP kinase pathway provides a strong mechanistic perspective, suggesting that Nup107 may influence EcR signaling and ecdysone biosynthesis.

      Weaknesses:

      The authors do not sufficiently address the potential off-target effects of RNAi, which could impact the validity of their findings. Alternative approaches, such as heterozygous or clonal studies, could help confirm the specificity of the observed phenotypes.

      This is a very valid point raised, and we are aware of the consequences of the off-target effects of RNAi. To assert the effects of authentic RNAi and reduce the off-target effects, we have used two RNAi lines (Nup107<sup>GD</sup> and Nup107<sup>KK</sup>) against Nup107. Both RNAi induced comparable levels of Nup107 reduction, and using these lines, ubiquitous and PG specific knockdown produced similar phenotypes. Although the Nup107<sup>GD</sup> line exhibited a relatively stronger knockdown compared to the Nup107<sup>KK</sup> line, we preferentially used the Nup107<sup>KK</sup> line because the Nup107<sup>GD</sup> line is based on the P-element insertion, and the exact landing site is unknown. Furthermore, there is an off-target predicted for the Nup107<sup>GD</sup> line, where a 19bp sequence aligns with the bifocal (bif) sequence. The bif-encoded protein is involved in axon guidance and regulation of axon extension. However, the Nup107<sup>KK</sup> line does not have a predicted off-target molecule, and we know its precise landing site on the second chromosome. Thus, the Nup107<sup>KK</sup> line was ultimately used in experimentation for its clearer and more reliable genetic background.

      We are also investigating Nup107 knockdown in the prothoracic gland, which exhibits polyteny. Additionally, the number of cells in the prothoracic gland is quite limited, approximately 50-60 cells (8). Given this, there is a possibility that a clonal study may not yield the phenotype.

      NPC Complex Specificity: While the authors focus on Nup107, it remains unclear whether the observed defects are specific to this nucleoporin or if other NPC components also contribute to similar defects. Demonstrating similar results with other NPC components would strengthen their claims.

      We thank this public review for raising this concern. Working with a Nup-complex like the Nup107 complex, this concern is anticipated but difficult to address as many Nups function beyond their complex identity. Our observations with all other members of the Nup107-complex, including dELYS, suggest that except Nup107, none of the other Nup107-complex members could induce larval developmental arrest. Since the study is primarily focused on the Nup107 complex (outer ring complex) of the NPC, we have not examined many more nucleoporins outside of this complex. But our observations with Nup153 knockdown, a nuclear basket nucleoporin, is comparable to control, with no delay in development (Author response image 1)

      Although the authors show that Nup107 depletion disrupts EcR signaling, the precise molecular mechanism by which Nup107 influences this process is not fully explored. Further investigation into how Nup107 regulates EcR nuclear translocation or ecdysone biosynthesis would improve the clarity of the findings.

      We appreciate the concern raised. Through our observation, we have proposed the upstream effect of Nup107 on the PTTH-torso-20E-EcR axis regulating developmental transitions. We know that Nup107 regulates torso levels, but we do not know if Nup107 directly interacts with torso. We would like to address whether Nup107 exerts control on PTTH levels also.

      However, we must emphasize that Nup107 does not directly regulate the translocation of EcR. On the contrary, we have demonstrated that when Nup107 is depleted only in the salivary gland, EcR translocates into the nucleus. Thus we conclude that the EcR translocation is 20E dependent and Nup107 independent. Further, we have argued that Nup107 regulates the expression of Halloween genes required for ecdysone biosynthesis. We are interested in identifying if Nup107 associates directly or through some protein to chromatin to bring about the changes in gene expression required for normal development.

      There are some typographical errors and overly strong phrases, such as "unequivocally demonstrate," which could be softened. Additionally, the presentation of redundant data in different tissues could be streamlined to enhance clarity and flow.

      Response: We thank the reviewer for this observation. We have put our best efforts to remove all typographical errors and have now made more reasonable statements based on our conclusions.

      Recommendations for the Authors:

      Reviewer #1 (Recommendations for the authors):

      The manuscript presents compelling evidence that Nup107 plays a role in regulating ecdysone production. However, significant concerns remain regarding the effects on EcR localization and expression, as well as the claimed link between PTTH/Torso signaling and Nup107's function, as the evidence provided is not conclusive.

      The hypothesis that Nup107 mediates EcR translocation from the cytoplasm to the nucleus appears misinterpreted by the authors. Based on the presented images, particularly for the prothoracic gland (PG) Figure 3C, Nup107 depletion seems to impact EcR protein levels rather than its localization. This conclusion is supported by data showing that EcR transcripts are autonomously downregulated in the absence of Nup107. Furthermore, the restoration of nuclear EcR levels upon exogenous 20E supplementation suggests that (1) Nup107 is dispensable for EcR activation and function, and (2) its primary role lies in regulating ecdysone production.

      We appreciate the concern raised by reviewer. However, we must clarify that we do not claim that Nup107 directly regulates the translocation of EcR from the cytoplasm, rather Nup107 regulates Ecdysone hormone (20E) synthesis which in turn affects EcR translocation. In the manuscript, we posited this hypothesis if Nup107 will regulate EcR nuclear translocation (9th line of 2nd paragraph on page 6). We have spelled this out more clearly as the 3rd subsection title of the Results section, and in the discussion (8th line of 2nd paragraph on page 11).

      20E acts through the EcR to induce the transcription of EcR responsive genes including the EcR. This creates a positive autoregulatory loop that enhances the EcR level through ecdysone signaling (1). Since Nup107 depletion leads to a reduction in ecdysone levels, it disrupts the transcription autoregulatory EcR expression loop. This can contribute to the reduced EcR levels seen in Nup107-depleted animals.

      Given that nucleoporins are known to influence mRNA transport-for instance, Nup107 has been shown to control Scn5a mRNA transport (Guan et al., 2019)-the observed effects on Halloween gene and EcR expression may stem from disruptions in mRNA transport to the cytoplasm. The downregulation of Shade further supports this hypothesis, as restricted ecdysone biosynthesis typically induces Shade upregulation in peripheral tissues. Quantifying potential mRNA accumulation in the nuclei of PG cells in Nup107-depleted animals would clarify this.

      The reviewer raised a valid point, and we fully agree with the concern that Nup107 has been shown to control Scn5a mRNA transport (Guan et al., 2019). The observed effects on Halloween gene and EcR expression could indeed stem from disruptions in efficient mRNA export to the cytoplasm. However, if Nup107 were regulating the mRNA export of Halloween genes and EcR, we should not expect a rescue of the Nup107 developmental delay phenotype with torso overexpression. But, by overexpressing the torso in the Nup107 depletion background, we are activating the torso pathway dependent Halloween gene expression, and rescuing the developmental delay phenotype of Nup107 depletion.

      With the current data, it is difficult to conclusively claim a role for Nup107 in EcR translocation or expression. Additional experiments, such as EcR overexpression in Nup107-depleted animals or Nup107 overexpression, would help determine its precise role.

      We appreciate the concern raised by reviewer. We did attempt to rescue the Nup107 depletion phenotype by overexpressing EcR (BL-6868) in the Nup107-RNAi background. However, we were unable to rescue the Nup107 depletion dependent developmental delay phenotype with this approach. This further suggests that the phenotype is not merely due to low level of EcR, but it is due to low availability of ecdysone hormone and EcR signaling.

      The second major issue is the proposed link between Nup107 and PTTH/Torso signaling. The authors suggest that Nup107 regulates ecdysone production through Torso expression based on rescue experiments. However, this is inconsistent with the distinct phenotypes observed when Nup107 or Torso signaling is disrupted. While PTTH/Torso signaling causes only a modest developmental delay (12 hours to 2 days, depending on the mutant), Nup107 depletion results in a complete developmental arrest at the larval stage. This discrepancy raises doubts about the assertion that Torso overexpression alone rescues such a severe phenotype. One possibility is that PTTH levels are upregulated in Nup107-depleted animals, leading to overactivation of the pathway when Torso is overexpressed. Quantifying PTTH levels in Nup107-depleted animals could address this.

      The reviewer raised a valid point, and we fully acknowledge this concern. While we do not completely agree with the idea of PTTH upregulation in Nup107 depleted larvae, as suggested here, we believe that quantifying PTTH levels upon Nup107 depletion can provide a useful insight. To address it, we quantified PTTH levels in Nup107-depleted larvae and found no significant change in PTTH expression compared to controls (Author response image 4).

      Author response image 4.

      Nup107 knockdown does not affect the PTTH level. Quantitation of PTTH transcript levels from control and Nup107 depleted larvae (Prothoracic specific depletion Phm-Gal4). Data are represented from at least three independent experiments. Statistical significance was derived from the Student's t-test. ns is non-significant.

      Another possibility is that the stock used for Torso overexpression, which includes a trk mutant, may introduce genetic interactions that overactivate the pathway. Using a clean UAS-Torso stock would resolve this issue.

      We appreciate the reviewer’s observation regarding the use of the Torso overexpression line (BL-92604), which carries the trk null allele on the second chromosome. The cleaved form of the trk serves as ligand for the troso receptor. Since it may serve as ligand for the torso, I am not sure how trk null allele bearing line when used along for torso overexpression studies will overactivate the pathway. 

      We realized this concern and the fly line used in this study and reported in the manuscript was generated through the following genetic strategy using the BL-92604 line.  First, a double balancer stock (Sco/CyO; MKRS/TM6.Tb) was used to generate the Sco/CyO; UAS-torso/ UAS-torso genotype. This recombinant line was subsequently combined with the Nup107<sup>KK</sup> line. Through the use of the double balancer strategy, we effectively replaced Nup107 RNAi genotype on the second chromosome, thereby ensuring that our final experimental setup is free from trk mutant contamination, if at all.

      Moreover, the rescue of Nup107 depletion phenotypes by RasV12 overexpression suggests that multiple RTKs, not just Torso, are affected. EGFR signaling, the primary regulator of ecdysone biosynthesis in the PG during the last larval stage, is notably absent from the authors' analysis. EGFR inactivation is known to arrest development, and previous studies indicate that Nup107 can reduce EGFR pathway activity (Kim et al, 2010). The authors should analyze EGFR pathway activity in the absence of Nup107. Overexpressing EGF ligands like Vein or Spitz in the PG (rather than the receptor) in a Nup107-depleted background would provide more relevant insights.

      The RasGTPase is one of the common effector molecules downstream of an activated receptor kinase. Rescue with a constitutively activated form of RasGTPase (RasV12) suggests one of the routes which is activated downstream of the torso receptor. It does not directly suggest all different RTKs are affected and are involved. Our idea of performing a rescue experiment was to see if the pathway activated downstream of the torso involves RasGTPase. 

      As noted in the literature, five RTKs—torso, InR, EGFR, Alk, and Pvr—stimulate the PI3K/Akt pathway, which plays a crucial role in the PG for controlling pupariation and body size (3). Although EGFR signaling is important, PTTH/Torso signaling is considered the primary mediator of metamorphic timing. In response to the suggestion to analyze EGFR pathway activity in the absence of Nup107, we attempted to rescue the phenotype by overexpressing constitutively active EGFR (BL-59843) in the Nup107-depleted background (data was not shown). We used constitutively active EGFR to bypass the availability of its ligands (vein and spitz). Unfortunately, we were unable to rescue the phenotype with this approach, which further suggests that EGFR is not the targeted RTK pathway in this context. By rescuing with torso, we found that Nup107 regulates torso-mediated Ras/Erk signaling to control metamorphosis.

      Additional issues require clarification:

      (1) RNAi Efficiency: In Figure 1C, the Nup107GD line shows a stronger knockdown effect than Nup107KK, yet most experiments were conducted with the weaker line. This might explain the residual Nup107 protein observed in Figure 2. Could the authors justify this choice?

      This is a very valid point raised, and we are aware of the consequences of the off-target effects of RNAi. To assert the effects of authentic RNAi and reduce the off-target effects, we have used two RNAi lines (Nup107<sup>GD</sup> and Nup107<sup>KK</sup>) against Nup107. Both RNAi induced comparable levels of Nup107 reduction, and using these lines, ubiquitous and PG specific knockdown produced similar phenotypes. Although the Nup107<sup>GD</sup> line exhibited a relatively stronger knockdown compared to the Nup107<sup>KK</sup> line, we preferentially used the Nup107<sup>KK</sup> line because the Nup107<sup>GD</sup> line is based on the P-element insertion, and the exact landing site is unknown. Furthermore, there is an off-target predicted for the Nup107<sup>GD</sup> line, where a 19bp sequence aligns with the bifocal (bif) sequence. The bif-encoded protein is involved in axon guidance and regulation of axon extension. However, the Nup107<sup>KK</sup> line does not have a predicted off-target molecule, and we know its precise landing site on the second chromosome. Thus, the Nup107<sup>KK</sup> line was ultimately used in experimentation for its clearer and more reliable genetic background.

      (2) Control Comparisons: In Figure 3, the effects of Nup107 depletion on EcR expression in salivary glands (SG) and PG are shown, but only SG controls are provided. Including PG controls would enable proper comparisons. These controls should also be added to Figures 5, 6, and S5.

      As suggested by the reviewer, we have checked the EcR localization in prothoracic gland (Author response image 5), also. As shown in figure R5, when PGs isolated from control, Nup107-RNAi and torso overexpression in Nup107 background were stained for EcR, the observations made were indistinguishable from those made in SGs of the indicated genetic combinations. This indicated that Nup107 regulates EcR signaling by regulating the 20E biosynthesis.

      Author response image 5.

      Prothoracic gland’s specific torso expression rescues EcR nuclear translocation defects. Immunofluorescence-based detection of nucleocytoplasmic distribution of EcR (EcR antibody, red) in control, prothoracic gland specific Nup107 knockdown (Phm-Gal4>Nup107<sup>KK</sup>) and torso overexpressing PG-specific Nup107 knockdown (Phm-Gal4>Nup107<sup>KK</sup>; UAS-torso) third instar larval Prothoracic gland nuclei. DNA is stained with DAPI. Scale bars, 20 μm.

      (3) Clarify the function of Torso in the text: The authors must revise their description of Torso signaling as the primary regulator of ecdysone production in both the results and discussion sections. Specifically, in the results section, the claim that Torso depletion induces developmental arrest is inaccurate. Instead, available evidence, including Rewitz et al. 2009, demonstrates that Torso depletion causes a delay of approximately five days rather than a complete developmental arrest. This discrepancy should be corrected to avoid overstating the role of Torso signaling in ecdysone regulation and to align the manuscript with established findings.

      We agree with the reviewer. We have incorporated the suggestion at the relevant place in the main manuscript.

      Reviewer #3 (Recommendations for the authors):

      These findings suggest that Nup107 is involved in regulating ecdysone signaling during developmental transitions, with depletion of Nup107 disrupting hormone-regulated processes. Moreover, the rescue experiments hint that Nup107 might directly influence EcR signaling and ecdysone biosynthesis, though the precise molecular mechanism remains unclear.

      Overall, the manuscript presents compelling data supporting Nup107's role in regulating developmental transitions. However, I have a few comments for consideration:

      Major Comments:

      RNAi Specificity: While RNAi is a powerful tool, the authors do not sufficiently address potential off-target effects, which could undermine the conclusions. Although a mutant Nup107 is described, it is lethal-are heterozygous or clonal studies possible to validate the findings more robustly?

      This is a very valid point raised, and we are aware of the consequences of the off-target effects of RNAi. To assert the effects of authentic RNAi and reduce the off-target effects, we have used two RNAi lines (Nup107<sup>GD</sup> and Nup107<sup>KK</sup>) against Nup107. Both RNAi induced comparable levels of Nup107 reduction, and using these lines, ubiquitous and PG specific knockdown produced similar phenotypes. Although the Nup107<sup>GD</sup> line exhibited a relatively stronger knockdown compared to the Nup107<sup>KK</sup> line, we preferentially used the Nup107<sup>KK</sup> line because the Nup107<sup>GD</sup> line is based on the P-element insertion, and the exact landing site is unknown. Furthermore, there is an off-target predicted for the Nup107<sup>GD</sup> line, where a 19bp sequence aligns with the bifocal (bif) sequence. The bif-encoded protein is involved in axon guidance and regulation of axon extension. However, the Nup107<sup>KK</sup> line does not have a predicted off-target molecule, and we know its precise landing site on the second chromosome. Thus, the Nup107<sup>KK</sup> line was ultimately used in experimentation for its clearer and more reliable genetic background.

      Following the suggestion from the reviewer, we considered conducting heterozygous and clonal analyses using the Nup107 mutant. We have carried out Nup107 knockdown studies in the prothoracic gland, which has a limited number of cells (50-60 cells) and is known to exhibit polyteny (8). Keeping these aspects of the Prothoracic gland in mind, the possibility that a clonal study will yield the phenotype is scarce. However, we will consider moving forward with this approach also.

      (2) NPC Complex Specificity: It remains unclear whether the observed defects are specific to Nup107 or if other NPC components also cause similar defects. If the authors are unable to use Nup107 mutants, they could demonstrate similar defects with other critical NPC members to bolster their claim.

      We thank this public review for raising this concern. Working with a Nup-complex like the Nup107 complex, this concern is anticipated but difficult to address as many Nups function beyond their complex identity. Our analysis of Nup153 depleted organisms indicates no developmental delay/defect. We have also assessed effects of knockdown of all other members of the Nup107-complex, including dELYS, but except Nup107 no other member of the Nup107-complex could induce developmental arrest in the third instar stage causing lack of pupariation. However, the null mutant of Nup133, the direct interactor of Nup107 in the Nup107-complex, induces a delay in pupariation (unpublished data).

      (3) Molecular Mechanism of EcR Signaling: The manuscript shows that Nup107 depletion affects EcR signaling and ecdysone biosynthesis, but the molecular basis of this regulation is not fully explored. Does phosphorylated ERK (p-ERK) fail to enter the nucleus? Clarifying this mechanism would strengthen the study's impact.

      We appreciate the reviewer’s insightful comment and fully agree with the concern. To address this, we examined the subcellular localization of phosphorylated ERK (p-ERK) in the prothoracic gland of control larvae, Nup107-depleted larvae, and Nup107-depleted larvae with torso overexpression. In control larvae, p-ERK was predominantly localized in the nucleus. However, in Nup107-depleted larvae, p-ERK was largely retained in the cytoplasm, indicating impaired pathway activation and nuclear translocation. Notably, overexpression of the torso in the Nup107-depleted background restored nuclear localization of p-ERK in the prothoracic gland (Author response image 6). These findings suggest that Nup107 regulates Drosophila metamorphosis, in part, through modulation of torso-mediated MAPK signaling.

      Author response image 6.

      Nup107 regulates torso activation dependent p-ERK localization. Detection of nucleocytoplasmic distribution of p-ERK (anti- p-ERK antibody, green) in the third instar larval prothoracic glands of control, PG-specific Nup107 knockdown (Phm-Gal4>Nup107<sup>KK</sup>) and PG-specific torso overexpression in Nup107 knockdown background (Phm-Gal4>Nup107<sup>KK</sup>; UAS-torso). DNA is stained with DAPI. Scale bars, 20 µm.

      Minor Comments:

      (1) The manuscript contains typographical errors that may hinder readability. Additionally, some phrases (e.g., "unequivocally demonstrate") may be overly strong. Consider adjusting language to reflect the nature of the data more accurately.

      We agree with the reviewer. We have edited the manuscript accordingly to crease out such typographical errors at relevant places in the main manuscript.

      (2) The data presentation could be improved by eliminating redundancy. Some sections repeat similar findings in different tissues, which could be consolidated to improve clarity and flow.

      While we agree with the comment, we could not help ourselves in tissue redundancy for presenting our data for EcR translocation studies. I wish we could use another tissue. However, we have put EcR localization and p-ERK translocation data in the responses to present another non-redundant tissue perspective (Figures R5 and R6).

      References:

      (1) Varghese, Jishy, and Stephen M Cohen. “microRNA miR-14 acts to modulate a positive autoregulatory loop controlling steroid hormone signaling in Drosophila.” Genes & development vol. 21,18 (2007): 2277-82. doi:10.1101/gad.439807

      (2) Rewitz, Kim F et al. “The insect neuropeptide PTTH activates receptor tyrosine kinase torso to initiate metamorphosis.” Science (New York, N.Y.) vol. 326,5958 (2009): 1403-5. doi:10.1126/science.1176450

      (3) Pan, Xueyang, and Michael B O'Connor. “Coordination among multiple receptor tyrosine kinase signals controls Drosophila developmental timing and body size.” Cell reports vol. 36,9 (2021): 109644. doi:10.1016/j.celrep.2021.109644

      (4) Pascual-Garcia, Pau et al. “Metazoan Nuclear Pores Provide a Scaffold for Poised Genes and Mediate Induced Enhancer-Promoter Contacts.” Molecular cell vol. 66,1 (2017): 63-76.e6. doi:10.1016/j.molcel.2017.02.020

      (5) Pascual-Garcia, Pau et al. “Nup98-dependent transcriptional memory is established independently of transcription.” eLife vol. 11 e63404. 15 Mar. 2022, doi:10.7554/eLife.63404

      (6) Kadota, Shinichi et al. “Nucleoporin 153 links nuclear pore complex to chromatin architecture by mediating CTCF and cohesin binding.” Nature communications vol. 11,1 2606. 25 May. 2020, doi:10.1038/s41467-020-16394-3

      (7) Gozalo, Alejandro et al. “Core Components of the Nuclear Pore Bind Distinct States of Chromatin and Contribute to Polycomb Repression.” Molecular cell vol. 77,1 (2020): 67-81.e7. doi:10.1016/j.molcel.2019.10.017

      (8) Shimell, MaryJane, and Michael B O'Connor. “Endoreplication in the Drosophila melanogaster prothoracic gland is dispensable for the critical weight checkpoint.” microPublication biology vol. 2023 10.17912/micropub.biology.000741. 21 Feb. 2023, doi:10.17912/micropub.biology.000741

    1. Author response:

      The following is the authors’ response to the original reviews.

      We have responded to these criticisms below and have revised the main text and figures. Here, we outline the major points of our responses:

      (1) The reviewers asked for more clarification regarding cell type annotation in the lung mesenchyme as shown in Figure 3C. We have included a new supplementary figure (Supplementary Figure 2) which shows differentially expressed genes amongst these mesenchymal cell subsets using a variety of visualization tools including a heatmap, UMAP plots, and the dotplot which was originally shown in Supplementary Figure 1D. The other supplemental figures have been re-numbered.

      (2) We acknowledge the lack of consensus in the field regarding the nomenclature of fibroblast subsets in the developing mouse lung. We are not attempting to define new subsets, but rather we adopted annotations based on previously published work. Specifically, we used Seurat to define mesenchymal cell clusters and then compared the gene expression patterns of these clusters to published work by Hurskainen et al. (Bernard Thebaud’s group) and Narvaez Del Pilar et al. (Jichou Chen’s group). We acknowledge these annotations might conflict with other published data, but any approach to choosing a cell label would be subject to scrutiny. For example, Col13a1 fibroblasts share markers with cells which have been defined by others as lipofibroblasts or alveolar fibroblasts. Similarly, Col14a1 fibroblasts appear to share markers with matrix fibroblasts. Further work is clearly needed to address these discrepancies, and we hope that making our data publicly available will help that effort. 

      (3) The reviewers asked us to interrogate changes in canonical markers of fibroblast subsets (i.e. lipofibroblasts, matrix fibroblasts) to address whether the apparent loss of myofibroblasts could be explained by a change in myofibroblast specification/differentiation. We have included these data in the responses, but because we are unable to draw any clear conclusions from these results, we do not feel these data warrant inclusion in the manuscript/figures.

      (4) As highlighted in the eLife assessment, our study does not include tissue validation (i.e. immunohistochemistry) of myofibroblast markers to distinguish whether the loss of myofibroblasts is attributable to lack of proliferation and/or changes in differentiation/specification. We spent considerable time over the past few months attempting to address these questions, however we were unable to produce convincing PDGFRa staining on tissues that we had collected during our original studies. Without PDGFRa staining, we regretfully could not co-stain for other useful markers to assess proliferation (EdU), apoptosis (TUNEL or caspase), or fibroblast function/specification (ACTA2, SM22a/TAGLN, ADRP, etc). We suspect that these experiments would require optimization of tissue fixation/processing at the time of harvest or the inclusion of a Pdgfra lineage tool for better identification of these cells by immunohistochemistry. Given that the majority of Pdgfra lineage tools require a knock-in/knock-out approach, data generated using these tools should be interpreted with caution given our results here show that Pdgfra-haploinsufficiency alone worsens disease outcomes after hyperoxia exposure.

      In summary, we have addressed several concerns raised by the reviewers and have attempted to perform some of the additional experiments suggested.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      In this study, the authors used both the commonly used neonatal hyperoxia model as well as cell-type-specific genetic inactivation of Tgfbr2 models to study the basis of BPD. The bulk of the analyses focus on the mesenchymal cells. Results indicate impaired myofibroblast proliferation, resulting in decreased cell number. Inactivation of Etc2 in Pdgfra-lineaged cells, preventing cytokinesis of myofibroblasts, led to alveolar simplification. Together, the findings demonstrate that disrupted myofibroblast proliferation is a key contributor to BPD pathogenesis.

      Strengths:

      Overall, this comprehensive study of BPD models advances our understanding of the disease. The data are of high quality.

      Weaknesses:

      The critiques are mostly minor and can be addressed without extensive experimentation.

      Reviewer #2 (Public Review):

      Summary:

      In this study, the authors systematically explore the mechanism(s) of impaired postnatal lung development with relevance to BPD (bronchopulmonary dysplasia) in two murine models of 'alveolar simplification', namely hyperoxia and epithelial loss of TGFb signaling. The work presented here is of great importance, given the limited treatment options for a clinical entity frequently encountered in newborns with high morbidity and mortality that is still poorly understood, and the unclear role of TGFb signaling, its signaling levels, and its cellular effects during secondary alveolar septum formation, a lung structure generating event heavily impacted by BPD. The authors show that hyperoxia and epithelial TGFb signaling loss have similar detrimental effects on lung structure and mechanical properties (emphysema-like phenotype) and are associated with significantly decreased numbers of PDGFRa-expressing cells, the major cell pool responsible for generation of postnatal myofibroblasts. They then use a single-cell transcriptomic approach combined with pathway enrichment analysis for both models to elucidate common factors that affect alveologenesis. Using cell communication analysis (NicheNet) between epithelial and myofibroblasts they confirm increased projected TGFb-TGFbR interactions and decreased projected interactions for PDGFA-PDGFRA, and other key pathways, such as SHH and WNT. Based on these results they go on to uncover in a sequela of experiments that surprisingly, increased TGFb appears reactive to postnatal lung injury and rather protective/homeostatic in nature, and the authors establish the requirement for alpha V integrins, but not the subtype alphaVbeta6, a known activator of TGFb signaling and implied in adult lung fibrosis. The authors then go beyond the TGFb axis evaluation to show that mere inhibition of proliferation by conditional KO of Ect2 in Pdgfra lineage results in alveolar simplification, pointing out the pivotal role of PDGFRa-expressing myofibroblasts for normal postnatal lung development.

      Strengths:

      (1) The approach including both pharmacologic and mechanistically-relevant transgenic interventions both of which produced consistent results provides robustness of the results presented here.

      (2) Further adding to this robustness is the use of moderate levels of hyperoxia at 75% FiO2, which is less extreme than 100% FiO2 frequently used by others in the field, and therefore favors the null hypothesis.

      (3) The prudent use of advanced single-cell analysis tools, such as NicheNet to establish cell interactions through the pathways they tested and the validation of their scRNA-seq results by analysis of two external datasets. Delineation of the complexity of signals between different cell types during normal and perturbed lung development, such as attempted successfully in this study, will yield further insights into the underlying mechanism(s).

      (4) The combined readout of lung morphometric (MLI) and lung physiologic parameters generates a clinically meaningful readout of lung structure and function.

      (5) The systematic evaluation of TGFb signaling better determines the role in normal and postnatally-injured lungs.

      Weaknesses:

      (1) While the study convincingly establishes the effect of lung injury on the proliferation of PDGFRa-expressing cells, differentiation is equally important. Characterization of PDGFRa expressing cells and tracking the changes in the injury models in the scRNA analysis, a key feature of this study, would benefit from expansion in this regard. PDGFRa lineage gives rise to several key fibroblast populations, including myofibroblasts, lipofibroblasts, and matrix-type fibroblasts (Collagen13a1, Collagen14a1). Lipofibroblasts constitute a significant fraction of PDGFRa+ cells, and expand in response to hyperoxic injury, as shown by others. Collagen13a1-expressing fibroblasts expand significantly under both conditions (Figure 3), and appear to contain a significant number of PDGFRa-expressing cells (Suppl Fig.1). Effects of the applied injuries on known differentiation markers for these populations should be documented. Another important aspect would be to evaluate whether the protective/homeostatic effect of TGFb signaling is supporting the differentiation of myofibroblasts. Postnatal Gli1 lineage gains expression of PDGFRa and differentiation markers, such as Acta2 (SMA) and Eln (Tropoelastin). Loss of PDGFRa expression was shown to alter Elastin and TGFb pathway-related genes. TGFb signaling is tightly linked to the ECM via LTBPs, Fibrillins, and Fibulins. An additional analysis in the aforementioned regard has great potential to more specifically identify the cell type(s) affected by the loss of TGFb signaling and allow analysis of their specific transcriptomic changes in response and underlying mechanism(s) to postnatal injury.

      We attempted to conduct additional analyses on our sequencing data to evaluate the impact of lung injury on the differentiation of Pdgfra-expressing cells towards other fibroblast lineages. To specifically address the impact of hyperoxia on fibroblast differentiation, we subsetted wildtype cells collected at the P7 timepoint (while pups were still undergoing hyperoxia treatment) from the larger data set. Shown below are several Violin Plots comparing gene expression between RA and O2 conditions across the mesenchymal populations.

      Although there are some interesting observations in this analysis, we could not identify a consistent theme from these data which could clearly answer the reviewers’ questions. We see a clear reduction of Pdgfra and Eln in both myofibroblast subsets with hyperoxia, which support our findings of reductions in the myofibroblast subsets. Acta2 and Tagln appear slightly lower in alveolar myofibroblasts, but both are higher in ductal myofibroblasts. Interestingly, both Acta2 and Tagln are higher in Col14a1 fibroblasts with hyperoxia. The functional relevance of these data are unclear because there appears to be higher per-cell expression of Acta2 in ductal myofibroblasts while the relative contribution of these cells is reduced (Figure 3D-E). Col14a1 fibroblasts show increased Acta2 and Tagln expression and are slightly increased in proportion at P7 with hyperoxia treatment (Figure 3D), albeit to a much lesser degree compared to Col13a1 fibroblasts.

      Author response image 1.

      Markers of ductal myofibroblasts including Hhip, Cdh4, and Aspn all appear lower with hyperoxia. Interestingly Plin2 expression is only slightly increased in Col13a1 fibroblasts with hyperoxia treatment, and there is also increased expression in alveolar myofibroblasts. Tcf21 is another marker commonly used to identify lipofibroblasts and its expression is similarly increased in myofibroblasts during hyperoxia, although its expression is conversely lower in Col13a1 and Col14a1 fibroblasts in our data. Overall, these data would appear consistent with recently published data by Ricetti et al. in which the authors observed an increase in lipofibroblast gene signatures and reduced myofibroblast gene signatures with hyperoxia treatment.

      Author response image 2.

      Author response image 3.

      The ability of our data to clearly identify changes in cell fate differentiation is limited by our use of Seurat to define cell clusters because these methods are likely to mask subtle gene expression changes in a small number of cells nested within a parent cluster. In the example above with Plin2, the change in Plin2 expression within myofibroblasts is not significant enough for Seurat to pull these cells out from their parent clusters to define a different lineage, nor are these cells similar enough in their current moment in time to be considered Col13a1 fibroblasts or lipofibroblasts. Increasing the dimensions used to define Seurat clusters might be sufficient to identify this subset of cells as a distinct cluster, however this approach would come at the expense of creating several more cell subsets with increasingly small populations which would be difficult to further analyze.

      One alternative approach to address these questions regarding differentiation might include using pseudo-time analysis of our sequencing data to predict cell lineage. Unfortunately, these analyses are beyond the scope of our current study, but we hope that our public data set can be used by investigators hoping to utilize this approach. Another method to address these questions could utilize a pulse-chase lineage experiment where one could label Pdgfra-expressing cells at the onset of injury and compare the differentiation of these labeled cells following injury. Li et al. conducted a similar experiment with hyperoxia in which Pdgfra-expressing cells were labeled during embryonic development and then postnatally following hyperoxia exposure. The authors noted a decrease in both lineaged myofibroblasts and lineaged lipofibroblasts and concluded that Pdgfra-lineaged cells were lost with hyperoxia treatment rather than undergoing aberrant differentiation. While these experiments likely have their own caveats related to the timing and efficiency of labeling, they represent a more conclusive approach to addressing differences in cell specification as compared to our sequencing- and flow cytometry-based approaches.

      Author response image 4.

      Author response image 5.

      (2) Of the three major lung abnormalities encountered in BPD, the authors focus on alveolarization impairment in great detail, to a very limited extent on inflammation, and not on vascularization impairment. However, this would be important not only to better capture the established pathohistologic abnormalities of BPD, but also it is needed since the authors alter TGFb signaling, and inflammatory and vascular phenotypes with developmental loss of TGFb signaling and its activators have been described. Since the authors make the point about the absence of inflammation in their BPD model, it will be important to show the evidence.

      We acknowledge that vascular changes significantly contribute to BPD pathogenesis, however our study was not designed to adequately characterize changes in vascular/endothelial cells. We were motivated to focus on the lung mesenchyme after observing a dramatic loss of PDGFRa+ cells with our initial characterization of the hyperoxia injury model (Figure 2). At the onset of our study, the existing publicly available data did not contain enough mesenchymal cells for in-depth analysis. To generate new observations and hypotheses within the lung mesenchyme we enriched our single cell prep for mesenchymal cells at the time of FACS-sorting to ensure we would have sufficient cell numbers for downstream analysis.

      (3) Conceptually it would be important that in the discussion the authors reconcile their findings in the experimental BPD models in light of human BPD and the potential implications it might have on new ways to target key pathways and cell types for treatment. This allows the scientific community to formulate the next set of questions in a disease-relevant manner.

      We have edited text in the discussion to address this point.

      Reviewer #3 (Public Review):

      Summary:

      This paper seeks to understand the role of alveolar myofibroblasts in abnormal lung development after saccular stage injury.

      Strengths:

      Multiple models of neonatal injury are used, including hyperoxia and transgenic models that target alveolar myofibroblasts.

      Weaknesses:

      There are several weaknesses that leave the conclusions significantly undersupported by the data as presented:

      (1) There is no validation of the decreased number of myofibroblasts suggested by flow cytometry/scRNAseq at the level of the tissue. Given that multiple groups have reported increased myofibroblasts (aSMA+ fibroblasts) in humans with BPD and in mouse models, demonstrating a departure from prior findings with tissue validation in the mouse models is essential. There are many reasons for decreased numbers of a subpopulation by flow cytometry, most notably that injured cells may be less likely to survive the cell sorting process.

      Unfortunately, we were unable to produce convincing PDGFRa staining on tissues that we had collected during our original studies. Without PDGFRa staining, we regretfully could not co-stain for other useful markers to assess proliferation (EdU), apoptosis (TUNEL or caspase), or fibroblast function/specification (aSMA/ACTA2, SM22a/TAGLN, ADRP, etc). We suspect that these experiments would require optimization of tissue fixation/processing at the time of harvest or the inclusion of a Pdgfra lineage tool for better identification of these cells by immunohistochemistry. Given that the majority of Pdgfra lineage tools require a knock-in/knock-out approach, data generated using these tools should be interpreted with caution given our results here show that Pdgfra-haploinsufficiency alone worsens disease outcomes after hyperoxia exposure.

      Our single cell data show that there is increased expression of Acta2 and Tagln shown in the plots which might be consistent with the increased aSMA staining which others have observed in these settings. Interestingly, the transcripts of both genes are reduced in alveolar fibroblasts while increased in ductal myofibroblasts, Col13a1 fibroblasts, Col14a1 fibroblasts, and vascular smooth muscle. We did not include aSMA antibody staining in our flow cytometry experiments, but this would certainly add value to future attempts to characterize the phenotypic changes occurring during these injury models. 

      (2) The hallmark genes used to define the subpopulations are not given in single-cell data. As the definition of fibroblast subtypes remains an area of unsettled discussion in the field, it is possible that the decreased number by classification and not a true difference. Tissue validation and more transparency in the methods used for single-cell sequencing would be critical here.

      See response above and new Supplemental Figure 2.

      (3) There is an oversimplification of neonatal hyperoxia as a "BPD model" used here without a reference to detailed prior work demonstrating that the degree and duration of hyperoxia dramatically change the phenotype. For example, Morty et al have shown that hyperoxia of 85% or more x 14 days is required to demonstrate the septal thickening observed in severe human BPD. Other than one metric of lung morphometry (MLI), which is missing units on the y-axis and flexivent data, the authors have not fully characterized this model. Prior work comparing 75% O2 exposure for 5, 8, or 14 days shows that in the 8-day exposed group (similar to the model used here), much of the injury was reversible. What evidence do the authors have that hyperoxia alone is an accurate model of the permanent structural injury seen in human BPD?

      At the onset of our studies, we noted that several groups were using widely variable protocols ranging from 60-100% O2 exposure. Morty et al. have indeed conducted thorough experiments to characterize various different hyperoxia exposure protocols. In their 2017 study (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5312005/) they showed that 85% O2 from P1-P7 was sufficient to produce increased septal thickness compared to control mice, and this change was comparable to P1-P14 exposure with 85% O2. Interestingly, they also noted that some therapeutic interventions could rescue disease caused by 60% O2 but not 85% O2 exposure. Our criteria in choosing a treatment protocol were: (1) nursing dams and pups survived hyperoxia exposure, (2) injury was reproducible across cohorts, and (3) injury was not reversible simply by recovering in room air. We found that recent work utilizing 75% O2 exposure was sufficient to cause the alveolar simplification phenotype which we sought to investigate. In our hands, we did not observe mortality of nursing dams or pups except for litters lost to cannibalism/failure of cross-fostering.

      We are confident that the injury caused by our hyperoxia protocol is not reversible simply by recovering mice in room air. Several groups have phenotyped mice at P4, P10, or P14 immediately following the conclusion of hyperoxia treatment. To ensure that we were studying a lasting, irreversible phenotype, we conducted our endpoint studies (morphometry and lung physiology) at P40. Because mice continue to undergo alveolarization until ~P36-P39, we reasoned that this additional recovery time following cessation of hyperoxia would allow for spontaneous recovery if this injury was transient. Additionally, shown below are unpublished flexiVent data in which mice were treated for 10 days with 75% O2 and recovered until analysis at 10 weeks of age. These results are entirely consistent with the flexiVent data we have included in the manuscript, and the persistence of lung physiologic changes in adult mice suggest the presence of permanent underlying structural changes. We did not conduct morphometry/MLI studies at later timepoints, but we have no reason to suspect a different outcome given the clear results from lung physiology.

      Author response image 6.

      (4) Thibeault et al published a single-cell analysis of neonatal hyperoxia in 2021, with seemingly contrasting findings. How does this dataset compare in context?

      Our data is complimentary to the single-cell analysis published by Thebaud et al. We included a re-analysis of their mesenchymal data in Supplementary Figure 2 which shows they also observed a relative decrease in myofibroblast clusters at the P7 and P14 timepoints following hyperoxia treatment. Figure 4 of their paper highlights the top differentially expressed genes between RA and O2 in Col13a1 FB and myofibroblasts, and we observe nearly identical findings in our data set within each of these clusters. Below we have created dotplots of P7 wildtype samples for the same selected genes shown in Figure 4G of the Thebaud et al. paper. It is important to note that their clustering pooled all myofibroblasts into one cluster, while our data is divided into alveolar myofibroblasts and ductal myofibroblasts. The other difference is their data set includes all timepoints P3, P7 and P14 pooled for display, while the plot we selected for simplicity here is only P7 cells. From these data we can see that the general trends are identical to those observed by Thebaud et al., and the differences in genes such as Acta2 can be accounted for by different changes observed in the different myofibroblast clusters – which is identical to what is shown in the violin plots above – namely that Acta2 is reduced in hyperoxia in alveolar myofibroblasts while increased in the ductal myofibroblasts.

      Author response image 7.

      Alveolar myoFB

      Author response image 8.

      Ductal myoFB

      One difference between our two datasets is the relative contribution of myofibroblast and Col13a1 fibroblasts to the entire mesenchymal population of cells. Over 50% of all mesenchymal cells in our preps consist of myofibroblasts, while most of their mesenchymal cells are Col13a1 fibroblasts. These differences are likely accounted for by differences in tissue digestion and cell preparation protocols. However, despite these differences, their data show the same trends of decreased myofibroblasts and a relative expansion in Col13a1 fibroblasts.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      (1) Figure 1, for the hyperoxia model, it is informative to have the analysis done at P40, while most of the previous studies using this model focus on outcomes shortly after the end of the hyperoxia regimen. The authors state "we did not see evidence of fibrosis, scarring, or inflammation." It will be helpful to include data supporting this conclusion, especially ACTA2, CTHRC1, and CD45 staining.

      We did not conduct trichrome staining or hydroxyproline assays to quantify the absence of fibrotic changes because there were no gross histologic changes consistent with scarring or fibrosis by H&E staining. We have amended the text to say “we did not see evidence of fibrosis or scarring” since we did not publish any changes to characterize the immune cell compartment.

      (2) Figure 3, single cell analysis, naming of the clusters is confusing. Is "alveolar myofibroblasts" the same as "secondary crest myofibroblasts"? Is "Col13a1 FB" the same as "alveolar fibroblasts" and "Col14a1 FB" the same as "adventitial fibroblasts"? The loss of myofibroblasts is intriguing because, by staining, there is an increase of ACTA2+ cells. Are ACTA2+ cells not myofibroblasts in scRNAseq data?

      As mentioned in responses above, we used Jichou Chen’s nomenclature of “alveolar myofibroblasts” and “ductal myofibroblasts”, but we agree that the former cluster is most consistent with “secondary crest myofibroblasts”. To distinguish the two remaining clusters of fibroblasts we used the same nomenclature as found in Thebaud et al’s single cell data set- “Col13a1 FB and “Col14a1 FB”. The Col13a1 FB cluster is most consistent with “alveolar fibroblasts” and contains high expression of several genes used to define “lipofibroblasts”, though it is unclear whether the latter may represent a subcluster within the Col13a1 FB cluster.

      As shown above, Acta2 is expressed broadly within the lung mesenchyme with highest levels found in myofibroblasts and smooth muscle cells.

      (3) Phosphorylated SMAD2/3 staining (e.g. Cell Signaling antibody) in the two models will be informative to show where TGF signaling activity is altered.

      We have not been successful in using SMAD2/3 staining to infer changes in TGFb signaling at the resolution needed to address this question. Other groups have shown qPCR and western blot data for SMAD2/3 signaling from whole lung extracts, but these approaches lack cell type and specificity and do not address spatial changes. We attempted to incorporate pSMAD2/3 staining into our flow cytometry experiments, but the staining protocol did not work in our hands.

      (4) Is cell death increased in the multiple models that showed simplification?

      While our EdU experiments address proliferation, we were unable to perform PDGFRa and TUNEL/caspase co-staining by histology to address apoptosis/cell death in our different models. Shown here is data from P7 wildtype mice in which Cdkn1a (promoting arrest of cell cycle), and pro-apoptotic genes Bax, Bak1, and Fas are all upregulated in hyperoxia in several mesenchymal cell populations including myofibroblasts.

      Author response image 9.

      (5) Wording: "These data suggest that avb6 does not play a role in TGFb activation during normal development or neonatal hyperoxia, while av-integrins in the lung mesenchyme are required for normal development and play a protective role in response to hyperoxia." The first half of the sentence is missing a reference to the epithelium.

      Text now reads "These data suggest that epithelial avb6 does not play a role…”

      Reviewer #2 (Recommendations For The Authors):

      The reviewer greatly appreciates the work presented here, especially the hard task of addressing combined signaling pathway input into key mesenchymal cell types during an essential expansion of alveolar surface area in postnatal lung and its effect upon disturbance.

      The issues of concern are mentioned in the public review and are expanded upon below:

      (1) Expanded characterization of PDGFRa+ expressing cells in the scRNA dataset is needed (see public review). Also included should be some of the key myofibroblast genes (elastin, Acta2, etc.) and their changes in the relevant cell populations. It would be important to show (at least at the transcriptional level) that myofibroblast differentiation is impaired if the author claims that the alveolarization defect is due to functional myofibroblast impairment. Furthermore, Ect2 expression and changes with treatments should be shown for the different cell populations (relevant to Figure 9).

      See responses above

      (2) The authors stated that they did not find evidence of fibrosis, scarring, and inflammation, but did not provide data to support this statement. Given the importance of at least the inflammation component in BPD, the absence of inflammation needs to be shown, especially in the model using the TGFBR2-cKO mouse, where at least their data show a trend to increased CD45 cell numbers (Figure 2), and upregulated inflammatory upstream regulators (IL10, IFNa, IKBKB, CEBPB upregulated) in the IPA (Figure 3). BAL and/or tissue by flow or IHC have been used to assess different immune cell populations. In terms of evaluation of vascular impairment, the single-cell data set contains endothelial cells, vascular smooth muscle, and pericytes, which allows interrogation following the two different types of injury (hyperoxia cKO TGFbR2) used for the scRNA-seq experiments).

      A full characterization of the immune cell or vascular/endothelial cell compartment within our models is beyond the scope of this current study as we were focusing on the shared changes observed within the lung mesenchyme. None of these compartments exist in isolation, so of course there are likely to be correlative and/or causative changes observed in each of the different models which we studied. We did consider further phenotypic analysis of the immune cells by flow cytometry within our different models, but deferred these experiments for future studies. As mentioned earlier we have omitted the reference to “no inflammation”.

      (3) The authors should report several litters per experiment and experimental group, mortality in the groups, and if present, visualize using e.g. Caplan-Meyer curves. The switch of the mothers during treatment, the early postnatal injections and treatments, and variability in outcome measures between different litters have to be anticipated. Therefore at least 2 litters, but preferably 3 litters per experiment should be examined, to show reproducibility.

      All experiments were conducted with at least 2-3 contemporaneous litters in each treatment group as this was necessary to have enough animals per treatment condition/group to achieve statistical significance. This was essential as all experiments were conducted on the C57BL/6 background where litter sizes are typically 6-8 pups in our colony. We did not encounter any maternal mortality related to hyperoxia exposure while rotating between hyperoxia and normoxia every 48 hrs. Loss of pups in our experiments was mostly due to cannibalism either immediately after birth or from neglect due to failure of cross-fostering.

      (4) The reviewer is concerned about using PBS as a control for experiments involving antibody treatment, in this case, 1D 11. The use of an isotype IgG would be the most appropriate and convincing control. In this case, an isotype-matched murine IgG1 control (13C4) has already been generated and is commercially available. While the reviewer does not suggest repeating all experiments, at least one small experiment showing that control IgG does not alter the lung phenotype with hyperoxia when compared with 1D11 would be important.

      We appreciate the reviewer’s suggestion and will consider an isotype antibody comparison in future studies. While not directly comparing 1D11 to isotype, we can share data in which we compared PBS to a different antibody. In this experiment, we attempted to use antibody blockade during the first 10 days of life while mice were undergoing hyperoxia treatment to target a specific component of the TGFb pathway. We observed no difference in outcomes either in RA or O2 when comparing PBS to xxx antibody. We cannot share the antibody identity due to intellectual property reasons, however additional studies confirmed that this antibody likely had no impact due to poor in vivo blocking activity.

      Author response image 10.

      (5) While inhibited proliferation is one possible explanation for the decrease of PDGFRa expression in the injured mice, there should be consideration of increased and/or premature apoptosis (before the physiologically observed wave P14-P20) as another reason. Also, do the authors propose that only proliferation results in alveolarization impairment, but differentiation plays no significant role here? If that is the case that would mean that there are some fully-differentiated myofibroblasts in the alveolar septa, but not enough to create the multitude of alveolar septal walls. Have the authors evaluated the decrease in secondary alveolar septa formed per alveolar airspace? This measure would give some sense of whether septum initiation was prevented or whether septa were formed, but are structurally abnormal, e.g. due to altered ECM (suspected decrease in Elastin and SMA expression, if myofibroblast differentiation was impaired or cell content (suspected decrease in myofibroblasts and increase of other cell types, such as lipofibroblasts).

      Apoptosis/cell death are likely to play a role in addition to inhibited proliferation. See violin plots shown above with cell cycle arrest and pro-apoptotic genes upregulated within the mesenchyme. Because we were unable to optimize tissue sections/staining with the samples collected during the early time points of our experiments (ie P4, P7, P10, P14), we are unable to co-stain for markers of apoptosis and answer this question in a direct manner. Future experiments will focus on additional characterization of these early changes with particular attention to altered fibroblast phenotypes within the alveolar septae.

      (6) An illustration depicting key cells and the pathways involved in cartoon format would be a useful addition and visualize the important conclusions of this paper for the reader.

      We appreciate this suggestion but think the results are sufficiently straightforward that a summary cartoon would not add much.

      Figure 4A: the legend appears to be switched. The gray square seems to align with the epithelial ligands, while the blue square aligns with receptors.

      Thank you for identifying this mistake – fixed.

      Names of transgenic lines used through manuscript:

      Please use the correct name, as per JAX would be either Gli1tm3(cre/ERT2)Alj/J or Gli1-CreERT2.

      Please use the correct name, as per JAX would be either Pdgfratm1.1(cre/ERT2)Blh/J or Pdgfrα-CreERT2.

      PDGFRa-CRE would be JAX# 013148.

      The transgenic lines have been noted in the methods, and we have edited the text of the manuscript to reflect the correct names of these lines. For the supplementary figure 4 which compares Gli1-CreERT2 to Pdgfrα-CreERT2, we left our prior nomenclature intact because it better reflects that each of these lines are haploinsufficient at their targeted loci, and that the controls are cre-negative littermates.

      We did not use the PDGFRa-CRE line (JAX# 013148).

      Reviewer #3 (Recommendations For The Authors):

      - More transparency about the single-cell analysis is required: 1) how are cell types and clusters defined? 2) what strategy was used for ambient RNA? 3) how do the controls compare with recently published mouse developmental datasets? 4) how does this model compare with the single-cell dataset published by Thibeault et al in 2021 (neonatal hyperoxia x 14 days with multiple time points used)?

      See responses above.

      - Tissue level validation of these findings is essential by RNA ISH or IF. While validation that the same process is at play in human tissue would be ideal, if this is not available, the conclusions must be tempered in the discussion.

      See responses above.

      - Is this more mild neonatal injury reversible in mice? As noted above, more characterization of this model (and placing it in the context of other more widely published models would be helpful).

      See responses above.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      Summary: 

      The innate immune system serves as the first line of defense against invading pathogens. Four major immune-specific modules - the Toll pathway, the Imd pathway, melanization, and phagocytosis- play critical roles in orchestrating the immune response. Traditionally, most studies have focused on the function of individual modules in isolation. However, in recent years, it has become increasingly evident that effective immune defense requires intricate interactions among these pathways. 

      Despite this growing recognition, the precise roles, timing, and interconnections of these immune modules remain poorly understood. Moreover, addressing these questions represents a major scientific undertaking. 

      Strengths: 

      In this manuscript, Ryckebusch et al. systematically evaluate both the individual and combined contributions of these four immune modules to host defense against a range of pathogens. Their findings significantly enhance our understanding of the layered architecture of innate immunity. 

      We thank the reviewer for their kind assessment.

      Weaknesses: 

      While I have no critical concerns regarding the study, I do have several suggestions to offer that may help further strengthen the manuscript. These include: 

      (1) Have the authors validated the efficiency of the mutants used in this study? It would be helpful to include supporting data or references confirming that the mutations effectively disrupted the intended immune pathways. 

      We have done so in Figure 1.

      (2) Given the extensive use of double, triple, and quadruple mutants, a more detailed description of the mutant construction process is warranted. 

      We now provide a supplement (File S1) that details the successive genetic crosses and recombinations that were required to generate these compound fly stocks carrying multiple mutations. We also provide some information regarding rapid screening of stocks for phenotypes. Of note some of these fly stocks have been deposited at VDRC as they will be useful to fly community to assess immune modules in a controlled background, and complete stock information will be tied to these stocks there.

      Reviewer #2 (Public review): 

      Summary: 

      In this work, the authors take a holistic view of Drosophila immunity by selecting four major components of fly immunity often studied separately (Toll signaling, Imd signaling, phagocytosis, and melanization), and studying their combinatory effects on the efficiency of the immune response. They achieve this by using fly lines mutant for one of these components, or modules, as well as for a combination of them, and testing the survival of these flies upon infection with a plethora of pathogens (bacterial, viral, and fungal). 

      Strengths: 

      It is clear that this manuscript has required a large amount of hands-on work, considering the number of pathogens, mutations, and timepoints tested. In my opinion, this work is a very welcome addition to the literature on fly immune responses, which obviously do not occur in one type of response at a time, but in parallel, subsequently, and/or are interconnected. I find that the major strength of this work is the overall concept, which is made possible by the mutations designed to target the specific immune function of each module (at least seemingly) without major effects on other functions. I believe that the combinatory mutants will be of use for the fly community and enable further studies of the interplay of these components of immune response in various settings. 

      To control for the effects arising from the genetic variation other than the intended mutations, the mutants have been backcrossed into a widely used, isogenized Drosophila strain called w1118. Therefore, the differences accounted for by the genotype are controlled. 

      I also appreciate that the authors have investigated the two possible ways of dealing with an infection: tolerance and resistance, and how the modules play into those. 

      We thank the reviewer for their kind assessment. 

      Weaknesses: 

      While controlling for the background effects is vital, the w1118 background is problematic (an issue not limited to this manuscript) because of the wide effects of the white mutation on several phenotypes (also other than eye color/eyesight). It is a possibility that the mutation influences the functionality of the immune response components, for example, via effects of the faulty tryptophan handling on the metabolism of the animal. 

      I acknowledge that it is not reasonable to ask for data in different backgrounds better representing a "wild type" fly (however, that is defined is another question), but I think this matter should be brought up and discussed. 

      We agree with the reviewer and have included caveats on the different genetic effects brought about the combinatory mutant approach including differences in white gene status, insertion of GFP or DsRed markers, and nature of genetic mutations (Line 142-on).

      “Of note, the strains used in this study differ in their presence/absence of the white<sup>+</sup> gene, present in the PPO1<sup>∆</sup>, NimC1<sup>1</sup> and eater<sup>1</sup> mutations.  In addition to its well established function in eye pigmentation, the white gene can also impact host neurology and intestinal stem cell proliferation (Ferreiro et al., 2017; Sasaki et al., 2021). We did not observe any obvious correlations between white<sup>+</sup> gene status and susceptibilities in this study. Moreover,  in a previous study looking at the cumulative effects of AMP mutations on lifespan, white gene status and fluorescent markers did not readily explain differences in longevity (Hanson and Lemaitre, 2023). We therefore believe that the extreme immune susceptibility we have created through deficiencies for pathways regulating hundreds of genes, or major immune modules, overwhelms the potential effects of white<sup>+</sup> and other transgenic markers. For additional information on which stocks bear which markers, see discussion in Supplementary file 1.”

      Of interest, we were highly conscious of this concern in working with combinatory AMP mutants which differed in white, GFP, and DsRed copies. However, even over the many weeks of snowballing effects on microbiota community composition and structure, we found no trends tied strictly to white+ or to other genetic insertions on lifespan (Hanson and Lemaitre, 2023; DMM).

      The whole study has been conducted on male flies. Immune responses show quite extensive sex-specific variation across a variety of species studied, also in the fly. But the reasons for this variation are not fully understood. Therefore, I suggest that the authors conduct a subset of experiments on female flies to see if the findings apply to both sexes, especially the infection-specificity of the module combinations.  

      We thank the reviewer for this suggestion. We have performed the requested experiments, and include female survival trends in Figure 4supp1. We have added the following text to the main manuscript (Line 554):

      “All survival experiments to this point were done with males. We therefore assessed key survival trends for these infections in females to learn whether the dynamics we observed were consistent across sexes (Figure 4supp1). For all three pathogens (Pr rettgeri, Sa aureus, C. albicans) the rank order of susceptibility was broadly similar between males and females, with higher rates of mortality in females overall. Thus, we found no marked sex-bygenotype interaction. Interestingly, the greater susceptibility of females in our hands is true even for ∆ITPM flies, although there are only a few surviving flies on which we can base these conclusions. However, these data may suggest the sexual dimorphism in defense against infection that we see against these pathogens is due to factors independent of the immune modules we disrupted.”

      It is worth noting that male-female sex dichotomies in infection are inconsistent across the literature, with strong lab-specific effects (Belmonte et al., 2020 and personal observation). In our lab setting, we consistently see female mortality higher than males when compared, independent of pathogen and mutant background. We have not seen notable interaction terms of sex and genotype for most immune deficient mutants. It is quite interesting to have done these experiments with ITPM, however, which reveals that there is at least a trend suggesting this dichotomy is independent of the four immune modules we deleted. Still, our infection conditions kill most males, and so it would be good to replicate this sex-specific ∆ITPM result in a dedicated study with doses chosen to improve the resolution of male-female differences. For now, we prefer to use conservative language and avoid overinterpreting this trend, but do feel it merits mentioning.  

      Recommendations for the authors:

      Comment on statistical requests

      Both reviewers requested further clarity on the statistical analyses supplemental to Figure 3. We haved address these comments as follows.

      First, we now provide an additional supplementary .zip file containing summary statistics for all survival data in Figure 3 (Supplementary File 3). We have additionally added this text to line 226 to make this data treatment more clear:

      …” we chose to focus on major differences apparent in summary statistics,Highlighting”…

      And we highlight that all survival data are also provided as Kaplan-Meier survival curves in the main or supplementary figures in Line 233:

      “Kaplan-Meier survival curves for all experiments are provided in the main text or supplementary information”.

      Second, as outlined in the main text, we were unable to sample across all pathogenby-genotype interactions systematically, and this unfortunately obfuscates robust statistical modelling. We addressed the challenge of finding meaningful statistical differences by focusing on trends only if they were i) consistent across experimental replicates, ii) of a consistent logic across comparable genotypes, ensuring random inter-experimental noise was not unduly shaping interpretations, and iii) of a mean lifespan difference ≥1.0 days compared to wild-type, and compared to relevant unchallenged or clean-injury controls. This last choice was especially important because not all experimental replicates included all genotypes due to challenges of animal husbandry and coordination among multiple researchers over five years of data collection. As a result, our initial analyses using a cox mixed-effects model found it to be rather useless, being insensitive to important experiment batch effects visible to the eye because statistically-affected genotypes were not present in all experiments.

      We therefore ensured that behaviour relative to controls within* experiments was consistent, rather than the comparison of genotypes to controls across the sum of experiments with a post-hoc treatment attempting to apportion variance to experiment batch (but unable to do so for some genotypes and some batches). Due to differeces in baseline health and the dynamics explained by studies like Duneau et al. (2017; eLife, there is an expected unequal variance of genotype*pathogen interactions across experiment batches. Unfortunately, this unequal variance, coupled with incomplete sampling across experiment batches, means “highly significant” differences can emerge that don’t hold up to scrutiny of comparisons to controls taken only from within an experiment batch. Thus, we chose to forego a cox mixed effect model approach entirely. Instead, our highly conservative approach, focusing on only very large effects with a mean lifespan difference ≥1.0 days, mitigates these issues. We have taken great care to ensure that any results we highlight stand up to inter-experiment batch effects. We would further draw the reviewers’ attention to our response to Reviewer 2 relating to Figure 3, which emphasizes the level of conservativism that we are applying.

      At the end of the Discussion, we have added the following sentence to emphasize these limitations:

      “…a combinatorial mutation approach to deciphering immune function can be extended even to the broad level of whole immune modules. Of note, we were unable to systematically sample all genotype-bypathogen interactions equally. We have therefore been highly conservative in our reporting of major effects. There are likely many important interactions” not discussed in our study. Future investigations may highlight important biology that is apparent in our data, but which we may not have mentioned here. To this end, we have deposited our isogenic immunity fly stocks in the Vienna Drosophila Resource Centre to facilitate their use. Beyond immunity, our tools can also be of use to study various questions at the cutting edge of aging, memory, neurodegeneration, cancer, and more, where immune genes are repeatedly implicated. We hope that this set of lines will be useful to the community to better characterize the Drosophila host defense.”

      We recognise this response may not fully satisfy the reviewers’ requests. While use of summary statistics is simple, our rules for highlighting interactions of importance are defined, readily understood and interpreted, and draw attention to key trends in that are backed by a solid understanding of the data and its limitations. We have taken this approach out of a responsibility to avoid making spurious assertions that stem from underpowered statistical models rather than from the biology itself.

      Reviewer #1 (Recommendations for the authors): 

      (1) Lines 1092-1093 - Please double-check the labeling of the panels in Figure 2. It appears that panels A and C correspond to single-module mutants, whereas panels B and D refer to compound-module mutants. 

      We have modified Figure 2 and Figure 2supp1 labelling. We also realise there was an error in the column titling that contributed to the confusion. We hope the new layout is clear, and thank the reviewers for noting this issue.

      (2) Lines 347-377 - Figure 2D is not cited in the text. 

      We now cite Fig2D in Line 356.

      (3) P values should be indicated in Figure 2 and Figure 3 for all relevant comparisons. Additionally, "ns" (not significant) should be added in Figure 5A-B. 

      We make the effort to show key uninfected survival trends in Figure 2, and list the total flies (n_flies) in Fig3 to provide the reader with the underlying confidence in the trends observed. We focus on differences of mean lifespan of at least 1 day, and which are consistent in direction across combinatory mutations.  We have avoided the multiple comparisons of cox proportional hazard survival analyses throughout this study because they are overly sensitive for our purposes, as we have previously when systematically comparing many genotypes to each other (see Hanson and Lemaitre, 2023; DMM).

      (4) Minor points: Hml-Gal4, UAS-GFP should be italic; Line 192-- "uL" and "uM"; Line 596: P>.05.

      We have made these changes. We’re unsure what the comment regarding P>.05 referred to, but have removed spaces and made it non-italics. 

      Reviewer #2 (Recommendations for the authors): 

      Statistical analyses and their outcomes are clearly indicated only for the data in Figure 1 and Figure 5 and in the supplement for Figure 1, while they are not reported/not easily accessible for other data. For the main figures, statistics should be indicated in the figure for an easier assessment of the data. In case of multiple comparisons potentially crowding the plots too much, statistics may be in a supplementary file/table. 

      See response above.

      In case of the hemocytes, besides phagocytosis, I would think that ROS generation via the DUOX/NOX system is also an integral part of the immune response against pathogens, and that has not been included here. That might be an interesting addition for future experiments. As the NimC1, eater double mutant flies are said to have fewer hemocytes, it is possible that this function of the hemocytes is affected as well. This could be commented on in the text. 

      The reviewer raises a good point. The role of DUOX and NOX in ROS responses is not assessed in our study. To our knowledge, DUOX and NOX participate primarily in the wound repair response, or in epithelial renewal at damage sites or in the gut. In our study on systemic immunity, we did not assess the role of clotting, the precise function of ROS, and we have missed other host defense or stress response mechanisms as well (e.g. constitutively-expressed AMP-like genes, TEPs, JAK-STAT) that likely play a role in the systemic immune defense. Considering the lethality caused by Nox and Duox mutation, there would be inherent genetic difficulties to recombine these as multiple mutations. Unfortunately, this makes it  difficult to include these processes in our analysis in a systematic manner.  We are already happy to have generated fly lines lacking four immune modules simultaneously, even if they are not fully immune deficient. We have mentioned this point in the discussion (Line 613-on).

      Of note, the NimC1, eater double mutants actually have decreased hemocyte counts at the adult stage (Melcarne et al,. 2019). Thus NimC1, eater double mutants are not impaired only in phagocytosis, but the overall cellular response. We make a point to outline this in Line 225-257, and 607.

      I think it could be mentioned that the melanization response at larval stage (against parasitoids) functions differently from the melanization described here (requiring hemocyte differentiation and PPO3).

      A good point. We have added this mention in Line 97:

      “In addition, a third PPO gene (PPO3) is specifically expressed by lamellocytes, specialized hemocytes that differentiate in larvae responding to and enveloping invading parasites (Dudzic et al., 2015)”.

      Overall, the clarity of the figures and figure legends could be worked on to make them a bit easier to follow. Below are some of my suggestions: 

      (1) In Figure 2, adding headings to parts C & D (similarly to A & B) would make it easier to follow what is happening in the figure at a glance. Also, it is rather difficult to visually follow which strain is which in the plots. I'd suggest adding the key/legend for single mutants below 2A & B, and the key for the double mutants below C & D. If a mutant is present in A & B and in C & D, it could be included in both keys. I also think that it would be intuitive to present the single mutants by dashed lines and double mutants by continuous lines (or vice versa), so that one would easily distinguish between them. Of note, the figure legend says that A & B are single mutants, but for example in B there are also some double mutants (?). 

      We have modified Figure 2 and Figure 2supp1 labelling. We also realise there was an error in the column titling that contributed to the confusion. We hope the new layout is clear, and thank the reviewers for noting this issue.

      (2) In Figure 3, it looks like ΔMel is almost identical to controls in the clean injury survival, but in Figure 2C, it is clearly doing worse. I might be missing something here, but would like the authors to clarify the matter. Also, the meaning of the numbers in the heat map could be explained in the figure legend and/or added to the figure (color key). 

      The reviewer is correct. We thank the reviewer for this astute observation. Inadvertently, we used an old version of the Figure 2 preparation where only a subset of experiments was entered in the Prism data file rather than the total data used to inform Figure 3. This issue affected all genotypes.

      We have reviewed the data in Figure 2, Figure 2supp1, and Figure 3, and updated these figures accordingly to ensure they represent the full survival data. We have also incorporated new experiments into the sum data related to male-female differences and to fill gaps in the data from the 1<sup>st</sup> submission. We will also note due to the nature of 1<sup>st</sup> decimal rounding that the difference between WT and ΔMel appears slightly underrepresented: the true difference (over the 7-day lifespan) is 0.37. We’ve provided a version of this figure rounded to 2 decimal places below, but prefer the simpler 1 decimal place in the main text for readability. The updated Figure 2 shows the full data in Figure 3 accurately.

      We will also take this opportunity to highlight how conservative our ≥1.0 days difference approach is. Breaking down survival curve patterns in Figure 2 relative to mean differences in Figure 3, for clean injury, approximately ~75% of ΔMel flies survive to day 7 with mortality mostly taking place between days 3-7. The result is a mean lifespan of 6.37 days. On a survival curve, this difference appears quite strong, but in our mean lifespan table the difference is rather muted (WT vs. ΔMel difference = 0.37 days). Thus, differences of ≥1.0 days reflect very strong trends in survival data that are near-guaranteed to be independent of experimental noise. While we note issues that prevented us from a fully systematic sampling for all experiments, we are confident that the ≥1.0 day differences we highlight, using the rules explained in the main text, are robust. While this approach could be seen as overly conservative, it is our preference in this initial study, containing combinations of 25 treatments and 14 genotypes, to be highly conservative. Future studies may investigate other strong differences we have not highlighted, and the data we provide here can help generate expectations and guide those studies.

      Author response image 1.

      Figure 3 with 2 decimals places of rounding for mean lifespans. The 7-day clean injury mean lifespan of WT is 6.74 days, and of ΔMel is 6.37 days. Due to rounding, in the 1 decimal Figure 3 this difference appears as if it is only 0.3 days, but it closer to 0.4 days. Regardless, this level of difference, which appears rather clearly in a survival curve, is well below the level of difference we have chosen to highlight in our study.

      (1) Figure 4: I find it very tedious to compare CFUs among different mutants from the plots. As the idea is to compare bacterial loads among the mutants at different timepoints, it would be easier to compare them if the data were shown within a timepoint (CFUs of each mutant at 2h, at 6h, and so on). This is also how the results are written in the text (within a time point). Would it also be clearer if the CFU plots were named, for example: " A', B', and C'"? 

      We appreciate this note. We feel both representations have merits and pitfalls, but prefer our original design showing the progression of bacterial growth within genotype first. However, we have added dotted lines representing the wild-type bacterial loads at 2hpi, 12hpi, and 24hpi to assist the reader in making acrossgenotype comparisons at key time points. Like this, the reader can see if the error bars (StDev) overlap the mean of the wild-type, and so make more intuitive judgements about whether these differences are meaningful.

      (2) Figure 2D is not referred to in the text. 

      We now cite Fig2D in Line 356.

    1. Author Response

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      The modeling approaches are very sophisticated, and clearly demonstrate the selective nature of acute ketamine to reduce the impact of trial losses on subsequent performance, relative to neutral or gain outcomes. The authors then, not unreasonably, suggest that this effect is important in the context of the negative bias in interpreting events that is prominent in depression, in that if ketamine reduces the ability of negative outcomes to alter behavior, this may be a mechanism for its rapid acting antidepressant effects.

      However, there is a very strong assumption in this regard, as shown by the first sentence of the discussion which implies this is a systematic study of ketamine's acute antidepressant effects. In actuality, this is a study of the acute effects of ketamine on reinforcement learning (RL) modeled parameters. A primary concern here is that an effect presented as a "robust antidepressant-like behavioral effect" should be more enduring than just an alteration during the acute administration. As it is, the link to an "anti-depressant effect" is based solely on the selective effects on losses. This is not to say this is not an interesting observation, worthy of exploration. It is noted that a similar lack of enduring effects on outcome evaluation is observed in humans, as shown in supplemental fig. S4, but there is not accompanying citation for the human work.

      We agree with the reviewer that the way we linked the study results to ketamine’s antidepressant action can be misleading and based on a rather strong assumption which was not systematically tested in the study. We made the following changes to the manuscript:

      (1) These results constitute a rare report of a robust antidepressant-like behavioral effect produced by therapeutic doses of ketamine during acute phase (<1 hour) after injection (Introduction, 3rd paragraph, line 8-9 in the original manuscript).

      Changed to: These results constitute a rare report of an acute effect of therapeutic dose of ketamine on the processing of affectively negative events during dynamic decision-making.

      (2) We clarified in the Discussion that our study is to gain insights into, but not a systematic investigation of ketamine’s antidepressant action as follows:

      (2.1) A sentence was added (1st paragraph of Discussion): Using a token-based decision task and extensive computational modeling, we examined the behavioral modulation induced by therapeutic doses of ketamine to gain insights into possible early signs of ketamine’s antidepressant activity.

      (2.2) Consistent with the findings from humans, ketamine’s effect on outcome evaluation was acute and did not last over subsequent days (Supplemental Figure S4) (Discussion, 2nd paragraph, line 6-7 in the original manuscript).

      Changed to: While ketamine’s antidepressant effect is reported to be sustained over a week of period (5), ketamine’s effect on outcome evaluation was acute and did not last over subsequent days (Supplemental Figure S4). This discrepancy might be attributable to the possible differences in the state of brain network between healthy subjects and those with depression as well as the type of measures taken to assess ketamine’s effect.

      (2.3) A sentence was added (Discussion, last sentence of the 2nd paragraph) : Nevertheless, systematic studies are required to understand whether the reduced aversiveness to loss in our task might share the same mechanisms that underlie ketamine’s antidepressant action.

      One question that comes to mind in terms of the selectivity observed is whether similar work has been done to examine the acute effects of any other drugs. If ketamine is unique in this regard, that would be quite interesting.

      We think this is an interesting idea. However, comparing ketamine’s effect to that of other drugs is not the scope of the current study. We hope that we will be able to answer this question with future studies.

      Reviewer #2 (Public Review):

      Oemisch and Seo set out to examine the effects of low-dose ketamine on reinforcement learning, with the idea that alterations in reinforcement learning and/or motivation might inform our understanding of what alterations co-occur with potential antidepressant effects. Macaques performed a reinforced/punished matching pennies task while under effects of saline or ketamine administration and the data were fit to a series of reinforcement learning models to determine which model described behavior under saline most closely and then what parameters of this best-fitting model were altered by ketamine. They found a mixed effect, with two out of three macaques primarily exhibiting an effect of ketamine on processing of losses and one out of three macaques exhibiting an effect of ketamine on processing of losses and perseveration. They found that these effects of ketamine appeared to be dissociable from the nystagmus effects of the ketamine.

      The findings are novel and the data suggesting that ketamine is primarily having its effects on processing of losses (under the procedures used) are solid. However, it is unclear whether the connection between processing of losses and the antidepressant effects of ketamine is justified and the current findings may be more useful for those studying reinforcement learning than those studying depression and antidepressant effects. In addition, the co-occurrence of different behavioral procedures with different patterns of ketamine effects, with one macaque tested with different parameters than the other two exhibiting effects of ketamine that were best fit with a different model than the other two macaques, suggests that there may be difficulty in generalizing these findings to reinforcement learning more generally.

      (1) First, the authors should be more explicit and careful in the connection they are trying to make about the link between loss processing and depression. The authors call their effect a "robust antidepressant-like behavioral effect" but there are no references to support this or discussion of how the altered loss processing would relate directly to the antidepressant effects.

      We agree with the reviewer’s point on the way we made the connection between the study results and ketamine’s antidepressant action. This concern overlaps with the reviewer #1’s concern. Please refer to our response 2, 2-1, 2-2 and 2-3.

      (2) It appears that the monkey P was given smaller rewards and punishers than the other two monkeys and this monkey had an effect of ketamine on perseveration that was not observed in the other two monkeys. Is this believed to be due to the different task, or was this animal given a different task because of some behavioral differences that preceded the experiment? The authors should also discuss what these differences may mean for the generality of their findings. For example, might there be some set of parameters where ketamine would only alter perseveration and not processing of losses?

      Although the best-fitting ketamine model for monkey P includes an additional element – perseveration, we believe that monkey P’s baseline behavior and ketamine’s effect are not significantly different from the other two monkeys for the following reasons.

      First, monkey P was the first animal that we tested ketamine’s effect, and therefore we aimed to match the other two monkeys’ baseline behavior similar to monkey P’s behavior in order to reduce variability in ketamine’s effect potentially attributable to the difference in baseline behavior before pharmacological manipulation. We had to adjust the payoff matrix for the subsequent animals (Y and B) because these monkeys were more sensitive to loss, and seldom chose “risky” target (yielding loss). In order to make the other two monkeys’ behavior similar to that of monkey P, we adjusted the asymmetry between the risky and the safe target in the way that loss (neutral) outcome occurred from the safe (risky) target as well. Eventually, this adjustment made the baseline behavior similar across all three monkeys. The goal of the study was to reliably measure the ketamine’s effect, and not to study individual differences that can naturally occur with the same task parameters. Therefore, we believe that the adjustment of payoff matrix helped to reliably detect ketamine’s effect starting from the common baseline behavior.

      Second, the best-fitting model for monkey P (K-model 7) and that for the other two monkeys (K-model 4) make very similar predictions both qualitatively and quantitatively as are seen in the revised Figure 4. The parameters for outcome values estimated from these two models in monkey P are very similar as is seen in the revised Table 3. In addition, the difference in BIC between the model which includes only perseveration modulation (K-model 6) and the model incorporating outcome value modulation as well (K-model 7) is 441, whereas the difference in BIC between K-model 7 and the model that includes only outcome value modulation (K-model 4) is as small as 4. These BIC results indicate that the variability explained by ketamine’s modulation of outcome evaluation is remarkably larger that that explained by its modulation of perseveration in monkey P.

      Therefore, we conclude that ketamine’s effect was not significantly different between monkey P and the other two monkeys. We clarified this in the revised manuscript by adding the following paragraph in the Result section:

      “Unlike monkey Y and B, the best-fitting model for monkey P indicated that ketamine increased overall tendency to switch choice in addition to outcome-dependent modulation of outcome evaluation. However, BIC differed only slightly (dBIC = 3.99) between the best-fitting (K-model 7) and the second-best model (K-model 4) and the model predictions for choice behavior were very similar both qualitatively and quantitatively (Table 3, Figure 4). We conclude that the behavioral effects of ketamine were consistent across all three monkeys.”

      (3) The authors should discuss whether the plasma ketamine levels they observed are similar to those seen with rapid antidepressant ketamine or are higher or lower.

      We added a sentence in the first paragraph of the Result section as follows with a reference.

      “Plasma concentration and its time course over 60 minutes were also comparable to those measured after 0.5mg/kg in human subjects (35).”

      (35) Zarate CA, Brutsche N, Laje G, Luckenbaugh DA, Venkata SLV, Ramamoorthy A, et al (2012): Relationship of ketamine’s plasma metabolites with response, diagnosis, and side effects in major depression. Biol Psychiatry, 72: 331-338.

      (4) For Figure 4 or S3, the authors should show the data fitted to model 7, which was the best for one of the animals.

      We added the parameters and model predictions from both K-model 7 and K-model 4 for monkey P to help comparison between two models in Table 3, and Figure 4. Revised Table 3 and Figure 4 are as follows:

      Author response table 1.

      Maximum likelihood parameter estimates of the best models for saline and ketamine sessions.

      In all three animals, the model incorporating valence-dependent change in outcome evaluation best fit the choice data from ketamine sessions with (K-model 7 in the parenthesis, P) or without (K-model 4, P and Y/B) additional change in the tendency of choice perseveration (Figure 3, Table 3).

      Author response image 1.

      ketamine-induced behavioral modulation simulated with differential forgetting model (for saline session) and best-fitting K-model (for ketamine session).

    1. Author response:

      The following is the authors’ response to the original reviews.

      Response to Public Comments

      (1) BioRxiv version history.

      Reviewer 1 correctly noted that we have posted different versions of the paper on bioRxiv and that there were significant changes between the initial version and the one posted as part of the eLife preprint process. Here we provide a summary of that history.

      We initially posted a bioRxiv preprint in November, 2021 (Version I) that included the results of two experiments. In Experiment 1, we compared conditions in which the stimulation frequency was at 2 kHz, 3.5 kHz, or 5.0 kHz. In Experiment 2, we replicated the 3.5 kHz condition of Experiment 1 and included two amplitude-modulated (AM) conditions, with a 3.5 kHz carrier signal modulated at 20 Hz or 140 Hz. Relative to the sham stimulation, non-modulated kTMP at 2 kHz and 3.5 kHz resulted in an increase in cortical excitability in Experiment 1. This effect was replicated in Experiment 2.

      In the original posting, we reported that there was an additional boost in excitability in the 20 Hz AM condition above that of the non-modulated condition. However, in re-examining the results, we recognized that the 20 Hz AM condition included an outlier that was pulling the group mean higher. We should have caught this outlier in the initial submission given that the resultant percent change for this individual is 3 standard deviations above the mean. Given the skew in the distribution, we also performed a log transform on the MEPs (which improves the normality and homoscedasticity of MEP distributions) and repeated the analysis. However, even here the participant’s results remained well outside the distribution. As such, we removed this participant and repeated all analyses. In this new analysis, there was no longer a significant difference between the 20 Hz AM and non-modulated conditions in Experiment 2. Indeed, all three true stimulation conditions (non-modulated, AM 20 Hz, AM 140 Hz) produced a similar boost in cortical excitability compared to sham. Thus, the results of Experiment 2 are consistent with those of Experiment 1, showing, in three new conditions, the efficacy of kHz stimulation on cortical excitability. But the results fail to provide evidence of an additional boost from amplitude modulation. 

      We posted a second bioRxiv preprint in May, 2023 (Version 2) with the corrected results for Experiment 2, along with changes throughout the manuscript given the new analyses.

      Given the null results for the AM conditions, we decided to run a third experiment prior to submitting the work for publication. Here we used an alternative form of amplitude modulation (see Kasten et. al., NeuroImage 2018). In brief, we again observed a boost in cortical excitability in from non-modulated kTMP at 3.5 kHz, but no additional effect of amplitude modulation.  This work is included in the third bioRrxiv preprint (Version 3), the paper that was submitted and reviewed at eLife.

      (2) Statistical analysis.

      Reviewer 1 raised a concern with the statistical analyses performed on aggregate data across experiments.  We recognize that this is atypical and was certainly not part of an a priori plan. Here we describe our goal with the analyses and the thought process that led us to combine the data across the experiments.

      Our overarching aim is to examine the effect of corticospinal excitability of different kTMP waveforms (carrier frequency and amplitude modulated frequency) matched at the same estimated cortical E-field (2 V/m). Our core comparison was of the active conditions relative to a sham condition (E-field = 0.01 V/m). We included the non-modulated 3.5 kHz condition in Experiments 2 and 3 to provide a baseline from which we could assess whether amplitude modulation produced a measurable difference from that observed with non-modulated stimulation. Thus, this non-modulated condition as well as the sham condition was repeated in all three experiments. This provided an opportunity to examine the effect of kTMP with a relatively large sample, as well as assess how well the effects replicate, and resulted in the strategy we have taken in reporting the results. 

      As a first step, we present the data from the 3.5 kHz non-modulated and sham conditions (including the individual participant data) for all three experiments in   4. We used a linear mixed effect model to examine if there was an effect of Experiment (Exps 1, 2, 3) and observed no significant difference within each condition. Given this, we opted to pool the data for the sham and 3.5 kHz non-modulated conditions across the three experiments. Once data were pooled, we examined the effect of the carrier frequency and amplitude modulated frequency of the kTMP waveform. 

      (3) Carry-over effects

      As suggested by Reviewer 1, we will examine in the revision if there is a carry-over effect across sessions (for the most part, 2-day intervals between sessions). For this, we will compare MEP amplitude in baseline blocks (pre-kTMP) across the four experimental sessions.

      Reviewer 1 also commented that mixing the single- and paired-pulse protocols might have impacted the results. While our a priori focus was on the single-pulse results, we wanted to include multiple probes given the novelty of our stimulation method. Mixing single- and different paired-pulse protocols has been relatively common in the non-invasive brain stimulation literature (e.g., Nitsche 2005, Huang et al, 2005, López-Alonso 2014, Batsikadze et al 2013) and we are unaware of any reports suggested that mixed designs (single and paired) distort the picture compared to pure designs (single only).

      (4) Sensation and Blinding

      Reviewer 2 bought up concerns about the sham condition and blinding of kTMP stimulation. We do think that kTMP is nearly ideal for blinding. The amplifier does emit an audible tone (at least for individuals with normal hearing) when set to an intensity to produce a 2 V/m E-field. For this reason, the participants and the experimenter wore ear plugs. Moreover, we played a 3.5 kHz tone in all conditions, including the sham condition, which effectively masked the amplifier sound. We measured the participant’s subjective rating of annoyance, pain, and muscle twitches after each kTMP session (active and sham). Using a linear mixed effect model, we found no difference between active and sham for each of these ratings suggesting that sensation was similar for active and sham (Fig 8). This matches our experience that kHz stimulation in the range used here has no perceptible sensation induced by the coil. To blind the experimenters (and participants) we used a coding system in which the experimenter typed in a number that had been randomly paired to a stimulation condition that varied across participants in a manner unknown to the experimenter.

      Reviewer 1 asked why we did not explicitly ask participants if they thought they were in an active or sham condition. This would certainly be a useful question. However, we did not want to alert them of the presence of a sham condition, preferring to simply describe the study as one testing a new method of non-invasive brain stimulation. Thus, we opted to focus on their subjective ratings of annoyance, pain, and finger twitches after kTMP stimulation for each experimental session.

      Response to Recommendations for the Authors

      Reviewer #1: 

      Reviewer # 1 in the public review noted the possibility of carry-over effects and suggested that we compare the amplitude of the MEPS in the pre blocks across the four sessions.

      Although we did not anticipate carry-over effects lasting 2 or more days, we have now conducted an analysis in which we use a linear mixed effect model with a fixed factor of Session and a random factor of Participant. The results show that there is not an effect of session [χ2(3) = 4.51, p \= 0.211].

      Author response table 1.

      Detailed comments and some suggestions to maybe improve the writing and figures: 

      Abstract: 

      BioRxiv Version 1: "We replicated this effect in Experiment 2 and found that amplitude-modulation at 20 Hz produced an additional boost in cortical excitability. " 

      BioRxiv Version 2, 3 and current manuscript: "Although amplitude-modulated kTMP increased MEP amplitude compared to sham, no enhancement was found compared to non-modulated kTMP." 

      I am a little concerned about this history because the conclusions seem to have changed. It looks like the new data has a larger number of subjects, which could explain the divergence. Although it is generally not good practice to analyze the data at interim time points, without accounting for alpha spending. It appears that data analysis methods may have also changed, as some of the extreme points in version 1 seem to be no longer in the new manuscript (Figure 4 Sham Experiment 1). 

      In the public review above we explain in detail the different versions of the bioRxiv preprint and how the results changed from the first version to the current manuscript.

      Introduction: <br /> "Second, the E-fields for the two methods exist in orthogonal subspaces" Can you explain what this means? 

      Thank you for this suggestion, we have updated the paper (pg. 4, line 78-81) by adding two sentences to explain what we mean by orthogonal subspaces and describe the consequences of this with respect to the E-fields resulting from tES and TMS. Specifically, we now comment that even if the E-fields of tES and TMS are similar in focality, they may target different populations of neurons.  

      "In addition, the kTMP waveform can be amplitude modulated to potentially mimic E-fields at frequencies matching endogenous neural rhythms [15]." That may be so, but reference [15] makes the exact opposite point, namely, that kHz stimulation has little effect on neuronal firing until you get to very strong fields. The paper that makes that claim is by Nir Grossman, but in my view, it is flawed as responses are most likely due to peripheral nerve (axon) stimulation there given the excessive currents used in that study. The reference to Wang and Peterchev [17] is in agreement with that by showing that you need 2 orders of magnitude stronger fields to activate neurons. 

      The reviewers are correct that that Ref 15 (Esmaeilpour et al, 2021), as well as Wang et al, 2023 use much higher E-fields than we target in our present study. However, our point here is that, while we cannot use our approach to apply E-fields at endogenous frequencies, we can do amplitude modulation of the kHz carrier frequency at these lower frequencies. We cited Esmaeilpour et al., (2021) because they show that high frequency stimulation with amplitude-modulated waveforms resulted in dynamic modulation at the “beating” frequency. Given we are well in subthreshold space in this paper, and well below the E-field levels in Esmaeilpour et al (2021), the open question is whether amplitude modulation at this level will be able to perturb neural activity (e.g., increase power of endogenous oscillations at the targeted frequency). 

      To address this concern, we modified the sentence (pg.6, lines 120-121) to now read "In addition, the kTMP waveform can be amplitude modulated at frequencies matching endogenous neural rhythms." In this way, we are describing a general property of kTMP (as well as other methods that can use high frequency signals).

      I am not aware of any in-vitro study showing the effects of kHz stimulation at 2V/m. The review paper by Neudorfer et al is very good. But if I got it correctly in a quick read it is not clear that there is experimental evidence for subthreshold effects. They do talk about facilitation, but the two experimental papers cited there on the auditory nerve don't quantify field magnitudes. I would really love it if you could point me to a relevant empirical study showing the effects of kHz stimulation at 2 V/m. 

      Perhaps all this is a moot point as you are interested in lasting (plastic) effects on MEP. For this, you cite one study with 11 subjects showing the effects of kHz tACS on MEPs [20]. I guess that is a start. The reference [21] is only a safety study, so it is probably not a good reference for that. Reference [22] also seems out of place as it is a modeling study. The effects on depression of low-intensity magnetic stimulation in references [23-26] are intriguing. 

      We agree with the reviewer that Ref 20 (now Ref 18: Chaieb, Antal & Paulus; 2011) is the most relevant one to cite here since it provides empirical evidence for changes in neural excitability from kHz stimulation, and in fact, serves as the model for the current study. We have retained Refs 23-26 (now Ref 19-22: Rohan et al., 2014; Carlezon et al., 2005; Rohan et al., 2004 & Dublin et al., 2019) since they also do show kHz effects on mood and removed Refs 21 (Chaieb et al., 2014) and 22 (Wang et al., 2018) for the reasons cited by the Reviewer.

      Figure 1: "The gray dashed function depicts the dependence of scalp stimulation threshold upon frequency [14]." It's hard to tell from that reference what the exact shape is, but the frequency dependence is likely steeper than what is shown here, i.e. 2 mA at 10 Hz can be really quite unpleasant. 

      We have removed the gray dashed line given that this might be taken to suggest a discrete transition. We now just have a graded transition to reflect that the tolerance of tES is subjective. We start the shading at 2 mA for the lowest frequencies given that there is general agreement that 2 mA is well-tolerated and decrease the shading intensity as frequency increases. The general aim of the figure is not to make strong claims about the threshold of scalp discomfort for tES, but to show that kTMP can target much higher cortical E-fields within the tolerable range.

      Methods: <br /> Procedures: <br /> It does not seem like double-blinding has been directly assessed. 

      We did not assess double blinding by directly assessing whether the participant was in a sham or active condition. We did not want to alert the participants of the presence of a sham condition after the first session of the 4-session study, preferring to simply describe the study as a test of a new method of non-invasive brain stimulation. For this reason, we opted to focus on their subjective ratings of annoyance, pain, and finger twitches after kTMP stimulation for each experimental session. These ratings did not differ between active and sham kTMP, which suggests kTMP has good potential for double blinding.

      MEP data analysis: Taking the mean of log power is unusual, but I suppose the reference provided gives a good justification. Does this explain the deviation from the biorxiv v1 results? 

      We opted to perform a logarithmic transformation of MEP amplitudes to improve the normality and homoscedasticity of the MEP distribution. We cite three papers (Refs 50-52: Peterchev et al., 2013, Nielsen 1996a, & Nielsen 1996b) that have applied a similar approach in handling MEP data. We had not done the transformation in the first bioRxiv but opted to do so in the eLife submission based on further review of the literature. We note that the two analyses produce similar statistical outcomes once we removed the outlier discussed in the Public Review.

      "Interactions were tested by comparing a model in which the fixed effects were restricted to be additive against a second model that could have multiplicative and additive effects." Not sure what this means. Why not run a full model with interactions included and read off the stats from that single model for the various factors? Should one not avoid running multiple models as one would have to correct p-values for multiple comparisons for every new test? 

      We used the lme4 package in R to fit our linear mixed effect models (Ref 54: Bates, Mächler, Bolker & Walker, 2015). In this package they intentionally leave out p-values for individual models or factors because they note there is a lack of convergence in the field about how to calculate parameter estimates in complex situations for linear mixed effect models (e.g., unbalanced designs). They suggest model comparison using the likelihood-ratio test to obtain and report p-values, which is what we report in the current manuscript.

      We revised the text in the section Linear Mixed Effects Models to state that likelihood ratio tests were used to obtain p-values to remove any confusion.

      Procedures: <br /> kTPM: Nice that fields were measured. Would be nice to see the data that established the empirical constant k. 

      We have expanded our discussion of how we established k in the Methods section. We first derived k using the equation E0 \= kfcI based on previously published reports of the current (I) and frequency (fc) of the MagVenture Cool-B65 coil (now Refs 29-30: Deng, Lisanby & Peterchev, 2013; Drakaki, Mathiesen, Siebner, Madsen & Thielscher, 2022). We then verified this value using the triangular E-field probe to within 5% error.

      Figure 3, spectrum. The placement of the fm label on the left panel is confusing. It suggests that fm was at the edge of the spectrum shown, which would not be the best way to show that there is nothing there - obviously, there isn't, but the figure could be more didactic. 

      Thanks for pointing this out. We modified the figure, moving the ‘fm’ label to the center of the first panel. This change makes it clear that there is no peak at the amplitude modulated frequency.

      "a trio of TMS assays of cortical excitability" Can you clarify what this means? 

      Sorry for the confusion. The trio of TMS assays refers to the single pulse and two paired-pulse protocols (SICI - ICF). We edited the Procedure section to clarify this (pg 9, line 195-197).

      Figure 2A: it would be nice to indicate which TMS blocks were single pulse and which were the two paired-pulse protocols. It is hard to keep track of it all for the three different experiments. 

      We have now clarified in the text (see above) that all three probes were used in each block for Experiments 1 and 2, and only the single-pulse probe in Experiment 3. We have modified the legend for Figure 2 to also provide this information.

      Results: <br /> "Based on these results, we combined the data across the three experiments for these two conditions in subsequent analyses." This strikes me as inappropriate. Should not a single model have been used with a fixed effect of experiment and fixed effect of stimulation condition? 

      We recognize that pooling data across experiments may be atypical. Indeed, our initial plan was to simply analyze each experiment on its own (completely within-subject analysis). However, after completing the three experiments, we realized that since the sham and non-modulated 3.5 kHz conditions were included in each experiment, we had an opportunity to examine the effect of kTMP in a relatively large N study (for NIBS research). Before pooling the data, we wanted to make sure that the factor of experiment did not impact the results and our analysis showed there was no effect of experiment. Note that we did not include the factor of stimulation condition in this model because we did not want to do multiple comparisons of the same contrast (3.5 kHz compared to sham). By pooling the data before analysis of the stimulation conditions we could then focus on our two key independent variables: 1) kTMP carrier frequency and 2) kTMP amplitude modulated frequency, doing fewer significance tests to minimize multiple comparisons. The linear mixed effect (LME) model allows us to include a random effect of participant. In this way, we account for the fact that some comparisons are within subjects and some comparisons are between subjects.

      The reviewer is correct that after pooling the data, we could have continued to include the factor of experiment in the LME models. This factor could still account for variance even though it was not significant in the initial test. Given this, we have now reanalyzed the data including the fixed factor of experiment in all the comparisons that contain data from multiple experiments. This has led us to modify the text in the Methods section under Linear Mixed Effects Models and in the Results section under Repeated kTMP Conditions (3.5 kHz and Sham) across Experiments. In addition, the results of the LME models have been updated throughout the Results section. We note that the pattern of results was unchanged with this modification of our analyses.

      "Pairwise comparisons of each active condition to sham showed that an increase was observed following both 2 kHz ..." I suppose this is all for Experiment 1? It is a little confusing to go back and forth between combining experiments and then separate analyses per experiment without some guiding text, aside from being a bit messy from the statistical point of view. 

      We did not go back to performing separate analyses of the experiments after pooling the data. Once we ran the test to justify pooling the data, subsequent tests were done with the pooled data to evaluate the effects of carrier frequency and amplitude modulation.

      Figure 5 is confusing because the horizontal lines with ** on top seem to refer to the same set of sham subjects, but the subjects of Experiments 2 and 3 are different from Experiment 1, so in these pairwise comparisons there is a mix of between-subject and within subject-comparison going on here. Did I get that right? 

      Yes – that is correct. As noted above we pooled the data after showing that there was no effect of experiment. Thus, the data for the sham and 3.5 kHz non-modulated conditions are from three different experiments. There was some overlap of subjects in Experiments 1 and Experiment 2 (Experiment 3 was all new participants).  We used a linear mixed effect model so that we could account for this mixed design. Participant was always included as a random factor, which allows us to account for the fact that some comparisons are within, and some are between. Based on a previous comment, we now include Experiment as a fixed factor (see above) which provides a way to evaluate variance across the different experiments.

      "We next compared sham vs. active non-modulated kTMP and found that active kTMP produced a significant increase in corticospinal excitability [χ2(1) = 23.46 p < 0.001" Is this for the 3.5Hz condition? 

      No, that is for an omnibus comparison of non-modulated kTMP (including 2 kHz, 3.5 kHz and 5 kHz conditions) vs. sham. We have edited the paper to include the three conditions that are included as the active non-modulated kTMP conditions for clarity (pg. 22, line 463). Having observed a significant omnibus result, we continued with paired comparisons: “Pairwise comparisons of each active condition to sham showed that an increase was observed following both 2 kHz [χ2(1) = 6.90, p = 0.009; d = 0.49] and 3.5 kHz kTMP [χ2(1) = 37.75, p < 0.001; d = 0.70; Fig 5: Non-Modulated conditions]. The 5 kHz condition failed to reach significance [χ2(1) = 1.43, p = 0.232; d = 0.21].”

      Paired-Pulse Assays: There are a number of results here without pointing to a figure, and at one point there is a reference to Figure 6, which may be in error. It would help to point the reader to some visual corresponding the the stats. 

      Thank you. This was an error on line 542. It should have read Figure 7. We have added two other pointers to Figure 7 where we discuss the absence of an effect of kTMP on SICI.

      Reviewer #2 (Recommendations For The Authors):

      I would recommend a couple of changes to the background.

      "Orthogonal subspaces" line 78. This is a fairly formal term that has little relevance here, although the difference between scalar and vector potential-based fields is interesting to think about. If it stays, it should be mathematically supported, but it's easily rewritten to deliver the gist of it. 

      We have updated the paper by adding text that we hope will clarify what we mean by orthogonal subspaces (pg. 4, line 78-81). We note that we developed the math behind this statement in a previous paper (Ref # 10: Sheltraw et al., 2021). We have changed the location of the citation so that it directly follows these sentences and will provide a pointer to readers interested in the physics and math concerning orthogonal subspaces. 

      The statement that the scalp e-field for TES is greater than the e-field for TMS for similar cortical fields needs a little more clarification, since historically they have operated orders of magnitude apart, and it is easy to misread and trip over this statement (although it is factually true). Presenting a couple of numbers at cortical and scalp positions would help illustrate the point. That you are not considering applying TES at traditional TMS levels but rather TMS at TES values is what is initially easy to miss. 

      We appreciate the feedback and have updated this section to provide the reader with a better intuition of this point. We now specify that the scalp to cortical E-field ratio is approximately 18 times larger for tES compared to TMS and cite our previous paper which has much more detail about how this was calculated.

      A note that the figures show scalp sensation around 1.0 V/m while the text states 0.5; cortical depths are an important thing for the reader to keep in mind. 

      This comment, when considered in tandem with one of the comments of Reviewer 1 led us to revise Figure 1. We removed the dashed gray line which might be taken to suggest a strict cutoff in terms of tolerability (which we did not intend). We now use shading that fades away to make the point of continuity. We have extended this down to a cortical E-field of 0.5 V/m to correspond with the text.  

      This is a nicely done and carefully reported experiment and I look forward to seeing more. 

      Thank you for your kind note!

    1. Author Response

      The following is the authors’ response to the original reviews.

      Reviewer 1

      Summary:

      In the present study, authors found the ternary complex formed by NCAN, TNC, and HA as an important factor facilitating the multipolar to bipolar transition in the intermediate zone (IZ) of the developing cortex. NCAM binds HA via the N-terminal Link modules, meanwhile, TNC cross-links NCAN through the CDL domain at the C-terminal. The expression and right localization of these three factors facilitate the multipolar-bipolar transition necessary for immature neurons to migrate radially. TNC and NCAM are also involved in neuronal morphology. The authors used a wide range of techniques to study the interaction between these three molecules in the developing cortex. In addition, single and double KO mice for NCAN and TNC were analyzed to decipher the role of these molecules in neuronal migration and morphology.

      Strengths:

      The study of the formation of the cerebral cortex is crucial to understanding the pathophysiology of many neurodevelopmental disorders associated with malformation of the cerebral cortex. In this study, the authors showed, for the first time, that the ternary complex formed by NCAN, TNC, and HA promotes neuronal migration. The results regarding the interaction between the three factors forming the ternary complex are convincing.

      We appreciate the reviewers' positive assessment of our research.

      Weaknesses:

      However, regarding the in vivo experiments, the authors should consider some points for the interpretation of the results:

      • The authors did not use the proper controls in their experiments. For embryonic analysis, such as cortical migration, neuronal morphology, and protein distribution (Fig. 6, 7, and 9), mutant mice should be compared with control littermates, since differences in the results could be due to differences in embryonic stages. For example, in Fig. 6 the dKO is more developed than the WT embryo.

      It was challenging to compare double knockout mice with control littermates. When crossing Ncan and Tcn double heterozygous mice, the probability of obtaining double knockout mice is 1/16. Given an average litter size of around 8, acquiring a substantial number of double knockout mice would necessitate an impractical number of breeding pairs. Consequently, we were constrained to use non-littermate control mice. To address potential differences in developmental stages, we analyzed 19-20 embryos obtained from five individuals in each group, demonstrating that the observed differences between the two groups are more substantial than the inherent variability within each group.

      • The authors claim that NCAM and TNC are involved in neuronal migration from experiments using single KO embryos. This is a strong statement considering the mild results, with no significant difference in the case of TNC KO embryos, and once again, using embryos from different litters.

      We agree with the reviewer's comment that a single deletion of TNC has a minimal impact on neuronal migration. We have revised the Results section to reflect the mild nature of the TNC KO phenotype more accurately.

      Page 8, line 225: "In NCAN KO mice, a significantly lower percentage of labeled cells resided in the upper layer (Bin2), and more cells remained in the lower layer (Bin5) than in WT mice (Figure 7a). In contrast, the impact of a single deletion of TNC on neuronal cell migration was minimal. Although TNC KO mice exhibited a tendency to have a higher proportion of labeled cells in the lower layer (Bin4) than in WT mice, this did not reach statistical significance (Figure 7a). The delay in neuronal migration observed in the single KO mice was milder when compared to that observed in DKO mice (Figure 6a-c), suggesting that simultaneous deletion of both NCAN and TNC is necessary for a more pronounced impairment in neuronal cell migration."

      • The measurement of immunofluorescence intensity is not the right method to compare the relative amount of protein between control and mutant embryos unless there is a right normalization.

      We agree that measuring immunofluorescence intensity alone is insufficient for comparing the relative amount of protein. In Figure 8, we have employed Western blotting to compare the protein levels, revealing an approximately 50% reduction in NCAN and TNC following hyaluronidase digestion. In Figures 7b and 7c, we demonstrated alterations in the localization patterns of TNC and NCAN in Ncan KO and Tnc KO mice; however, we did not mention their quantity.

      • Page 7, line 206. "No significant abnormalities were observed in the laminar structure in 4-week-old DKO mice". The authors should be more careful with this statement since they did not check the lamination of the adult cortex. I would recommend staining, control and mutant mice, with markers of different cortical populations, such as Cux1, Ctip2, Tbr1, to asses this point.

      In response to the suggestion, we have conducted additional experiments to provide a more detailed examination of the laminar structure in the cerebral cortex. The results have been incorporated into the revised manuscript as follows:

      Page 7, line 209: "To investigate the laminar organization of the postnatal cerebral cortex, we analyzed the distribution of NeuN-positive postmitotic neurons in DKO mice at 2 weeks of age. No notable abnormalities were observed in the laminar structure of DKO mice (Figure 6-figure supplement 3a, b). Additionally, the laminar distribution of Ctip2-positive deep layer neurons showed no significant differences between WT and DKO mice (Figure 6-figure supplement 3a, c)."

      • The authors do not explain how they measured the intensity of TNC around the transfected Turbo-RFP-positive neurons.

      We added the following description to the Materials and Methods:

      Page 18, line 608: "Images were captured in the IZ region containing Turbo-RFP-positive neurons using a 100X magnification objective lens with 3.0X optical zoom on an AX R confocal microscope (Nikon). A total of 10 optical sections were acquired with a step size of 190 nm. Z-projection views were generated, and the staining intensity of TNC around Turbo-RFP-positive neurons was measured in a 59 × 59 µm area using ImageJ FIJI."

      • The loading control of the western blots should be always included.

      In Figure 6-figure supplement 1, we have incorporated western blot data using a GAPDH antibody as a loading control. We have added an explanation in the figure legend of Figure 3c, stating that we analyzed the same samples as those used in Figure 1e.

      • For Fig. 3e, I think values are represented relative to E18 instead to P2.

      Thank you for pointing that out. As suggested, we have corrected the representation in Fig. 3e to be relative to E18 instead of P2.

      • I would recommend authors use the standard nomenclature for the embryonic stages. The detection of the vaginal plug is considered as E0.5 and therefore, half a day should be added to embryonic stages (E14.5...).

      We have revised our manuscript to designate the detection of the vaginal plug as E0.5, and subsequently, we have adjusted all embryonic stages by adding half a day, such as E14.5.

      • Fig 10K: I do not see the differences in the number of neurites in the graph.

      We have modified the presentation from a box-and-whisker plot to a bar graph to enhance the visibility of differences in the average number of neurites.

      • Line 37: Not all of the cerebral cortex is structured in 6 layers but the neocortex.

      We have changed 'cerebral cortex' to 'cerebral neocortex.'

      Reviewer 2

      Summary:

      ECM components are prominent constituents of the pericellular environment of CNS cells and form complex and dynamic interactomes in the pericellular spaces. Based on bioinformatic analysis, more than 300 genes have been attributed to the so-called matrisome, many of which are detectable in the CNS. Yet, not much is known about their functions while increasing evidence suggests important contributions to developmental processes, neural plasticity, and inhibition of regeneration in the CNS. In this respect, the present work offers new insights and adds interesting aspects to the facets of ECM contributions to neural development. This is even more relevant in view of the fact that neurocan has recently been identified as a potential risk gene for neuropsychiatric diseases. Because ECM components occur in the interstitial space and are linked in interactomes their study is very difficult. A strength of the manuscript is that the authors used several approaches to shed light on ECM function, including proteome studies, the generation of knockout mouse lines, and the analysis of in vivo labeled neural progenitors. This multi-perspective approach permitted to reveal hitherto unknown properties of the ECM and highlighted its importance for the overall organization of the CNS.

      Strengths:

      Systematic analysis of the ternary complex between neurons, TNC, and hyaluronic acid; establishment of KO mouse lines to study the function of the complex, use of in utero electroporation to investigate the impact on neuronal migration;

      We appreciate the reviewers' insightful comments.

      Weaknesses:

      The analysis is focused on neuronal progenitors, however, the potential impact of the molecules of interest, in particular, their removal on differentiation and /or survival of neural stem/progenitor cells is not addressed. The potential receptors involved are not considered. It also seems that rather the passage to the outer areas of the forming cortex is compromised, which is not the same as the migration process. The movement of the cells is not included in the analysis.

      In this study, we demonstrated that the ternary complex of NCAN, TNC, and HA is predominantly localized in the subplate/intermediate zone. This region lacks neural stem/progenitor cells but serves as the initiation site for the radial migration of postmitotic neurons. Consequently, our study focused on the role of the ternary complex in neuronal migration and polarity formation. We acknowledge that we did not investigate in-depth the potential effects of ECM perturbation on the differentiation and survival of neural stem/progenitor cells. However, as highlighted by the reviewer, it is important to explore the effects on neural stem/progenitor cells. To address this concern, we analyzed Pax6-positive radial glial cells and Tbr2-positive intermediate progenitor cells in the ventricular zone of wild-type and Ncan/Tnc double knockout (DKO) mice. Immunohistochemical analysis revealed no significant differences between WT and DKO mice (Figure 6-figure supplement 4a). Furthermore, the morphology of nestin-positive radial fibers exhibited no distinguishable variations between WT and DKO mice (Figure 6-figure supplement 4b, c).

      (1) In the description of the culture of cortical neurons the authors mentioned the use of 5% horse serum as a medium constituent. HS is a potent stimulus for astrocyte differentiation and astrocytes in vitro release neurocan. Therefore, the detection of neurocan in the supernatant of the cultures as shown in Figure 1h might as well reflect release by cultivated astrocytes.

      As pointed out by the reviewer, Figure 1h did not conclusively demonstrate that neurons are the sole source of NCAN production. Indeed, in situ hybridization analysis revealed the widespread distribution of Ncan mRNA throughout the cerebral cortex (Figure 2a). This result suggests that the production of NCAN involves not only neurons but also other cell populations, including radial glial cells and astrocytes. While we acknowledge the potential contribution of other cell types to NCAN production, Ncan expression by neurons during radial migration is a crucial aspect of our findings (Figure 1i, j). We have revised the manuscript as follows:

      Page 5, line 111: "This result suggested the secretion of NCAN by developing neurons; however, we cannot rule out the involvement of coexisting glial cells in the culture system. To investigate the expression of Ncan mRNA during radial migration in vivo, we labeled radial glial cells in the VZ with GFP through in utero electroporation at E14.5 (Figure 1i, Figure 1-figure supplement 1)."

      (2) It is known that neurocan in vivo is expressed by neurons, but may be upregulated in astrocytes after lesion, or in vitro, where the cells become reactive.

      We have incorporated the following description into the discussion:

      Page 11, line 359: "Previous studies have reported an upregulation of NCAN and TNC in reactive astrocytes, indicating the potential formation of the ternary complex of NCAN, TNC, and HA in the adult brain in response to injury (Deller et al., 1997; Haas et al., 1999)."

      (3) Do NCAN KO neurons show an increase in neurite growth on the TNC substrates? The response on POL was changed (Fig. 10h-k), but the ECM substrates were not tested with the KO neurons.

      The impact of ECM substrates on NCAN KO neurons has not been investigated, and this remains an avenue for further exploration in our ongoing research. Future studies aim to elucidate the NCAN-TNC connection by identifying TNC cell surface receptors and unraveling the subsequent intracellular signaling pathways.

      (4) Do the authors have an explanation for why the ternary complex is concentrated in the SP/IZ zone?

      In the mature brain, hyaluronan acts as a scaffold that facilitates the accumulation of ECM components, including proteoglycans and tenascins around neurons. Therefore, it is conceivable that the ECM components bind to hyaluronan in the embryonic brain, resulting in its accumulation in the subplate/intermediate zone. In support of this hypothesis, enzymatic digestion of hyaluronan in the subplate/intermediate zone led to the disappearance of TNC and NCAN accumulation (Figure 8a-c). This result may account for the disparity observed, where Tnc mRNA is expressed in the ventricular zone while the TNC protein localizes to the subplate/intermediate zone.

      (5) Are hyaluronic acid synthesizing complexes (HAS) concentrated in the SP/IZ?

      According to the reviewer's comment, we have investigated the localization of Has2 and Has3 mRNA using in situ hybridization. However, due to the relatively low expression levels of these enzymes, we encountered challenges in obtaining clear signals (Author response image 1). Further research is needed to understand the mechanisms behind the localization of hyaluronan in the intermediate zone.

      Author response image 1.

      In situ hybridization analysis of Has2 and 3 mRNA on the E16.5 cerebral cortex. Upper images show results of in situ hybridization using antisense against Has2 and 3. Lower images are in situ hybridization using sense probes as negative controls.

      (6) CSPGs as well as TNC are part of the neural stem/progenitors cell niche environment. Does the removal of either of the ECM compounds affect the proliferation, differentiation, and/or survival of NSPCs, or their progeny?

      )7) This question relates to the fact that the migration process itself is not visualized in the present study, rather its outcome - the quantitative distribution of labeled neurons in the different bins of the analysis. This could also derive from modified cell numbers.

      As pointed out by the reviewer, previous studies have shown the role of CSPGs and TNC as components of the neural stem/progenitor cell niche (see reviews by (Faissner et al., 2017; Faissner and Reinhard, 2015). However, as mentioned in Response #2, based on our analyses, we did not observe a reduction in neural stem/progenitor cells in NCAN/TNC double-knockout mice. While we cannot precisely explain this discrepancy, it is worth noting that many past studies evaluated the activities of the ECM molecules in in vitro systems such as neurospheres. The observed differences may stem from variations in experimental systems.

      (8) What is the role of the ECM in the SP/IZ area? Do the cells need the ECM to advance, the reduction would then leave the neuronal progenitors in the VZ area? This somehow contrasts with interpretations that the ECM acts as an obstacle for neurite growth or cell migration, or as a kind of barrier.

      The role of the ECM is multifaceted, with certain ECM molecules known to inhibit neurite outgrowth while others facilitate it. Additionally, the effects of ECM can vary depending on the cell type. It is established that after migrating neurons adhere to radial fibers, they utilize these fibers as a scaffold to migrate toward the cortical surface. However, in the subplate/intermediate zone, migrating neurons have not yet adhered to radial fibers. This study provides evidence that multipolar neurons undergo morphological changes into bipolar cells with the assistance of the NCAN, TNC, and HA complex. Subsequently, this facilitates their movement along radial fibers.

      (9) A direct visualization of the movement of neural progenitors in the tissue as has been for example performed by the Kriegstein laboratory might help resolve some of these issues.

      As suggested by the reviewer, utilizing live imaging techniques to directly observe the movement of neural progenitors within the tissue is indeed a powerful tool. We recognize the significance of addressing these points in future research.

    1. Author Response

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      Zhang et al., investigated the relationship between monocular and binocular responses of V1 superficial-layer neurons using two-photon calcium imaging. They found a strong relationship in their data: neurons that exhibited a greater preference for one eye or the other (high ocular dominance) were more likely to be suppressed under binocular stimulation, whereas neurons that are more equivalently driven by each other (low ocular dominance) were more likely to be enhanced by binocular stimulation. This result chiefly demonstrates the relationship between ocular dominance and binocular responses in V1, corroborating what has been shown previously using electrophysiological techniques but now with greater spatial resolution (albeit less temporal resolution). The binocular responses were well-fitted by a model that institutes divisive normalization between the eyes that accounts for both the suppression and enhancement phenomena observed in the subpopulation of binocular neurons. In so doing, the authors reify the importance of incorporating ocular dominance in computational models of binocular combination.

      The conclusions of this paper are mostly well supported by the data, but there are some limitations of the methodology that need to be clarified, and an expansion of how the results relate to previous work would better contextualize these important findings in the literature.

      Strengths:

      The two-photon imaging technique used to resolve the activity of individual neurons within intact brain tissue grants a host of advantages. Foremost, two-photon imaging confers considerably high spatial resolution. As a result, the authors were able to sample and analyze the activity from thousands of verified superficial-layer V1 neurons. The animal model used, awake macaques, is also highly relevant for the study of binocular combination. Macaques, like humans, are binocular animals, meaning they have forward-facing eyes that confer overlapping visual fields. Importantly, macaque V1 is organized into cortical columns that process specific visual features from the separate eyes just like in humans. In combination with a powerful imaging technique, this allowed the authors to evaluate the monocular and binocular response profiles of V1 neurons that are situated within neighboring ocular dominance columns, a novel feat. To this aim, the approach was well-executed and should instill further confidence in the notion that V1 neurons combine monocular information in a manner that is dependent on the strength of their ocular dominance.

      Weaknesses:

      While two-photon imaging provides excellent spatial resolution, its temporal resolution is often lower compared to some other techniques, such as electrophysiology. This limits the ability to study the fast dynamics of neuronal activity, a well-understood trade-off of the method. The issue is more so that the authors draw comparisons to electrophysiological studies without explicit appreciation of the temporal difference between these techniques. In a similar vein, two-photon imaging is limited spatially in terms of cortical depth, preferentially sampling from neurons in layers 2/3. This limitation does not invalidate any of the interpretations but should be considered by readers, especially when making comparisons to previous electrophysiological reports using microelectrode linear arrays that sample from all cortical layers. Indeed, it is likely that a complete picture of early cortical binocular processing will require high spatial resolution (i.e., sampling from neurons in neighboring ocular dominance columns, from pia mater to white matter) at the biophysically relevant timescales (1ms resolution, capturing response dynamics over the full duration of the stimulus presentation, including the transient onset and steady-state periods).

      To address the same concern from all three reviewers, we discussed the technical limitations of two photon calcium imaging at the end of Discussion, including limited imaging depth, low temporal resolution, and nonlinearity. The relevant texts are copied here:

      (Ln 304) “Limitations of the current study

      Although capable of sampling a large number of neurons at cellular resolution and with low sampling bias, two-photon calcium imaging has its known limitations that may better make it a complementary research tool to electrophysiological recordings.

      For example, two-photon imaging can only sample neurons from superficial-layers, while binocular neurons also exist in deeper layers, and even neurons in the input layer are affected by feedback from downstream binocular neurons to exhibit binocular response properties (Dougherty, Cox, Westerberg, & Maier, 2019). Furthermore, calcium signals are relatively slow and cannot reveal the fast dynamics of neuronal responses. Due to these spatial and temporal limitations, a more complete picture of the neuronal mechanisms underlying binocular combination of monocular responses may come from studies using both technologies.

      In addition, calcium signals may exaggerate the nonlinear properties of neurons. Although calcium signals indicated by GCaMP5, our favored choice of calcium indicator, displays a linear relationship to neuronal spike rates within a range of 10-150 Hz (Li, Liu, Jiang, Lee, & Tang, 2017), weak and strong signals out of this range are more nonlinear, and may appear poorer and stronger, respectively, than electrode-recorded effects. Consequently, the differences in population responses between monocular and binocular stimulations revealed by this study might be less pronounced.”

      (Recommendations For The Authors):

      Overall, my main suggestion for the authors to improve the paper is to revise some of the interpretations of their results in relation to previous research. The purpose of the present study was to illustrate a more complete picture of the binocular combination of monocular responses by taking into consideration the ocular dominance of V1 cells (lines 34-36). A study published earlier this year had an identical purpose (Mitchell et al., Current Biology, 2023) and arrived at a highly similar conclusion (and also applied divisive normalization to fit their data). I would ask that this paper be mentioned in the introduction and discussed.

      The Mitchell et al 2023 paper is added to the Introduction and Discussion:

      (Ln 50) “In addition (to the Dougherty et al 2019 paper from the same group), Mitchell, Carlson, Westerberg, Cox, and Maier (2023) reported that binocular combination of monocular stimuli with different contrasts is also affected by neurons’ eye preference.”

      (Ln 286) “The critical roles of ocular dominance have been largely overlooked by extant binocular vision models to our knowledge, except that Anderson and Movshon (1989) demonstrated that a model consisting of multiple ocular dominance channels can better explain their psychophysical adaptation data, and that Mitchell et al. (2023) revealed that binocular combination of different contrasts presented to different eyes are affected by neurons’ ocularity preference.”

      Nevertheless, the results of the present study are very valuable. They add substantial spatial resolution and sophisticated relational analysis of monocular and binocular responses that Mitchell et al., 2023 did not include. Therefore, my suggestion is to emphasize the advantages of two-photon imaging in the introduction, focusing on the ability to image neurons in neighboring ocular dominance columns. The rigorous modeling of the relationship between nearby neurons with a range of eye preferences, in tandem with the incredible yield of two-photon imaging, is what sets this paper apart from previous electrophysiological work.

      The finding that binocular responses were dependent on ocular dominance is largely consistent with previous electrophysiological results. However, there should be a paragraph in the discussion section that speaks to the limitations of comparing two-photon imaging data to electrophysiological data. Namely, there are two limitations:

      (1) These two techniques confer different temporal resolutions. It is conceivable that some of the electrophysiology relationships (for example, described by Dougherty et al., 2019) may be dependent on the temporal window over which the data was averaged, typically over 50-100ms around stimulus onset, or 100-250ms comprising the neurons' sustained response to the stimulus. This possible explanation of the difference in obtained results would be especially useful for the discussion paragraph starting at line 232. It would also be helpful to readers for there to be some mention of the advantage of having high temporal resolution (i.e., the benefits of electrophysiology) since (a) recent work has distinguished between sequential stages of binocular combination (Cox et al., 2019) and (b) modern models of V1 neurons emphasize recurrent feedback to explain V1 temporal dynamics (see Heeger et al., 2019; Rubin et al., 2015), which could prove to be relevant for combination of stimuli in the two eyes (Fleet et al., 1997).

      Our discussion regarding the technical limitations of 2-p calcium imaging has been listed earlier. Specific to the Dougherty et 2019 paper, we added the following discussion to address the issue of temporal resolution difference between two technologies.

      (Ln 266) “In addition, it is unclear whether the discrepancies are caused by different temporal resolutions of electrode recording and calcium imaging. The results of Dougherty et al. (2019) represent changes of neuronal spike activities over a period of approximately 50-200 ms after the stimulus onset, which may reflect the sustained neuronal responses to the stimulus and possible feedback signals. Calcium signals are much slower and indicative of the aggregated neuronal responses over a longer period (up to 1000 ms in the current study). They should have smeared, rather than exaggerated, the differences between monocular and binocular responses, although we cannot exclude the possibility that some neuronal response changes beyond 200 ms are responsible for the discrepancies.”

      (2) The sample of V1 neurons in this study is limited to cells in the most superficial layers of the cortex (layers 2/3). This limitation is, of course, well understood, but it should be mentioned at least in the context of studying the formative mechanisms of binocular combination in V1 (since we know that binocular neurons also exist in layers 5/6, and there is now substantial evidence that even layer 4 neurons are not as "monocular" as we previously thought (Dougherty et al., 2019)).

      See our discussion regarding the technical limitations of 2-p calcium imaging listed earlier.

      In short, I believe the paper would be improved by (1) adding the above citations in the appropriate places, (2) acknowledging in the introduction that this question has been investigated electrophysiologically but emphasizing the advantages of two-photon imaging, and (3) adding a paragraph to the discussion section that discusses the temporal and spatial limitations when using two-photon imaging to study binocular combination, particularly when comparing the results to electrophysiology.

      Reviewer #2 (Public Review):

      Summary:

      This study examines the pattern of responses produced by the combination of left-eye and right-eye signals in V1. For this, they used calcium imaging of neurons in V1 of awake, fixating monkeys. They take advantage of calcium imaging, which yields large populations of neurons in each field of view. With their data set, they observe how response magnitude relates to ocular dominance across the entire population. They analyze carefully how the relationship changed as the visual stimulus switched from contra-eye only, ipsi-eye only, and binocular. As expected, the contra-eye-dominated neurons responded strongly with a contra-eye-only stimulus. The ipsi-eye-dominated neurons responded strongly with an ipsi-eye-only stimulus. The surprise was responses to a binocular stimulus. The responses were similarly weak across the entire population, regardless of each neuron's ocular dominance. They conclude that this pattern of responses could be explained by interocular divisive normalization, followed by binocular summation.

      Strengths:

      A major strength of this work is that the model-fitting was done on a large population of simultaneously recorded neurons. This approach is an advancement over previous work, which did model-fitting on individual neurons. The fitted model in the manuscript represents the pattern observed across the large population in V1, and washes out any particular property of individual neurons. Given the large neuronal population from which the conclusion was drawn, the authors provide solid evidence supporting their conclusion. They also observed consistency across 5 fields of view.

      The experiments were designed and executed appropriately to test their hypothesis. Their data support their conclusion.

      Weaknesses:

      One weakness of their study is that calcium signals can exaggerate the nonlinear properties of neurons. Calcium imaging renders poor responses poorer and strong responses stronger, compared to single-unit recording. In particular, the dramatic change in the population response between monocular stimulation and binocular stimulation could actually be less pronounced when measured with single-unit recording methods. This means their choice of recording method could have accidentally exaggerated the evidence of their finding.

      We discussed the nonlinearity of calcium signals as part of the technical limitations of 2-p imaging calcium. The calcium indicator we use, GCaMP5, has a reasonable range of linear relationship with spike rates. But out of this range, the nonlinearity is indeed a concern.

      (Ln 314) “In addition, calcium signals may exaggerate the nonlinear properties of neurons. Although signals indicated by GCaMP5, our favored choice of calcium indicator, displays a linear relationship to neuronal spike rate within a range of 10-150 Hz (Li et al., 2017), weak and strong signals out of this range are more nonlinear, and may appear poorer and stronger, respectively, than electrode-recorded effects. Consequently, the changes in population responses between monocular and binocular stimulations revealed by this study might be less pronounced.”

      The implication of their finding is that strong ocular dominance is the result of release from interocular suppression by a monocular stimulus, rather than the lack of binocular combination as many traditional studies have assumed. This could significantly advance our understanding of the binocular combination circuitry of V1. The entire population of neurons could be part of a binocular combination circuitry present in V1.

      This is a very good insight. We added the following sentences to the end of the first paragraph of Discussion:

      (Ln 242) “These findings implicate that at least for neurons in superficial layers of V1, significant ocular dominance may result from a release of interocular suppression during monocular stimulation, an unusual viewing condition as our vision is typically binocular, rather than a lack of binocular combination of inputs from upstream monocular neurons.”

      (Recommendations For The Authors):

      Line 150: "To model interocular response suppression, responses from each eye in Eq. 2 were further normalized by an interocular suppression factor wib or wcb," I recommend the authors improve their explanation of how they arrived at Eq. 3 from Eq. 2. As it stands, my impression is that they have one model for the responses to monocular stimulation, and another model for the responses to binocular stimulation. What I think is missing is that both equations are derived from the same model. Monocular stimulation is a situation in which the stimulus in one eye's contrast is zero. Could the authors clarify whether this situation produces an interocular suppression of zero, and how that leads to Eq. 2?

      We rewrote the modeling part to show that Equations 1-3 are sequential steps of development for the same model. We also added a brief paragraph to discuss how Eq. 3 could lead to Eq. 2 under monocular viewing:

      (Ln 166) “Although not shown in Eq. 3, we also assumed that the nonlinear exponent b also depends on the contrast of the stimulus presented to the other eye (i.e., Sc or Si). Consequently, when Sc or Si = 0 under monocular stimulation, Rc or Ri = 0 (Eq. 1), and interocular suppression wib or wcb = 1, so Eq. 3 changes back to Eq. 2. It is only when Sc and Si are equal and close to 1, as in the current study, that interocular suppression and binocular combination would be in the current Eq. 3 format.”

      Line 225: "However, individually, compared to monocular responses, responses of monocular neurons more preferring the stimulated eye are actually suppressed, and only responses of binocular neurons are increased by binocular stimulation." This sentence is difficult to follow. I recommend the authors improve clarity by breaking up the sentence into several sentences. If I understand correctly, they summarize the pattern in the data that is indicative of interocular divisive normalization, i.e., their final conclusion.

      This sentence no longer exists in the Discussion.

      Line 426: "Third, for those showing significant orientation difference, the trial-based orientation responses of each neuron were fitted with a Gaussian model with a MATLAB nonlinear least squares function:" The choice of using a Gaussian function to fit orientation tuning was probably suboptimal. A Gaussian function provides an adequate fit only for neurons whose tuning is very sharp. The responses outside of the peak fall down to the baseline and the two ends meet. Otherwise, the two ends do not meet. An adequate fit would be achieved with a function of a circular variable, which wraps around 180 deg. I recommend using a Von Mises function for fitting orientation tuning.

      We agree with the reviewer that the Von Mises function is more accurate than Gaussian for fitting orientation tuning functions. Indeed we are using it to fit orientation tuning of V4 neurons, many of which have two peaks. For the current V1 data, the differences between Von Mises and Gaussian fittings are very small, as shown in the orientation functional maps from three macaques below. Because we also use the same Gaussian fitting of orientation tuning in several published and current under-review papers, we prefer to keep the Gaussian fitting results in the manuscript.

      Author response image 1.

      Reviewer #3 (Public Review):

      The authors have made simultaneous recordings of the responses of large numbers of neurons from the primary visual cortex using optical two-photon imaging of calcium signals from the superficial layers of the cortex. Recordings were made to compare the responses of the cortical neurons under normal binocular viewing of a flat screen with both eyes open and monocular viewing of the same screen with one eye's view blocked by a translucent filter. The screen displayed visual stimuli comprising small contrast patches of Gabor function distributions of luminance, a stimulus that is known to excite cortical neurons.

      This is an important data set, given the large numbers of neurons recorded. The authors present a simple model to explain the binocular combination of neuronal signals from the right and left eyes.

      The limitations of the paper as written are as follows. These points can be addressed with some additional analysis and rewriting of sections of the paper. No new experimental data need to be collected.

      (1) The authors should acknowledge the fact that these recordings arise from neurons in the superficial layers of the cortex. This limitation arises from the usual constraints on optical imaging in the macaque cortex. This means that the sample of neurons forming this data set is not fully representative of the population of binocular neurons within the visual cortex. This limitation is important in comparing the outcome of these experiments with the results from other studies of binocular combination, which have used single-electrode recording. Electrode recording will result in a sample of neurons that is drawn from many layers of the cortex, rather than just the superficial layers.

      See our discussion regarding the technical limitations of 2-p calcium imaging listed earlier.

      (2) Single-neuron recording of binocular neurons in the primary visual cortex has shown that these neurons often have some spontaneous activity. Assessment of this spontaneous level of firing is important for accurate model fitting [1]. The paper here should discuss the level of spontaneous neuronal firing and its potential significance.

      We have noticed previously that at non-optimal spatial frequencies, calcium responses to a moving Gabor grating are close to zero (Guan et al., Prog Neurobiology, 2021, Fig. 1B), but we cannot tell whether this is due to calcium response nonlinearity, or a close-to-zero level of spontaneous neuronal activity. Prince et al (2002) reported low spontaneous responses of V1 neurons with moving grating stimuli (e.g., about 3 spikes/sec in one exemplar neuron, their Fig. 1B), so this appears not a big effect. In our data fitting, we do have an orientation-unspecific component in the Gaussian model, which represents the neuronal response at a non-preferred orientation, but not necessarily the spontaneous activity.

      (3) The arrangements for visual stimulation and comparison of binocular and monocular responses mean that the stereoscopic disparity of the binocular stimuli is always at zero or close to zero. The animal's fixation point is in the centre of a single display that is viewed binocularly. The fixation point is, by definition, at zero disparity. The other points on the flat display are also at zero disparity or very close to zero because they lie in the same depth plane. There will be some small deviations from exactly zero because the geometry of the viewing arrangements results in the extremities of the display being at a slightly different distance than the centre. Therefore, the visual stimulation used to test the binocular condition is always at zero disparity, with a slight deviation from zero at the edges of the display, and never changes. [There is a detail that can be ignored. The experimenters tested neurons with visual stimulation at different real distances from the eyes, but this is not relevant here. Provided the animals accurately converged their eyes on the provided binocular fixation point, then the disparity of the visual stimuli will always be at or close to zero, regardless of viewing distance in these circumstances.] However, we already know from earlier work that neurons in the visual cortex exhibit a range of selectivity for binocular disparity. Some neurons have their peak response at non-zero disparities, representing binocular depths nearer than the fixation depth or beyond it. The response of other neurons is maximally suppressed by disparities at the depth of the fixation point (so-called Tuned Inhibitory [TI] neurons). The simple model and analysis presented in the paper for the summation of monocular responses to predict binocular responses will perform adequately for neurons that are tuned to zero disparity, so-called tuned excitatory neurons [TE], but is necessarily compromised when applied to neurons that have other, different tuning profiles. Specifically, when neurons are stimulated binocularly with a non-preferred disparity, the binocular response may be lower than the monocular response[2, 3]. This more realistic view of binocular responses needs to be considered by the authors and integrated into their modelling.

      We agree and include the following texts when discussing the future work:

      (Ln 298) “In addition, in our experiments, binocular stimuli were presented with zero disparity, which best triggered the responses of neurons with zero-disparity tuning. A more realistic model of binocular combination also requires the consideration of neurons with other disparity-tuning profiles.”

      (4) The data in the paper show some features that have been reported before but are not captured by the model. Notably for neurons with extreme values of ocular dominance, the binocular response is typically less than the larger of the two monocular responses. This is apparent in the row of plots in Figure 2D from individual animals and in the pooled data in Figure 2E. Responses of this type are characteristic of tuned inhibitory [TI] neurons[2]. It is not immediately clear why this feature of the data does not appear in the summary and analysis in Figure 3.

      This difference is indeed captured by the model, which can be more easily appreciated in Fig. 4A where monocular and binocular model simulations are plotted in the same panel. In the text, we also wrote: (Ln 195) “It is apparent that binocular responses cannot be explained by the sum of monocular responses, as binocular responses are substantially lower than the summed monocular responses for both monocular and binocular neurons. Nor can binocular responses be explained by the responses to the preferred eye, as binocular responses are also lower than those to the preferred eye (the larger of the two monocular responses) for monocular neurons.”

      The paper text states that the responses were "first normalized by the median of the binocular responses". This will certainly get rid of this characteristic of the data, but this step needs better justification, or an amendment to the main analysis is needed.

      The relevant sentence has been rewritten as “Monocular and binocular data of each FOV/depth, as well as the pooled data, were first normalized by the respective median of the binocular responses of all neurons in the same FOV/depth.” This normalization would render the overall binocular responses to be around unity, for the purpose of facilitating comparisons among all FOV/depth, but it would not affect the overall characteristic of the data.

      In the present form, the model and analysis do not appear to fit the data in Figure 2 as accurately as needed.

      Thanks for pointing out the problem, as data fitting for FOV C_270 and the pooled data were especially inaccurate. The issue has been mostly fixed when each datum was weighted by its standard deviation (please see the updated Fig. 3).

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Recommendations For The Authors):

      In its current form, I would exclude the cryo-EM data from the manuscript. It does not add much and it is distracting from the excellent work that you did on the functional characterization of the variant. Alternatively, you could try to improve the resolution and see if you can get some more meaningful analysis out of the structures? I noticed that you only collected very small datasets. If you decide to pursue a higher resolution reconstruction, collecting more movies will give you a better chance to obtain a higher resolution.

      We express our gratitude to the reviewer for their invaluable feedback. While acknowledging that our structure currently maintains a low resolution, it still provides valuable insights into the splice's proximity to the N412 glycan density. This proximity and low-resolution map hindered the complete modeling of all the splice residues. Notably, this structure represents the first depiction of this particular splice variant. Consequently, it lays a foundation for subsequent studies in the field, and hence, we would want to keep it in the manuscript. As per reviewers’ suggestions, we have now included comparisons of our structure with the GluK1-2a receptor structure reported recently (Mayerson et al. 2022). We do plan to carry out higher-resolution structures in the future.

      I would probably also exclude the RNAseq analysis. I think that Figure 1 is fine, but the supplement 1 is not very successful in convincing me that the exon 9 is expressed mainly in early stages of brain development. In addition, the plot in Figure 1 indicates strong expression in the cerebellar cortex in 20s and 30s. If you decide to keep the data, I strongly encourage you to include more details on the analysis in the methods section.

      Thanks for this insightful comment. We have now modified this section extensively for better clarity. Indeed, the expression of this variant seems to be dynamic in different brain regions. This has now been specified in the revised manuscript. Figure 1 shows the expression of GRIK1 exon 9 gene in different regions of the human brain and donor age. The supplementary figure 1 is a zoom-in on one such region, the Cerebral cortex, where we observe the maximum expression of GRIK1. In this region, we also observed higher expression of exon 9 in the early stages of development. The scales of Figure 1 (0-4 RPKM) and supplemental Figure 1(06RPKM) are different due to more expression of other exons in supplemental Figure 1 (example, we observe 4RPKM expression in the shade of red, for figure 1, whereas similar values of 4RPKM are orange-yellow in the supplemental figure1). Using Supplemental Figure 1, we wanted to show the expression of exon 9 with respect to other exons during developmental stages that prove that GluK1-1 is highly expressed in the initial stages of life. more details on the analysis in the methods section has been added now.

      Additionally, there are a few minor issues in the data presentation:

      (1) in Fig. 2C there seems to be a mismatch between the green dose response plot and the GluK12a trace shown. The plot reports an EC50 of 187.7 uM, whereas in the sample trace 0.25 mM agonist activates only to ~20%.

      We have verified the data and statistics, confirming their consistency with the values reported in the manuscript. For Figure 2C, we present representative traces from a single cell. However, the EC50 value was calculated using Hill's equation based on averaged data from 5 cells.

      (2) The axis label is misprinted in Figure 3C

      Thanks. Corrected.

      (3) In Fig 5 supplement 1, panel B - the 3 last labels above the western blot lanes are off so it is difficult to see which sample corresponds to which lane.

      Thanks. We have corrected the figure.

      Reviewer #2 (Recommendations For The Authors):

      Overall I congratulate the authors of this study nicely done. It represents a large body of work.

      We thank the reviewer for his/her time and positive comments.

      I have several minor corrections that authors could consider for the revision of the manuscript P7. The desensitization rate of GluK1-2a was "delayed"... replace by "increased".

      Corrected.

      P9. Last line 0.37; P.. Add the P value.

      P value has been added as suggested.

      P11 authors indicate that K368/375//379/382H376-E mutant exhibit significant difference in desensitization properties in presence of NEto1, but on the 1st line of p11, they provide a P value above 0.05

      We thank the reviewer for pointing out this discrepancy and have fixed the same. We have discussed two mutants that show slower desensitization when compared to GluK1-1a co-expressed with Neto1. The K to E mutant has significance, while the des value for the K368/375//379/382H376-E mutant shows the same pattern, though not significantly. We have now modified the text to explain this more clearly.

      P19 the calculation of mean weighted tau TDes is not clear and should be better explained.

      Thanks. We have added more details in the Methods sections. We analyzed the current decays in response to 1–2 ms or 1 s applications by employing an exponential function or the sum of two exponential functions. This analysis allowed us to derive a weighted mean τdes using the formula [(τ1 × amplitude1) + (τ2 × amplitude2)]/[amplitude1 + amplitude2]. The tau values represent the time constants obtained from the exponential fits, while the amplitudes correspond to the estimated contributions of each component to the total peak current amplitude.

      [(A1 * t1) + (A2 * t2)] / (A1 + A2)

      It represents the calculation of a weighted mean, where A1 and A2 are the amplitudes, and t1 and t2 are the corresponding time constants. The formula calculates the overall mean time constant by taking into account the contribution of each component to the total amplitude.

      P19 the rate of recovery was obtained by fitting the one-phase association "with" exponential function. With is missing.

      We have corrected this error.  Thanks.

      P21 which method has been used for site directed mutagenesis

      Overlapping PCR was carried out for mutagenesis using the primers listed in Figure 4-table supplement 1. A ligation-free cloning approach (Zhang et al., 2017) was used. It has now been elaborated in the methodology section under Site directed mutagenesis.

      P21 and 22. Provide complete reference of reagent including species of antibodies.

      Thanks. We have added all the details in the methods section now. 

      Anti-His: Rabbit mAb #12698 (Cell Signaling Technology)

      Anti-Neto1: Rabbit #SAB3500679 (Sigma Aldrich)

      Anti-GFP: Mouse mAb G1546 (Sigma Aldrich)

      Anti-actin: Mouse mAb A3853 (Sigma Aldrich)

      P22 How much anti His antibody was used with 40microliter of protein A?

      We have used 2µg/ 40uL of Protein A slurry. This has now been added to the methodology.

      P23 Authors seem to have used a virus to express protein but the protocol is not given. For example what is P2 virus?

      We have now modified the manuscript to include details of baculovirus generation as per the protocol described in Goehring et al. 2014. We followed the same protocol wherein the 2nd generation of virus (P2) generated in insect (SF9) cells was used for infecting suspensionadapted HEK293-T cells for large-scale GluK1-1aEM protein expression.

      Reviewer #3 (Recommendations For The Authors):

      Major concerns:

      (1) The effect of the splice insert on Gluk1 regulation by Neto proteins is not fully clear. For example, experiments in Fig. 3G indicate that the desensitization time for Gluk1-1a + Neto2 is ~32ms. This value is half compared with data obtained from whole-cell experiments shown in Fig. 3A (~70ms). What is the reason for this discrepancy? If variability is observed between experiments, I wonder how valid are the comparisons made in panel A between GluK11a+Neto2 vs GluK1-2a+Neto2 groups. In the case of recovery analysis, authors found significant differences comparing both groups in the presence of Neto (Fig. 3B) but recovery times are not identic for Gluk1-1a vs Gluk1-2a (without Neto). Thus, I wonder if the fold change related to the control group (without Neto) is different. 

      We appreciate your detailed feedback, which has allowed us to clarify and reinforce the validity of our experimental findings. Different recording configurations (e.g., outside-out patch (Fig. 3G) versus whole-cell recordings (Fig. 3A) have been used. Whole-cell recordings average responses over a larger membrane area and also have slower solution exchange times compared to outside-out patch recordings. This may have contributed to the variability in desensitization times. However, similar trends in our whole cell vs. outside-out patch recordings were observed. Further, all the data except those presented in Figs 3G and 3H are from whole-cell recordings. We have performed multiple independent experiments and utilized rigorous statistical analyses to validate our comparisons. We report mean values with standard deviations or confidence intervals to provide a more accurate representation of the data.

      Neto1 significantly speeds up the recovery from desensitization for both variants, with a more pronounced effect on GluK1-1a (GluK1-1a +Neto1: 0.68 s) compared to GluK1-2a (GluK1-2a +Neto1: 1.15 s). The recovery times are not identical for the two variants, likely due to the presence of splice insert in GluK1-1a. Neto2, on the other hand, slows recovery for both variants without significant differential effects. However, the recovery rate from the desensitized state is faster for GluK1-1 compared to GluK1-2a alone, although insignificant (without Neto). 

      In the case of the glutamate concentration-response curve (Fig. 3C), EC50 values for Neto1 and Neto2 are relatively the same, but this approach on its own does not provide insights about the role of the splice insert. Previous experiments with the Gluk1 reveal differences between EC50 in the presence of Neto1 or 2 (Fisher, 2015), suggesting that the insert could regulate glutamate binding affinity, but still, this point is not directly demonstrated in this work.

      Thanks for this insightful comment. Indeed, we cannot conclude that splice residues directly affect glutamate sensitivity and have modified the text accordingly. The Fisher paper demonstrated that both Neto1 and Neto2 can influence glutamate sensitivity in GluK1-2a, with EC50 values of 124.6 ± 16.2 µM. Specifically, in the presence of Neto1 and Neto2, the EC50 values are 4.4 ± 0.4 µM and 13.7 ± 4.2 µM, respectively, indicating a noticeable effect though not substantially different for GluK1-2a coexpressed with either Neto1 and Neto2. Our observation for the GluK1-1a has been similar, with both Neto1 and Neto2 showing a leftward shift.

      (2) Similar to the previous point, a proper interpretation of mutant data is missing in the manuscript. From current data, it is difficult to visualize the role of the insert on Netodependent regulation, mainly, because of the fact that some mutations alone affect Gluk1-1 channel properties. The authors conclude their data by stating that "while the modulation of the receptor by Neto 1 is affected by mutations in splice insert, the modulation by Neto 2 remains largely unaffected" (Page 13). However, this statement is confusing since the co-expression of Gluk1-1a with Neto2 (Fig. 5) prevents the effect caused by mutation K368 alone (Fig. 4), indicating that modulations by Neto 2 are indeed potentially affected by the mutations. Please, clarify. Also, the effect of the K368/375/379/382H376-E mutant on Neto modulation (pink bar in Fig. 5) is impossible to interpret properly since the effect of the mutation alone is not shown in the manuscript.

      Thanks for seeking this important clarification. It is indeed true that splice residue mutations themselves affect the receptor functional properties in comparison to the wild-type receptors. For the sake of clarity, we have presented the effect of splice mutants on receptor properties separately from the effect of mutations on modulation by Neto proteins. Figure 4 demonstrates a comparison between wild-type and mutant receptors without the Neto proteins, showcasing different kinetic properties, while Figure 5 provides detailed information on the role of the insert in Neto-dependent regulation. 

      It’s true we could not record the effect of the K368/375/379/382H376-E mutant alone or when coexpressed with Neto 2 due to low peak amplitudes (mentioned in Table 1) that prevented reliable comparisons. However, robust currents were observed when the same mutant was coexpressed with Neto1, and hence comparisons were shown for this mutant with GluK1-1a wild-type + Neto1. 

      We have now modified the statement "while the modulation of the receptor by Neto 1 is affected by mutations in splice insert, the modulation by Neto 2 remains largely unaffected" and the last paragraph as follows:

      “Neto1 appears to have more pronounced effects on the mutant receptors compared to Neto2. Specifically, Neto1 significantly slowed desensitization for the K368-E mutant, accelerated recovery from desensitization for K368-E and K368/375/379/382H376-E mutants, increased agonist efficacy for K368-E and K375/379/382H376-E mutants, and altered rectification properties for K368E and K368/375/379/382H376-E mutants. In contrast, Neto2 had fewer significant effects on the mutant receptors, with the main impact being an increase in agonist efficacy for the K368-E mutant. Notably, Neto2 did not significantly affect desensitization, recovery from desensitization, or rectification properties of the mutant receptors when compared with wildtype GluK1-1a coexpressed with Neto2. These findings suggest that the splice residues in GluK1-1a differentially influence receptor modulation by Neto1 and Neto2, with Neto1 showing more extensive modulation of the mutant receptors' functional properties.”

      (3) An open question after reading this interesting work is if the proposed change in Neto regulation because of the splice insert is due to changes in Gluk1-Neto interactions or because the rearrangement after interaction with Neto proteins is different. Pull-down experiments (Fig 5 Sup.1) suggest that the splice insert and all the mutants tested do not prevent interaction with Neto proteins. I wonder if the authors could complement their data with a quantitative approach/analysis to demonstrate if the splice insert and the mutants affect Neto1/2 interactions (as expected for the rationale when creating the mutants).

      Thank you for this insightful suggestion. You raise an important point about distinguishing between changes in GluK1-Neto interactions and potential differences in receptor rearrangement after Neto binding. While our pull-down experiments suggest that the splice insert and mutants don't prevent Neto interactions (probably due to a larger interaction interface all along the receptor), a quantitative approach would indeed provide more nuanced information. In future studies, we do plan to perform a quantitative approach like Surface plasmon resonance to assess the changes in interactions upon mutations in the splice and/or Neto proteins in different states of the receptor. In addition, obtaining cryo-EM structures of GluK1 splice variants in complex with Neto1 and Neto2 would provide crucial insights into their interaction interfaces and any conformational changes induced by binding. 

      (4) Related to the Gluk1-1a structure, the authors state that the overall structure is similar to the one without the insert (page 14); however, this is not properly shown in the manuscript. Even if the overall architecture of the channel is the same, authors should make a proper/adequate comparison between both structures/domains to support their claims. Also, one should expect that the insertion of 15 amino acids would affect in some way the closing neighboring domains. The differential effect of the splice insert on glutamate and kainate EC50 values (Fig. 2 and Fig. 2 sup.1), suggests that the insert could introduce a sort of rearrangement in the binding domain. Thus, I wonder if a more elaborated analysis of the current structural data could reveal some structural insights that would explain the specific functional differences due to the splice insert. If the low resolution and the missing residues avoid making some comparisons and establish differences between sidechain orientations, still, a proper comparison between the domain backbones would be helpful to validate the author's statement at least. Also, I wonder if the changes could be resolved better in a closed state or APO structure, instead of the desensitized structure. Finally, are the structures obtained in DDM and nanodiscs similar?

      As per the reviewer’s suggestion, we have now added a new figure in the supplementary information, “Figure 6-figure supplement 9,” where we show a superimposition of GluK11aEM (detergent-solubilized or reconstituted in nanodiscs) and GluK1-2a (PDB:7LVT; silver) showing overall conservation of the structures in the desensitized state.

      As evident from the figure and rmsd values mentioned above, we do not observe significant movements at both ATD and LBD layers of GluK1-1a with respect to GluK1-2a. Also as can be observed the DDM solubilized and nanodisc reconstituted GluK1-1a (Panel A) are very similar with a rmsd of ~2.19Å across all the 2664 Calpha atom pairs. Due to low resolution of our structures, we have refrained from carrying out detailed structural comparisions.

      Our efforts to capture the closed state or apo state structures have failed due to either severe orientation bias (only top views) or increased heterogeneity. 

      (5) Methods section lacks relevant information for proper data interpretation as well as for replicating some experiments in the future. For example:

      A) The experimental design to determine the rectification index with a Ramp protocol is not clear: 1) Why the authors applied a ramp protocol if receptors desensitize along the time? Please clarify the protocol.

      Ramp protocols were used only for the wild-type receptors to compare their voltage-dependent behavior, as this was the first study to compare the two splice variants. All kainate receptors (GluK1-GluK5) desensitize over time. However, their rectification properties have been studied previously (both the absence and presence of Neto proteins) using Ramp protocols as they are faster than step protocols.  

      B) Are polyamines included in the solutions to perform the rectification assays?

      No, polyamines were not added to the intracellular solution, and the effect of the endogenous polyamine block was measured. This has now been specified in the results as well as the methods section.

      C) It is not clear if the experiments to calculate IK/IG ratios were performed in the same preparation (This is, the same cell was stimulated with glutamate and then kainate or vice versa).

      Indeed, the current responses for glutamate vs kainate are performed in the same cell (the same cell was stimulated by glutamate then kainate) so that the responses can be compared. It’s now been specified in the methods section.

      D) The experimental design for calculating recovery is not clear.

      We employed a double pulse protocol to measure receptor recovery. The protocol involved applying two consecutive pulses of agonist stimulation to the receptor. Initially, we applied a brief agonist pulse to activate the receptor, followed by a specific recovery period. After the recovery period, we administered a second agonist pulse to assess the receptor's recovery response. The receptor's recovery was determined by comparing the response amplitude of the second pulse to that of the first pulse, providing valuable insights into the receptor's recovery kinetics. Recovery rates were calculated with single exponential association fits in Prism. We have now modified the text for better clarity.

      E) Please indicate the species used for both functional and Cryo-EM (rat Gluk1 isoform?).

      Thanks for pointing this out. We have now specified in relevant methodology sections that Rattus norvegicus GluK1 and Neto proteins were used in this study.

      F) Please describe the nanodisc reconstitution protocol and how the nanodisc protein was purified, if appropriate.

      The MSP1E3D1 was purified by following the protocol given by the Sligar group in 2014 (doi: 10.1016/S0076-6879(09)64011-8). The nanodisc reconstitution protocol has now been elaborated in the revised manuscript.

      G) Site-directed mutagenesis methodology is incomplete. Please check.

      We have now elaborated this section to include more details.

      Minor concerns:

      (1) Authors state that splice residues are ~30A away from the TM domain. Currently, there is no friendly representation showing the localization of the splice in the structure, besides Fig.6E. The manuscript could benefit itself if authors include a better 3D representation or a scheme to highlight the position of the splice relative to critical domains.

      Thanks for pointing this out. The distance between TRP 381 CA (ATD) and LEU 636 CA (TM3) is 92.10 Å. We have changed the value in the text to ~92 Å.

      Author response image 1.

      (2) Authors mention that mutations in the insert to alanine show normal traffic to the plasma membrane but low current amplitude. Then, I wonder if single-channel conductance, mean open time or open probability is affected by the splice insert. Showing the effects of the insert on single-channel properties would strengthen the manuscript's quality.

      It is a good suggestion. However, as can be observed from our whole cell or outside out patch data, we obtained low peak amplitudes (<50 pA) for many of our receptor-only constructs and also suffered from high SEM for some recordings due to heterogeneity between cells of the same population. The suggestion to study the single channel properties of these receptors is considered for future experiments

      (3) It is unclear how the insert or the mutations specifically affect glutamate- or kainate-induced responses because authors analyze IK/IG ratios only. Maybe authors could consider including an analysis of the role of the insert on specific glutamate- or kainate-induced response to gain insights about ligand selectivity.

      All the values have been included in the excel for raw data. We have included the desensitization kinetics of mutant receptors in the presence of glutamate and compared it to the wild type GluK1-1a. Kainate induced responses were very heterogenous (high SEM for % desensitization) and hence have not been included in the main data.

      (4) Please be consistent with nomenclature along the manuscript to avoid confusion. For example, Are Gluk-1-1 and Gluk-1-1a referring to the same variant?

      GluK1-1 has been used in the abstract and the introduction where we introduce the N-terminal splice variant which either has the 15 residues (termed as GluK1-1) or lacks it (GluK1-2). The C- terminal splice variants for GluK1 are named as “a-d”, with “a” being the smallest Cterminal domain variant. Later in the manuscript, we have used only GluK1-1a terminology to represent the ATD splice variant with shortest C-terminal domain.

      The introduction and spatiotemporal results talk about the GluK1-1 receptors wherein the 

      (5) Legend figure 2: Repeated phrase should be removed. Please check.

      (6) Page 8: "This is similar to the effect observed in GluK1-2 receptors whereby the glutamate EC50 was shown to increase by Neto proteins [Neto1: 34-fold and Neto2: 7.5-fold (Palacios-Filardo et al., 2016) and Neto1/2: 10-30X (Fisher, 2015)]". It seems that values from Fisher's paper are backward. Please correct. 

      (7) Page 9. Second paragraph. Spelling mistake when referring to Fig. 3G.

      Thanks for pointing out the inadvertent errors; we have now corrected all of them.

      (8) Figure 3: The title in Y axis overlaps with the figure. Please check.

      We have corrected the error.

      (9) Page 10: "In addition, K375/379/382H376-E mutant also exhibited a slowdown in the recovery (K375/379/382H376-E: 4.83 {plus minus} 0.31 s P=0.2774) (Figure 4C; Table 1)." Statistical analysis indicates this is not correct. Please tone down this statement. For example: "...mutant also exhibited a trend to a slowdown in the recovery although differences do not reach statistical significance".

      Thanks. We have modified the statement as suggested.

      (10) Page 11: "and a reduction was observed for K375/379/382H376-E receptors (1.17 {plus minus} 0.28 P=0.3733) compared to wild-type (Figure 4D; Table 1)." Same issue as the previous minor comment.

      Thanks. We have modified the statement as suggested.

      (11) Page 11: "We observed that mutants K368-E and K368/375/379/382H376-E, desensitize significantly slower in the presence of Neto1" This statement is not true for K368/375/379/382H376-E mutant. Please correct.

      Thanks. We have modified the statement as suggested and specified the difference.

      (12) Legend Figure 4. Colored asterisks are not clear in the figure. Please check.

      Thanks. The reference to colored asterisks has been removed from the legend as they are not used.

      (13) Representative data shown in Fig 5 sup.2A do not match very well with the final quantification shown in Fig 5A. Please check. Also, the authors state in the result section (page 10) that data shown in Fig. 5A indicate that "GluK1-1a modulation by Neto 1 is influenced by the splice residues". This could be true only for residue K368; however, this is not so obvious since the two mutants containing K368E are inconsistent. Please check and clarify.

      Only representative traces are shown in Fig 5 sup 2 A. However, the quantification shown in Fig 5 A is from multiple cells. We have rechecked all the data and found it to be consistent. We have rewritten this section and modified it for better clarity.

      (14) Figure 6-supplement 2: Please incorporate missing values of MW standards in panel B.

      Thanks. We have modified the figure to include values for MW standards.

      (15) It is not clear the rationale for showing construct C552Y C557V C575S in Fig. 6 sup.3, panel A. This mutant is not mentioned in the manuscript.

      It has been mentioned in the methodology section under “Construct design for expression and purification of rat GluK1-1aEM”. It (C552Y C557V C576S) is one of the constructs used in optimizations that were checked for good protein yields. Based on FSEC protein profiles, we used C552Y, C557V (2X Cys mutant) as GluK1-1aEM, which is mentioned in the same section.

      (16) Fig. 6 sup.4 Not clear what does mean w.r.c. Please specify in the legend.

      With respect to (w. r. t.) has been specified in the manuscript.

      (17) Suggestion to improve data presentation in Fig. 4D and Fig. 3 sup.1B: For easier comparison of IK/IG ratios, representative traces for kainate and glutamate in the same group could be shown using the same Y-scale.

      It has been purposely shown with two different Y-scales due to the differences in peak amplitudes in the presence of glutamate or kainate. 

      (18) Fig. 3 sup.1A: Based on the figure legend, horizontal bars representing the application of glutamate are not consistent with time scale bars. Please, check. In the same figure, panel B, the representative traces shown for GluK-1a-Neto1 are not consistent with IK/IG ratio shown in Fig. 3D.

      Thanks, we have corrected the horizontal bars representing glutamate application. The representative traces shown for GluK-1a-Neto1 were rechecked and are consistent with the IK/IG ratio shown in Fig. 3D.

      (19) I wonder if the authors could discuss the lack of Neto1 effect on the wild type Gluk1-2a channel, as proposed previously.

      Sheng et al., 2015 showed that Neto1 enhances the desensitization onset of GluK1. However, it is unclear which GluK1 splice variants were used in that study. GluK1 has several splice variants, but in the present study, we specifically compared GluK1-1a and 2a. In our case, we did not observe the effect of Neto1 on wild-type GluK1-2a in either of the two techniques (whole cell and outside-out patch) we utilized for our study. However, as can be observed from our data, the GluK1-2a receptor alone shows a faster desensitization kinetics than the previous study (Copits et al., 2011). The differences could stem from different experimental conditions such as constructs, recording conditions used etc.

      Copits BA, Robbins JS, Frausto S, Swanson GT. Synaptic targeting and functional modulation of GluK1 kainate receptors by the auxiliary neuropilin and tolloid-like (NETO) proteins. Journal of Neuroscience. 2011 May 18;31(20):7334-40.

      Sheng N, Shi YS, Lomash RM, Roche KW, Nicoll RA. Neto auxiliary proteins control both the trafficking and biophysical properties of the kainate receptor GluK1. Elife. 2015 Dec 31;4:e11682. doi: 10.7554/eLife.11682. PMID: 26720915; PMCID: PMC4749551.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Strengths:

      The authors embarked on an ambitious journey to seek the answer regarding 3D genome changes predisposing to metastatic organotropism. The authors succeeded in the assembly of a comprehensive panel of breast cancer cell lines and the aggregation of the 3D genome structure data to conduct a hypothesis-driven computation analysis. The authors also achieved in including proper controls representing normal non-cancerous epithelium and the end organ of interest. The authors did well in the citation of relevant references in 3D genome organization and EMT.

      Weaknesses:

      (1) The authors should clearly indicate how they determine the patterns of spread of the breast cancer cell lines being utilized in this manuscript. How did the authors arrive at the conclusion that certain cell lines would be determined as "localized spread" and "metastatic tropism to the lung"? This definition is crucial, and I will explain why.

      It is indeed a critical point to clearly define and explain what qualifies as metastatic potential to particular organs in our system. Here, we intentionally limited our scope to metastasis that had occurred within the human system. Our cell lines are chosen based on their sites of origin and etiological history in the patients from which they were derived. For example, the cancer cell line BT474 was classified as “localized” because these cells were derived from a solid tumor in the breast itself. Meanwhile, MCF7 and T47D cell lines are considered lung metastatic because these cells were collected from the pleural effusion from the lung. We therefore model human organotropism from the breast to the lung by using cells that originated from infiltrative ductal carcinoma (human breast) but were collected from pleural effusions (human lung). We then use as a comparison a human lung cancer-derived cell line that was itself purified from a pleural effusion. In this way, we can compare the genome structure of a lung cancer cell in the lung environment to a breast cancer cell that has metastasized to the lung environment.

      In our revised version, we further clarify this definition in the text as well as in additional annotations in our supplemental table of all cell line information.

      Todd Golub's team from the Broad Institute of MIT and Harvard published "A metastasis map of human cancer cell lines" to exhaustively create a first-generation metastasis map (MetMap) that reveals organspecific patterns of metastasis. (By the way, this work was not cited in the reference in this manuscript.) The MetMap Explorer (https://depmap.org/metmap/vis-app/index.html) is a public resource that could be openly accessed to visualize the metastatic potential of each cell line as determined by the in vivo barcoding approach as described in the MetMap paper in the format of petal plots. 5 organs were tested in the MetMap paper, including brain, lung, liver, kidney, and bone. The authors would discover that some of the organ-specific metastasis patterns defined in the MetMap Explorer would be different from the authors' classification. For example, the authors defined MCF7 as a line as lung metastatic, and rightly so the MetMap charted a signal towards lung with low penetrance and low metastatic potential. The authors defined ZR751 as a line with localized spread, however, the MetMap charted a signal towards the kidney with low penetrance and low metastatic potential, the signal strength similar to the lung metastasis in MCF7. A similar argument could be made for T47D. The TNBC line MDA-MB-231 is indeed highly metastatic, however, in MetMap data, its metastasis is not only specific to the lung but towards all 5 organs with high penetrance and metastatic potential. The 2 lung cancer cell lines mentioned in this study, A549 and H460, the authors defined them as localized spread to the lung. However, the MetMap data clearly indicated that A549 and H460 are highly metastatic to all 5 organs with high penetrance and high metastatic potential.

      We acknowledge the valuable contributions of animal models in metastatic cancer studies, but we also want to avoid the potentially confounding variable of the animal microenvironment. The MetMap Explorer contains valuable information (and as part of our clarification on this point, we now cite the MetMap in the text), but the “metastatic potential of each cell line” for this tool is measured in a mouse environment. Knowing that a particular cell line, which originated from a human lung metastasis, can further metastasize to other organs in a mouse does not necessarily mean that those cells could do so in humans. The microenvironment responses to metastatic colonization recapitulate the events in wound repair, and these can differ among species (https://pubmed.ncbi.nlm.nih.gov/28916657/ https://pubmed.ncbi.nlm.nih.gov/39729995/ ). Further, the changes a cell needs to make to adapt to a new organ system in a mouse could be confounded by the changes needed to adapt to mouse conditions in general. Finally, migration from a site of ectopic injection may not mimic migration from an initial tumor site. These factors lead to well known cases where MetMap does not reflect the metastatic potential of cancers in humans. As a classic example, prostate cancer frequently metastasizes to bone in humans, and the PC3 cell line was derived from a bone metastatic prostate cancer. However, MetMap shows no evidence of PC3 being able to metastasize to bone in a mouse.

      We agree that the very best data would come from matched primary and metastatic tumors in the same human patient, but those data do not currently exist and generating them would require future work beyond the scope of this study.

      Since results will vary among different experimental models testing metastatic organotropism, (intracardiac injection was the metastasis model being adopted in the MetMap), the authors should state more clearly which experimental model system served as the basis for their definition of organ-specific metastasis. In my opinion, this is the most crucial first step for this entire study to be sound and solid.

      Taking all the above into account, in our revision, we have now included further clarification in the main text to more clearly explain how and why we chose the cell lines we did and what the advantages and limitations of this choice are.

      (2) Figure 1b: The authors found that "MDA-MB-231 cells were grouped with the lung carcinoma cells. This implies that the genome organization of this cell line is closer to that of lung cells than to other breast epithelial cell lines.". In fact, another TNBC line BT549 was also clustered under the same clade. So this clade consisted of normal-like and highly metastatic lines. Therefore, the authors should be mindful of the fact that the compartment features might not directly link to metastasis (or even metastatic organotropism).

      In figure 1b, the grouping that includes MDA-MB-231 (lung metastatic breast cancer) connected to A549, and H460 (lung cancer) occurs at a distance of about 0.2. If the clustering tree were cut at a distance of 0.26, 6 separate clusters would result: two clusters of Luminal subtypes (all labeled red), one that includes all healthy epithelial cells (both lung and breast, all labeled green), one that links two localized breast cancers, one that links MDA-MB-231 to lung carcinoma cell lines, and then BT549 by itself. So, while BT549 appears next to MDA-MB-231 along the horizontal axis, this is just coincidence of the representation: the dendrogram shows it is quite distant from all the other cell lines in this cluster according to compartment profile.

      So, it is only MDA-MB-231 that is very closely linked with the lung cancer cell types.

      It is true that the healthy lung cells (HTBE) are clustered separately and are more similar to normal/non tumorigenic breast epithelial cells (HMEC and MCF10A) than to any cancer cell type. This could suggest that there are aspects of the compartment pattern that represent any healthy epithelium as compared to cancer. What we find in the compartment profile, in both the clustering and the PCA analysis, is that compartment signatures contain information about cell properties on several overlapping levels: there is an aspect of the compartment profile that distinguishes healthy from cancerous cells, an aspect that distinguishes luminal cancers from other subtypes, a part that associates with organotropism, and an aspect that captures EMT status. The final compartment status is a composite of these numerous factors.

      We have clarified the text to indicate that we mean MDA-MB-231 clusters near lung cancer, not necessarily healthy lung cell models.

      (3) Figure 3: In the text, the authors stated, "To further investigate this result, we examined the transcription status of genes that changed compartment across the EMT spectrum and, conversely, the compartment status of genes that changed transcription (Fig. 3b, c, and d)". However, it was not apparent in the figure that the cell lines were arranged according to an EMT spectrum.

      To display these comparisons more clearly, we have now revised figure 3b, c, and d in two ways: First, we have defined the gene and cell line clustering by one set of data (for example, compartment identity in 3b) and then displayed the other data (gene expression) with all genes and cell lines in the same order. Therefore, for each column, genes and cell lines can be compared visually between top and bottom rows. Second, we have colored cell line names from purple to yellow according to their EMT scores as shown in Supplementary Figure 1a. This allows a visual indication of how the clustering separates cell lines by EMT status.

      Also, the clustering heatmaps did not provide sufficient information regarding the genes with concordant/divergent compartments vs transcription changes. It would be more informative if the authors could spend more effort in annotating these genes/pathways.

      We want to clarify that the genes plotted in the heatmaps in Figure 3 are also the genes whose functional enrichment we present in figures 1 and 2. So, the genes that segregate strongly based on A/B compartment (but not gene expression) in figure 3b are the same genes whose GO terms are annotated in Figure 1d. Likewise, the genes that segregate strongly based on gene expression, but not A/B compartment, in figure 3c and d are the same genes whose GO terms are annotated in Figure 2b. We have now made this connection clearer in the text.

      But, we also agree with the reviewer that it is important to explore a bit further the relationship between these divergent sets of genes. Our explorations have led to several observations:

      (1) In some cases, the compartment-segregated genes and the transcription-segregated genes are different members of the same pathways. In Author response image 1 below, for example, we show interactions (according to STRING) for genes from figure 3c that are highly expressed in the epithelial-like cell lines and are annotated as involved in epithelial development (green). We then added to the network genes from figure 3b that are specifically in the A compartment in the epithelial-like cell lines but not mesenchymal cell lines that are also annotated as involved in epithelial development (red). Most of these epithelial development genes that change expression are in the A compartment in all cell lines and therefore do not rely on spatial compartment changes for their regulation. But some additional epithelial development genes, which are interconnected in this same network, are changing compartments across the EMT spectrum. One example, FOXA1, is a key hub in the network and is known to be a pioneer transcription factor involved in development and differentiation. Controlling this gene at the level of spatial genome organization rather than local transcriptional control could be important in the stable cell fate changes that can happen with EMT.

      Author response image 1.

      (2) Overall, the set of genes that change compartments does not have as strong functional enrichment as the transcription change set of genes. This could indicate that some of the compartment changes that occur with EMT are not directly gene regulatory but rather enable an overall conformational change of the chromatin that is needed for the alterations in physical cell state or to accomplish long distance gene regulation changes.

      (3) Related to long distance gene regulation changes, we also see cases in which the gene that changes transcription but not compartment across EMT is adjacent to regions that switch compartments.

      A good example is TFF3 (yellow, Supplementary figure 1C). TFF3 is one of the genes that strongly segregates across EMT by transcription, being more highly expressed in epithelial-like (bottom 4 tracks) but not mesenchymal-like (top 4 tracks) cancers. Despite this differential expression, it is almost always in the A compartment across all cell lines. However, it is adjacent to regions that show strong compartment change EMT signatures. So, even though this specific gene region is not changing compartment, its regulation may be influenced by the entire region being Aassociated in epithelial-like but neighboring regions becoming B-associated in mesenchymal like cancers.

      TFF3 is expressed in normal breast epithelium and has been implicated as a biomarker for endocrine therapy response in breast cancer.

      Meanwhile, many genes that are in these compartment switching regions (BACE2, DSCAM, PDE9A) are not among the strongest expression signature genes.

      (4) Interestingly, some of the regions (such as the region shown in Supplementary figure 1C) that change compartment across the breast cancer spectrum overlap with regions that we found change compartment in the progression of prostate cancer, as shown in the string.db enrichment analysis below.

      Author response image 2.

      In our revised manuscript, we now include more of these explanations in the text and include the example offset compartment and transcription change region shown about as panel c of Supplementary Figure 1.

      (4) Figure 4: The title of the subheading of this section was 'Lung metastatic breast cancer cell lines acquire lung-like genome architecture". Echoing my comments in point 1, I am a bit hesitant to term it as "lung metastatic" but rather "metastatic' in general since cell lines such as MDA-MD-231 do metastasize to other organs as well. However, I do get the point that the definition of "lung metastasis" is derived from the common metastasis features among the cell lines here (MCF7, T47D, SKBR3, MDAMB-231). There might be another argument about whether the "lung" carcinoma cell lines can be considered "localized" since they are also capable of metastasizing to other organs.

      Rather than classifying cells on metastatic “potential” (as measured in a mouse), our cell lines are chosen based on their sites of origin and etiological history in the patients from which they were derived. Cancer cell lines called “lung metastasis” were collected from the pleural effusion from the human lung. Likewise, we call a cancer “localized” because it was taken from the tissue where the cancer originated, even if it might, if placed into a different context, be able to metastasize. We would argue that the genome structure features of the “localized” cancers reflect cancers that have not yet metastasized (even if they could in the future) while the “metastatic” cancers have already gone to a certain location (even if they could in theory have gone to a different location).

      In a way, what the authors probably were trying to leverage here is the "tissue" identity of that organ.

      Having said this, in addition to showing the "lung permissive changes", the authors should show the "breast identity conservation" as well. Because this section started to deal with the concept of "tissue/lineage identify", the authors should also clarify whether these breast cancer cell lines capable of making lung metastasis are also preserving their original tissue identity from the compartment features (which would most likely be the case).

      This is a great question. We have now more explicitly checked the proportions of genomic regions that change compartments to match lung vs. maintaining breast-specific compartment identity. The graphs in Supplementary Figure 2 begin with all genomic bins that have distinctive compartment identity between non-cancerous breast and lung epithelial cells. Then, the plots show what fraction of these tissue-specific bins change compartment to match lung vs. maintaining breast identity in each breast cancer cell line category. As we have shown in other graphs, particularly for switches to the A compartment, more bins change to match lung in the metastatic vs. primary site cell lines. In most cases, more than 50% of the tissue-specific bins shift to look more like lung.

      (5) Rest of the sections: The authors started to claim that the organ-specific metastasis permissive compartmental features mimic the destinated end organ. The authors utilized additional non-breast cancer cell lines (prostate cancer cell lines LNCaP as localized and DU145 as brain metastatic) in brain metastasis to strengthen this claim. (DU145 in MetMap again is highly metastatic to lung, brain, and kidney). However, this makes one wonder that for cell lines that are capable of metastasizing to multiple organ sites (eg. MDA-MB-231, DU145, A459, H460), does it mean that they all acquire the permissive features for all these organs? This scenario is clinically relevant in Stage 4 patients who often present with not only one metastatic lesion in one single organ but multiple metastatic lesions in more than one organ (eg. concomitant liver and lung metastasis). Do the authors think that there might be different clones having different tropism-permissive 3D genome features or there might be evolutionary trajectory in this?

      In my opinion, to further prove this point, the authors might need to consider doing in vivo experiments to collect paired primary and organ-specific metastatic samples to look at the 3D genome changes.

      We agree that an ideal experimental follow up to this study would be to collect paired metastatic and primary tumors, either in mouse xenograft or, even better, from patients. This is beyond the scope of what we can do for our current paper, but we have added a statement to the discussion of further experiments that would be required to clarify this point.

      (6) Technically, the study utilized public Hi-C data without generating new Hi-C data. The resolution of the Hi-C data for compartments was set at 250KB as the binning size indicating that the Hi-C data was at lower resolution so it might not be ideal to address other 3D genome architecture changes such as TADs or long-range loops. It is therefore unknown whether there might be permissive TAD/loop changes associated with organotropism and this is the limitation of this study.

      Our decision to focus on A/B compartmentalization rather than TAD or loop structure in this analysis was intentional and biologically motivated, rather than solely being a reflection of data resolution. Both compartments and topologically associated domains (TADs) are key parts of genome organization and disruption of these structures has the potential to alter downstream gene regulation, as shown by numerous studies. However, compartments have been found, more so than TADs, to be strongly associated with cell type and cell fate. Therefore, in this manuscript, we decided to focus only on the compartment organization changes between different healthy and cancerous cells as they are more likely to represent the stable alterations of the genome organization malignant transformations.

      (7) In the final sentence of the discussion the authors stated "Overall, our results suggest that genome spatial compartment changes can help encode a cell state that favors metastasis (EMT)". The "metastasis (EMT)" was in fact not clearly linked inside the manuscript. The authors did not provide a strong link between metastasis and EMT in their result description. It is also unclear whether the EMTassociated compartment identity would also correlate with the organotropic compartment identity.

      We agree that this statement involves too strong of an assumption. The literature on this topic is vast and complex, and while there is abundant evidence that pathways of EMT can play important roles in facilitating metastasis, there are other pathways at play in the metastatic process as well (https://journals.plos.org/Plosbiology/article?id=10.1371/journal.pbio.3002487). We have made a clearer statement about this in the text now.

      To address the question of whether the organotropic changes related to the EMT changes, we calculated the overlap between the genomic bins that strongly segregated cell lines in the compartment principal component analysis (PC1) with those that showed “organotropic” changes. As you can see in supplementary table 3, this overlap is actually very small, where only 3% of bins are important both for the EMT segregation of cell lines and organotropism.

      We have now included this overlap information as supplementary table 3 and have addressed this in the text.

      Reviewer #2 (Public review):

      Summary:

      This work addresses an important question of chromosome architecture changes associated with organotopic metastatic traits, showing important trends in genome reorganization. The most important observation is that 3D genome changes consistent with adaptations for new microenvironments, including lung metastatic breast cells exhibiting signatures of the genome architecture typical to a lung cell-like conformation and brain metastatic prostate cancer cells showing compartment shifts toward a brain-like state.

      Strengths:

      This work presents interesting original results, which will be important for future studies and biomedical implications of epigenetic regulation in norm and pathology.

      Weaknesses:

      The authors used publicly available data for 15 cell types. They should show how many different sources the data were obtained from and demonstrate that obtained results are consistent if the data from different sources were used.

      In our revised version, we have provided a clarified table of information about all the publicly available data used from all the cell lines, indicating the sources of the data. The 17 datasets used come from 8 different studies. So, indeed, the reviewer is correct that many different sources of data were used. To address the question of whether our results would be consistent if data from different sources were used, we created a comparison map of the A/B compartment profiles for data from multiple sources when it was available. You can see below that the Hi-C data from different sources for the same cell lines cluster quite closely and show high correlation and are well separated from different cell lines. So, we do not think that source batch effects play a major role in our results.

      Author response image 3.

      Recommendations for the authors:  

      Reviewer #1 (Recommendations for the authors):

      (1) Figure 1a: This figure could be re-formatted without the arrows. Arrows usually indicate upstreamto-downstream relationships along certain processes. Using arrows here would mislead people to think that the cell lines were derived from one another. The same could apply to the supplementary figures.

      We have now edited figure 1a to include lines linking cell lines, indicating conceptual relationships, rather than arrows, which would imply direct derivation.

      (2) Figure 1c: The PCA (PC2 axis) indeed seemed to separate the HER2 status quite well. One concern is MCF7, it is labeled as ERpos/HER2neg in MetMap but seems to be clustered as HER2pos in this study. Are they the same? (This again highlights the importance of cell line definition and annotation).

      It is a good point that MCF7, while generally considered HER2 negative (we indicate this negative status in Supplementary Table 1), falls near HER2 positive cells in PCA space. This indicates that PCA captures tendencies but is not a perfect classifier. In a high dimensional, complex system, it is expected that an unsupervised analysis such as this will not capture just one biological feature in a given principal component, and therefore something like HER2 status may not segregate perfectly. However, this analysis does suggest that MCF7 3D genome structure has features that are more similar to other HER2+ cell lines. This raises the interesting possibility that it may actually behave like HER2+ cells in some ways even while being HER2- itself. We have more clearly stated the MCF7 discrepancy in the text.

      Reviewer #2 (Recommendations for the authors):

      (1) The description of results can be shortened, to make it easier to read and understand.

      In our revision, we have tried to clarify where possible, but it was difficult to shorten without losing important caveats and context (especially to make important points emphasized by reviewer 1).

      (2) "100 most positive and negative eigenvalues for PC1" - please provide the correct description.

      We have altered this to make it clearer and more correct: “using the genes from the regions with the top 100 most positive and 100 most negative eigenvector loadings for this PC1”

    1. Author Response

      The following is the authors’ response to the original reviews.

      eLife assessment

      This manuscript provides novel and important findings regarding the impact of noradrenergic signaling from the locus coeruleus on hippocampal gene expression. The locus coeruleus is the sole source of noradrenaline to the hippocampus and many rapid molecular changes induced by stress are regulated by noradrenaline. This manuscript provides a rigorous investigation into hippocampal genes uniquely regulated by noradrenaline in the presence or absence of stress. Data were collected and analyses were performed using solid methodology, and the results mostly convincingly support the conclusion made with few weaknesses. The study would benefit from a more comprehensive analyses of sex differences.

      Response: We thank the reviewers and the editors for the positive evaluation of our work and for the constructive feedback. To address some of the key criticisms, we have performed several new experiments and analyses. Importantly, we now provide a much more rigorous comparison of males and females, which strongly suggests that there are no major sex differences in the transcriptomic response to stress and noradrenaline in the hippocampus. We think that these - and other additions discussed below - significantly strengthen the manuscript. We provide detailed responses to all the reviewers comments. We have added numbers to the reviewers’ comments for easier referencing.

      Reviewer #1 (Public Review):

      Comment 1: Privitera et al., provide a comprehensive and rigorous assessment of how noradrenaline (NA) inputs from the locus coeruleus (LC) to the hippocampus regulate stress-induced acute changes in gene expression. They utilize RNA-sequencing with selective activation/inhibition of LC-NA activity using pharmacological, chemogenetic and optogenetic manipulations to identify a great number of reproducible sets of genes impacted by LC activation. It is noteworthy that this study compares transcriptomic changes in the hippocampus induced by stress alone, as compared with selective circuit activation/inhibition. This reveals a small set of genes that were found to be highly reproducible. Further, the publicly available data will be highly useful to the scientific community.

      Response: We are very grateful for this positive evaluation.

      Comment 2: A major strength of the study is the inclusion of both males and females. However, with this aspect of the study also lies the biggest weakness. While the experiments tested males and females, they were not powered for identifying sex differences. There are vast amounts of literature documenting the inherent sex differences, both under resting and stress-evoked conditions, in the LC-NA system and this is a major missed opportunity to better understand if there is an impact of these sex-specific differences at the genetic level in a major LC projection region. There are many instances whereby sex effects are apparent, but do not pass multiple testing correction due to low n's. The authors highlight one of them (Ctla2b) in supplemental figure 6. This gene is only upregulated by stress in females. It is appreciated that the manuscript provides an incredible amount of novel data, making the investigation of sex differences ambitious. Data are publicly available for others to conduct follow up work, and therefore it may be useful if a list of those genes that were different based on targeted interrogation of the dataset be provided with a clear statement that multiple testing corrections failed. This will aid further investigations that are powered to evaluate sex effects.

      Response: The assessment of the reviewers and the editorial feedback encouraged us to look more thoroughly into potential sex differences, because we believe it would indeed be a major additional strength if our manuscript could make a firm statement on this important issue. To this end, we have expanded the manuscript in two major ways:

      (1) To expand the analysis of sex effects also to the dorsal hippocampus, and to increase robustness of the data, we have performed RNA-seq in 32 additional samples of male and female mice exposed to stress (or control) and propranolol (or saline) injection. Figure 1fh and Supplementary Figure 1d-f have been updated to reflect this new addition, and the results are presented in a new section on Pages 3-4 (pasted below for ease of reviewing). In summary, the strongly support our initial observation that the effects of stress on gene expression, as well as the effects of propranolol on blocking stress-induced effects, are highly similar in both sexes.

      (2) To further increase the power for detection of sex-effects, we have performed a small meta-analysis. For this, we combined several RNAseq datasets from the current manuscript and published datasets from our previous work (Floriou-Servou et al., 2018; von Ziegler et al., 2022), which also investigated transcriptomic sex-differences in the hippocampus 45 min after cold swim stress exposure in the same setup as used for the current manuscript. This approach increased our sample size to 51 males and 20 females. In summary, this well-powered approach shows no evidence for sex differences in the transcriptional response to stress, even when more lenient analyses were applied. These results are described in a new section on page 4, and summarized in Supplementary Figures 1f+g. This section is pasted below for ease of reviewing.

      "While blocking β-adrenergic receptors was able to block stress-induced gene expression, we did not test whether propranolol might decrease gene expression already at baseline, independent of stress. Additionally, all tests had thus far been conducted in male mice, raising the question about potential sex differences in NA-mediated transcriptomic responses. To address these two issues, we repeated the experiment in both sexes and included a group that received a propranolol injection but was not exposed to stress (Fig. 1f). Combining the data from both experiments, we repeated the analysis for each region, to identify genes whose response to stress was inhibited by propranolol (Figure 1g). As in the previous experiment, we found that many of the stress-induced gene expression changes were blocked by propranolol injection in both dHC (Figure 1g, left panel) and vHC (Figure 1g, right panel). Importantly, propranolol did not change the expression level of these genes in the absence of stress. We then directly compared the genes sensitive to stress and propranolol treatment in both dHC and vHC. To this end, we plotted the union of genes showing a significant stress:propranolol interaction in either region in one heatmap across both dHC and vHC (Supplementary Figure 1d). This showed again that the stress-induced changes were very similar in dHC and vHC, and that propranolol similarly blocked many of them. Finally, we asked whether the response differs between males and females. Despite clear sex differences in gene expression at baseline (data not shown), we found no significant sex differences in response to stress or propranolol between male and female mice (FDR<0.05; Fig. 1g). To more directly visualize this, we compared females and males by plotting the log2-fold changes of the stress:propranolol interaction across all stress-induced genes that were blocked by propranolol. We find very similar regulation patterns in both sexes (Figure 1h). Although none of these sex differences are significant, some genes seem to show quantitative differences, so we plotted the expression patterns of the 5 genes showing the largest difference in interaction term as box-plots, which suggest that these spurious differences are likely due to noisy coefficient estimates (Supplementary Fig. 1e). To address concerns that our analysis of sex differences might not have been sufficiently powered, we performed a meta-analysis of the experiments shown here along with previously published datasets from our lab (Floriou-Servou et al. 2018; von Ziegler et al. 2022). In all these experiments, the vHC of male and female mice was profiled 45 min after exposure to an acute swim stress challenge. This resulted in a sample size of 51 males and 20 females. Despite this high number of independent samples, we could not identify any statistically significant interaction between sex and the stress response. To identify candidates that might not reach significance while discounting differences due to noise in fold-change estimates, we reproduced the same analysis using DESeq2 with Approximate Posterior Estimation for generalized linear model (apeglm) logFC shrinkage (A. Zhu, Ibrahim, and Love 2018). This analysis also did not reveal any sex differences in the stress response (Supplementary Fig. 1f). We then tailored the meta-analysis specifically to the set of stress-responsive genes that were blocked by propranolol, and also for these genes the response to stress was strikingly similar in both sexes (Supplementary Fig. 1g). Altogether, we conclude that there are no major sex differences in the rapid transcriptomic stress response in the hippocampus, and that blocking beta-receptors prevents a large set of stress-induced genes in both females and males."

      To put these findings in context with existing literature, we agree with the reviewer that there are many studies that have reported sex differences in the LC-circuitry as summarized by Bangasser and colleagues (Bangasser et al., 2016, 2019). However, these studies primarily focus on the LC itself, suggesting that female rats have more LC neurons, denser LC-dendrites in the peri-LC region, and that LC neurons are more readily activated by stress in females because of heightened sensitivity to CRF-signaling. A recent study in mice reports, in contrast, that females have fewer TH-positive neurons in the LC, but they also find enhanced excitability of LC neurons in females (Mariscal et al., 2023). Similarly, one study has suggested molecular differences in the makeup of the LC (Mulvey et al., 2018). Our experiments, however, focus on the impact of NA release in a projection region (hippocampus). Further, we use a strong stress induction protocol (swim stress) and various potent modes of direct LC activation, so differences in "LC-excitability" are likely less relevant in this context. We added evidence showing that we trigger powerful NA release in both sexes (Supplementary Figure 2c-h; see response to Reviewer #2, Comment #3 for more details). In addition, we show that the intensity or pattern of LC stimulation does not appear to alter the molecular response (Figure 3a-b), and that various stressors (mild or intense) all trigger the same NA-dependent molecular changes (Figure 4a-b). Therefore, our results suggest that once NA is released (in the hippocampus), the molecular downstream effects on gene expression are very similar - independent of stimulation intensity, sex, or hippocampal subregion (dorsal/ventral). This does not mean that there are no sex differences for activation of LC, but rather that the transcriptional response to NA release in the hippocampus is robust across sexes, and that propranolol seems to block NA-dependent effects similarly in both sexes. This does not rule out quantitative differences between sexes that only emerge with targeted analyses of individual genes, or once fluctuations in ovarian hormones are taken into account. We have updated the section in the discussion to summarize these considerations in light of the new results (see pages 20-21, section: "A uniform molecular response to stress and noradrenaline release in both sexes").

      Comment 3: A major finding of the present study is the involvement of noradrenergic transcriptomic changes occurring in astrocytic genes in the hippocampus. Given the stated importance of this finding within the discussion, it seems that some additional dialogue integrating this with current literature about the role of astrocytes in the hippocampus during stress or fear memory would be important.

      Response: We thank the reviewer for giving us an opportunity to add a more detailed discussion about the role of astrocytes and thyroid hormones in the hippocampus during learning and memory formation. We have added these statements to the discussion:

      “Within the hippocampus, astrocytic pathways are emerging as important players for learning and memory processes (Gibbs, Hutchinson, and Hertz 2008; Bohmbach et al. 2022). In fact, it is well-known that NA enhances memory consolidation (Schwabe et al. 2022; McGaugh and Roozendaal 2002), and recent work suggests that these effects are mediated by astrocytic β-adrenergic receptors (Gao et al. 2016; Iqbal et al. 2023). Our transcriptomic screens revealed Dio2 as the most prominent target influenced by LC activity. Dio2 is selectively expressed in astrocytes and encodes for the intracellular type II iodothyronine deiodinase, which converts thyroxine (T4) to the bioactive thyroid hormone 3,3',5-triiodothyronine (T3) and therefore regulates the local availability of T3 in the brain (Bianco et al. 2019). Enzymatic activity of DIO2 has further been shown to be increased by prolonged noradrenergic transmission through desipramine treatment in LC projection areas (Campos-Barros et al. 1994). This suggests that the LC-NA system and its widespread projections could act as a major regulator of brain-derived T3. Notably, T3-signaling plays a role in hippocampal memory formation (Rivas and Naranjo 2007; Sui et al. 2006), raising the possibility that NA-induced Dio2 activity in astrocytes might mediate some of these effects.”

      Comment 4: The comparison of the candidate genes activated by the LC in the present study (swim) with datasets published by Floriou-Servou et al., 2018 (Novelty, swim, restraint, and footshock) is an interesting and important comparison. Were there other stressors identified in this paper or other publications that do not regulate these candidate genes? Further, can references be added to clarify to the reader, that prior studies have identified that novelty, restraint and footshock all activate LC-NA neurons.

      ponse: Thank you for the positive feedback. We have only tested the stressors reported in Figure 4a-b (novelty, swim, restraint, and footshock). It is known that all these stressors trigger noradrenaline release, in fact we are not aware of stressors that do not trigger NA release. This reproducible finding supports the notion that the identified set of genes is indeed highly NAresponsive. As suggested, we have now included references that show increased NA release in response to all these stressors:

      “Therefore, we assessed their expression in a dataset comparing the effect of various stressors on the hippocampal transcriptome (Floriou-Servou et al., 2018). The stressors included restraint, novelty and footshock stress, which have all previously been shown to increase hippocampal NA release (HajósKorcsok et al., 2003; Lima et al., 2019; Masatoshi Tanaka et al., 1982).”

      Comment 5: Comparisons are made between chemogenetic studies and yohimbine, stating that fewer genes were activated by chemogenetic activation of LC neurons. There is clear justification for why this may occur, but a caveat may need to be mentioned, that evidence of neuronal activation in the LC by each of these methods were conducted at 90 (yohimbine) versus 45 (hM3Dq) minutes, and therefore it cannot be ruled out that differences in LC-NA activity levels might also contribute.

      Response: The reviewer raises an important point about some inconsistencies between the time points chosen in our study, an aspect that was also pointed out by Reviewer #2. We have chosen the 45 and 90 min time points for two different reasons. On the one hand, cFos changes on the protein level are known to peak 90 min after neuronal activation, and we wanted to capture the strongest possible cFos signal in the LC. On the other hand, we wanted to measure gene expression changes triggered by NA release, which already occur 45 min after noradrenergic activation (Roszkowski et al., 2016). Thus, when the experimental design allowed separate experiments (e.g. systemic yohimbine injection), we chose to measure gene expression after 45 min, but to validate cFos activation in the LC separately after 90min. In response to DREADD activation, however, we wanted to confirm within the same animal that LC activation was successful, and thus we collected LC and hippocampus simultaneously (Figure 2c,d). While the cFos increase is already very pronounced at the 45min time point (Figure 2g), the quality of IHC is slightly lower because the tissue cannot be perfused in this experimental design. Therefore, we do not think that the time point for cFos sampling matters in this context. However, we agree with the reviewer that it remains unclear whether yohimbine and DREADDs activate the LC with similar potency. To directly compare NA release would require a set of photometry-based experiments to measure NA release using genetically-encoded NA-sensors. While we have added such experiments for LC activation with DREADDs and optogenetics to show rapid NA release indeed occurs in the hippocampus (see Reviewer #2, Comment 3; Supplementary Figure 2c-h), yohimbine interferes with the NA-sensors as explained in detail in response to Reviewer 2, Comment 3. Thus, it was too challenging for us to directly compare the release dynamics in response to DREADDs and yohimbine, which was also not the main focus of our work. To explicitly address this caveat, we have extended the corresponding section in the discussion:

      "Finally, our observation that systemic administration of the α2-adrenergic receptor antagonist yohimbine very closely recapitulates the transcriptional response to stress stands in contrast to the much more selective transcriptional changes observed after chemogenetic or optogenetic LC-NA activation. This difference could be due to various factors. First, it remains unclear how strong the LC gets activated by yohimbine versus hM3Dq-DREADDs. However, given the potent LC activation observed after DREADD activation, it seems unlikely that yohimbine would lead to a more pronounced LC activation, thus explaining the stronger transcriptional effects. Second, contrary to LC-specific DREADD-activation, systemic yohimbine injection will also antagonize postsynaptic α2-adrenergic receptors throughout the brain (and periphery). More research is needed to determine whether this could have a more widespread impact on the hippocampus (and other brain regions) than isolated LC-NA activation, further enhancing excitability by preventing α2-mediated inhibition of cAMP production. Finally, systemic yohimbine administration and noradrenergic activity have been shown to induce corticosterone release into the blood (Johnston, Baldwin, and File 1988; Leibowitz et al. 1988; Fink 2016). Thus, yohimbine injection could have broader transcriptional consequences, including corticosteroid-mediated effects on gene expression."

      Comment 6: Please add information about how virus or cannula placement was confirmed in these studies. Were missed placements also analyzed separately?

      Response: Pupillometry recordings were performed with all animals involving optogenetic or chemogenetic manipulations of the LC, before subjecting them to stress experiments. These assessments account for both correct optic fiber placement and virus expression (Privitera et al., 2020). If an animal did not show a clear pupil response, it was not included any further in the study. To demonstrate correct cannula placement for drug infusion of isoprotenerol in the dorsal hippocampus, we added a representative image of cannula placement in Supplementary Figure 1h.

      Comment 7: Time of day for tissue collection used in genetic analysis should be reported for all studies conducted or reanalyzed.

      Response: Thank you for pointing out this omission. Tissue collection for RNA-seq analysis was always performed between 11am and 5pm during the dark phase of the reversed light-dark cycle. We have added this information to the corresponding method section (“Tissue collection”).

      Reviewer #1 (Recommendations For The Authors):

      Comment 8: This is a well written, comprehensive and rigorous manuscript that will be of great interest to those in the scientific community.

      Response: Thank you for the positive evaluation of our work and for the constructive feedback.

      Reviewer #2 (Public Review):

      Comment 1: The present manuscript investigates the implication of locus coeruleus-noradrenaline system in the stress-induced transcriptional changes of dorsal and ventral hippocampus, combining pharmacological, chemogenetic, and optogenetic techniques. Authors have revealed that stress-induced release of noradrenaline from locus coeruleus plays a modulatory role in the expression of a large scale of genes in both ventral and dorsal hippocampus through activation of β-adrenoreceptors. Similar transcriptional responses were observed after optogenetic and chemogenetic stimulation of locus coeruleus. Among all the genes analysed, authors identified the most affected ones in response to locus coeruleus-noradrenaline stimulation as being Dio2, Ppp1r3c, Ppp1r3g, Sik1, and Nr4a1. By comparing their transcriptomic data with publicly available datasets, authors revealed that these genes were upregulated upon exposure to different stressors. Additionally, authors found that upregulation of Ppp1r3c, Ppp1r3g, and Dio2 genes following swim stress was sustained from 90 min up to 2-4 hours after stress and that it was predominantly restricted to hippocampal astrocytes, while Sik1 and Nr4a1 genes showed a broader cellular expression and a sharp rise and fall in expression, within 90 min of stress onset.

      Overall, the paper is well written and provides a useful inventory of dorsal and ventral hippocampal gene expression upregulated by activation of LC-NA system, which can be used as starting point for more functional studies related to the effects of stress-induced physiological and pathological changes.

      Response: We thank the reviewer for the careful assessment of our work.

      Comment 2: However, I believe that the study would have benefited of a more comprehensive analyses of sex differences. Experiments in females were conducted only in one experiment and analyses restricted to the ventral hippocampus.

      Response: In response to the comments by the reviewer, as well as Reviewer #1 and the editors, we have sequenced an additional 32 brain samples to expand the comparison of sex effects in females and males across dorsal and ventral hippocampus, and we included a new meta-analysis of 3 experimental datasets (51 male and 20 female) samples, to thoroughly assess sex differences in the transcriptomic response to stress. We refer the reviewer to our detailed response provided above to Reviewer #1, comment #2, and the updated results section on pages 3-4.

      Comment 3: Although, the experiments were overall sound and the results broadly support the conclusion made, I think some methodological choices should be better explained and rationalized. For instance, the study focuses on identifying transcriptional changes in the hippocampus induced by stress-mediated activation of the LC-NA system, however NA release following stress exposure and pharmacological or optogenetic manipulation was mostly measured in the cortex.

      Response: Because the hippocampus was used for RNA-sequencing, we could not assess NA release in the hippocampus (as this would require fiber implants that would interfere with molecular measures, or different tissue processing for HPLC). Nonetheless, we wanted to assess the transcriptional changes in the hippocampus, while simultaneously measuring successful stimulation of the LC-NA system in the same animals. To achieve this, we pursued 3 routes: 1) we used pupillometry to confirm functional LC activation; 2) we measured cFOS in the LC to directly demonstrate LC activation; 3) we assessed NA release using uHPLC (which requires larger tissue samples) and we chose the cortex because both cortex and hippocampus receive NA predominantly from the LC (Samuels & Szabadi, 2008). Importantly, we had previously shown that chemogenetic LC activation leads to a similar NA turnover in both the cortex and hippocampus, as measured by uHPLC (Zerbi et al., 2019). The relevant figure from that paper is inserted below to quickly show the striking similarity between hippocampus and cortex.

      Author response image 1.

      Levels of noradrenaline (NE) turnover (MHPG/NE ratio) in the cortex (CTX) and hippocampus (HC), measured in whole tissue with uHPLC 90min after hM3Dq-DREADD activation of the LC (copied and cropped from Zerbi et al, 2019, Neuron).

      In response to the reviewers comment, we performed additional experiments to directly demonstrate that LC-activation with DREADDs as well as optogenetics causes an increase in hippocampal NA-release. We recorded NA release in the hippocampus (using fiber photometry combined with genetically encoded NA sensors). For DREADD activation, we observed a strong increase in hippocampal noradrenaline that started a few minutes after clozapine administration, and this increase was sustained throughout the duration of the 21 minute recording (see Supplementary Figure2c-e). For optogenetic LC activation, we find a rapid and immediate sharp increase in NA levels in the hippocampus (Supplementary Figure 2f-h). These experiments were performed in females and males and triggered similar responses. An adapted and cropped version of Supplementary Figure 2 is pasted below for ease of reading.

      Please note that we could not perform a similar experiment using yohimbine, because the GRABNE sensors are based on the alpha-2 adrenergic receptor, thus yohimbine administration interferes with the photometry recording. However, we believe that it is clear from this response that strong activation of the LC leads to uniform release of NA in the hippocampus and cortex.

      Author response image 2.

      c, Schematic of fiber photometry recording of hippocampal NA during chemogenetic activation of the LC. After 5 min baseline recording in the homecage animals were injected with clozapine (0.03mg/kg, i.p.) and placed in the OFT for 21min. d, Average ΔF/F traces of GRABNE2m photometry recordings in response to chemogenetic activation of the LC (mean±SEM for hM3DGq+ and hM3DGq- split into females and males, n=3/group/sex). e, Peak ΔF/F response of fiber photometry trace. f, Schematic of fiber photometry recording of hippocampal NA during optogenetic activation of the LC. Animals were lightly anesthetized (1.5% isoflurane) and recorded in a stereotaxic frame. After 1 min baseline recording, animals were stimulated three times with 5Hz for 10s (10ms pulse width, ~8mW laser power) and recorded for 2 min post-stimulation. g, Average ΔF/F traces of the NA sensors GRABNE1m and nLightG in response to optogenetic activation of the LC (mean±SEM for females and males, n(females)= 10, n(males)=5. h, Peak ΔF/F response of fiber photometry trace.

      Comment 4: Furthermore, behavioral changes following systemic pharmacologic or chemogenetic manipulation were observed in the open field task immediately after peripheral injections of yohimbine or CNO, respectively. Is this timing sufficient for both drugs to cross the blood brain barrier and to exert behavioral effects?

      Response: We have previously shown that chemogenetic activation of the LC through clozapine elicits pupil responses within 1-2 minutes after injection (Privitera et al., 2020; Zerbi et al., 2019). This indicates that clozapine rapidly crosses the blood brain barrier and affects LC activity within a few minutes after injection. Our additional experiments using genetically encoded sensors in the hippocampus show this even more directly (Supplementary Figure 2d), see also the response to Comment 3 above.

      Similarly, yohimbine also rapidly crosses the blood brain barrier within the same time frame (Hubbard et al., 1988). These observations are consistent with the rapid behavioral effects that can be detected within a few minutes after injection of clozapine for LC-DREADD activation (Zerbi et al., 2019), and for yohimbine as well (von Ziegler et al., 2023). In response to another comment of this reviewer, we have also re-analyzed the behavior presented in the current manuscript in time-bins of 3 minutes, which also shows the rapid onset of effects in response to yohimbine (within the first 3 min) and DREADDs (within 6 min), see Supplementary Fig. 3.

      Comment 5: Finally, the study shows that activation of noradrenergic hippocampus-projecting LC neurons is sufficient to regulate the expression of several hippocampal genes, although the necessity of these projection to induce the observed transcriptional effects has been tested to some extent through systemic blockade of beta-adrenoceptor, I believe the study would have benefited of more selective (optogenetic or chemogenetic) necessity experiments.

      Response: We understand the reviewer's point that blocking the LC during stress exposure would be an interesting experiment. However, it is very hard to completely silence the LC during intense stressors. In fact, despite intense efforts, we have not been able to silence the LC during swim stress exposure using DREADDs or other chemogenetic approaches (PSAM/PSEM). We were in fact able to silence the LC with the optogenetic inhibitor JAWS (and others have reported successful LC silencing with GtACR2), but there is a major issue involving the "rebound effect", where more NA is released once the inhibition is stopped. We would thus have had to optogenetically silence the LC for 45-90 min, which would create heat artifacts, and require challenging control experiments to draw firm conclusions. Given all these issues, we reasoned that blocking adrenergic receptors is a simple and elegant solution, which provides clear evidence for the necessity of beta-adrenergic signaling.

      Reviewer #2 (Recommendations For The Authors):

      Major concerns:

      Comment 6: The study focuses on the identification of transcriptional changes in the hippocampus induced by stress-mediated activation of the LC-NA system, however, noradrenaline release following stress exposure or yohimbine injection was measured in the cortex. Authors should consider measuring NA concentrations in the hippocampus after exposure to swim stress or administration of yohimbine, or at least explain their choice to analyse to cortex in the manuscript.

      Response: We have addressed this issue in detail in Response to "Reviewer 2, Comment #3", where we provided an overview of the additional data that support our approach. As mentioned before, measuring NA release after yohimbine is not compatible with our GRABNE-photometry approach, as the GRAB-sensor is based on alpha2-adrenoceptor. Here, we would like to add that measuring NA release using photometry during swim stress is also challenging. The challenge is the vigorous movement (swimming, typically in one direction), which creates pressure on the cables/implants. We felt that overcoming these experimental challenges (setup, troubleshooting and controls) would be beyond the scope of the paper, given that it is already known that this stressor leads to strong NA release in the hippocampus. We have now included references that demonstrate that all the stressors used in our work trigger NA increase in the hippocampus (see response to Reviewer 1, Comment 3): “Therefore, we assessed their expression in a dataset comparing the effect of various stressors on the hippocampal transcriptome (Floriou-Servou et al., 2018). The stressors included restraint, novelty and footshock stress, which have all previously been shown to increase hippocampal NA release (Hajós-Korcsok et al., 2003; Lima et al., 2019; Masatoshi Tanaka et al., 1982).”

      Comment 7: Concerning the experiment aimed at investigating sex differences in gene expression, it is not clear the reason why authors decided to restrict their analyses in females to the ventral hippocampal only. The explanation that in males they did not detect major differences between the dorsal and ventral hippocampus is not sufficient, because there could have been different effects in females. Therefore, the conclusion made by the authors that their "results suggest that the transcriptomic response is independent of sex" is not entirely correct, since sex differences were only evaluated in the ventral hippocampus.

      Response: We appreciate the reviewer's critique. As described above, we have now also sequenced the dorsal hippocampal tissue from the propranolol experiment (males and females, 32 samples) and additionally added an extensive meta-analysis of three large datasets (n=71) to compare transcriptional sex differences in response to stress. A detailed description of these experiments and how they have extended/supported our conclusions have been provided in response to Reviewer #1, Comment #2.

      Comment 8: Besides the effects on females, the same experiment examined whether propranolol by itself (in the absence of stress) would have been able to alter gene expression: such effects were not examined in the dorsal hippocampus. In contrast, in a different experiment, the effects of isoproterenol on genes expression were restricted to the dorsal hippocampus only. Furthermore, related to this latter experiment, intra-dorsal hippocampal injection of isoproterenol should presumably mimic the rise in NA observed after stress exposure, why was gene expression measured 90 min after isoproterenol central injections while in the other experiments gene expression was determined 45 min after stress, that is when authors observe the peak NA concentration?

      Response: We have addressed the reviewer's critique of dorsal vs ventral hippocampus by reanalyzing 32 additional samples from dorsal hippocampus of male and female mice after propranolol (or saline) injection. Please see response to Reviewer #1, comment #2.

      Regarding the time points: We have chosen the 45 and 90 min time points mainly for two reasons. First, cFos protein changes are known to be strongest 90 min after neuronal activation. Second, because we wanted to capture gene expression changes triggered by NA release, we reasoned that these effects must be fast and should thus be measured at an early transcriptional time-point (45min). However, after performing the time-course experiment after swim stress exposure (Figure 4d,c), we observed that the LC-NA-sensitive genes (e.g. Dio2 and several PP1-subunits) show the strongest changes 90 min after stress exposure. Therefore, in some of our experiments we opted to analyze gene expression changes at 90min, converging with the time-point we typically use for cFos staining. Contrary to the reviewer's statement, peak NA concentrations are not observed 45 min after the various interventions, but rather the peak in the main metabolite (MHPG) is observed then, due to the temporal dynamics of NA release and breakdown. NA release occurs immediately upon stress exposure (or direct LC activation), which we also show in the new photometry data described above. Thus, rapid NA release triggers intracellular cascades that lead to downstream transcriptional changes, which peak presumably between 4590 min later.

      Comment 9: Behavioral changes following systemic pharmacologic or chemogenetic manipulation were observed in the open field task immediately after peripheral injections of yohimbine or CNO, respectively. Is this timing sufficient for both drugs to cross the blood brain barrier and to exert behavioral effects? It is also not immediately clear the reason why the open field tasks have different durations depending on the experiments, which can also impact the results. Authors might also consider to split the open field data analyses in 2 or 3 min time-bins, to allow for a better comparison across the different results.

      Response: We thank the reviewer for the suggestion to plot the behavior data as time-bins. We have implemented this change for the yohimbine and DREADD experiments, and updated the corresponding figure accordingly (Supplementary Figure 3, pasted below for ease of reading). The new visualization clearly shows that yohimbine injection triggers rapid behavioral effects already in the first three minutes, whereas the LC-DREADD activation triggers behavioral changes within 3-6 minutes after injection. Thus, clear drug effects are visible in the first 10 minutes, which is comparable to the standard OFT test (10min testing) shown in response to swim stress exposure (Suppl. Figure 3a). The choice to expose mice to the OFT for 21 minutes in total was due to the fact that we based our experimental approach on the optogenetic LC-stimulation protocol first published by McCall and colleagues (McCall et al, Neuron, 2015), in which the LC is stimulated for 3 min followed by 3 min pauses (see Suppl. Figure 3d). Because of this on-off design, we decided to keep the optogenetic analysis simple and show the overall effect (Supplementary Figure 3d), particularly as we know that NA dynamics do not recover rapidly enough after 3 min continuous stimulation to justify a bin-analysis (unpublished data).

      Author response image 3.

      Effects of acute stress and noradrenergic stimulation on anxiety-like behaviour in the open field test. a, Stress-induced changes in the open field test 45 min after stress onset. Stressed animals show overall reductions in distance traveled (unpaired t-test; t=3.55, df=22, p=0.0018), time in center (welch unpaired t-test; t=3.50, df=13.61, p=0.0036), supported rears (unpaired t-test; t=3.39, df=22, p=0.0026) and unsupported rears (unpaired t-test; t=5.53, df=22, p = 1.47e-05) compared to controls (Control n = 12; Stress n = 12). This data have been previously published (von Ziegler et al., 2022). b, Yohimbine (3 mg/kg, i.p.) injected animals show reduced distance traveled (unpaired t-test; t=2.39, df=10, p=0.03772), reduced supported rears (unpaired t-test; t=6.56, df=10, p=0.00006) and reduced unsupported rears (welch unpaired t-test; t=3.69, df=4.4, p = 0.01785) compared to vehicle injected animals (Vehicle n = 6; Yohimbine n = 7). c, Chemogenetic LC activation induced changes in the open field test immediately after clozapine (0.03 mg/kg, i.p.) injection. hM3Dq+ animals show reduced distance traveled (unpaired t-test; t=6.28, df=13, p=0.00003), reduced supported rears (unpaired t-test; t=4.28, df=13, p=0.0009), as well as reduced unsupported rears (welch unpaired t-test; t=4.28, df=13, p = 0.00437) compared to hM3D- animals (hM3Dq- n = 7; hM3Dq+ n = 8). d, Optogenetic 5 Hz LC activation induced changes during the open field test. ChR2+ animals show reduced supported rears (unpaired t-test; t=2.42, df=64, p=0.0185) and reduced unsupported rears (unpaired ttest; t=2.91, df=64, p = 0.00499) compared to ChR2- animals (ChR2- n = 32; ChR2+ n = 36). Data expressed as mean ± SEM. p < 0.05, p < 0.01, p < 0.001, **p < 0.0001.

      Comment 9: The study shows that activation of noradrenergic hippocampus-projecting LC neurons is sufficient to regulate the expression of several hippocampal genes. I believe the study would have benefited of more selective necessity experiments. Authors might consider adding optogenetic (or chemogenetic) experiments aimed at inhibiting LC-NA hippocampal projections during stress exposure (or, alternatively, perform intrahippocampal pharmacological blockade of β-adrenoreceptors during stress exposure), and determine the effects on gene expression.

      Response: We kindly refer the reviewer to our previous response to Comment #2 above.

      Minor concerns:

      There is a typo in the abstract. Please correct "LN-NA" with "LC-NA"

      Response: Thank you, we have corrected it.

      References

      Bangasser, D. A., Eck, S. R., & Ordoñes Sanchez, E. (1/2019). Sex differences in stress reactivity in arousal and attention systems. Neuropsychopharmacology: Official Publication of the American College of Neuropsychopharmacology, 44(1), 129–139.

      Bangasser, D. A., Wiersielis, K. R., & Khantsis, S. (06/2016). Sex differences in the locus coeruleusnorepinephrine system and its regulation by stress. Brain Research, 1641, 177–188.

    1. Author Response

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Summary:

      Mice can learn to associate sensory cues (sound and light) with a reward or activation of dopamine neurons in the ventral tegmental area (VTA), and then anticipate the reward from the sensory cue only. Using this paradigm, Harada et al. showed that after learning, the cue is able to induce dopamine release in the projection targets of the VTA, namely the nucleus accumbens and lateral hypothalamus (LH). Within the LH, dopamine release from VTA neurons (either by presentation of the cue or direct optical stimulation of VTA neurons) activates orexin neurons, measured as an increase in intracellular calcium levels.

      Strengths:

      This study utilized genetically encoded optical tools to selectively stimulate dopamine neurons and to monitor dopamine release in target brain areas and the calcium response of orexin neurons. This allowed a direct assessment of the relationship between the behavioral response of the animals, the release of a key neurotransmitter in select brain areas, and its effect on target cells, with a precision previously not possible. The results shed light on the mechanism underlying reward-related learning and expectation.

      Weaknesses: - The Ca increase in orexin neurons in response to optical stimulation of VTA DA neurons is convincing. However, there is an accumulated body of literature indicating that dopamine inhibits orexin neurons through D2 receptors, particularly at high concentrations both directly and indirectly (PMID 15634779, 16611835, 26036709, 30462527; but note that synaptic effects at low conc are excitatory - PMID 30462527, 26036709). There should be a clear acknowledgment of these previous studies and a discussion directly addressing the discrepancy. Furthermore, there are in-vivo studies that investigated the role of dopamine in the LH involving orexin neurons in different behavioral contexts (e.g. PMID 24236888). The statement found in the introduction "whether and how dopamine release modulates orexin neuronal activity has not been investigated vigorously" (3rd para of Introduction) is an understatement of these previous reports.

      We thank the Reviewer for pointing out that we missed several important citations. We added the references mentioned and the discrepancy of concern is addressed in the discussion section

      • Along these lines, previous reports of concentration-dependent bidirectional dopaminergic modulation of orexin neurons suggest that high and low levels of DA would affect orexin neurons differently. Is there any way to estimate the local concentration of DA released by the laser stimulation protocol used in this study? Could there be a dose dependency in the Intensity of laser stimulation and orexin neuron response?

      We agree that this is an interesting point. However, one limitation of our study, and of intensity-based genetically-encoded sensors in general, is that the estimation of the concentration is technically difficult. The sensor effectively reports changes in extra-synaptic levels of neurotransmitters, but to get the absolute value other modalities would be needed such as fast scan voltammetry. This limitation is now included in the discussion section.

      • The transient dip in DA signal during omission sessions in Fig2C (approx 1% decrease from baseline) is similar in amplitude compared to the decrease seen in non-laser trails shown in Fig 1C right panel (although the time course of the latter is unknown as the data is truncated). The authors should clarify whether those dips are a direct effect of the cue itself or indeed reward prediction error.

      Thanks for raising this important point. Indeed, there is a dip of the signal during non-stimulation trials. At day 1, the delivery of the cue triggered a dip and at day 10, there was a slight increase of the signal and followed by the dip. The data is difficult to interpret but our hypothesis is that two components trigger this dip of the signal. One is the aversiveness of the cue. Because a relatively loud sound (90dB) was used for the cue, it would not be surprising if the auditory cue was slightly aversive to the experimental animals. It has been shown that aversive stimuli induce a dip of dopamine in the NAc, although it is specific to NAc subregions. The second component is reward prediction error. Although the non-laser paired cue never triggered the laser stimulation, it is similar to the laser paired one. In a way both are composed of loud tone and same color of the visual cue (spatially different). We think it is possible that reward-related neuronal circuit was slightly activated by the non-laser paired cue. In line with this interpretation, a small increase of the signal was observed at day 10 but not day 1. If our hypothesis is true, since this signal was induced by two components, further analysis is unfortunately difficult.

      • There seem to be orexin-negative-GCaMP6 positive cells (Fig. 4B), suggesting that not all cells were phenotypically orexin+ at the time of imaging.<br /> The proportion of GCaMP6 cells that were ORX+ or negative and whether they responded differently to the stimuli should be indicated.

      While we acknowledge the observation of orexin-negative-GCaMP6 positive cells in Figure 4B, it's important to note that this phenomenon is consistent with the characteristics of the hOX-GCaMP virus used in prior experiments. The virus has undergone thorough characterization, and it has been reported to exhibit over 90% specificity, as demonstrated in prior work conducted in the laboratory of one of our contributing authors (PMID: 27546579). To address the concern raised by the reviewer, we have included Supplemental Figure 4 confirming that all mice consistently exhibited qualitatively similar hOX-GCaMP transients upon dopaminergic terminal stimulation. This additional evidence supports the reliability and specificity of our experimental approach.

      • Laser stimulation of DA neurons at the level of cell bodies (in VTA) induces an increase in DA release within the LH (Fig. 3C, D), however, there is no corresponding Ca signal in orexin neurons (Fig.4C).

      We realized that the figures were not clear and we understood that the reviewer did not see any corresponding Ca signal, but this description is not true. We now added Supplemental Figure 3 to show that there is Ca signal at day 1 already.

      In contrast, stimulating DA terminals within the LH induces a robust, long-lasting Ca signal (> 30s) in orexin neurons (Fig. 5). The initial peak is blocked by raclopride but the majority of Ca signal is insensitive to DA antagonists (please add a positive control or cite references indicating that the dose of antagonists used was sufficient; also the timing of antagonist administration should be indicated).

      This is now included in the discussion section. Also, the timing and dose of the antagonist is now described in the method section.

      Taken together, these results seem to suggest that DA does not directly increase Ca signal in orexin neurons. What could be mediating the remaining component?

      This point has been included in the discussion section.

      • Similarly, there is an elevation of Ca signal in orexin neurons that remains significantly higher after the cue/laser stimulation (Fig. 4F). It appears that it is this sustained component that is missing in omission trials. This can be analyzed further.

      It is true that there is a sustained component in stimulation trials, that is missing in omission trials. Most likely that is evoked by the stimulation of dopamine neurons. We argue that this component is isolated in Fig 5 and analyzed as much as we can.

      • Mice of both sexes were used in this study; it would be interesting to know whether sex differences were observed or not.

      We agree that this is an important point. However, our sample number is not high enough to make a meaningful comparison between male and female.

      Reviewer #2 (Public Review):

      Summary:

      This is an interesting and well-written study assessing the role of dopaminergic inputs from the VTA on orexin cell responses in an opto-pavlovian conditioning task. These data are consistent with a possible role of this system in reward expectation and are surprisingly one of the first demonstrations of a role for dopamine in this phenomenon.

      Strengths:

      The study has used an interesting opto-Pavlovian approach combined with fibre photometry.

      Weaknesses:

      It is unclear what n size was used or analysed, particularly for AUC measures e.g. Figures 1 D/E and 3 G. The number of trials reflected and the animal numbers need clarification.

      The sample size is indicated in the legend section.

      The study focused on opto-stim omissions - this work would be significantly strengthened by a comparison to a real-world examination where animals are trained for a radiation reward (food pellet).

      We agree that this would be an important experiment. This experiment is partially done in one of the contributing authors laboratories (doi.org/10.1101/2022.04.13.488195) and would be one of our follow up study.

      Have the authors considered the role of orexin in the opposing situation i.e. a surprise addition of reward?

      That would be an interesting experiment. To do that, natural reward, not optical stimulation, should be used as a reinforcer. This could be part of our follow up study.

      Similarly, there remains some conjecture regarding the role of these systems in reward and aversion - have the authors considered aversive learning paradigms - fear, or fear extinction - to further explore the roles of this system? There are some (important) discussions about the possible role of orexin in negative reinforcement. Further studies to address this could be warranted.

      It is true that dopamine also plays a significant role in aversive learning. Therefore, this would be an interesting experiment. The discussion section now includes this point.

      I think some further discussion of the work by Lineman concerning the interesting bidirectional actions of d1/d2 r signalling on glutamatergic transmission onto orexin neurons is worthwhile. While this work is currently cited, the nuance and perhaps relevance to d1 and d2 signalling could be contextualised a little more (https://doi.org/10.1152/ajpregu.00150.2018).

      Thanks for the suggestion. The discussion has been expanded.

      Reviewer #3 (Public Review):

      Summary:

      Harada and colleagues describe an interesting set of experiments characterizing the relationship between dopamine cell activity in the ventral tegmental area (VTA) and orexin neuron activity in the lateral hypothalamus (LH). All experiments are conducted in the context of an opto-Pavlovian learning task, in which a cue predicts optogenetic stimulation of VTA dopamine neurons. With training, cues that predict DA stimulation come to elicit dopamine release in LH (a similar effect is seen in accumbens). After training, omission trials (cue followed by no laser) result in a dip (inhibition) of dopamine release in LH, characteristic of reward prediction error observed in the striatum. Across cue training, the activity pattern of orexin neurons in LH mirrors that of LH DA levels. However, unlike the DA signal, orexin neurons do not exhibit a decrease in activity in omission trials. Systemic blockade of D2 but not D1 receptors blocked DA release in LH following VTA DA cell stimulation.

      Strengths: Although much work has been dedicated to examining projections from orexin cells to VTA, less has been done to characterize reciprocal projections and their function. In this way, this paper is a very important addition to the literature. The experiments are technically sound (with some limitations, below) and utilize sophisticated approaches, the manuscript is nicely written, and the conclusions are mostly reasonable based on the data collected.

      Weaknesses:

      I believe the impact of the paper could be enhanced by considering and/or addressing the following:

      Major:

      • I encourage the authors to discuss in the Introduction previous work on DA regulation of orexin neurons. In particular, the authors cite, but do not describe in any detail, the very relevant Linehan paper (2019; Am J Physiol Regul) which shows that DA differentially alters excitatory/inhibitory input onto orexin neurons and that these actions are reversed by D1 vs D2 receptor antagonists. Another paper (Bubser, 2005, EJN) showed that dopamine agonists increase the activity of orexin neurons and that these effects are blocked by D1/D2 antagonists. The current findings should be discussed in the context of these (and any other relevant) papers in the Discussion, too.

      Thanks for the valuable suggestion. This point has been integrated and the introduction and discussion sections have been revised carefully.

      • In the Discussion, the authors provide two (plausible) explanations for why they did not observe a dip in the calcium signal of orexin neurons during omission trials. Is it not possible that these cells do not encode for this type of RPE?

      We completely agree that it is possible. Now our current hypothesis is that dopamine in the LH encodes RPE and that information is transmitted to orexin neurons. Orexin neurons integrate other information and encode something else, we call it ‘multiplexed cognitive information’. It is still open question what this means exactly. This point is now mentioned in the discussion section.

      • Related to the above - I am curious about the authors' thoughts on why there is such redundancy in the system. i.e. why is dopamine doing the same thing in NAC and LH in the context of cue-reward learning?

      Thank you for the question. This is an important point, indeed. Our current hypothesis is described in the discussion section.

      ’Our data indicate that dopamine in both the NAc and LH encodes reward prediction error (RPE). One open question is the existence of such a redundant mechanism. We hypothesize that dopamine in the LH boosts dopamine release via a positive feedback loop between the orexin and dopamine systems. It has already been established that some orexin neurons project to dopaminergic neurons in the VTA, positively modulating firing. On the other hand, our data indicate that dopamine in the LH stimulates orexinergic neurons. These collective findings suggest that when either the orexin or dopamine system is activated, the other system is also activated consequently. Although the current findings align with this idea, the hypothesis should be carefully challenged and scrutinized.’

      • The data, as they stand, are largely correlative and do not indicate that DA recruitment of orexin neurons is necessary for learning to occur. It would be compelling if blocking the orexin cell recruitment affected some behavioral outcomes of learning. Similarly - does raclopride treatment across training prevent learning?

      We appreciate the insightful comment. It is indeed a limitation of our study that we lack behavioral data. However, given the extensive previous research on the crucial role of orexin in motivated behavior, we argue that establishing dopaminergic regulation of the orexin system itself is a valuable contribution. This perspective is thoroughly discussed in the dedicated section of our paper. It's important to note that the injection of D2 antagonists, including raclopride, is known to induce significant sedation. Due to this sedative effect, combining behavioral experiments with these drugs poses considerable challenges.

      • Only single doses of SCH23390 and raclopride were used. How were these selected? It would be nice to use more of a dose range to show that 1) and effect of D1R blockade was not missed, and 2) that the reduction in orexin signal with raclopride was dose-dependent.

      The rationale of the dose has been added to the discussion session. It is reported that these doses block dopamine receptors. We agree that it would be nice to have a dose-response curve, we are reluctant to increase the doses to avoid adverse effect to the experimental animals. The doses we used effectively induced hypo-locomotion, although data is not shown.

      • Fig 1C, could the effect the authors observed be due to movement?

      We argue this is unlikely. We recorded two channels one for the control and the other one for the signal. The motion-related artifact is corrected based on the control channel. One example trace around the laser stimulation is shown below. Please note that a typical motion-related artifact is a fast dip of the signal, normally observed in both 405 and 465 nm channels.

      Relatedly, what was the behavior like when the cue was on? Did mice orient/approach the cue?

      Although it has been reported that rats approach the cue (PMID: 30038277) in a similar task, it was not obvious in our case. It could be because we used both visual and auditory cues. Mice showed a general increase of locomotion during the cue and the stimulation but the direction was not clear to the experimenter.

      Also, when does the learning about the cue occur? Does it take all 10 days of learning or does this learning/cue-induced increase in dopamine signaling occur in less than 10 days?

      It is hard to say when the learning occurs. When we look at the learning curve of Figures 1,3 and 4, it seems the response to the cue plateaus at day 5 but since we don’t have behavioral data, the assessment is relayed only on the neuronal signal.

      • Also related to the above, could the observed dopamine signal be a result of just the laser turning on? It would seem important to include mice with a control sensor.

      We recorded two channels, 405 nm and 465 nm wavelength. 405 nm signal did not show increase of the signal while 465 nm signal did. The example trace is shown. Besides, the sensor has been characterized by the corresponding author already so we argue that this is unlikely.

      Author response image 1.

      Fig 1E, the effect seems to be driven by one mouse which looks like it could be a statistical outlier. The inclusion of additional animals would make these data more compelling.

      We agree that adding more mice would make data more compelling. However, considering the fact that dopamine in the accumbens has been investigated vigorously and our data is in line with the prior studies, we argue that we have enough data to claim our conclusion.

      • For Fig 1C, 3D, 3F, and 4D, could the authors please show the traces for the entire length of laser onset? It would be helpful to see both the rise and the fall of dopamine signals.

      For Fig 1C, one panel has been added. For fig 3, 4, supplemental figure was created to show the signal around laser stimulation.

      • Fig 2C, could the authors comment on how they compared the AUC to baseline? Was this comparison against zero? Because of natural hills and troughs during signals prior to cue (which may not equate to a zero), comparing the omission-induced dip to a zero may not be appropriate. A better baseline might be using the signals prior to the cue.

      The signal immediately before the cue onset was considered as a baseline, and baseline was subtracted. This means zero and baseline would be the same in our way of analysis.

      • Could the authors comment on how they came up with the 4-5.3s window to observe the AUC in Fig 3H?

      Since the kinetic of dopamine in the NAc and LH is different, different time windows have been used to observed a dip of dopamine. The analysis of the kinetics has been added.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      Specific feedback to the authors

      • Sample size for each experiment/group could not be found.

      The sample size is now included in the legends.

      • In most figures, the timing of onset for the cue and laser stimulation is unclear. This makes the data interpretation difficult. They should be labeled as in Fig. 3C, for example.

      Panels have been updated to address this point.

      • Please provide the rationale for selecting the time range for the measurement of AUC for different experiments (e.g. Fig. 2C, 3H, 4A, 5F).

      The kinetics of dopamine in NAc and LH are different. This is now shown in the new Supplemental Figure 2. Based on this difference, the different window was chosen.

      • Fig. 1E, 3G right, 4E right: statistical analysis should use two-way repeated measures ANOVA rather than one-way ANOVA. Fig 1D, 3G left and 4E left panels can also be analyzed by two-way repeated measures ANOVA.

      We realized that those panels were redundant. Some panels have been removed and the analysis has been conducted according to this point.

      Minor comments:

      Fig. 2C can also show non-omission trials as a comparison.

      The panel has been updated.

      • The term "laser cue" is confusing, as the cue itself does not involve a laser.

      ’Laser-paired cue’ is used instead.

      • Color contrast can be improved for some figures, including Fig. 2C right, Fig. 3H right, and green and blue fluorescent fonts.

      The panels have been updated.

      • Figure legends: Tukey's test, rather than Tekey's test.

      This has been fixed.

      • There are some long-winded sentences that are hard to follow.

      Edited.

      • p.2, line 11 from bottom: should read ...the VTA evokes the release of dopamine.

      Edited

      • p.3, line 9: remove e from release.

      This has been addressed.

      Reviewer #3 (Recommendations For The Authors):

      Minor:

      • When discussing the understudied role of dopamine in brain regions other than the striatum in the Introduction, it might be helpful to cite this article: https://elifesciences.org/articles/81980 where the authors characterize dopamine in the bed nucleus of stria terminalis in associative behaviors and reward prediction error.

      The discussion session has been updated accordingly.

      • In the Discussion, it might be better to refrain from describing the results as 'measuring dopamine release' in the LH. Since there was no direct detection of dopamine release, rather a dopamine binding to the dLight receptors, referring to the detection as dopamine signaling/binding/transients is a better alternative.

      This point has been addressed.

      • In the Discussion, without measuring tonic dopamine release, it is difficult to say that there was a tonic dopamine release in the LH prior to negative RPE. In addition, I wouldn't describe the negative RPE as silencing of dopamine neurons projecting to the LH since this was not directly measured and it is hard to say for sure if the dip in dopamine is caused by silencing of the neurons. There certainly seems to be a reduction in extra-synaptic dopamine signaling in LH, however, what occurs upstream is unknown.

      We respectfully disagree with this point. In our opinion, the dopamine transient is more important than the firing of dopamine neurons because what matters for downstream neurons is dopamine concentration. For example, administration of cocaine increases the dopamine concentration extra-synaptically via blockade of DAT, while the firing of dopamine neurons go down via activation of D2 receptors expressed in dopamine neurons. Administration of cocaine is not known to induce negative RPE.

      • Typo at multiple places: 'Tekey's multiple comparison test'.

      This has been fixed.